Introduction to Data-Science
Introduction to Data-Science
1
Course Information
2 Python Exercise 10
3 R Exercise 10
4 Final Exam 70
Total 100
Do’s and Don’ts
• Do’s :
o Maintain absolute silence and listen attentively
during lectures
o Ask questions
Don’ts :
o Do not make noise during lectures
o Use of mobile phones for any purpose is not
allowed
o Late coming is not allowed
The course covers:
• Introduction
• Fundamental of data science
• Steps and scope of data science
• Introduction to big data
• Introduction to data mining
• Use cases in various domains
• Introduction to data analytics
• Introduction to R and Python
programming.
Module Objectives
7
What is Data Science?
(2)
Data science is about extraction,
preparation, analysis, visualization, and
maintenance of information.
Data
science
Data
Traditional analytics
Software
Business / Domain
expertise
11
What is Data Science? (5)
Data Science is about data, models, and evaluation.
Data science can solve a variety of problems once we have
the right data and model.
Scientific,
Collect &
social or Problem
understand
Business
data
problem
14
Data Science Life Cycle
Data science’s lifecycle consists of five distinct stages, each
with its tasks:
15
Data Science Lifecycle (2)
Process: In this stage, the data scientists take the prepared
data and examine its patterns, ranges, and biases to
determine how useful it will be in predictive analysis using
Data Mining, Clustering/Classification, Data Modeling and
Data Summarization
17
Evolution of Data Science: Growth & Innovation (2)
1962: American mathematician John W. Tukey first
articulated the idea of data science.
18
Evolution of Data Science: Growth & Innovation(3)
1977: The theories and predictions of pre data scientists
like Tukey and Naur established the International
Association for Statistical Computing (IASC), whose
mission was to link traditional statistical methodology,
modern computer technology, and the knowledge of
domain experts to convert data into information and
knowledge.
1990s and early 2000s: We can see that data science has
emerged as a recognized and specialized field. Several
data science academic journals began to circulate, and
data science proponents like Jeff Wu and William S.
Cleveland continued to help develop and expound upon
the necessity and potential of data science.
20
Evolution of Data Science: Growth & Innovation (5)
2000s: Technology made enormous leaps by providing
universal access to internet connectivity, communication,
and (of course) data collection.
2005: Big data enters the scene. With tech giants such as
Google and Facebook uncovering large amounts of data,
new technologies capable of processing them became
necessary. Hadoop rose to the challenge, and later Spark
and Cassandra made their debuts.
22
Data Science Components
23
Data Science Components(2)
Data engineering is the process of organizing,
managing, and analyzing large amounts of
data. It is about building systems that will enable
the collection and usage of data.
24
Data Science Components (3)
Statistics is the method of collecting and analyzing
numerical data in large quantities to get useful insights.
this is the most critical unit in data science.
25
Data Science Jobs Roles
Most prominent Data Scientist job titles are:
26
Data Science Jobs Roles (2)
Data Scientist:
A data scientist is a professional who manages enormous
amounts of data and come up with business visions by using
various tools, techniques, methodologies, algorithms, etc.
Role:
Identifying data collection sources for business needs
Processing, cleansing, and integrating data
Automation data collection and management process
Using Data Science techniques/tools to improve processes
Languages
R, SAS, Python, SQL, Hive, Matlab, Pig, Spark
27
Data Science Jobs Roles (2)
Data Engineers: Build and test scalable Big Data
ecosystems for businesses so that the data scientists can run
their algorithms on systems that are stable and highly
optimized.
Also, he/she updates the existing systems with newer or
upgraded versions of the current technologies to improve the
efficiency of the databases.
Role:
Design and maintain data management systems
Data collection/acquisition and management
Conducting primary and secondary research
Languages:
SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C++, and
Perl 28
Data Science Jobs Roles (3)
Data Analysts: are responsible for a variety of tasks
including visualization and processing of massive
amounts of data. They perform queries on the databases
from time to time. The most important skill of a data
analyst is optimization
Role:
Extracting data from primary and secondary sources using
automated tools
Developing and maintaining databases
Performing data analysis and making reports with
recommendations
Languages:
R, Python, HTML, JS, C, C+ +, SQL
29
Data Science Jobs Roles (4)
Statistician: collects, analyses, and understands qualitative
and quantitative data by using statistical techniques,
theories, and methods.
Role:
Collecting, analyzing, and interpreting data
Analyzing data, assessing results, and predicting
trends/relationships using statistical
methodologies/tools
Designing data collection processes
Languages:
SQL, R, Matlab, Tableau, Python
30
Data Science Jobs Roles (5)
Data Administrators: are responsible for the proper
functioning of all databases of an enterprise and grant or
revoke its services to the employees of the company
depending on their requirements.
Role:
Working on database software to store and manage
data
Working on database design and development
Implementing security measures for the database
Preparing reports, documentation, and operating
manuals
Languages:
Ruby on Rails, SQL, Java, C#, and Python 31
Data Science Jobs Roles (6)
Business Analyst: They have a good understanding of how
data-oriented technologies work and how to handle large
volumes of data, they also separate the high-value data from
the low-value data.
Role:
Understanding the business of the organization
Conducting detailed business analysis – outlining
problems, opportunities, and solutions
Working on improving existing business processes
Languages:
SQL, Tableau, Power BI, and Python
32
Data Science Jobs Roles (7)
Data Architects: creates the blueprints for data management so
that the databases can easily be integrated, centralized, and protected
with the best security measures.
Role:
Developing and implementing overall data strategy in line with
business/organization
Identifying data collection sources in line with the data strategy
Collaborating with cross-functional teams and stakeholders for
the smooth functioning of database systems
Planning and managing end-to-end data architecture
Languages:
SQL, Power BI, Java, C#, and Python
33
Data Science Jobs Roles (8)
Data Analysts Manager: coordinates the different tasks
that must be completed by their team for a big data project
Role:
Researching and creating effective methods to
collect data.
Analyzing information.
Recommending solutions to a business.
Languages:
R, Python, HTML, JS, C, C+ +, SQL
34
Applications of Data Science in Various Fields
1. Search Engines
The most useful application of data science is Search
Engines. As we know when we want to search for something
on the internet, we mostly use search engines like Google,
Yahoo, Safari, Firefox, etc. data science is used to get
searches faster.
2. Transport
Data science also contributes to the Transport field like the
use of Driverless Cars. With the help of driverless cars, it is
easy to reduce the number of accidents.
For Example: for driverless cars, the training data is fed into
the algorithm and with the help of data science techniques,
the data then analyzes the speed limit on highways, busy
streets, narrow roads, etc. 35
Applications of data science in various fields(2)
3. Finance
Data science plays a key role in Financial Industries in the
area of fraud detection and minimizing the risk of losses.
37
Applications of data science in various fields (4)
5. Health Care:
In the Healthcare Industry, data science is used for:
Detecting tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
38
Applications of data science in various fields (5)
6. Image Recognition
Currently, data science is also used in Image Recognition.
39
Applications of data science in various fields (6)
8. Airline Routing Planning
With the help of data science, the Airline Sector can now
easily predict flight delays. It helps decide whether to
directly land at a destination or take a halt in between. For
example, a flight can have a direct route from Delhi to the
U.S. stop in between.
42
1. Business Requirements
43
2. Data Acquisition
44
3. Data Processing
45
4. Data Exploration
46
5. Modelling
47
6. Deployment
48
How does Data Science Work?
Collect Analyze Data Insights
49
Consider An Example
Suppose there is an organization that is working towards
finding out potential leads for the sales team.
They can follow this approach to get optimal solutions using
data science
50
Challenges of Data Science Technology
51
Challenges of Data Science Technology(2)
52
What is Data Scientist’s salary?
53
What is Data Scientist salary?(2)
54
Summary
Data Science is the area of study that involves extracting insights from vast amounts
of data by the use of various scientific methods, algorithms, and processes.
Important Data Scientist job roles are: 1) Data Scientist 2) Data Engineer 3) Data
Analyst 4) Statistician 5) Data Architect 6) Data Admin 7) Business Analyst 8)
Data/Analytics Manager
A high variety of information & data is the biggest challenge of Data Science
technology.
55
Exercises
1. What is data science? How does it relate to and
differ from statistics?
56
Reference
• Introduction to Data Science A Beginner's Guide
57
Lecture Two: Methodology of Extracting
58
Methodology of Extracting Knowledge From Big
Datasets
Extracting knowledge from big datasets involves a variety
of techniques such as data collection, cleaning, analysis,
and visualization.
59
Methodology of Extracting Knowledge From Big Datasets(2)
Although it is like information extraction and data warehouse,
the main criteria is that the extraction result goes beyond the
creation of structured information or the transformation into a
relational schema.
60
Methodology of Extracting Knowledge From Big Datasets(3)
61
Knowledge Extraction Steps
Data selection: where data relevant to the analysis task are
retrieved from the database.
62
Knowledge Extraction Steps (2)
63
Knowledge Discovery In Datasets
64
Tools for Data Science
65
Tools for Data Science
66
Tools for Data Science (2)
Jupyter Notebook: This is an open-source web application
that allows you to create and share documents that contain
live code, equations, visualizations, and narrative text.
67
Tools for Data Science
Given the following numbers and tasked with finding the largest of
them.
7, 24, 62, 11, 4, 39, 42, 5, 97, 54.
71
Hands-On Example: Analyzing Data (4)
180
Visualization of height vs. weight
160
data
140
120
Weights (lbs)
100
80
60
40
20
0
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
Height (inches)
72
Hands-On Example: Analyzing Data (5)
We know from the lower range of height (less than the average of
65 inches), for each inch of height change, weight changes by
about 3 pounds.
74
Hands-On Example: Analyzing Data (7)
For someone who is 58 inches in height, the
corresponding weight is 115 pounds.
75
Problem One
# Insurance provider Rating (out of 10)
1 GEICO 4.7
2 GEICO 8.3
3 GEICO 9.2
4 Progressive 7.4
5 Progressive 6.7
6 Progressive 8.9
7 USAA 3.8
8 USAA 6.3
9 USAA 8.1
76
Problem Two
The table presents a movie review dataset from the past. It consists of three
attributes: the movie name, the leading actor in the movie, and its IMDB
rating. [Note: assume a better rating means a more watchable movie.]
How do you predict which of these actor’s movies you should watch when
a new one is released?
77
Lecture three: What will you learn?
Definition of Data
Data types
Major data sources
Formats.
78
What is Data?
Data is a collection of facts, figures, or statistics that can be
analyzed to gain insights and knowledge about a particular
subject.
79
What is Data?(2)
Data (Plural) Singular is datum Data is divided
into two types:
81
What is Data? (4)
82
Why is Data Important?
83
Why is Data Important?(2)
84
Data Sources
85
Data Sources (2)
86
Data Sources (3)
87
Tabular Representation of Various Memory Sizes
88
Different Ways of Ccombining Data
89
Different Ways of Ccombining Data (2)
Joining tables: allows one to combine the information of
one observation found in one table with the information
found in another table.
90
Different Ways of Ccombining Data(3)
91
Different Ways of Combining Data(4)
92
Different Ways of Combining Data(5)
Appending data from tables is a common operation but requires an equal structure in the tables being
appended.
93
Basic Classification of Data
Structured data are highly organized information that can be
seamlessly included in a database and readily searched via
simple search operations.
94
Basic Classification of Data(2)
Semi-structured data doesn’t reside in a relational
database.
95
Basic Classification of Data(3)
96
Basic Classification of Data(4)
Open data: Data that is freely available in a
public domain and can be used by anyone as they
wish, without restrictions from copyright, patents,
or other mechanisms of control.
97
Lecture four: Scope of Data Science
98
Scope of Data Science
Data collection is one of the most tasking steps in
the process of data science.
99
Data Collection Steps
Defining the aim of your research: This step involves
identifying the problem that needs to be solved and defining the
research question
There are two main methods of data collection in research based on the
information that is required, namely:
101
Methods of Data Collection(2)
The most common primary data collection methods:
1. Interviews
are direct methods of data collection.
102
Methods of data collection(3)
2. Observations
In this method, researchers observe a situation around
them and record the findings.
103
Methods of data collection(4)
104
Methods of data collection(5)
4. Focus Groups
105
Methods of data collection(6)
5. Oral Histories
Oral histories also involve asking questions like interviews and focus
groups.
106
Methods of data collection(7)
Secondary Data Collection Methods
Secondary data refers to data collected by someone else.
The use of the Internet has become one of the most popular
secondary data collection methods in recent times.
There is a large pool of free and paid research resources that can
be easily accessed on the Internet.
107
Methods of data collection(8)
2. Government Archives
108
Methods of data collection(9)
3. Libraries
109
Data Preparation
Data preprocessing, a component of data preparation,
describes any type of processing performed on raw data.
Data preprocessing transforms the data into a format that
is more easily and effectively processed in data mining,
machine learning and other data science tasks.
The techniques are generally used at the earliest stages of
the machine learning and AI development pipeline to
ensure accurate results.
There are several different tools and methods used for
preprocessing data, including the following:
110
Data Preparation(2)
The main activities of data preparation are:
Data cleaning: fixing incomplete or erroneous data
Data integration: unifying data from different sources
Data transformation: formatting the data
Data reduction: reducing data to its simplest form
Data discretization: reducing the number of values to
make data management easier
Feature engineering: selecting and transforming
variables to work better with machine learning
111
Data Preparation (3)
Sampling, which selects a representative subset from a
large population of data;
Transformation, which manipulates raw data to produce
a single input;
Denoising, which removes noise from data;
Imputation, which synthesizes statistically relevant data
for missing values
Normalization, which organizes data for more efficient
access; and
Feature extraction, which pulls out a relevant feature
subset that is significant in a particular context.
112
Data Preparation (4)
Data preparation is a phase that may need to be revisited
multiple times throughout your project.
113
Data cleansing
114
115
Methods of Data Cleaning
Data Munging
The process of converting data to something
more suitable for a computer to understand.
Table
119
Data Storage and Presentation
Depending on its nature, data is stored in various formats.
The CSV format is more generic and useful when sharing with almost
anyone.
Disadvantage of CSV
Since the comma is used to separate fields, if the data contains a comma
that could be problematic.
121
Data Storage and Presentation (3)
treat,before,after,diff
No Treatment,13,16,3
No Treatment,10,18,8
No Treatment,16,16,0
Placebo,16,13,-3
Placebo,14,12,-2
Placebo,19,12,-7
Seroxat (Paxil),17,15,-2
Seroxat (Paxil),14,19,5
Seroxat (Paxil),20,14,-6
Effexor,17,19,2
Effexor,20,12,-8
Effexor,13,10,-3 122
Data Storage and Presentation (4)
TSV (Tab Separated Values) files are used for raw data
and can be imported into and exported from spread sheet
software instead of commas, the fields are separated
using tabs
123
Data Storage and Presentation (5)
On the other hand, TSV is less common than other delimited
formats such as CSV.
Name<TAB>Age<TAB>AddressRyan<TAB>33<TAB>11
15WFranklinPaul<TAB>25<TAB>BigFarmWayJim<TAB
>45<TAB>WMainStSamantha<TAB>32<TAB>28George
St.
124
Data Preprocessing
128
Data Integration (3)
4. Address redundant data in data integration.
Redundant data is commonly generated in the
process of integrating multiple databases.
For example:
130
Data Transformation (2)
4. Normalization: Scaled to fall within a small, specified range and
aggregation. Some of the techniques that are used for accomplishing
normalization are:
a. Min-max normalization (feature scaling): that rescale values in the
range (0,1)
We have minA and maxA as the minimum and maximum value observed
for attribute A and Vi is the value for attribute A that has to be
normalized. The min-max normalization will map Vi to the V’i in a
new smaller range [new_minA, new_maxA]. The formula for the min-
max normalization is:
131
Data Transformation (3)
For example, we have $1200 and $9800 as the minimum and
maximum value for the attribute income, and [0.0, 1.0] is the range in
which we map a value of $73,600. The value of $73,600 will be
transformed using min-max normalization as follows:
132
Data Transformation (4)
Here à and σ are the mean and standard deviation for attribute A,
respectively. For example, we have a mean and standard deviation for
attribute A as $5,400 and $1,600 and normalize the value to $73,600
using z-score normalization.
133
Data Transformation (5)
Here j is the smallest integer such that max(|v’i|) < 1 for
example, the observed values for the attribute A range from -
986 to 917, and the maximum absolute value for attribute A
is 986. Here, to normalize each attribute A value using
decimal scaling, divide each attribute A by 1000 ie j=3. So,
the value -986 would be normalized to -0.986 and 917
would be normalized to 0.917.
135
Data Reduction (2)
The two most common techniques used for data reduction are:
In other words, we reduce the data to its more meaningful size and
structure for the task at hand.
136
Data Compression
Data compression employs converting the structure of data
in a way that consumes less space. Data compression
involves building a compact representation of information
by removing redundancy and representing data in binary
form. Data that can be restored successfully from its
compressed form is called lossless compression .
137
Data Compression (2)
139
Data Discretization
Data Discretization is the process of converting
continuous values into more manageable parts
140
141
Hands-On Example 2: Data Pre-processing
Now, let us use a sample dataset and walk through those stages step
by step.
143
Hands-On Example 2: Data Pre-processing(2)
Smooth Noisy Data. We can see that the wine consumption value for
Iceland per capita is −0.800000012. However, wine consumption
values per capita cannot be negative. Therefore, it must be a faulty
entry and we should change the alcohol consumption for Iceland to
0.800000012. Using the same logic, the number of deaths in Israel
should be converted from −834 to 834.
To do this, we calculate the total alcohol consumption for the country of India as an
average of alcohol consumption, which is 2.95.
Similarly, we can calculate the heart diseases per 100,000 people in India as 171
(approximated to the nearest integer value).
Since we do not have any source for the number of total deaths or the number of
fatal liver diseases in India, we are going to handle these the same way we
previously addressed any missing values. The resultant dataset is shown in Table 4
147
Hands-On Example 2: Data Pre-processing (5)
Note that some of the assumptions we have made here before using
this external dataset are for our cause. First, when we are using the
average of the alcohol consumption for these States as the amount of
alcohol consumption for India, we are assuming that:
(a) The populations of these States are the same or at least similar
148
Hands-On Example 2: Data Pre-processing (6)
Data Transformation: is process involves one or more of smoothing, removing
noise from data, summarization, generalization, and normalization.
For this example, we will employ smoothing, which is simpler than summarization
and normalization. As we can see, in our data the wine consumption per capita for
Italy is unusually high, whereas the same for Norway is unusually low. So, chances
are these are outliers.
In this case, we will replace the value of wine consumption for Italy with
7.900000095. Similarly, for Norway, we will use the value of 0.800000012 in
Place of 0.0800000012. We are treating both of these potential errors as
“equipment error” or “entry error,” which resulted in an extra digit for both of
these countries (an extra “2” in front of Italy and an extra “0” after the decimal
point for Norway).
This is a reasonable assumption given the limited context we have about the dataset.
A more practical approach would be to look at the nearest geolocation for which we
have the values and use that value to make predictions about the countries with
erroneous entries. So, at the end of this step, the dataset will be transformed into
what is shown in Table 5.
149
Table 5: Integration of Indian data sets
# Country Alcohol Deaths Heart Liver
1 Australia 2.5 785 211 15.30000019
2 Austria 3.000000095 863 167 45.59999847
3 Belg. and Lux. 2.900000095 883 131 20.70000076
4 Canada 2.400000095 793 185 16.39999962
5 Denmark 2.900000095 971 220 23.89999962
6 Finland 0.800000012 970 297 19
7 France 9.100000381 751 11 37.90000153
8 Iceland 0.800000012 743 211 11.19999981
9 Ireland 0.699999988 1000 300 6.5
10 Israel 0.600000024 834 183 13.69999981
11 Italy 27.900000095 775 107 42.20000076
12 Japan 1.5 680 36 23.20000076
13 Netherlands 1.799999952 773 167 9.199999809
14 New Zealand 1.899999976 916 266 7.699999809
15 Norway 0.0800000012 806 227 12.19999981
16 Spain 6.5 724 185 20.27
17 Sweden 1.600000024 743 207 11.19999981
18 Switzerland 5.800000191 693 115 20.29999924
19 UK 1.299999952 941 285 10.30000019
20 US 1.200000048 926 199 22.10000038
21 West Germany 2.700000048 861 172 36.70000076
22 India 2.950000000 750 171 20.27 150
Table 6: Data Transformation
# Country Alcohol Deaths Heart Liver
1 Australia 2.5 785 211 15.30000019
2 Austria 3.000000095 863 167 45.59999847
3 Belg. and Lux. 2.900000095 883 131 20.70000076
4 Canada 2.400000095 793 185 16.39999962
5 Denmark 2.900000095 971 220 23.89999962
6 Finland 0.800000012 970 297 19
7 France 9.100000381 751 11 37.90000153
8 Iceland 0.800000012 743 211 11.19999981
9 Ireland 0.699999988 1000 300 6.5
10 Israel 0.600000024 834 183 13.69999981
11 Italy 7.900000095 775 107 42.20000076
12 Japan 1.5 680 36 23.20000076
13 Netherlands 1.799999952 773 167 9.199999809
14 New Zealand 1.899999976 916 266 7.699999809
15 Norway 0.800000012 806 227 12.19999981
16 Spain 6.5 724 185 20.27
17 Sweden 1.600000024 743 207 11.19999981
18 Switzerland 5.800000191 693 115 20.29999924
19 UK 1.299999952 941 285 10.30000019
20 US 1.200000048 926 199 22.10000038
21 West Germany 2.700000048 861 172 36.70000076 151
22 India 2.950000000 750 171 20.27
Data Reduction. The process is aimed at producing a reduced
representation of the dataset that can be used to obtain the same or
similar analytical results.
Thus, reducing the liver column to one decimal place and the alcohol
consumption column to two decimal places would result in the dataset
shown in Table 7.
Note that data reduction does not mean just reducing the size of
attributes – it also may involve removing some attributes, which is
known as feature selection.
However, depending on the model you want to build, you may have
to discretize the attribute values into binary or categorical types.
For example, you may want to discretize the wine consumption per
capita into four categories
154
Table 8: Data Discretization
# Country Alcohol Deaths Heart Liver
1 Australia 2 785 211 15.3
2 Austria 2 863 167 45.6
3 Belg. and Lux. 2 883 131 20.7
4 Canada 2 793 185 16.4
5 Denmark 2 971 220 23.9
6 Finland 0 970 297 19.0
7 France 3 751 11 37.9
8 Iceland 0 743 211 11.2
9 Ireland 0 1000 300 6.5
10 Israel 0 834 183 13.7
11 Italy 3 775 107 42.2
12 Japan 1 680 36 23.2
13 Netherlands 1 773 167 9.2
14 New Zealand 1 916 266 7.7
15 Norway 0 806 227 12.21
16 Spain 3 724 185 20.3
17 Sweden 1 743 207 11.2
18 Switzerland 3 693 115 20.3
19 UK 1 941 285 10.3
20 US 1 926 199 22.1
21 West Germany 2 861 172 36.7
22 India 2 750 171 20.3 155
Hands-On Example 2: Data Pre-processing (8)
156
Prediction and forecasting:
Prediction and forecasting: The next step in Data
Science is data analysis, prediction, and forecasting.
One can do this on an individual level or on a larger scale
for the entire customer database.
158
Steps in Data science Process(2)
The first step of this process is setting a research goal.
159
Steps in Data science Process(3)
Now that you have the raw data, it’s time to prepare it.
160
Steps in Data science Process(4)
The goal of this step is to gain a deep understanding of the
data.
The insights you gain from this phase will enable you to start
modeling.
162
Create a project charter
A project charter requires teamwork, and your input covers
at least the following:
A clear research goal
A timeline
163
Summary
Setting the research goal —Defining the what, the why, and the
how of your project in a project charter.
Retrieving data —Finding and getting access to data needed in the
project. This data is either found within the company or retrieved
from a third party.
Data preparation —Checking and removing data errors, enriching
the data with data from other data sources, and transforming it into a
suitable format for your models.
Data exploration —Diving deeper into the data using descriptive
statistics and visual techniques.
Data modeling —Using machine learning and statistical techniques
to achieve the project goal.
Presentation and automation —Presenting the results to the
stakeholders and industrializing your analysis process for repetitive
reuse and integration with other tools.
164
Lectures Five: Rudiments of data
Visualizations
165
Rudiments of data Visualizations
Data visualization is the process of creating graphical
representations of information.
166
Data visualization techniques
Pie Chart: is Ideal for illustrating proportions, or part-to-
whole comparisons. Pie charts are relatively simple and easy
to read.
They are best suited for audiences who might be unfamiliar
with the information or are only interested in key takeaways.
167
Data visualization
techniques(2)
Bar Chart: This is a visualization technique,where one
axis of the chart shows the categories being compared, and
the other, a measured value.
The length of the bar indicates how each group measures
according to the value.
168
Data visualization techniques (3)
Histogram: A histogram is a graphical representation of
the distribution of numerical data.
It groups data into bins and displays the frequency of the
number of observations in each bin.
169
Data visualization techniques (4)
Gantt Chart: A Gantt chart is a type of bar chart that
illustrates a project schedule.
170
Data visualization techniques (5)
Heat Map: A heat map is a graphical representation of
data where the individual values contained in a matrix
are represented as colors.
They are useful for visualizing large amounts of data
and identifying patterns.
171
Data visualization techniques (6)
Box and Whisker Plot: A box and whisker plot is a
graphical representation of the distribution of numerical
data.
It displays the median, quartiles, and outliers of the data
172
Data visualization techniques (7)
Scatter Plot: A scatter plot is a graphical representation
of the relationship between two variables.
It is used to identify trends and relationships between
variables.
173
Data Visualization Techniques (8)
Pictogram Chart: A pictogram chart is a type of chart
that uses icons or pictures to represent data.
It is useful for presenting data in a visually appealing way
174
Data visualization techniques (9)
Timeline: A timeline is a graphical representation of a
chronological sequence of events.
It is useful for presenting historical data or project
schedules.
175
Data visualization techniques (10)
Word Cloud: A word cloud is a visual representation of
text data.
It displays the most frequently occurring words in a text
document as larger words.
176
Definitions
Machine Learning is designing algorithms and models to
analyze data based on defined metrics that learn from experience
to ensure the data science process can be automated and
improved without continuously programming it. These
algorithms learn to make predictions about unforeseen/future
data.
177
Data distribution refers to the way data is spread out or
distributed over a range of values.
180
Predictive Analytics
181
Descriptive Statistics
Descriptive statistics describe, show, and
summarize the basic features of a dataset found in
a given study.
182
Types of Descriptive Statistics
183
Distribution (also called Frequency Distribution)
184
Measures of Central Tendency
Measures of central tendency estimate a dataset's average or
center.
Finding the result using three methods: mean, mode, and median.
Mean. The mean is also known as “M” and is the most common
method for finding averages.
You get the mean by adding all the response values together,
dividing the sum by the number of responses, or “N.”
For instance, say someone is trying to figure out how many hours
a day they sleep in a week.
So, the data set would be the hour entries (e.g., 6,8,7,10,8,4,9),
and the sum of those values is 52.
There are seven responses, so N=7.
You divide the value sum of 52 by N, or 7, to find M, which in this
instance is 7.3.
185
Measures of Central Tendency (2)
Mode. The mode is just the most frequent response
value.
Look for the number in the set’s middle. In this case, the
median is eight.
187
Measures of Central Tendency (4)
Variability (also called Dispersion)
The measure of variability gives the statistician an idea of
how spread out the responses are.
The spread has three aspects — range, standard deviation,
and variance.
14, 19, 100, 27, 54, 52, 93, 50, 61, 87,68, 85, 75, 82, 95
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100
Step 2 — Find the median of this data set. Median is mid value in
this ordered data set.
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100
It is value lies in the extreme left from this data set or first
value in the data set after ordering.
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100
Minimum Value is 14
195
Lecture six: Use Cases In
Various Domain
196
Use Cases In Various Domain
- image
- natural language
- audio
- video.
198
Use Cases In Various Domain(3)
Natural Language: Data science can be used to
analyze and understand natural language.
Knowledge Base:
This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting patterns.
User interface
This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying
a data mining query or task, providing information to help focus
the search, and performing exploratory datamining based on the
intermediate data mining results.
208
Data Mining Tasks (3)
Classification – is about the assignment of labels to objects
for identification or assigning categories to objects. For
example, an e-mail program might attempt to classify an e-
mail as "legitimate" or as "spam".
209
Data Mining Tasks (4)
Regression – This is about the relationship between
features to predict numeric values.
211
Basic introduction to knowledge extraction(2)
Business Intelligence (BI) is a set of tools and techniques
used to transform raw data into meaningful and useful
information for business analysis purposes.
212
Basic introduction to knowledge extraction(3)
213
Introduction to Big Data integration and
intelligence
Big data integration is the practice of using people,
processes, suppliers, and technologies to retrieve,
reconcile, and make better use of data from various
sources for decision support.
214
Brief History of Big Data
218
Characteristics of Big Data(2)
219
How much data is generated in each minute
220
Characteristics of Big Data (3)
Variety: involves collecting data from various resources
heterogeneous types.
223
Introduction to R programming
R programming is an open-source programming language
that is widely used as a statistical software and data
analysis tool.
It was designed by Ross Ihaka and Robert Gentleman at
the University of Auckland, New Zealand, and is currently
being developed by the R Development Core Team
R programming is used as a leading tool for machine
learning, statistics, and data analysis.
It provides an extensive collection of graphical and
statistical techniques, making it a preferred choice for
statisticians and data analysts.
R programming is available across platforms like
Windows, Linux and Mac OS
224
Setting Up R
R is a free, open source, environment for data
analysis.
It is available as a free binary download for Mac,
Linux and Windows.
To install R in your computer go to
https://cran.r-project.org/index.html and
download and install the appropriate binary file
225
Setting Up R(2)
226
Setting Up Rstudio
227
Setting Up Rstudio(2)
228
Working with R
When you installed R, a few illustrative datasets were
installed as well.
Write the following command in the console data()
In R, the first attempt to get help with something is to
use the ? operation
229
Working with R
Let’s see what the swiss data looks like using the
Rstudio data viewer.
230
Working with R
What value is produced when R evaluates the expression swiss?.
When we write the expression swiss, we tell R we want
the value referenced by the name swiss, that is, the data itself!
231
Working with R
On your own: Use the Data Viewer to explore another
type of the datasets you saw listed before.
233
Plotting in RStudio
Let’s make a plot using the swiss dataset:
234
Plotting in RStudio
236
Plotting in RStudio
We can then evaluate the expressions in the file one at a
time, or all at the same time.
238
R Basic
2+2 we assigned a value to variable “year” and checked if it
[1] 4 is divisible by 4 using the modulus operator (%%)
> x=2
> y=2 Now let us put this code in a file. In RStudio, select File
> z=x+y > New File > R Script. This should open an editor
>print(z) where you can type your code. There, write the
> [1] 4 following:
> 2>3
[1] Now, let us write a small program on the R console:
FALSE > year = 2020
> 2==2 > if (year%%4==0)
[1] TRUE + print (‘Leap year’)
[1] “Leap year”
239
PYTHON
240
Download and Install Python
241
A screenshot of Anaconda 242
243
244
245
REFERENCES
246