Data Analytics Program - Introduction To Data Analytics - Lesson 1
Data Analytics Program - Introduction To Data Analytics - Lesson 1
ANALYTICS
PROGRAM
12. Data Analytics Capstone
Project
Learning Schedule
11. Predictive Analytics 2 -
Deep Learning
9. Holistics/Bigquery/Tableau
8. Data Visualization
4. Programming Fundamentals
(R/Python)
DATA ANALYTICS
Worldwide
“Data Analytics”
Past 5 years
Vietnam
INTRODUCTION TO DATA ANALYTICS lessons
01 02 03 04
What should we do with Data? Data Analytics Basics Data Analytics Framework Getting Insights from Data
1. Think like a Data Scientist 1. Approach Frameworks 1. Data Analytics Process
1. Data Analysis vs Data Analytics 2. Do you need all that data? 2. Data Quality
2. CRISP-DM Frameworks
2. Understanding Data 3. Importance of Segmentation of your 3. Descriptive
Analytics 3. Strategic Roadmap 4. Predictive
3. Data issues
4. Know the difference between your data & 5. Prescriptive
4. Data formatting your metrics 6. Semantic
5. Data Blending 5. Can your data be trusted?
6. Pitfalls of data-driven decisions
7. Why it’s so hard for us to communicate
uncertainty
Lesson 1
WHAT SHOULD WE DO WITH DATA?
Lesson 1: What should we do with Data?
Data Analysis vs Data Analytics
5 months ago, Bank ABC decreased totally Top 4 reasons due to Attrition in Bank:
10.200 bio. VND of Loan portfolio in Attrition (1) Dissatisfaction about services (50%)
(2) Lower rate in another banks (30%)
(Ending Loan portfolio = Beginning Loan + (3) Change another loan package in the bank(10%)
(4) Death (10%)
New loan – Attrition - Maturity)
Lesson 1: What should we do with Data?
Understanding Data – Categories of Data
Lesson 1: What should we do with Data?
Understanding Data – Data Sources
String data can be Numeric data are Date/time contains a The Boolean type is Images
declared in a number numbers which can specific date, or a sometimes also called Maps
of different ways be whole numbers, combination of both a logical type and is a Report objects
depending on the such as Integers or date and time conditional flag Sound
character set required numbers with decimal representing either
and the anticipated places true or false
length of the string: Byte
any kind of Integer
characters, Fixed Decimal
alphanumeric, Float
including symbols. Double
Lesson 1: What should we do with Data?
Understanding Data – Data Types Exercise
Lesson 1: What should we do with Data?
Data Issues – Types of Data Issues
Dirty
Data
Data
Issues
Missing
Outliers
data
Lesson 1: What should we do with Data?
Data Issues – Dirty Data
Dirty Data contains some kind of errors in them, or in a format that’s unfriendly or unusable
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Parsing Data (Example)
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Extra Characters
Extra characters can be currency symbols, number signs… We’d need to remove these before
changing between field types
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Extra Characters (Example)
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Extra Characters (Example)
Lesson 1: What should we do with Data?
Data Issues – Dirty Data: Duplicate Data - Example
Real Data
Downward BIAS
Lesson 1: What should we do with Data?
Data Issues – Solutions for Missing data
SOLUTIONS
1. Deleting Missing Data
2. Imputation
3. Advanced methods
Lesson 1: What should we do with Data?
Data Issues – Missing data: Deleting Missing Data
Imputation
In statistics, Imputation is the process of
substituting values in the data where the
value are missing (we impute values, we
are making them up). We are creating
fake data in order to develop a model
that makes sense and is as close to
reality as we can get it
Lesson 1: What should we do with Data?
Data Issues – Missing data: Imputation (Example)
Lesson 1: What should we do with Data?
Data Issues – Missing data: Imputation (Example)
Lesson 1: What should we do with Data?
Data Issues – Missing data: Advanced methods
If your business and results could be significantly off by using a simpler method,
you might want to explore these options
Identifying outliers in the data helps us understand how vulnerable our model would be to a small
set of observations.
Lesson 1: What should we do with Data?
Data Issues – Outliers: Identify
Identifying outliers more methodically rather than simply eyeballing them
Violin Plot: shows the volume of the distribution
Others: z-scores or standard deviations
Lesson 1: What should we do with Data?
Data Issues – Outliers: Identify
Could be correct, but it’s just abnormal then the analysis and Outliers didn’t change the
modeling process SHOULD INCLUDE that data. That said, it is results, the regression line
legitimate to create models without the data as well to compare retained its original lope, then
results, but it should be noted which models do and don’t contain it can be legitimate to
the outliers (2 options: include outliers vs exclude outliers) remove that observation
Lesson 1: What should we do with Data?
Data Issues – Outliers: Dealing with outliers
4. Truncation NOTE: We see here where age and income are fairly
random with no association between how old a
person is and how much income they have. But the
outlier creates the slope of the line by just being
present... so without outlier 1 (row 10), we have a
steep positive slope, but without outliers 2 (row 14)
and 3 (row 15), we have a negative slope.
In other words without the outlier we wouldn’t really
be able to draw a legitimate line at all, but the
presence of the outlier is what creates the model
effect. In cases such as this, we should definitely
Where we know that a certain value can remove the outlier and investigate other predictor
only be below a given maximum and yet variables.
a value is reported above that
Lesson 1: What should we do with Data?
Data Formatting
1. Transposing
2. Aggregating Data
3. Cross Tabulation
Lesson 1: What should we do with Data?
Data Formatting - Transposing
Lesson 1: What should we do with Data?
Data Formatting – Transposing - Example
Lesson 1: What should we do with Data?
Data Formatting - Aggregating Data
Lesson 1: What should we do with Data?
Data Formatting - Aggregating Data - Example
Lesson 1: What should we do with Data?
Data Formatting - Cross Tabulation
Lesson 1: What should we do with Data?
Data Formatting - Cross Tabulation - Example
Lesson 1: What should we do with Data?
Data Blending
Unioning allows you to take multiple datasets and deal with them as one
Lesson 1: What should we do with Data?
Data Blending – Joining Datasets
Lesson 1: What should we do with Data?
Data Blending – Fuzzy Matching
Fuzzy Matching will enable you to join 2 data sets
together where a regular join may fail. The Fuzzy
Match identifies records with similar string values
in specified fields.
The match threshold is the minimum score achieved by the fuzzy matching for
it to be considered to be a match
Lesson 1: What should we do with Data?
Data Blending – Spatial Matching
There aren’t fields that can be Gray area: How many customers fall
used to join them together within a store trade area is to match
them and assign a store number to them
Lesson 1: What should we do with Data?
Data Blending – Spatial Blending - Example
Customer Information
Spatial Data
LESSON 1: WHAT SHOULD WE DO WITH DATA?
THANK YOU