Lecture - 04 - Data Understanding and Preparation
Lecture - 04 - Data Understanding and Preparation
Data Analytics
Lecture 4
Data Understanding & Preparation
1
CRISP-DM, Phases and Tasks
Today’s Course
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation
•Determine •Collect Initial •Select Data •Select •Evaluate •Plan
Business Data •Clean Data Modeling Results Deployment
Objectives •Describe •Construct Technique •Review •Plan
•Assess Data Data •Generate Test Process Monitoring &
Situation •Explore Data •Integrate Design •Determine Maintenance
•Determine •Verify Data Data •Build Model Next Steps •Produce Final
Data Mining Quality •Format Data •Assess Model Report
Goals •Review
•Produce Project
Project Plan
2
Table of contents
1. Introduction
2. Data cleaning
3. Data reduction
4. Data transformation
3
Introduction
• Data understanding and data preparation (2nd and 3rd phases of CRISP-DM
standard process)
▪ Evaluate the quality of the data
▪ Clean the raw data and Deal with missing data
▪ Integrate different sources
▪ Reduce data where necessary
▪ Perform transformations on certain variables
4
Introduction: What is data?
5
Introduction: Attribute values
6
Data quality
• Data have quality if they satisfy the requirements of the intended use
• Measures for data quality: A multidimensional view
▪ Accuracy: correct or wrong, accurate or not
▪ Completeness: not recorded, unavailable, …
▪ Consistency: some modified but some not, dangling, …
▪ Timeliness: timely update?
▪ Believability: how trustable is the data?
▪ Interpretability: how easily the data can be understood?
• Two different users may have very different assessments of the quality of a
given dataset
7
Data Preparation
• Much of the raw data contained in databases is unprocessed, incomplete, and noisy
▪ Poor quality data result in incorrect and unreliable data mining results
▪ In recent years we moved from “all signal” to “mostly noise” data!!
• Data preparation (or data preprocessing) means the manipulation of data into a form suitable for
further analysis and processing
▪ Objective: minimize garbage in, garbage out (GIGO). It improves the quality of data and
consequently helps improve the quality of data mining results
• Depending on the dataset, data preprocessing alone can account for 10-60% of all time and effort
of the data mining process (data preparation activities are routine, tedious, and time consuming)
8
Major tasks in Data Preparation
9
Table of contents
1. Introduction
2. Data cleaning
3. Data reduction
4. Data transformation
10
Data cleaning
• Data cleaning (cleansing or scrubbing) routines attempt to fill in missing values, smooth
out noise while identifying outliers, and correct inconsistencies in the data
11
Missing data
12
Missing data
13
Handling Missing Data
14
Handling missing data: Imputation
➢ Manual Imputation: tedious + infeasible?
➢ Automatic Imputation using:
▪ a global constant : e.g., “unknown”, a new class?!
▪ the attribute mean/the local mean/the moving average
▪ the attribute mean for all samples belonging to the same class (smarter)
▪ the most probable value: inference-based such as Bayesian formula or decision tree, or
regression
15
Noisy data
16
How to handle noisy data (outliers)?
• Binning:
▪ first sort data and partition into (equal-frequency) bins
▪ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Regression: smooth by fitting the data into regression functions
17
How would you handle this?
18
Identifying misclassifications
• A frequency distribution can be used to make sure that all labels of a categorical
variable are all valid and consistent
• Example (Larose & Larose, 2015, p. 26)
19
The normal distribution: Review
• Bell Shaped
• Symmetrical
f(x)
• Mean, Median and Mode are Equal
• Location is determined by the mean, μ
• Spread is determined by the standard deviation, σ σ
• The random variable has an infinite theoretical range: x
• + to − μ
20
The Standard Normal Distribution: Review
• Also known as the “z” distribution
• Mean is defined to be 0
• Standard Deviation is 1
f(z)
• Values above the mean have positive
z-values
• Values below the mean have negative
z-values z
-1.96 0 +1.96
• Translate from x to the standard normal
(the “z” distribution) by subtracting the
mean of x and dividing by its standard • 95.0% of the scores fall between a Z of -1.96 to +1.96
deviation: • 99.9% of the scores fall between a Z of -3.30 to +3.30
x −μ
z=
σ
21
22
Graphical methods for identifying outliers
➢ Outliers are extreme values that go against the trend of the remaining data.
➢ Certain statistical methods are sensitive to the presence of outliers and may deliver
unreliable results
➢ Graphical methods for identifying outliers for numeric variables include:
▪ Histograms
▪ Box Plot
▪ Scatter diagram
23
Histogram
❖HISTOGRAM: A graph in which the classes are marked on the horizontal axis and the class
frequencies on the vertical axis. The class frequencies are represented by the heights of
the bars and the bars are drawn adjacent to each other.
Bin
Histogram: Discovering Outlier
Outlier
Box Plot: Percentiles & Quartiles
➢ The pth percentile of a data set is a value such that at least p percent of the items take on this value
or less and at least (100 - p) percent of the items take on this value or more.
▪ Arrange the data in ascending order.
▪ Compute index i, the position of the pth percentile.
i = (p/100)n
▪ If i is not an integer, round up. The p th percentile is the value in the i th position.
▪ If i is an integer, the p th percentile is the average of the values in positions i and i +1.
➢ Quartiles are specific percentiles
▪ First Quartile = 25th Percentile
▪ Second Quartile = 50th Percentile = Median
▪ Third Quartile = 75th Percentile
➢ IQR: Interquartile Range=Q3-Q1
➢A measure of the spread of the middle 50% of the data.
Box Plot
IQR=Interquartile range
27
Box Plot: Identifying outliers
• A robust measure of identifying outliers (univariate case) is defined as follows. A
data value is an outlier if
▪ it is located 1.5(IQR) or more below Q1, or
▪ It is located 1.5(IQR) or more above Q3.
• Example: Q1=70, Q3=80, identify the boundaries starting from which some
of the values are considered outliers:
• IQR=80-70=10
• A value is identified as an outlier if:
• It is lower than Q1-1.5*IQR=55, or
• It is higher than Q3+1.5*IQR=95
28
Box Plot: Z-Score Example
29
The column
ZReactionTime
contains the Z scores
for ReactionTime.
32
Correlation vs. Possible Relationship Between Variables
❖ Direct cause and effect,
▪ e.g. water causes plant to grow
❖ Both cause and effect,
▪ E.g. coffee consumption causes nervousness as well nervous people have more coffee.
❖ Relationship caused by third variable;
▪ Death due to drowning and soft drink consumption during summer.
• Both variables are related to heat and humidity (third variable).
• This is dangerous (Why?)
❖ Coincidental relationship;
▪ Increase in the number of people exercising and increase in the number of people committing crimes.
• This is even more dangerous (Why?)
❖ Correlation measures association and not causation.
33
Other problems
❑ Other data problems which require data cleaning
▪ duplicate records
▪ incomplete data
▪ inconsistent data
34
Table of contents
1. Introduction
2. Data cleaning
3. Data reduction
4. Data transformation
36
Data Reduction
37
Data reduction: keep in mind the following tradeoffs
❖Data reduction helps reduce the processing time of the 2 subsequent phases
(modeling & evaluation)
▪ Ideally, should not throw away data unless we’re sure it is not essential
▪ The number of observations in the dataset should be at least 6 times the
number of attributes
• This is a so-called rule of thumb (not an exact rule)
• Need sufficient observations in order to obtain reliable modeling results
38
Table of contents
1. Introduction
2. Data cleaning
3. Data reduction
4. Data transformation
39
Data transformation
• Data are transformed or consolidated to forms appropriate for mining
• Strategies for data transformation include the following
▪ Smoothing
▪ Attribute construction
▪ Aggregation
▪ Normalization
▪ Discretization
▪ Concept hierarchy generation for nominal data
40
Normalization
• Variables tend to have ranges that vary greatly from each other
• The measurement unit used can affect the data analysis
• For some data mining algorithms, differences in ranges will lead to a tendency for the
variable with greater range to have undue influence on the results
• Data miners should normalize their numeric variables in order to standardize the scale
of effect each variable has on the results
41
Normalization
• Normalizing the data attempts to give all attributes an equal weight
• The terms standardize and normalize are used interchangeably in data preprocessing
• Algorithms that make use of distance measures, such as the k-nearest neighbors
algorithm benefit from normalization
• Notation:
▪ X : original field value
▪ X*: normalized field value
42
Min-max normalization
• Performs a linear transformation on the original data
• Min-max normalization works by seeing how much greater the field value is
than the minimum value, min(X), and scaling this difference by the range:
X − min( X ) X − min( X )
*
X mm = =
range( X ) max( X ) − min( X )
43
Z-score normalization
• Also called zero-mean normalization
• Z-score standardization works by taking the difference between the field
value and the field mean value, and scaling this difference by the standard
deviation of the field values:
X − mean( X )
z − score =
SD( X )
• The z-score normalization is useful when the actual minimum and maximum
of an attribute X are unknown, or when there are outliers that dominate the
min-max normalization
• A variation of the z-score normalization replaces the standard deviation of
X by the mean absolute deviation of X
44
Decimal scaling
❖ Decimal scaling ensures that every normalized value lies
between -1 and 1
X
X decimal = d
*
10
• d represents the number of digits in the data value with the largest absolute value
45
Normalization-examples
• Suppose that the minimum, maximum, mean, and standard deviation of the
values for the attribute income are $12000, $98000, $54000 and $16000.
Transform a value of $73600 for income using:
▪ Min-max normalization: a value of $73600 for income is transformed to 0.716
▪ Z-score normalization: a value of $73600 for income is transformed to 1.225
• Suppose that the recorded values range from -986 to 917. To normalize by
decimal scaling, we therefore divide each value by 1000 = 103
▪ 986 normalizes to -0.986 and 917 normalizes to 0.917
46
Normalization — remarks
• Normalization can change the original data quite a bit,
especially when using the z-score normalization or decimal
scaling
• It is necessary to save the normalization parameters (e.g., the
mean and standard deviation if using z-score normalization) so
that future data can be normalized in a uniform manner
• The normalization parameters now become model parameters
and the same value should be used when the model is used on
new data (e.g. testing data)
47
Transformations to achieve normality
• Some data mining algorithms and statistical methods require that the variables be
normally distributed
• Z-score transformation does not achieve normality
Z-score
Transformation
48
Transformations to achieve normality
❖ The skewness of a distribution is measured by
3(mean − median)
sk =
std
▪ Much real-world data is right skewed, including most financial data
➢ right skewed : sk >0
▪ Left-skewed data is not as common (often occurs when the data is right-censored)
➢ Left skewed : sk <0
49
Transformations to achieve normality
• Common transformations to achieve normality (more normally distributed) by
eliminating skewness)
➢ ln(x)
➢ Sqrt(x)
➢ 1/x
➢ 1/sqrt(x)
50
Transformations to achieve normality
• Normal probability plot can be used to check whether the new variable is
normally distributed or not (normality ≠ symmetry).
• Algorithms requiring normality usually do fine when supplied with data that
is symmetric and unimodal.
• Don’t forget to « de-transform » the data when the algorithm is done with its
results:
➢ What does this mean?
51
Checking Normality
52
Checking Normality: Histogram
56
Transforming categorical variables into numerical variables
• In most instances, the data analyst should
avoid transforming categorical variables
to numeric variables
57
Discretization by binning numerical values
• Some algorithms prefer categorical rather than continuous predictors, in which case
we would need to partition the numerical predictors into bins or bands
• Common methods for binning numerical predictors:
▪ Equal width binning
▪ Equal frequency binning
▪ Binning by clustering
▪ Binning based on predictive value
58
Discretization by binning
▪ Binning does not use class information and is therefore an unsupervised discretization
technique
▪ Binning is sensitive to the user-specified number of bins, as well as the presence of
outliers
▪ Equal width binning is not recommended for most data mining applications (the width
of the categories can be greatly affected by the presence of outliers)
▪ Equal frequency binning assumes that each category is equally likely (an assumption
which is usually not warranted)
▪ Binning by clustering and binning based on predictive value are preferred
59
Discretization by binning --example
➢ Suppose we have the following tiny data set, which we would like to discretize into
k=3 categories: X = {1, 1, 1, 1, 1, 2, 2, 11, 11, 12, 12, 44}
▪ Equal width binning: [0, 15), [15, 30), [30, 45)
▪ Equal frequency binning: n/k = 12 / 3 = 4
▪ K-means clustering
Reclassifying categorical variables
▪ Reclassifying categorical variables is the equivalent of binning numerical variables
▪ Often, a categorical variable will contain too many easily analyzable field values (for example state)
▪ Data mining methods such as logistic regression perform sub -optimally when confronted with
predictors containing too many field values
▪ The data analyst should reclassify the field values (the reclassification should support the objectives of
the business problem or research question)
61