Data Preprocessing
Data Preprocessing
Program
on
Machine Learning and Deep Learning
Organized By:
Department of Computer Science & Engineering
Parala Maharaja Engineering College,
Berhampur, Odisha, 761003
Data Preprocessing
Presenter:
Dr. Debasis Mohapatra
Assistant Professor
Dept. of CSE, PMEC,BAM,761003
Overall approach of Machine Learning
KDnuggets
Text Data Preprocessing: A Walkthrough in Python - KDnuggets
Major Tasks in data preprocessing
Data Cleaning (Process of dealing with incomplete, noisy, and inconsistent data)
Dealing with missing value:- (i) ignore the tuple (ii) Fill the missing value manually (iii) Using
measures of central tendency (mean, median, mode) to fill the missing value. (iv) Using
interpolation
Dealing with noisy data:- (i) Binning (ii) Outlier analysis (iii) Regression
Data Integration
Entity Identification
Removal of duplicate tuples
Removing redundancy using correlation analysis
Data reduction
Dimensionality reduction
Numerosity reduction
Data transformation and data discretization
Normalization, Concept hierarchy generation
Data Cleaning
-1<= r <= +1
If r=+1 perfect positive correlation
If r=-1 perfect negative correlation
If r=0 No relationship/ attributes are independent.
Chi-Square test for correlation
For nominal data, a correlation relationship between two attributes, A
and B, can be discovered by a chi-square test.
Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct values,
namely b1,b2,...br . The data tuples described by A and B can be shown as
a contingency table, with the c values of A making up the columns and the
r values of B making up the rows.
Let (Ai ,Bj) denote the joint event that attribute A takes on value ai and
attribute B takes on value bj , that is, where (A = ai ,B = bj).
Chi-square value can be computed as:
where n is the number of data tuples, count(A = ai) is the number of tuples
having value ai for A, and count(B = bj) is the number of tuples having value
bj for B.
The χ 2 statistic tests the hypothesis that A and B are independent, that is,
there is no correlation between them.
If the hypothesis can be rejected, then we say that A and B are statistically
correlated.
(If the computed value is greater than or equal to the tabulated value in chi
square distribution at a particular sensitive level(p value) then the hypothesis
is rejected with probability (1-p) and accepted with probability p.)
Example:- Are gender and preferred reading correlated?
Hypothesis: Gender and Preferred reading are independent.
Degree of freedom= (2-1)*(2-1)=1
Data transformation routines convert the data into appropriate forms for
mining.
For example, in normalization, attribute data are scaled so as to fall
within a small range such as 0.0 to 1.0.
Data discretization transforms numeric data by mapping values to interval
or concept labels (Concept hierarchy is used for this purpose).
Data Transformation
Normalization:-
Normalizing the data attempts to give all attributes an equal weight.
Normalization is particularly useful for classification algorithms
involving neural networks or distance measurements such as nearest-
neighbor classification and clustering.
Example:- If using the neural network backpropagation algorithm for
classification, normalizing the input values for each attribute
measured in the training tuples will help speed up the learning phase.
Methods:-
Min-Max Normalization
Z-score Normalization
etc.
Data Transformation
Min-max normalization performs a linear transformation on the
original data using the following formula.
X Y
4 11
8 4
13 5
7 14
= 8.5
=1/3*[(4-8)^2+0+(13-8)^2+(7-8)^2]=14
=1/3*[(4-8)*(11-8.5)+0+(13-8)*(5-8.5)+(7-8)*(14-8.5)] = -11
Cov(X,Y) = Cov(Y,X) = -11
Cov(Y,Y)= 1/3*[(11-8.5)^2+(4-8.5)^2+(5-8.5)^2+(14-
8.5)^2] = 23
(3) Compute Eigen values and Eigen vectors of Covariance
matrix and normalize the Covariance matrix
,
= 0 ( is the eigen vector corresponding to =30.384)
-11 =0
-11 +(23- ) =0
-11 = 0
= 11
=1
=11
=
===
Compute Normalized eigen vector E1.
E1= =
Data sets for analysis may contain hundreds of attributes, many of which may
be irrelevant to the mining task or redundant.
Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
The goal of attribute subset selection is to find a minimum set of attributes
such that the resulting probability distribution of the data classes is as close
as possible to the original distribution obtained using all attributes.
Mining on a reduced set of attributes has an additional benefit:
It reduces the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to understand.
Data reduction (Dimensionality reduction )
Attribute Subset Selection (Cont..)
Stepwise forward selection:
The procedure starts with an empty set of attributes as the reduced set.
The best of the original attributes is determined and added to the reduced
set.
At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
Stepwise backward elimination:
The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
Combination of forward selection and backward elimination:
The stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
Data reduction (Dimensionality reduction )
Attribute Subset Selection (Cont..)
Decision tree induction:
Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended
for classification.
Decision tree induction constructs a flowchart like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction.
At each node, the algorithm chooses the “best” attribute to partition the
data into individual classes.
When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data. All attributes that do not appear in the tree
are assumed to be irrelevant. The set of attributes appearing in the tree form
the reduced subset of attributes
Data reduction (Dimensionality reduction )
Attribute Subset Selection (Contd..)
Attribute subset selection (Decision Tree Induction)