Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Data Preprocessing

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Data Preprocessing

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Summer Training cum Internship

Program
on
Machine Learning and Deep Learning

Organized By:
Department of Computer Science & Engineering
Parala Maharaja Engineering College,
Berhampur, Odisha, 761003
Data Preprocessing

Presenter:
Dr. Debasis Mohapatra
Assistant Professor
Dept. of CSE, PMEC,BAM,761003
Overall approach of Machine Learning

KDnuggets
Text Data Preprocessing: A Walkthrough in Python - KDnuggets
Major Tasks in data preprocessing

 Data Cleaning (Process of dealing with incomplete, noisy, and inconsistent data)
 Dealing with missing value:- (i) ignore the tuple (ii) Fill the missing value manually (iii) Using
measures of central tendency (mean, median, mode) to fill the missing value. (iv) Using
interpolation
 Dealing with noisy data:- (i) Binning (ii) Outlier analysis (iii) Regression
 Data Integration
 Entity Identification
 Removal of duplicate tuples
 Removing redundancy using correlation analysis
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data transformation and data discretization
 Normalization, Concept hierarchy generation
Data Cleaning

Process of dealing with incomplete, noisy,


and inconsistent data.
Data Cleaning (Dealing with missing value)
 Ignore the tuple:- Remove the tuple containing several attributes with
missing values.
 Fill the missing value manually:- Fill the missing value manually through
observation. In general, this approach is time consuming and may not be
feasible given a large data set with many missing values.
 Using measures of central tendency (mean, median, mode) to fill the
missing value.
 Using interpolation (Lagrange, Newton, Linear, etc.)
Data Cleaning (Dealing with noisy data)
 Noise is a random error or variance in a measured variable.
 Binning:-
 Sort the values.
 Consider a bucket/bin size and fill the bins.
 Replace the values by bin mean/median/closest boundary
 Example:-
 Given data:- 4, 28, 15, 21, 21, 24, 25, 8, 34
 Sorted:- 4, 8, 15, 21, 21, 24, 25, 28, 34
 Prepare bins of size 3:-
 Bin 1:- 4,8,15
 Bin 2:- 21,21,24
 Bin 3:- 25, 28, 34
 Replace by bin means:-
 Bin 1:- 9, 9, 9
 Bin 2:- 22, 22, 22
 Bin 3:- 29, 29, 29
Data Cleaning (Dealing with noisy data)
 Outlier analysis:-
 Identifying the outliers and removing them.
 Methods based on IQR, data visualization, clustering can be used for this
purpose.
 Regression:-
 This can also be used for data smoothing. By fitting a linear or non linear
function.
Data Integration

 The merging of data from multiple data stores.


 Entity Identification:- How can equivalent real-world entities from multiple data
sources be matched up? This is referred to as the entity identification problem.
 For example, how can the data analyst or the computer be sure that customer id
in one database and cust_number in another refer to the same attribute?
 Metadata for each attribute include the name, meaning, data type, and range of
values permitted for the attribute, and null rules for handling blank, zero, or null
values. Such metadata can be used to help avoid errors in schema integration.
 Duplicate tuple:- Duplication should also be detected at the tuple level (e.g., where
there are two or more identical tuples for a given unique data entry case).
 Inconsistencies often arise between various duplicates, due to inaccurate data
entry or updating some but not all data occurrences.
 Duplicate tuples should be removed.
Data Integration

 Redundancy and correlation analysis:-


 Redundancy is another important issue in data integration. An attribute may be
redundant if it can be “derived” from another attribute or set of attributes.
 Inconsistencies in attribute or dimension naming can also cause redundancies in
the resulting data set.
 Some redundancies can be detected by correlation analysis. Given two
attributes, such analysis can measure how strongly one attribute implies the
other, based on the available data.
 For correlation analysis on numeric data we use Pearson’s correlation coefficient.
 For correlation analysis on nominal/categorical data we use Chi-square test.
Correlation analysis
 Correlation analysis attempts to determine the degree of relationship between
attributes/variables.
 Examples:-
 Increase in rain fall up to a point and production of rice.
 As her salary increased, so did her spending.
 As attendance at school drops, so does achievement.
 If a train increases speed, the length of time to get to the final point decreases.
 As one exercises more, his body weight becomes less.
 If both variables are varying in same direction, then the correlation is called positive
correlation.
 If both variables are varying in opposite directions, then the correlation is called
negative correlation.
 The correlation does not tell anything about cause-and-effect relationship.
Pearson’s coefficient of correlation
 It is applicable for numeric data only.
 It is denoted by symbol ‘r’.
 ‘r’ between two attributes X and Y can be computed as:

 -1<= r <= +1
 If r=+1 perfect positive correlation
 If r=-1 perfect negative correlation
 If r=0 No relationship/ attributes are independent.
Chi-Square test for correlation
 For nominal data, a correlation relationship between two attributes, A
and B, can be discovered by a chi-square test.
 Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct values,
namely b1,b2,...br . The data tuples described by A and B can be shown as
a contingency table, with the c values of A making up the columns and the
r values of B making up the rows.
 Let (Ai ,Bj) denote the joint event that attribute A takes on value ai and
attribute B takes on value bj , that is, where (A = ai ,B = bj).
 Chi-square value can be computed as:

where n is the number of data tuples, count(A = ai) is the number of tuples
having value ai for A, and count(B = bj) is the number of tuples having value
bj for B.
 The χ 2 statistic tests the hypothesis that A and B are independent, that is,
there is no correlation between them.

 The test is based on a significance level, with (r − 1) × (c − 1) degrees of


freedom.

 If the hypothesis can be rejected, then we say that A and B are statistically
correlated.
(If the computed value is greater than or equal to the tabulated value in chi
square distribution at a particular sensitive level(p value) then the hypothesis
is rejected with probability (1-p) and accepted with probability p.)
Example:- Are gender and preferred reading correlated?
Hypothesis: Gender and Preferred reading are independent.
 Degree of freedom= (2-1)*(2-1)=1

 For 1 degree of freedom, the χ2 value needed to reject the hypothesis at


the 0.001 significance level is 10.828 (taken from the table of upper
percentage points of the χ2 distribution, typically available in any
textbook on statistics).
 Since our computed value is above this, we can reject the hypothesis that
gender and preferred reading are independent and conclude that the two
attributes are (strongly) correlated for the given group of people (with
probability 1-0.001=0.999).
Data transformation and data discretization

 Data transformation routines convert the data into appropriate forms for
mining.
 For example, in normalization, attribute data are scaled so as to fall
within a small range such as 0.0 to 1.0.
 Data discretization transforms numeric data by mapping values to interval
or concept labels (Concept hierarchy is used for this purpose).
Data Transformation
 Normalization:-
 Normalizing the data attempts to give all attributes an equal weight.
 Normalization is particularly useful for classification algorithms
involving neural networks or distance measurements such as nearest-
neighbor classification and clustering.
 Example:- If using the neural network backpropagation algorithm for
classification, normalizing the input values for each attribute
measured in the training tuples will help speed up the learning phase.
 Methods:-
 Min-Max Normalization
 Z-score Normalization
 etc.
Data Transformation
 Min-max normalization performs a linear transformation on the
original data using the following formula.

 In z-score normalization (or zero-mean normalization), the values for


an attribute A, are normalized based on the mean (i.e., average) and
standard deviation of A.
Mean= 500
SD=282.843
Data discretization example

Concept hierarchy for age


Data reduction
 Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data.
 That is, mining/ML tasks on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results.
 Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation.
 These techniques may be parametric or nonparametric.
 For parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. Regression and log-linear models are
examples.
 Nonparametric methods for storing reduced representations of the data include histograms,
clustering, sampling, and data cube aggregation.
 Dimensionality reduction is the process of reducing the number of variables or attributes
under consideration.
 Methods used for this purpose are wavelet transform, attribute subset selection,
principal component analysis etc.
Data reduction
Numerosity reduction (Sampling)
 Sampling is a non-parametric numerosity reduction method.
 Sampling can be used as a data reduction technique because it allows a large data set
to be represented by a much smaller random data sample (or subset). Suppose that a
large data set, D, contains N tuples.
 Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s
of the N tuples from D (s < N), where the probability of drawing any tuple in D is 1/N, that is,
all tuples are equally likely to be sampled.
 Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after
a tuple is drawn, it is placed back in D so that it may be drawn again.
 Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified sample
of D is generated by obtaining an SRS at each stratum. This helps ensure a representative
sample from each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group.
Data reduction
Dimensionality reduction (PCA)
 Principal Component Analysis, or PCA, is a dimensionality-reduction method that
is often used to reduce the dimensionality of large data sets, by transforming a
large set of variables into a smaller one that still contains most of the information
in the large set.
 Steps:-
 Normalize the data set (optional).
 Compute the covariance matrix.
 Find the eigen values and eigen vectors of the covariance matrix and
Normalize the eigen vectors.
 Find the principal components by multiplying normalized eigen vectors with
the data points.
Example:-

X Y
4 11
8 4
13 5
7 14

No. of attributes (n)=2


No. of records (N)=4

(1) Let us ignore the step-1 i.e., normalization of data set.


(2) Compute the covariance matrix (S) by considering all possible pairs.

Cov(X,X) Cov(X,Y) 14 -11


Cov(Y,X) Cov(Y,Y) -11 23

= 8.5
=1/3*[(4-8)^2+0+(13-8)^2+(7-8)^2]=14

=1/3*[(4-8)*(11-8.5)+0+(13-8)*(5-8.5)+(7-8)*(14-8.5)] = -11
Cov(X,Y) = Cov(Y,X) = -11

Cov(Y,Y)= 1/3*[(11-8.5)^2+(4-8.5)^2+(5-8.5)^2+(14-
8.5)^2] = 23
(3) Compute Eigen values and Eigen vectors of Covariance
matrix and normalize the Covariance matrix

,
= 0 ( is the eigen vector corresponding to =30.384)

-11 =0
-11 +(23- ) =0

Take any one of the above equations

-11 = 0
= 11
=1
=11
=

===
Compute Normalized eigen vector E1.

E1= =

Like wise, eigen vector corresponding to can be computed.


Here, we are interested to reduce the given data set to one dimension, so
we consider only the principal eigen vector and normalized it.
3. Computation of principal components(Here only
one principal component is generated as we used
only the principal eigen vector)
* ( 1st Principal component)

Likewise, 2nd Principal component can be computed if interested in 2


dimensions
Data reduction (Dimensionality reduction )
Attribute Subset Selection

 Data sets for analysis may contain hundreds of attributes, many of which may
be irrelevant to the mining task or redundant.
 Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
 The goal of attribute subset selection is to find a minimum set of attributes
such that the resulting probability distribution of the data classes is as close
as possible to the original distribution obtained using all attributes.
 Mining on a reduced set of attributes has an additional benefit:
 It reduces the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to understand.
Data reduction (Dimensionality reduction )
Attribute Subset Selection (Cont..)
 Stepwise forward selection:
 The procedure starts with an empty set of attributes as the reduced set.
 The best of the original attributes is determined and added to the reduced
set.
 At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
 Stepwise backward elimination:
 The procedure starts with the full set of attributes.
 At each step, it removes the worst attribute remaining in the set.
 Combination of forward selection and backward elimination:
 The stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
Data reduction (Dimensionality reduction )
Attribute Subset Selection (Cont..)
 Decision tree induction:
 Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended
for classification.
 Decision tree induction constructs a flowchart like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction.
 At each node, the algorithm chooses the “best” attribute to partition the
data into individual classes.
 When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data. All attributes that do not appear in the tree
are assumed to be irrelevant. The set of attributes appearing in the tree form
the reduced subset of attributes
Data reduction (Dimensionality reduction )
Attribute Subset Selection (Contd..)
Attribute subset selection (Decision Tree Induction)

You might also like