0% found this document useful (0 votes)

13 views

Data Preprocessing

Uploaded by

Debasis Mahapatra

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Data Preprocessing

Uploaded by

Debasis Mahapatra

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Summer Training cum Internship

Program
on
Machine Learning and Deep Learning

Organized By:
Department of Computer Science & Engineering
Parala Maharaja Engineering College,
Berhampur, Odisha, 761003
Data Preprocessing

Presenter:
Dr. Debasis Mohapatra
Assistant Professor
Dept. of CSE, PMEC,BAM,761003
Overall approach of Machine Learning

KDnuggets
Text Data Preprocessing: A Walkthrough in Python - KDnuggets
Major Tasks in data preprocessing

 Data Cleaning (Process of dealing with incomplete, noisy, and inconsistent data)
 Dealing with missing value:- (i) ignore the tuple (ii) Fill the missing value manually (iii) Using
measures of central tendency (mean, median, mode) to fill the missing value. (iv) Using
interpolation
 Dealing with noisy data:- (i) Binning (ii) Outlier analysis (iii) Regression
 Data Integration
 Entity Identification
 Removal of duplicate tuples
 Removing redundancy using correlation analysis
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data transformation and data discretization
 Normalization, Concept hierarchy generation
Data Cleaning

Process of dealing with incomplete, noisy,

and inconsistent data.
Data Cleaning (Dealing with missing value)
 Ignore the tuple:- Remove the tuple containing several attributes with
missing values.
 Fill the missing value manually:- Fill the missing value manually through
observation. In general, this approach is time consuming and may not be
feasible given a large data set with many missing values.
 Using measures of central tendency (mean, median, mode) to fill the
missing value.
 Using interpolation (Lagrange, Newton, Linear, etc.)
Data Cleaning (Dealing with noisy data)
 Noise is a random error or variance in a measured variable.
 Binning:-
 Sort the values.
 Consider a bucket/bin size and fill the bins.
 Replace the values by bin mean/median/closest boundary
 Example:-
 Given data:- 4, 28, 15, 21, 21, 24, 25, 8, 34
 Sorted:- 4, 8, 15, 21, 21, 24, 25, 28, 34
 Prepare bins of size 3:-
 Bin 1:- 4,8,15
 Bin 2:- 21,21,24
 Bin 3:- 25, 28, 34
 Replace by bin means:-
 Bin 1:- 9, 9, 9
 Bin 2:- 22, 22, 22
 Bin 3:- 29, 29, 29
Data Cleaning (Dealing with noisy data)
 Outlier analysis:-
 Identifying the outliers and removing them.
 Methods based on IQR, data visualization, clustering can be used for this
purpose.
 Regression:-
 This can also be used for data smoothing. By fitting a linear or non linear
function.
Data Integration

 The merging of data from multiple data stores.

 Entity Identification:- How can equivalent real-world entities from multiple data
sources be matched up? This is referred to as the entity identification problem.
 For example, how can the data analyst or the computer be sure that customer id
in one database and cust_number in another refer to the same attribute?
 Metadata for each attribute include the name, meaning, data type, and range of
values permitted for the attribute, and null rules for handling blank, zero, or null
values. Such metadata can be used to help avoid errors in schema integration.
 Duplicate tuple:- Duplication should also be detected at the tuple level (e.g., where
there are two or more identical tuples for a given unique data entry case).
 Inconsistencies often arise between various duplicates, due to inaccurate data
entry or updating some but not all data occurrences.
 Duplicate tuples should be removed.
Data Integration

 Redundancy and correlation analysis:-

 Redundancy is another important issue in data integration. An attribute may be
redundant if it can be “derived” from another attribute or set of attributes.
 Inconsistencies in attribute or dimension naming can also cause redundancies in
the resulting data set.
 Some redundancies can be detected by correlation analysis. Given two
attributes, such analysis can measure how strongly one attribute implies the
other, based on the available data.
 For correlation analysis on numeric data we use Pearson’s correlation coefficient.
 For correlation analysis on nominal/categorical data we use Chi-square test.
Correlation analysis
 Correlation analysis attempts to determine the degree of relationship between
attributes/variables.
 Examples:-
 Increase in rain fall up to a point and production of rice.
 As her salary increased, so did her spending.
 As attendance at school drops, so does achievement.
 If a train increases speed, the length of time to get to the final point decreases.
 As one exercises more, his body weight becomes less.
 If both variables are varying in same direction, then the correlation is called positive
correlation.
 If both variables are varying in opposite directions, then the correlation is called
negative correlation.
 The correlation does not tell anything about cause-and-effect relationship.
Pearson’s coefficient of correlation
 It is applicable for numeric data only.
 It is denoted by symbol ‘r’.
 ‘r’ between two attributes X and Y can be computed as:

 -1<= r <= +1
 If r=+1 perfect positive correlation
 If r=-1 perfect negative correlation
 If r=0 No relationship/ attributes are independent.
Chi-Square test for correlation
 For nominal data, a correlation relationship between two attributes, A
and B, can be discovered by a chi-square test.
 Suppose A has c distinct values, namely a1,a2,...ac. B has r distinct values,
namely b1,b2,...br . The data tuples described by A and B can be shown as
a contingency table, with the c values of A making up the columns and the
r values of B making up the rows.
 Let (Ai ,Bj) denote the joint event that attribute A takes on value ai and
attribute B takes on value bj , that is, where (A = ai ,B = bj).
 Chi-square value can be computed as:

where n is the number of data tuples, count(A = ai) is the number of tuples
having value ai for A, and count(B = bj) is the number of tuples having value
bj for B.
 The χ 2 statistic tests the hypothesis that A and B are independent, that is,
there is no correlation between them.

 The test is based on a significance level, with (r − 1) × (c − 1) degrees of

freedom.

 If the hypothesis can be rejected, then we say that A and B are statistically
correlated.
(If the computed value is greater than or equal to the tabulated value in chi
square distribution at a particular sensitive level(p value) then the hypothesis
is rejected with probability (1-p) and accepted with probability p.)
Example:- Are gender and preferred reading correlated?
Hypothesis: Gender and Preferred reading are independent.
 Degree of freedom= (2-1)*(2-1)=1

 For 1 degree of freedom, the χ2 value needed to reject the hypothesis at

the 0.001 significance level is 10.828 (taken from the table of upper
percentage points of the χ2 distribution, typically available in any
textbook on statistics).
 Since our computed value is above this, we can reject the hypothesis that
gender and preferred reading are independent and conclude that the two
attributes are (strongly) correlated for the given group of people (with
probability 1-0.001=0.999).
Data transformation and data discretization

 Data transformation routines convert the data into appropriate forms for
mining.
 For example, in normalization, attribute data are scaled so as to fall
within a small range such as 0.0 to 1.0.
 Data discretization transforms numeric data by mapping values to interval
or concept labels (Concept hierarchy is used for this purpose).
Data Transformation
 Normalization:-
 Normalizing the data attempts to give all attributes an equal weight.
 Normalization is particularly useful for classification algorithms
involving neural networks or distance measurements such as nearest-
neighbor classification and clustering.
 Example:- If using the neural network backpropagation algorithm for
classification, normalizing the input values for each attribute
measured in the training tuples will help speed up the learning phase.
 Methods:-
 Min-Max Normalization
 Z-score Normalization
 etc.
Data Transformation
 Min-max normalization performs a linear transformation on the
original data using the following formula.

 In z-score normalization (or zero-mean normalization), the values for

an attribute A, are normalized based on the mean (i.e., average) and
standard deviation of A.
Mean= 500
SD=282.843
Data discretization example

Concept hierarchy for age

Data reduction
 Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data.
 That is, mining/ML tasks on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results.
 Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation.
 These techniques may be parametric or nonparametric.
 For parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. Regression and log-linear models are
examples.
 Nonparametric methods for storing reduced representations of the data include histograms,
clustering, sampling, and data cube aggregation.
 Dimensionality reduction is the process of reducing the number of variables or attributes
under consideration.
 Methods used for this purpose are wavelet transform, attribute subset selection,
principal component analysis etc.
Data reduction
Numerosity reduction (Sampling)
 Sampling is a non-parametric numerosity reduction method.
 Sampling can be used as a data reduction technique because it allows a large data set
to be represented by a much smaller random data sample (or subset). Suppose that a
large data set, D, contains N tuples.
 Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s
of the N tuples from D (s < N), where the probability of drawing any tuple in D is 1/N, that is,
all tuples are equally likely to be sampled.
 Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after
a tuple is drawn, it is placed back in D so that it may be drawn again.
 Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified sample
of D is generated by obtaining an SRS at each stratum. This helps ensure a representative
sample from each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group.
Data reduction
Dimensionality reduction (PCA)
 Principal Component Analysis, or PCA, is a dimensionality-reduction method that
is often used to reduce the dimensionality of large data sets, by transforming a
large set of variables into a smaller one that still contains most of the information
in the large set.
 Steps:-
 Normalize the data set (optional).
 Compute the covariance matrix.
 Find the eigen values and eigen vectors of the covariance matrix and
Normalize the eigen vectors.
 Find the principal components by multiplying normalized eigen vectors with
the data points.
Example:-

X Y
4 11
8 4
13 5
7 14

No. of attributes (n)=2

No. of records (N)=4

(1) Let us ignore the step-1 i.e., normalization of data set.

(2) Compute the covariance matrix (S) by considering all possible pairs.

Cov(X,X) Cov(X,Y) 14 -11

Cov(Y,X) Cov(Y,Y) -11 23

= 8.5
=1/3*[(4-8)^2+0+(13-8)^2+(7-8)^2]=14

=1/3*[(4-8)*(11-8.5)+0+(13-8)*(5-8.5)+(7-8)*(14-8.5)] = -11
Cov(X,Y) = Cov(Y,X) = -11

Cov(Y,Y)= 1/3*[(11-8.5)^2+(4-8.5)^2+(5-8.5)^2+(14-
8.5)^2] = 23
(3) Compute Eigen values and Eigen vectors of Covariance
matrix and normalize the Covariance matrix

,
= 0 ( is the eigen vector corresponding to =30.384)

-11 =0
-11 +(23- ) =0

Take any one of the above equations

-11 = 0
= 11
=1
=11
=

===
Compute Normalized eigen vector E1.

E1= =

Like wise, eigen vector corresponding to can be computed.

Here, we are interested to reduce the given data set to one dimension, so
we consider only the principal eigen vector and normalized it.
3. Computation of principal components(Here only
one principal component is generated as we used
only the principal eigen vector)
* ( 1st Principal component)

Likewise, 2nd Principal component can be computed if interested in 2

dimensions
Data reduction (Dimensionality reduction )
Attribute Subset Selection

 Data sets for analysis may contain hundreds of attributes, many of which may
be irrelevant to the mining task or redundant.
 Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
 The goal of attribute subset selection is to find a minimum set of attributes
such that the resulting probability distribution of the data classes is as close
as possible to the original distribution obtained using all attributes.
 Mining on a reduced set of attributes has an additional benefit:
 It reduces the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to understand.
Data reduction (Dimensionality reduction )
Attribute Subset Selection (Cont..)
 Stepwise forward selection:
 The procedure starts with an empty set of attributes as the reduced set.
 The best of the original attributes is determined and added to the reduced
set.
 At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
 Stepwise backward elimination:
 The procedure starts with the full set of attributes.
 At each step, it removes the worst attribute remaining in the set.
 Combination of forward selection and backward elimination:
 The stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
Data reduction (Dimensionality reduction )
Attribute Subset Selection (Cont..)
 Decision tree induction:
 Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended
for classification.
 Decision tree induction constructs a flowchart like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction.
 At each node, the algorithm chooses the “best” attribute to partition the
data into individual classes.
 When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data. All attributes that do not appear in the tree
are assumed to be irrelevant. The set of attributes appearing in the tree form
the reduced subset of attributes
Data reduction (Dimensionality reduction )
Attribute Subset Selection (Contd..)
Attribute subset selection (Decision Tree Induction)

Research Methodology: Rebecca Jadon
100% (1)
Research Methodology: Rebecca Jadon
48 pages
Is Celebrity Advertising Worth Celebrating
0% (2)
Is Celebrity Advertising Worth Celebrating
3 pages
Ch 3-Final
No ratings yet
Ch 3-Final
39 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Week3- Data Preprocessing, Extraction and Preparation
No ratings yet
Week3- Data Preprocessing, Extraction and Preparation
34 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Unit 3
No ratings yet
Unit 3
164 pages
DM_merged
No ratings yet
DM_merged
169 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
data mining 3
No ratings yet
data mining 3
57 pages
Module 2
No ratings yet
Module 2
62 pages
PPT1
No ratings yet
PPT1
93 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Week2-2
No ratings yet
Week2-2
25 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
W4-5 03preprocessing
No ratings yet
W4-5 03preprocessing
83 pages
Chapter3
No ratings yet
Chapter3
50 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Lec7
No ratings yet
Lec7
45 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
03Preprocessing_20160222
No ratings yet
03Preprocessing_20160222
65 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
13. Data Preprocessing_Updated (6)
No ratings yet
13. Data Preprocessing_Updated (6)
31 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
03Preprocessing (2)
No ratings yet
03Preprocessing (2)
80 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Spectral Approach (BU)
No ratings yet
Spectral Approach (BU)
2 pages
Lecture-4 (Day 3) - Pandas
No ratings yet
Lecture-4 (Day 3) - Pandas
4 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
Lecture-1 (Day 1)
No ratings yet
Lecture-1 (Day 1)
16 pages
Lecture-3 (Day 2) - NumPy
No ratings yet
Lecture-3 (Day 2) - NumPy
2 pages
I 24 Nov 2023 Lab Exam Questions Material
No ratings yet
I 24 Nov 2023 Lab Exam Questions Material
2 pages
Solomon S2 E MS
No ratings yet
Solomon S2 E MS
4 pages
Segunda Asignación de Estadística Aplicada A La Ingeniería Química 2016 I
No ratings yet
Segunda Asignación de Estadística Aplicada A La Ingeniería Química 2016 I
5 pages
Nama: Irda Puspasari NPM: 12-087: Regression
No ratings yet
Nama: Irda Puspasari NPM: 12-087: Regression
2 pages
Two SPSS Programs For Interpreting Multiple Regression Results
No ratings yet
Two SPSS Programs For Interpreting Multiple Regression Results
5 pages
Data Analyst Roadmap
No ratings yet
Data Analyst Roadmap
5 pages
STAT 431 - Practice Term Test # 2
No ratings yet
STAT 431 - Practice Term Test # 2
4 pages
Ist 407 Presentation
No ratings yet
Ist 407 Presentation
12 pages
Assignment 4
No ratings yet
Assignment 4
14 pages
Portfolio Optimization and Mean Variance Framework L1 Final PPT Unlocked
No ratings yet
Portfolio Optimization and Mean Variance Framework L1 Final PPT Unlocked
42 pages
DSMP 1.0 CampusX Data Science Mentorship Program
No ratings yet
DSMP 1.0 CampusX Data Science Mentorship Program
14 pages
4 - Exploring Data
No ratings yet
4 - Exploring Data
32 pages
Principles and Procedures of Statistics: With Special Reference To The Biological Sciences
No ratings yet
Principles and Procedures of Statistics: With Special Reference To The Biological Sciences
509 pages
QUESTIONS TRIAL KMJ AM025 - Part2
No ratings yet
QUESTIONS TRIAL KMJ AM025 - Part2
2 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
1.5 Regularization and Optimization
No ratings yet
1.5 Regularization and Optimization
17 pages
Worksheet 7
No ratings yet
Worksheet 7
3 pages
ch03 BFZH
No ratings yet
ch03 BFZH
34 pages
Logistic Regression - Complete Problems: Outliers and Influential Cases Split-Sample Validation Sample Problems
No ratings yet
Logistic Regression - Complete Problems: Outliers and Influential Cases Split-Sample Validation Sample Problems
72 pages
Notes 2 BayesianStatistics
No ratings yet
Notes 2 BayesianStatistics
6 pages
Which Statistical Tests To Use
No ratings yet
Which Statistical Tests To Use
1 page
[Ebooks PDF] download Event History Analysis with R Chapman Hall CRC The R Series 2nd Edition Göran Broström full chapters
100% (11)
[Ebooks PDF] download Event History Analysis with R Chapman Hall CRC The R Series 2nd Edition Göran Broström full chapters
67 pages
Descriptive Statistics For The Variables in The Data: Panel A. Whether They Are in Treatment or Control Group
No ratings yet
Descriptive Statistics For The Variables in The Data: Panel A. Whether They Are in Treatment or Control Group
8 pages
Chapter 14-Introduction To Multiple Regression
No ratings yet
Chapter 14-Introduction To Multiple Regression
67 pages
DS100 3 WS3.8 PDF
No ratings yet
DS100 3 WS3.8 PDF
5 pages
N Mean Std. Dev. Control Pets Friends
No ratings yet
N Mean Std. Dev. Control Pets Friends
2 pages
Npar Tests: Hasil Asumsi Klasik Uji Normalitas
No ratings yet
Npar Tests: Hasil Asumsi Klasik Uji Normalitas
4 pages
8.3 Chi-square for Homogeneity
No ratings yet
8.3 Chi-square for Homogeneity
3 pages