ICS 2408 - Lecture 2 - Data Preprocessing

The document discusses the importance of data preprocessing for data mining and analytics. It explains that real-world data is often dirty, incomplete, noisy, and inconsistent. The major tasks in data preprocessing are data cleaning, integration, transformation, reduction, and discretization. Data cleaning involves handling missing data, noisy data, and inconsistencies. Data integration combines data from multiple sources. Data transformation includes normalization, aggregation, and attribute construction. Data reduction reduces data volume while maintaining analytical results. Dimensionality reduction and discretization are common reduction techniques.

Uploaded by

petergitagia9781

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

ICS 2408 - Lecture 2 - Data Preprocessing

Uploaded by

petergitagia9781

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Preprocessing

Why Data Preprocessing?

Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g. occupation=“ ”
noisy: containing errors or outliers e.g. Salary=“-10”
inconsistent: containing discrepancies in codes or
names
e.g. Age=“42” Birthday=“03/07/1997”
e.g. Was rating “1,2,3”, now rating “A, B, C”
e.g. discrepancy between duplicate records
Why Is Data Dirty?
 Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected
and when it is analyzed.
Human/hardware/software problems
 Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
 Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Why is Data Preprocessing important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation

comprises the majority of the work of building a data
warehouse
Measures for data quality
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update?
Believability: how trustable are the data, are they
correct?
Interpretability: how easily the data can be understood?
Value added
Accessibility
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data Cleaning
Importance
“Data cleaning is the number one problem in data
warehousing”

Data cleaning tasks – this routine attempts to

Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data
Data is not always available
E.g., many tuples have no recorded values for several
attributes, such as customer income in sales data

Missing data may be due to

Equipment malfunction
Inconsistent with other recorded data and thus deleted
Data not entered due to misunderstanding
Certain data may not be considered important at the time
of entry
No registered history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data
1. Ignore the tuple
2. Fill in missing values manually: tedious + infeasible
3. Fill in automatically with a global constant.
4. Fill in with the attribute mean
5. Fill in with the attribute mean for all samples
belonging to the same class as the given tuple
6. Fill in with the most probable value determined with
regression, inference-based such as Bayesian
formula, decision tree.
Noisy Data
Noise: random error or variance in a measured
variable.

Incorrect attribute values may be due to

faulty data collection instruments
data entry problems
data transmission problems

Other data problems which requires data cleaning

duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data
 Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Clustering
Similar values are organized into groups (clusters).
Values that fall outside of clusters are considered outliers.
 Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
 Regression
Data can be smoothed by fitting the data to a function such as
with regression. (linear regression/multiple linear regression)
Discretization Method: Binning
 Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well

 Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
Sorted data for price : 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equi-depth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Data Integration
 Data integration:
Combines data from multiple sources(data cubes, multiple db
or flat files)
 Issues during data integration
Schema integration
Integrate metadata (about the data) from different sources
Entity identification problem: identify real world entities from
multiple data sources
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British
units
Removing duplicates and redundant data
An attribute can be derived from another table (annual revenue)
Inconsistencies in attribute naming
Data Transformation
Smoothing: remove noise from data (binning, clustering,
regression)
Normalization: scaled to fall within a small, specified
range such as –1.0 to 1.0 or 0.0 to 1.0
Attribute/feature construction
New attributes constructed / added from the given ones
Aggregation: summarization or aggregation operations
apply to data
Generalization: concept hierarchy climbing
Low level/ primitive/raw data are replace by higher level
concepts
Data Transformation: Normalization
Useful for classification algorithms involving
Neural networks
Distance measurements (nearest neighbor)
Backpropagation algorithm (NN) – normalizing helps to
speed up the learning phase
Distance-based methods – normalization prevents
attributes with initially large range (i.e. income) from
outweighing attributes with initially smaller ranges (i.e.
binary attribute)
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
v  minAA
v'  (new _ maxAA  new _ minAA)  new _ minAA
maxAA  minAA
Ex. Let income range 12,000 to 98,000 normalized to [0.0,
1.0]. Then 73,000 is mapped to 73,600  12,000 (1.0  0)  0  0.716
98,000  12,000
Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A
73,600  54,000
Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
Normalization by decimal scaling
v
v'  j v'
Where j is the smallest integer such that Max(|
10 |)<1
Data Reduction Strategies
 Data is too big to work with – may take time, impractical or
infeasible analysis
 Data reduction techniques
 Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results
 Data reduction strategies
 Data cube aggregation – apply aggregation operations (data cube)
 Dimensionality reduction—remove unimportant attributes
 Data compression – encoding mechanism used to reduce data size
 Numerosity reduction – data replaced or estimated by alternative,
smaller data representation - parametric model (store model parameter
instead of actual data), non-parametric (clustering sampling,
histogram)
 Discretization and concept hierarchy generation – replaced by ranges
or higher conceptual levels
Dimensionality Reduction
Problem: Feature selection (i.e., attribute subset
selection):
Select a minimum set of attributes (features) that is
sufficient for the data mining task.
Best/worst attributes are determined using test of
statistical significance – information gain (building
decision tree for classification)
Solution: Heuristic methods (due to exponential # of
choices – 2d):
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video, image compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Data Compression

Original Data Compressed

Data
lossless

s s y
lo
Original Data
Approximated
Numerosity Reduction
Reduce the data volume by choosing alternative ‘smaller’
forms of data representation
Two type:
Parametric – a model is used to estimate the data, only
the data parameters is stored instead of actual data
regression
log-linear model
Nonparametric –storing reduced representation of the
data
Histograms
Clustering
Sampling
Histograms
A popular data reduction technique
Divide data into buckets and store average (sum) for each
bucket
Use binning to approximate data distributions
Bucket – horizontal axis, height (area) of bucket – the
average frequency of the values represented by the
bucket
Bucket for single attribute-value/frequency pair –
singleton buckets
Continuous ranges for the given attribute
Clustering
Partition data set into clusters, and one can store cluster
representation only.
Can be very effective if data is clustered but not if data
is “smeared”/ spread.
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures.
There are many choices of clustering definitions and
clustering algorithms
Sampling
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods e.g. Stratified
sampling.
Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
Used in conjunction with skewed data
Discretization and Concept Hierarchy
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values

Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior)
Discretization
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into intervals
because some data mining algorithms only accept
categorical attributes.
Some techniques:
Binning methods – equal-width, equal-frequency
Histogram
Entropy-based methods
References
J. Han and M. Kamber. Data Mining: Concepts and
Techniques. Morgan Kaufmann, 2000.
T. Dasu and T. Johnson. Exploratory Data Mining and
Data Cleaning. John Wiley & Sons, 2003
V. Raman and J. Hellerstein. Potters Wheel: An
Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), December 1997.
H.V. Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), December 1997

Data Analysis Using ChatGPT
No ratings yet
Data Analysis Using ChatGPT
10 pages
Handwritten Text Recognition: Software Requirements Specification
No ratings yet
Handwritten Text Recognition: Software Requirements Specification
10 pages
Transformations Problem Statement
No ratings yet
Transformations Problem Statement
3 pages
Normalization
No ratings yet
Normalization
35 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
CH 3
No ratings yet
CH 3
68 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
04 - ML - Data Preprocessing
No ratings yet
04 - ML - Data Preprocessing
13 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Lecture123
No ratings yet
Lecture123
20 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Unit - II
No ratings yet
Unit - II
56 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
DWM
No ratings yet
DWM
14 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Predicting Stock Market Time-Series Data Using CNN-LSTM Neural Network Model
No ratings yet
Predicting Stock Market Time-Series Data Using CNN-LSTM Neural Network Model
8 pages
Zameer Usman - AI Resume
No ratings yet
Zameer Usman - AI Resume
4 pages
DataPreparation Outlier Treatment
No ratings yet
DataPreparation Outlier Treatment
5 pages
20p11a0462 Ybi Doc F1
No ratings yet
20p11a0462 Ybi Doc F1
48 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
C2 - W2 Mlopssadasdsa
No ratings yet
C2 - W2 Mlopssadasdsa
123 pages
Big Data: Hrushikesha Mohanty Prachet Bhuyan Deepak Chenthati Editors
No ratings yet
Big Data: Hrushikesha Mohanty Prachet Bhuyan Deepak Chenthati Editors
50 pages
Report CMWP Model
No ratings yet
Report CMWP Model
51 pages
Machine Learning Pipeline: Created by Arbaz Ali
No ratings yet
Machine Learning Pipeline: Created by Arbaz Ali
32 pages
Scikit-Learn Cheat Sheet
No ratings yet
Scikit-Learn Cheat Sheet
1 page
FX RTM
No ratings yet
FX RTM
15 pages
Fake News Detection - Report
No ratings yet
Fake News Detection - Report
21 pages
External PPT - Animesh Singh-200301120038
No ratings yet
External PPT - Animesh Singh-200301120038
47 pages
Netflix HD
No ratings yet
Netflix HD
21 pages
Machine Learning Based Telecom-Customer Churn Prediction
No ratings yet
Machine Learning Based Telecom-Customer Churn Prediction
7 pages
Auto ML v21657563907199
No ratings yet
Auto ML v21657563907199
39 pages
A Review of Android Malware Detection Approaches Based On Machine Learning
No ratings yet
A Review of Android Malware Detection Approaches Based On Machine Learning
29 pages
2019 Framework For Hoax News Detection1
No ratings yet
2019 Framework For Hoax News Detection1
8 pages
Knowledge Discovery Process
No ratings yet
Knowledge Discovery Process
15 pages
Introduction
No ratings yet
Introduction
10 pages
SSRN Id4012594
No ratings yet
SSRN Id4012594
24 pages
Data Mining MCA 3 Sem
No ratings yet
Data Mining MCA 3 Sem
51 pages
006 Practical List of DM-2023
No ratings yet
006 Practical List of DM-2023
1 page
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
4420180083-RAHMAN MD MAHABUBUR (Data Visualization Project Report)
No ratings yet
4420180083-RAHMAN MD MAHABUBUR (Data Visualization Project Report)
13 pages
DM Unit-1 Notes
No ratings yet
DM Unit-1 Notes
47 pages