Chapter 3: Data Preprocessing

Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
1
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
2
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization

 Normalization
 Concept hierarchy generation
3
 Data Quality
 Data Cleaning
 Data Reduction
 Summary
4
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data

 e.g., Occupation = “ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary = “−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age = “42”, Birthday = “03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
5
Incomplete (Missing) Data
 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time
of entry
 not register history or changes of the data
 Missing data may need to be inferred
6
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as Bayesian
formula or decision tree
7
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning

 duplicate records
 incomplete data
 inconsistent data
8
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection

 detect suspicious values and check by human (e.g., deal
with possible outliers)
9
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections

 Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)

 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface

 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels) – tool that integrates
discrepancy detection and transformation 10

 Data Quality
 Data Cleaning
 Data Reduction
 Summary
11
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different sources
are different
 Possible reasons: different representations, different scales, e.g., metric
vs. British units
12
Handling Redundancy in Data Integration
 Redundant data occur often when integration of multiple

databases
 Object identification: The same attribute or object may
have different names in different databases
 Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
13
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
(Observed  Expected) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
14
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
 Expected : 1st col = (300*450)/1500 = 90

 Χ2 (chi-square) calculation (numbers in parenthesis are expected
counts calculated based on the data distribution in the two
categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are correlated

in the group 15
 Degree of Freedom :
 (r-1)(c-1)
 Level of Significance:
 Chi square > los : reject null hypothesis
 Reject or accept the null hypothesis
4/9/2019 Data Mining: Concepts and Techniques 16

Covariance (Numeric Data)
 Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or

expected values of A and B, σA and σB are the respective standard deviation
of A and B
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence 17
Co-Variance: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.

 Data Quality
 Data Cleaning
 Data Reduction
 Summary
19
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
 Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the
complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction , e.g., replace – data – by smaller forms
 Regression and Log-Linear Models (Parametric)
 Histograms, clustering, sampling (Non parametric)
 Data cube aggregation
 Data compression (lossless and lossy)
20
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in one or
more other attributes
 E.g., purchase price of a product and the amount of sales
tax paid
 Irrelevant attributes
 Contain no information that is useful for the data mining
task at hand
 E.g., students' ID is often irrelevant to the task of predicting
students' GPA
21
Heuristic Search in Attribute Selection
 There are 2d possible attribute combinations of d attributes

 Typical heuristic attribute selection methods:
 Best single attribute under the attribute independence
assumption: choose by significance tests

 Best step-wise feature selection:
 The best single-attribute is picked first
 Then next best attribute condition to the first, ...
 Step-wise attribute elimination:
 Repeatedly eliminate the worst attribute
 Best combined attribute selection and elimination
 Optimal branch and bound:
 Use attribute elimination and backtracking
22
Attribute Creation (Feature Generation)
 Create new attributes (features) that can capture the
important information in a data set more effectively than the
original ones
 Three general methodologies
 Attribute extraction
 Domain-specific
 Mapping data to new space (see: data reduction)
 E.g., Fourier transformation, wavelet transformation,
manifold approaches (not covered)

 Attribute construction
 Combining features (see: discriminative frequent
patterns in Chapter on “Advanced Classification”)

 Data discretization
23
 Data Quality
 Data Cleaning
 Data Reduction
 Summary
24
Data Transformation
 A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing
25
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,000  12,000
Then $73,000 is mapped to (1.0  0)  0  0.716
98,000  12,000
 Z-score normalization (μ: mean, σ: standard deviation):

v  A
v' 
 A
73,600  54,000
 Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
26
Discretization
 Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification
27
Simple Discretization: Binning
 Equal-width (distance) partitioning

 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning

 Divides the range into N intervals, each containing approximately same
number of samples
 Good data scaling
 Managing categorical attributes can be tricky
28
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
29
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem; Remove redundancies; Detect
inconsistencies
 Data reduction
 Dimensionality reduction; Numerosity reduction; Data
compression
 Data transformation and data discretization
 Normalization; Concept hierarchy generation
30

Chapter 3: Data Preprocessing

Uploaded by

Copyright:

Available Formats

Chapter 3: Data Preprocessing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3: Data Preprocessing

Uploaded by

Copyright:

Available Formats

Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Measures for data quality: A multidimensional view

outliers, and resolve inconsistencies

 Data transformation and data discretization

 Concept hierarchy generation

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

interest, or containing only aggregate data

 noisy: containing noise, errors, or outliers

 e.g., Salary = “−10” (an error)

 inconsistent: containing discrepancies in codes or names, e.g.,

 Age = “42”, Birthday = “03/07/2010”

 Was rating “1, 2, 3”, now rating “A, B, C”

 discrepancy between duplicate records

 Intentional (e.g., disguised missing data)

 Jan. 1 as everyone’s birthday?

 Data is not always available

 data entry problems

 data transmission problems

 inconsistency in naming convention

 Other data problems which require data cleaning

 then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

 Combined computer and human inspection

with possible outliers)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal code,

spell-check) to detect errors and make corrections

detect violators (e.g., correlation and clustering to find outliers)

 ETL (Extraction/Transformation/Loading) tools: allow users to specify

transformations through a graphical user interface

discrepancy detection and transformation 10

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Redundant data occur often when integration of multiple

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Expected : 1st col = (300*450)/1500 = 90

 It shows that like_science_fiction and play_chess are correlated

4/9/2019 Data Mining: Concepts and Techniques 16

where n is the number of tuples, A and B are the respective mean or

 It can be simplified in computation as

 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

 Thus, A and B rise together since Cov(A, B) > 0.

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Principal Components Analysis (PCA)