Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
By
C.Kayathri
Student at ANJAC
Sivakasi

1
Data Preprocessing


Definition



Why preprocess the data?



Data cleaning



Data integration and transformation



Data reduction



Summary
2
Definition


Data preprocessing is a data
mining technique that involves
transforming raw data into an
understandable format.

3
Why Data Preprocessing?


Data in the real world is dirty
incomplete: lacking attribute values, lacking

certain attributes of interest, or containing only
aggregate
data
○ e.g., occupation=“ ”

noisy: containing errors or outliers
○ e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or

names
○ e.g., Age=“42” Birthday=“03/07/1997”
○ e.g., Was rating “1,2,3”, now rating “A, B, C”
○ e.g., discrepancy between duplicate records
4
Why Is Data Preprocessing Important?


No quality data, no quality mining results!
Quality decisions must be based on quality data
○ e.g., duplicate or missing data may cause incorrect

or even misleading statistics.
Data warehouse needs consistent integration of

quality data


Data extraction, cleaning, and transformation
comprises the majority of the work of building a data
warehouse

5
Major Tasks in Data Preprocessing





Data cleaning
Data integration
Data transformation
Data reduction

6
Forms of Data Preprocessing

7
Data Cleaning


Importance
“Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data
warehousing”—DCI survey



Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

8
Missing Data


Data is not always available
 E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data


Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of

entry
 not register history or changes of the data


Missing data may need to be inferred.
9
Noisy Data





Noise: random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
10
How to Handle Noisy Data?






Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
11
Simple Discretization Methods: Binning


Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,

the width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate

presentation
 Skewed data is not handled well


Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately

same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
12
Binning Methods for Data
Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34


13
Regression
y
Y1

y=x+1

Y1’

x

X1

14
Cluster Analysis

15
Data Integration







Data integration:
Combines data from multiple sources into a coherent
store
Schema integration: e.g., A.cust-id ≡ B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different
scales, e.g., metric vs. British units

16
Data Transformation


Smoothing: remove noise from data



Aggregation: summarization, data cube construction



Generalization: concept hierarchy climbing



Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling



Attribute/feature construction
New attributes constructed from the given ones
17
Data Reduction






Why data reduction?
 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time to run
on the complete data set
Data reduction
 Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
 Data cube aggregation:
 Dimensionality reduction — e.g., remove unimportant
attributes
 Data Compression
 Numerosity reduction — e.g., fit data into models
 Discretization and concept hierarchy generation
18
Summary


Data preparation or preprocessing is a big issue for both
data warehousing and data mining



Descriptive data summarization is need for quality data
preprocessing



Data preparation includes
Data cleaning and data integration
Data reduction and feature selection



A lot a methods have been developed but data
preprocessing still an active area of research

19
an
Th

ou
y
k

20

More Related Content

Data preprocessing

  • 2. Data Preprocessing  Definition  Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Summary 2
  • 3. Definition  Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. 3
  • 4. Why Data Preprocessing?  Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ○ e.g., occupation=“ ” noisy: containing errors or outliers ○ e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names ○ e.g., Age=“42” Birthday=“03/07/1997” ○ e.g., Was rating “1,2,3”, now rating “A, B, C” ○ e.g., discrepancy between duplicate records 4
  • 5. Why Is Data Preprocessing Important?  No quality data, no quality mining results! Quality decisions must be based on quality data ○ e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse 5
  • 6. Major Tasks in Data Preprocessing     Data cleaning Data integration Data transformation Data reduction 6
  • 7. Forms of Data Preprocessing 7
  • 8. Data Cleaning  Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data cleaning is the number one problem in data warehousing”—DCI survey  Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration 8
  • 9. Missing Data  Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred. 9
  • 10. Noisy Data    Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data 10
  • 11. How to Handle Noisy Data?     Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) 11
  • 12. Simple Discretization Methods: Binning  Equal-width (distance) partitioning  Divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.  The most straightforward, but outliers may dominate presentation  Skewed data is not handled well  Equal-depth (frequency) partitioning  Divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky 12
  • 13. Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34  13
  • 16. Data Integration     Data integration: Combines data from multiple sources into a coherent store Schema integration: e.g., A.cust-id ≡ B.cust-# Integrate metadata from different sources Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., metric vs. British units 16
  • 17. Data Transformation  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling  Attribute/feature construction New attributes constructed from the given ones 17
  • 18. Data Reduction    Why data reduction?  A database/data warehouse may store terabytes of data  Complex data analysis/mining may take a very long time to run on the complete data set Data reduction  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies  Data cube aggregation:  Dimensionality reduction — e.g., remove unimportant attributes  Data Compression  Numerosity reduction — e.g., fit data into models  Discretization and concept hierarchy generation 18
  • 19. Summary  Data preparation or preprocessing is a big issue for both data warehousing and data mining  Descriptive data summarization is need for quality data preprocessing  Data preparation includes Data cleaning and data integration Data reduction and feature selection  A lot a methods have been developed but data preprocessing still an active area of research 19