Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (2 votes)
380 views

Data Preprocessing - Data Cleaning

Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Uploaded by

tierSarge
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
380 views

Data Preprocessing - Data Cleaning

Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Uploaded by

tierSarge
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Preprocessing

January 20, 2015

Data Mining: Concepts and Techniques

Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy


generation

Summary

January 20, 2015

Data Mining: Concepts and Techniques

Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy


generation

Summary

January 20, 2015

Data Mining: Concepts and Techniques

Why Data Preprocessing?

Data in the real world is dirty


incomplete: lacking attribute values,
lacking certain attributes of interest, or
containing only aggregate data

noisy: containing errors or outliers

e.g., occupation=
e.g., Salary=-10

inconsistent: containing discrepancies in


codes or names

January 20, 2015

e.g., Age=42 Birthday=03/07/1997


e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
Data Mining: Concepts and Techniques

Why Is Data Dirty?

Incomplete data may come from

Noisy data (incorrect values) may come from

Faulty data collection instruments


Human or computer error at data entry
Errors in data transmission

Inconsistent data may come from

Not applicable data value when collected


Different considerations between the time when the data
was collected and when it is analyzed.
Human/hardware/software problems

Different data sources


Functional dependency violation (e.g., modify some linked
data)

Duplicate records also need data cleaning

January 20, 2015

Data Mining: Concepts and Techniques

Why Is Data Preprocessing


Important?

No quality data, no quality mining results!

Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or


even misleading statistics.

Data warehouse needs consistent integration of


quality data

Data extraction, cleaning, and transformation


comprises the majority of the work of building a
data warehouse

January 20, 2015

Data Mining: Concepts and Techniques

Multi-Dimensional Measure of Data


Quality

Measures for data quality: A multidimensional view

Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable,

Consistency: some modified but some not,


dangling,

Timeliness: timely update?

Believability: how trustable the data are correct?

Interpretability: how easily the data can be


understood?

Major Tasks in Data Preprocessing

Data cleaning

Data integration

Fill in missing values, smooth noisy data, identify or


remove outliers, and resolve inconsistencies
Integration of multiple databases, data cubes, or files

Data reduction

Dimensionality reduction

Numerosity reduction

Data compression

Data transformation and data discretization

Normalization

Concept hierarchy generation

Forms of Data Preprocessing

January 20, 2015

Data Mining: Concepts and Techniques

Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy


generation

Summary

January 20, 2015

Data Mining: Concepts and Techniques

10

Data Cleaning

Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph
Kimball
Data cleaning is the number one problem in
data warehousingDCI survey

January 20, 2015

Data Mining: Concepts and Techniques

11

Data Cleaning

Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

January 20, 2015

Data Mining: Concepts and Techniques

12

Incomplete (Missing) Data

Data is not always available

13

E.g., many tuples have no recorded value for


several attributes, such as customer income in
sales data

Incomplete (Missing) Data

Data is not always available

14

E.g., many tuples have no recorded value for several


attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus


deleted

data not entered due to misunderstanding

certain data may not be considered important at the


time of entry

not register history or changes of the data

Missing data may need to be inferred

How to Handle Missing Data?

Ignore the tuple: usually done when class label is


missing (when doing classification)not effective
when the % of missing values per attribute varies
considerably

Fill in the missing value manually: tedious +


infeasible?

15

How to Handle Missing Data?

Ignore the tuple: usually done when class label is


missing (when doing classification)not effective when
the % of missing values per attribute varies considerably

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

16

a global constant : e.g., unknown, a new class?!

the attribute mean

the attribute mean for all samples belonging to the


same class: smarter

the most probable value: inference-based such as


Bayesian formula or decision tree

Noisy Data

Noise: random error or variance in a measured


variable

Incorrect attribute values may due to


faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention

January 20, 2015

Data Mining: Concepts and Techniques

17

How to Handle Noisy Data?

Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.

January 20, 2015

Data Mining: Concepts and Techniques

18

Simple Discretization Methods:


Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.

The most straightforward, but outliers may dominate


presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing


approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

January 20, 2015

Data Mining: Concepts and Techniques

19

Binning Methods for Data


Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

January 20, 2015

Data Mining: Concepts and Techniques

20

Binning Methods for Data


Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

January 20, 2015

Data Mining: Concepts and Techniques

21

Binning Methods for Data


Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

January 20, 2015

Data Mining: Concepts and Techniques

22

How to Handle Noisy Data?

Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression
functions

January 20, 2015

Data Mining: Concepts and Techniques

23

Regression
y
Y1

y=x+1

Y1

X1

January 20, 2015

Data Mining: Concepts and Techniques

24

How to Handle Noisy Data?

Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression
functions
Clustering
detect and remove outliers

January 20, 2015

Data Mining: Concepts and Techniques

25

Cluster Analysis

January 20, 2015

Data Mining: Concepts and Techniques

26

How to Handle Noisy Data?

Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)

January 20, 2015

Data Mining: Concepts and Techniques

27

Problems
3.3 Suppose that the data for analysis includes the
attribute age. The age values for the data tuples
are (in increasing order)
13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,3
3,35,35,35,35,36,40,45,46,52,70.
i. Use smoothing by bin means and bondaries to
smooth the data, using a bin depth of 3. Illustrate
your steps.
ii. How might you determine the outliers?

January 20, 2015

Data Mining: Concepts and Techniques

28

Data Cleaning as a Process

Data discrepancy detection


Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal


code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and


relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potters Wheels)
29

You might also like