100% found this document useful (2 votes)

380 views

Data Preprocessing - Data Cleaning

Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Uploaded by

tierSarge

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

380 views

Data Preprocessing - Data Cleaning

Uploaded by

tierSarge

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Preprocessing

January 20, 2015

Data Mining: Concepts and Techniques

Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

January 20, 2015

Data Mining: Concepts and Techniques

Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

January 20, 2015

Data Mining: Concepts and Techniques

Why Data Preprocessing?

Data in the real world is dirty

incomplete: lacking attribute values,
lacking certain attributes of interest, or
containing only aggregate data

noisy: containing errors or outliers

e.g., occupation=
e.g., Salary=-10

inconsistent: containing discrepancies in

codes or names

January 20, 2015

e.g., Age=42 Birthday=03/07/1997

e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
Data Mining: Concepts and Techniques

Why Is Data Dirty?

Incomplete data may come from

Noisy data (incorrect values) may come from

Faulty data collection instruments

Human or computer error at data entry
Errors in data transmission

Inconsistent data may come from

Not applicable data value when collected

Different considerations between the time when the data
was collected and when it is analyzed.
Human/hardware/software problems

Different data sources

Functional dependency violation (e.g., modify some linked
data)

Duplicate records also need data cleaning

January 20, 2015

Data Mining: Concepts and Techniques

Why Is Data Preprocessing

Important?

No quality data, no quality mining results!

Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or

even misleading statistics.

Data warehouse needs consistent integration of

quality data

Data extraction, cleaning, and transformation

comprises the majority of the work of building a
data warehouse

January 20, 2015

Data Mining: Concepts and Techniques

Multi-Dimensional Measure of Data

Quality

Measures for data quality: A multidimensional view

Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable,

Consistency: some modified but some not,

dangling,

Timeliness: timely update?

Believability: how trustable the data are correct?

Interpretability: how easily the data can be

understood?

Major Tasks in Data Preprocessing

Data cleaning

Data integration

Fill in missing values, smooth noisy data, identify or

remove outliers, and resolve inconsistencies
Integration of multiple databases, data cubes, or files

Data reduction

Dimensionality reduction

Numerosity reduction

Data compression

Data transformation and data discretization

Normalization

Concept hierarchy generation

Forms of Data Preprocessing

January 20, 2015

Data Mining: Concepts and Techniques

Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

January 20, 2015

Data Mining: Concepts and Techniques

Data Cleaning

Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph
Kimball
Data cleaning is the number one problem in
data warehousingDCI survey

January 20, 2015

Data Mining: Concepts and Techniques

Data Cleaning

Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

January 20, 2015

Data Mining: Concepts and Techniques

Incomplete (Missing) Data

Data is not always available

E.g., many tuples have no recorded value for

several attributes, such as customer income in
sales data

Incomplete (Missing) Data

Data is not always available

E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus

deleted

data not entered due to misunderstanding

certain data may not be considered important at the

time of entry

not register history or changes of the data

Missing data may need to be inferred

How to Handle Missing Data?

Ignore the tuple: usually done when class label is

missing (when doing classification)not effective
when the % of missing values per attribute varies
considerably

Fill in the missing value manually: tedious +

infeasible?

How to Handle Missing Data?

Ignore the tuple: usually done when class label is

missing (when doing classification)not effective when
the % of missing values per attribute varies considerably

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

a global constant : e.g., unknown, a new class?!

the attribute mean

the attribute mean for all samples belonging to the

same class: smarter

the most probable value: inference-based such as

Bayesian formula or decision tree

Noisy Data

Noise: random error or variance in a measured

variable

Incorrect attribute values may due to

faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention

January 20, 2015

Data Mining: Concepts and Techniques

How to Handle Noisy Data?

Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.

January 20, 2015

Data Mining: Concepts and Techniques

Simple Discretization Methods:

Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.

The most straightforward, but outliers may dominate

presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing

approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

January 20, 2015

Data Mining: Concepts and Techniques

Binning Methods for Data

January 20, 2015

Data Mining: Concepts and Techniques

Binning Methods for Data

Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

January 20, 2015

Data Mining: Concepts and Techniques

Binning Methods for Data

Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

January 20, 2015

Data Mining: Concepts and Techniques

How to Handle Noisy Data?

Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression
functions

January 20, 2015

Data Mining: Concepts and Techniques

Regression
y
Y1

y=x+1

January 20, 2015

Data Mining: Concepts and Techniques

How to Handle Noisy Data?

January 20, 2015

Data Mining: Concepts and Techniques

Cluster Analysis

January 20, 2015

Data Mining: Concepts and Techniques

How to Handle Noisy Data?

Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)

January 20, 2015

Data Mining: Concepts and Techniques

Problems
3.3 Suppose that the data for analysis includes the
attribute age. The age values for the data tuples
are (in increasing order)
13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,3
3,35,35,35,35,36,40,45,46,52,70.
i. Use smoothing by bin means and bondaries to
smooth the data, using a bin depth of 3. Illustrate
your steps.
ii. How might you determine the outliers?

January 20, 2015

Data Mining: Concepts and Techniques

Data Cleaning as a Process

Data discrepancy detection

Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potters Wheels)
29

Software Startup - Business Plan
100% (2)
Software Startup - Business Plan
32 pages
Advance English Speaking Course
76% (25)
Advance English Speaking Course
2 pages
Data Collection: Getting Started With Statistics
From Everand
Data Collection: Getting Started With Statistics
Lee Baker
No ratings yet
Lecture 1 Data Mining
No ratings yet
Lecture 1 Data Mining
51 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
93 pages
Client Predictive Analytics Proposal PDF
100% (1)
Client Predictive Analytics Proposal PDF
7 pages
Data Cleaning and Data Pre Processing
100% (1)
Data Cleaning and Data Pre Processing
72 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
Practical Data Cleaning: Bite-Size Stats, #5
From Everand
Practical Data Cleaning: Bite-Size Stats, #5
Lee Baker
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Quality and Data Cleaning: An Overview
0% (1)
Data Quality and Data Cleaning: An Overview
132 pages
Data Science Use Cases
100% (1)
Data Science Use Cases
10 pages
Data Mining
No ratings yet
Data Mining
27 pages
Python For Data Science
100% (1)
Python For Data Science
4 pages
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
Data Science Interview Questions
100% (2)
Data Science Interview Questions
55 pages
02.data Preprocessing PDF
100% (1)
02.data Preprocessing PDF
31 pages
Exploratory Data Analysis and Data Preprocessing - Dr. Haleema
No ratings yet
Exploratory Data Analysis and Data Preprocessing - Dr. Haleema
11 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
Data Science Note
No ratings yet
Data Science Note
24 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
100 Data Science in R Interview Questions and Answers For 2016
100% (2)
100 Data Science in R Interview Questions and Answers For 2016
56 pages
PCA Using Python
No ratings yet
PCA Using Python
18 pages
Data Science Interview
100% (4)
Data Science Interview
12 pages
How To Document Your Data Science Project
No ratings yet
How To Document Your Data Science Project
9 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Data Visualization in Data Science
100% (6)
Data Visualization in Data Science
34 pages
Data Warehouse 1
No ratings yet
Data Warehouse 1
21 pages
Data Science With Python
No ratings yet
Data Science With Python
4 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
No ratings yet
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Mlops: 5 Steps To Operationalize Machine Learning Models
No ratings yet
Mlops: 5 Steps To Operationalize Machine Learning Models
17 pages
Visualization, Data and Tableau: Alexander Mou
No ratings yet
Visualization, Data and Tableau: Alexander Mou
22 pages
Data Science 5
100% (3)
Data Science 5
216 pages
Exploratory Data Analysis Reference
100% (1)
Exploratory Data Analysis Reference
49 pages
Data Science Interview Questions
100% (1)
Data Science Interview Questions
68 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Practical Data Science With R
33% (3)
Practical Data Science With R
1 page
Data Science Notes
100% (1)
Data Science Notes
59 pages
Data Cleaning
100% (2)
Data Cleaning
19 pages
Great Collection of Data Science Resources
100% (1)
Great Collection of Data Science Resources
2 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
20 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Logistic Regression
100% (1)
Logistic Regression
21 pages
Basic Statistics For Data Science
100% (1)
Basic Statistics For Data Science
45 pages
Logistic Regression
100% (2)
Logistic Regression
30 pages
Big Data Analytics
100% (1)
Big Data Analytics
11 pages
Data Mining
100% (1)
Data Mining
29 pages
Data Visualization Techniques
No ratings yet
Data Visualization Techniques
20 pages
Data Mining
100% (1)
Data Mining
53 pages
Information Visualization: Dr. Parvathi.R VIT University, Chennai
No ratings yet
Information Visualization: Dr. Parvathi.R VIT University, Chennai
73 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Sampling PDF
No ratings yet
Sampling PDF
120 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Apache Spark Graph Processing
From Everand
Apache Spark Graph Processing
Ramamonjison Rindra
No ratings yet
CSE 409 - Advanced Internet Technology
No ratings yet
CSE 409 - Advanced Internet Technology
42 pages
Error Control, Digital Data Communication Technique
No ratings yet
Error Control, Digital Data Communication Technique
44 pages
Lecture3 CSharp Lang-1
No ratings yet
Lecture3 CSharp Lang-1
41 pages
Lovi 11 30 PM
No ratings yet
Lovi 11 30 PM
51 pages
Data Transmission
No ratings yet
Data Transmission
23 pages
8.building Analysis Model
No ratings yet
8.building Analysis Model
64 pages
The MAC Layer
No ratings yet
The MAC Layer
77 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
13 pages
Design and Implementation of Programming Languages Introduction
No ratings yet
Design and Implementation of Programming Languages Introduction
71 pages
Task Assignment - Week 18. Ingles 2
No ratings yet
Task Assignment - Week 18. Ingles 2
6 pages
Kev Sop Caltech Updated 1
No ratings yet
Kev Sop Caltech Updated 1
3 pages
Lab 1 Compresive Test PDF
No ratings yet
Lab 1 Compresive Test PDF
10 pages
Afrikaans Resources
No ratings yet
Afrikaans Resources
2 pages
Morgan Motor Company
100% (1)
Morgan Motor Company
9 pages
Literature Review
100% (1)
Literature Review
58 pages
Lms Activity 3 Two-Storey Residential House Magaddon, Engel A. Ground Floor Plan
No ratings yet
Lms Activity 3 Two-Storey Residential House Magaddon, Engel A. Ground Floor Plan
1 page
Amazon Fees Structure
No ratings yet
Amazon Fees Structure
22 pages
Console Output
No ratings yet
Console Output
81 pages
Unit 3 Dreams
0% (1)
Unit 3 Dreams
3 pages
Business Studies IGCSE SoW
No ratings yet
Business Studies IGCSE SoW
40 pages
Explanation of Benefits
No ratings yet
Explanation of Benefits
5 pages
Work Permit Receiver
No ratings yet
Work Permit Receiver
45 pages
Linux Firewall: For The Office and Home
No ratings yet
Linux Firewall: For The Office and Home
53 pages
FLSmidth Cross-Bar Cooler Brochure
No ratings yet
FLSmidth Cross-Bar Cooler Brochure
8 pages
Tara
No ratings yet
Tara
108 pages
DOC-20241128-WA0006_241213_211559
No ratings yet
DOC-20241128-WA0006_241213_211559
131 pages
IMC Unit 1 Mock Exam 2 V17 June 2020 Final Version 11
No ratings yet
IMC Unit 1 Mock Exam 2 V17 June 2020 Final Version 11
25 pages
Yaris (1may2022)
No ratings yet
Yaris (1may2022)
2 pages
Top 25 investors in India
No ratings yet
Top 25 investors in India
5 pages
E 3102 AYTB Model - PDF 2
No ratings yet
E 3102 AYTB Model - PDF 2
1 page
Supply Chain Management - MSIL
No ratings yet
Supply Chain Management - MSIL
15 pages
978-1-4612-2364-1 - 8 Lubrications
No ratings yet
978-1-4612-2364-1 - 8 Lubrications
2 pages
Internet Service Provider
No ratings yet
Internet Service Provider
4 pages
Welcome To Timisoara, Romania: How To Get Here?
No ratings yet
Welcome To Timisoara, Romania: How To Get Here?
10 pages
3D Bioprinting From Benches To Translational Applications
No ratings yet
3D Bioprinting From Benches To Translational Applications
47 pages
Sell ClickBank Products Using Ebay Classified Ads. (PDFDrive)
No ratings yet
Sell ClickBank Products Using Ebay Classified Ads. (PDFDrive)
112 pages
AZ Evergreen Moldy Bread v1 GY
No ratings yet
AZ Evergreen Moldy Bread v1 GY
6 pages