0% found this document useful (0 votes)

5 views

Module2 DataPreprocessing

Data Preprocessing in Data Mining

Uploaded by

Ceejay Estigoy

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Module2 DataPreprocessing

Data Preprocessing in Data Mining

Uploaded by

Ceejay Estigoy

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Mining:

Concepts and Techniques

(3rd ed.)

— Chapter 3 —

1
Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

Data Mining: Concepts and Techniques 2

What went wrong?
Imagine that you are a manager at AllElectronics and have been
charged with analyzing the company’s data with respect to your
branch’s sales. You immediately set out to perform this task. You
carefully inspect the company’s database and data warehouse,
identifying and selecting the attributes or dimensions (e.g., item,
price, and units sold) to be included in your analysis. Alas! You
notice that several of the attributes for various tuples have no
recorded value. For your analysis, you would like to include
information as to whether each item purchased was advertised as on
sale, yet you discover that this information has not been recorded.
Furthermore, users of your database system have reported errors,
unusual values, and inconsistencies in the data recorded for some
transactions.

Data Mining: Concepts and Techniques 3

Data Quality: Why Preprocess the Data?
■ Measures for data quality: A multidimensional view
■ Accuracy: correct or wrong, accurate or not
■ Completeness: not recorded, unavailable, …
■ Consistency: some modified but some not, dangling, …
■ Timeliness: timely update?
■ Believability: how trustable the data are correct?
■ Interpretability: how easily the data can be
understood?

Data Mining: Concepts and Techniques 4

Major Tasks in Data Preprocessing
■ Data cleaning
■ Fill in missing values, smooth noisy
data, identify or remove outliers, and
resolve inconsistencies
■ Data integration
■ Integration of multiple databases, data
cubes, or files
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
■ Data transformation and data
discretization
■ Normalization Figure 1. Forms of data preprocessing
■ Concept hierarchy generation

Data Mining: Concepts and Techniques 5

Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

Data Mining: Concepts and Techniques 6

Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data

■ e.g., Occupation = “ ” (missing data)

■ noisy: containing noise, errors, or outliers

■ e.g., Salary = “−10” (an error)

■ inconsistent: containing discrepancies in codes or names, e.g.,

■ Age = “42”, Birthday = “03/07/2010”

■ Was rating “1, 2, 3”, now rating “A, B, C”

■ discrepancy between duplicate records

■ Intentional (e.g., disguised missing data)

■ Jan. 1 as everyone’s birthday?

Data Mining: Concepts and Techniques 7

Incomplete (Missing) Data

■ Data is not always available

■ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
■ Missing data may be due to
■ equipment malfunction
■ inconsistent with other recorded data and thus deleted
■ data not entered due to misunderstanding
■ certain data may not be considered important at the time
of entry
■ not register history or changes of the data
■ Missing data may need to be inferred
Data Mining: Concepts and Techniques 8
How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!
■ the attribute mean
■ the attribute mean for all samples belonging to the same
class: smarter
■ the most probable value: inference-based such as Bayesian
formula or decision tree
9
Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments

■ data entry problems

■ data transmission problems

■ technology limitation

■ inconsistency in naming convention

■ Other data problems which require data cleaning

■ duplicate records

■ incomplete data

■ inconsistent data

10
How to Handle Noisy Data?

■ Binning
■ first sort data and partition into (equal-frequency) bins

■ then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

■ Regression
■ smooth by fitting the data into regression functions

■ Clustering
■ detect and remove outliers

■ Combined computer and human inspection

■ detect suspicious values and check by human (e.g., deal

with possible outliers)

11
Data Cleaning as a Process
■ Data discrepancy detection
■ Use metadata (e.g., domain, range, dependency, distribution)
■ Check field overloading
■ Check uniqueness rule, consecutive rule and null rule
■ Use commercial tools
■ Data scrubbing: use simple domain knowledge (e.g., postal code,

spell-check) to detect errors and make corrections

■ Data auditing: by analyzing data to discover rules and relationship to

detect violators (e.g., correlation and clustering to find outliers)

■ Data migration and integration
■ Data migration tools: allow transformations to be specified
■ ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
■ Integration of the two processes
■ Iterative and interactive (e.g., Potter’s Wheels)
12
Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

13
Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ Schema integration: e.g., A.cust-id ≡ B.cust-#
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from different sources are
different
■ Possible reasons: different representations, different scales, e.g., metric
vs. British units
14
Handling Redundancy in Data Integration

■ Redundant data occur often when integration of multiple

databases
■ Object identification: The same attribute or object may
have different names in different databases
■ Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
■ Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
■ Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
15
Histogram Analysis
■ Divide data into buckets and
store average (sum) for each
bucket
■ Partitioning rules:
■ Equal-width: equal bucket
range
■ Equal-frequency (or
equal-depth)

16
Clustering
■ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
■ Can be very effective if data is clustered but not if data is
“smeared”
■ Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
■ There are many choices of clustering definitions and
clustering algorithms
■ Cluster analysis will be studied in depth in Chapter 10

17
Sampling

■ Sampling: obtaining a small sample s to represent the whole

data set N
■ Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
■ Key principle: Choose a representative subset of the data
■ Simple random sampling may have very poor performance in
the presence of skew
■ Develop adaptive sampling methods, e.g., stratified
sampling:
■ Note: Sampling may not reduce database I/Os (page at a time)

18
Types of Sampling

■ Simple random sampling

■ There is an equal probability of selecting any particular item

■ Sampling without replacement

■ Once an object is selected, it is removed from the population

■ Sampling with replacement

■ A selected object is not removed from the population

■ Stratified sampling:
■ Partition the data set, and draw samples from each partition

(proportionally, i.e., approximately the same percentage of

the data)
■ Used in conjunction with skewed data

19
Sampling: With or without Replacement

W O R
SRS le random
i m p h ou t
(s e wi t
p l
sam ment)
p l a ce
re

SRSW
R

Raw Data
20
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

21
Data Cube Aggregation

■ The lowest level of a data cube (base cuboid)

■ The aggregated data for an individual entity of interest
■ E.g., a customer in a phone calling data warehouse
■ Multiple levels of aggregation in data cubes
■ Further reduce the size of data to deal with
■ Reference appropriate levels
■ Use the smallest representation which is enough to solve the
task
■ Queries regarding aggregated information should be answered
using data cube, when possible

22
Data Reduction 3: Data Compression
■ String compression
■ There are extensive theories and well-tuned algorithms

■ Typically lossless, but only limited manipulation is possible

without expansion
■ Audio/video compression
■ Typically lossy compression, with progressive refinement

■ Sometimes small fragments of signal can be reconstructed

without reconstructing the whole

■ Time sequence is not audio
■ Typically short and vary slowly with time

■ Dimensionality and numerosity reduction may also be

considered as forms of data compression
23
Data Compression

Original Data Compressed

Data
lossless

s s y
lo
Original Data
Approximated

24
Automatic Concept Hierarchy Generation
■ Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
■ The attribute with the most distinct values is placed at
the lowest level of the hierarchy
■ Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct

values
city 3567 distinct values

street 674,339 distinct

values 25
Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview

■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

26
Summary
■ Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
■ Data cleaning: e.g. missing/noisy values, outliers
■ Data integration from multiple sources:
■ Entity identification problem; Remove redundancies; Detect
inconsistencies
■ Data reduction
■ Dimensionality reduction; Numerosity reduction; Data
compression
■ Data transformation and data discretization
■ Normalization; Concept hierarchy generation
27

Make $35 Per Hour On PayPal NO Investment PDF
No ratings yet
Make $35 Per Hour On PayPal NO Investment PDF
62 pages
LDM2 Module 3B Activity With Sample Answers
97% (31)
LDM2 Module 3B Activity With Sample Answers
6 pages
Chassis: A650E Automatic Transmission
50% (2)
Chassis: A650E Automatic Transmission
8 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
data mining 3
No ratings yet
data mining 3
57 pages
Chapter 2 - Data Preprocessing
No ratings yet
Chapter 2 - Data Preprocessing
15 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
D06A Data Preprocessing
No ratings yet
D06A Data Preprocessing
25 pages
CS822-DataMining-Week3
No ratings yet
CS822-DataMining-Week3
91 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Dmml Notes
No ratings yet
Dmml Notes
89 pages
Romi DM 03 Persiapan Mar2016
No ratings yet
Romi DM 03 Persiapan Mar2016
82 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
21IS503 UnitII LM5
No ratings yet
21IS503 UnitII LM5
20 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
3 Persiapan Data Mining
No ratings yet
3 Persiapan Data Mining
83 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
34 pages
Dwina DM 03 Persiapan 2018
No ratings yet
Dwina DM 03 Persiapan 2018
82 pages
Data Mining
No ratings yet
Data Mining
40 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Data Mining Pertemuan 6
No ratings yet
Data Mining Pertemuan 6
28 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Data Mining Chapter 1
No ratings yet
Data Mining Chapter 1
43 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Unsia_Data Mining Pertemuan 9
No ratings yet
Unsia_Data Mining Pertemuan 9
39 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
Data Mining Introductiondifferent
No ratings yet
Data Mining Introductiondifferent
83 pages
Week2-2
No ratings yet
Week2-2
25 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
BECE352E Module 2
No ratings yet
BECE352E Module 2
58 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Unit I Chapter III
No ratings yet
Unit I Chapter III
71 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
GK NU CS 503 - Data Preprocessing
No ratings yet
GK NU CS 503 - Data Preprocessing
62 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
dm 1
No ratings yet
dm 1
47 pages
Datamining Chapter 1 Introduction
No ratings yet
Datamining Chapter 1 Introduction
41 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Study Material I
No ratings yet
Study Material I
140 pages
Unit 1 - Big Data Technologies
No ratings yet
Unit 1 - Big Data Technologies
89 pages
Lecture 2
No ratings yet
Lecture 2
18 pages
Unit 2 Data Preprocessing for Students.pptx
No ratings yet
Unit 2 Data Preprocessing for Students.pptx
169 pages
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
No ratings yet
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
19 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Dsa 1
No ratings yet
Dsa 1
24 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
7677 Version 7th PDF
No ratings yet
7677 Version 7th PDF
2 pages
Handleman Expansion
No ratings yet
Handleman Expansion
4 pages
Designing Service Entrance Panelboard Equipment
100% (1)
Designing Service Entrance Panelboard Equipment
99 pages
LADLAD Vs Velasco
No ratings yet
LADLAD Vs Velasco
2 pages
Presentation Repsol Industrial
No ratings yet
Presentation Repsol Industrial
31 pages
Sap Sto PDF
100% (1)
Sap Sto PDF
4 pages
Cambridge International General Certificate of Secondary Education
No ratings yet
Cambridge International General Certificate of Secondary Education
12 pages
Question Bank
No ratings yet
Question Bank
7 pages
98RE243-Issue-12-TFM-500 Installation-and-Operating-Instructions
No ratings yet
98RE243-Issue-12-TFM-500 Installation-and-Operating-Instructions
48 pages
DeltaGrid EVM - Brochure - WW - 20231027
No ratings yet
DeltaGrid EVM - Brochure - WW - 20231027
4 pages
Acknowledgement: Gawade S.R. Class Coordinator Prof. Shirke Sir, Principal Dr. Narve N.G. and All The Staff
No ratings yet
Acknowledgement: Gawade S.R. Class Coordinator Prof. Shirke Sir, Principal Dr. Narve N.G. and All The Staff
16 pages
Flender SafetyCouplings FLE10 7 EN
No ratings yet
Flender SafetyCouplings FLE10 7 EN
40 pages
Ax2012 Enus Imp 05 PDF
No ratings yet
Ax2012 Enus Imp 05 PDF
40 pages
UNIT-III Compiler Design - SCS1303: School of Computing Department of Computer Science and Engineering
No ratings yet
UNIT-III Compiler Design - SCS1303: School of Computing Department of Computer Science and Engineering
24 pages
Financial Management PPT Final
No ratings yet
Financial Management PPT Final
16 pages
100% Stator Earth Fault Protection - Relion 670 - ABB - Settings & Highlights
No ratings yet
100% Stator Earth Fault Protection - Relion 670 - ABB - Settings & Highlights
21 pages
Qualcomm Atheros QCNFA335: Product Name
No ratings yet
Qualcomm Atheros QCNFA335: Product Name
55 pages
2-Input Gates Using 2 - 1 Mux
No ratings yet
2-Input Gates Using 2 - 1 Mux
8 pages
Factors Affecting Selection of Parting Direction and Parting Line
No ratings yet
Factors Affecting Selection of Parting Direction and Parting Line
4 pages
MWT Sample Questions#1
No ratings yet
MWT Sample Questions#1
6 pages
Selected Answers CH 9-13
100% (1)
Selected Answers CH 9-13
15 pages
MATH 3650 Homework 8
No ratings yet
MATH 3650 Homework 8
2 pages
Household Cleaning Products Name of Product Active/ Major Ingredients Functions Guidelines in Using The Products Precautions in Using The Products
No ratings yet
Household Cleaning Products Name of Product Active/ Major Ingredients Functions Guidelines in Using The Products Precautions in Using The Products
3 pages
Statutory Declaration For Lost Certificate Document
No ratings yet
Statutory Declaration For Lost Certificate Document
4 pages
Assignment 4: Creating The Logical Architecture
No ratings yet
Assignment 4: Creating The Logical Architecture
20 pages
Scharlach Camila
No ratings yet
Scharlach Camila
1 page
Abstract MEP Amendment - D - 4,5&6 Corrected MEP Team
No ratings yet
Abstract MEP Amendment - D - 4,5&6 Corrected MEP Team
113 pages