Data Preprocessing

Data can be collected as objects with attributes and values. Attributes describe objects and can be continuous or categorical. Real world data is often dirty with missing values, noise, inconsistencies. Common techniques for handling dirty data include imputation to fill in missing values, smoothing to remove noise, binning to discretize continuous attributes, outlier detection and removal, and data transformation like normalization. Sampling is also used to select a representative subset of data for analysis when the entire dataset is too large.

Uploaded by

Uyên Vương

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Data Preprocessing

Uploaded by

Uyên Vương

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 12

What is Data?

• Collection of data objects and

their attributes Attributes

• An attribute is a property or
characteristic of an object Tid Refund Marital Taxable
Status Income Cheat
– Examples: eye color of a
person, temperature, etc. 1 Yes Single 125K No
– Attribute is also known as 2 No Married 100K No
variable, field, characteristic, 3 No Single 70K No
or feature
4 Yes Married 120K No
• A collection of attributes 5 No Divorced 95K Yes
describe an object Objects
6 No Married 60K No
– Object is also known as
7 Yes Divorced 220K No
record, point, case, sample,
entity, or instance 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

1
Attribute Values
• Attribute values are numbers or symbols assigned
to an attribute

• Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of

values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
2
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or
names
• No quality data, no quality mining results!
How to Handle Missing
•
Data?
Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain cases)

• Fill in the missing value manually: tedious + infeasible?

• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!

• Use the attribute mean to fill in the missing value

• Use the attribute mean for all samples of the same class
to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
– used also for discretization (discussed later)
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer and
human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Data smoothing
• Data smoothing is executed by making use of a
specialized algorithm for removing noise from the given
data set.

6
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

7
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Duplicate Data
• Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues
9
Data Transformation:
Normalization
Particularly useful for classification (NNs, distance measurements,
nn classification, etc)
• min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• z-score normalization
v  mean A
v' 
stand _ dev A

• normalization by decimal scaling

v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10
Discretization/Quantization
• Three types of attributes:
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• Discretization/Quantization:
 divide the range of a continuous attribute into intervals
x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6
– Some classification algorithms only accept categorical
attributes.
– Reduce data size by discretization
– Prepare for further analysis
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.

• Statisticians sample because obtaining the entire set of

data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.

Woolworths Pegasus Card Induction Instructions
No ratings yet
Woolworths Pegasus Card Induction Instructions
11 pages
Sentiment Analysis of Restaurant Review - Project Report
No ratings yet
Sentiment Analysis of Restaurant Review - Project Report
20 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Intro To DA, Data Sources and Representation
No ratings yet
Intro To DA, Data Sources and Representation
119 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
lec01-dataprep
No ratings yet
lec01-dataprep
67 pages
datamining-1class
No ratings yet
datamining-1class
76 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Intro MLT 08Jan25
No ratings yet
Intro MLT 08Jan25
21 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
DM Lect3 4
No ratings yet
DM Lect3 4
30 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Datamining-lect2 - What is Data_ the Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization (1)
No ratings yet
Datamining-lect2 - What is Data_ the Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization (1)
94 pages
Data Preprocessing
No ratings yet
Data Preprocessing
131 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
No ratings yet
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
19 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Lecture2_IntroData
No ratings yet
Lecture2_IntroData
16 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
L1
No ratings yet
L1
44 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
DM Lec1 2
No ratings yet
DM Lec1 2
39 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Classification
No ratings yet
Classification
88 pages
Data Mining at UVA: New Horizons in Teaching and Learning Conference
No ratings yet
Data Mining at UVA: New Horizons in Teaching and Learning Conference
19 pages
Concepts - Decision Trees
No ratings yet
Concepts - Decision Trees
23 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Datamining Lect1
No ratings yet
Datamining Lect1
59 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
0 KDLVLP Đã G P
No ratings yet
0 KDLVLP Đã G P
523 pages
lecture1&2-đã chuyển đổi
No ratings yet
lecture1&2-đã chuyển đổi
46 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
CH2 data 1
No ratings yet
CH2 data 1
35 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
AI Chapter 5
No ratings yet
AI Chapter 5
31 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
DATA MINING for search engines
No ratings yet
DATA MINING for search engines
33 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
1. Εισαγωγή στην Εξόρυξη Δεδομένων
No ratings yet
1. Εισαγωγή στην Εξόρυξη Δεδομένων
70 pages
Hundred Number Board Activities, Grades K - 1
From Everand
Hundred Number Board Activities, Grades K - 1
Carson Dellosa Education
No ratings yet
Oceanica 25
No ratings yet
Oceanica 25
28 pages
03_01_PatMax_Logic
No ratings yet
03_01_PatMax_Logic
15 pages
01 Unit 3 13 A Digital Portfolio Assignment 1 - Design Template l2
No ratings yet
01 Unit 3 13 A Digital Portfolio Assignment 1 - Design Template l2
8 pages
100-412-203 DCX Af Web Page Interface Manual Rev. 3
No ratings yet
100-412-203 DCX Af Web Page Interface Manual Rev. 3
87 pages
DS Diskeeper18
No ratings yet
DS Diskeeper18
2 pages
Fnirsi Manual
No ratings yet
Fnirsi Manual
30 pages
AOS Assignment 1: Features of Operating System (OS)
No ratings yet
AOS Assignment 1: Features of Operating System (OS)
3 pages
Class 6 Syllabus St. Vincent Asansol Icse Board
No ratings yet
Class 6 Syllabus St. Vincent Asansol Icse Board
17 pages
VBA Course
No ratings yet
VBA Course
8 pages
5267Instant ebooks textbook Beyond the Mushroom Cloud Commemoration Religion and Responsibility after Hiroshima 1st Edition Yuki Miyamoto download all chapters
100% (1)
5267Instant ebooks textbook Beyond the Mushroom Cloud Commemoration Religion and Responsibility after Hiroshima 1st Edition Yuki Miyamoto download all chapters
48 pages
Computer Fundamentals - Classification
No ratings yet
Computer Fundamentals - Classification
8 pages
12th Computer Application Creative Questios English Medium PDF Download (1)
No ratings yet
12th Computer Application Creative Questios English Medium PDF Download (1)
2 pages
DS 4
No ratings yet
DS 4
9 pages
Buffalo TeraStation TS-WX4.0TL-R1 Manual
No ratings yet
Buffalo TeraStation TS-WX4.0TL-R1 Manual
160 pages
REN R20ut4690ej0100-Csms MAN 20200601
No ratings yet
REN R20ut4690ej0100-Csms MAN 20200601
199 pages
Last Name: Job You'Re Looking For
No ratings yet
Last Name: Job You'Re Looking For
1 page
Scientific Notation PDF
No ratings yet
Scientific Notation PDF
2 pages
CurrentUse - SB Founder Master Table11
No ratings yet
CurrentUse - SB Founder Master Table11
3,052 pages
IT3104 CN Sol MTE
No ratings yet
IT3104 CN Sol MTE
5 pages
Autocad Primjeri
No ratings yet
Autocad Primjeri
25 pages
User Manual: Product Type: Switching Power Supply Model Name: HG2, HP2, PSM, PSL
No ratings yet
User Manual: Product Type: Switching Power Supply Model Name: HG2, HP2, PSM, PSL
17 pages
MPD Model Ec
No ratings yet
MPD Model Ec
6 pages
Code 1
No ratings yet
Code 1
43 pages
Maths Quiz 3
No ratings yet
Maths Quiz 3
4 pages
Chapter 11: Consumer-Oriented E-Commerce: Week 7
No ratings yet
Chapter 11: Consumer-Oriented E-Commerce: Week 7
18 pages
Knowledge Calibration Values
100% (1)
Knowledge Calibration Values
18 pages
ACTIVITY CHECKLIST OF STAGE 1-Imadur Rahman
No ratings yet
ACTIVITY CHECKLIST OF STAGE 1-Imadur Rahman
1 page
The Convergence of Internet of Things and Cloud For Smart Computing
No ratings yet
The Convergence of Internet of Things and Cloud For Smart Computing
139 pages