Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

Data Preprocessing

Data can be collected as objects with attributes and values. Attributes describe objects and can be continuous or categorical. Real world data is often dirty with missing values, noise, inconsistencies. Common techniques for handling dirty data include imputation to fill in missing values, smoothing to remove noise, binning to discretize continuous attributes, outlier detection and removal, and data transformation like normalization. Sampling is also used to select a representative subset of data for analysis when the entire dataset is too large.

Uploaded by

Uyên Vương
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Data Preprocessing

Data can be collected as objects with attributes and values. Attributes describe objects and can be continuous or categorical. Real world data is often dirty with missing values, noise, inconsistencies. Common techniques for handling dirty data include imputation to fill in missing values, smoothing to remove noise, binning to discretize continuous attributes, outlier detection and removal, and data transformation like normalization. Sampling is also used to select a representative subset of data for analysis when the entire dataset is too large.

Uploaded by

Uyên Vương
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 12

What is Data?

• Collection of data objects and


their attributes Attributes

• An attribute is a property or
characteristic of an object Tid Refund Marital Taxable
Status Income Cheat
– Examples: eye color of a
person, temperature, etc. 1 Yes Single 125K No
– Attribute is also known as 2 No Married 100K No
variable, field, characteristic, 3 No Single 70K No
or feature
4 Yes Married 120K No
• A collection of attributes 5 No Divorced 95K Yes
describe an object Objects
6 No Married 60K No
– Object is also known as
7 Yes Divorced 220K No
record, point, case, sample,
entity, or instance 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

1
Attribute Values
• Attribute values are numbers or symbols assigned
to an attribute

• Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of


values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
2
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or
names
• No quality data, no quality mining results!
How to Handle Missing

Data?
Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain cases)

• Fill in the missing value manually: tedious + infeasible?


• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!

• Use the attribute mean to fill in the missing value


• Use the attribute mean for all samples of the same class
to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
– used also for discretization (discussed later)
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer and
human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Data smoothing
• Data smoothing is executed by making use of a
specialized algorithm for removing noise from the given
data set.

6
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

7
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Duplicate Data
• Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues
9
Data Transformation:
Normalization
Particularly useful for classification (NNs, distance measurements,
nn classification, etc)
• min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• z-score normalization
v  mean A
v' 
stand _ dev A

• normalization by decimal scaling


v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10
Discretization/Quantization
• Three types of attributes:
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• Discretization/Quantization:
 divide the range of a continuous attribute into intervals
x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6
– Some classification algorithms only accept categorical
attributes.
– Reduce data size by discretization
– Prepare for further analysis
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.

• Statisticians sample because obtaining the entire set of


data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.

12

You might also like