Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
DATA PREPROCESSING
WEEK-3
CSE(5317) 4TH YEAR SEMESTER-1(ELECTIVE)
Data cleaning
Data transformation
Data reduction
Discretization and generating concept hierarchies
DATA MINING
STAGES
DATA PREPROCESSING
HOW DATA MINING WORKS?
TYPES OF DATASETS
TYPES OF DATASETS CONTD..
NEEDS OF DATA PREPROCESSING (WHY DATA
PREPROCESSING?)
(Data in Real World is Dirty!)
Incomplete: means lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data.
e.g., occupation=“ ”
Noisy: unwanted information or containing errors or outliers)
e.g., salary=“-10”
Inconsistent: means containing incorportability(mismatch/discrepancies) in codes
or names.
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A,B,C’’
used for cleaning the quality of data.
Clean data is vital for the success of the warehouse.
If there is No quality data, no quality mining results!
Quality decisions must be based on quality data.
e.g., duplicate or missing data may cause incorrect or even misleading
statistics.
Dw needs consistent integration of quality data.
Example
Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same
person.
WHAT IS PREPROCESSING?
Definition:-It refers to a set of techniques implemented on
databases to remove, noisy, missing and inconsistent
data.
It improves the efficiency of the data mining process.
Improve the quality of selected data.
Data selection
When you select from the data warehouse, it is assumed that the data
is already cleansed.
Preprocessing could also involve enriching the selected data with
external data.
In this preprocessing sub step, remove noisy data, that is, data
blatantly out or range. Also ensure that there are no missing values.
Clearly, if the data mining is selected from the data warehouse, it is
again assumed that all necessary data transformations have already
been completed.
FORMS OF DATA
PREPROCESSING
DATA ERROR
DATA SOLUTION
MAJOR TASKS IN DATA
PREPROCESSING
Data gathering/data collect from dw.
Data cleaning
Convert data into suitable format
Normalization
DATA MINING BASIC
ARCHITECTURE
Preprocessing
MULTI-DIMENSIONAL MEASURE
OF DATA QUALITY
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Intepretability
accessibility
PREPROCESSING
Data aggregation(build data cube)
Attribute subset selection(removing irrelevant
Attribute by correlation analysis
Dimensionality reducing(encoding
schemes:warelength)
Numerosity reduction(replacing by clusters or
parametric models)
Data discretization(automatic generation of
concept hierarchies from numerical data-
specially for numerical data) part of data
reduction.
DATA DISCRETIZATION
It converts a large number of data values into
smaller once, so that data evaluation and data
management becomes very easy.
Example:
We have an attribute age with the following values.
Age 10,11,13,17,19,31,32,38,40,70,72,73,75
Table: Before discretization
Age 10,11,13,14,17,19,(Young)
30,31,32,38,40,42,(Mature) 70,72,73,75 (old)
Table: How to discretization
Age Young Mature Old
Table: After discretization
DISCRETIZATION PROCESS
A normal discretization process specifically
consists of four steps
(i) sort all the continuous values of the
feature to be discretized
(ii) choose a cut point to split the continuous
values into intervals.
(iii) split or merge the intervals of continuous
values
(iv) choose the stopping criteria of the
discretization process
DATA CLEANING
MISSING DATA
HOW TO HANDLE MISSING DATA
NOISY DATA
DATA REDUCTION: TYPES OF
SAMPLING
SUMMARY
END
Thank You!