Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
116 views

Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies

This document discusses data preprocessing for data mining. It covers the major tasks in data preprocessing which include data cleaning, data transformation, data reduction, and data discretization. The goals of preprocessing are to handle incomplete, noisy, and inconsistent data to improve data quality and efficiency. Techniques covered include data cleaning to handle missing data and errors, data discretization to reduce continuous attributes to discrete intervals, and data reduction through sampling to reduce data size. Preprocessing is an important step to transform raw data into a suitable format for data mining.

Uploaded by

Mik Clash
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies

This document discusses data preprocessing for data mining. It covers the major tasks in data preprocessing which include data cleaning, data transformation, data reduction, and data discretization. The goals of preprocessing are to handle incomplete, noisy, and inconsistent data to improve data quality and efficiency. Techniques covered include data cleaning to handle missing data and errors, data discretization to reduce continuous attributes to discrete intervals, and data reduction through sampling to reduce data size. Preprocessing is an important step to transform raw data into a suitable format for data mining.

Uploaded by

Mik Clash
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Mining:- Knowledge discovery in databases

DATA PREPROCESSING
WEEK-3 
CSE(5317) 4TH YEAR SEMESTER-1(ELECTIVE)

 Data cleaning
 Data transformation 
 Data reduction 
 Discretization and generating concept hierarchies
DATA MINING
STAGES
DATA PREPROCESSING
HOW DATA MINING WORKS?
TYPES OF DATASETS
TYPES OF DATASETS CONTD..
NEEDS OF DATA PREPROCESSING (WHY DATA
PREPROCESSING?)
(Data in Real World is Dirty!)
Incomplete: means lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data.
e.g., occupation=“ ”
Noisy: unwanted information or containing errors or outliers)
e.g., salary=“-10”
Inconsistent: means containing incorportability(mismatch/discrepancies) in codes
or names.
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A,B,C’’
used for cleaning the quality of data.
Clean data is vital for the success of the warehouse.
If there is No quality data, no quality mining results!
Quality decisions must be based on quality data.
e.g., duplicate or missing data may cause incorrect or even misleading
statistics.
Dw needs consistent integration of quality data.
Example
Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same
person.
WHAT IS PREPROCESSING?
Definition:-It refers to a set of techniques implemented on
databases to remove, noisy, missing and inconsistent
data.
 It improves the efficiency of the data mining process.
 Improve the quality of selected data.

Data selection
 When you select from the data warehouse, it is assumed that the data
is already cleansed.
 Preprocessing could also involve enriching the selected data with
external data.
 In this preprocessing sub step, remove noisy data, that is, data
blatantly out or range. Also ensure that there are no missing values.
 Clearly, if the data mining is selected from the data warehouse, it is
again assumed that all necessary data transformations have already
been completed.
FORMS OF DATA
PREPROCESSING
DATA ERROR
DATA SOLUTION
MAJOR TASKS IN DATA
PREPROCESSING
 Data gathering/data collect from dw.
 Data cleaning
 Convert data into suitable format
 Normalization
DATA MINING BASIC
ARCHITECTURE

Preprocessing
MULTI-DIMENSIONAL MEASURE
OF DATA QUALITY
A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Intepretability
 accessibility
PREPROCESSING
 Data aggregation(build data cube)
 Attribute subset selection(removing irrelevant
 Attribute by correlation analysis
 Dimensionality reducing(encoding
schemes:warelength)
 Numerosity reduction(replacing by clusters or
parametric models)
 Data discretization(automatic generation of
concept hierarchies from numerical data-
specially for numerical data) part of data
reduction.
DATA DISCRETIZATION
 It converts a large number of data values into
smaller once, so that data evaluation and data
management becomes very easy.
Example:
We have an attribute age with the following values.
Age 10,11,13,17,19,31,32,38,40,70,72,73,75
Table: Before discretization
Age 10,11,13,14,17,19,(Young)
30,31,32,38,40,42,(Mature) 70,72,73,75 (old)
Table: How to discretization
Age Young Mature Old
Table: After discretization
DISCRETIZATION PROCESS
 A normal discretization process specifically
consists of four steps
(i) sort all the continuous values of the
feature to be discretized
(ii) choose a cut point to split the continuous
values into intervals.
(iii) split or merge the intervals of continuous
values
(iv) choose the stopping criteria of the
discretization process
DATA CLEANING
MISSING DATA
HOW TO HANDLE MISSING DATA
NOISY DATA
DATA REDUCTION: TYPES OF
SAMPLING
SUMMARY
END
Thank You!

You might also like