Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Module2 DataPreprocessing

Data Preprocessing in Data Mining

Uploaded by

Ceejay Estigoy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module2 DataPreprocessing

Data Preprocessing in Data Mining

Uploaded by

Ceejay Estigoy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 3 —

1
Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview


■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

Data Mining: Concepts and Techniques 2


What went wrong?
Imagine that you are a manager at AllElectronics and have been
charged with analyzing the company’s data with respect to your
branch’s sales. You immediately set out to perform this task. You
carefully inspect the company’s database and data warehouse,
identifying and selecting the attributes or dimensions (e.g., item,
price, and units sold) to be included in your analysis. Alas! You
notice that several of the attributes for various tuples have no
recorded value. For your analysis, you would like to include
information as to whether each item purchased was advertised as on
sale, yet you discover that this information has not been recorded.
Furthermore, users of your database system have reported errors,
unusual values, and inconsistencies in the data recorded for some
transactions.

Data Mining: Concepts and Techniques 3


Data Quality: Why Preprocess the Data?
■ Measures for data quality: A multidimensional view
■ Accuracy: correct or wrong, accurate or not
■ Completeness: not recorded, unavailable, …
■ Consistency: some modified but some not, dangling, …
■ Timeliness: timely update?
■ Believability: how trustable the data are correct?
■ Interpretability: how easily the data can be
understood?

Data Mining: Concepts and Techniques 4


Major Tasks in Data Preprocessing
■ Data cleaning
■ Fill in missing values, smooth noisy
data, identify or remove outliers, and
resolve inconsistencies
■ Data integration
■ Integration of multiple databases, data
cubes, or files
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
■ Data transformation and data
discretization
■ Normalization Figure 1. Forms of data preprocessing
■ Concept hierarchy generation

Data Mining: Concepts and Techniques 5


Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview


■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

Data Mining: Concepts and Techniques 6


Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data


■ e.g., Occupation = “ ” (missing data)

■ noisy: containing noise, errors, or outliers

■ e.g., Salary = “−10” (an error)

■ inconsistent: containing discrepancies in codes or names, e.g.,

■ Age = “42”, Birthday = “03/07/2010”

■ Was rating “1, 2, 3”, now rating “A, B, C”

■ discrepancy between duplicate records

■ Intentional (e.g., disguised missing data)

■ Jan. 1 as everyone’s birthday?

Data Mining: Concepts and Techniques 7


Incomplete (Missing) Data

■ Data is not always available


■ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
■ Missing data may be due to
■ equipment malfunction
■ inconsistent with other recorded data and thus deleted
■ data not entered due to misunderstanding
■ certain data may not be considered important at the time
of entry
■ not register history or changes of the data
■ Missing data may need to be inferred
Data Mining: Concepts and Techniques 8
How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!
■ the attribute mean
■ the attribute mean for all samples belonging to the same
class: smarter
■ the most probable value: inference-based such as Bayesian
formula or decision tree
9
Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments

■ data entry problems

■ data transmission problems

■ technology limitation

■ inconsistency in naming convention

■ Other data problems which require data cleaning


■ duplicate records

■ incomplete data

■ inconsistent data

10
How to Handle Noisy Data?

■ Binning
■ first sort data and partition into (equal-frequency) bins

■ then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.


■ Regression
■ smooth by fitting the data into regression functions

■ Clustering
■ detect and remove outliers

■ Combined computer and human inspection


■ detect suspicious values and check by human (e.g., deal

with possible outliers)

11
Data Cleaning as a Process
■ Data discrepancy detection
■ Use metadata (e.g., domain, range, dependency, distribution)
■ Check field overloading
■ Check uniqueness rule, consecutive rule and null rule
■ Use commercial tools
■ Data scrubbing: use simple domain knowledge (e.g., postal code,

spell-check) to detect errors and make corrections


■ Data auditing: by analyzing data to discover rules and relationship to

detect violators (e.g., correlation and clustering to find outliers)


■ Data migration and integration
■ Data migration tools: allow transformations to be specified
■ ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
■ Integration of the two processes
■ Iterative and interactive (e.g., Potter’s Wheels)
12
Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview


■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

13
Data Integration
■ Data integration:
■ Combines data from multiple sources into a coherent store
■ Schema integration: e.g., A.cust-id ≡ B.cust-#
■ Integrate metadata from different sources
■ Entity identification problem:
■ Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
■ Detecting and resolving data value conflicts
■ For the same real world entity, attribute values from different sources are
different
■ Possible reasons: different representations, different scales, e.g., metric
vs. British units
14
Handling Redundancy in Data Integration

■ Redundant data occur often when integration of multiple


databases
■ Object identification: The same attribute or object may
have different names in different databases
■ Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
■ Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
■ Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
15
Histogram Analysis
■ Divide data into buckets and
store average (sum) for each
bucket
■ Partitioning rules:
■ Equal-width: equal bucket
range
■ Equal-frequency (or
equal-depth)

16
Clustering
■ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
■ Can be very effective if data is clustered but not if data is
“smeared”
■ Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
■ There are many choices of clustering definitions and
clustering algorithms
■ Cluster analysis will be studied in depth in Chapter 10

17
Sampling

■ Sampling: obtaining a small sample s to represent the whole


data set N
■ Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
■ Key principle: Choose a representative subset of the data
■ Simple random sampling may have very poor performance in
the presence of skew
■ Develop adaptive sampling methods, e.g., stratified
sampling:
■ Note: Sampling may not reduce database I/Os (page at a time)

18
Types of Sampling

■ Simple random sampling


■ There is an equal probability of selecting any particular item

■ Sampling without replacement


■ Once an object is selected, it is removed from the population

■ Sampling with replacement


■ A selected object is not removed from the population

■ Stratified sampling:
■ Partition the data set, and draw samples from each partition

(proportionally, i.e., approximately the same percentage of


the data)
■ Used in conjunction with skewed data

19
Sampling: With or without Replacement

W O R
SRS le random
i m p h ou t
(s e wi t
p l
sam ment)
p l a ce
re

SRSW
R

Raw Data
20
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

21
Data Cube Aggregation

■ The lowest level of a data cube (base cuboid)


■ The aggregated data for an individual entity of interest
■ E.g., a customer in a phone calling data warehouse
■ Multiple levels of aggregation in data cubes
■ Further reduce the size of data to deal with
■ Reference appropriate levels
■ Use the smallest representation which is enough to solve the
task
■ Queries regarding aggregated information should be answered
using data cube, when possible

22
Data Reduction 3: Data Compression
■ String compression
■ There are extensive theories and well-tuned algorithms

■ Typically lossless, but only limited manipulation is possible

without expansion
■ Audio/video compression
■ Typically lossy compression, with progressive refinement

■ Sometimes small fragments of signal can be reconstructed

without reconstructing the whole


■ Time sequence is not audio
■ Typically short and vary slowly with time

■ Dimensionality and numerosity reduction may also be


considered as forms of data compression
23
Data Compression

Original Data Compressed


Data
lossless

s s y
lo
Original Data
Approximated

24
Automatic Concept Hierarchy Generation
■ Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
■ The attribute with the most distinct values is placed at
the lowest level of the hierarchy
■ Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct


values
city 3567 distinct values

street 674,339 distinct


values 25
Chapter 3: Data Preprocessing

■ Data Preprocessing: An Overview


■ Data Quality
■ Major Tasks in Data Preprocessing
■ Data Cleaning
■ Data Integration

■ Data Reduction
■ Data Transformation and Data Discretization
■ Summary

26
Summary
■ Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
■ Data cleaning: e.g. missing/noisy values, outliers
■ Data integration from multiple sources:
■ Entity identification problem; Remove redundancies; Detect
inconsistencies
■ Data reduction
■ Dimensionality reduction; Numerosity reduction; Data
compression
■ Data transformation and data discretization
■ Normalization; Concept hierarchy generation
27

You might also like