Pre processing

• Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
• Real-world data is often incomplete,
inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain
many errors.

• Data preprocessing is a proven method of
resolving such issues.
• Data preprocessing prepares raw data for
further processing.
• Data preprocessing is used database-driven
applications such as customer relationship
management and rule-based applications (like
neural networks).

Number of data preprocessing
techniques
•
•
•
•

Data cleaning
Data integration
Data transformation
Data reduction

Data Preprocessing Techniques
• Data cleaning : can be applied to remove
noise and correct inconsistencies in the data.
• Data integration :merges data from multiple
sources into a coherent data store, such as a
data warehouse.
• Data transformations :such as
normalization, may be applied.
• Data reduction : can reduce the data size by
aggregating, eliminating redundant features,
or clustering ,for instance.

• routines work to “clean” the data by filling in
missing values, smoothing noisy data, identifying
or removing outliers, and resolving
inconsistencies.
• If users believe the data are dirty, they are
unlikely to trust the results of any data mining
that has been applied to it.
• Although most mining routines have some
procedures for dealing with incomplete or noisy
data, they are not always robust.

• Therefore, a useful preprocessing step is to
run your data through some data cleaning
routines.

• Include data from multiple sources in your
analysis.
• This would involve integrating multiple
databases, data cubes, or files, that is, data
integration.
• Yet some attributes representing a given
concept may have different names in different
databases, causing inconsistencies and
redundancies.

• Having a large amount of redundant data may
slow down or confuse the knowledge
discovery process.
• Clearly, in addition to data cleaning, steps
must be taken to help avoid redundancies
during data integration.
• Typically, data cleaning and data integration
are performed as a preprocessing step when
preparing the data for a data warehouse.

• Additional data cleaning can be performed to
detect and remove redundancies that may
have resulted from data integration.

• Getting back to your data, you have decided,
say, that you would like to use a distance
based mining algorithm for your analysis, such
as neural networks, nearest-neighbor
classifiers, or clustering.
• methods provide better results if the data to
be analyzed have been normalized, that is,
scaled to a specific range such as [0.0, 1.0].

• You soon realize that data transformation
operations, such as normalization and
aggregation, are additional data preprocessing
procedures that would contribute toward the
success of the mining process.

• Data reduction obtains a reduced
representation of the data set that is much
smaller in volume, yet produces the same (or
almost the same) analytical results.
• There are a number of strategies for data
reduction.
• These include data aggregation , attribute
subset selection , dimensionality reduction and
numerosity reduction.

DATA REDUCTION
• Data can also be “reduced” by generalization
with the use of concept hierarchies, where lowlevel concepts, such as city for customer location,
are replaced with higher-level concepts, such as
region or province or state.
• A concept hierarchy organizes the concepts into
varying levels of abstraction.
• Data discretization is a form of data reduction
that is very useful for the automatic generation of
concept hierarchies from numerical data.

Pre processing

More Related Content

Pre processing