Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Pre processing
• Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
• Real-world data is often incomplete,
inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain
many errors.
• Data preprocessing is a proven method of
resolving such issues.
• Data preprocessing prepares raw data for
further processing.
• Data preprocessing is used database-driven
applications such as customer relationship
management and rule-based applications (like
neural networks).
Number of data preprocessing
techniques
•
•
•
•

Data cleaning
Data integration
Data transformation
Data reduction
Pre processing
Data Preprocessing Techniques
• Data cleaning : can be applied to remove
noise and correct inconsistencies in the data.
• Data integration :merges data from multiple
sources into a coherent data store, such as a
data warehouse.
• Data transformations :such as
normalization, may be applied.
• Data reduction : can reduce the data size by
aggregating, eliminating redundant features,
or clustering ,for instance.
• routines work to “clean” the data by filling in
missing values, smoothing noisy data, identifying
or removing outliers, and resolving
inconsistencies.
• If users believe the data are dirty, they are
unlikely to trust the results of any data mining
that has been applied to it.
• Although most mining routines have some
procedures for dealing with incomplete or noisy
data, they are not always robust.
• Therefore, a useful preprocessing step is to
run your data through some data cleaning
routines.
• Include data from multiple sources in your
analysis.
• This would involve integrating multiple
databases, data cubes, or files, that is, data
integration.
• Yet some attributes representing a given
concept may have different names in different
databases, causing inconsistencies and
redundancies.
• Having a large amount of redundant data may
slow down or confuse the knowledge
discovery process.
• Clearly, in addition to data cleaning, steps
must be taken to help avoid redundancies
during data integration.
• Typically, data cleaning and data integration
are performed as a preprocessing step when
preparing the data for a data warehouse.
• Additional data cleaning can be performed to
detect and remove redundancies that may
have resulted from data integration.
• Getting back to your data, you have decided,
say, that you would like to use a distance
based mining algorithm for your analysis, such
as neural networks, nearest-neighbor
classifiers, or clustering.
• methods provide better results if the data to
be analyzed have been normalized, that is,
scaled to a specific range such as [0.0, 1.0].
• You soon realize that data transformation
operations, such as normalization and
aggregation, are additional data preprocessing
procedures that would contribute toward the
success of the mining process.
• Data reduction obtains a reduced
representation of the data set that is much
smaller in volume, yet produces the same (or
almost the same) analytical results.
• There are a number of strategies for data
reduction.
• These include data aggregation , attribute
subset selection , dimensionality reduction and
numerosity reduction.
DATA REDUCTION
• Data can also be “reduced” by generalization
with the use of concept hierarchies, where lowlevel concepts, such as city for customer location,
are replaced with higher-level concepts, such as
region or province or state.
• A concept hierarchy organizes the concepts into
varying levels of abstraction.
• Data discretization is a form of data reduction
that is very useful for the automatic generation of
concept hierarchies from numerical data.
Pre processing

More Related Content

Pre processing

  • 2. • Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. • Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors.
  • 3. • Data preprocessing is a proven method of resolving such issues. • Data preprocessing prepares raw data for further processing. • Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks).
  • 4. Number of data preprocessing techniques • • • • Data cleaning Data integration Data transformation Data reduction
  • 6. Data Preprocessing Techniques • Data cleaning : can be applied to remove noise and correct inconsistencies in the data. • Data integration :merges data from multiple sources into a coherent data store, such as a data warehouse. • Data transformations :such as normalization, may be applied. • Data reduction : can reduce the data size by aggregating, eliminating redundant features, or clustering ,for instance.
  • 7. • routines work to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. • If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied to it. • Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust.
  • 8. • Therefore, a useful preprocessing step is to run your data through some data cleaning routines.
  • 9. • Include data from multiple sources in your analysis. • This would involve integrating multiple databases, data cubes, or files, that is, data integration. • Yet some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies.
  • 10. • Having a large amount of redundant data may slow down or confuse the knowledge discovery process. • Clearly, in addition to data cleaning, steps must be taken to help avoid redundancies during data integration. • Typically, data cleaning and data integration are performed as a preprocessing step when preparing the data for a data warehouse.
  • 11. • Additional data cleaning can be performed to detect and remove redundancies that may have resulted from data integration.
  • 12. • Getting back to your data, you have decided, say, that you would like to use a distance based mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or clustering. • methods provide better results if the data to be analyzed have been normalized, that is, scaled to a specific range such as [0.0, 1.0].
  • 13. • You soon realize that data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process.
  • 14. • Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. • There are a number of strategies for data reduction. • These include data aggregation , attribute subset selection , dimensionality reduction and numerosity reduction.
  • 15. DATA REDUCTION • Data can also be “reduced” by generalization with the use of concept hierarchies, where lowlevel concepts, such as city for customer location, are replaced with higher-level concepts, such as region or province or state. • A concept hierarchy organizes the concepts into varying levels of abstraction. • Data discretization is a form of data reduction that is very useful for the automatic generation of concept hierarchies from numerical data.