COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
DATA PRE-PROCESSING
Preprocessing refers to the steps and techniques used to prepare raw data for analysis or
modeling.
Preprocessing aims to clean, transform, and organize the data into a suitable format for
further processing.
DATA PRE-PROCESSING
Missing values might be imputed or removed based on the context and the impact on the
analysis.
Missing values can be ‘0’ or any string value. This problem can be resolved by calculating the
mean value or by replacing the values.
DATA PRE-PROCESSING
Floating value with alphabets so it must be replaced by binary numbers i.e 0,1.
DATA PRE-PROCESSING
2. Transformation:
Data transformation involves converting variables into a suitable format for analysis.
• Normalization - In this method, the data is transformed to a specific range, usually between 0 and 1.
Normalization =
DATA PRE-PROCESSING
3. Data Integration
Data integration is a crucial step in the data management process that involves combining and unifying
data from various sources into a single, coherent, and organized view.
The goal of data integration is to provide a unified and comprehensive view of data, making it easier to
analyze, report on, and derive insights from the information contained in disparate datasets.
DATA PRE-PROCESSING
4. Data Reduction
Data reduction is a crucial technique in data analysis and data mining that involves reducing the volume but
producing the same or similar analytical results from a dataset.
The primary goal of data reduction is to simplify the data while retaining the essential information and
patterns, which can be beneficial for various purposes, including improving efficiency, speeding up
algorithms, reducing storage requirements, and gaining a better understanding of the data.
DATA PRE-PROCESSING
(a) Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the number of
variables or features in the dataset while preserving as much relevant information as possible.
There are 2 types of Dimensionality Reduction:
1. Stepwise forward selection:
2. Stepwise backward elimination
DATA PRE-PROCESSING
Data compression is the process of reducing the size of data files or streams while preserving as much of
the original information as possible.
It is widely used in various applications to save storage space, reduce transmission time over networks,
and improve overall system efficiency.
There are two primary types of data compression:
1. Lossless
2. Lossy
DATA PRE-PROCESSING
It is a data mining technique used to reduce the number of data points in a dataset while retaining its
essential characteristics and patterns.
The goal of numerosity reduction is to simplify complex datasets by representing them with a smaller set of
representative data points or summary statistics.
This reduction in data volume can make it more manageable for analysis, visualization, and model building
while still preserving meaningful information.
DATA PRE-PROCESSING
(d) Discretization
Discretization is a data preprocessing technique used in data mining and machine learning to convert
continuous (numerical) data into discrete (categorical) intervals or bins.
It involves grouping data points into specific ranges or categories based on their values.
Discretization is primarily used for several reasons, including simplifying data analysis, improving model
performance, and addressing certain algorithms' requirements.