Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)

Data pre-processing involves cleaning, transforming, and organizing raw data to prepare it for analysis. Common pre-processing tasks include data cleaning to handle missing values, data transformation such as normalization and standardization, data integration to combine disparate sources, and data reduction techniques like dimensionality reduction, data compression, and discretization. The goal of pre-processing is to prepare suitable, high-quality data for modeling and analysis.

Uploaded by

Mr Kamina
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)

Data pre-processing involves cleaning, transforming, and organizing raw data to prepare it for analysis. Common pre-processing tasks include data cleaning to handle missing values, data transformation such as normalization and standardization, data integration to combine disparate sources, and data reduction techniques like dimensionality reduction, data compression, and discretization. The goal of pre-processing is to prepare suitable, high-quality data for modeling and analysis.

Uploaded by

Mr Kamina
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

DATA PRE-PROCESSING

DATA PRE-PROCESSING

 Preprocessing refers to the steps and techniques used to prepare raw data for analysis or
modeling.

 It is a crucial step in the data science and machine learning

 Preprocessing aims to clean, transform, and organize the data into a suitable format for
further processing.
DATA PRE-PROCESSING

Common preprocessing tasks include:


1. Data Cleaning:
 This involves dealing with missing values and with irrelevant data.

 Missing values might be imputed or removed based on the context and the impact on the
analysis.

 Missing values can be ‘0’ or any string value. This problem can be resolved by calculating the
mean value or by replacing the values.
DATA PRE-PROCESSING

 All blank spaces are filled by ‘0’ values.

 Floating value with alphabets so it must be replaced by binary numbers i.e 0,1.
DATA PRE-PROCESSING

2. Transformation:

 Data transformation involves converting variables into a suitable format for analysis.

This can include –

• Normalization - In this method, the data is transformed to a specific range, usually between 0 and 1.

The formula for normalization is:

Normalization =
DATA PRE-PROCESSING

• Standardization (Z-Score Scaling) - Standardization transforms


the data to have a mean of 0 and a standard deviation of 1.
The formula for standardization is:
Standardization =

DATA PRE-PROCESSING

• Scaling - Scaling is similar to standardization but uses the median and


interquartile range (IQR) instead of the mean and standard deviation.
This makes it more resistant to the influence of outliers.
Scaling =
DATA PRE-PROCESSING

3. Data Integration

 Data integration is a crucial step in the data management process that involves combining and unifying
data from various sources into a single, coherent, and organized view.

 The goal of data integration is to provide a unified and comprehensive view of data, making it easier to
analyze, report on, and derive insights from the information contained in disparate datasets.
DATA PRE-PROCESSING

4. Data Reduction

 Data reduction is a crucial technique in data analysis and data mining that involves reducing the volume but
producing the same or similar analytical results from a dataset.

 The primary goal of data reduction is to simplify the data while retaining the essential information and
patterns, which can be beneficial for various purposes, including improving efficiency, speeding up
algorithms, reducing storage requirements, and gaining a better understanding of the data.
DATA PRE-PROCESSING

Techniques for Data Reduction:

(a) Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the number of
variables or features in the dataset while preserving as much relevant information as possible.
There are 2 types of Dimensionality Reduction:
1. Stepwise forward selection:
2. Stepwise backward elimination
DATA PRE-PROCESSING

(b) Data Compression:

 Data compression is the process of reducing the size of data files or streams while preserving as much of
the original information as possible.

 It is widely used in various applications to save storage space, reduce transmission time over networks,
and improve overall system efficiency.
There are two primary types of data compression:
1. Lossless
2. Lossy
DATA PRE-PROCESSING

(c) Numerosity Reduction:

 It is a data mining technique used to reduce the number of data points in a dataset while retaining its
essential characteristics and patterns.

 The goal of numerosity reduction is to simplify complex datasets by representing them with a smaller set of
representative data points or summary statistics.

 This reduction in data volume can make it more manageable for analysis, visualization, and model building
while still preserving meaningful information.
DATA PRE-PROCESSING

(d) Discretization

 Discretization is a data preprocessing technique used in data mining and machine learning to convert
continuous (numerical) data into discrete (categorical) intervals or bins.

 It involves grouping data points into specific ranges or categories based on their values.

 Discretization is primarily used for several reasons, including simplifying data analysis, improving model
performance, and addressing certain algorithms' requirements.

You might also like