COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)

Data pre-processing involves cleaning, transforming, and organizing raw data to prepare it for analysis. Common pre-processing tasks include data cleaning to handle missing values, data transformation such as normalization and standardization, data integration to combine disparate sources, and data reduction techniques like dimensionality reduction, data compression, and discretization. The goal of pre-processing is to prepare suitable, high-quality data for modeling and analysis.

Uploaded by

Mr Kamina

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)

Uploaded by

Mr Kamina

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

DATA PRE-PROCESSING

 Preprocessing refers to the steps and techniques used to prepare raw data for analysis or
modeling.

 It is a crucial step in the data science and machine learning

 Preprocessing aims to clean, transform, and organize the data into a suitable format for
further processing.
DATA PRE-PROCESSING

Common preprocessing tasks include:

1. Data Cleaning:
 This involves dealing with missing values and with irrelevant data.

 Missing values might be imputed or removed based on the context and the impact on the
analysis.

 Missing values can be ‘0’ or any string value. This problem can be resolved by calculating the
mean value or by replacing the values.
DATA PRE-PROCESSING

 All blank spaces are filled by ‘0’ values.

 Floating value with alphabets so it must be replaced by binary numbers i.e 0,1.
DATA PRE-PROCESSING

2. Transformation:

 Data transformation involves converting variables into a suitable format for analysis.

This can include –

• Normalization - In this method, the data is transformed to a specific range, usually between 0 and 1.

The formula for normalization is:

Normalization =
DATA PRE-PROCESSING

• Standardization (Z-Score Scaling) - Standardization transforms

the data to have a mean of 0 and a standard deviation of 1.
The formula for standardization is:
Standardization =
•
DATA PRE-PROCESSING

• Scaling - Scaling is similar to standardization but uses the median and

interquartile range (IQR) instead of the mean and standard deviation.
This makes it more resistant to the influence of outliers.
Scaling =
DATA PRE-PROCESSING

3. Data Integration

 Data integration is a crucial step in the data management process that involves combining and unifying
data from various sources into a single, coherent, and organized view.

 The goal of data integration is to provide a unified and comprehensive view of data, making it easier to
analyze, report on, and derive insights from the information contained in disparate datasets.
DATA PRE-PROCESSING

4. Data Reduction

 Data reduction is a crucial technique in data analysis and data mining that involves reducing the volume but
producing the same or similar analytical results from a dataset.

 The primary goal of data reduction is to simplify the data while retaining the essential information and
patterns, which can be beneficial for various purposes, including improving efficiency, speeding up
algorithms, reducing storage requirements, and gaining a better understanding of the data.
DATA PRE-PROCESSING

Techniques for Data Reduction:

(a) Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the number of
variables or features in the dataset while preserving as much relevant information as possible.
There are 2 types of Dimensionality Reduction:
1. Stepwise forward selection:
2. Stepwise backward elimination
DATA PRE-PROCESSING

(b) Data Compression:

 Data compression is the process of reducing the size of data files or streams while preserving as much of
the original information as possible.

 It is widely used in various applications to save storage space, reduce transmission time over networks,
and improve overall system efficiency.
There are two primary types of data compression:
1. Lossless
2. Lossy
DATA PRE-PROCESSING

(c) Numerosity Reduction:

 It is a data mining technique used to reduce the number of data points in a dataset while retaining its
essential characteristics and patterns.

 The goal of numerosity reduction is to simplify complex datasets by representing them with a smaller set of
representative data points or summary statistics.

 This reduction in data volume can make it more manageable for analysis, visualization, and model building
while still preserving meaningful information.
DATA PRE-PROCESSING

(d) Discretization

 Discretization is a data preprocessing technique used in data mining and machine learning to convert
continuous (numerical) data into discrete (categorical) intervals or bins.

 It involves grouping data points into specific ranges or categories based on their values.

 Discretization is primarily used for several reasons, including simplifying data analysis, improving model
performance, and addressing certain algorithms' requirements.

Sandeep Aws - Devops Resume
100% (8)
Sandeep Aws - Devops Resume
5 pages
Science, Technology, and Society
100% (10)
Science, Technology, and Society
18 pages
Final Year Project Documentation: Department of Computer Science
100% (1)
Final Year Project Documentation: Department of Computer Science
86 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
IDOC Step by Step Configuration
No ratings yet
IDOC Step by Step Configuration
8 pages
Unit 2
No ratings yet
Unit 2
9 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Week 3
No ratings yet
Week 3
23 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Data Transformation and standardization
No ratings yet
Data Transformation and standardization
5 pages
Experiment-3 31
No ratings yet
Experiment-3 31
9 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
DAI101 4 Data Preparation (1)
No ratings yet
DAI101 4 Data Preparation (1)
45 pages
Data Integration and Data Reduction
No ratings yet
Data Integration and Data Reduction
27 pages
Business Data Mining Week 3
No ratings yet
Business Data Mining Week 3
3 pages
Chương
No ratings yet
Chương
12 pages
Week 3 - LAQ
No ratings yet
Week 3 - LAQ
2 pages
1.3 Introduction To Data Preprocessing
No ratings yet
1.3 Introduction To Data Preprocessing
16 pages
Module 2
No ratings yet
Module 2
42 pages
Unit-2
No ratings yet
Unit-2
144 pages
Take Home Assignment - CCS3342-Business Intelligence (1)
No ratings yet
Take Home Assignment - CCS3342-Business Intelligence (1)
2 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Binning
No ratings yet
Data Binning
9 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Data in Enterprise End Term Cheat Sheet
No ratings yet
Data in Enterprise End Term Cheat Sheet
13 pages
Unit-1 DMDW
No ratings yet
Unit-1 DMDW
22 pages
Solution
No ratings yet
Solution
16 pages
the Ultimate Guide to Data Cleaning With SQL 1738769035
No ratings yet
the Ultimate Guide to Data Cleaning With SQL 1738769035
36 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Dimensionality reduction
No ratings yet
Dimensionality reduction
7 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Data Migration Process Infographics by Slidesgo
No ratings yet
Data Migration Process Infographics by Slidesgo
9 pages
Knowledge Discovery in Databases
No ratings yet
Knowledge Discovery in Databases
17 pages
Data Warehousing Management
No ratings yet
Data Warehousing Management
18 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
121A1114_D2_SMA_EXP3
No ratings yet
121A1114_D2_SMA_EXP3
9 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
5 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
DSUR_EA2352001010391_W7
No ratings yet
DSUR_EA2352001010391_W7
3 pages
Stages in Data Mining
No ratings yet
Stages in Data Mining
11 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Adm Q&a
No ratings yet
Adm Q&a
13 pages
Unit 1
No ratings yet
Unit 1
8 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
Practical 1 ML_removed
No ratings yet
Practical 1 ML_removed
5 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Documentation 32
No ratings yet
Documentation 32
4 pages
Week 2
No ratings yet
Week 2
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
CS383 - Software Engineering: Semester: 432
No ratings yet
CS383 - Software Engineering: Semester: 432
19 pages
Sangoma Pbxact Series Datasheet
No ratings yet
Sangoma Pbxact Series Datasheet
2 pages
Empowering Indonesia's Restaurant Industry: 2022 - Company Profile
No ratings yet
Empowering Indonesia's Restaurant Industry: 2022 - Company Profile
22 pages
Course Review - Lecture 12 - Next Generation User Interfaces (4018166FNR)
No ratings yet
Course Review - Lecture 12 - Next Generation User Interfaces (4018166FNR)
14 pages
RDBMS Notes
No ratings yet
RDBMS Notes
4 pages
Analysis Xss Exploitation Threat Advisory
No ratings yet
Analysis Xss Exploitation Threat Advisory
6 pages
ISTQB Question Paper16
100% (1)
ISTQB Question Paper16
8 pages
Emerging Technology For Engineering (MCQS)
No ratings yet
Emerging Technology For Engineering (MCQS)
60 pages
Tableau Certified Data Analyst Requirements 221227
No ratings yet
Tableau Certified Data Analyst Requirements 221227
4 pages
My New CV Format - 1563713143
No ratings yet
My New CV Format - 1563713143
1 page
CSE 562 Database Systems
No ratings yet
CSE 562 Database Systems
7 pages
Information Security Manual For Safe Usage of IT
No ratings yet
Information Security Manual For Safe Usage of IT
39 pages
2205 08818
No ratings yet
2205 08818
18 pages
License Administration Workbench - LAW
No ratings yet
License Administration Workbench - LAW
14 pages
RESTful Web Services With Scala - Sample Chapter
No ratings yet
RESTful Web Services With Scala - Sample Chapter
26 pages
Computer Project Class X-By Me ICSE 2023-24, Hotel Management
No ratings yet
Computer Project Class X-By Me ICSE 2023-24, Hotel Management
20 pages
F5 Load Balancer 2
No ratings yet
F5 Load Balancer 2
7 pages
User Stories
No ratings yet
User Stories
35 pages
AWS+Partner+ +SAP+on+AWS+ (Technical) +v2.1.0+Student+Guide
No ratings yet
AWS+Partner+ +SAP+on+AWS+ (Technical) +v2.1.0+Student+Guide
289 pages
Maintenance of Electronic Records
No ratings yet
Maintenance of Electronic Records
4 pages
INDEED SCRAPER
No ratings yet
INDEED SCRAPER
2 pages
PowerMaxOS+10+Data+Mobility Open-Minimally+Disruptive+Migration+O-MDM Participant+Guide
No ratings yet
PowerMaxOS+10+Data+Mobility Open-Minimally+Disruptive+Migration+O-MDM Participant+Guide
34 pages
Release Notes 8-5-182 2
No ratings yet
Release Notes 8-5-182 2
2 pages
CCP - 1 Spring
No ratings yet
CCP - 1 Spring
38 pages
Nijamutheen 5
No ratings yet
Nijamutheen 5
5 pages
1 Howard Buys A Tablet Computer.: Turn Over
No ratings yet
1 Howard Buys A Tablet Computer.: Turn Over
4 pages