0% found this document useful (0 votes)

7 views

Chap.3 Data Preprocessing

Uploaded by

sahilmulani2003

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Chap.3 Data Preprocessing

Uploaded by

sahilmulani2003

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Chapter 3.

Data Preprocessing
What is Data Preprocessing?

Real-world datasets are generally messy, raw, incomplete, inconsistent, and unusable. It can
contain manual entry errors, missing values, inconsistent schema, etc. “Data Preprocessing is
the process of converting raw data into a format that is understandable and usable”. It is a crucial
step in any Data Science project to carry out an efficient and accurate analysis.

Data Preprocessing is an important step in the Data Preparation stage of a Data Science
development lifecycle that will ensure reliable, robust, and consistent results.

 Accuracy - Data Preprocessing will ensure that input data is accurate and reliable by
ensuring there are no manual entry errors, no duplicates, etc.

 Completeness - It ensures that missing values are handled, and data is complete for
further analysis.

 Consistent - Data Preprocessing ensures that input data is consistent, i.e., the same data
kept in different places should match.

 Timeliness - Whether data is updated regularly and on a timely basis or not.

 Trustable - Whether data is coming from trustworthy sources or not.

 Interpretability - Raw data is generally unusable, and Data Preprocessing converts raw
data into an interpretable format.

 Data preprocessing is an important step in the data mining process that involves
cleaning and transforming raw data to make it suitable for analysis. Some common
steps in data preprocessing include:
 Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used
for data cleaning, such as imputation, removal, and transformation.
 Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with
different formats, structures, and semantics. Techniques such as record linkage and data
fusion can be used for data integration.
 Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete categories.
 Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as
feature selection and feature extraction. Feature selection involves selecting a subset of
relevant features from the dataset, while feature extraction involves transforming the
data into a lower-dimensional space while preserving the important information.
 Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms
that require categorical data. Discretization can be achieved through techniques such as
equal width binning, equal frequency binning, and clustering.
 Data Normalization: This involves scaling the data to a common range, such as
between 0 and 1 or -1 and 1. Normalization is often used to handle data with different
units and scales. Common normalization techniques include min-max normalization, z-
score normalization, and decimal scaling.
 Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy
of the analysis results. The specific steps involved in data preprocessing may vary
depending on the nature of the data and the analysis goals.

What are Attributes?

 Attributes are qualities or characteristics that describe an object, individual, or

phenomenon.

 Attributes can be categorical, representing distinct categories or classes, such as colors,

types, or labels.

 Some attributes are quantitative, taking on numerical values that can be measured or
counted, such as height, weight, or temperature.

 Categorical attributes can be further classified as nominal (no inherent order) or ordinal
(possessing a meaningful order).

Example –

 Let’s consider a person then their name, address, email, etc. are the attributes for the
contact information.

Quantitative Attributes:

Numeric Attributes :

A numeric attribute is calculable, that is, it is a quantifiable amount that constitutes integer or
real values. Numeric attributes can be of two types as follows: Interval- scaled, and Ratio –
scaled. Let’s discuss one by one.

1. Interval – Scaled Attributes :

Interval – scaled attributes are calculated on a lamella of uniform- size units. The values of
interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition to
providing a ranking of values, such attributes allow us to compare and quantify the
difference between values.
Example – A temperature attribute is an interval – scaled. We have different temperature
values for every new day, where each day is an entity. By sequencing the values, we obtain
an arrangement of entities with reference to temperature. In addition, we can quantify the
difference in the value between values, for example, a temperature of 20 degrees C is five
degrees higher than a temperature of 15 degrees C.

2. Ratio – Scaled Attributes :

A ratio – scaled attribute is a category of a numeric attribute with imminent or fix zero
points. In inclusion, the entities are structured, and we can also compute the difference
between values, as well as the mean, median, and mode.

Example – The Kelvin (K) temperature scale has what is contemplated as a true zero
point. It is the point at which the tiny bits that consist of matter has zero kinetic energy.

Numeric attributes can also be divided into the discrete and continuous data.

 Discrete Attribute :
A discrete attribute has a limited or restricted unlimited set of values, which may appear as
integers.

Example: The attributes skin color, drinker, medical report, and drink size each have a
finite number of values, and so are discrete.

 Continuous Attribute :
A continuous attribute has real numbers as attribute values.

Example – Height, weight, and temperature have real values . Real values can only be
represented and measured using finite number of digits . Continuous attributes are typically
represented as floating-point variables.

What is data quality?

Data quality is a measure of a data set's condition based on factors such as accuracy,
completeness, consistency, reliability and validity. Measuring data quality can help organizations
identify errors and inconsistencies in their data and assess whether the data fits its intended
purpose.

Why is data quality so important?

Low-quality data can have significant business consequences for an organization. Bad data is
often the culprit behind operational snafus, inaccurate analytics and ill-conceived business
strategies. For example, it can potentially cause any of the following problems:
 Shipping products to the wrong customer addresses.

 Missing sales opportunities because of erroneous or incomplete customer records.

 Being fined for improper financial or regulatory compliance reporting.

What are the six elements of data quality?

 Accuracy. The data correctly represents the entities or events it is supposed to represent, and
the data comes from sources that are verifiable and trustworthy.

 Consistency. The data is uniform across systems and data sets, and there are no conflicts
between the same data values in different systems or data sets.

 Validity. The data conforms to defined business rules and parameters, which ensure that the
data is properly structured and contains the values it should.

 Completeness. The data includes all the values and types of data it is expected to contain,
including any metadata that should accompany the data sets.

 Timeliness. The data is current (relative to its specific requirements) and is available to use
when it's needed.

 Uniqueness. The data does not contain duplicate records within a single data set, and every
record can be uniquely identified.
Data Munging:

Data wrangling is the process of cleaning, structuring, and transforming raw data into a usable
format for analysis. Also known as data munging, it involves tasks such as handling missing or
inconsistent data, formatting data types, and merging different datasets to prepare the data for
further exploration and modeling in data analysis or machine learning projects.

Data Wrangling Process

Wrangling data involves the systematic and iterative transformation of raw, unstructured, or
messy data into a clean, structured, and usable format for data science and analytics.
Step 1: Discover

Initially, your focus is on understanding and exploring the data you’ve gathered. This involves
identifying data sources, assessing data quality, and gaining insights into the structure and format
of the data. Your goal is to establish a foundation for the subsequent data preparation steps by
recognizing potential challenges and opportunities in the data.

Step 2: Structure

In the data structuring step, you organize and format the raw data in a way that facilitates
efficient analysis. The specific form your data will take depends on which analytical model
you’re using, but structuring typically involves reshaping data, handling missing values, and
converting data types. This ensures that the data is presented in a coherent and standardized
manner, laying the groundwork for further manipulation and exploration.

Step 3: Clean

Data cleansing is a crucial step to address inconsistencies, errors, and outliers within the dataset.
This involves removing or correcting inaccurate data, handling duplicates, and addressing any
anomalies that could impact the reliability of analyses. By cleaning the data, your focus is on
enhancing data accuracy and reliability for downstream processes.
Step 4: Enrich

Enriching your data involves enhancing it with additional information to provide more context or
depth. This can include merging datasets, extracting relevant features, or incorporating external
data sources. The goal is to augment the original dataset, making it more comprehensive and
valuable for analysis. If you do add data, be sure to structure and clean that new data.

Step 5: Validate

Validation ensures the quality and reliability of your processed data. You’ll check for
inconsistencies, verify data integrity, and confirm that the data adheres to predefined standards.
Validation helps in building your confidence in the accuracy of the dataset and ensures that it
meets the requirements for meaningful analysis.

Step 6: Publish

Now your curated and validated dataset is prepared for analysis or dissemination to business
users. This involves documenting data lineage and the steps taken during the entire wrangling
process, sharing metadata, and preparing the data for storage or integration into data science and
analytics tools. Publishing facilitates collaboration and allows others to use the data for their
analyses or decision-making processes.

Information Technology: Pearson Edexcel International Advanced Level
100% (1)
Information Technology: Pearson Edexcel International Advanced Level
24 pages
Ch.3 Data Preprocessing
No ratings yet
Ch.3 Data Preprocessing
16 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
FDS CH 3
No ratings yet
FDS CH 3
2 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Binning
No ratings yet
Data Binning
9 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Data Mining
No ratings yet
Data Mining
5 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
23 pages
Data Preparation and Analysis
No ratings yet
Data Preparation and Analysis
22 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Data Mining
No ratings yet
Data Mining
40 pages
DMDW Chapter 3
No ratings yet
DMDW Chapter 3
13 pages
Math211101020
No ratings yet
Math211101020
12 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Unit I Notes
No ratings yet
Unit I Notes
23 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
86 pages
CH 3
No ratings yet
CH 3
68 pages
Chapter3
No ratings yet
Chapter3
50 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Bi 20soeit11002 Antala Krishnaa
No ratings yet
Bi 20soeit11002 Antala Krishnaa
5 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
Unit 1
No ratings yet
Unit 1
8 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
206 Data Mining
No ratings yet
206 Data Mining
28 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
CS322_Lec 3_S25
No ratings yet
CS322_Lec 3_S25
42 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
253777
No ratings yet
253777
66 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
DM_merged
No ratings yet
DM_merged
169 pages
Unit-2
No ratings yet
Unit-2
144 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Task 1
No ratings yet
Task 1
3 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
SAP Analytics Emergency License Key Process
No ratings yet
SAP Analytics Emergency License Key Process
5 pages
Unit 5 (E)
No ratings yet
Unit 5 (E)
17 pages
1) Explain in Detail Core Function of Edge Analytics With Diagram
No ratings yet
1) Explain in Detail Core Function of Edge Analytics With Diagram
13 pages
Unit 2 AI - AI Project Cycle
No ratings yet
Unit 2 AI - AI Project Cycle
50 pages
ZOS DB2 DBA Workshop Material
No ratings yet
ZOS DB2 DBA Workshop Material
33 pages
Project Study 1 and 2 - Templates
100% (1)
Project Study 1 and 2 - Templates
8 pages
HFM 1112100 Readme
No ratings yet
HFM 1112100 Readme
28 pages
Modules of Instruction TESDA Template
No ratings yet
Modules of Instruction TESDA Template
50 pages
Tara Kizer Database Administrator at Qualcomm, Inc. SQL Server MVP
No ratings yet
Tara Kizer Database Administrator at Qualcomm, Inc. SQL Server MVP
10 pages
Data Presentation, Analysis, Conclusion and Recommendation
No ratings yet
Data Presentation, Analysis, Conclusion and Recommendation
13 pages
1 SM
No ratings yet
1 SM
10 pages
Handouts in PR2
No ratings yet
Handouts in PR2
6 pages
Homework Assignment 3: Name: Paul Gudgel
No ratings yet
Homework Assignment 3: Name: Paul Gudgel
12 pages
EndTerm Task
No ratings yet
EndTerm Task
5 pages
Oracle DBA
100% (2)
Oracle DBA
187 pages
HSS MandE Framework Nov 2009 PDF
No ratings yet
HSS MandE Framework Nov 2009 PDF
28 pages
Final Ojt Report
No ratings yet
Final Ojt Report
57 pages
NOSQL Databases
No ratings yet
NOSQL Databases
48 pages
Kuntie Pob Sba
100% (1)
Kuntie Pob Sba
21 pages
SQL SERVER DBA Class Room Training
No ratings yet
SQL SERVER DBA Class Room Training
28 pages
31 PDF
No ratings yet
31 PDF
6 pages
Network File System (NFS) : Installation
No ratings yet
Network File System (NFS) : Installation
5 pages
Introduction To Program Logic Formulation
No ratings yet
Introduction To Program Logic Formulation
39 pages
Introduction To Splunk
No ratings yet
Introduction To Splunk
7 pages
Sqoop To Hbase
No ratings yet
Sqoop To Hbase
4 pages
Guidebook RBAC Season 3
No ratings yet
Guidebook RBAC Season 3
64 pages
en Improving Students Writing Skill Using A
No ratings yet
en Improving Students Writing Skill Using A
15 pages
Effectiveness of Game Based Learning To Enhance
No ratings yet
Effectiveness of Game Based Learning To Enhance
4 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages