Chap.3 Data Preprocessing
Chap.3 Data Preprocessing
Data Preprocessing
What is Data Preprocessing?
Real-world datasets are generally messy, raw, incomplete, inconsistent, and unusable. It can
contain manual entry errors, missing values, inconsistent schema, etc. “Data Preprocessing is
the process of converting raw data into a format that is understandable and usable”. It is a crucial
step in any Data Science project to carry out an efficient and accurate analysis.
Data Preprocessing is an important step in the Data Preparation stage of a Data Science
development lifecycle that will ensure reliable, robust, and consistent results.
Accuracy - Data Preprocessing will ensure that input data is accurate and reliable by
ensuring there are no manual entry errors, no duplicates, etc.
Completeness - It ensures that missing values are handled, and data is complete for
further analysis.
Consistent - Data Preprocessing ensures that input data is consistent, i.e., the same data
kept in different places should match.
Interpretability - Raw data is generally unusable, and Data Preprocessing converts raw
data into an interpretable format.
Data preprocessing is an important step in the data mining process that involves
cleaning and transforming raw data to make it suitable for analysis. Some common
steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used
for data cleaning, such as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with
different formats, structures, and semantics. Techniques such as record linkage and data
fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as
feature selection and feature extraction. Feature selection involves selecting a subset of
relevant features from the dataset, while feature extraction involves transforming the
data into a lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms
that require categorical data. Discretization can be achieved through techniques such as
equal width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as
between 0 and 1 or -1 and 1. Normalization is often used to handle data with different
units and scales. Common normalization techniques include min-max normalization, z-
score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy
of the analysis results. The specific steps involved in data preprocessing may vary
depending on the nature of the data and the analysis goals.
Some attributes are quantitative, taking on numerical values that can be measured or
counted, such as height, weight, or temperature.
Categorical attributes can be further classified as nominal (no inherent order) or ordinal
(possessing a meaningful order).
Example –
Let’s consider a person then their name, address, email, etc. are the attributes for the
contact information.
Quantitative Attributes:
Numeric Attributes :
A numeric attribute is calculable, that is, it is a quantifiable amount that constitutes integer or
real values. Numeric attributes can be of two types as follows: Interval- scaled, and Ratio –
scaled. Let’s discuss one by one.
Example – The Kelvin (K) temperature scale has what is contemplated as a true zero
point. It is the point at which the tiny bits that consist of matter has zero kinetic energy.
Numeric attributes can also be divided into the discrete and continuous data.
Discrete Attribute :
A discrete attribute has a limited or restricted unlimited set of values, which may appear as
integers.
Example: The attributes skin color, drinker, medical report, and drink size each have a
finite number of values, and so are discrete.
Continuous Attribute :
A continuous attribute has real numbers as attribute values.
Example – Height, weight, and temperature have real values . Real values can only be
represented and measured using finite number of digits . Continuous attributes are typically
represented as floating-point variables.
Data quality is a measure of a data set's condition based on factors such as accuracy,
completeness, consistency, reliability and validity. Measuring data quality can help organizations
identify errors and inconsistencies in their data and assess whether the data fits its intended
purpose.
Low-quality data can have significant business consequences for an organization. Bad data is
often the culprit behind operational snafus, inaccurate analytics and ill-conceived business
strategies. For example, it can potentially cause any of the following problems:
Shipping products to the wrong customer addresses.
Accuracy. The data correctly represents the entities or events it is supposed to represent, and
the data comes from sources that are verifiable and trustworthy.
Consistency. The data is uniform across systems and data sets, and there are no conflicts
between the same data values in different systems or data sets.
Validity. The data conforms to defined business rules and parameters, which ensure that the
data is properly structured and contains the values it should.
Completeness. The data includes all the values and types of data it is expected to contain,
including any metadata that should accompany the data sets.
Timeliness. The data is current (relative to its specific requirements) and is available to use
when it's needed.
Uniqueness. The data does not contain duplicate records within a single data set, and every
record can be uniquely identified.
Data Munging:
Data wrangling is the process of cleaning, structuring, and transforming raw data into a usable
format for analysis. Also known as data munging, it involves tasks such as handling missing or
inconsistent data, formatting data types, and merging different datasets to prepare the data for
further exploration and modeling in data analysis or machine learning projects.
Wrangling data involves the systematic and iterative transformation of raw, unstructured, or
messy data into a clean, structured, and usable format for data science and analytics.
Step 1: Discover
Initially, your focus is on understanding and exploring the data you’ve gathered. This involves
identifying data sources, assessing data quality, and gaining insights into the structure and format
of the data. Your goal is to establish a foundation for the subsequent data preparation steps by
recognizing potential challenges and opportunities in the data.
Step 2: Structure
In the data structuring step, you organize and format the raw data in a way that facilitates
efficient analysis. The specific form your data will take depends on which analytical model
you’re using, but structuring typically involves reshaping data, handling missing values, and
converting data types. This ensures that the data is presented in a coherent and standardized
manner, laying the groundwork for further manipulation and exploration.
Step 3: Clean
Data cleansing is a crucial step to address inconsistencies, errors, and outliers within the dataset.
This involves removing or correcting inaccurate data, handling duplicates, and addressing any
anomalies that could impact the reliability of analyses. By cleaning the data, your focus is on
enhancing data accuracy and reliability for downstream processes.
Step 4: Enrich
Enriching your data involves enhancing it with additional information to provide more context or
depth. This can include merging datasets, extracting relevant features, or incorporating external
data sources. The goal is to augment the original dataset, making it more comprehensive and
valuable for analysis. If you do add data, be sure to structure and clean that new data.
Step 5: Validate
Validation ensures the quality and reliability of your processed data. You’ll check for
inconsistencies, verify data integrity, and confirm that the data adheres to predefined standards.
Validation helps in building your confidence in the accuracy of the dataset and ensures that it
meets the requirements for meaningful analysis.
Step 6: Publish
Now your curated and validated dataset is prepared for analysis or dissemination to business
users. This involves documenting data lineage and the steps taken during the entire wrangling
process, sharing metadata, and preparing the data for storage or integration into data science and
analytics tools. Publishing facilitates collaboration and allows others to use the data for their
analyses or decision-making processes.