Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
11 views

Ch.3 Data Preprocessing

Uploaded by

Omkar Shinde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Ch.3 Data Preprocessing

Uploaded by

Omkar Shinde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CH.

3 DATA PREPROCESSING
Data Preprocessing
Definition: Pre-processing refers to the transformations applied to our data before feeding it
to the algorithm. Data preprocessing is a technique that is used to convert the raw data into a
clean data set. In other words, whenever the data is gathered from different sources it is
collected in raw format which is not feasible for the analysis.
Need for Data Preprocessing:
• For achieving better results from the applied model in Machine Learning projects
the format of the data has to be in a proper manner. Some specified Machine
Learning model needs information in a specified format, for example, the Random
Forest algorithm does not support null values, therefore to execute a random forest
algorithm null values have to be managed from the original raw data set.
• Another aspect is that the data set should be formatted in such a way that more
than one Machine Learning and Deep Learning algorithm is executed in one data
set, and the best out of them is chosen.

What is Data Wrangling?

Sometimes, data Wrangling is referred to as data munging. It is the process of transforming


and mapping data from one "raw" data form into another format to make it more appropriate
and valuable for various downstream purposes such as analytics. The goal of data wrangling is
to ensure quality and useful data. Data analysts typically spend the majority of their time in the
process of data wrangling compared to the actual analysis of the data.

OR

Data Wrangling is the process of gathering, collecting, and transforming Raw data into
another format for better understanding, decision-making, accessing, and analysis in less
time. Data Wrangling is also known as Data Munging.

The process of data wrangling may include further munging, data visualization, data
aggregation, training a statistical model, and many other potential uses. Data wrangling
typically follows a set of general steps, which begin with extracting the raw data from the data
source, "munging" the raw data (e.g., sorting) or parsing the data into predefined data
structures, and finally depositing the resulting content into a data sink for storage and future
use.
Data Attributes:
• Attributes are qualities or characteristics that describe an object, individual, or
phenomenon.
• Attributes can be categorical, representing distinct categories or classes, such as
colours, types, or labels.
• Some attributes are quantitative, taking on numerical values that can be measured or
counted, such as height, weight, or temperature.

Types of Attributes :
Qualitative Attributes:
These attributes represent categories and do not have a meaningful numeric interpretation.
Examples include gender, colour, or product type. These are often referred to as nominal,
ordinal or binary attributes.
1. Nominal Attributes:
Nominal means “relating to names”. The utilities of a nominal attribute are signs or
titles of objects. Each value represents some kind of category, code or state, and so
nominal attributes are also referred to as categorical. Example: Suppose that skin colour
and education status are two attributes of expressing a person's objects. In our
implementation, possible values for skin colour are dark, white, and brown. The
attributes for education status can contain the values- undergraduate, postgraduate, and
matriculate.
2. Binary Attributes :
A binary attribute is a category of nominal attributes that contains only two classes: 0
or 1, where 0 often tells that the attribute is not present, and 1 tells that it is existing.
Binary attributes are mentioned as Boolean if the two conditions agree to true and
false. Example – Given the attribute drinker narrate a patient item, 1 specify that the
drinker drinks, while 0 specify that the patient does not. Similarly, suppose the patient
undergoes a medical test that has two practicable outcomes.
3. Ordinal Attributes :
Ordinal data is a type of categorical data that possesses a meaningful order or ranking
among its categories, yet the intervals between consecutive values are not consistently
measurable or well-defined. Example –In the context of sports, an ordinal data example
would be medal rankings in a competition, such as gold, silver, and bronze.
Quantitative Attributes:
Numeric Attributes :
A numeric attribute is calculable, that is, it is a quantifiable amount that constitutes integer or
real values. Numeric attributes can be of two types as follows: Interval- scaled, and Ratio –
scaled. Let’s discuss one by one.
1. Interval – Scaled Attributes :
Interval–scaled attributes are calculated on a lamella of uniform-size units. The values
of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in
addition to providing a ranking of values, such attributes allow us to compare and
quantify the difference between values. Example – A temperature attribute is an
interval–scaled. We have different temperature values for every new day, where each
day is an entity. By sequencing the values, we obtain an arrangement of entities
concerning temperature. In addition, we can quantify the difference in the value
between values, for example, a temperature of 20 degrees C is five degrees higher than
a temperature of 15 degrees C.
2. Ratio – Scaled Attributes :
A ratio–scaled attribute is a category of a numeric attribute with imminent or fix zero
points. In inclusion, the entities are structured, and we can also compute the difference
between values, as well as the mean, median, and mode. Example – The Kelvin (K)
temperature scale has what is contemplated as a true zero point. It is the point at which
the tiny bits that consist of matter have zero kinetic energy.
Numeric attributes can also be divided into discrete and continuous data.
• Discrete Attribute :
A discrete attribute has a limited or restricted unlimited set of values, which may appear
as integers. Example: The attributes of skin colour, drinker, medical report, and drink
size each have a finite number of values, and so are discrete.
• Continuous Attribute:
A continuous attribute has real numbers as attribute values. Example – Height, weight,
and temperature have real values. Real values can only be represented and measured
using a finite number of digits. Continuous attributes are typically represented as
floating-point variables.

Data Objects: A collection of attributes that describe an object. Data objects can also
be referred to as samples, examples, instances, cases, entities, data points or objects.
The data object is a location or region of storage that contains a collection of attributes
or groups of values that act as an aspect, characteristic, quality, or descriptor of the
object. A vehicle is a data object which can be defined or described with the help of a
set of attributes or data.
Different data objects are present which are shown below:
• External entities such as a printer, user, speakers, keyboard, etc.
• Things such as reports, displays, signals.
• Occurrences or events such as alarm, telephone calls.
• Sales databases such as customers, store items, sales.
• Organizational units such as division, departments.
• Places such as manufacturing floor, workshops.
• Structures such as student records, accounts, files, documents.
Data Quality: Why preprocess the data?
(What is data quality? Which factors affect data qualities?)

There are six primary, or core, dimensions to data quality. These are the metrics
analysts use to determine the data’s viability and its usefulness to the people who need
it.
• Accuracy
The data must conform to actual, real-world scenarios and reflect real-world objects
and events. Analysts should use verifiable sources to confirm the measure of accuracy,
determined by how close the values jibe with the verified correct information sources.
• Completeness
Completeness measures the data's ability to deliver all the mandatory values that are
available successfully.
• Consistency
Data consistency describes the data’s uniformity as it moves across applications and
networks and when it comes from multiple sources. Consistency also means that the
same datasets stored in different locations should be the same and not conflict. Note
that consistent data can still be wrong.
• Timeliness
Timely data is information that is readily available whenever it’s needed. This
dimension also covers keeping the data current; data should undergo real-time updates
to ensure that it is always available and accessible.
• Uniqueness
Uniqueness means that no duplications or redundant information are overlapping
across all the datasets. No record in the dataset exists multiple times. Analysts use
data cleansing and deduplication to help address a low uniqueness score.
• Validity
Data must be collected according to the organization’s defined business rules and
parameters. The information should also conform to the correct, accepted formats, and
all dataset values should fall within the proper range.
Data Munging/Wrangling Operations:
Data wrangling is the task of converting data into a feasible format that is suitable for
the consumption of the data.
The goal of data wrangling is to ensure quality and useful data.
Data munging includes operations such as cleaning data, data transformation, data
reduction, and data discretization.
Data Cleaning:
Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset.
The importance of data cleaning lies in the following factors:
• Improved data quality: It is therefore very important to clean the data as this reduces
the chances of errors, inconsistencies and missing values, which ultimately makes the
data to be more accurate and reliable in the analysis.
• Better decision-making: Consistent and clean data gives organization insight into
comprehensive and actual information and minimizes the way such organizations
make decisions on outdated and incomplete data.
• Increased efficiency: High quality data is efficient to analyze, model or report on it,
whereas clean data often avoids a lot of foreseen time and effort that goes into
handling poor data quality.
• Compliance and regulatory requirements: There are standard policies the
industries and various regulatory authorities set on data quality, and by data cleaning,
one can be able to conform with these standards to avoid penalties and legal
endangers.
• Common Data Cleaning Tasks
• Data cleaning involves several key tasks, each aimed at addressing specific issues
within a dataset. Here are some of the most common tasks involved in data cleaning:
• 1. Handling Missing Data
• Missing data is a common problem in datasets. Strategies to handle missing data
include:
• Removing Records: Deleting rows with missing values if they are relatively few and
insignificant.
• Imputing Values: Replacing missing values with estimated ones, such as the mean,
median, or mode of the dataset.
• Using Algorithms: Employing advanced techniques like regression or machine
learning models to predict and fill in missing values.
• 2. Removing Duplicates
• Duplicates can skew analyses and lead to inaccurate results. Identifying and removing
duplicate records ensures that each data point is unique and accurately represented.
• 3. Correcting Inaccuracies
• Data entry errors, such as typos or incorrect values, need to be identified and corrected.
This can involve cross-referencing with other data sources or using validation rules to
ensure data accuracy.
• 4. Standardizing Formats
• Data may be entered in various formats, making it difficult to analyze. Standardizing
formats, such as dates, addresses, and phone numbers, ensures consistency and makes
the data easier to work with.
• 5. Dealing with Outliers
• Outliers can distort analyses and lead to misleading results. Identifying and addressing
outliers, either by removing them or transforming the data, helps maintain
the integrity of the dataset.

Data Transformation:
Definition:
Data transformation is a process of converting raw data into a single and easy-to-read format
to facilitate easy analysis.
The data transformation process involves converting, cleansing, and structuring data into a
usable format used to analyse, to support decision-making processes. It includes modifying
the format, organisation, or values of data to prepare it for consumption by an application or
for analysis.
Benefits:
1. Make data better organized.
2. Organized/transformed data is easy for both humans & computers.
3. Properly formatted and validated data improves data quality and protects applications from
null values, duplicates, and incorrect values.
4. Data transformation facilitates compatibility between applications, systems, and types of
data.
Advantages and Limitations of Data Transformation

Advantages of Data Transformation:


• Enhanced Data Quality: Data transformation aids in the organisation and
cleaning of data, improving its quality.
• Compatibility: It guarantees data consistency between many platforms and
systems, which is necessary for integrated business environments.
• Improved Analysis: Analytical results that are more accurate and perceptive are
frequently the outcome of transformed data.
Limitations of Data Transformation:
• Complexity: When working with big or varied datasets, the procedure might be
laborious and complicated.
• Cost: The resources and tools needed for efficient data transformation might be
expensive.
• Risk of Data Loss: Inadequate transformations may cause important data to be
lost or distorted.
Transformation Strategies:
1. Rescaling: Rescaling means transforming the data so that it fits within a
specific scale, like 0-100 or 0-1. Rescaling the data allows scaling all data values
to lie between a specified minimum and maximum value (0 and 1).
2. Normalizing: The measurement unit used can affect the data analysis.
For example, changing measurement units from meters to inches for height or kg
to pounds for weight may lead to very different results.
• To avoid more dependence on measurement units there is a need for
normalization.
• Normalization scaled the attribute into a very smaller value/unit. The
smaller range such as 0.0 to 1.0 or -1.0 to 1.0.
• Normalizing the data attempts to give all attributes an equal weight.
3. Binarizing: It is the process of converting data to either 0 or 1 based on a
threshold value.

Binarization is a preprocessing technique which is used when we need to convert


the data into binary numbers i.e., when we need to binarize the data. The scikit-
learn function is named Sklearn. preprocessing. binarize() is used to binarize the
data.

This binarize function has having threshold parameter, the feature values below
or equal to this threshold value are replaced by 0 and the value above it is replaced
by 1.

4. Standardizing Data: Standardization is a very important concept in feature


scaling which is an integral part of feature engineering. When you collect data
for data analysis or machine learning, you will have a lot of features, which are
independent. With the help of the independent features, we will try to predict
the dependent feature in supervised learning.
Standardization is also called mean removal. In other words, standardization is
another scaling technique where the values are centred around the mean with a
unit standard deviation.
5. Labelling:
The label encoding is used to convert textual labels into numeric form to
prepare them to be used in a machine-readable form. In label encoding,
categorical data is converted to numerical data, and values are assigned labels
For ex: consider the feature gender having two values (0) and female(1)
6. One-Hot Encoding: This is the most common encoding technique used in
data transformation, what it does is it converts each category in a categorical
feature into a different binary feature(i.e. 0 or 1), for example, if there is a feature
called ‘vehicle’ in the dataset and the categories in it are ‘car’, ‘bike’, ‘bicycle’,
one-hot encoding will create three separate columns as ‘is_car’, ‘is_bike’,
‘is_bicycle’ and then label them as 0 if absent or 1 if present.
One-hot encoding is a method of converting categorical variables into a format
that can be provided to machine learning algorithms to improve prediction. It
involves creating new binary columns for each unique category in a feature. Each
column represents one unique category, and a value of 1 or 0 indicates the
presence or absence of that category.
Let's consider an example to illustrate how one-hot encoding works. Suppose we
have a dataset with a single categorical feature, Color, that can take on three
values: Red, Green, and Blue. Using one-hot encoding, we can transform this
feature as follows:

In this example, the original "Color" column is replaced by three new binary
columns, each representing one of the colors. A value of 1 indicates the
presence of the color in that row, while a 0 indicates its absence.

Why Use One-Hot Encoding?

One-hot encoding is an essential technique in data preprocessing for several


reasons. It transforms categorical data into a format that machine learning
models can easily understand and use. This transformation allows each
category to be treated independently without implying any false relationships
between them.
Additionally, many data processing and machine learning libraries support one-
hot encoding. It fits smoothly into the data preprocessing workflow, making it
easier to prepare datasets for various machine learning algorithms.
One hot encoding generates binary columns for each category, whereas label
encoding provides each category with a unique numeric label.
Data Reduction:
Data Reduction: The process reduces the volume of original data &
represents it in a much smaller volume. Data reduction techniques ensure
the integrity of data while reducing the data.

The number of input features, variables, or columns present in a given dataset is


known as dimensionality, and the process to reduce these features is called
dimensionality reduction.

A dataset contains many input features in various cases, making the predictive
modelling task more complicated. Because it is complicated to visualize or make
predictions for the training dataset with a high number of features,
dimensionality reduction techniques are required for such cases.

Dimensionality reduction technique can be defined as "It is a way of converting


the higher dimensions dataset into lesser dimensions dataset ensuring that it
provides similar information." These techniques are widely used in Machine
Learning for obtaining a better-fit predictive model while solving classification
and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also be
used for data visualization, noise reduction, cluster analysis, etc.

Data reduction is a technique used in data mining to reduce the size of a


dataset while still preserving the most important information. This can be
beneficial in situations where the dataset is too large to be processed efficiently,
or where the dataset contains a large amount of irrelevant or redundant
information.

Several different data reduction techniques/strategies include:

Dimensionally reduction, Data Cube aggregation, Numerosity Reduction

1. Dimensionality Reduction: This technique involves reducing the


number of features in the dataset, either by removing features that are
not relevant or by combining multiple features into a single feature.
OR
Whenever we come across any data which is weakly important, then
we use the attribute required for our analysis. It reduces data size as it
eliminates outdated or redundant features.
Benefits:
o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of
features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.

Limitations:

There are also some disadvantages of applying the dimensionality reduction,


which are given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA (Principal Component Analysis) dimensionality reduction
technique, sometimes the principal components required to be considered
are unknown.

It can be divided into two main components-


1. Feature Selection:
2. Feature Extraction:PCA,LDA,GDA

1. Feature Selection: This technique involves selecting a subset of


features from the dataset that are most relevant to the task at hand.

• Step-wise Forward Selection –


The selection begins with an empty set of attributes later on we decide
the best of the original attributes on the set based on their relevance to
other attributes. We know it as a p-value in statistics.
(A p-value is a statistical measurement used to validate a
hypothesis against observed data.)

Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


• Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original
data and at each point, it eliminates the worst remaining attribute in
the set.
Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


• Combination of forwarding and Backward Selection –
It allows us to remove the worst and select the best attributes, saving
time and making the process faster.

• Decision Tree Induction:


This method uses the concept of decision trees for attribute selection. A
decision tree consists of several nodes that have branches. The nodes of
the decision tree indicate a test applied on an attribute while the branch
indicates the outcome of the test.
The decision tree helps in discarding the irrelevant attributes by
considering those attributes that are not a part of the tree.
Decision Tree is a supervised learning method used in data mining for
classification and regression methods.
It is a tree that helps us in decision-making purposes. The decision tree
creates classification or regression models as a tree structure.
It separates a data set into smaller subsets, and at the same time, the
decision tree is steadily developed.
The final tree is a tree with the decision nodes and leaf nodes. A decision
node has at least two branches.
The leaf nodes show a classification or decision. We can't accomplish
more split on a leaf. The uppermost decision node in a tree that relates to
the best predictor is called the root node. Decision trees can deal with both
categorical and numerical data.
Example:
Key factors:
Entropy:
Entropy refers to a common way to measure impurity. In the decision tree,
it measures the randomness or impurity in data sets.

Information Gain:
Information Gain refers to the decline in entropy after the dataset is split.
It is also called Entropy Reduction. Building a decision tree is all about
discovering attributes that return the highest data gain.
In
short, a decision tree is just like a flow chart diagram with the terminal
nodes showing decisions. Starting with the dataset, we can measure the
entropy to find a way to segment the set until the data belongs to the same
class.
Advantages of Decision Tree Induction:

A decision tree model is automatic and simple to explain to the technical


team as well as stakeholders.
Compared to other algorithms, decision trees need less exertion for data
preparation during pre-processing.
A decision tree does not require a standardization of data.

2) Feature Extraction:
Feature Extraction process is used to reduce data in a high dimensional
space to a lower dimension space.
Feature extraction creates a new, smaller set of features that consists of
the most useful information.
Methods for feature extraction include:
1. Principal component Analysis (PCA)
2. Linear Discriminant Analysis (LDA)
3. Generalized Discriminant Analysis((GDA)
What is a Feature?
A feature is an individual measurable property within a recorded
dataset. In machine learning and statistics, features are often called
“variables” or “attributes.” Relevant features correlate or bearing on a
model’s use case. In a patient medical dataset, features could be age,
gender, blood pressure, cholesterol level, and other observed
characteristics relevant to the patient.

Why is Feature Extraction Important?


Feature extraction plays a vital role in many real-world applications.
Feature extraction is critical for image and speech recognition,
predictive modelling, and Natural Language Processing (NLP)
processes. In these scenarios, the raw data may contain many
irrelevant or redundant features. This makes it difficult for algorithms
to accurately process the data.
By performing feature extraction, the relevant features are separated
(“extracted”) from the irrelevant ones. With fewer features to process,
the dataset becomes simpler and the accuracy and efficiency of the
analysis improves.
Common Feature Types:
• Numerical: Values with numeric types (int, float, etc.). Examples: age,
salary, height.
• Categorical Features: Features that can take one of a limited number of
values. Examples: gender (male, female, X), color (red, blue, green).
• Ordinal Features: Categorical features that have a clear ordering.
Examples: T-shirt size (S, M, L, XL).
• Binary Features: A special case of categorical features with only two
categories. Examples: is_smoker (yes, no), has_subscription (true, false).
• Text Features: Features that contain textual data. Textual data typically
requires special preprocessing steps (like tokenization) to transform it into
a format suitable for machine learning models.

Principal Component Analysis: an unsupervised learning algorithm


that reduces dimensionality in machine learning. It is a statistical
process that converts the observations of correlated features into a set
of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called
the Principal Components.
• Or Principal Component Analysis (PCA): This feature extraction
method reduces the dimensionality of large data sets while preserving the
maximum amount of information. Principal Component Analysis
emphasizes variation and captures important patterns and relationships
between variables in the dataset.
(II) Linear Discriminant Analysis: (LDA)
It is a supervised method of feature extraction that also creates a linear
combination of the original features. However, it can be used for only
labelled data and can be thus used only in certain situations. The data has
to be normalized before performing LDA.
(III) Generalized Discriminant Analysis (GDA):

https://www.datacamp.com/tutorial/one-hot-encoding-python-tutorial

Example: https://www.statology.org/one-hot-encoding-in-python/

You might also like