Ch.3 Data Preprocessing
Ch.3 Data Preprocessing
3 DATA PREPROCESSING
Data Preprocessing
Definition: Pre-processing refers to the transformations applied to our data before feeding it
to the algorithm. Data preprocessing is a technique that is used to convert the raw data into a
clean data set. In other words, whenever the data is gathered from different sources it is
collected in raw format which is not feasible for the analysis.
Need for Data Preprocessing:
• For achieving better results from the applied model in Machine Learning projects
the format of the data has to be in a proper manner. Some specified Machine
Learning model needs information in a specified format, for example, the Random
Forest algorithm does not support null values, therefore to execute a random forest
algorithm null values have to be managed from the original raw data set.
• Another aspect is that the data set should be formatted in such a way that more
than one Machine Learning and Deep Learning algorithm is executed in one data
set, and the best out of them is chosen.
OR
Data Wrangling is the process of gathering, collecting, and transforming Raw data into
another format for better understanding, decision-making, accessing, and analysis in less
time. Data Wrangling is also known as Data Munging.
The process of data wrangling may include further munging, data visualization, data
aggregation, training a statistical model, and many other potential uses. Data wrangling
typically follows a set of general steps, which begin with extracting the raw data from the data
source, "munging" the raw data (e.g., sorting) or parsing the data into predefined data
structures, and finally depositing the resulting content into a data sink for storage and future
use.
Data Attributes:
• Attributes are qualities or characteristics that describe an object, individual, or
phenomenon.
• Attributes can be categorical, representing distinct categories or classes, such as
colours, types, or labels.
• Some attributes are quantitative, taking on numerical values that can be measured or
counted, such as height, weight, or temperature.
Types of Attributes :
Qualitative Attributes:
These attributes represent categories and do not have a meaningful numeric interpretation.
Examples include gender, colour, or product type. These are often referred to as nominal,
ordinal or binary attributes.
1. Nominal Attributes:
Nominal means “relating to names”. The utilities of a nominal attribute are signs or
titles of objects. Each value represents some kind of category, code or state, and so
nominal attributes are also referred to as categorical. Example: Suppose that skin colour
and education status are two attributes of expressing a person's objects. In our
implementation, possible values for skin colour are dark, white, and brown. The
attributes for education status can contain the values- undergraduate, postgraduate, and
matriculate.
2. Binary Attributes :
A binary attribute is a category of nominal attributes that contains only two classes: 0
or 1, where 0 often tells that the attribute is not present, and 1 tells that it is existing.
Binary attributes are mentioned as Boolean if the two conditions agree to true and
false. Example – Given the attribute drinker narrate a patient item, 1 specify that the
drinker drinks, while 0 specify that the patient does not. Similarly, suppose the patient
undergoes a medical test that has two practicable outcomes.
3. Ordinal Attributes :
Ordinal data is a type of categorical data that possesses a meaningful order or ranking
among its categories, yet the intervals between consecutive values are not consistently
measurable or well-defined. Example –In the context of sports, an ordinal data example
would be medal rankings in a competition, such as gold, silver, and bronze.
Quantitative Attributes:
Numeric Attributes :
A numeric attribute is calculable, that is, it is a quantifiable amount that constitutes integer or
real values. Numeric attributes can be of two types as follows: Interval- scaled, and Ratio –
scaled. Let’s discuss one by one.
1. Interval – Scaled Attributes :
Interval–scaled attributes are calculated on a lamella of uniform-size units. The values
of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in
addition to providing a ranking of values, such attributes allow us to compare and
quantify the difference between values. Example – A temperature attribute is an
interval–scaled. We have different temperature values for every new day, where each
day is an entity. By sequencing the values, we obtain an arrangement of entities
concerning temperature. In addition, we can quantify the difference in the value
between values, for example, a temperature of 20 degrees C is five degrees higher than
a temperature of 15 degrees C.
2. Ratio – Scaled Attributes :
A ratio–scaled attribute is a category of a numeric attribute with imminent or fix zero
points. In inclusion, the entities are structured, and we can also compute the difference
between values, as well as the mean, median, and mode. Example – The Kelvin (K)
temperature scale has what is contemplated as a true zero point. It is the point at which
the tiny bits that consist of matter have zero kinetic energy.
Numeric attributes can also be divided into discrete and continuous data.
• Discrete Attribute :
A discrete attribute has a limited or restricted unlimited set of values, which may appear
as integers. Example: The attributes of skin colour, drinker, medical report, and drink
size each have a finite number of values, and so are discrete.
• Continuous Attribute:
A continuous attribute has real numbers as attribute values. Example – Height, weight,
and temperature have real values. Real values can only be represented and measured
using a finite number of digits. Continuous attributes are typically represented as
floating-point variables.
Data Objects: A collection of attributes that describe an object. Data objects can also
be referred to as samples, examples, instances, cases, entities, data points or objects.
The data object is a location or region of storage that contains a collection of attributes
or groups of values that act as an aspect, characteristic, quality, or descriptor of the
object. A vehicle is a data object which can be defined or described with the help of a
set of attributes or data.
Different data objects are present which are shown below:
• External entities such as a printer, user, speakers, keyboard, etc.
• Things such as reports, displays, signals.
• Occurrences or events such as alarm, telephone calls.
• Sales databases such as customers, store items, sales.
• Organizational units such as division, departments.
• Places such as manufacturing floor, workshops.
• Structures such as student records, accounts, files, documents.
Data Quality: Why preprocess the data?
(What is data quality? Which factors affect data qualities?)
There are six primary, or core, dimensions to data quality. These are the metrics
analysts use to determine the data’s viability and its usefulness to the people who need
it.
• Accuracy
The data must conform to actual, real-world scenarios and reflect real-world objects
and events. Analysts should use verifiable sources to confirm the measure of accuracy,
determined by how close the values jibe with the verified correct information sources.
• Completeness
Completeness measures the data's ability to deliver all the mandatory values that are
available successfully.
• Consistency
Data consistency describes the data’s uniformity as it moves across applications and
networks and when it comes from multiple sources. Consistency also means that the
same datasets stored in different locations should be the same and not conflict. Note
that consistent data can still be wrong.
• Timeliness
Timely data is information that is readily available whenever it’s needed. This
dimension also covers keeping the data current; data should undergo real-time updates
to ensure that it is always available and accessible.
• Uniqueness
Uniqueness means that no duplications or redundant information are overlapping
across all the datasets. No record in the dataset exists multiple times. Analysts use
data cleansing and deduplication to help address a low uniqueness score.
• Validity
Data must be collected according to the organization’s defined business rules and
parameters. The information should also conform to the correct, accepted formats, and
all dataset values should fall within the proper range.
Data Munging/Wrangling Operations:
Data wrangling is the task of converting data into a feasible format that is suitable for
the consumption of the data.
The goal of data wrangling is to ensure quality and useful data.
Data munging includes operations such as cleaning data, data transformation, data
reduction, and data discretization.
Data Cleaning:
Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or
incomplete data within a dataset.
The importance of data cleaning lies in the following factors:
• Improved data quality: It is therefore very important to clean the data as this reduces
the chances of errors, inconsistencies and missing values, which ultimately makes the
data to be more accurate and reliable in the analysis.
• Better decision-making: Consistent and clean data gives organization insight into
comprehensive and actual information and minimizes the way such organizations
make decisions on outdated and incomplete data.
• Increased efficiency: High quality data is efficient to analyze, model or report on it,
whereas clean data often avoids a lot of foreseen time and effort that goes into
handling poor data quality.
• Compliance and regulatory requirements: There are standard policies the
industries and various regulatory authorities set on data quality, and by data cleaning,
one can be able to conform with these standards to avoid penalties and legal
endangers.
• Common Data Cleaning Tasks
• Data cleaning involves several key tasks, each aimed at addressing specific issues
within a dataset. Here are some of the most common tasks involved in data cleaning:
• 1. Handling Missing Data
• Missing data is a common problem in datasets. Strategies to handle missing data
include:
• Removing Records: Deleting rows with missing values if they are relatively few and
insignificant.
• Imputing Values: Replacing missing values with estimated ones, such as the mean,
median, or mode of the dataset.
• Using Algorithms: Employing advanced techniques like regression or machine
learning models to predict and fill in missing values.
• 2. Removing Duplicates
• Duplicates can skew analyses and lead to inaccurate results. Identifying and removing
duplicate records ensures that each data point is unique and accurately represented.
• 3. Correcting Inaccuracies
• Data entry errors, such as typos or incorrect values, need to be identified and corrected.
This can involve cross-referencing with other data sources or using validation rules to
ensure data accuracy.
• 4. Standardizing Formats
• Data may be entered in various formats, making it difficult to analyze. Standardizing
formats, such as dates, addresses, and phone numbers, ensures consistency and makes
the data easier to work with.
• 5. Dealing with Outliers
• Outliers can distort analyses and lead to misleading results. Identifying and addressing
outliers, either by removing them or transforming the data, helps maintain
the integrity of the dataset.
Data Transformation:
Definition:
Data transformation is a process of converting raw data into a single and easy-to-read format
to facilitate easy analysis.
The data transformation process involves converting, cleansing, and structuring data into a
usable format used to analyse, to support decision-making processes. It includes modifying
the format, organisation, or values of data to prepare it for consumption by an application or
for analysis.
Benefits:
1. Make data better organized.
2. Organized/transformed data is easy for both humans & computers.
3. Properly formatted and validated data improves data quality and protects applications from
null values, duplicates, and incorrect values.
4. Data transformation facilitates compatibility between applications, systems, and types of
data.
Advantages and Limitations of Data Transformation
This binarize function has having threshold parameter, the feature values below
or equal to this threshold value are replaced by 0 and the value above it is replaced
by 1.
In this example, the original "Color" column is replaced by three new binary
columns, each representing one of the colors. A value of 1 indicates the
presence of the color in that row, while a 0 indicates its absence.
A dataset contains many input features in various cases, making the predictive
modelling task more complicated. Because it is complicated to visualize or make
predictions for the training dataset with a high number of features,
dimensionality reduction techniques are required for such cases.
It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also be
used for data visualization, noise reduction, cluster analysis, etc.
Limitations:
Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Information Gain:
Information Gain refers to the decline in entropy after the dataset is split.
It is also called Entropy Reduction. Building a decision tree is all about
discovering attributes that return the highest data gain.
In
short, a decision tree is just like a flow chart diagram with the terminal
nodes showing decisions. Starting with the dataset, we can measure the
entropy to find a way to segment the set until the data belongs to the same
class.
Advantages of Decision Tree Induction:
2) Feature Extraction:
Feature Extraction process is used to reduce data in a high dimensional
space to a lower dimension space.
Feature extraction creates a new, smaller set of features that consists of
the most useful information.
Methods for feature extraction include:
1. Principal component Analysis (PCA)
2. Linear Discriminant Analysis (LDA)
3. Generalized Discriminant Analysis((GDA)
What is a Feature?
A feature is an individual measurable property within a recorded
dataset. In machine learning and statistics, features are often called
“variables” or “attributes.” Relevant features correlate or bearing on a
model’s use case. In a patient medical dataset, features could be age,
gender, blood pressure, cholesterol level, and other observed
characteristics relevant to the patient.
https://www.datacamp.com/tutorial/one-hot-encoding-python-tutorial
Example: https://www.statology.org/one-hot-encoding-in-python/