Module 2
Module 2
Module 2
Data pre-processing is the process of transforming raw data into an understandable format.
It is also an important step in data mining as we cannot work with raw data. The quality of
the data should be checked before applying machine learning or data mining algorithms.
Pre-processing of data is mainly to check the data quality. The quality can be checked by the
following:
There are 4 major tasks in data pre-processing – Data cleaning, Data integration, Data
reduction, and Data transformation.
Data Cleaning
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data
from the datasets, and it also replaces the missing values. Here are some techniques for data
cleaning:
● Standard values like “Not Available” or “NA” can be used to replace the missing values.
● Missing values can also be filled manually, but it is not recommended when that dataset is
big.
● The attribute’s mean value can be used to replace the missing value when the data is
normally distributed
wherein in the case of non-normal distribution median value of the attribute can be used.
● While using regression or decision tree algorithms, the missing value can be replaced by the
most probable value.
● Binning: This method is to smooth or handle noisy data. First, the data is sorted then, and
then the sorted values are separated and stored in the form of bins. There are three
methods for smoothing data in the bin. Smoothing by bin mean method: In this method,
the values in the bin are replaced by the mean value of the bin; Smoothing by bin median:
In this method, the values in the bin are replaced by the median value; Smoothing by bin
boundary: In this method, the using minimum and maximum values of the bin values are
taken, and the closest boundary value replaces the values.
● Regression: This is used to smooth the data and will help to handle data when unnecessary
data is present. For the analysis, purpose regression helps to decide the variable which is
suitable for our analysis.
● Clustering: This is used for finding the outliers and also in grouping the data. Clustering is
generally used in unsupervised learning.
Data Integration
The process of combining multiple sources into a single dataset. The Data integration process
is one of the main components of data management. There are some problems to be
considered during data integration.
● Schema integration: Integrates metadata(a set of data that describes other data) from
different sources.
● Entity identification problem: Identifying entities from multiple databases. For example, the
system or the user should know the student id of one database and studentname of another
database belonging to the same entity.
● Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. The attribute values from one database may differ from another
database. For example, the date format may differ, like “MM/DD/YYYY” or “DD/MM/YYYY”.
Data Reduction
This process helps in the reduction of the volume of the data, which makes the analysis easier
yet produces the same or almost the same result. This reduction also helps to reduce storage
space. Some of the data reduction techniques are dimensionality reduction, numerosity
reduction, and data compression.
Data Transformation
The change made in the format or the structure of the data is called data transformation. This
step can be simple or complex based on the requirements. There are some methods for data
transformation.
● Smoothing: With the help of algorithms, we can remove noise from the dataset, which helps
in knowing the important features of the dataset. By smoothing, we can find even a simple
change that helps in prediction.
● Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set, which is from multiple sources, is integrated into with data analysis description.
This is an important step since the accuracy of the data depends on the quantity and quality
of the data. When the quality and the quantity of the data are good, the results are more
relevant.
● Discretization: The continuous data here is split into intervals. Discretization reduces the
data size. For example, rather than specifying the class time, we can set an interval like (3
pm-5 pm, or 6 pm-8 pm).
● Normalization: It is the method of scaling the data so that it can be represented in a smaller
range. Example ranging from -1.0 to 1.0.
Data collection
Collecting data for training the ML model is the basic step in the machine learning pipeline.
The predictions made by ML systems can only be as good as the data on which they have
been trained. Following are some of the problems that can arise in data collection:
● Inaccurate data. The collected data could be unrelated to the problem statement.
● Missing data. Sub-data could be missing. That could take the form of empty values in
columns or missing images for some class of prediction.
● Data imbalance. Some classes or categories in the data may have a
disproportionately high or low number of corresponding samples. As a result, they
risk being under-represented in the model.
● Data bias. Depending on how the data, subjects and labels themselves are chosen,
the model could propagate inherent biases on gender, politics, age or region, for
example. Data bias is difficult to detect and remove.
Several techniques can be applied to address those problems:
● Pre-cleaned, freely available datasets. If the problem statement (for example, image
classification, object recognition) aligns with a clean, pre-existing, properly
formulated dataset, then take advantage of existing, open-source expertise.
● Web crawling and scraping. Automated tools, bots and headless browsers can crawl
and scrape websites for data.
● Private data. ML engineers can create their own data. This is helpful when the
amount of data required to train the model is small and the problem statement is too
specific to generalize over an open-source dataset.
● Custom data. Agencies can create or crowdsource the data for a fee.
Data pre-processing
Real-world raw data and images are often incomplete, inconsistent and lacking in certain
behaviours or trends. They are also likely to contain many errors. So, once collected, they are
pre-processed into a format the machine learning algorithm can use for the model.
Pre-processing includes a number of techniques and actions:
● Data cleaning. These techniques, manual and automated, remove data incorrectly
added or classified.
● Data imputations. Most ML frameworks include methods and APIs for balancing or
filling in missing data. Techniques generally include imputing missing values with
standard deviation, mean, median and k-nearest neighbours (k-NN) of the data in the
given field.
● Oversampling. Bias or imbalance in the dataset can be corrected by generating more
observations/samples with methods like repetition, bootstrapping or Synthetic
Minority Over-Sampling Technique (SMOTE), and then adding them to the
under-represented classes.
● Data integration. Combining multiple datasets to get a large corpus can overcome
incompleteness in a single dataset.
● Data normalization. The size of a dataset affects the memory and processing required
for iterations during training. Normalization reduces the size by reducing the order
and magnitude of data.
I. DATA INTEGRATION:
Data integration is one of the steps of data pre-processing that involves combining data
residing in different sources and providing users with a unified view of these data.
• It merges the data from multiple data stores (data sources)
• It includes multiple databases, data cubes or flat files.
• Metadata, Correlation analysis, data conflict detection, and resolution of semantic
heterogeneity contribute towards smooth data integration.
• There are mainly 2 major approaches for data integration - commonly known as "tight
coupling approach" and "loose coupling approach".
Tight Coupling
o Here data is pulled over from different sources into a single physical location through the
process of ETL - Extraction, Transformation and Loading.
o The single physical location provides an uniform interface for querying the data.
ETL layer helps to map the data from the sources so as to provide a uniform data
o warehouse. This approach is called tight coupling since in this approach the data is tightly
coupled with the physical repository at the time of query.
ADVANTAGES:
1 Independence (Lesser dependency to source systems since data is physically copied
over)
2 Faster query processing
3 Complex query processing
4 Advanced data summarization and storage possible
5 High Volume data processing
DISADVANTAGES: 1. Latency (since data needs to be loaded using ETL)
1 Costlier (data localization, infrastructure, security)
Loose Coupling
o Here a virtual mediated schema provides an interface that takes the query from the user,
transforms it in a way the source database can understand and then sends the query directly
to the source databases to obtain the result.
o In this approach, the data only remains in the actual source databases.
o However, mediated schema contains several "adapters" or "wrappers" that can connect
back to the source systems in order to bring the data to the front end.
ADVANTAGES:
Data Freshness (low latency - almost real time)
Higher Agility (when a new source system comes or existing source system changes - only
the corresponding adapter is created or changed - largely not affecting the other parts of the
system)
Less costlier (Lot of infrastructure cost can be saved since data localization not required)
DISADVANTAGES:
1 Semantic conflicts
2 Slower query response
3 High order dependency to the data sources
For example, let's imagine that an electronics company is preparing to roll out a new mobile
device. The marketing department might want to retrieve customer information from a sales
department database and compare it to information from the product department to create
a targeted sales list. A good data integration system would let the marketing department
view information from both sources in a unified way, leaving out any information that didn't
apply to the search.
o Min-max normalization preserves the relationships among the original data values.
2. z-score normalization:
o Here the values for an attribute, A, are normalized based on the mean and standard
deviation of A.
o Value, v of A is normalized to v0 by computing