Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
1K views

Data Preprocessing in Machine Learning

Data pre-processing is the first and crucial step in machine learning that involves preparing raw data for model building. It includes cleaning data by removing incorrect or missing values, transforming variables through techniques like encoding categorical data, and scaling features. The key tasks are data cleaning, integration, reduction and transformation to check data quality, completeness, consistency and ensure it is ready for machine learning algorithms.

Uploaded by

Naashit Hashmi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Data Preprocessing in Machine Learning

Data pre-processing is the first and crucial step in machine learning that involves preparing raw data for model building. It includes cleaning data by removing incorrect or missing values, transforming variables through techniques like encoding categorical data, and scaling features. The key tasks are data cleaning, integration, reduction and transformation to check data quality, completeness, consistency and ensure it is ready for machine learning algorithms.

Uploaded by

Naashit Hashmi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Pre-processing in Machine learning

Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, we use data pre-processing task.

Why is Data preprocessing important?

Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following

 Accuracy: To check whether the data entered is correct or not.


 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that do or do not
match.
 Timeliness: The data should be updated correctly.
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing:

1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
                                    

Data cleaning:

Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data
from the datasets, and it also replaces the missing values. There are some techniques in data
cleaning

It involves below stepsOOPs Concepts in Java

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset


To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.

What is a CSV File?


CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these datasets
in programs.

Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning

. For real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets
, https://archive.ics.uci.edu/ml/index.php
etc.

We can also create our dataset by gathering data using various API with Python and put that
data into a .csv file.

2) Importing Libraries
In order to perform data pre-processing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data pre-processing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices. So, in Python, we can import it as:
1. import numpy as nm  

Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and


with this library, we need to import a sub-library pyplot. This library is used to plot any type
of charts in Python for the code. It will be imported as below:

1. import matplotlib.pyplot as mpt  

Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

3) Importing the Datasets


Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory. To set a working directory in Spyder IDE, we need to follow the below steps:

1. Save your Python file in the directory which contains dataset.


2. Go to File explorer option in Spyder IDE, and select the required directory.
3. Click on F5 button or run option to execute the file.

Note: We can set any directory as a working directory, but it must contain the required dataset.

Here, in the below image, we can see the Python file along with required dataset. Now, the
current folder is set as a working directory.
read_csv() function:

Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL.

We can use read_csv function as below:

1. data_set= pd.read_csv('Dataset.csv')  

Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code. We can also check the imported dataset by clicking on the
section variable explorer, and then double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python.
We can also change the format of our dataset by clicking on the format option.

Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent


variables) and dependent variables from dataset. In our dataset, there are three independent
variables that are Country, Age, and Salary, and one is a dependent variable which
is Purchased.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to


extract the required rows and columns from the dataset.

1. x= data_set.iloc[:,:-1].values  

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for
all the columns. Here we have used :-1, because we don't want to take the last column as it
contains the dependent variable. So by doing this, we will get the matrix of features.

By executing the above code, we will get output as:


1. [['India' 38.0 68000.0]  
2.  ['France' 43.0 45000.0]  
3.  ['Germany' 30.0 54000.0]  
4.  ['France' 48.0 65000.0]  
5.  ['Germany' 40.0 nan]  
6.  ['India' 35.0 58000.0]  
7.  ['Germany' nan 53000.0]  
8.  ['France' 49.0 79000.0]  
9.  ['India' 50.0 88000.0]  
10.  ['France' 37.0 77000.0]]  

As we can see in the above output, there are only three variables.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

1. y= data_set.iloc[:,3].values  

Here we have taken all the rows with the last column only. It will give the array of dependent
variables.

By executing the above code, we will get output as:

Output:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)

Note: If you are using Python language for machine learning, then extraction is mandatory, but
for R language it is not required.

4) Handling Missing data:


The next step of data pre-processing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.
Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.

To handle missing values, we will use Scikit-learn library in our code, which contains


various libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)  
2. from sklearn.preprocessing import Imputer  
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)  
4. #Fitting imputer object to the independent variables x.   
5. imputer= imputer.fit(x[:, 1:3])  
6. #Replacing missing data with the calculated mean value  
7. x[:, 1:3]= imputer.transform(x[:, 1:3])  

Output:

array([['India', 38.0, 68000.0],


['France', 43.0, 45000.0],
['Germany', 30.0, 54000.0],
['France', 48.0, 65000.0],
['Germany', 40.0, 65222.22222222222],
['India', 35.0, 58000.0],
['Germany', 41.111111111111114, 53000.0],
['France', 49.0, 79000.0],
['India', 50.0, 88000.0],
['France', 37.0, 77000.0]], dtype=object

As we can see in the above output, the missing values have been replaced with the means of
rest column values.

5) Encoding Categorical data:


Categorical data is data which has some categories such as, in our dataset; there are two
categorical variables, Country, and Purchased.

Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.

For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.

1. #Catgorical data  
2. #for Country Variable  
3. from sklearn.preprocessing import LabelEncoder  
4. label_encoder_x= LabelEncoder()  
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  

Output:

Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)

Explanation:

In above code, we have imported LabelEncoder class of sklearn library. This class has


successfully encoded the variables into digits.

But in our case, there are three country variables, and as we can see in the above output, these
variables are encoded into 0, 1, and 2. By these values, the machine learning model may
assume that there is some correlation between these variables which will produce the wrong
output. So to remove this issue, we will use dummy encoding.

Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.

In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values.
For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.

1. #for Country Variable  
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
3. label_encoder_x= LabelEncoder()  
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  
5. #Encoding for dummy variables  
6. onehot_encoder= OneHotEncoder(categorical_features= [0])    
7. x= onehot_encoder.fit_transform(x).toarray()  

Output:

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


6.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,
4.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
5.40000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
6.50000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
6.52222222e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,
5.80000000e+04],
[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,
5.30000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,
7.90000000e+04],
[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,
8.80000000e+04],
[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
7.70000000e+04]])

As we can see in the above output, all the variables are encoded into numbers 0 and 1 and
divided into three columns.

It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:

1. labelencoder_y= LabelEncoder()  
2. y= labelencoder_y.fit_transform(y)  

For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the purchased
variable has only two categories yes or no, and which are automatically encoded into 0 and 1.

Output:

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

It can also be seen as:


6) Splitting the Dataset into the Training set and Test set
In machine learning data pre-processing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data pre-processing as by doing this, we can enhance the
performance of our machine learning model.

Suppose, if we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to understand
the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know
the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.

For splitting the dataset, we will use the below lines of code:

1. from sklearn.model_selection import train_test_split  
2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  

Explanation:

o In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are
for arrays of data, and test_size is for specifying the size of the test set. The test_size
maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.

Output:

By executing the above code, we will get 4 different variables, which can be seen under the
variable explorer section.
As we can see in the above image, the x and y variables are divided into 4 different variables
with corresponding values.

7) Feature Scaling
Feature scaling is the final step of data pre-processing in machine learning. It is a technique
to standardize the independent variables of the dataset in a specific range. In feature scaling,
we put our variables in the same range and in the same scale so that no any variable dominate
the other variable.

Consider the below dataset:


As we can see, the age and salary column values are not on the same scale. A machine
learning model is based on Euclidean distance, and if we do not scale the variable, then it
will cause some issue in our machine learning model.

Euclidean distance is given as:


If we compute any two values from age and salary, then salary values will dominate the age
values, and it will produce an incorrect result. So to remove this issue, we need to perform
feature scaling for machine learning.

There are two ways to perform feature scaling in machine learning:

Standardization

Normalization
Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

1. from sklearn.preprocessing import StandardScaler  

Now, we will create the object of StandardScaler class for independent variables or features.
And then we will fit and transform the training dataset.

1. st_x= StandardScaler()  
2. x_train= st_x.fit_transform(x_train)  

For test dataset, we will directly apply transform() function instead


of fit_transform() because it is already done in training set.

1. x_test= st_x.transform(x_test)  

Output:

By executing the above lines of code, we will get the scaled values for x_train and x_test as:

x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.

Note: Here, we have not scaled the dependent variable because there are only two values 0 and 1.
But if these variables will have more range of values, then we will also need to scale those
variables.

Combining all the steps:

Now, in the end, we can combine all the steps together to make our complete code more
understandable.

1. # importing libraries  
2. import numpy as nm  
3. import matplotlib.pyplot as mtp  
4. import pandas as pd  
5.   
6. #importing datasets  
7. data_set= pd.read_csv('Dataset.csv')  
8.   
9. #Extracting Independent Variable  
10. x= data_set.iloc[:, :-1].values  
11.   
12. #Extracting Dependent variable  
13. y= data_set.iloc[:, 3].values  
14.   
15. #handling missing data(Replacing missing data with the mean value)  
16. from sklearn.preprocessing import Imputer  
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)  
18.   
19. #Fitting imputer object to the independent varibles x.   
20. imputerimputer= imputer.fit(x[:, 1:3])  
21.   
22. #Replacing missing data with the calculated mean value  
23. x[:, 1:3]= imputer.transform(x[:, 1:3])  
24.   
25. #for Country Variable  
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder  
27. label_encoder_x= LabelEncoder()  
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  
29.   
30. #Encoding for dummy variables  
31. onehot_encoder= OneHotEncoder(categorical_features= [0])    
32. x= onehot_encoder.fit_transform(x).toarray()  
33.   
34. #encoding for purchased variable  
35. labelencoder_y= LabelEncoder()  
36. y= labelencoder_y.fit_transform(y)  
37.   
38. # Splitting the dataset into training and test set.  
39. from sklearn.model_selection import train_test_split  
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  
41.   
42. #Feature Scaling of datasets  
43. from sklearn.preprocessing import StandardScaler  
44. st_x= StandardScaler()  
45. x_train= st_x.fit_transform(x_train)  
46. x_test= st_x.transform(x_test)  

In the above code, we have included all the data preprocessing steps together. But there are
some steps or lines of code which are not necessary for all machine learning models. So we
can exclude them from our code to make it reusable for all models.

Noisy:

          Noisy generally means random error or containing unnecessary data points.Removing


noise from a data set is termed data smoothing. Here are some of the methods to handle noisy
data.

 Binning: This method is to smooth or handle noisy data. First, the data is sorted then and
then the sorted values are separated and stored in the form of bins. There are three methods
for smoothing data in the bin. Smoothing by bin mean method: In this method, the values in
the bin are replaced by the mean value of the bin; Smoothing by bin median: In this method,
the values in the bin are replaced by the median value; Smoothing by bin boundary: In this
method, the using minimum and maximum values of the bin values are taken and the values
are replaced by the closest boundary value.

Examples:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Partition using equal frequency approach:

- Bin 1 : 4, 8, 9, 15

- Bin 2 : 21, 21, 24, 25

- Bin 3 : 26, 28, 29, 34

Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

Smoothing by bin median:

- Bin 1: 9 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

Approach:
Sort the array of a given data set.

Divides the range into N intervals, each containing the approximately same number of
samples(Equal-depth partitioning).

Store mean/ median/ boundaries in each row.

 Regression

This is used to smooth the data and help handle data when unnecessary data is present.
For the analysis, purpose regression helps decide the suitable variable. Linear
regression refers to finding the best line to fit between two variables so that one can be
used to predict the other. Multiple linear regression involves more than two variables.
Using regression to find a mathematical equation to fit into the data helps to smooth out
the noise.

 Clustering: This is used for finding the outliers and also in grouping the data. Clustering
is generally used in unsupervised learning.

Data integration:

          The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components in data management. There are some problems to be
considered during data integration.

 Schema integration: Integrates metadata(a set of data that describes other data) from
different sources.
 Entity identification problem: Identifying entities from multiple databases. For example,
the system or the use should know student _id of one database and student_name of another
database belongs to the same entity.
 Detecting and resolving data value concepts: The data taken from different databases while
merging  may differ. Like the attribute values from one database may differ from another
database. For example, the date format may differ like “MM/DD/YYYY” or
“DD/MM/YYYY”.

Data reduction:

         This process helps in the reduction of the volume of the data which makes the analysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. There are some of the techniques in data reduction are Dimensionality
reduction, Numerosity reduction, Data compression.

 Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby reducing
the volume of original data. It reduces data size as it eliminates outdated or redundant features. Here are
three methods of dimensionality reduction.

1. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed into a
numerically different data vector A' such that both A and A' vectors are of the same length. Then
how it is useful in reducing data because the data obtained from the wavelet transform can be
truncated. The compressed data is obtained by retaining the smallest fragment of the strongest
wavelet coefficients. Wavelet transform can be applied to data cubes, sparse data, or skewed data.
2. Principal Component Analysis: Suppose we have a data set to be analyzed that has tuples with n
attributes. The principal component analysis identifies k independent tuples with n attributes that
can represent the data set.
In this way, the original data can be cast on a much smaller space, and dimensionality reduction
can be achieved. Principal component analysis can be applied to sparse and skewed data.
3. Attribute Subset Selection: The large data set has many attributes, some of which are irrelevant
to data mining or some are redundant. The core attribute subset selection reduces the data volume
and dimensionality. The attribute subset selection reduces the volume of data by eliminating
redundant and irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even after
eliminating the unwanted attributes. The resulting probability of data distribution is as close as
possible to the original data distribution using all the attributes.

2. sNumerosity Reduction

The numerosity reduction reduces the original data volume and represents it in a much smaller form. This
technique includes two types parametric and non-parametric numerosity reduction.

1. Parametric: Parametric numerosity reduction incorporates storing only data parameters instead of


the original data. One method of parametric numerosity reduction is the regression and log-linear
method.

o Regression and Log-Linear: Linear regression models a relationship between the two


attributes by modeling a linear equation to the data set. Suppose we need to model a
linear function between two attributes.
y = wx +b
Here, y is the response attribute, and x is the predictor attribute. If we discuss in terms of
data mining, attribute x and attribute y are the numeric database attributes, whereas w
and b are regression coefficients.
Multiple linear regressions let the response variable y model linear function between two
or more predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in the
database. Suppose we have a set of tuples presented in n-dimensional space. Then the
log-linear model is used to study the probability of each tuple in a multidimensional
space.
Regression and log-linear methods can be used for sparse data and skewed data.
2. Non-Parametric: A non-parametric numerosity reduction technique does not assume any model.
The non-Parametric technique results in a more uniform reduction, irrespective of data size, but it
may not achieve a high volume of data reduction like the parametric. There are at least four types
of Non-Parametric data reduction techniques, Histogram, Clustering, Sampling, Data Cube
Aggregation, and Data Compression.

o Histogram: A histogram is a graph that represents frequency distribution which describes


how often a value appears in the data. Histogram uses the binning method to represent
an attribute's data distribution. It uses a disjoint subset which we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of only one
attribute, the histogram can be implemented for multiple attributes. It can effectively
represent up to five attributes.
o Clustering: Clustering techniques groups similar objects from the data so that the objects
in a cluster are similar to each other, but they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated using a distance
function. More is the similarity between the objects in a cluster closer they appear in the
cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max distance
between any two objects in the cluster.
The cluster representation replaces the original data. This technique is more effective if the
present data can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can reduce the
large data set into a much smaller data sample. Below we will discuss the different
methods in which we can sample a large data set D containing N tuples:

1. Simple random sample without replacement (SRSWOR) of size s: In this s,


some tuples are drawn from N tuples such that in the data set D (s<N). The
probability of drawing any tuple from the data set D is 1/N. This means all tuples
have an equal probability of getting sampled.
2. Simple random sample with replacement (SRSWR) of size s: It is similar to the
SRSWOR, but the tuple is drawn from data set D, is recorded, and then replaced
into the data set D so that it can be drawn again.

3. Cluster sample: The tuples in data set D are clustered into M mutually disjoint
subsets. The data reduction can be applied by implementing SRSWOR on these
clusters. A simple random sample of size s could be generated from these clusters
where s<M.
4. Stratified sample: The large data set D is partitioned into mutually disjoint sets
called 'strata'. A simple random sample is taken from each stratum to get
stratified data. This method is effective for skewed data.

3. Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a multidimensional
aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus
achieving data reduction.

For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the year
2022. If you want to get the annual sale per year, you just have to aggregate the sales per quarter for each
year. In this way, aggregation provides you with the required data, which is much smaller in size, and
thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis. The
data cube present precomputed and summarized data which eases the data mining into fast access.

4. Data Compression

Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space. Data compression involves building a compact representation of information by
removing redundancy and representing data in binary form. Data that can be restored successfully from its
compressed form is called Lossless compression. In contrast, the opposite where it is not possible to restore
the original form from the compressed form is Lossy compression. Dimensionality and numerosity
reduction method are also used for data compression.
This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression techniques.

1. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
2. Lossy Compression: In lossy-data compression, the decompressed data may differ from the
original data but are useful enough to retrieve information from them. For example, the JPEG
image format is a lossy compression, but we can find the meaning equivalent to the original image.
Methods such as the Discrete Wavelet transform technique PCA (principal component analysis) are
examples of this compression.

5. Discretization Operation

The data discretization technique is used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes with labels of small intervals. This means that
mining results are shown in a concise and easily understandable way.

1. Top-down discretization: If you first consider one or a couple of points (so-called breakpoints or
split points) to divide the whole set of attributes and repeat this method up to the end, then the
process is known as top-down discretization, also known as splitting.
2. Bottom-up discretization: If you first consider all the constant values as split-points, some are
discarded through a combination of the neighborhood values in the interval. That process is called
bottom-up discretization.

Data Transformation:

       The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some methods in
data transformation.

 Smoothing: With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we can find even a simple
change that helps in prediction.
 Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set which is from multiple sources is integrated into with data analysis description. This
is an important step since the accuracy of the data depends on the quantity and quality of the
data. When the quality and the quantity of the data are good the results are more relevant.
 Discretization: The continuous data here is split into intervals. Discretization reduces the
data size. For example, rather than specifying the class time, we can set an interval like (3
pm-5 pm, 6 pm-8 pm).
 Normalization: It is the method of scaling the data so that it can be represented in a smaller
range. Example ranging from -1.0 to 1.0.

You might also like