Data Preprocessing in Machine Learning
Data Preprocessing in Machine Learning
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, we use data pre-processing task.
Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data
from the datasets, and it also replaces the missing values. There are some techniques in data
cleaning
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the
dataset in our code, we usually put it into a CSV file. However, sometimes, we may also need
to use an HTML or xlsx file.
Here we will use a demo dataset for data preprocessing, and for practice, it can be
downloaded from here, "https://www.superdatascience.com/pages/machine-learning
. For real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets
, https://archive.ics.uci.edu/ml/index.php
etc.
We can also create our dataset by gathering data using various API with Python and put that
data into a .csv file.
2) Importing Libraries
In order to perform data pre-processing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data pre-processing, which are:
Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports to
add large, multidimensional arrays and matrices. So, in Python, we can import it as:
1. import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
1. import matplotlib.pyplot as mpt
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
Note: We can set any directory as a working directory, but it must contain the required dataset.
Here, in the below image, we can see the Python file along with required dataset. Now, the
current folder is set as a working directory.
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to
read a csv file and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL.
1. data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we have
passed the name of our dataset. Once we execute the above line of code, it will successfully
import the dataset in our code. We can also check the imported dataset by clicking on the
section variable explorer, and then double click on data_set. Consider the below image:
As in the above image, indexing is started from 0, which is the default indexing in Python.
We can also change the format of our dataset by clicking on the format option.
1. x= data_set.iloc[:,:-1].values
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for
all the columns. Here we have used :-1, because we don't want to take the last column as it
contains the dependent variable. So by doing this, we will get the matrix of features.
As we can see in the above output, there are only three variables.
1. y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent
variables.
Output:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Note: If you are using Python language for machine learning, then extraction is mandatory, but
for R language it is not required.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we will
use this approach.
1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Output:
As we can see in the above output, the missing values have been replaced with the means of
rest column values.
Since machine learning model completely works on mathematics and numbers, but if our
dataset would have a categorical variable, then it may create trouble while building the
model. So it is necessary to encode these categorical variables into numbers.
Firstly, we will convert the country variables into categorical data. So to do this, we will
use LabelEncoder() class from preprocessing library.
1. #Catgorical data
2. #for Country Variable
3. from sklearn.preprocessing import LabelEncoder
4. label_encoder_x= LabelEncoder()
5. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Output:
Out[15]:
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
[0, 37.0, 77000.0]], dtype=object)
Explanation:
But in our case, there are three country variables, and as we can see in the above output, these
variables are encoded into 0, 1, and 2. By these values, the machine learning model may
assume that there is some correlation between these variables which will produce the wrong
output. So to remove this issue, we will use dummy encoding.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the
presence of that variable in a particular column, and rest variables become 0. With dummy
encoding, we will have a number of columns equal to the number of categories.
In our dataset, we have 3 categories so it will produce three columns having 0 and 1 values.
For Dummy Encoding, we will use OneHotEncoder class of preprocessing library.
1. #for Country Variable
2. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
3. label_encoder_x= LabelEncoder()
4. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
5. #Encoding for dummy variables
6. onehot_encoder= OneHotEncoder(categorical_features= [0])
7. x= onehot_encoder.fit_transform(x).toarray()
Output:
As we can see in the above output, all the variables are encoded into numbers 0 and 1 and
divided into three columns.
It can be seen more clearly in the variables explorer section, by clicking on x option as:
For Purchased Variable:
1. labelencoder_y= LabelEncoder()
2. y= labelencoder_y.fit_transform(y)
For the second categorical variable, we will only use labelencoder object
of LableEncoder class. Here we are not using OneHotEncoder class because the purchased
variable has only two categories yes or no, and which are automatically encoded into 0 and 1.
Output:
Suppose, if we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to understand
the correlations between the models.
If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:
Training Set: A subset of dataset to train the machine learning model, and we already know
the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.
For splitting the dataset, we will use the below lines of code:
1. from sklearn.model_selection import train_test_split
2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
Explanation:
o In the above code, the first line is used for splitting arrays of the dataset into random
train and test subsets.
o In the second line, we have used four variables for our output that are
o x_train: features for the training data
o x_test: features for testing data
o y_train: Dependent variables for training data
o y_test: Independent variable for testing data
o In train_test_split() function, we have passed four parameters in which first two are
for arrays of data, and test_size is for specifying the size of the test set. The test_size
maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
o The last parameter random_state is used to set a seed for a random generator so that
you always get the same result, and the most used value for this is 42.
Output:
By executing the above code, we will get 4 different variables, which can be seen under the
variable explorer section.
As we can see in the above image, the x and y variables are divided into 4 different variables
with corresponding values.
7) Feature Scaling
Feature scaling is the final step of data pre-processing in machine learning. It is a technique
to standardize the independent variables of the dataset in a specific range. In feature scaling,
we put our variables in the same range and in the same scale so that no any variable dominate
the other variable.
Standardization
Normalization
Here, we will use the standardization method for our dataset.
1. from sklearn.preprocessing import StandardScaler
Now, we will create the object of StandardScaler class for independent variables or features.
And then we will fit and transform the training dataset.
1. st_x= StandardScaler()
2. x_train= st_x.fit_transform(x_train)
1. x_test= st_x.transform(x_test)
Output:
By executing the above lines of code, we will get the scaled values for x_train and x_test as:
x_train:
x_test:
As we can see in the above output, all the variables are scaled between values -1 to 1.
Note: Here, we have not scaled the dependent variable because there are only two values 0 and 1.
But if these variables will have more range of values, then we will also need to scale those
variables.
Now, in the end, we can combine all the steps together to make our complete code more
understandable.
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('Dataset.csv')
8.
9. #Extracting Independent Variable
10. x= data_set.iloc[:, :-1].values
11.
12. #Extracting Dependent variable
13. y= data_set.iloc[:, 3].values
14.
15. #handling missing data(Replacing missing data with the mean value)
16. from sklearn.preprocessing import Imputer
17. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
18.
19. #Fitting imputer object to the independent varibles x.
20. imputerimputer= imputer.fit(x[:, 1:3])
21.
22. #Replacing missing data with the calculated mean value
23. x[:, 1:3]= imputer.transform(x[:, 1:3])
24.
25. #for Country Variable
26. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
27. label_encoder_x= LabelEncoder()
28. x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
29.
30. #Encoding for dummy variables
31. onehot_encoder= OneHotEncoder(categorical_features= [0])
32. x= onehot_encoder.fit_transform(x).toarray()
33.
34. #encoding for purchased variable
35. labelencoder_y= LabelEncoder()
36. y= labelencoder_y.fit_transform(y)
37.
38. # Splitting the dataset into training and test set.
39. from sklearn.model_selection import train_test_split
40. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
41.
42. #Feature Scaling of datasets
43. from sklearn.preprocessing import StandardScaler
44. st_x= StandardScaler()
45. x_train= st_x.fit_transform(x_train)
46. x_test= st_x.transform(x_test)
In the above code, we have included all the data preprocessing steps together. But there are
some steps or lines of code which are not necessary for all machine learning models. So we
can exclude them from our code to make it reusable for all models.
Noisy:
Binning: This method is to smooth or handle noisy data. First, the data is sorted then and
then the sorted values are separated and stored in the form of bins. There are three methods
for smoothing data in the bin. Smoothing by bin mean method: In this method, the values in
the bin are replaced by the mean value of the bin; Smoothing by bin median: In this method,
the values in the bin are replaced by the median value; Smoothing by bin boundary: In this
method, the using minimum and maximum values of the bin values are taken and the values
are replaced by the closest boundary value.
Examples:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
- Bin 1 : 4, 8, 9, 15
- Bin 1: 9, 9, 9, 9
- Bin 1: 4, 4, 4, 15
- Bin 1: 9 9, 9, 9
Approach:
Sort the array of a given data set.
Divides the range into N intervals, each containing the approximately same number of
samples(Equal-depth partitioning).
Regression
This is used to smooth the data and help handle data when unnecessary data is present.
For the analysis, purpose regression helps decide the suitable variable. Linear
regression refers to finding the best line to fit between two variables so that one can be
used to predict the other. Multiple linear regression involves more than two variables.
Using regression to find a mathematical equation to fit into the data helps to smooth out
the noise.
Clustering: This is used for finding the outliers and also in grouping the data. Clustering
is generally used in unsupervised learning.
Data integration:
The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components in data management. There are some problems to be
considered during data integration.
Schema integration: Integrates metadata(a set of data that describes other data) from
different sources.
Entity identification problem: Identifying entities from multiple databases. For example,
the system or the use should know student _id of one database and student_name of another
database belongs to the same entity.
Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. Like the attribute values from one database may differ from another
database. For example, the date format may differ like “MM/DD/YYYY” or
“DD/MM/YYYY”.
Data reduction:
This process helps in the reduction of the volume of the data which makes the analysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. There are some of the techniques in data reduction are Dimensionality
reduction, Numerosity reduction, Data compression.
Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby reducing
the volume of original data. It reduces data size as it eliminates outdated or redundant features. Here are
three methods of dimensionality reduction.
1. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed into a
numerically different data vector A' such that both A and A' vectors are of the same length. Then
how it is useful in reducing data because the data obtained from the wavelet transform can be
truncated. The compressed data is obtained by retaining the smallest fragment of the strongest
wavelet coefficients. Wavelet transform can be applied to data cubes, sparse data, or skewed data.
2. Principal Component Analysis: Suppose we have a data set to be analyzed that has tuples with n
attributes. The principal component analysis identifies k independent tuples with n attributes that
can represent the data set.
In this way, the original data can be cast on a much smaller space, and dimensionality reduction
can be achieved. Principal component analysis can be applied to sparse and skewed data.
3. Attribute Subset Selection: The large data set has many attributes, some of which are irrelevant
to data mining or some are redundant. The core attribute subset selection reduces the data volume
and dimensionality. The attribute subset selection reduces the volume of data by eliminating
redundant and irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even after
eliminating the unwanted attributes. The resulting probability of data distribution is as close as
possible to the original data distribution using all the attributes.
2. sNumerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller form. This
technique includes two types parametric and non-parametric numerosity reduction.
3. Cluster sample: The tuples in data set D are clustered into M mutually disjoint
subsets. The data reduction can be applied by implementing SRSWOR on these
clusters. A simple random sample of size s could be generated from these clusters
where s<M.
4. Stratified sample: The large data set D is partitioned into mutually disjoint sets
called 'strata'. A simple random sample is taken from each stratum to get
stratified data. This method is effective for skewed data.
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a multidimensional
aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus
achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the year
2022. If you want to get the annual sale per year, you just have to aggregate the sales per quarter for each
year. In this way, aggregation provides you with the required data, which is much smaller in size, and
thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis. The
data cube present precomputed and summarized data which eases the data mining into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space. Data compression involves building a compact representation of information by
removing redundancy and representing data in binary form. Data that can be restored successfully from its
compressed form is called Lossless compression. In contrast, the opposite where it is not possible to restore
the original form from the compressed form is Lossy compression. Dimensionality and numerosity
reduction method are also used for data compression.
This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression techniques.
1. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
2. Lossy Compression: In lossy-data compression, the decompressed data may differ from the
original data but are useful enough to retrieve information from them. For example, the JPEG
image format is a lossy compression, but we can find the meaning equivalent to the original image.
Methods such as the Discrete Wavelet transform technique PCA (principal component analysis) are
examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes with labels of small intervals. This means that
mining results are shown in a concise and easily understandable way.
1. Top-down discretization: If you first consider one or a couple of points (so-called breakpoints or
split points) to divide the whole set of attributes and repeat this method up to the end, then the
process is known as top-down discretization, also known as splitting.
2. Bottom-up discretization: If you first consider all the constant values as split-points, some are
discarded through a combination of the neighborhood values in the interval. That process is called
bottom-up discretization.
Data Transformation:
The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some methods in
data transformation.
Smoothing: With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we can find even a simple
change that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set which is from multiple sources is integrated into with data analysis description. This
is an important step since the accuracy of the data depends on the quantity and quality of the
data. When the quality and the quantity of the data are good the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization reduces the
data size. For example, rather than specifying the class time, we can set an interval like (3
pm-5 pm, 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be represented in a smaller
range. Example ranging from -1.0 to 1.0.