Machine Learning With Python Data Preprocessing, Analysis and Visualization
Machine Learning With Python Data Preprocessing, Analysis and Visualization
Advertisements
In the real world, we usually come across lots of raw data which is not fit to be readily processed by machine
learning algorithms. We need to preprocess the raw data before it is fed into various machine learning
algorithms. This chapter discusses various techniques for preprocessing data in Python machine learning.
Data Preprocessing
In this section, let us understand how we preprocess data in Python.
Initially, open a file with a .py extension, for example prefoo.py file, in a text editor like notepad.
import numpy as np
#We imported a couple of packages. Let's create some sample data and add the line to this
file:
input_data = np.array([[3, -1.5, 3, -6.4], [0, 3, -1.3, 4.1], [1, 2.3, -2.9, -4.3]])
Preprocessing Techniques
Data can be preprocessed using several techniques as discussed here −
Mean removal
It involves removing the mean from each feature so that it is centered on zero. Mean removal helps in removing
any bias from the features.
data_standardized = preprocessing.scale(input_data)
print "\nMean = ", data_standardized.mean(axis = 0)
print "Std deviation = ", data_standardized.std(axis = 0)
$ python prefoo.py
Observe that in the output, mean is almost 0 and the standard deviation is 1.
https://www.tutorialspoint.com/cgi-bin/printpage.cgi 1/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization
Scaling
The values of every feature in a data point can vary between random values. So, it is important to scale them so
that this matches specified rules.
Now run the code and you can observe the following output −
Note that all the values have been scaled between the given range.
Normalization
Normalization involves adjusting the values in the feature vector so as to measure them on a common scale.
Here, the values of a feature vector are adjusted so that they sum up to 1. We add the following lines to the
prefoo.py file −
Now run the code and you can observe the following output −
Normalization is used to ensure that data points do not get boosted due to the nature of their features.
Binarization
Binarization is used to convert a numerical feature vector into a Boolean vector. You can use the following code
for binarization −
data_binarized = preprocessing.Binarizer(threshold=1.4).transform(input_data)
print "\nBinarized data =", data_binarized
Now run the code and you can observe the following output −
https://www.tutorialspoint.com/cgi-bin/printpage.cgi 2/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization
It may be required to deal with numerical values that are few and scattered, and you may not need to store these
values. In such situations you can use One Hot Encoding technique.
If the number of distinct values is k, it will transform the feature into a k-dimensional vector where only one
value is 1 and all other values are 0.
You can use the following code for one hot encoding −
encoder = preprocessing.OneHotEncoder()
encoder.fit([ [0, 2, 1, 12],
[1, 3, 5, 3],
[2, 3, 2, 12],
[1, 2, 4, 3]
])
encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print "\nEncoded vector =", encoded_vector
Now run the code and you can observe the following output −
In the example above, let us consider the third feature in each feature vector. The values are 1, 5, 2, and 4.
There are four separate values here, which means the one-hot encoded vector will be of length 4. If we want to
encode the value 5, it will be a vector [0, 1, 0, 0]. Only one value can be 1 in this vector. The second element is 1,
which indicates that the value is 5.
Label Encoding
In supervised learning, we mostly come across a variety of labels which can be in the form of numbers or words.
If they are numbers, then they can be used directly by the algorithm. However, many times, labels need to be in
readable form. Hence, the training data is usually labelled with words.
Label encoding refers to changing the word labels into numbers so that the algorithms can understand how to
work on them. Let us understand in detail how to perform label encoding −
Now run the code and you can observe the following output −
Class mapping:
bmw --> 0
ford --> 1
suzuki --> 2
toyota --> 3
As shown in above output, the words have been changed into 0-indexed numbers. Now, when we deal with a set
of labels, we can transform them as follows −
https://www.tutorialspoint.com/cgi-bin/printpage.cgi 3/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization
Now run the code and you can observe the following output −
This is efficient than manually maintaining mapping between words and numbers. You can check by
transforming numbers back to word labels as shown in the code here −
encoded_labels = [3, 2, 0, 2, 1]
decoded_labels = label_encoder.inverse_transform(encoded_labels)
print "\nEncoded labels =", encoded_labels
print "Decoded labels =", list(decoded_labels)
Now run the code and you can observe the following output −
From the output, you can observe that the mapping is preserved perfectly.
Data Analysis
This section discusses data analysis in Python machine learning in detail −
We can load the data directly from the UCI Machine Learning repository. Note that here we are using pandas
to load the data. We will also use pandas next to explore the data both with descriptive statistics and data
visualization. Observe the following code and note that we are specifying the names of each column when
loading the data.
import pandas
data = ‘pima_indians.csv’
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', ‘Outcome’]
dataset = pandas.read_csv(data, names = names)
When you run the code, you can observe that the dataset loads and is ready to be analyzed. Here, we have
downloaded the pima_indians.csv file and moved it into our working directory and loaded it using the local file
name.
Dimensions of Dataset
https://www.tutorialspoint.com/cgi-bin/printpage.cgi 4/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization
You can use the following command to check how many instances rows and attributes columns the data
contains with the shape property.
print(dataset.shape)
Then, for the code that we have discussed, we can see 769 instances and 6 attributes −
(769, 6)
You can view the entire data and understand its summary −
print(dataset.head(20))
You can view the statistical summary of each attribute, which includes the count, unique, top and freq, by using
the following command.
print(dataset.describe())
The above command gives you the following output that shows the statistical summary of each attribute −
https://www.tutorialspoint.com/cgi-bin/printpage.cgi 5/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization
You can also look at the number of instances rows that belong to each outcome as an absolute count, using the
command shown here −
print(dataset.groupby('Outcome').size())
Outcome
0 500
1 268
Outcome 1
dtype: int64
Data Visualization
You can visualize data using two types of plots as shown −
Univariate Plots
Univariate plots are plots of each individual variable. Consider a case where the input variables are numeric,
and we need to create box and whisker plots of each. You can use the following code for this purpose.
import pandas
import matplotlib.pyplot as plt
data = 'iris_df.csv'
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(data, names=names)
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()
You can see the output with a clearer idea of the distribution of the input attributes as shown −
You can create a histogram of each input variable to get an idea of the distribution using the commands shown
below −
https://www.tutorialspoint.com/cgi-bin/printpage.cgi 6/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization
#histograms
dataset.hist()
plt().show()
From the output, you can see that two of the input variables have a Gaussian distribution. Thus these plots help
in giving an idea about the algorithms that we can use in our program.
Multivariate Plots
Multivariate plots help us to understand the interactions between the variables.
First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships
between input variables.
https://www.tutorialspoint.com/cgi-bin/printpage.cgi 7/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization
Observe that in the output there is a diagonal grouping of some pairs of attributes. This indicates a high
correlation and a predictable relationship.
https://www.tutorialspoint.com/cgi-bin/printpage.cgi 8/8