0% found this document useful (0 votes)

51 views

Machine Learning With Python Data Preprocessing, Analysis and Visualization

Uploaded by

lukumon balogun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Machine Learning With Python Data Preprocessing, Analysis and Visualization

Uploaded by

lukumon balogun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization

DATA PREPROCESSING, ANALYSIS & VISUALIZATION

https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_data_preprocessing_analysis_
visualization.htm
Copyright © tutorialspoint.com

Then, add the following piece of code to this file −

import numpy as np

from sklearn import preprocessing

#We imported a couple of packages. Let's create some sample data and add the line to this
file:

input_data = np.array([[3, -1.5, 3, -6.4], [0, 3, -1.3, 4.1], [1, 2.3, -2.9, -4.3]])

We are now ready to operate on this data.

Preprocessing Techniques
Data can be preprocessed using several techniques as discussed here −

Mean removal

It involves removing the mean from each feature so that it is centered on zero. Mean removal helps in removing
any bias from the features.

You can use the following code for mean removal −

data_standardized = preprocessing.scale(input_data)
print "\nMean = ", data_standardized.mean(axis = 0)
print "Std deviation = ", data_standardized.std(axis = 0)

Now run the following command on the terminal −

$ python prefoo.py

You can observe the following output −

Mean = [ 5.55111512e-17 -3.70074342e-17 0.00000000e+00 -1.85037171e-17]

Std deviation = [1. 1. 1. 1.]

Observe that in the output, mean is almost 0 and the standard deviation is 1.

https://www.tutorialspoint.com/cgi-bin/printpage.cgi 1/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization

Scaling

The values of every feature in a data point can vary between random values. So, it is important to scale them so
that this matches specified rules.

You can use the following code for scaling −

data_scaler = preprocessing.MinMaxScaler(feature_range = (0, 1))

data_scaled = data_scaler.fit_transform(input_data)
print "\nMin max scaled data = ", data_scaled

Now run the code and you can observe the following output −

Min max scaled data = [ [ 1. 0. 1. 0. ]

[ 0. 1. 0.27118644 1. ]
[ 0.33333333 0.84444444 0. 0.2 ]
]

Note that all the values have been scaled between the given range.

Normalization
Normalization involves adjusting the values in the feature vector so as to measure them on a common scale.
Here, the values of a feature vector are adjusted so that they sum up to 1. We add the following lines to the
prefoo.py file −

You can use the following code for normalization −

data_normalized = preprocessing.normalize(input_data, norm = 'l1')

print "\nL1 normalized data = ", data_normalized

Now run the code and you can observe the following output −

L1 normalized data = [ [ 0.21582734 -0.10791367 0.21582734 -0.46043165]

[ 0. 0.35714286 -0.1547619 0.48809524]
[ 0.0952381 0.21904762 -0.27619048 -0.40952381]
]

Normalization is used to ensure that data points do not get boosted due to the nature of their features.

Binarization

Binarization is used to convert a numerical feature vector into a Boolean vector. You can use the following code
for binarization −

data_binarized = preprocessing.Binarizer(threshold=1.4).transform(input_data)
print "\nBinarized data =", data_binarized

Now run the code and you can observe the following output −

Binarized data = [[ 1. 0. 1. 0.]

[ 0. 1. 0. 1.]
[ 0. 1. 0. 0.]
]

This technique is helpful when we have prior knowledge of the data.

https://www.tutorialspoint.com/cgi-bin/printpage.cgi 2/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization

One Hot Encoding

It may be required to deal with numerical values that are few and scattered, and you may not need to store these
values. In such situations you can use One Hot Encoding technique.

If the number of distinct values is k, it will transform the feature into a k-dimensional vector where only one
value is 1 and all other values are 0.

You can use the following code for one hot encoding −

encoder = preprocessing.OneHotEncoder()
encoder.fit([ [0, 2, 1, 12],
[1, 3, 5, 3],
[2, 3, 2, 12],
[1, 2, 4, 3]
])
encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print "\nEncoded vector =", encoded_vector

Now run the code and you can observe the following output −

Encoded vector = [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]

In the example above, let us consider the third feature in each feature vector. The values are 1, 5, 2, and 4.

There are four separate values here, which means the one-hot encoded vector will be of length 4. If we want to
encode the value 5, it will be a vector [0, 1, 0, 0]. Only one value can be 1 in this vector. The second element is 1,
which indicates that the value is 5.

Label Encoding
In supervised learning, we mostly come across a variety of labels which can be in the form of numbers or words.
If they are numbers, then they can be used directly by the algorithm. However, many times, labels need to be in
readable form. Hence, the training data is usually labelled with words.

Label encoding refers to changing the word labels into numbers so that the algorithms can understand how to
work on them. Let us understand in detail how to perform label encoding −

Create a new Python file, and import the preprocessing package −

from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
input_classes = ['suzuki', 'ford', 'suzuki', 'toyota', 'ford', 'bmw']
label_encoder.fit(input_classes)
print "\nClass mapping:"
for i, item in enumerate(label_encoder.classes_):
print item, '-->', i

Now run the code and you can observe the following output −

Class mapping:
bmw --> 0
ford --> 1
suzuki --> 2
toyota --> 3

As shown in above output, the words have been changed into 0-indexed numbers. Now, when we deal with a set
of labels, we can transform them as follows −

https://www.tutorialspoint.com/cgi-bin/printpage.cgi 3/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization

labels = ['toyota', 'ford', 'suzuki']

encoded_labels = label_encoder.transform(labels)
print "\nLabels =", labels
print "Encoded labels =", list(encoded_labels)

Now run the code and you can observe the following output −

Labels = ['toyota', 'ford', 'suzuki']

Encoded labels = [3, 1, 2]

This is efficient than manually maintaining mapping between words and numbers. You can check by
transforming numbers back to word labels as shown in the code here −

encoded_labels = [3, 2, 0, 2, 1]
decoded_labels = label_encoder.inverse_transform(encoded_labels)
print "\nEncoded labels =", encoded_labels
print "Decoded labels =", list(decoded_labels)

Now run the code and you can observe the following output −

Encoded labels = [3, 2, 0, 2, 1]

Decoded labels = ['toyota', 'suzuki', 'bmw', 'suzuki', 'ford']

From the output, you can observe that the mapping is preserved perfectly.

Data Analysis
This section discusses data analysis in Python machine learning in detail −

Loading the Dataset

We can load the data directly from the UCI Machine Learning repository. Note that here we are using pandas
to load the data. We will also use pandas next to explore the data both with descriptive statistics and data
visualization. Observe the following code and note that we are specifying the names of each column when
loading the data.

import pandas
data = ‘pima_indians.csv’
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', ‘Outcome’]
dataset = pandas.read_csv(data, names = names)

When you run the code, you can observe that the dataset loads and is ready to be analyzed. Here, we have
downloaded the pima_indians.csv file and moved it into our working directory and loaded it using the local file
name.

Summarizing the Dataset

Summarizing the data can be done in many ways as follows −

Check dimensions of the dataset

List the entire data
View the statistical summary of all attributes
Breakdown of the data by the class variable

Dimensions of Dataset

https://www.tutorialspoint.com/cgi-bin/printpage.cgi 4/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization

You can use the following command to check how many instances rows and attributes columns the data
contains with the shape property.

print(dataset.shape)

Then, for the code that we have discussed, we can see 769 instances and 6 attributes −

(769, 6)

List the Entire Data

You can view the entire data and understand its summary −

print(dataset.head(20))

This command prints the first 20 rows of the data as shown −

Sno Pregnancies Glucose BloodPressure SkinThickness Insulin Outcome

1 6 148 72 35 0 1
2 1 85 66 29 0 0
3 8 183 64 0 0 1
4 1 89 66 23 94 0
5 0 137 40 35 168 1
6 5 116 74 0 0 0
7 3 78 50 32 88 1
8 10 115 0 0 0 0
9 2 197 70 45 543 1
10 8 125 96 0 0 1
11 4 110 92 0 0 0
12 10 168 74 0 0 1
13 10 139 80 0 0 0
14 1 189 60 23 846 1
15 5 166 72 19 175 1
16 7 100 0 0 0 1
17 0 118 84 47 230 1
18 7 107 74 0 0 1
19 1 103 30 38 83 0

View the Statistical Summary

You can view the statistical summary of each attribute, which includes the count, unique, top and freq, by using
the following command.

print(dataset.describe())

The above command gives you the following output that shows the statistical summary of each attribute −

Pregnancies Glucose BloodPressur SkinThckns Insulin Outcome

count 769 769 769 769 769 769
unique 18 137 48 52 187 3
top 1 100 70 0 0 0
freq 135 17 57 227 374 500

Breakdown the Data by Class Variable

https://www.tutorialspoint.com/cgi-bin/printpage.cgi 5/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization

You can also look at the number of instances rows that belong to each outcome as an absolute count, using the
command shown here −

print(dataset.groupby('Outcome').size())

Then you can see the number of outcomes of instances as shown −

Outcome
0 500
1 268
Outcome 1
dtype: int64

Data Visualization
You can visualize data using two types of plots as shown −

Univariate plots to understand each attribute

Multivariate plots to understand the relationships between attributes

Univariate Plots
Univariate plots are plots of each individual variable. Consider a case where the input variables are numeric,
and we need to create box and whisker plots of each. You can use the following code for this purpose.

import pandas
import matplotlib.pyplot as plt
data = 'iris_df.csv'
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(data, names=names)
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

You can see the output with a clearer idea of the distribution of the input attributes as shown −

Box and Whisker Plots

You can create a histogram of each input variable to get an idea of the distribution using the commands shown
below −

https://www.tutorialspoint.com/cgi-bin/printpage.cgi 6/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization

#histograms
dataset.hist()
plt().show()

From the output, you can see that two of the input variables have a Gaussian distribution. Thus these plots help
in giving an idea about the algorithms that we can use in our program.

Multivariate Plots
Multivariate plots help us to understand the interactions between the variables.

Scatter Plot Matrix

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships
between input variables.

from pandas.plotting import scatter_matrix

scatter_matrix(dataset)
plt.show()

You can observe the output as shown −

https://www.tutorialspoint.com/cgi-bin/printpage.cgi 7/8
6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization

Observe that in the output there is a diagonal grouping of some pairs of attributes. This indicates a high
correlation and a predictable relationship.

https://www.tutorialspoint.com/cgi-bin/printpage.cgi 8/8

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Quantitative Techniques For Managerial Decisions
No ratings yet
Quantitative Techniques For Managerial Decisions
3 pages
AES Encryption and Decryption in Java - DevGlan
No ratings yet
AES Encryption and Decryption in Java - DevGlan
1 page
2 DNN-CNN-RNN
100% (1)
2 DNN-CNN-RNN
87 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Week 10
No ratings yet
Week 10
50 pages
72b85f60-8523-423f-9efc-ff56aa21f3f3
No ratings yet
72b85f60-8523-423f-9efc-ff56aa21f3f3
29 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
06 - Data Preprocessing
No ratings yet
06 - Data Preprocessing
68 pages
ML_DA
No ratings yet
ML_DA
55 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Data Pre-Processing with Sklearn using Standard and Minmax
No ratings yet
Data Pre-Processing with Sklearn using Standard and Minmax
21 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
mini4
No ratings yet
mini4
9 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
1737527078055
No ratings yet
1737527078055
111 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
CSL0777 L09
No ratings yet
CSL0777 L09
29 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Slides on DataI
No ratings yet
Slides on DataI
33 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Feature Engineering / Feature Selection
No ratings yet
Feature Engineering / Feature Selection
33 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
ML_1
No ratings yet
ML_1
13 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
Exp 2 Data Preprocessing_ Cleaning the Dataset Obtained from the UCI ML Repository
No ratings yet
Exp 2 Data Preprocessing_ Cleaning the Dataset Obtained from the UCI ML Repository
9 pages
Kartik mlp 4-9prg (1)
No ratings yet
Kartik mlp 4-9prg (1)
10 pages
AI With Python-Data Preprocessing: Student Name Student Roll # Program Section
No ratings yet
AI With Python-Data Preprocessing: Student Name Student Roll # Program Section
7 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
MLCyberLab
No ratings yet
MLCyberLab
9 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
Lecture 5 Encoding
No ratings yet
Lecture 5 Encoding
35 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Model_learning_steps
No ratings yet
Model_learning_steps
12 pages
ML Notes
No ratings yet
ML Notes
79 pages
Machine Learning: by Team 2
No ratings yet
Machine Learning: by Team 2
41 pages
MACHINE LEARNING manual
No ratings yet
MACHINE LEARNING manual
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Image Classification
No ratings yet
Image Classification
18 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Performance Analysis of Algorithms
100% (1)
Performance Analysis of Algorithms
19 pages
Project
No ratings yet
Project
1 page
As Level Pure June 2023 Mark Scheme Student Friendly
No ratings yet
As Level Pure June 2023 Mark Scheme Student Friendly
18 pages
Intelligent Temperature Controller For Water Bath System
No ratings yet
Intelligent Temperature Controller For Water Bath System
7 pages
Algs4.Cs - Princeton.edu-33 NBSP Balanced Search Trees
No ratings yet
Algs4.Cs - Princeton.edu-33 NBSP Balanced Search Trees
8 pages
Book_List_2024-2025
No ratings yet
Book_List_2024-2025
74 pages
Bubble Sort
No ratings yet
Bubble Sort
54 pages
1 s2.0 S2542660524001033 Main
No ratings yet
1 s2.0 S2542660524001033 Main
18 pages
Conversation
No ratings yet
Conversation
4 pages
Electrical and Computer Engineering EE521 Analog and Digital Communications
No ratings yet
Electrical and Computer Engineering EE521 Analog and Digital Communications
3 pages
Unit-5 Greedy Algorithm
No ratings yet
Unit-5 Greedy Algorithm
71 pages
Gauge - Distinct Categories
No ratings yet
Gauge - Distinct Categories
1 page
Download Full Image Analysis, Classification, and Change Detection in Remote Sensing: With Algorithms for Python 4th Edition Morton John Canty PDF All Chapters
100% (2)
Download Full Image Analysis, Classification, and Change Detection in Remote Sensing: With Algorithms for Python 4th Edition Morton John Canty PDF All Chapters
55 pages
Operating system
No ratings yet
Operating system
6 pages
00 Week1 Introduction 4293
No ratings yet
00 Week1 Introduction 4293
75 pages
Week 2 Notes
No ratings yet
Week 2 Notes
23 pages
Computer aided Civil Eng - 2024 - Berangi - Gradient boosting decision trees to study laboratory and field performance in
No ratings yet
Computer aided Civil Eng - 2024 - Berangi - Gradient boosting decision trees to study laboratory and field performance in
30 pages
A Note On The Lorentz Transformations Fo
No ratings yet
A Note On The Lorentz Transformations Fo
5 pages
Data Structure and Algorithms PDF
No ratings yet
Data Structure and Algorithms PDF
151 pages
Data Structure Unit 1
No ratings yet
Data Structure Unit 1
49 pages
Recursion
No ratings yet
Recursion
12 pages
Codebasics DS AI Bootcamp Brochure v1
No ratings yet
Codebasics DS AI Bootcamp Brochure v1
41 pages
A Machine Learning Approach For Forecasting Hierarchical Time Series PDF
No ratings yet
A Machine Learning Approach For Forecasting Hierarchical Time Series PDF
28 pages
Signals and Systems
100% (1)
Signals and Systems
18 pages
Lab Report Template
No ratings yet
Lab Report Template
3 pages
1 s2.0 0005109896855464 Main
No ratings yet
1 s2.0 0005109896855464 Main
20 pages
Chp8 (Topic Not in Book) - ClassificationPrediction+Issues
No ratings yet
Chp8 (Topic Not in Book) - ClassificationPrediction+Issues
7 pages

Machine Learning With Python Data Preprocessing, Analysis and Visualization

Uploaded by

Machine Learning With Python Data Preprocessing, Analysis and Visualization

Uploaded by

6/7/2019 Machine Learning with Python Data Preprocessing, Analysis and Visualization

DATA PREPROCESSING, ANALYSIS & VISUALIZATION

Then, add the following piece of code to this file −

from sklearn import preprocessing

We are now ready to operate on this data.

You can use the following code for mean removal −

Now run the following command on the terminal −

You can observe the following output −

Mean = [ 5.55111512e-17 -3.70074342e-17 0.00000000e+00 -1.85037171e-17]

You can use the following code for scaling −

data_scaler = preprocessing.MinMaxScaler(feature_range = (0, 1))

Min max scaled data = [ [ 1. 0. 1. 0. ]

You can use the following code for normalization −

data_normalized = preprocessing.normalize(input_data, norm = 'l1')

L1 normalized data = [ [ 0.21582734 -0.10791367 0.21582734 -0.46043165]

Binarized data = [[ 1. 0. 1. 0.]

This technique is helpful when we have prior knowledge of the data.

One Hot Encoding

Encoded vector = [[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]

Create a new Python file, and import the preprocessing package −

from sklearn import preprocessing

labels = ['toyota', 'ford', 'suzuki']

Labels = ['toyota', 'ford', 'suzuki']

Encoded labels = [3, 2, 0, 2, 1]

Loading the Dataset

Summarizing the Dataset

Check dimensions of the dataset

List the Entire Data

This command prints the first 20 rows of the data as shown −

Sno Pregnancies Glucose BloodPressure SkinThickness Insulin Outcome

View the Statistical Summary

Pregnancies Glucose BloodPressur SkinThckns Insulin Outcome

Breakdown the Data by Class Variable

Then you can see the number of outcomes of instances as shown −

Univariate plots to understand each attribute

Multivariate plots to understand the relationships between attributes

Box and Whisker Plots

Scatter Plot Matrix

from pandas.plotting import scatter_matrix

You can observe the output as shown −

You might also like