Machine Learning With Python
Machine Learning With Python
This tutorial starts with an introduction to machine learning and the Python language and
shows you how to setup Python and its packages. It further covers all important concepts
such as exploratory data analysis, data preprocessing, feature extraction, data
visualization and clustering, classification, regression and model performance evaluation.
This tutorial also provides various projects that teaches you the techniques and
functionalities such as news topic classification, spam email detection, online ad click-
through prediction, stock prices forecast and other several important machine learning
algorithms.
Audience
This tutorial has been prepared for professionals aspiring to learn the basics of Python and
develop applications involving machine learning techniques such as recommendation,
classification, and clustering. Through this tutorial, you will learn to solve data-driven
problems and implement your solutions using the powerful yet simple programming
language, Python and its packages. After completing this tutorial, you will gain a broad
picture of the machine learning environment and the best practices for machine learning
techniques.
Prerequisites
Before you start proceeding with this tutorial, we assume that you have a prior exposure
to Python, Numpy, pandas, scipy, matplotlib, Windows and any of the Linux operating
system flavors. If you are new to any of these concepts, we recommend you to take up
tutorials concerning these topics, before you dig further into this tutorial.
All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish
any contents or a part of contents of this e-book in any manner without written consent
of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at contact@tutorialspoint.com
1
Python Machine Learning
Table of Contents
About the Tutorial .................................................................................................................................... i
Audience .................................................................................................................................................. i
Prerequisites ............................................................................................................................................ i
Installation .............................................................................................................................................. 7
1
Python Machine Learning
Data Analysis......................................................................................................................................... 17
Univariate Plots..................................................................................................................................... 19
Classification ......................................................................................................................................... 26
Regression ............................................................................................................................................. 27
Recommendation .................................................................................................................................. 27
Clustering .............................................................................................................................................. 28
2
Python Machine Learning
K-Means ................................................................................................................................................ 55
3
Python Machine Learning
1. Python Machine Learning – Introduction
Python is a popular platform used for research and development of production systems. It
is a vast language with number of modules, packages and libraries that provides multiple
ways of achieving a task.
Python and its libraries like NumPy, SciPy, Scikit-Learn, Matplotlib are used in data science
and data analysis. They are also extensively used for creating scalable machine learning
algorithms. Python implements popular machine learning techniques such as
Classification, Regression, Recommendation, and Clustering.
Python offers ready-made framework for performing data mining tasks on large volumes
of data effectively in lesser time. It includes several implementations achieved through
algorithms such as linear regression, logistic regression, Naïve Bayes, k-means, K nearest
neighbor, and Random Forest.
4
Python Machine Learning
2. Python Machine Learning – Concepts
In this chapter, you will learn in detail about the concepts of Python in machine learning.
Machine learning is a discipline that deals with programming the systems so as to make
them automatically learn and improve with experience. Here, learning implies recognizing
and understanding the input data and taking informed decisions based on the supplied
data. It is very difficult to consider all the decisions based on all possible inputs. To solve
this problem, algorithms are developed that build knowledge from a specific data and past
experience by applying the principles of statistical science, probability, logic, mathematical
optimization, reinforcement learning, and control theory.
Vision processing
Language processing
Forecasting things like stock market trends, weather
Pattern recognition
Games
Data mining
Expert systems
Robotics
5
Python Machine Learning
Defining a Problem
Preparing Data
Evaluating Algorithms
Improving Results
Presenting Results
The best way to get started using Python for machine learning is to work through a project
end-to-end and cover the key steps like loading data, summarizing data, evaluating
algorithms and making some predictions. This gives you a replicable method that can be
used dataset after dataset. You can also add further data and improve the results.
6
Python Machine Learning
3. Python Machine Learning – Environment Setup
In this chapter, you will learn how to setup the working environment for Python machine
learning on your local computer.
Installation
You can install software for machine learning in any of the two methods as discussed here:
Method 1
Download and install Python separately from python.org on various operating systems
as explained below:
To install Python after downloading, double click the .exe (for Windows) or .pkg (for Mac)
file and follow the instructions on the screen.
For Linux OS, check if Python is already installed by using the following command at the
prompt:
If Python 2.7 or later is not installed, install Python with the distribution's package
manager. Note that the command and package name varies.
Now, open the command prompt and run the following command to verify that Python is
installed correctly:
$ python3 --version
7
Python Machine Learning
Python 3.6.2
Similarly, we can download and install necessary libraries like numpy, matplotlib etc.
individually using installers like pip. For this purpose, you can use the commands shown
here:
Method 2
Alternatively, to install Python and other scientific computing and machine learning
packages simultaneously, we should install Anaconda distribution. It is a Python
implementation for Linux, Windows and OSX, and comprises various machine learning
packages like numpy, scikit-learn, and matplotlib. It also includes Jupyter Notebook, an
interactive Python environment. We can install Python 2.7 or any 3.x version as per our
requirement.
To download the free Anaconda Python distribution from Continuum Analytics, you can do
the following:
Visit the official site of Continuum Analytics and its download page. Note that the
installation process may take 15-20 minutes as the installer contains Python, associated
packages, a code editor, and some other files. Depending on your operating system,
choose the installation process as explained here:
For Windows: Select the Anaconda for Windows section and look in the column with
Python 2.7 or 3.x. You can find that there are two versions of the installer, one for 32-bit
Windows, and one for 64-bit Windows. Choose the relevant one.
For Mac OS: Scroll to the Anaconda for OS X section. Look in the column with Python
2.7 or 3.x. Note that here there is only one version of the installer: the 64-bit version.
For Linux OS: We select the "Anaconda for Linux" section. Look in the column with Python
2.7 or 3.x.
Note that you have to ensure that Anaconda’s Python distribution installs into a single
directory, and does not affect other Python installations, if any, on your system.
To work with graphs and plots, we will need these Python library packages: matplotlib
and seaborn.
If you are using Anaconda Python, your system already has numpy, matplotlib, pandas,
seaborn, etc. installed. We start the Anaconda Navigator to access either Jupyter Note
book or Spyder IDE of python.
import numpy
import matplotlib
Now, we need to check if installation is successful. For this, go to the command line and
type in the following command:
$ python
Python 3.6.3 |Anaconda custom (32-bit)| (default, Oct 13 2017, 14:21:34)
[GCC 7.2.0] on linux
Next, you can import the required libraries and print their versions as shown:
>>>import numpy
>>>print numpy.__version__
1.14.2
>>> import matplotlib
>>> print (matplotlib.__version__)
2.1.2
>> import pandas
>>> print (pandas.__version__)
0.22.0
>>> import seaborn
>>> print (seaborn.__version__)
0.8.1
9
Python Machine Learning
4. Python Machine Learning – Types of Learning
The input to a learning algorithm is training data, representing experience, and the
output is any expertise, which usually takes the form of another algorithm that can
perform a task. The input data to a machine learning system can be numerical, textual,
audio, visual, or multimedia. The corresponding output data of the system can be a
floating-point number, for instance, the velocity of a rocket, an integer representing a
category or a class, for example, a pigeon or a sunflower from image recognition.
In this chapter, we will learn about the training data our programs will access and how
learning process is automated and how the success and performance of such machine
learning algorithms is evaluated.
Concepts of Learning
Learning is the process of converting experience into expertise or knowledge.
Learning can be broadly classified into three categories, as mentioned below, based on the
nature of the learning data and interaction between the learner and the environment.
Supervised Learning
Unsupervised Learning
Semi-supervised learning
Similarly, there are four categories of machine learning algorithms as shown below:
However, the most commonly used ones are supervised and unsupervised learning.
Supervised Learning
Supervised learning is commonly used in real world applications, such as face and speech
recognition, products or movie recommendations, and sales forecasting. Supervised
learning can be further classified into two types: Regression and Classification.
10
Python Machine Learning
In supervised learning, learning data comes with description, labels, targets or desired
outputs and the objective is to find a general rule that maps inputs to outputs. This kind
of learning data is called labeled data. The learned rule is then used to label new data
with unknown outputs.
Supervised learning involves building a machine learning model that is based on labeled
samples. For example, if we build a system to estimate the price of a plot of land or a
house based on various features, such as size, location, and so on, we first need to create
a database and label it. We need to teach the algorithm what features correspond to what
prices. Based on this data, the algorithm will learn how to calculate the price of real estate
using the values of the input features.
Supervised learning deals with learning a function from available training data. Here, a
learning algorithm analyzes the training data and produces a derived function that can be
used for mapping new examples. There are many supervised learning algorithms such
as Logistic Regression, Neural networks, Support Vector Machines (SVMs), and Naive
Bayes classifiers.
Common examples of supervised learning include classifying e-mails into spam and not-
spam categories, labeling webpages based on their content, and voice recognition.
Unsupervised Learning
Unsupervised learning is used to detect anomalies, outliers, such as fraud or defective
equipment, or to group customers with similar behaviors for a sales campaign. It is the
opposite of supervised learning. There is no labeled data here.
When learning data contains only some indications without any description or labels, it is
up to the coder or to the algorithm to find the structure of the underlying data, to discover
hidden patterns, or to determine how to describe the data. This kind of learning data is
called unlabeled data.
Suppose that we have a number of data points, and we want to classify them into several
groups. We may not exactly know what the criteria of classification would be. So, an
unsupervised learning algorithm tries to classify the given dataset into a certain number
of groups in an optimum way.
Unsupervised learning algorithms are extremely powerful tools for analyzing data and for
identifying patterns and trends. They are most commonly used for clustering similar input
into logical groups. Unsupervised learning algorithms include Kmeans, Random Forests,
Hierarchical clustering and so on.
11
Python Machine Learning
Semi-supervised Learning
If some learning samples are labeled, but some other are not labeled, then it is semi-
supervised learning. It makes use of a large amount of unlabeled data for training and
a small amount of labeled data for testing. Semi-supervised learning is applied in cases
where it is expensive to acquire a fully labeled dataset while more practical to label a small
subset. For example, it often requires skilled experts to label certain remote sensing
images, and lots of field experiments to locate oil at a particular location, while acquiring
unlabeled data is relatively easy.
Reinforcement Learning
Here learning data gives feedback so that the system adjusts to dynamic conditions in
order to achieve a certain objective. The system evaluates its performance based on the
feedback responses and reacts accordingly. The best known instances include self-driving
cars and chess master algorithm AlphaGo.
As a field of science, machine learning shares common concepts with other disciplines such
as statistics, information theory, game theory, and optimization.
12
Python Machine Learning
5. Python Machine Learning – Data Preprocessing,
Analysis and Visualization
In the real world, we usually come across lots of raw data which is not fit to be readily
processed by machine learning algorithms. We need to preprocess the raw data before it
is fed into various machine learning algorithms. This chapter discusses various techniques
for preprocessing data in Python machine learning.
Data Preprocessing
In this section, let us understand how we preprocess data in Python.
Initially, open a file with a .py extension, for example prefoo.py file, in a text editor like
notepad.
import numpy as np
from sklearn import preprocessing
#We imported a couple of packages. Let's create some sample data and add the
line to this file:
input_data = np.array([[3, -1.5, 3, -6.4], [0, 3, -1.3, 4.1], [1, 2.3, -2.9, -
4.3]])
Preprocessing Techniques
Data can be preprocessed using several techniques as discussed here:
Mean removal
It involves removing the mean from each feature so that it is centered on zero. Mean
removal helps in removing any bias from the features.
data_standardized = preprocessing.scale(input_data)
print "\nMean =", data_standardized.mean(axis=0)
print "Std deviation =", data_standardized.std(axis=0)
$ python prefoo.py
13
Python Machine Learning
Observe that in the output, mean is almost 0 and the standard deviation is 1.
Scaling
The values of every feature in a data point can vary between random values. So, it is
important to scale them so that this matches specified rules.
Now run the code and you can observe the following output:
Note that all the values have been scaled between the given range.
Normalization
Normalization involves adjusting the values in the feature vector so as to measure them
on a common scale. Here, the values of a feature vector are adjusted so that they sum up
to 1. We add the following lines to the prefoo.py file:
Now run the code and you can observe the following output:
14
Python Machine Learning
Normalization is used to ensure that data points do not get boosted due to the nature of
their features.
Binarization
Binarization is used to convert a numerical feature vector into a Boolean vector. You can
use the following code for binarization:
data_binarized = preprocessing.Binarizer(threshold=1.4).transform(input_data)
print "\nBinarized data =", data_binarized
Now run the code and you can observe the following output:
15
Python Machine Learning
16