Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

MachineLearningusingPython

The document discusses the principles and applications of machine learning using Python, highlighting its importance in processing complex data efficiently across various sectors. It categorizes machine learning into supervised, unsupervised, and reinforcement learning, detailing algorithms like KNN, SVM, and Random Forest, along with their implementations in Python. Additionally, it covers essential Python libraries for machine learning, installation instructions, and provides examples using the Iris dataset.

Uploaded by

Johnny Ortiz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

MachineLearningusingPython

The document discusses the principles and applications of machine learning using Python, highlighting its importance in processing complex data efficiently across various sectors. It categorizes machine learning into supervised, unsupervised, and reinforcement learning, detailing algorithms like KNN, SVM, and Random Forest, along with their implementations in Python. Additionally, it covers essential Python libraries for machine learning, installation instructions, and provides examples using the Iris dataset.

Uploaded by

Johnny Ortiz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/375962188

Machine Learning using Python Software

Article · November 2023

CITATIONS READS

0 564

1 author:

Ritwika Das
Indian Agricultural Statistics Research Institute
45 PUBLICATIONS 116 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ritwika Das on 28 November 2023.

The user has requested enhancement of the downloaded file.


Machine Learning using Python Software

Ritwika Das
Scientist, Division of Agricultural Bioinformatics
ICAR – IASRI, New Delhi – 110012
Email: Ritwika.Das@icar.gov.in

▪ Machine Learning
Machine learning is the scientific discipline concerned with the design and development of
algorithms that allow computers to evolve behaviours based on empirical data. It is a subset
of AI which allows a machine to automatically learn from past data without explicit
computer programming.

▪ Importance of Machine Learning


The human brain is highly efficient in perceiving and solving a problem. However, it fails
to analyse huge amount of complex data at a short time. It takes a lot of time and large
dataset for training machine learning algorithms to accurately classify objects. However,
once the model is trained, it can process millions of complex data, e.g., images efficiently
at a very short period of time. Now-a-days, machine learning is being applied in various
sectors like,
✓ Medical image analysis
✓ Root zone soil moisture estimation
✓ Estimation of soil organic matter variability with environmental factors
✓ Forecasting of commodity prices, rainfall
✓ Automated video surveillance
✓ Online customer support
✓ Detecting fraudulent activities in credit card transactions
✓ Detecting email spam and malware etc.

▪ Types of Machine Learning


Machine learning algorithms can be broadly classified into 3 categories:
1. Supervised Learning: It is a class of ML algorithms where a mapping function is
developed between the features and associated known labels present in the training
dataset so that new input feature can be assigned to a particular label by the trained model.
Example: Linear Regression, Logistic Regression, KNN, SVM, Decision Tree, Random
Forest, ANN etc.
2. Unsupervised Learning: It is a class of ML algorithms which clusters the input features
present in the input datasets in such a way that within cluster, homogeneity of data points
and between clusters, heterogeneity of data points in maximized. Example: Hierarchical
clustering, K-means clustering, PAM, DBSCAN, CLARA, STING etc.
3. Reinforcement Learning: It is an area of ML which is about taking suitable action to
maximize reward in a particular situation. It is employed by various software and
machines to find the best possible behaviour or path it should take in a specific situation.
Example: Q-Learning, SARSA, DQN etc.

Figure 1: Types and applications of machine learning algorithms

▪ Python Programming Language


Python combines the power of general-purpose programming languages as well as the user-
friendliness of domain-specific scripting languages like MATLAB or R. It has numerous
modules for data loading, visualization, statistical analysis, natural language processing,
image processing etc. Advantages of Python programming languages are:
✓ Ability to interact directly with the code, using a terminal or other tools like the Jupyter
Notebook
✓ Open source, scalable
✓ Quick iteration and easy interaction
✓ Availability of an extensive range of machine learning libraries, such as Scikit-Learn,
TensorFlow, Keras, PyTorch etc.
✓ Easy to create complex graphical user interfaces (GUIs) and web applications
✓ Easy to integrate into existing software.

▪ Installation of Python
Python can be installed in two ways:
1. Install Python individually: Download Python source codes from the site
https://www.python.org/downloads/ based on the particular operation system
2. Install pre-packaged Python distribution: Anaconda (https://www.anaconda.com/
download/)
▪ Jupyter Notebook
It provides an interactive computational environment for developing Python based data
science applications. It facilitates the arrangement of codes in various blocks so that changes
in codes can be done easily and analysis results, viz., images, text, output etc. can be viewed
in a step-by-step manner.
✓ Installation of Jupyter Notebook
C:\>pip install notebook
✓ Running Jupyter Notebook
C:\>jupyter notebook

(A)

(B)

Figure 2: (A) Running Jupyter Notebook in the local browser, (B) Starting a new
Python3 kernel

▪ Python libraries used for Machine Learning:


For developing machine learning models, following Python libraries are needed to be
installed:
✓ NumPy: It facilitates the data representation in the form of one-dimensional NumPy
arrays and also contains functionality for multidimensional arrays, high-level
mathematical functions such as linear algebra operations, Fourier transform,
pseudorandom number generators etc.
✓ Pandas: It represents the tabular data as DataFrame similar to R Dataframe and Excel
spreadsheet. The input data file of various formats like SQL, CSV, Excel etc. can be
loaded to the Python programme using Pandas. It also facilitates easy data manipulation,
SQL like queries and joins of tables.

✓ Scikit-Learn: Most important Python library for data science and machine learning. It
contains a wide range of machine learning algorithms like data preprocessing,
classification, clustering, regression, dimensionality reduction, model selection, model
evaluation etc.

✓ Matplotlib: This library is useful for making publication-quality visualizations such as


line charts, histograms, scatter plots, and so on.

Installation of these Python libraries can be done using the following code:
C:\>pip install <library_name>

➢ Iris Dataset
It is one of the sklearn datasets consisting of 4 different measurements, viz., sepal length,
sepal width, petal length and petal width for 3 types of Iris flowers, viz., Setosa, Versicolour
and Virginica stored as a 150 × 4 NumPy array. Each class, i.e., flower type consists of 50
samples. Iris dataset has been used for the demonstration of supervised and unsupervised
machine learning algorithms in the following section.

❖ Supervised Machine Learning using Python


1. K-Nearest Neighbour (KNN)
It is a non-parametric, lazy learning algorithm which doesn’t consider any assumption about
the underlying data. It uses ‘feature similarity’ to assess the distance between a new data
point and data present in the training set. Training data points closest to the new data point
are called as its “Nearest Neighbours”. The most frequent class to which most of the top K
nearest neighbours belong is assigned to the new data as its predicted class.
KNeighborsClassifier class of sklearn library is used for developing KNN classifier in
Python. Steps of developing a KNN classifier for Iris dataset are given below:

Step1: Load required libraries and the Iris dataset


Step 2: Rescaling the features and splitting the whole dataset into training and test set
in 80:20 ratio

Step 3: Build and train the KNN classifier


Step 4: Prediction for the test data and evaluation of the trained classifier

Selection of the value of K can impact the performance of the KNN classifier. Optimum K
can be identified by GridSearchCV method. KNN is useful for classifying non-linear data.
However, the prediction speed is low in case of large K value. KNN is also sensitive to the
scale of data and outlier features.

2. Support Vector Machine (SVM)


SVM is a supervised learning algorithm which can be used for classification and regression
problems as support vector classification (SVC) and support vector regression (SVR). It is
used for smaller dataset as it takes too long to process. The input data points belonging to
various classes are represented in a hyperplane in multidimensional space. The goal of SVM
is to divide the data points into classes by finding the maximum marginal hyperplane
(MMH). Datapoints that are closest to the MMH are called support vectors. SVM is an
efficient binary classifier where support vectors facilitate the identification of two parallel
separating lines which separates the two classes from each other. Equations of two
hyperplanes and classification process by SVM can be represented as:
If 𝑊 𝑇 𝑥 + 𝑏 > 1, then 𝑥 ∈ 𝐶𝑙𝑎𝑠𝑠 1 (1)
If 𝑊 𝑇 𝑥 + 𝑏 < −1, then 𝑥 ∈ 𝐶𝑙𝑎𝑠𝑠 2 (2)
1
The aim is to minimize: ||𝑊||2
2

Figure 3: Binary SVM classifier

Multiclass problems are considered as a set of several binary classification cases and
classification is done in an iterative process. In case of non-linear data, kernel function such
as radial basis function, tanh etc. is applied to transform the input data by mapping them to
a higher dimensional feature space so that they can be separated by two parallel and optimal
separating hyperplanes in that dimension. In the following section, sklearn class SVC has
been used for developing SVM based classifier for Iris dataset. Here, optimum parameter
selection has been done using GridSearchCV method.

Step1: Load required libraries and the Iris dataset


(Similar to KNN)

Step 2: Rescaling the features and splitting the whole dataset into training and test set
in 80:20 ratio
(Similar to KNN)
Step 3: Selecting optimum parameters using GridSearchCV method and train the SVM
classifier

Step 4: Prediction for the test data and evaluation of the trained classifier
Step 5: Visualization of SVM decision boundaries and heatmap for the confusion
matrix
SVM achieves very high accuracy in high dimensional feature space while utilizing very
less memory. However, its performance declines for overlapping classes. Moreover, SVM
takes longer training time for large datasets.

3. Random Forest
Random Forest is an ensemble machine learning algorithm which can be used either
classifier or regressor. Initially, a large number of candidate decision trees (viz.,
classification tree) are generated from the bootstrap replicate sample of the input training
dataset for the classification problem. The final class is decided based on the majority voting
method. In case of regression problem, average value is calculated for predictions from all
the candidate decision trees (viz., regression tree).

Figure 4: Random Forest classifier

In the following section, sklearn class RandomForestClassifier has been used for developing
Random Forest based classifier for the Iris dataset. Optimum parameter selection has been
done using GridSearchCV method.

Step1: Load required libraries and the Iris dataset


(Similar to KNN)
Step 2: Rescaling the features and splitting the whole dataset into training and test set
in 80:20 ratio
(Similar to KNN)

Step 3: Selecting optimum parameters using GridSearchCV method and train the
Random Forest classifier

Step 4: Prediction for the test data and evaluation of the trained classifier
Step 5: Text representation of the 4th Decision Tree from the set of 100 candidate trees
in the developed Random Forest classifier and plotting that chosen tree

(A)
(B)

Figure 6: (A) Text representation of the fourth Decision Tree, (B) Plotting the fourth
Decision Tree of the Random Forest

Random Forest is more robust and efficient than single decision tree as it can reduce the
model over-fitting problem by combining the predictions from all the candidate decision
trees. However, there are some disadvantages of Random Forest algorithms:
▪ Algorithm complexity
▪ Training and prediction processes are time consuming
▪ Requirement of higher computation resources
▪ Accuracy is reduced in case of a large collection of decision trees.

Apart from KNN, SVM and Random Forest, several other supervised machine learning
classifiers can also be used for classification and regression problems such as Naïve-Bayes,
Artificial Neural Network, XGBoost, Linear Regression, Logistic Regression etc. Specific
Python classes are available for these algorithms.

❖ Unsupervised Machine Learning using Python


1. Hierarchical Clustering
In this method, unlabelled input data points are grouped into clusters in either agglomerative
(bottom-up) or division (top-down) approach. In agglomerative hierarchical algorithms,
each data point is initially treated as a single cluster and then they are successively merged
as the pairs of clusters until the desired result is obtained. The hierarchy of the clusters is
represented as a dendrogram or tree structure. In divisive hierarchical algorithms, all the
data points are initially treated as one big cluster and the process of clustering involves
dividing the big cluster into various small clusters. There are various measures for
calculating the similarities/ distances between data points such as single linkage, complete
linkage, average linkage (UPGMA), Ward’s distance etc. In the following section, sklearn
class AgglomerativeClustering has been used for generating agglomerative clusters from the
Iris dataset using the average linkage distance measure.

Step 1: Agglomerative hierarchical clustering and representation of clusters using


scatterplot
Step 2: Plotting the dendrogram

There are some disadvantages of hierarchical clustering such as inability to separate


overlapping clusters efficiently, difficulty in handling non-Euclidean distance, inefficient
for high dimensional data, lack of scalability and flexibility, sensitivity to outliers, difficulty
in determining the number of clusters etc.

2. K-means Clustering
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of
the squared distance between the data points and the cluster centroid would be minimum. It
assumes that the number of clusters are already known and computes the initial cluster
centroids iteratively until optimal centroids are obtained. The number of clusters identified
from data by the algorithm is represented by ‘K’ in K-means. It is to be understood that less
variation within the clusters will lead to more similar data points within same cluster. In the
following section, sklearn class KMeans has been used for demonstrating K-means
clustering algorithm using the Iris dataset.
Figure 7: K-means clustering for Iris dataset and visualization of clusters and cluster
centroids using scatterplot diagram

In K-means clustering, the choice of initial cluster numbers and identifying initial centroid
positions are the most important task. Moreover, this method is inefficient for high
dimensional and non-linearly separable data. It assumes that the clusters are spherical and
of equal size. In real-world data, clusters might have irregular shapes and varying sizes.
Thus, K-Means may not be suitable for datasets with complex cluster shapes. For such
problems, other clustering algorithms such as DBSCAN, PAM, STING etc. can be used and
corresponding Python codes are also publicly available.
References:
Muller, A., & Guido, S (2016). Introduction to Machine Learning with Python. O'Reilly Media,
Inc.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay,
É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research,
12, 2825-2830.
Raschka, S., & Mirjalili, V. (2017). Python machine learning: Machine learning and deep
learning with python. Scikit-Learn, and TensorFlow. Second edition ed, 3.
https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_pyth
on_tutorial.pdf

View publication stats

You might also like