MachineLearningusingPython
MachineLearningusingPython
net/publication/375962188
CITATIONS READS
0 564
1 author:
Ritwika Das
Indian Agricultural Statistics Research Institute
45 PUBLICATIONS 116 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ritwika Das on 28 November 2023.
Ritwika Das
Scientist, Division of Agricultural Bioinformatics
ICAR – IASRI, New Delhi – 110012
Email: Ritwika.Das@icar.gov.in
▪ Machine Learning
Machine learning is the scientific discipline concerned with the design and development of
algorithms that allow computers to evolve behaviours based on empirical data. It is a subset
of AI which allows a machine to automatically learn from past data without explicit
computer programming.
▪ Installation of Python
Python can be installed in two ways:
1. Install Python individually: Download Python source codes from the site
https://www.python.org/downloads/ based on the particular operation system
2. Install pre-packaged Python distribution: Anaconda (https://www.anaconda.com/
download/)
▪ Jupyter Notebook
It provides an interactive computational environment for developing Python based data
science applications. It facilitates the arrangement of codes in various blocks so that changes
in codes can be done easily and analysis results, viz., images, text, output etc. can be viewed
in a step-by-step manner.
✓ Installation of Jupyter Notebook
C:\>pip install notebook
✓ Running Jupyter Notebook
C:\>jupyter notebook
(A)
(B)
Figure 2: (A) Running Jupyter Notebook in the local browser, (B) Starting a new
Python3 kernel
✓ Scikit-Learn: Most important Python library for data science and machine learning. It
contains a wide range of machine learning algorithms like data preprocessing,
classification, clustering, regression, dimensionality reduction, model selection, model
evaluation etc.
Installation of these Python libraries can be done using the following code:
C:\>pip install <library_name>
➢ Iris Dataset
It is one of the sklearn datasets consisting of 4 different measurements, viz., sepal length,
sepal width, petal length and petal width for 3 types of Iris flowers, viz., Setosa, Versicolour
and Virginica stored as a 150 × 4 NumPy array. Each class, i.e., flower type consists of 50
samples. Iris dataset has been used for the demonstration of supervised and unsupervised
machine learning algorithms in the following section.
Selection of the value of K can impact the performance of the KNN classifier. Optimum K
can be identified by GridSearchCV method. KNN is useful for classifying non-linear data.
However, the prediction speed is low in case of large K value. KNN is also sensitive to the
scale of data and outlier features.
Multiclass problems are considered as a set of several binary classification cases and
classification is done in an iterative process. In case of non-linear data, kernel function such
as radial basis function, tanh etc. is applied to transform the input data by mapping them to
a higher dimensional feature space so that they can be separated by two parallel and optimal
separating hyperplanes in that dimension. In the following section, sklearn class SVC has
been used for developing SVM based classifier for Iris dataset. Here, optimum parameter
selection has been done using GridSearchCV method.
Step 2: Rescaling the features and splitting the whole dataset into training and test set
in 80:20 ratio
(Similar to KNN)
Step 3: Selecting optimum parameters using GridSearchCV method and train the SVM
classifier
Step 4: Prediction for the test data and evaluation of the trained classifier
Step 5: Visualization of SVM decision boundaries and heatmap for the confusion
matrix
SVM achieves very high accuracy in high dimensional feature space while utilizing very
less memory. However, its performance declines for overlapping classes. Moreover, SVM
takes longer training time for large datasets.
3. Random Forest
Random Forest is an ensemble machine learning algorithm which can be used either
classifier or regressor. Initially, a large number of candidate decision trees (viz.,
classification tree) are generated from the bootstrap replicate sample of the input training
dataset for the classification problem. The final class is decided based on the majority voting
method. In case of regression problem, average value is calculated for predictions from all
the candidate decision trees (viz., regression tree).
In the following section, sklearn class RandomForestClassifier has been used for developing
Random Forest based classifier for the Iris dataset. Optimum parameter selection has been
done using GridSearchCV method.
Step 3: Selecting optimum parameters using GridSearchCV method and train the
Random Forest classifier
Step 4: Prediction for the test data and evaluation of the trained classifier
Step 5: Text representation of the 4th Decision Tree from the set of 100 candidate trees
in the developed Random Forest classifier and plotting that chosen tree
(A)
(B)
Figure 6: (A) Text representation of the fourth Decision Tree, (B) Plotting the fourth
Decision Tree of the Random Forest
Random Forest is more robust and efficient than single decision tree as it can reduce the
model over-fitting problem by combining the predictions from all the candidate decision
trees. However, there are some disadvantages of Random Forest algorithms:
▪ Algorithm complexity
▪ Training and prediction processes are time consuming
▪ Requirement of higher computation resources
▪ Accuracy is reduced in case of a large collection of decision trees.
Apart from KNN, SVM and Random Forest, several other supervised machine learning
classifiers can also be used for classification and regression problems such as Naïve-Bayes,
Artificial Neural Network, XGBoost, Linear Regression, Logistic Regression etc. Specific
Python classes are available for these algorithms.
2. K-means Clustering
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of
the squared distance between the data points and the cluster centroid would be minimum. It
assumes that the number of clusters are already known and computes the initial cluster
centroids iteratively until optimal centroids are obtained. The number of clusters identified
from data by the algorithm is represented by ‘K’ in K-means. It is to be understood that less
variation within the clusters will lead to more similar data points within same cluster. In the
following section, sklearn class KMeans has been used for demonstrating K-means
clustering algorithm using the Iris dataset.
Figure 7: K-means clustering for Iris dataset and visualization of clusters and cluster
centroids using scatterplot diagram
In K-means clustering, the choice of initial cluster numbers and identifying initial centroid
positions are the most important task. Moreover, this method is inefficient for high
dimensional and non-linearly separable data. It assumes that the clusters are spherical and
of equal size. In real-world data, clusters might have irregular shapes and varying sizes.
Thus, K-Means may not be suitable for datasets with complex cluster shapes. For such
problems, other clustering algorithms such as DBSCAN, PAM, STING etc. can be used and
corresponding Python codes are also publicly available.
References:
Muller, A., & Guido, S (2016). Introduction to Machine Learning with Python. O'Reilly Media,
Inc.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay,
É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research,
12, 2825-2830.
Raschka, S., & Mirjalili, V. (2017). Python machine learning: Machine learning and deep
learning with python. Scikit-Learn, and TensorFlow. Second edition ed, 3.
https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_pyth
on_tutorial.pdf