Scikit - Notes ML

What is Scikit-Learn (Sklearn)?
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a range of supervised and unsupervised learning algorithms for
machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction via a consistence interface in Python. This
library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.
As Scikit Learn is built on top of several common data and math Python libraries, such a
design makes it super easy to integrate between them all. You can pass numpy arrays and
pandas data frames directly to the ML algorithms of Scikit.
It uses the following libraries:

NumPy: For any work with matrices, especially math operations
SciPy: Scientific and technical computing
Matplotlib: Data visualization
IPython: Interactive console for Python
Sympy: Symbolic mathematics
Pandas: Data handling, manipulation, and analysis
Scikit Learn library overview

Rather than focusing on loading, manipulating, handling, manipulating, visualizing of data
and summarising data, Scikit-learn library is focused on modeling the data. Thus, it is natural
and common practice to use the above libraries, especially NumPy, for those extra steps; they
are made for each other!
Therefore, Scikit’s robust set of algorithm offerings include:

 Regression: Fitting linear and non-linear models
 Clustering: Unsupervised classification
 Decision Trees: Tree induction and pruning for both classification and regression
tasks
 Neural Networks: End-to-end training for both classification and regression.
Layers can be easily defined in a tuple
 SVMs: for learning decision boundaries
 Naive Bayes: Direct probabilistic modelling
 Even beyond that, it has some very convenient and advanced functions not
commonly offered by other libraries:
 Ensemble Methods: Boosting, Bagging, Random Forest, Model voting and
averaging
 Feature Manipulation: Dimensionality reduction, feature selection, feature analysis
 Outlier Detection: For detecting outliers and rejecting noise
 Model selection and validation: Cross-validation, Hyperparamter tuning, and
metrics.
Some of the most popular groups of models provided by Sklearn are as follows:
Supervised Learning algorithms: Almost all the popular supervised learning

algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree
etc., are the part of scikit-learn.
Unsupervised Learning algorithms: On the other hand, it also has all the popular
unsupervised learning algorithms from clustering, factor analysis, PCA (Principal
Component Analysis) to unsupervised neural networks.
Clustering: This model is used for grouping unlabeled data.
Cross Validation: It is used to check the accuracy of supervised models on unseen data.
Dimensionality Reduction: It is used for reducing the number of attributes in data

which can be further used for summarisation, visualisation and feature selection.
Ensemble methods: As name suggest, it is used for combining the predictions of

multiple supervised models.
Feature extraction: It is used to extract the features from data to define the attributes
in image and text data.
Feature selection: It is used to identify useful attributes to create supervised models.
Basic algorithms in machine learning:
Regression vs Classification
Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labelled datasets.
But the difference between both is how they are used for different machine learning
problems.
The main difference between Regression and Classification algorithms is that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. whereas
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.
Image: Classification vs Regression
Reference: https://www.javatpoint.com/regression-vs-classification-in-machine-learning
Classification
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on the
training dataset and based on that training, it categorizes the data into different classes.
The task of the classification algorithm is to find the mapping function to map the input
(x) to the discrete output (y).
Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different parameters,
and whenever it receives a new email, it identifies whether the email is spam or not. If the
email is spam, then it is moved to the Spam folder.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the following types:
i) Logistic Regression
ii) K-Nearest Neighbours
iii) Support Vector Machines
iv) Kernel SVM
v) Naïve Bayes
vi) Decision Tree Classification
vii) Random Forest Classification
Regression:
Regression is a process of finding the correlations between dependent and independent variables. It
helps in predicting the continuous variables such as prediction of Market Trends, prediction of House
prices, etc.
The task of the Regression algorithm is to find the mapping function to map the input variable (x) to
the continuous output variable (y).
Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training is
completed, it can easily predict the weather for future days.
Types of Regression Algorithm:
i) Simple Linear Regression

ii) Multiple Linear Regression
iii) Polynomial Regression
iv) Support Vector Regression
v) Decision Tree Regression
vi) Random Forest Regression
Terminology of ML
Dataset: A set of data examples, that contain features important to solving the problem.
Features: Important pieces of data that help us understand a problem. These are fed in to a Machine
Learning algorithm to help it learn.
Model: The representation (internal model) of a phenomenon that a Machine Learning algorithm has
learnt. It learns this from the data it is shown during training. The model is the output you get after
training an algorithm. For example, a decision tree algorithm would be trained and produce a decision
tree model.
Process
Data Collection: Collect the data that the algorithm will learn from.
Data Preparation: Format and engineer the data into the optimal format, extracting important features
and performing dimensionality reduction.
Training: Also known as the fitting stage, this is where the Machine Learning algorithm actually
learns by showing it the data that has been collected and prepared.
Evaluation: Test the model to see how well it performs.
Tuning: Fine tune the model to maximize its performance.

Dataset Loading
A collection of data is called dataset. It is having the following two components:
Features: The variables of data are called its features. They are also known as predictors,
inputs or attributes.
 Feature matrix: It is the collection of features, in case there are more than one.
 Feature Names: It is the list of all the names of the features.
Response: It is the output variable that basically depends upon the feature variables. They
are also known as target, label or output.
 Response Vector: It is used to represent response column. Generally, we have just one response
column.
 Target Names: It represent the possible values taken by a response vector.
Scikit-learn have few example datasets like iris and digits for classification and the
Boston house prices for regression.
Following is an example to load iris dataset:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nFirst 10 rows of X:\n", X[:10])
Output
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
First 10 rows of X:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
Splitting the dataset
To check the accuracy of our model, we can split the dataset into two pieces-a training set
and a testing set. Use the training set to train the model and testing set to test the model.
After that, we can evaluate how well our model did.
The following example will split the data into 70:30 ratio, i.e. 70% data will be used as
training data and 30% will be used as testing data. The dataset is iris dataset as in above
example.
iris = load_iris()
X = iris.data
y =iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.3, random_state=1)
print(X_train.shape)
print(X_test.shape)
Output
(105, 4)
(45, 4)
(105,)
(45,)
7
As seen in the example above, it uses train_test_split() function of scikit-learn to split the
dataset. This function has the following arguments:
 X, y: Here, X is the feature matrix and y is the response vector, which need to be split.
 test_size: This represents the ratio of test data to the total given data. As in the above example, we are
setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows.
 random_size: It is used to guarantee that the split will always be the same. This is useful in the
situations where you want reproducible results.
Train and test the Model

Next, we can use our dataset to train some prediction-model. As discussed, scikit-learn has
wide range of Machine Learning (ML) algorithms which have a consistent interface for
fitting, predicting accuracy, recall etc.
In the example below, we are going to use KNN (K nearest neighbors) classifier.
This example is used to make you understand the implementation part only.
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.4, random_state=1)
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
8
classifier_knn.fit(X_train, y_train)
y_pred = classifier_knn.predict(X_test)
# Finding accuracy by comparing actual response

values(y_test)with predicted response value(y_pred)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
# Providing sample data and the model will make prediction out of
that data
sample = [[5, 5, 3, 2], [2, 4, 3, 5]]

preds = classifier_knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds]
print("Predictions:", pred_species)
Output
Accuracy: 0.9833333333333333
Predictions: ['versicolor', 'virginica']
Scikit-Learn ― Clustering Methods

 clustering methods in Sklearn help in identification of any similarity in the data samples.
 Clustering methods, one of the most useful unsupervised ML methods

 used to find similarity & relationship patterns among data samples.
 After that, they cluster those samples into groups having similarity based on features.
 Clustering determines the intrinsic grouping among the present unlabeled data, that’s
why it is important.
 The Scikit-learn library have sklearn.cluster to perform clustering of unlabeled data.
9
 Under this module scikit-leran have the following clustering methods:
 KMeans, Mean Shift, Affinity Propagation, Hierarchical clustering, DBSCAN (Density-
based spatial clustering of applications with noise), BIRCH (Balanced iterative
reducing and clustering using hierarchies)
KMeans
 This algorithm computes the centroids and iterates until it finds optimal centroid.
 It requires the number of clusters to be specified that’s why it assumes that they are
already known.
 The main logic of this algorithm is to cluster the data separating samples in n number of
groups of equal variances by minimizing the criteria known as the inertia.
 The number of clusters identified by algorithm is represented by K.
 Scikit-learn have sklearn.cluster.KMeans module to perform K-Means clustering.
 While computing cluster centers and value of inertia, the parameter named sample_weight
allows sklearn.cluster.KMeans module to assign more weight to some samples.
Dimensionality Reduction
 Dimensionality reduction, an unsupervised machine learning method is used to reduce
the number of feature variables for each data sample selecting set of principal
features.
 Principal Component Analysis (PCA) is one of the popular algorithms for

dimensionality reduction.
 Principal Component Analysis (PCA) is used for linear dimensionality reduction using Singular Value
Decomposition (SVD) of the data to project it to a lower dimensional space.
 While decomposition using PCA, input data is centered but not scaled for each feature before applying
the SVD.
 The Scikit-learn ML library provides sklearn.decomposition.PCA module that is implemented as a
transformer object which learns n components in its fit() method.
 It can also be used on new data to project it on these components.
Example
The below example will use sklearn.decomposition.PCA module to find best 5 Principal
components from Pima Indians Diabetes dataset.
10
from pandas import read_csv
from sklearn.decomposition import PCA
path = r'C:\Users\Leekha\Desktop\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi',
'age',
‘class']
dataframe = read_csv(path,
names=names) array =
dataframe.values
X = array[:,0:8]
Y = array[:,8]
pca =
PCA(n_components=5)
fit = pca.fit(X)
print(("Explained Variance: %s") %
(fit.explained_variance_ratio_)) print(fit.components_)
Explained Variance: [0.88854663 0.06159078 0.02579012 0.01308614

0.00744094]
Output
[[-2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02
9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]

[-2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-02
9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01]
[-2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01
2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]
[-4.90459604e-02 1.19830016e-01 -2.62742788e-01 8.84369380e-01
-6.55503615e-02 1.92801728e-01 2.69908637e-03 -3.01024330e-01]
[ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01 2.59973487e-01
-1.72312241e-04 2.14744823e-02 1.64080684e-03 9.20504903e-01]]
11
12

Scikit - Notes ML

Uploaded by

Copyright:

Available Formats

Scikit - Notes ML

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scikit - Notes ML

Uploaded by

Copyright:

Available Formats

What is Scikit-Learn (Sklearn)?

It uses the following libraries:

Scikit Learn library overview

Therefore, Scikit’s robust set of algorithm offerings include:

Supervised Learning algorithms: Almost all the popular supervised learning

Clustering: This model is used for grouping unlabeled data.

Dimensionality Reduction: It is used for reducing the number of attributes in data

Ensemble methods: As name suggest, it is used for combining the predictions of

Feature selection: It is used to identify useful attributes to create supervised models.

Basic algorithms in machine learning:

Classification Algorithms can be further divided into the following types:

i) Simple Linear Regression

Evaluation: Test the model to see how well it performs.

Tuning: Fine tune the model to maximize its performance.

from sklearn.datasets import load_iris

Target names: ['setosa' 'versicolor' 'virginica']

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,

Train and test the Model

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,

from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

# Finding accuracy by comparing actual response

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

sample = [[5, 5, 3, 2], [2, 4, 3, 5]]

Predictions: ['versicolor', 'virginica']

Scikit-Learn ― Clustering Methods

 Clustering methods, one of the most useful unsupervised ML methods

 Scikit-learn have sklearn.cluster.KMeans module to perform K-Means clustering.

 Principal Component Analysis (PCA) is one of the popular algorithms for

Explained Variance: [0.88854663 0.06159078 0.02579012 0.01308614

9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]

You might also like