Scikit - Notes ML
Scikit - Notes ML
Scikit - Notes ML
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a range of supervised and unsupervised learning algorithms for
machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction via a consistence interface in Python. This
library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.
As Scikit Learn is built on top of several common data and math Python libraries, such a
design makes it super easy to integrate between them all. You can pass numpy arrays and
pandas data frames directly to the ML algorithms of Scikit.
Unsupervised Learning algorithms: On the other hand, it also has all the popular
unsupervised learning algorithms from clustering, factor analysis, PCA (Principal
Component Analysis) to unsupervised neural networks.
Cross Validation: It is used to check the accuracy of supervised models on unseen data.
Feature extraction: It is used to extract the features from data to define the attributes
in image and text data.
Regression vs Classification
Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in Machine learning and work with the labelled datasets.
But the difference between both is how they are used for different machine learning
problems.
The main difference between Regression and Classification algorithms is that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. whereas
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.
Image: Classification vs Regression
Reference: https://www.javatpoint.com/regression-vs-classification-in-machine-learning
Classification
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on the
training dataset and based on that training, it categorizes the data into different classes.
The task of the classification algorithm is to find the mapping function to map the input
(x) to the discrete output (y).
Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different parameters,
and whenever it receives a new email, it identifies whether the email is spam or not. If the
email is spam, then it is moved to the Spam folder.
Types of ML Classification Algorithms:
i) Logistic Regression
ii) K-Nearest Neighbours
iii) Support Vector Machines
iv) Kernel SVM
v) Naïve Bayes
vi) Decision Tree Classification
vii) Random Forest Classification
Regression:
Regression is a process of finding the correlations between dependent and independent variables. It
helps in predicting the continuous variables such as prediction of Market Trends, prediction of House
prices, etc.
The task of the Regression algorithm is to find the mapping function to map the input variable (x) to
the continuous output variable (y).
Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training is
completed, it can easily predict the weather for future days.
Types of Regression Algorithm:
Terminology of ML
Dataset: A set of data examples, that contain features important to solving the problem.
Features: Important pieces of data that help us understand a problem. These are fed in to a Machine
Learning algorithm to help it learn.
Model: The representation (internal model) of a phenomenon that a Machine Learning algorithm has
learnt. It learns this from the data it is shown during training. The model is the output you get after
training an algorithm. For example, a decision tree algorithm would be trained and produce a decision
tree model.
Process
Data Collection: Collect the data that the algorithm will learn from.
Data Preparation: Format and engineer the data into the optimal format, extracting important features
and performing dimensionality reduction.
Training: Also known as the fitting stage, this is where the Machine Learning algorithm actually
learns by showing it the data that has been collected and prepared.
Feature matrix: It is the collection of features, in case there are more than one.
Feature Names: It is the list of all the names of the features.
Response: It is the output variable that basically depends upon the feature variables. They
are also known as target, label or output.
Response Vector: It is used to represent response column. Generally, we have just one response
column.
Target Names: It represent the possible values taken by a response vector.
Scikit-learn have few example datasets like iris and digits for classification and the
Boston house prices for regression.
Following is an example to load iris dataset:
Output
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)']
First 10 rows of X:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
Splitting the dataset
To check the accuracy of our model, we can split the dataset into two pieces-a training set
and a testing set. Use the training set to train the model and testing set to test the model.
After that, we can evaluate how well our model did.
The following example will split the data into 70:30 ratio, i.e. 70% data will be used as
training data and 30% will be used as testing data. The dataset is iris dataset as in above
example.
iris = load_iris()
X = iris.data
y =iris.target
print(X_train.shape)
print(X_test.shape)
Output
(105, 4)
(45, 4)
(105,)
(45,)
7
As seen in the example above, it uses train_test_split() function of scikit-learn to split the
dataset. This function has the following arguments:
X, y: Here, X is the feature matrix and y is the response vector, which need to be split.
test_size: This represents the ratio of test data to the total given data. As in the above example, we are
setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows.
random_size: It is used to guarantee that the split will always be the same. This is useful in the
situations where you want reproducible results.
In the example below, we are going to use KNN (K nearest neighbors) classifier.
This example is used to make you understand the implementation part only.
iris = load_iris()
X = iris.data
y = iris.target
8
classifier_knn.fit(X_train, y_train)
y_pred = classifier_knn.predict(X_test)
# Providing sample data and the model will make prediction out of
that data
Output
Accuracy: 0.9833333333333333
9
Under this module scikit-leran have the following clustering methods:
KMeans, Mean Shift, Affinity Propagation, Hierarchical clustering, DBSCAN (Density-
based spatial clustering of applications with noise), BIRCH (Balanced iterative
reducing and clustering using hierarchies)
KMeans
This algorithm computes the centroids and iterates until it finds optimal centroid.
It requires the number of clusters to be specified that’s why it assumes that they are
already known.
The main logic of this algorithm is to cluster the data separating samples in n number of
groups of equal variances by minimizing the criteria known as the inertia.
The number of clusters identified by algorithm is represented by K.
While computing cluster centers and value of inertia, the parameter named sample_weight
allows sklearn.cluster.KMeans module to assign more weight to some samples.
Dimensionality Reduction
Dimensionality reduction, an unsupervised machine learning method is used to reduce
the number of feature variables for each data sample selecting set of principal
features.
Principal Component Analysis (PCA) is used for linear dimensionality reduction using Singular Value
Decomposition (SVD) of the data to project it to a lower dimensional space.
While decomposition using PCA, input data is centered but not scaled for each feature before applying
the SVD.
The Scikit-learn ML library provides sklearn.decomposition.PCA module that is implemented as a
transformer object which learns n components in its fit() method.
It can also be used on new data to project it on these components.
Example
The below example will use sklearn.decomposition.PCA module to find best 5 Principal
components from Pima Indians Diabetes dataset.
10
from pandas import read_csv
from sklearn.decomposition import PCA
path = r'C:\Users\Leekha\Desktop\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi',
'age',
‘class']
dataframe = read_csv(path,
names=names) array =
dataframe.values
X = array[:,0:8]
Y = array[:,8]
pca =
PCA(n_components=5)
fit = pca.fit(X)
print(("Explained Variance: %s") %
(fit.explained_variance_ratio_)) print(fit.components_)
11
12