Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

Applying_Machine_Learning_Algorithms_with_Scikit-learn(Sklearn)_-_Notes

Uploaded by

AnkitMishra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Applying_Machine_Learning_Algorithms_with_Scikit-learn(Sklearn)_-_Notes

Uploaded by

AnkitMishra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Applying Machine Learning Algorithms

with Scikit-learn(Sklearn)

Topics Covered:
● What Is Regression ?
● Why do we Need Regression ?
● Linear Regression
● Regression Metrics
● What is Clustering?
● KMeans Clustering

What Is Regression ?

Regression analysis is one of the most important fields in statistics and


machine learning. There are many regression methods available. Linear
regression is one of them.
Regression searches for relationships among variables. For example, you
can observe several employees of some company and try to understand
how their salaries depend on their features, such as experience,
education level, role, city of employment, and so on.

This is a regression problem where data related to each employee


represents one observation. The presumption is that the experience,
education, role, and city are the independent features, while the salary
depends on them.

Similarly, you can try to establish the mathematical dependence of


housing prices on area, number of bedrooms, distance to the city
center, and so on.

Generally, in regression analysis, you consider some phenomenon of


interest and have a number of observations. Each observation has two

1
or more features. Following the assumption that at least one of the
features depends on the others, you try to establish a relation among
them.

In other words, you need to find a function that maps some features or
variables to others sufficiently well.

The dependent features are called the dependent variables, outputs, or


responses. The independent features are called the independent
variables, inputs, regressors, or predictors.

Regression problems usually have one continuous and unbounded


dependent variable. The inputs, however, can be continuous, discrete, or
even categorical data such as gender, nationality, or brand.

It’s a common practice to denote the outputs with 𝑦 and the inputs with
𝑥. If there are two or more independent variables, then they can be
represented as the vector 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of inputs.

Why do we Need Regression ?

We need regression to answer whether and how some phenomenon


influences the other or how several variables are related. For example,
you can use it to determine if and to what extent experience or gender
impacts salaries.

Regression is also useful when you want to forecast a response using a


new set of predictors. For example, you could try to predict electricity
consumption of a household for the next hour given the outdoor
temperature, time of day, and number of residents in that household.

Regression is used in many different fields, including economics,


computer science, and the social sciences. Its importance rises every
day with the availability of large amounts of data and increased
awareness of the practical value of data.

2
Linear Regression

Linear regression is probably one of the most important and widely


used regression techniques. It’s among the simplest regression
methods. One of its main advantages is the ease of interpreting results.

It is a predictive modeling technique which investigates the


relationship between dependent and independent variables(one or
more) Dependent/Target variable is continuous in nature

eg - Sales, Weight, Profit, Revenue, Price, Distance, Magnitude, Height,


Weight etc

y = dependent variable/output/target variable


x = independent variable/input(s)

When implementing linear regression of some dependent variable 𝑦 on


the set of independent variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of
predictors, you assume a linear relationship between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁
+ ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression equation. 𝛽₀, 𝛽₁, …, 𝛽ᵣ are the
regression coefficients, and 𝜀 is the random error.

Linear regression calculates the estimators of the regression coefficients


or simply the predicted weights, denoted with 𝑏₀, 𝑏₁, …, 𝑏ᵣ. These
estimators define the estimated regression function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ +
𝑏ᵣ𝑥ᵣ. This function should capture the dependencies between the inputs
and output sufficiently well.

The estimated or predicted response, 𝑓(𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛,


should be as close as possible to the corresponding actual response 𝑦ᵢ.
The differences 𝑦ᵢ - 𝑓(𝐱ᵢ) for all observations 𝑖 = 1, …, 𝑛, are called the

3
residuals. Regression is about determining the best predicted
weights—that is, the weights corresponding to the smallest residuals.

To get the best weights, you usually minimize the sum of squared
residuals (SSR) for all observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This
approach is called the method of ordinary least squares.

Regression Performance

The variation of actual responses 𝑦ᵢ, 𝑖 = 1, …, 𝑛, occurs partly due to the


dependence on the predictors 𝐱ᵢ. However, there’s also an additional
inherent variance of the output.

The coefficient of determination, denoted as 𝑅², tells you which amount


of variation in 𝑦 can be explained by the dependence on 𝐱, using the
particular regression model. A larger 𝑅² indicates a better fit and means
that the model can better explain the variation of the output with
different inputs.

The value 𝑅² = 1 corresponds to SSR = 0. That’s the perfect fit, since the
values of predicted and actual responses fit completely to each other.

Simple Linear Regression

Simple or single-variate linear regression is the simplest case of linear


regression, as it has a single independent variable, 𝐱 = 𝑥.
When implementing simple linear regression, you typically start with a
given set of input-output (𝑥-𝑦) pairs. These pairs are your observations,
shown as green circles in the figure. For example, the leftmost
observation has the input 𝑥 = 5 and the actual output, or response, 𝑦 = 5.
The next one has 𝑥 = 15 and 𝑦 = 20, and so on.

It is a regression model that estimates the relationship between one


independent variable and one dependent variable using a straight
line.

4
It has an equation of the form.

y = mx + c

Where-
x = independent variable/ input feature/input attribute/input column/(s)
y = dependent variable / output feature/target attribute/ output column
m = slope or coefficient or weight or how much we expect y to change
as x changes
c = intercept / constant / bias

In the graph given above , x = Time spent Studying, y = Marks obtained.


The orange dots are the corresponding data points. The blue line is the
best fit line for Linear regression(y = mx +c)

5
Multiple Linear Regression

Multiple or multivariate linear regression is a case of linear regression


with two or more independent variables.
Multiple linear regression is used to estimate the relationship
between two or more independent variables and one dependent
variable
It has an equation of the form

y = m1x1 + m2x2 + m3x3 +…..+ mnxn + c

Where:-
x1,x2,x3,...,xn = independent variables/ input features
y = dependent variable / output feature
m1,m2,m3...,mn = coefficients/slope corresponding to x1 - xn
c = intercept / constant / bias

If there are just two independent variables, then the estimated


regression function is 𝑓(𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂. It represents a regression
plane in a three-dimensional space. The goal of regression is to
determine the values of the weights 𝑏₀, 𝑏₁, and 𝑏₂ such that this plane is
as close as possible to the actual responses, while yielding the minimal
SSR.
The case of more than two independent variables is similar, but more
general. The estimated regression function is 𝑓(𝑥₁, …, 𝑥ᵣ) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ +𝑏ᵣ𝑥ᵣ,
and there are 𝑟 + 1 weights to be determined when the number of
inputs is 𝑟.

Underfitting and Overfitting

● Underfitting occurs when a model can’t accurately capture the


dependencies among data, usually as a consequence of its own

6
simplicity. It often yields a low 𝑅² with known data and bad
generalization capabilities when applied with new data.
● Overfitting happens when a model learns both data dependencies
and random fluctuations. In other words, a model learns the
existing data too well. Complex models, which have many features
or terms, are often prone to overfitting. When applied to known
data, such models usually yield high 𝑅². However, they often don’t
generalize well and have significantly lower 𝑅² when used with
new data.

Regression Metrics

Regression is a problem where we try to predict a continuous


dependent variable using a set of independent variables. For example,
Stock Market & Weather forecasting, Sales Prediction etc. These
problems are used to answer “How much?” or “How many?”
In regression problems, the prediction error is used to define the model
performance. The prediction error is also referred to as residuals and it is
defined as the difference between the actual and predicted values.

7
The regression model tries to fit a line that produces the smallest
difference between predicted and actual(measured) values.

Residuals are important when determining the quality of a model. You


can examine residuals in terms of their magnitude and/or whether they
form a pattern.

Where the residuals are all 0, the model predicts perfectly. The further
residuals are from 0, the less accurate the model is.
Where the average residual is not 0, it implies that the model is
systematically biased (i.e., consistently over-or under-predicting).
Where residuals contain patterns, it implies that the model is
qualitatively wrong, as it is failing to explain some properties of the data.

Residual = actual value — predicted value


error(e) = y — ŷ

We can calculate the residual for every point in our data set, and each of
these residuals will be of use in assessment.

Month Month No Inflation(%) Predicted(%) Residual(%)

January 1 0.6 1.9 -1.3

February 2 0.5 1.9 -1.4

March 3 1.5 2.2 -0.7

April 4 2.1 2.0 0.1

May 5 2.2 2.0 0.2

June 6 1.9 2.5 -0.6

Residual = Inflation — Predicted

8
We can technically inspect all residuals to judge the model’s accuracy,
but this does not scale if we have thousands or millions of data points.
That’s why we have summary measurements that take our collection of
residuals and condense them into a single value representing our
model's predictive ability.

Below are some popular metrics for regression models.

Best Fit Line:

The Linear Regression model finds the best fit line that minimizes the
squared error between the actual data points and its perpendicular
dropped on the predicted line(prediction)

Where,
SST = Total Sum of Squares
SSE = Error Sum of squares
SSR = Regression sum of squares

9
Mean Absolute Error (MAE):
It is the average of the absolute differences between the actual value
and the model’s predicted value.

where,
N = total number of data points
Yi = actual value
Ŷi = predicted value
If we don’t take the absolute values, then the negative difference will
cancel out the positive difference and we will be left with a zero upon
summation.

A small MAE suggests the model is great at prediction, while a large


MAE suggests that your model may have trouble in certain areas. MAE
of 0 means that your model is a perfect predictor of the outputs.

The mean absolute error (MAE) has the same unit as the original data,
and it can only be compared between models whose errors are
measured in the same units.

10
The bigger the MAE, the more critical the error is. It is robust to outliers.
Therefore, by taking the absolute values, MAE can deal with the outliers

Here, a big error doesn’t overpower a lot of small errors and thus the
output provides us with a relatively unbiased understanding of how the
model is performing. Hence, it fails to punish the bigger error terms.

MAE is not differentiable so we have to apply various optimizers like


Gradient descent which can be differentiable.

Mean Squared Error (MSE):


It is the average of the squared differences between the actual and the
predicted values.

Lower the value, the better the regression model.

where,
n = total number of data points
yi = actual value
ŷi = predicted value
Its unit is the square of the variable’s unit.

If you have outliers in the dataset then it penalizes the outliers most
and the calculated MSE is bigger. So, in short, It is not Robust to outliers
which were an advantage in MAE.

MSE uses the square operation to remove the sign of each error value
and to punish large errors.

11
As we take the square of the error, the effect of larger errors becomes
more pronounced then smaller error, hence the model can now focus
more on the larger errors.
The main reason this is not that useful is that if we make a single very
bad prediction, the squaring will make the error even worse and it may
skew the metric towards overestimating the model’s badness.
On the other hand, if all the errors are small, or rather, smaller than 1,
then we may underestimate the model’s badness.

Root Mean Squared Error (RMSE):

It is the average root-squared difference between the real value and


the predicted value. By taking a square root of MSE, we get the Root
Mean Square Error.
We want the value of RMSE to be as low as possible, as lower the RMSE
value is, the better the model is with its predictions. A Higher RMSE
indicates that there are large deviations between the predicted and
actual value.

where,
n = total number of data points
yj = actual value
ŷj= predicted value

Max Error:

While RMSE is the most common metric, it can be hard to interpret.


One alternative is to look at quantiles of the distribution of the absolute
percentage errors. The Max-Error metric is the worst-case error
between the predicted value and the true value.

12
R² score, the coefficient of determination:

R-squared explains to what extent the variance of one variable explains


the variance of the second variable. In other words, it measures the
proportion of variance of the dependent variable explained by the
independent variable.

R squared is a popular metric for identifying model accuracy. It tells how


close the data points to the fitted line generated by a regression
algorithm. A larger R squared value indicates a better fit. This helps us to
find the relationship between the independent variable towards the
dependent variable.

R² score ranges from 0 to 1. The closer to 1 the R², the better the
regression model is. If R² is equal to 0, the model is not performing
better than a random model. If R² is negative, the regression model is
erroneous.

It is the ratio of the sum of squares and the total sum of squares

OR

Where,
SST = SSE + SSR
SSR = SST-SSE
R2 score = SSR/SST = (SST- SSE)/ SST = 1 – SSE/SST

13
● When SSE= 0, R2 score = 1 (Best Case Scenario)
● When SSE = SST, R2 score = 0 (Worst Case Scenario)

where SSE is the sum of the square of the difference between the actual
value and the predicted value

and, SST is the total sum of the square of the difference between the
actual value and the mean of the actual value.

Here, yi is the observed target value, ŷi is the predicted value, and y-bar
is the mean value, m represents the total number of observations.

When we add new features in our data, R2 score starts increasing or


constant but never decreases because It assumes that while adding
more data variance of data increases.

But the problem is when we add an irrelevant feature in the dataset


then at that time R2 sometimes starts increasing which is incorrect.

R2 describes the proportion of variance of the dependent variable


explained by the regression model. If the regression model is “perfect”,
SSE is zero, and R2 is 1. If the regression model is a total failure, SSE is
equal to SST, no variance is explained by the regression, and R2 is zero.

14
Adjusted R-Square:

Adjusted R² is the same as standard R² except that it penalizes models


when additional features are added.

To counter the problem which is faced by R-square, Adjusted r-square


penalizes adding more independent variables which don’t increase the
explanatory power of the regression model.

The value of adjusted r-square is always less than or equal to the value
of r-square.

It ranges from 0 to 1, the closer the value is to 1, the better it is.

It measures the variation explained by only the independent variables


that actually affect the dependent variable.

Where,
n is the number of data points
k is the number of independent variables in your model
General Protocol for a Regression Model - using sklearn

● ML model will not accept any null value


● ML will not accept data types other than int or float
● x(independent variable) has to be a DataFrame or a 2D numpy
array or a 2D list.
● y(dependent variable) has to be a Series or a 1D numpy array or a
1D list.

15
What Is Clustering?

Clustering is used to group data into segments. Similar data is clustered


together using a distance calculation algorithm such as Euclidean,
Manhattan distance, Cosine similarity, Pearson correlation etc. We have
to use an unsupervised clustering algorithm when we cannot train our
model with labeled data.

● Clustering problems fall into the domain of unsupervised learning.


● Clustering identifies similarities between objects, which it groups
according to those characteristics in common and which
differentiate them from other groups of objects. These groups are
known as "clusters".
● Clusters are the collection of data points that have similar values
or attributes and clustering algorithms are the methods to group
similar data points into different clusters based on their values or
attributes.
● Since clustering is framed in unsupervised learning, for this type of
algorithm we only have one set of input data (not labeled), about
which we must obtain information, without previously knowing
what the output will be.
● There is no need to split the data in training and testing dataset.

Examples:
An important real-life problem of marketing a product or service to a
specific target audience can be easily resolved with the help of a form of
unsupervised learning known as Clustering.

Why Clustering?

Organizing data into clusters helps identify the data’s underlying


structure and finds applications across industries. For example,
clustering could be used to classify diseases in the field of medical
science and can also be used in customer classification in marketing
research.

16
What Are The Two Main Types Of Clustering?

There are mainly two types of clustering algorithms:

● Centroid-based clustering: When you know the number of clusters


upfront. Number of clusters is known beforehand and then data is
clustered into groups. These groups are known as centroids. Data
is grouped into centroids based on how close they are to the
center of the centroids. Algorithms include K-Means.
● Hierarchical clustering: When you want a machine to find the right
number of clusters. Each data item is considered as a cluster and
then data items are grouped together based on their distance
recursively until optimum clusters of data are calculated.
Algorithms include Agglomerative clustering.

There are also other clustering algorithm types such as distribution


based clustering which uses underlying probability distribution to
group data into clusters.

KMeans Clustering

K-Means is an unsupervised clustering algorithm that is used to group


data into k-clusters. The algorithm is simple, Where K = number of
clusters
● K-means algorithm is an iterative algorithm that tries to partition
the dataset into K pre-defined distinct non-overlapping subgroups
(clusters) where each data point belongs to only one group
● It tries to make the intra-cluster data points as similar as possible
while also keeping the clusters as different (far) as possible. It
assigns data points to a cluster such that the sum of the squared
distance between the data points and the cluster’s centroid
(arithmetic mean of all the data points that belong to that cluster)
is at the minimum. The less variation we have within clusters, the

17
more homogeneous (similar) the data points are within the same
cluster.
● Since clustering algorithms including KMeans which use
distance-based measurements to determine the similarity
between data points, it’s recommended to standardize or scale the
data since almost always the features in any dataset would have
different units of measurements for instance as age vs. income.

Repeat the two steps below until clusters and their mean is stable:
➔ For each data item, assign it to the nearest cluster center. Nearest
distance can be calculated based on distance algorithms.
➔ Calculate mean of the cluster with all data items.
Once clusters and their mean is stable, all data items are then known to
be grouped into their relevant clusters.

Steps in KMeans algorithm

● Specify the number of clusters, K.


● Initialize the centroids by first shuffling the dataset and then
randomly selecting K data points to be the centroids without
replacement.
● Compute the sum of squared distance between each data point
and all the cluster centroids
● Assign each data point to the closest cluster(centroid) based on its
nearest distance.
● Recompute the centroids of the clusters by taking the average of
all the data points that belong to each cluster.
● Repeat steps 4,5 and 6 till the cluster centroid is no longer
changing

How to determine the optimal value of K?

The Elbow Method is one of the most popular methods to determine


this optimal value of k. To understand Elbow Method, we need to
understand - WCSS and inertia (WCSS - Within Cluster sum of square)

18
Working of Elbow Method:
● It is a curve between the K(number of clusters)(x-axis) and the
WCSS(y-axis).
● For each value of K(number of clusters), we are calculating WCSS (
Within-Cluster Sum of Square ). WCSS is the sum of squared
distance between each point and the centroid in a cluster. When
we plot the WCSS with the K value, the plot looks like an Elbow. As
the number of clusters increases, the WCSS value will start to
decrease. WCSS value is largest when K = 1. When we analyze the
graph we can see that the graph will rapidly change at a point and
thus creating an elbow shape.
● KMeans model bears an attribute inertia_ which computes the
WCSS for a particular value of K

Are There Any Limitations Of K-Means Algorithms?

There are however limitations of K-Means algorithm:

● The K-Means algorithm does not work well with missing data.
● It uses a random seed to generate clusters which makes the
results in-deterministic and random. We can however supply our
own random seed number.
● It can get slower with larger data items.
● It does not work well with categorical (textual) data.

Unsupervised clustering algorithms can help us identify groups within


our data. These groups can then help us plan our events better and we
can make calculated decisions. K-Means is a simple yet powerful
algorithm. It has huge potential in finding anomalies and outliers in our
data.

19

You might also like