Applying_Machine_Learning_Algorithms_with_Scikit-learn(Sklearn)_-_Notes
Applying_Machine_Learning_Algorithms_with_Scikit-learn(Sklearn)_-_Notes
with Scikit-learn(Sklearn)
Topics Covered:
● What Is Regression ?
● Why do we Need Regression ?
● Linear Regression
● Regression Metrics
● What is Clustering?
● KMeans Clustering
What Is Regression ?
1
or more features. Following the assumption that at least one of the
features depends on the others, you try to establish a relation among
them.
In other words, you need to find a function that maps some features or
variables to others sufficiently well.
It’s a common practice to denote the outputs with 𝑦 and the inputs with
𝑥. If there are two or more independent variables, then they can be
represented as the vector 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of inputs.
2
Linear Regression
3
residuals. Regression is about determining the best predicted
weights—that is, the weights corresponding to the smallest residuals.
To get the best weights, you usually minimize the sum of squared
residuals (SSR) for all observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This
approach is called the method of ordinary least squares.
Regression Performance
The value 𝑅² = 1 corresponds to SSR = 0. That’s the perfect fit, since the
values of predicted and actual responses fit completely to each other.
4
It has an equation of the form.
y = mx + c
Where-
x = independent variable/ input feature/input attribute/input column/(s)
y = dependent variable / output feature/target attribute/ output column
m = slope or coefficient or weight or how much we expect y to change
as x changes
c = intercept / constant / bias
5
Multiple Linear Regression
Where:-
x1,x2,x3,...,xn = independent variables/ input features
y = dependent variable / output feature
m1,m2,m3...,mn = coefficients/slope corresponding to x1 - xn
c = intercept / constant / bias
6
simplicity. It often yields a low 𝑅² with known data and bad
generalization capabilities when applied with new data.
● Overfitting happens when a model learns both data dependencies
and random fluctuations. In other words, a model learns the
existing data too well. Complex models, which have many features
or terms, are often prone to overfitting. When applied to known
data, such models usually yield high 𝑅². However, they often don’t
generalize well and have significantly lower 𝑅² when used with
new data.
Regression Metrics
7
The regression model tries to fit a line that produces the smallest
difference between predicted and actual(measured) values.
Where the residuals are all 0, the model predicts perfectly. The further
residuals are from 0, the less accurate the model is.
Where the average residual is not 0, it implies that the model is
systematically biased (i.e., consistently over-or under-predicting).
Where residuals contain patterns, it implies that the model is
qualitatively wrong, as it is failing to explain some properties of the data.
We can calculate the residual for every point in our data set, and each of
these residuals will be of use in assessment.
8
We can technically inspect all residuals to judge the model’s accuracy,
but this does not scale if we have thousands or millions of data points.
That’s why we have summary measurements that take our collection of
residuals and condense them into a single value representing our
model's predictive ability.
The Linear Regression model finds the best fit line that minimizes the
squared error between the actual data points and its perpendicular
dropped on the predicted line(prediction)
Where,
SST = Total Sum of Squares
SSE = Error Sum of squares
SSR = Regression sum of squares
9
Mean Absolute Error (MAE):
It is the average of the absolute differences between the actual value
and the model’s predicted value.
where,
N = total number of data points
Yi = actual value
Ŷi = predicted value
If we don’t take the absolute values, then the negative difference will
cancel out the positive difference and we will be left with a zero upon
summation.
The mean absolute error (MAE) has the same unit as the original data,
and it can only be compared between models whose errors are
measured in the same units.
10
The bigger the MAE, the more critical the error is. It is robust to outliers.
Therefore, by taking the absolute values, MAE can deal with the outliers
Here, a big error doesn’t overpower a lot of small errors and thus the
output provides us with a relatively unbiased understanding of how the
model is performing. Hence, it fails to punish the bigger error terms.
where,
n = total number of data points
yi = actual value
ŷi = predicted value
Its unit is the square of the variable’s unit.
If you have outliers in the dataset then it penalizes the outliers most
and the calculated MSE is bigger. So, in short, It is not Robust to outliers
which were an advantage in MAE.
MSE uses the square operation to remove the sign of each error value
and to punish large errors.
11
As we take the square of the error, the effect of larger errors becomes
more pronounced then smaller error, hence the model can now focus
more on the larger errors.
The main reason this is not that useful is that if we make a single very
bad prediction, the squaring will make the error even worse and it may
skew the metric towards overestimating the model’s badness.
On the other hand, if all the errors are small, or rather, smaller than 1,
then we may underestimate the model’s badness.
where,
n = total number of data points
yj = actual value
ŷj= predicted value
Max Error:
12
R² score, the coefficient of determination:
R² score ranges from 0 to 1. The closer to 1 the R², the better the
regression model is. If R² is equal to 0, the model is not performing
better than a random model. If R² is negative, the regression model is
erroneous.
It is the ratio of the sum of squares and the total sum of squares
OR
Where,
SST = SSE + SSR
SSR = SST-SSE
R2 score = SSR/SST = (SST- SSE)/ SST = 1 – SSE/SST
13
● When SSE= 0, R2 score = 1 (Best Case Scenario)
● When SSE = SST, R2 score = 0 (Worst Case Scenario)
where SSE is the sum of the square of the difference between the actual
value and the predicted value
and, SST is the total sum of the square of the difference between the
actual value and the mean of the actual value.
Here, yi is the observed target value, ŷi is the predicted value, and y-bar
is the mean value, m represents the total number of observations.
14
Adjusted R-Square:
The value of adjusted r-square is always less than or equal to the value
of r-square.
Where,
n is the number of data points
k is the number of independent variables in your model
General Protocol for a Regression Model - using sklearn
15
What Is Clustering?
Examples:
An important real-life problem of marketing a product or service to a
specific target audience can be easily resolved with the help of a form of
unsupervised learning known as Clustering.
Why Clustering?
16
What Are The Two Main Types Of Clustering?
KMeans Clustering
17
more homogeneous (similar) the data points are within the same
cluster.
● Since clustering algorithms including KMeans which use
distance-based measurements to determine the similarity
between data points, it’s recommended to standardize or scale the
data since almost always the features in any dataset would have
different units of measurements for instance as age vs. income.
Repeat the two steps below until clusters and their mean is stable:
➔ For each data item, assign it to the nearest cluster center. Nearest
distance can be calculated based on distance algorithms.
➔ Calculate mean of the cluster with all data items.
Once clusters and their mean is stable, all data items are then known to
be grouped into their relevant clusters.
18
Working of Elbow Method:
● It is a curve between the K(number of clusters)(x-axis) and the
WCSS(y-axis).
● For each value of K(number of clusters), we are calculating WCSS (
Within-Cluster Sum of Square ). WCSS is the sum of squared
distance between each point and the centroid in a cluster. When
we plot the WCSS with the K value, the plot looks like an Elbow. As
the number of clusters increases, the WCSS value will start to
decrease. WCSS value is largest when K = 1. When we analyze the
graph we can see that the graph will rapidly change at a point and
thus creating an elbow shape.
● KMeans model bears an attribute inertia_ which computes the
WCSS for a particular value of K
● The K-Means algorithm does not work well with missing data.
● It uses a random seed to generate clusters which makes the
results in-deterministic and random. We can however supply our
own random seed number.
● It can get slower with larger data items.
● It does not work well with categorical (textual) data.
19