Math YHPLinear Regression
Math YHPLinear Regression
1 Introduction
In this project, my aim is to understand and apply simple linear regression to a particular
dataset to find a trendline. I will use some simple statistical error functions like Mean Squared
Error (MSE), and a regression algorithm which is gradient descent. Also, I will use the simple
terms in statistics such as median and bias of a function to understand the error functions.
To start this project, I will understand the consepts listed above, and after I come into a
level that I can freely use them in my operations, I will start the coding part of the project
using Python programming language.
2 Mathematical Content
For this project, I needed to learn the basics of statistics, basics of calculus like deriva-
tives, and linear algebra to understand the derivation of mathematical equations, and to
be able to derive them by myself. For statistics, I first began by learning the terms like
mean, bias, and error. I learned that mean is basically the average of a collection of numbers
and can be found by summing up all of them and dividing by the term number, which is:
m
x̄ = m1
P
xi for x = {x0 , x1 , . . . , xm } and the mean of this set is denoted by x̄. After learning
i=0
what a mean is, I learned what a bias is and how can I find the bias between predictions
and real data points. Bias is basically the difference between estimated and the real value,
1
and can be denoted with b = y − ŷ in which ŷ denotes the estimated value. And to find
the bias of a function, we can find the mean of all bias values for a specific data. Bias of a
m
method can be denoted as B = m1
P
bi |bi ∈ {b0 , b1 , . . . , bm }|bi = yi − ŷi . Understanding bias
i=0
helped me to get a better intuition for error functions such as MSE. MSE (Mean Squared
Error) is basically a function that finds the error of the prediction function regarding the
m
real data points. Equation of MSE is M SE = m1 (ŷ − y)2 . This equation is somewhat
P
i=0
similar to bias of a method, but the key difference is that we are squaring the differences
between ŷ and y. We are squaring because we don’t want biases like 4 and -4 eliminate
each other, therefore we are squaring it to make all negatives positive and to not effect the
final result. And the result of this function is called “error” which tells us the accuracy of
our prediction function. For uni-variate functions, I am now able to determine the error of
prediction function h(x) = w0 + w1 x using MSE. Also, the graph of the MSE vs. two weights
which are w0 and w1 in prediction function looks like:
Our aim is to optimize the error in our prediction. In other words, our aim is to have the
most-accurate result as possible, by not being dependent on the initial starting coordinates.
We can achieve our aim by bringing the initial coordinate to one of the local minimums of
the function. This graph has various local minimums, thus it is possible to get different
answers for optimizations. To be able to determine the most-accurate coefficients, or weights
for the prediction function h(x), there is an algorithm called Gradient Descent, which is:
2
d
wj = wj − α M SE (1)
dwj
The English translation of the equation would be updating each parameter of the pre-
diction function by the slope of the tangent line of the current coordinate, multiplied by a
learning step α. Starting with α, this modifier is useful to control the -so to say- speed of
the process. If we choose α so small, then the value change in wj would be considerably
small, thus it would consume more time to iterate and optimize. If we choose α so big,
then it would skip some steps and would fail to optimize. Thus, we would never achieve our
aim. Looking at the derivative, it is used to understand and calculate how close the current
coordinate is to the local minimum. To be more intuitive, let’s consider a hill. And let’s
represent this hill as the graph of the function y = x2 :
Now, let’s consider the derivative, the tangent line, of the function for the coordinate
(−2, 4). It is:
3
Figure 3: The tangent line for the x-coordinate −2
The slope of this tangent line is 4 units, and if we imagine ourselves in this coordinate of
the cliff, we would feel that it is quite steep at that point. But when we move towards the
sea level, we would feel that the ground becomes less and less steep, and eventually having
a slope of almost 0. The graph for the tangent line at the coordinate (−0.5, 0.25) for the
same function would be:
4
Now, it is clear that the slope of the tangent line of the point closer to the minimum is
less steep than the point which is away from the minimum, as proved with these graphs.
Hence, when the current weight value gets closer to the minimum, change in MSE (∆ MSE)
would be lower than previous values, since the slope of the tangent line to an instantaneous
point decreases towards minimum. Since we want to find the most-accurate weights, MSE’s
minimum point is what we want to achieve, thus this can be notated as:
minimize J(w0 , w1 , . . . , wn )
wj
in which J(w0 , w1 , . . . , wn ) is the MSE function for m number of parameters, and the inputs
are the weight parameters for the prediction function. Overall, the process of gradient descent
could be visualized as:
And this is what Gradient Descent is aiming for: it is trying to optimize the graph by
looking at instantaneous derivatives, or slope of the tangent lines.
5
multiple inputs (h(x0 , x1 , . . . , xn ) = w0 x0 +w1 x1 +· · ·+wn xn ), we can’t use normal derivatives
since there are multiple factors to consider. That’s where partial derivatives comes in
handy. We use partial derivatives to deal with functions involving multiple variables. Just
as normal derivatives, partial derivatives also give us the slope at instantaneous point for
N dimensional space. In fact, partial derivatives consider all variables except the respected
variable as constant values, which makes the derivative of them equal to 0, as the constant
d
rule suggests: dx k = 0. Since the idea of Gradient Descent is to update each weight (wj )
according to its own input (xj - since other input weight pairs would be 0 (constant rule)),
we can consider partial derivatives as an operation which transforms an N dimensional space
into 2 dimensional plane according to its input, and then just taking the normal derivative.
So, with partial derivatives the equation looks like:
∂
wj = wj − α M SE (2)
∂wj
Notice that the derivative signs became ∂, indicating that this is partial derivative. If we
expand this equation:
m
∂ 1 X
wj = wj − α (ŷ − y)2 (3)
∂wj m i=0
m
1 X ∂
wj = wj − α (ŷ − y)2 (4)
m i=0 ∂wj
let f (u) = u2 = q, let g(W ) = ŷ − y = u (5)
m
1 X ∂
wj = wj − α (f ◦ g)(W ) (6)
m i=0 ∂wj
(7)
6
w0 x0 + w1 x1 + · · · + wn xn into ŷ:
m
1 X ∂
wj = wj − α 2u · (w0 x0 + w1 x1 + · · · + wn xn − y) (11)
m i=0 ∂wj
m
1 X ∂ ∂
wj = wj − α 2u · w 0 x0 + w 1 x 1 + · · · + w n xn − y (12)
m i=0 ∂wj ∂wj
m
1 X
wj = wj − α 2u · xj (13)
m i=0
m
1 X
wj = wj − α 2(h(X) − y) · xj (14)
m i=0
m
1 X
wj = wj − 2α (h(X) − y) · xj (15)
m i=0
After using the chain rule, sum rule and difference rule of derivatives, the equation of
gradient descent is derived. With this equation, the weights are updated iterative using a
loop. A more efficient way is to use vectorized operations, and that’s where linear algebra
comes in. Using matrices and vectors, let’s redefine our variables. We can pack our inputs
and gather them into a feature-example matrix, which could be denoted as:
x1,0 x1,1 · · · x1,n
x2,0 x2,1 · · · x2,n
X=
.. .. .. .. (16)
. . . .
7
Now with vectorization, MSE becomes:
1
J(W ) = (X · W − y)(X · W − y)T (19)
m
This little trick, multiplying itself by its transpose simply gives us the sum of each
element’s square. And by this way, without any iteration we can simply obtain the cost
using matrices and vectors.
Previously, gradient descent was:
m
1 X
wj = wj − 2α (h(X) − y) · xj (20)
m i=0
which updated each weight one by one. But using vectorization, we can optimize it to do all
operations at once. The vectorized way would be:
1
W = W − 2α (X · W − y)T · X (21)
m
which updates all weights at once using linear algebra and is more efficient and optimized
than iterative update. Now, this equation can be used to obtain the most-accurate weights
for the trendline of a particular dataset.
In Python, the best library to do linear algebra based operations is numpy, and I used it
to do all the vectorized operations. In addition, I used scikit-learn (sklearn) to generate the
dataset for testing the code. Also, the graphs were made using the matplotlib library. My
code is:
8
Figure 6: My solution for vectorized gradient descent, using numpy library as np.
For testing and training, I generated 100 training examples and plotted them using
matplotlib library:
9
Figure 7: Dataset with 100 examples
After feeding this dataset into my code, the trendline was found, which is:
10
And the optimal and most-accurate parameters are:
After training the dataset, the equation of the trendline turned out to be h(X) = 5.6087x0 +
81.900863x1 , in which x0 = 0, since it is the bias term that makes w0 equal to y-intercept. So,
all x0 for every training examples are equal to 0 for convention to do vectorized operations.
Also, the convergence of MSE is like:
Notice that it took almost 2600 iterations to find the optimal weights for this particular
dataset, and MSE is at its minimum value, which is ≈ 300, as can be seen from the graph.
11
3 Conclusion
In this project, I learned the basics of statistics and linear regression to create a project
that finds the trendline of a particular dataset. Statistics needed in this project were finding
the mean, finding the bias or the error of a function and linear regression algorithms to find
the optimal weights. Derivatives and their rules were needed to derive some equations and
improve my understanding. Also, matrices and vectors and the operations between them
were essential to vectorize all the operations to optimize the code and make the operations
in an efficient way. Finally, the program was created using Python using the numpy module.
The full project I made can be found using this link.
References
[1] Linear Regression: Hypothesis Function, Cost Function and Gra-
dient Descent.Everything ”Linear Regression: Hypothesis Function,
Cost Function And Gradient Descent.Everything”. Medium, 2020,
https://medium.datadriveninvestor.com/linear-regression-hypothesis-function-cost-
function-and-gradient-descent-part-1-6cd865552923. Accessed 7 Mar 2021.
[3] Social Network for Programmers and Developers ”Social Network For Programmers
And Developers”. Morioh.Com, 2021, https://morioh.com/p/15c995420be6. Accessed 7
Mar 2021.
[4] Linear Algebra — Khan Academy ”Linear Algebra — Khan Academy”. Khan Academy,
2021, https://www.khanacademy.org/math/linear-algebra. Accessed 7 Mar 2021.
[6] Chain rule (article) — Khan Academy ”Chain Rule (Article) — Khan
Academy”. Khan Academy, 2021, https://www.khanacademy.org/math/ap-calculus-
ab/ab-differentiation-2-new/ab-3-1a/a/chain-rule-review. Accessed 7 Mar 2021.
12
https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/. Ac-
cessed 7 Mar 2021.
[8] All the other graphs were made by me using the matplotlib library for Python.
13