0% found this document useful (0 votes)

17 views

Math YHPLinear Regression

This document discusses linear regression and gradient descent. It provides explanations of key concepts in statistics and machine learning used in linear regression like mean, bias, and mean squared error. It also derives the gradient descent algorithm and explains how it can be used to minimize error in linear regression models with multiple variables.

Uploaded by

Emir Hurturk

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Math YHPLinear Regression

Uploaded by

Emir Hurturk

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Mathematics YHP

Evaluating Data with Linear Regression

First Draft
Emir Hurturk

April 13, 2021

Teacher: Semra Guven

Class: L9-8

1 Introduction

In this project, my aim is to understand and apply simple linear regression to a particular
dataset to find a trendline. I will use some simple statistical error functions like Mean Squared
Error (MSE), and a regression algorithm which is gradient descent. Also, I will use the simple
terms in statistics such as median and bias of a function to understand the error functions.
To start this project, I will understand the consepts listed above, and after I come into a
level that I can freely use them in my operations, I will start the coding part of the project
using Python programming language.

2 Mathematical Content
For this project, I needed to learn the basics of statistics, basics of calculus like deriva-
tives, and linear algebra to understand the derivation of mathematical equations, and to
be able to derive them by myself. For statistics, I first began by learning the terms like
mean, bias, and error. I learned that mean is basically the average of a collection of numbers
and can be found by summing up all of them and dividing by the term number, which is:
m
x̄ = m1
P
xi for x = {x0 , x1 , . . . , xm } and the mean of this set is denoted by x̄. After learning
i=0
what a mean is, I learned what a bias is and how can I find the bias between predictions
and real data points. Bias is basically the difference between estimated and the real value,

1
and can be denoted with b = y − ŷ in which ŷ denotes the estimated value. And to find
the bias of a function, we can find the mean of all bias values for a specific data. Bias of a
m
method can be denoted as B = m1
P
bi |bi ∈ {b0 , b1 , . . . , bm }|bi = yi − ŷi . Understanding bias
i=0
helped me to get a better intuition for error functions such as MSE. MSE (Mean Squared
Error) is basically a function that finds the error of the prediction function regarding the
m
real data points. Equation of MSE is M SE = m1 (ŷ − y)2 . This equation is somewhat
P
i=0
similar to bias of a method, but the key difference is that we are squaring the differences
between ŷ and y. We are squaring because we don’t want biases like 4 and -4 eliminate
each other, therefore we are squaring it to make all negatives positive and to not effect the
final result. And the result of this function is called “error” which tells us the accuracy of
our prediction function. For uni-variate functions, I am now able to determine the error of
prediction function h(x) = w0 + w1 x using MSE. Also, the graph of the MSE vs. two weights
which are w0 and w1 in prediction function looks like:

Figure 1: 3D Graph of Cost Function Regarding 2 Weights.

Our aim is to optimize the error in our prediction. In other words, our aim is to have the
most-accurate result as possible, by not being dependent on the initial starting coordinates.
We can achieve our aim by bringing the initial coordinate to one of the local minimums of
the function. This graph has various local minimums, thus it is possible to get different
answers for optimizations. To be able to determine the most-accurate coefficients, or weights
for the prediction function h(x), there is an algorithm called Gradient Descent, which is:

2
d
wj = wj − α M SE (1)
dwj
The English translation of the equation would be updating each parameter of the pre-
diction function by the slope of the tangent line of the current coordinate, multiplied by a
learning step α. Starting with α, this modifier is useful to control the -so to say- speed of
the process. If we choose α so small, then the value change in wj would be considerably
small, thus it would consume more time to iterate and optimize. If we choose α so big,
then it would skip some steps and would fail to optimize. Thus, we would never achieve our
aim. Looking at the derivative, it is used to understand and calculate how close the current
coordinate is to the local minimum. To be more intuitive, let’s consider a hill. And let’s
represent this hill as the graph of the function y = x2 :

Figure 2: The graph of the function y = x2

Now, let’s consider the derivative, the tangent line, of the function for the coordinate
(−2, 4). It is:

3
Figure 3: The tangent line for the x-coordinate −2

The slope of this tangent line is 4 units, and if we imagine ourselves in this coordinate of
the cliff, we would feel that it is quite steep at that point. But when we move towards the
sea level, we would feel that the ground becomes less and less steep, and eventually having
a slope of almost 0. The graph for the tangent line at the coordinate (−0.5, 0.25) for the
same function would be:

Figure 4: The tangent line for the x-coordinate −0.5

4
Now, it is clear that the slope of the tangent line of the point closer to the minimum is
less steep than the point which is away from the minimum, as proved with these graphs.
Hence, when the current weight value gets closer to the minimum, change in MSE (∆ MSE)
would be lower than previous values, since the slope of the tangent line to an instantaneous
point decreases towards minimum. Since we want to find the most-accurate weights, MSE’s
minimum point is what we want to achieve, thus this can be notated as:

minimize J(w0 , w1 , . . . , wn )
wj

in which J(w0 , w1 , . . . , wn ) is the MSE function for m number of parameters, and the inputs
are the weight parameters for the prediction function. Overall, the process of gradient descent
could be visualized as:

Figure 5: Process of gradient descent

And this is what Gradient Descent is aiming for: it is trying to optimize the graph by
looking at instantaneous derivatives, or slope of the tangent lines.

2.1 Derivation of Gradient Descent

The previous steps for gradient descent is generally used for uni-variate (one variable) linear
regression, in which there is only two weights to consider: the slope (w1 ) and the y-intercept
(w0 ). Thus, the mathematics involved were dealing normal derivatives, by looking visualizing
with 2D graph. But if we have multiple weights, meaning multiple variations of inputs
(x0 , x1 , . . . , xn ), the notation changes a bit in gradient descent. Previously, we were using
normal derivatives to deal with 1 variable. But now, since the prediction function involves

5
multiple inputs (h(x0 , x1 , . . . , xn ) = w0 x0 +w1 x1 +· · ·+wn xn ), we can’t use normal derivatives
since there are multiple factors to consider. That’s where partial derivatives comes in
handy. We use partial derivatives to deal with functions involving multiple variables. Just
as normal derivatives, partial derivatives also give us the slope at instantaneous point for
N dimensional space. In fact, partial derivatives consider all variables except the respected
variable as constant values, which makes the derivative of them equal to 0, as the constant
d
rule suggests: dx k = 0. Since the idea of Gradient Descent is to update each weight (wj )
according to its own input (xj - since other input weight pairs would be 0 (constant rule)),
we can consider partial derivatives as an operation which transforms an N dimensional space
into 2 dimensional plane according to its input, and then just taking the normal derivative.
So, with partial derivatives the equation looks like:

∂
wj = wj − α M SE (2)
∂wj
Notice that the derivative signs became ∂, indicating that this is partial derivative. If we
expand this equation:

m
∂ 1 X
wj = wj − α (ŷ − y)2 (3)
∂wj m i=0
m
1 X ∂
wj = wj − α (ŷ − y)2 (4)
m i=0 ∂wj
let f (u) = u2 = q, let g(W ) = ŷ − y = u (5)
m
1 X ∂
wj = wj − α (f ◦ g)(W ) (6)
m i=0 ∂wj
(7)

Using the chain rule for composition of functions:

m
1 X ∂q ∂u
wj = wj − α · (8)
m i=0 ∂u ∂wj
m
1 X ∂ 2 ∂
wj = wj − α u · ŷ − y (9)
m i=0 ∂u ∂wj
m
1 X ∂
wj = wj − α 2u · ŷ − y (10)
m i=0 ∂wj

Now, if we define ŷ as h(w0 x0 + · · · + wn xn ) = w0 x0 + w1 x1 + · · · + wn xn and substitute

6
w0 x0 + w1 x1 + · · · + wn xn into ŷ:

m
1 X ∂
wj = wj − α 2u · (w0 x0 + w1 x1 + · · · + wn xn − y) (11)
m i=0 ∂wj
m
1 X ∂ ∂
wj = wj − α 2u · w 0 x0 + w 1 x 1 + · · · + w n xn − y (12)
m i=0 ∂wj ∂wj
m
1 X
wj = wj − α 2u · xj (13)
m i=0
m
1 X
wj = wj − α 2(h(X) − y) · xj (14)
m i=0
m
1 X
wj = wj − 2α (h(X) − y) · xj (15)
m i=0

After using the chain rule, sum rule and difference rule of derivatives, the equation of
gradient descent is derived. With this equation, the weights are updated iterative using a
loop. A more efficient way is to use vectorized operations, and that’s where linear algebra
comes in. Using matrices and vectors, let’s redefine our variables. We can pack our inputs
and gather them into a feature-example matrix, which could be denoted as:
 
x1,0 x1,1 · · · x1,n
 x2,0 x2,1 · · · x2,n 
 
X=
 .. .. .. ..  (16)
 . . . . 


xm,0 xm,1 · · · xm,n

The real target outputs are (m × 1) vector since we have m training examples, and can
be denoted as:  
y0
 y1 
 
y=  .. 
 (17)
 .
ym
The weights can also be vectorized as well, and they become (n × 1) vector, and notated as:
 
w0
 w1 
 
w=
 .. 
 (18)
 . 
wn

7
Now with vectorization, MSE becomes:

1
J(W ) = (X · W − y)(X · W − y)T (19)
m
This little trick, multiplying itself by its transpose simply gives us the sum of each
element’s square. And by this way, without any iteration we can simply obtain the cost
using matrices and vectors.
Previously, gradient descent was:
m
1 X
wj = wj − 2α (h(X) − y) · xj (20)
m i=0

which updated each weight one by one. But using vectorization, we can optimize it to do all
operations at once. The vectorized way would be:

1
W = W − 2α (X · W − y)T · X (21)
m

which updates all weights at once using linear algebra and is more efficient and optimized
than iterative update. Now, this equation can be used to obtain the most-accurate weights
for the trendline of a particular dataset.
In Python, the best library to do linear algebra based operations is numpy, and I used it
to do all the vectorized operations. In addition, I used scikit-learn (sklearn) to generate the
dataset for testing the code. Also, the graphs were made using the matplotlib library. My
code is:

8
Figure 6: My solution for vectorized gradient descent, using numpy library as np.

For testing and training, I generated 100 training examples and plotted them using
matplotlib library:

9
Figure 7: Dataset with 100 examples

After feeding this dataset into my code, the trendline was found, which is:

Figure 8: The trendline of this dataset

10
And the optimal and most-accurate parameters are:

Figure 9: Most-accurate parameters of the trendline

After training the dataset, the equation of the trendline turned out to be h(X) = 5.6087x0 +
81.900863x1 , in which x0 = 0, since it is the bias term that makes w0 equal to y-intercept. So,
all x0 for every training examples are equal to 0 for convention to do vectorized operations.
Also, the convergence of MSE is like:

Figure 10: Convergence of MSE, for finding the optimal weights.

Notice that it took almost 2600 iterations to find the optimal weights for this particular
dataset, and MSE is at its minimum value, which is ≈ 300, as can be seen from the graph.

11
3 Conclusion
In this project, I learned the basics of statistics and linear regression to create a project
that finds the trendline of a particular dataset. Statistics needed in this project were finding
the mean, finding the bias or the error of a function and linear regression algorithms to find
the optimal weights. Derivatives and their rules were needed to derive some equations and
improve my understanding. Also, matrices and vectors and the operations between them
were essential to vectorize all the operations to optimize the code and make the operations
in an efficient way. Finally, the program was created using Python using the numpy module.
The full project I made can be found using this link.

References
[1] Linear Regression: Hypothesis Function, Cost Function and Gra-
dient Descent.Everything ”Linear Regression: Hypothesis Function,
Cost Function And Gradient Descent.Everything”. Medium, 2020,
https://medium.datadriveninvestor.com/linear-regression-hypothesis-function-cost-
function-and-gradient-descent-part-1-6cd865552923. Accessed 7 Mar 2021.

[2] Machine Learning, Coursera, Ng, Andrew, 2008,

https://www.coursera.org/learn/machine-learning. Accessed 7 Mar 2021.

[3] Social Network for Programmers and Developers ”Social Network For Programmers
And Developers”. Morioh.Com, 2021, https://morioh.com/p/15c995420be6. Accessed 7
Mar 2021.

[4] Linear Algebra — Khan Academy ”Linear Algebra — Khan Academy”. Khan Academy,
2021, https://www.khanacademy.org/math/linear-algebra. Accessed 7 Mar 2021.

[5] Derivative Rules ”Derivative Rules”. Mathsisfun.Com, 2021,

https://www.mathsisfun.com/calculus/derivatives-rules.html. Accessed 7 Mar 2021.

[6] Chain rule (article) — Khan Academy ”Chain Rule (Article) — Khan
Academy”. Khan Academy, 2021, https://www.khanacademy.org/math/ap-calculus-
ab/ab-differentiation-2-new/ab-3-1a/a/chain-rule-review. Accessed 7 Mar 2021.

[7] Algebra, L. and OpenCourseWare, M. Algebra, Linear, and MIT

OpenCourseWare. ”Linear Algebra”. MIT Opencourseware, 2021,

12
https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/. Ac-
cessed 7 Mar 2021.

[8] All the other graphs were made by me using the matplotlib library for Python.

Introduction To Gradient Descent
No ratings yet
Introduction To Gradient Descent
13 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
3 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Linear Regression Using Gradient Descent
No ratings yet
Linear Regression Using Gradient Descent
2 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
linear regression
No ratings yet
linear regression
130 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
07_Gradient_Descent_For_Linear_Regression_10_min
No ratings yet
07_Gradient_Descent_For_Linear_Regression_10_min
5 pages
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
No ratings yet
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
8 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
7 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Chapter Gradient Descent
No ratings yet
Chapter Gradient Descent
6 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
MACHINE LEARNING ALGORITHM Unit-II
No ratings yet
MACHINE LEARNING ALGORITHM Unit-II
115 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
7 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Basic Machine Learning: Case Study
No ratings yet
Basic Machine Learning: Case Study
11 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Linear Regression
No ratings yet
Linear Regression
95 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
DeepLearning Lect2 3
No ratings yet
DeepLearning Lect2 3
89 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
LInear
No ratings yet
LInear
14 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
Gradient Descent for Linear Regression: repeat until convergence: (:=:=) − α ( −) 1 ∑ − α ( ( −) ) 1 ∑
No ratings yet
Gradient Descent for Linear Regression: repeat until convergence: (:=:=) − α ( −) 1 ∑ − α ( ( −) ) 1 ∑
1 page
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
6 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Module 3
No ratings yet
Module 3
27 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
Experiment N1
No ratings yet
Experiment N1
7 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
Updating_Weight
No ratings yet
Updating_Weight
9 pages
Lecture 2-Linear-Regression-Part1
No ratings yet
Lecture 2-Linear-Regression-Part1
80 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Regression PPT
No ratings yet
Regression PPT
21 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
Lecture 2.1 Linear Regression
No ratings yet
Lecture 2.1 Linear Regression
36 pages
Notes Unit 1-3 Part-III
No ratings yet
Notes Unit 1-3 Part-III
25 pages
Assignment Problem - Gradient Descent
No ratings yet
Assignment Problem - Gradient Descent
6 pages
Representer Function
No ratings yet
Representer Function
12 pages
Ch2_Lec3_ Linear Regression and Gradient Descent
No ratings yet
Ch2_Lec3_ Linear Regression and Gradient Descent
60 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
The Effects of User-generated Content and Firm-generated Content
No ratings yet
The Effects of User-generated Content and Firm-generated Content
20 pages
3.regression Slides
100% (1)
3.regression Slides
25 pages
Effect of Human Resource Development Practices On Performance of
No ratings yet
Effect of Human Resource Development Practices On Performance of
15 pages
09 Chapter 2
No ratings yet
09 Chapter 2
98 pages
Ppt. Correlation and Regression
No ratings yet
Ppt. Correlation and Regression
33 pages
PDF 3
No ratings yet
PDF 3
14 pages
Data Analyst Question-Answers
No ratings yet
Data Analyst Question-Answers
17 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
Commonly Used Machine Learning Algorithms (With Python and R Codes)
No ratings yet
Commonly Used Machine Learning Algorithms (With Python and R Codes)
19 pages
Ridge Regression: Ryota Tomioka Department of Mathema6cal Informa6cs The University of Tokyo
No ratings yet
Ridge Regression: Ryota Tomioka Department of Mathema6cal Informa6cs The University of Tokyo
53 pages
Viva EDA
No ratings yet
Viva EDA
8 pages
Casual Methods Group 4 Presentation-1
No ratings yet
Casual Methods Group 4 Presentation-1
13 pages
Subhadeep Chatterjee
No ratings yet
Subhadeep Chatterjee
4 pages
Econometrics Board Questions
No ratings yet
Econometrics Board Questions
13 pages
Ve401 Probabilistic Methods in Engineering: Sample Questions For The Final Exam
No ratings yet
Ve401 Probabilistic Methods in Engineering: Sample Questions For The Final Exam
4 pages
[FREE PDF sample] (Original PDF) Categorical Data Analysis 3rd Edition by Alan Agresti ebooks
No ratings yet
[FREE PDF sample] (Original PDF) Categorical Data Analysis 3rd Edition by Alan Agresti ebooks
45 pages
Barnaby Spitler 05
100% (1)
Barnaby Spitler 05
18 pages
Mathematics For Machine Learning
No ratings yet
Mathematics For Machine Learning
134 pages
Generalized Linear Model: Badr Missaoui
No ratings yet
Generalized Linear Model: Badr Missaoui
35 pages
3.2.6 Practice
No ratings yet
3.2.6 Practice
8 pages
85 ArticleText 340 1 10 20200730
No ratings yet
85 ArticleText 340 1 10 20200730
8 pages
Gaussian Random Field Models For Spatial Data
No ratings yet
Gaussian Random Field Models For Spatial Data
47 pages
MOISTURE DETECTION USING FIBER OPTICS SENSOR WITH THE HELP OF OTDR TECHNOLOGY
No ratings yet
MOISTURE DETECTION USING FIBER OPTICS SENSOR WITH THE HELP OF OTDR TECHNOLOGY
22 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
NumXL Functions
No ratings yet
NumXL Functions
11 pages
3a. Factorial Experiment
No ratings yet
3a. Factorial Experiment
47 pages
Factors Influencing Nivea Face Wash Purchasing Decision in Cikarang
No ratings yet
Factors Influencing Nivea Face Wash Purchasing Decision in Cikarang
4 pages
A Medical Researcher Is Studying The Relationship Between Age (X Years) and Volume
No ratings yet
A Medical Researcher Is Studying The Relationship Between Age (X Years) and Volume
17 pages
Theory Testing in Social Research
No ratings yet
Theory Testing in Social Research
28 pages
Beth Mukui Gretsa University-Past Tense
No ratings yet
Beth Mukui Gretsa University-Past Tense
41 pages