0% found this document useful (0 votes)

5 views

Lecture Notes 5 Linear Regression

Uploaded by

Abhishek Gullipalli

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Lecture Notes 5 Linear Regression

Uploaded by

Abhishek Gullipalli

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

5.

LINEAR REGRESSION
Regression is a problem in supervised learning where the output is a real number.
Consider a data set D = {(x(1), y (1)), . . . , (x(n), y (n))}.

x(i) = input = independent variable = predictor or explanatory variable ∈ Rd

y (i) = output = dependent variable = target or response variable ∈ R

Goal: We want to find a hypothesis h : Rd → R that agrees with the data set D.

https://towardsdatascience.com/how-to-choose-between-a-linear-or-nonlinear-regression-for-your-dataset-e58a568e2a15

A model of Linear Regression means that the hypothesis is linear

h(x; θ, θ0 ) = θT x + θ0 or h(x; θ, θ0 ) = θ1 x1 + . . . + θd xd + θ0

where θ ∈ Rd and θ0 ∈ R are model parameters.

Linear Regression is one of the oldest models and dates to beginning of the 19th century (Gauss,
Legendre).

1
5.1. SIMPLE LINEAR REGRESSION

For simplicity, we will assume x(i) ∈ R and in that case the hypothesis we would like to learn is of
the form
h(x; w, b) = wx + b.

http://abyss.uoregon.edu/ js/glossary/correlation.html

Step 1: Determine if there is a linear relationship between x and y

• create a scatter plot

• find sample covariance

n
1 X (i)
Cov(x, y) = (x − x̄)(y (i) − ȳ)
n−1
i=1

2
• find sample correlation coefficient
P (i)
Cov(x, y) (x − x̄)(y (i) − ȳ)
r= = pP pP
sx sy (x(i) − x̄)2 · (y (i) − ȳ)2

– r measures the strength of linear relationship between x and y

– r is independent of the units in which x and y are measured
– r is between −1 and 1
– r = 1 if all points (x(i) , y (i)) lie on a straight line with positive slope, and r = −1 if all
points (x(i), y (i)) lie on a straight line with negative slope

weak −0.5 ≤ r ≤ 0.5

moderate −0.8 < r < −0.5 or 0.5 < r < 0.8
strong r ≤ −0.8 or r ≥ 0.8

https://www.mathsisfun.com/data/correlation.html

3
Step 2: Find the hypothesis h(x; w, b) = wx + b

• Given the data set D = {(x(1), y (1)), . . . , (x(n), y (n))}, the goal is to find the parameters w
and b. We find these parameters by minimizing the loss function which measures the errors
we made by predicting ŷ (i) = h(x(i)) when the true target value is y (i) .
The most common loss functions for regression are:

– Mean Absolute Error is the mean (or average) of the absolute values of the individual
errors and it tells us how big of an error we can expect on average.
n
1 X (i)
M AE = |ŷ − y (i)|
n
i=1

The main advantage of MAE is that it is in the same unit as the target variable. The
disadvantage is that MAE is not a differentiable function.
– Mean Squared Error is the mean of the individual squared errors.
n
1 X (i)
M SE = (ŷ − y (i) )2
n
i=1

Note that MSE is differentiable and convex; however, its units do not match that of
the original target. The effect of the square term is most apparent with the presence of
outliers in the data: while each residual in MAE contributes proportionally to the total
error, the error grows quadratically in MSE. This ultimately means that outliers in our
data will contribute to much higher total error in the MSE than they would in the MAE
and the model will be penalized more for making predictions that differ greatly from the
corresponding actual values.
There is also a probabilistic interpretation of linear regression that justifies using MSE
when training data is generated from an underlying linear hypothesis with added Gausian-
distributed noise with mean 0 (see [1], pages 11-13).
– Root Mean Squared Error is the square root of the MSE.
v
u n
√ u1 X
RM SE = M SE = t (ŷ (i) − y (i))2
n
i=1

RMSE is differentiable and it is in the same units as the target variable.

• We will consider the MSE loss function

n
1 X (i)
L(w, b; D) = (ŷ − y (i))2
n
i=1

Since we are using linear regression model we have

n
1X
L(w, b; D) = (wx(i) + b − y (i) )2
n
i=1

We will discuss two ways of finding the optimal values w and b for which this loss has a
minimum value.

4
1. Analytical Solution
Take the first derivatives of L with respect to w and b, set them to 0, and solve for w
and b
n x(i) y (i) −
P P (i) P (i)
x y
w= P (i)2 P (i)2
n x − x

b = ȳ − wx̄
For details of this derivation in case of multiple linear regression (the case d ≥ 1), see
the APPENDIX at the end of these notes.
2. Gradient Descent
We initialize w and b and we use the update rule
∂L
wnew = w − α (w, b; D) (1)
∂w
∂L
bnew = b−α (w, b; D) (2)
∂b
We use calculus rules to find derivatives
n n
∂L 1X 2 X (i)
(w, b; D) = 2 (wx(i) + b − y (i) ) x(i) = (ŷ − y (i)) x(i)
∂w n n
i=1 i=1

and
n n
∂L 1X 2 X (i)
(w, b; D) = 2 (wx(i) + b − y (i)) = (ŷ − y (i) )
∂b n n
i=1 i=1

We plug these two derivatives in the above updating rule (1)-(2) and we find w and b
that minimize L by
n
2 X (i)
wnew = w − α (ŷ − y (i))x(i)
n
i=1

n
2 X (i)
bnew = b − α (ŷ − y (i) )
n
i=1

Note that this updating rule includes computing derivatives on the entire data set D
before we take a single step and it is called batch gradient descent. A single iteration
or update is called an epoch.
If the loss function is convex (as is the case here), batch gradient descent will converge
to that one global minimum, but it might take long time if the data set is large and it
is also expensive in terms of memory. If the loss function is not convex, we might not
even reach the global minimum.

The following two versions of gradient descent can solve both of these problems.

5
– stochastic (or incremental) gradient descent
In this case the update rule does not wait to use the entire data set, but at each
iteration the gradient is calculated by considering the loss at a single random point
(x(i), y (i)) which is given by

L(w, b; (x(i), y (i))) = (wx(i) + b − y (i))2 .

More precisely, the update rule is

∂L
wnew = w − α (w, b; (x(i), y (i)))
∂w
∂L
bnew = b−α (w, b; (x(i), y (i)))
∂b
with
∂L
(w, b; (x(i), y (i))) = 2 (ŷ (i) − y (i)) x(i)
∂w
∂L
(w, b; (x(i), y (i))) = 2 (ŷ (i) − y (i))
∂b
implying
wnew = w − 2α(ŷ (i) − y (i)) x(i)

bnew = b − 2α(ŷ (i) − y (i))

Notes:
∗ The magnitude of the update is proportional to the error ŷ (i) − y (i).
∗ By convention, n of these updates on data points is considered one epoch and
usually we just iterate for i = 1 to n. However, unlike batch-gradient descent
which requires scanning through the entire data set before making a single step,
SGD starts making progress immediately.
– mini-batch gradient descent
The update rule contains computation of gradients on small random sets of instances
called mini-batches.

6
https://suniljangirblog.wordpress.com/2018/12/13/variants-of-gradient-descent/

7
• Remarks:

1. The learning rate and the number of iterations are hyperparameters in the GD method.
To find a good learning rate, we can use grid search. Regarding the number of iter-
ations, in order not to waste time and memory, set a number of iterations and exit the
loop when the gradient becomes sufficiently small. For convex function, the number of
iterations to reach the gradient within ε is of order 1/ε.
2. In SGD, both the gradient and the loss function decrease on average. They can go up
and down even when we are close to the minimum, so the final parameters are good, but
not always optimal. Solution to this problem is to gradually reduce the learning rate in
time.
– Theorem: If the loss function is convex and the learning rate α(t) satisfies
∞
X ∞
X
α(t) = ∞ and (α(t))2 < ∞,
t=1 t=1

then SGD converges with probability of 1.

3. SGD helps with irregular loss functions since it can jump out of the local minimum.
4. Note that the linear regression model h(x; w, b) = wx + b can be represented using a
single neuron with one input x.

Here, the pre-activation is given by

z = wx + b

and the activation function is the identity function

φ(z) = z.

Therefore, ŷ = φ(z) = z = wx + b.
5. Scaling and normalization
Linear regression typically does not require scaling the features, but some other models
compare features, which can be meaningless if the features have different scales. However,
if the features don’t have similar scale, GD might take much longer to converge.
The following two techniques create features with common ranges.

8
– Min-Max Scaling for the jth feature is defined by
(i)
(i) xj − mj
x̃j =
Mj − mj

where
(i) (i)
mj = min xj and Mj = max xj
i i
This scaling creates values within the interval [0, 1] and both end points are achieved.
– Standard Scaling for the jth feature is defined by
(i)
(i) xj − µj
x̃j =
σj

where X (i) 1 X (i)

µj = xj and σj = (xj − µj )2
n
i i
are mean (arithmetic average) and standard deviation across the jthe feature.
To avoid data leakage, scaling must be done after train-test split. Whatever scaling
transformation is applied to the training data must be applied unchanged to the test
data as well or the test data will not be sensible inputs for the model.
Summary of algorithms for Linear Regression

algorithm large n large d scaling required sklearn

normal equation fast slow no N/A
pseudo inverse SVD fast slow no LinearRegression
batch GD slow fast yes SGDRegressor
stochastic GD fast fast yes SGDRegressor
mini-batch GD fast fast yes SGDRegressor

5.2. MULTIPLE LINEAR REGRESSION

In this case the input is multidimensional x(i) ∈ Rd with d > 1 and the hypothesis is of the form
h(x; θ) = θT x + θ0
where θ ∈ Rd and θ0 are parameters determined either by the normal equation or by the gradient
descent.

5.3. POLYNOMIAL REGRESSION

In this case the hypothesis has the form of a polynomial of a certain degree. For example, if x(i) ∈ R
and polynomial is of the third degree, the hypothesis is of the form
h(x; θ) = θ0 + θ1 x + θ2 x2 + θ3 x3 .
Note that even though this function is a polynomial of degree 3 in variable x, it is a linear function
in variables x, x2 , and x3 . Therefore, to perform polynomial regression, we can simply add new
(polynomial) features to our data and perform linear regression on the data with those new features.

9
APPENDIX:

• We show the derivation of optimal parameters in case of multiple linear regression. Assume
we have data consisting of n observations described by d explanatory variables and a target
(i) (i) (i)
variable {(x1 , x2 , . . . , xd , y (i))}, i = 1, . . . , n. We are looking for a regression line

θ0 + θ1 x1 + . . . + θd xd

that is the best fit for the data. For convenience, let us introduce x0 = 1 and consider the
above line as
d
X
h(θ; x) = θi xi = θT · x,
i=0
and the sum of squared errors
n
1X 1
L(θ) = (h(θ; xi ) − yi )2 = (Xθ − y)T (Xθ − y),
n n
i=1

where X ∈ Rn×(d+1) , θ ∈ Rd+1 , and y ∈ Rn . Note that the above expression can be rewritten
as
1 T T
L(θ) = (θ X − y T )(Xθ − y)
n
1
= {θT X T Xθ − 2θT X T y + y T y}.
n
We want to find θ that minimizes L(θ). Note that the gradient of L(θ) is given by
2 T 2
∇L(θ) = X Xθ − X T y,
n n
implying that ∇L(θ) = 0 holds for
2 T 2
X Xθ − X T y = 0
n n
X T Xθ = X T y.
From here we get
θ = (X T X)−1 X T y (3)
Since L(θ) is a convex function, the value of θ in (3) is where L(θ) achieves its minimum.
This equation is called the NORMAL EQUATION.

• The computing cost of the inverse of the square matrix X T X, which has size (d + 1) × (d + 1),
is of order of (d + 1)3 , and it can be very costly if we have a large number of input variables.
Since the matrix X T X might not be invertible, instead pseudo Moore-Penrose inverse is found
(this method is based on the singular value decomposition and for a matrix A = U ΣV T , its
pseudo inverse is A+ = V Σ+ U , where the elements of Σ are inverted and Σ+ is its transpose).
Finding the pseudo-inverse is of order (d + 1)2 .
On the other hand, the computational cost of both the normal equation and the pseudo-
inverse is linear in n.

10
• We’ll see in future lectures how the problem that the matrix X T X might not be invertible
could be fixed.

• Since computing the optimal coefficients analytically is very costly for a large number of input
features, the method of gradient descent is often used instead.

• Sklearn LinearRegression is based on the normal equation and pseudo Moore-Penrose

inverse.

• Sklearn SGDRegressor uses stochastic gradient descent.

Python code: Lecture 5 Linear Regression.ipynb

Homework 2:

• Part 1: Refer to §1 of the code Lecture 5 Linear Regression.ipynb

(a) Notice that we used 100 epochs which was waste of time and we could have stopped earlier
since after about epoch 55 or so, the loss is not getting lower significantly. Modify that code
so that if the percentage change in loss is less than 1%, you exit the iterations.
(b) The class MyLinReg in the above code uses batch gradient descent to find the minimum
of the loss function. Modify the original code and use the stochastic gradient descent instead.
Iterate over many iterations and see how the RMSE changes. The graph of RMSE for the
batch gradient descent is smooth and decreasing as the number of iterations increases. What
can you say about the graph of RMSE when the stochastic gradient descent is used?

• Part 2: Refer to §3 of the code Lecture 5 Linear Regression.ipynb

Try using sklearn SGDRegressor class instead of sklearn LinearRegression.
If the input variables are of different scales (here, TV and radio), scaling those variables
typically improves SGD convergence. Read about sklearn MinMaxScaler and try to see if
using it will give better results.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

• Part 3: Import the data file mtcars.csv. The goal is to determine two or three continuous
numerical variables that can be used to predict mpg (miles per gallon) using multiple linear
regression. You can use sklearn or custom class; batch GD, SGD, or mini-batch SGD; and
scaling.

• Part 4: Read about Probabilistic Interpretation of Linear Regression in [1], pages 11-13.

References and Reading Material:

[1] Andrew Ng Lecture Notes on Linear Regression, pages 1-13

[2] Hands-On Machine Learning with Sckit-Learn, Keras & Tensorflow, Geron, pages 111-134

Impact of Training On Employee's Performance in Indian Telecom Industry: A Study of Selected Telecom Companies
No ratings yet
Impact of Training On Employee's Performance in Indian Telecom Industry: A Study of Selected Telecom Companies
6 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
Basic Interview Question of Linear Regression
No ratings yet
Basic Interview Question of Linear Regression
9 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
Regression
No ratings yet
Regression
16 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Linear Regression - Everything You Need To Know About Linear Regression
No ratings yet
Linear Regression - Everything You Need To Know About Linear Regression
17 pages
Week 04
No ratings yet
Week 04
101 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
Lecture 10_04.09.2024_Regression-02 Lecture Slides
No ratings yet
Lecture 10_04.09.2024_Regression-02 Lecture Slides
61 pages
Cost Function
No ratings yet
Cost Function
17 pages
Lec 3
No ratings yet
Lec 3
22 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
lec6_7_Linear_regression
No ratings yet
lec6_7_Linear_regression
38 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Linear Regression
No ratings yet
Linear Regression
34 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Linear-Regression ML
No ratings yet
Linear-Regression ML
36 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Regression PPT
No ratings yet
Regression PPT
21 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
3 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
MACHINE LEARNING ALGORITHM Unit-II
No ratings yet
MACHINE LEARNING ALGORITHM Unit-II
115 pages
Week 6 - Lecture 12-1
No ratings yet
Week 6 - Lecture 12-1
34 pages
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
No ratings yet
Linear and Logistic Regression: Marta Arias Marias@lsi - Upc.edu
25 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
Modern Pridictive Modelling(Regression)
No ratings yet
Modern Pridictive Modelling(Regression)
12 pages
training-models
No ratings yet
training-models
13 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
What Are The Purposes of Nursing Research?
100% (2)
What Are The Purposes of Nursing Research?
15 pages
BBB BBBBBB
No ratings yet
BBB BBBBBB
14 pages
Survival Models: 7.1 The Hazard and Survival Functions
No ratings yet
Survival Models: 7.1 The Hazard and Survival Functions
34 pages
Scottish J Political Eco - 2024 - Sun - Can The Military Be A Better Manager of The Economy
No ratings yet
Scottish J Political Eco - 2024 - Sun - Can The Military Be A Better Manager of The Economy
11 pages
Homework 3
No ratings yet
Homework 3
2 pages
Marketing Mix, Loyal Customers & Motorcycle Industry
No ratings yet
Marketing Mix, Loyal Customers & Motorcycle Industry
4 pages
BRM-Chapter 2
No ratings yet
BRM-Chapter 2
12 pages
Week 10 Assignment Ch14
No ratings yet
Week 10 Assignment Ch14
16 pages
Hypothesis Testing and Setting Objectives
No ratings yet
Hypothesis Testing and Setting Objectives
104 pages
Aff700 1000 220401
No ratings yet
Aff700 1000 220401
8 pages
Different Kinds of Variables and Their Uses
67% (3)
Different Kinds of Variables and Their Uses
38 pages
Early Maladaptive Schemas and Adaptive M
No ratings yet
Early Maladaptive Schemas and Adaptive M
12 pages
Anomie at Public Organizations
No ratings yet
Anomie at Public Organizations
9 pages
Data Visualization - What Does An Added Variable Plot (Partial Regression Plot) Explain in A Multipl
No ratings yet
Data Visualization - What Does An Added Variable Plot (Partial Regression Plot) Explain in A Multipl
1 page
Mental Health and Academic Performance: A Study On Selection and Causation Effects From Childhood To Early Adulthood
No ratings yet
Mental Health and Academic Performance: A Study On Selection and Causation Effects From Childhood To Early Adulthood
10 pages
2003-ADRs, Analysts, and Accuracy - Does Cross Listing in The United States Improve A Firm's Information Environment and Increase Market Value
No ratings yet
2003-ADRs, Analysts, and Accuracy - Does Cross Listing in The United States Improve A Firm's Information Environment and Increase Market Value
29 pages
Man-Hour Estimation Model Development in BIM-Based
No ratings yet
Man-Hour Estimation Model Development in BIM-Based
14 pages
Summative Test 1
No ratings yet
Summative Test 1
2 pages
A Longitudinal Systematic Review of Credit Risk Assessment and Credit Default Predictors
No ratings yet
A Longitudinal Systematic Review of Credit Risk Assessment and Credit Default Predictors
19 pages
The Effect of Colored Picture To The Students' Vocabulary Mastery
100% (2)
The Effect of Colored Picture To The Students' Vocabulary Mastery
29 pages
Unit 1 - Math Review
No ratings yet
Unit 1 - Math Review
16 pages
Project-Work Guide: MECP-101 Master of Arts (Economics)
No ratings yet
Project-Work Guide: MECP-101 Master of Arts (Economics)
32 pages
A Study On Impact of Capital Structure On Profitab
No ratings yet
A Study On Impact of Capital Structure On Profitab
15 pages
02-Predicting Corporate Bond Illiquidity Via Machine Learning
No ratings yet
02-Predicting Corporate Bond Illiquidity Via Machine Learning
26 pages
Fandom Dissertation
100% (2)
Fandom Dissertation
6 pages
ADA Chapter5
No ratings yet
ADA Chapter5
6 pages
Conceptual Framework
100% (3)
Conceptual Framework
18 pages
Scientific Method Lab Report
No ratings yet
Scientific Method Lab Report
37 pages
Basic Research Method in Social Science
100% (1)
Basic Research Method in Social Science
91 pages