Lecture Notes 5 Linear Regression
Lecture Notes 5 Linear Regression
LINEAR REGRESSION
Regression is a problem in supervised learning where the output is a real number.
Consider a data set D = {(x(1), y (1)), . . . , (x(n), y (n))}.
Goal: We want to find a hypothesis h : Rd → R that agrees with the data set D.
https://towardsdatascience.com/how-to-choose-between-a-linear-or-nonlinear-regression-for-your-dataset-e58a568e2a15
h(x; θ, θ0 ) = θT x + θ0 or h(x; θ, θ0 ) = θ1 x1 + . . . + θd xd + θ0
Linear Regression is one of the oldest models and dates to beginning of the 19th century (Gauss,
Legendre).
1
5.1. SIMPLE LINEAR REGRESSION
For simplicity, we will assume x(i) ∈ R and in that case the hypothesis we would like to learn is of
the form
h(x; w, b) = wx + b.
http://abyss.uoregon.edu/ js/glossary/correlation.html
2
• find sample correlation coefficient
P (i)
Cov(x, y) (x − x̄)(y (i) − ȳ)
r= = pP pP
sx sy (x(i) − x̄)2 · (y (i) − ȳ)2
https://www.mathsisfun.com/data/correlation.html
3
Step 2: Find the hypothesis h(x; w, b) = wx + b
• Given the data set D = {(x(1), y (1)), . . . , (x(n), y (n))}, the goal is to find the parameters w
and b. We find these parameters by minimizing the loss function which measures the errors
we made by predicting ŷ (i) = h(x(i)) when the true target value is y (i) .
The most common loss functions for regression are:
– Mean Absolute Error is the mean (or average) of the absolute values of the individual
errors and it tells us how big of an error we can expect on average.
n
1 X (i)
M AE = |ŷ − y (i)|
n
i=1
The main advantage of MAE is that it is in the same unit as the target variable. The
disadvantage is that MAE is not a differentiable function.
– Mean Squared Error is the mean of the individual squared errors.
n
1 X (i)
M SE = (ŷ − y (i) )2
n
i=1
Note that MSE is differentiable and convex; however, its units do not match that of
the original target. The effect of the square term is most apparent with the presence of
outliers in the data: while each residual in MAE contributes proportionally to the total
error, the error grows quadratically in MSE. This ultimately means that outliers in our
data will contribute to much higher total error in the MSE than they would in the MAE
and the model will be penalized more for making predictions that differ greatly from the
corresponding actual values.
There is also a probabilistic interpretation of linear regression that justifies using MSE
when training data is generated from an underlying linear hypothesis with added Gausian-
distributed noise with mean 0 (see [1], pages 11-13).
– Root Mean Squared Error is the square root of the MSE.
v
u n
√ u1 X
RM SE = M SE = t (ŷ (i) − y (i))2
n
i=1
We will discuss two ways of finding the optimal values w and b for which this loss has a
minimum value.
4
1. Analytical Solution
Take the first derivatives of L with respect to w and b, set them to 0, and solve for w
and b
n x(i) y (i) −
P P (i) P (i)
x y
w= P (i)2 P (i)2
n x − x
b = ȳ − wx̄
For details of this derivation in case of multiple linear regression (the case d ≥ 1), see
the APPENDIX at the end of these notes.
2. Gradient Descent
We initialize w and b and we use the update rule
∂L
wnew = w − α (w, b; D) (1)
∂w
∂L
bnew = b−α (w, b; D) (2)
∂b
We use calculus rules to find derivatives
n n
∂L 1X 2 X (i)
(w, b; D) = 2 (wx(i) + b − y (i) ) x(i) = (ŷ − y (i)) x(i)
∂w n n
i=1 i=1
and
n n
∂L 1X 2 X (i)
(w, b; D) = 2 (wx(i) + b − y (i)) = (ŷ − y (i) )
∂b n n
i=1 i=1
We plug these two derivatives in the above updating rule (1)-(2) and we find w and b
that minimize L by
n
2 X (i)
wnew = w − α (ŷ − y (i))x(i)
n
i=1
n
2 X (i)
bnew = b − α (ŷ − y (i) )
n
i=1
Note that this updating rule includes computing derivatives on the entire data set D
before we take a single step and it is called batch gradient descent. A single iteration
or update is called an epoch.
If the loss function is convex (as is the case here), batch gradient descent will converge
to that one global minimum, but it might take long time if the data set is large and it
is also expensive in terms of memory. If the loss function is not convex, we might not
even reach the global minimum.
The following two versions of gradient descent can solve both of these problems.
5
– stochastic (or incremental) gradient descent
In this case the update rule does not wait to use the entire data set, but at each
iteration the gradient is calculated by considering the loss at a single random point
(x(i), y (i)) which is given by
6
https://suniljangirblog.wordpress.com/2018/12/13/variants-of-gradient-descent/
7
• Remarks:
1. The learning rate and the number of iterations are hyperparameters in the GD method.
To find a good learning rate, we can use grid search. Regarding the number of iter-
ations, in order not to waste time and memory, set a number of iterations and exit the
loop when the gradient becomes sufficiently small. For convex function, the number of
iterations to reach the gradient within ε is of order 1/ε.
2. In SGD, both the gradient and the loss function decrease on average. They can go up
and down even when we are close to the minimum, so the final parameters are good, but
not always optimal. Solution to this problem is to gradually reduce the learning rate in
time.
– Theorem: If the loss function is convex and the learning rate α(t) satisfies
∞
X ∞
X
α(t) = ∞ and (α(t))2 < ∞,
t=1 t=1
z = wx + b
φ(z) = z.
Therefore, ŷ = φ(z) = z = wx + b.
5. Scaling and normalization
Linear regression typically does not require scaling the features, but some other models
compare features, which can be meaningless if the features have different scales. However,
if the features don’t have similar scale, GD might take much longer to converge.
The following two techniques create features with common ranges.
8
– Min-Max Scaling for the jth feature is defined by
(i)
(i) xj − mj
x̃j =
Mj − mj
where
(i) (i)
mj = min xj and Mj = max xj
i i
This scaling creates values within the interval [0, 1] and both end points are achieved.
– Standard Scaling for the jth feature is defined by
(i)
(i) xj − µj
x̃j =
σj
In this case the input is multidimensional x(i) ∈ Rd with d > 1 and the hypothesis is of the form
h(x; θ) = θT x + θ0
where θ ∈ Rd and θ0 are parameters determined either by the normal equation or by the gradient
descent.
In this case the hypothesis has the form of a polynomial of a certain degree. For example, if x(i) ∈ R
and polynomial is of the third degree, the hypothesis is of the form
h(x; θ) = θ0 + θ1 x + θ2 x2 + θ3 x3 .
Note that even though this function is a polynomial of degree 3 in variable x, it is a linear function
in variables x, x2 , and x3 . Therefore, to perform polynomial regression, we can simply add new
(polynomial) features to our data and perform linear regression on the data with those new features.
9
APPENDIX:
• We show the derivation of optimal parameters in case of multiple linear regression. Assume
we have data consisting of n observations described by d explanatory variables and a target
(i) (i) (i)
variable {(x1 , x2 , . . . , xd , y (i))}, i = 1, . . . , n. We are looking for a regression line
θ0 + θ1 x1 + . . . + θd xd
that is the best fit for the data. For convenience, let us introduce x0 = 1 and consider the
above line as
d
X
h(θ; x) = θi xi = θT · x,
i=0
and the sum of squared errors
n
1X 1
L(θ) = (h(θ; xi ) − yi )2 = (Xθ − y)T (Xθ − y),
n n
i=1
where X ∈ Rn×(d+1) , θ ∈ Rd+1 , and y ∈ Rn . Note that the above expression can be rewritten
as
1 T T
L(θ) = (θ X − y T )(Xθ − y)
n
1
= {θT X T Xθ − 2θT X T y + y T y}.
n
We want to find θ that minimizes L(θ). Note that the gradient of L(θ) is given by
2 T 2
∇L(θ) = X Xθ − X T y,
n n
implying that ∇L(θ) = 0 holds for
2 T 2
X Xθ − X T y = 0
n n
X T Xθ = X T y.
From here we get
θ = (X T X)−1 X T y (3)
Since L(θ) is a convex function, the value of θ in (3) is where L(θ) achieves its minimum.
This equation is called the NORMAL EQUATION.
• The computing cost of the inverse of the square matrix X T X, which has size (d + 1) × (d + 1),
is of order of (d + 1)3 , and it can be very costly if we have a large number of input variables.
Since the matrix X T X might not be invertible, instead pseudo Moore-Penrose inverse is found
(this method is based on the singular value decomposition and for a matrix A = U ΣV T , its
pseudo inverse is A+ = V Σ+ U , where the elements of Σ are inverted and Σ+ is its transpose).
Finding the pseudo-inverse is of order (d + 1)2 .
On the other hand, the computational cost of both the normal equation and the pseudo-
inverse is linear in n.
10
• We’ll see in future lectures how the problem that the matrix X T X might not be invertible
could be fixed.
• Since computing the optimal coefficients analytically is very costly for a large number of input
features, the method of gradient descent is often used instead.
Homework 2:
• Part 3: Import the data file mtcars.csv. The goal is to determine two or three continuous
numerical variables that can be used to predict mpg (miles per gallon) using multiple linear
regression. You can use sklearn or custom class; batch GD, SGD, or mini-batch SGD; and
scaling.
• Part 4: Read about Probabilistic Interpretation of Linear Regression in [1], pages 11-13.
11