LinearRegression Annotated
LinearRegression Annotated
2 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Outline
• Linear regression
One or multiple variables
Cost function
• Normal equation
• Locally-Weighted Regression
• Linear regression
One or multiple variables
Cost function
• Normal equation
• Locally-Weighted Regression
Price 200
Learning Algorithm
Size of h Estimated
house price
2104 460
1416 232
1534 315
852 178
… …
Multiple features (variables)
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
Hypothesis:
With one variable:
For convenience of notation, define .
Hypothesis:
‘s: Parameters
How to choose ‘s ?
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
y
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -‐0.5 0 0.5 1 1.5 2 2.5
x
Hypothesis:
Parameters:
Cost Function:
Goal:
(for fixed , this is a function of x) (function of the parameters )
500
400
Price ($) 300
in 1000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x)
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
Outline
• Linear regression
One or multiple variables
Cost function
• Normal equation
• Locally-Weighted Regression
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
J()
J()
Gradient descent algorithm
Current value of
Gradient descent can converge to a local
minimum, even with the learning rate α fixed
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
Gradient Descent:
for Linear Regression
Gradient descent algorithm Linear Regression Model
Gradient descent algorithm
update
and
simultaneously
J()
J()
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
“Batch” Gradient Descent
56 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Stochastic Gradient Descent (SGD)
• For large training sets, the evaluation the gradient over all samples may be expensive
• Stochastic or “online” gradient descent approximates the true gradient by a gradient at a
single example
Pseudocode:
- Choose an initial vector of parameters 𝜽𝟎 and learning rate 𝛼
- Repeat until convergence:
• Randomly shuffle examples in the training set
• For 𝑘 = 1,2, … 𝑚 , do:
𝜽(𝑖+1) = 𝜽(𝑖) − 𝛼 ∇𝐽(𝑘) (𝜽(𝑖) )
57 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Stochastic Gradient Descent (SGD)
• For large training sets, the evaluation the gradient over all samples may be expensive
• Stochastic or “online” gradient descent approximates the true gradient by a gradient at a
single example
Pseudocode:
- Choose an initial vector of parameters 𝜽𝟎 and learning rate 𝛼
- Repeat until convergence:
• Randomly shuffle examples in the training set
• For 𝑘 = 1,2, … 𝑚 , do:
𝜽(𝑖+1) = 𝜽(𝑖) − 𝛼 ∇𝐽(𝑘) (𝜽(𝑖) )
• Normally preferable: mini-batch gradient descent
Consider a mini-batch at each step
This normally results in smoother convergence
This normally is faster, thanks to vectorization libraries
58 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Gradient Descent:
for Multiple Variables
Hypothesis:
Parameters:
Cost function:
Gradient descent:
Repeat
(simultaneously update )
Feature Scaling
Idea: Make sure features are on a similar scale.
E.g. = size (0-‐2000 feet2) size (feet2)
= number of bedrooms (1-‐5)
number of bedrooms
Feature Scaling
Get every feature into approximately a range.
Mean normalization
Replace with to make features have approximately zero mean
(Do not apply to ).
E.g.
Gradient Descent:
Learning Rate
Gradient descent
Example automatic
convergence test:
Declare convergence if
decreases by less than
in one iteration.
0 100 200 300 400
No. of iterations
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
No. of iterations
To choose , try
Linear Regression:
Features and Polynomial regression
Housing prices prediction
Polynomial regression
Price
(y)
Size (x)
Choice of features
Price
(y)
Size (x)
Dangers of (Polynomial) Regression
• Linear regression
One or multiple variables
Cost function
• Normal equation
• Locally-Weighted Regression
𝑓 𝐴 ∈ ℝ , 𝐴 ∈ ℝ𝑚𝑥𝑛
• Trace:
n
If 𝐴 ∈ ℝ𝑛𝑥𝑛 tr A = Aii
i=1
80 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Facts
tr 𝐴𝐵 = tr 𝐵𝐴
T
f(𝐴) = tr 𝐵𝐴 A tr AB = B
T
tr A = tr A
If 𝑎 ∈ ℝ tr a = a
T T T
A tr ABA C = CAB + C AB
81 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
examples ; features.
Cost function
1
ℎ 𝑥 − 𝑦 (1) 𝑛
(𝑖)
𝑋Θ − 𝑦 = ⋮ where Θ𝑇 𝑥 (𝑖) = Θ𝑗 𝑥𝑗
ℎ 𝑥 𝑚 − 𝑦 (𝑚) 𝑗=0
Recall z 𝑇 𝑧 = 𝑧𝑖2
𝑖
1 𝑇
1
𝑋Θ − 𝑦 𝑋Θ − 𝑦 = = 𝐽(Θ)
2 2
Intuition: If 1D
Solve for 𝚯 analytically
attained when
Expanding J(Θ):
Solve for 𝚯 analytically
T T T T
Recall A tr ABA C = CAB + C AB A tr AB = B
Solve for 𝚯 analytically
Normal equation
Examples:
(feet22)
Size (feet Number
Numberof
of Number
Numberofof Age
Ageofofhome
home Price ($1000)
bedrooms floors (years)
bedrooms floors (years)
1 2104 5 1 45 460
1 1416
1416 33 22 40
40 232
232
1 1534
1534 33 22 30
30 315
315
1 852
852 22 11 36
36 178
178
training examples, features.
Gradient Descent Normal Equation
• Need to choose . • No need to choose .
• Needs many iterations. • Don’t need to iterate.
• Works well even • Need to compute
when is large.
• Slow if is very large.
Normal equation
• Linear regression
One or multiple variables
Cost function
• Normal equation
• Locally-Weighted Regression
96 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Correlation and Causation
97 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Correlation and Causation (from Overview)
98 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Regression and Correlation
The better the function fits the points,
the more correlated x and y are
▪ Linear functions only
▪ Correlation – Values track each other
Positively – when one goes up the other goes up
▪ Also negative correlation
When one goes up the other goes down
• Latitude versus temperature
• Car weight versus gas mileage
• Class absences versus final grade
99 Fabio Galasso
Fundamentals of Data Science | Winter Semester 2020
Calculating Simple Linear Regression
Method of least squares
▪ Given a point and a line, the error for the point
is its vertical distance d from the line, and the
squared error is d 2
▪ Given a set of points and a line, the sum of
squared error (SSE) is the sum of the squared
errors for all the points
▪ Goal: Given a set of points, find the line that
minimizes the SSE
d4 d5
d2
d3
d1
d2
d3
d1
- Gradient Descent
- Normal equation
- software packages,
e.g. Numpy polyfit
SSE = d12 + d22 + d32 + d42 + d52
• Linear regression
One or multiple variables
Cost function
• Normal equation
• Locally-Weighted Regression
𝑦 (𝑖) ∈ ℝ x0 = 1
𝑛 𝑚
1 2
ℎΘ 𝑥 = Θ𝑗 𝑥𝑗 = Θ𝑇 𝑥 𝐽 Θ = ℎΘ 𝑥 (𝑖) − 𝑦 (𝑖)
2
𝑗=0 𝑖=1
Recap: choice of features
Θ0 + Θ1 𝑥1
Θ0 + Θ1 𝑥 + Θ2 𝑥 2
Price
(y) Θ0 + Θ1 𝑥 + Θ2 𝑥 + Θ3 log(𝑥)
Size (x)
Recap: dangers of polynomial regression
Overfitting and Underfitting
Locally-weighted regression
• Locally-weighted regression y
also named Loess or Lowess
• Linear regression
One or multiple variables
Cost function
• Normal equation
• Locally-Weighted Regression
Acknowledges: slides and material from Andrew Ng, Eric Xing, Matthew R. Gormley, Jessica Wu