Lec4 Oct12 2022 PracticalNotes LinearRegression
Lec4 Oct12 2022 PracticalNotes LinearRegression
Subhadip Mukherjee
Acknowledgment: Tom S. F. Haines and Xi Chen (for images and other contents)
12 Oct. 2022
University of Bath, UK
1
Evaluating and diagnosing ML systems
• Underfitting
• Overfitting
• Bad data
• ···
2
Underfitting & overfitting
You can fit a polynomial that exactly fits the training data, but does not generalize.
On the other extreme, your model could be too weak to fit the training data.
5
Underfitting causes
• Weak model
• Bad fitting (left, random forest again)
6
Overfitting causes
7
Train and test set
Accuracy
• Model can’t overfit on data it doesn’t see!
Random Forest Train Test
• Split the data:
• A train set, to fit the model
Underfitting 79.2% 79.2%
• A test set, to verify performance
8
Hyperparameters I
9
Hyperparameters II
10
Measuring performance
• Good default: Validation and test small as possible while maintaining reliable
estimate, rest used to train
• . . . but might shrink train due to computational cost
• “small as possible” hard to judge
11
k-fold cross validation
False 49 6
True 14 159
13
Some terminologies
Actual
False True
Predicted
14
Some more terminologies
TP
TP+FN sensitivity, recall, hit rate, true positive rate
TN
TN+FP specificity, true negative rate
TP
TP+FP precision, positive predictive value
TP+TN
TP+TN+FP+FN accuracy
2×TP
2×TP+FP+FN F1 score
(many more. . . )
15
Imbalanced data
• Balanced accuracy:
1 X |{yi = c ∧ fθ (xi ) = c}|
=
|C| |{yi = c}|
c∈C
17
Linear regression – 1
w0 = b
−−− x⊤
1 −−− y1 w
1
− − − x⊤ − − − y2
2
w2
X= .. , y = .
. , w =
. . ..
.
−−− x⊤n − − − n×(d+1) yn n×1
wd (d+1)×1
1 X
min ∥Xw − y∥22 , where ∥z∥22 = zi2
w 2
i
19
Linear regression: direct solution – 1
n
1 X ⊤ 2 1
Let J(w) = w xi − yi = ∥Xw − y∥22
2 2
i=1
Observations:
n ∂ w⊤ x Xn
∂J(w) X ⊤ i
= w xi − yi = w⊤ xi − yi xit , where t = 0, 1, 2, · · · , d
∂wt ∂wt
i=1 i=1
∂J(w) ⊤
i.e., = X (Xw − y)
∂wt t
∂J(w)
0 ∂w
∂J(w)
∂w1
• ∇J(w) =
..
= X ⊤ (Xw − y): called the gradient vector
.
∂J(w)
∂wd (d+1)×1
20
Linear regression: direct solution – 2
For s, t = 0, 1, 2, · · · , d:
n n
∂ 2 J(w) ∂ X ⊤ X
= w xi − yi xit = xit xis = X ⊤ X
∂ws wt ∂ws t,s
i=1 i=1
Exercise: Show that the Hessian matrix is positive semi-definite (PSD), i.e., for any
u ∈ R(d+1)×(d+1) , the quadratic form of the Hessian u⊤ ∇2 J(w) u ≥ 0
• A function that is twice differentiable and has a PSD Hessian is called a convex
function.
• For a convex J(w), the solution to ∇J(w) = 0 minimizes the function.
21
Linear regression: direct solution – 3
−1
where X † = X ⊤ X X ⊤ is called the pseudo-inverse of X.
That is, the residual error e = Xw∗ − y is orthogonal (normal) to the range
space of X, hence the name! 23
The direct solution may not always be available
24
Iterative solution
25
The intuition behind gradient-descent
How GD worksa
a
image source: mathworks 26
The least mean square (LMS) algorithm
• Basically applies gradient descent, but only on one randomly chosen sample
instead of the whole dataset
LMS algorithm (also known as Widrow-Hoff algo., or just stochastic gradient descent)
• init k ← 0, w ← w(0)
while not converged
– Sample randomly a data point (xk , yk )
– Do weight update: w ← w − ηk x⊤
k w − yk x k
return w
Result: The LMS algorithm recovers a solution of the normal equation if the step-sizes
are chosen appropriately.
Advantage: The computational complexity of every update step is n times smaller
than the batch version.
27
Generalized linear regression (GLR)
1 −1
• ERM: min ∥Φw − y∥22 =⇒ w∗ = Φ⊤ Φ Φ⊤ y
w 2
28
GLR can learn powerful models – 1
" #
x(1)
• Consider binary classification in 2D, x = (2)
x
• A simple linear model f (x) = w x will only allow you to learn lines for fitting the
⊤
data
• Using GLR, you can fit a polynomial, for instance (and more complicated
functions!)
1
x(1)
x(2)
ϕ : x 7→ ϕ(x) =
x(1) · x(2)
2
x(1)
2
x(2)
29
GLR can learn powerful models – 2
30
Easy extension to vector-valued targets