Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
31 views

Lec4 Oct12 2022 PracticalNotes LinearRegression

This document provides a summary of a lecture on machine learning systems evaluation and linear regression. It discusses how ML systems can fail due to underfitting, overfitting, or bad data. It also covers evaluating models using train and test accuracy, detecting underfitting and overfitting, the role of hyperparameters, measures of performance like confusion matrices, and challenges with imbalanced data. Finally, it provides an overview of linear regression and its objective to fit a linear function to training data by minimizing empirical risk.

Uploaded by

asasd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Lec4 Oct12 2022 PracticalNotes LinearRegression

This document provides a summary of a lecture on machine learning systems evaluation and linear regression. It discusses how ML systems can fail due to underfitting, overfitting, or bad data. It also covers evaluating models using train and test accuracy, detecting underfitting and overfitting, the role of hyperparameters, measures of performance like confusion matrices, and challenges with imbalanced data. Finally, it provides an overview of linear regression and its objective to fit a linear function to training data by minimizing empirical risk.

Uploaded by

asasd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Machine Learning 1 (CM50264)

Week 2, Lecture 2: Evaluating ML systems and linear regression

Subhadip Mukherjee
Acknowledgment: Tom S. F. Haines and Xi Chen (for images and other contents)
12 Oct. 2022
University of Bath, UK

1
Evaluating and diagnosing ML systems

How can your machine learning system fail?

• Underfitting
• Overfitting
• Bad data
• ···

How do you detect?

2
Underfitting & overfitting

• Complicated (non-linear) decision boundary


• Classes overlap
• How would you divide them?
3
Underfitting & overfitting

• Complicated (non-linear) decision boundary • Underfitting (over-simplistic classifier)


• Classes overlap
• How would you divide them?
3
Underfitting & overfitting

• Complicated (non-linear) decision boundary • Overfitting (too complicated classifier)


• Classes overlap
• How would you divide them?
3
Underfitting & overfitting

• Complicated (non-linear) decision boundary • Balanced (does reasonable fitting on training


• Classes overlap data, generalizes well on test data)
• How would you divide them?
3
Underfitting & overfitting

• Underfitting • Balanced • Overfitting


• Logistic regression • Tuned random forest • Badly tuned random forest
(sklearn, (sklearn, with default
min impurity decrease=0.008, parameters)
4
n estimators=512)
Underfitting and overfitting in regression

You can fit a polynomial that exactly fits the training data, but does not generalize.
On the other extreme, your model could be too weak to fit the training data.
5
Underfitting causes

• Weak model
• Bad fitting (left, random forest again)

(e.g., if the optimization was not good


enough)

6
Overfitting causes

• Powerful model → can model training specific noise


+
• Insufficient regularization
Regularization controls model complexity (so that model doesn’t fit noise)

• Maybe insufficient training data?


• How to detect?

7
Train and test set

Accuracy
• Model can’t overfit on data it doesn’t see!
Random Forest Train Test
• Split the data:
• A train set, to fit the model
Underfitting 79.2% 79.2%
• A test set, to verify performance

Balanced 97.6% 95.0%


• Large gap between train/test accuracy
indicates overfitting (usually)
Overfitting 99.6% 94.7%

8
Hyperparameters I

• Parameters → fit to training data


• Hyperparameters → parameters that cannot be fit to training data

• Reasons for having hyperparameters:


• Avoiding overfitting, e.g. decision tree depth
• Controlling the amount of computation, e.g. ensemble size
• Hard to optimize (e.g., which degree of polynomial to fit?)
• Bayesian priors have hyperparameters associated with them

9
Hyperparameters II

• Can still fit hyperparameters . . .


(manually or by algorithm)

• . . . but not to the test set!


(this mistake can be found in countless research papers)

• Introduce a third set: validation set


• train – Give to algorithm
• validation – Objective of hyperparameter optimization
• test – To report final performance

10
Measuring performance

• How do we decide on split percentages?


• Train large → Algorithm performs well
• Validation large → Hyperparameter optimization performs well
• Test large → Accurate performance estimate

• Good default: Validation and test small as possible while maintaining reliable
estimate, rest used to train
• . . . but might shrink train due to computational cost
• “small as possible” hard to judge
11
k-fold cross validation

• Divide train/validation into 7-fold


• train: six parts
• validation: one part
Train for all 7 combinations and report average performance on
test
• Effectively, all samples in the train+val dataset get used for training
and validation: artificially increase the train and val data size
• More robust estimate of model accuracy

• k-fold = k × slower! Typically 4 ≤ k ≤ 10

• Most extreme: Jackknife resampling (leave one out cross-val)


validation sets of size 1; very slow
• In practice mostly not done: time = money
12
Measure of performance: confusion matrix

• Makes sense for classification


only
• Random forest on breast
cancer:
• On diagonal means correct, off means wrong
Actual • Can see which classes are confused
• An empty row is a problem
False True
• May want to color code cells as a heat map!
Predicted

False 49 6

True 14 159

13
Some terminologies

Actual

False True
Predicted

False True Negative (TN) False Negative (FN)

True False Positive (FP) True Positive (TP)

14
Some more terminologies

Loads of terms are used (ignore most of them):

TP
TP+FN sensitivity, recall, hit rate, true positive rate
TN
TN+FP specificity, true negative rate
TP
TP+FP precision, positive predictive value
TP+TN
TP+TN+FP+FN accuracy
2×TP
2×TP+FP+FN F1 score

(many more. . . )

15
Imbalanced data

• Imbalanced training set → Makes training difficult


• Cancer detection in CT scans ≈ 0.1% of scans have cancer
• 99.9% accuracy by predicting no cancer – meaningless!

• Often need to adjust training (e.g. oversampling)


• F1 score is better

• Balanced accuracy:
1 X |{yi = c ∧ fθ (xi ) = c}|
=
|C| |{yi = c}|
c∈C

• C = set of classes, of size |C|


• (yi , xi ) = data points
• fθ (·) = your classifier model
16
Linear regression

17
Linear regression – 1

• Training data: (xi , yi )ni=1 , xi ∈ Rd , yi ∈ R


X d
• Objective: Fit a function f (x) = wj x(j) + b = w⊤ x + b, where w ∈ Rd , b ∈ R
j=1
• We use the same trick that we used in Lecture-1:
" #
1
f (x) = [b w] = w̄⊤ x̄, where w̄, x̄ ∈ Rd+1
x

• We will, however, write f (x) = w⊤ x and understand x ∈ Rd+1 as the augmented


feature vector (with its first element being 1; and w encompassing both the weight
vector and the bias term)
n
1 X ⊤ 2
• Empirical risk minimization (ERM): min w x i − yi
w 2
i=1
18
Linear regression – 2

• A more compact formulation of ERM:


• Let X ∈ Rn×(d+1) be such that its i-th row is xi

 
    w0 = b
−−− x⊤
1 −−− y1  w 
1 
− − − x⊤ − − −  y2 
    
2
 w2 
 
X= ..  , y = .
. , w =
. .  .. 
   
 
 . 
−−− x⊤n − − − n×(d+1) yn n×1
wd (d+1)×1

ERM can be expressed using the following equivalent matrix-vector notations:

1 X
min ∥Xw − y∥22 , where ∥z∥22 = zi2
w 2
i
19
Linear regression: direct solution – 1
n
1 X ⊤ 2 1
Let J(w) = w xi − yi = ∥Xw − y∥22
2 2
i=1
Observations:
n  ∂ w⊤ x  Xn 
∂J(w) X  ⊤ i

= w xi − yi = w⊤ xi − yi xit , where t = 0, 1, 2, · · · , d
∂wt ∂wt
i=1 i=1

∂J(w)  ⊤ 
i.e., = X (Xw − y)
∂wt t
 ∂J(w) 
0 ∂w
 ∂J(w) 
 ∂w1 
• ∇J(w) = 
 .. 
 = X ⊤ (Xw − y): called the gradient vector
 . 
∂J(w)
∂wd (d+1)×1
20
Linear regression: direct solution – 2

For s, t = 0, 1, 2, · · · , d:
n n
∂ 2 J(w) ∂ X ⊤  X  
= w xi − yi xit = xit xis = X ⊤ X
∂ws wt ∂ws t,s
i=1 i=1

• ∇2 J(w) = X ⊤ X: called the Hessian matrix ((d + 1) × (d + 1))

Exercise: Show that the Hessian matrix is positive semi-definite (PSD), i.e., for any
u ∈ R(d+1)×(d+1) , the quadratic form of the Hessian u⊤ ∇2 J(w) u ≥ 0

• A function that is twice differentiable and has a PSD Hessian is called a convex
function.
• For a convex J(w), the solution to ∇J(w) = 0 minimizes the function.

21
Linear regression: direct solution – 3

Therefore, to find the optimal w (which we will denote by w∗ ), we need to solve


∇J(w) w=w∗ = X ⊤ (Xw∗ − y) = 0.

X ⊤ (Xw∗ − y) = 0 : called the normal equation


=⇒ X ⊤ X w∗ = X ⊤ y

This can be solved directly if X ⊤ X is invertible, and the solution is given by


 −1
w∗ = X ⊤ X X ⊤y = X †y ,

 −1
where X † = X ⊤ X X ⊤ is called the pseudo-inverse of X.

• For the normal matrix X ⊤ X ∈ R(d+1)×(d+1) to be invertible, we need n ≥ (d + 1)


• This condition is necessary for the existence of inverse, not sufficient 22
Why ‘normal’ equation?

• Consider the range space of X, given by {v ∈ Rn : v = Xw}


• The algorithm seeks to project y onto this space

v ⊤ (Xw∗ − y) = (Xw)⊤ (Xw∗ − y) = w⊤ X ⊤ (Xw∗ − y) = 0


| {z }
=0

That is, the residual error e = Xw∗ − y is orthogonal (normal) to the range
space of X, hence the name! 23
The direct solution may not always be available

• X ⊤ X might not be invertible always (possible reasons include n < (d + 1),


and/or linearly dependent feature vectors)

• A direct solution may not even exist!


• Matrix inversion (or computing the pseudo-inverse) is generally an expensive
operation. Could be infeasible for large d and/or n

Can we approximate the solution iteratively?

24
Iterative solution

A very simple iterative algorithm is gradient-descent (GD).

w(k + 1) = w(k) − ηk ∇J(w(k))


= w(k) − ηk X ⊤ (Xw(k) − y)

• For ηk small enough, GD converges to a global minima for convex J (and a


stationary point for non-convex J), subject to some additional technical
requirements
• You will learn more about GD in Xi’s lectures on optimization

25
The intuition behind gradient-descent

How GD worksa

a
image source: mathworks 26
The least mean square (LMS) algorithm

• Basically applies gradient descent, but only on one randomly chosen sample
instead of the whole dataset
LMS algorithm (also known as Widrow-Hoff algo., or just stochastic gradient descent)
• init k ← 0, w ← w(0)
while not converged
– Sample randomly a data point (xk , yk )
– Do weight update: w ← w − ηk x⊤

k w − yk x k
return w
Result: The LMS algorithm recovers a solution of the normal equation if the step-sizes
are chosen appropriately.
Advantage: The computational complexity of every update step is n times smaller
than the batch version.
27
Generalized linear regression (GLR)

• Let ϕ : Rd 7→ RD be a feature transformation, where D > d


D
X
• Let our model be f (x) = wj ϕ(j) (x) + b = w⊤ ϕ + b
j=1
• Using the augmentation trick, we will write f (x) = w⊤ ϕ, where w, ϕ ∈ RD+1
 


   w0 = b
− − − ϕ(x1 ) −−− y1  w 
⊤ 1 
− − − ϕ(x2 ) − − −  y2 
    
 
Φ= .. ,y =   ..  ,w =  w2 
  
. .  .. 
 
 
 . 
− − − ϕ(xn )⊤ − − − n×(D+1) yn n×1
wD (D+1)×1

1  −1
• ERM: min ∥Φw − y∥22 =⇒ w∗ = Φ⊤ Φ Φ⊤ y
w 2
28
GLR can learn powerful models – 1
" #
x(1)
• Consider binary classification in 2D, x = (2)
x
• A simple linear model f (x) = w x will only allow you to learn lines for fitting the

data
• Using GLR, you can fit a polynomial, for instance (and more complicated
functions!)
 
1
 x(1) 
 
 x(2) 
 
ϕ : x 7→ ϕ(x) =  
x(1) · x(2) 
 2 
 x(1) 
 
2
x(2)
29
GLR can learn powerful models – 2

• Interestingly, the feature map ϕ can even transform the feature to an


infinite-dimensional space (without actually ever computing ϕ explicitly)
• Leads to the so-called kernel linear regression algorithm (more about it later in
the course)
• The idea is to have a kernel function K such that K(x1 , x2 ) = ϕ(x1 )⊤ ϕ(x2 ), for
any x1 and x2
• K must be efficiently computable (bypasses the task of inner-product in a very
high-dimensional space)
• More about kernels in the context of SVMs
• Even state-of-the-art deep neural networks can be approximated using kernel
linear regression (neural tangent kernels)

30
Easy extension to vector-valued targets

• Target variable y ∈ Rm , m > 1: fit m linear models for m dimensions


1  −1
min ∥XW − Y ∥22 =⇒ w∗ = X ⊤ X X ⊤Y
w 2
   
−−− x⊤1 −−− −−− y1⊤ −−−
− − − x⊤ − − − − − − y2⊤ − − −
   
2
X= ..  ,Y = 
 ..

.  .
  
  
−−− x⊤
n − − − n×(d+1) −−− yn⊤ − − − n×m

 (1) (2) (m) 


w0 w0 ··· w0
 (1) (2) (m) 
w1 w1 ··· w1 
W =
 .. .. .. .. 
.

 . . . 
(1) (2) (m)
wd wd ··· wd (d+1)×m
31

You might also like