Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

School of Computing and Information Systems The University of Melbourne COMP90049 Introduction To Machine Learning (Semester 1, 2022)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

School of Computing and Information Systems

The University of Melbourne


COMP90049 Introduction to Machine Learning (Semester 1, 2022)
Sample solutions: Week 8

1. What is the difference between “model bias” and “model variance”?

Model Bias:
– Model bias is the propensity of a classifier to systematically produce the same errors; if it doesn’t
produce errors, it is unbiased; if it produces different kinds of errors on different instances, it is also
unbiased. (An example of the latter: the instance is truly of class A, but sometimes the system calls it B
and sometimes the system calls it C.)

– The notation of bias is slightly more natural in a regression context, where we can sensibly measure the
difference between the prediction and the true value. In a classification context, these can only be
“same” or “different”.
– Consequently, a typical interpretation of bias in a classification context is whether the classifier labels
the test data in such a way that the distribution of predicted classes systematically doesn’t match the
distribution of actual classes. For example, "bias towards the majority class", when the model predicts
too many instances as the majority class.
Model variance is the tendency of a classifier to produce different classifications if it was trained on different
training sets (randomly sampled from the same population). It is a measure of the inconsistency of the
classifier between different training sets.

(i). Why is a high bias, low variance classifier undesirable?

In short, because it’s consistently wrong. Using the other interpretation: the distribution of labels
predicted by the classifier is consistently different to the distribution of the true labels; this means
that it must be making mistakes.

(ii). Why is a low bias, high variance classifier (usually) undesirable?

This is less obvious – it’s low bias, so that it must be making a bunch of correct decisions. The fact that
it’s high variance means that not all of the predictions can possibly be correct (or it would be low-
variance!) — and the correct predictions will change, perhaps drastically, as we change the training
data.
One obvious problem here is that it’s difficult to be certain about the performance of the classifier at
all: we might estimate its error rate to be low on one set of data, and high on another set of data.
The real issue becomes more obvious when we consider the alternative formulation: the low bias
means that the distribution of predictions matches the distribution of true labels; however, the high
variance means that which instances are getting assigned to which label must be changing every time.
This suggests the real problem — namely, that what we have is the second kind of unbiased classifier:
one that makes different kinds of errors on different training sets, but always errors; and not the first
kind: one that is usually correct.

2. Describe how validation set, and cross-validation can help reduce overfitting?

1
Machine learning models usually have one or more (hyper)parameters that control for model complexity,
the ability of model to fit noise in the training set. In a practical application, we need to determine the values
of such parameters, and the principal objective in doing so is usually to achieve the best predictive
performance on new data. Furthermore, as well as finding the appropriate values for complexity parameters
within a given model, we may wish to consider a range of different types of model in order to find the best
one for our particular application.
We know that the performance on training data is not a good indicator of predictive performance on unseen
data because of overfitting. If data is plentiful, then one approach is simply to use some of the available data
to train a range of models, or a given model with a range of values for its complexity parameters, and then
to compare them on independent data, sometimes called a validation set, and select the one having the
best predictive performance. If the model design is iterated many times using a limited size data set, then
some overfitting to the validation data can occur and so it may be necessary to keep aside a third test set
on which the performance of the selected model is finally evaluated.
In many applications, however, the supply of data for training and testing will be limited, and in order to
build good models, we wish to use as much of the available data as possible for training. However, if the
validation set is small, it will give a relatively noisy estimate of predictive performance. One solution to this
dilemma is to use cross-validation.

3. Why does ensembling reduce model variance?

We know from statistics that averaging reduces variance. If Z1, . . . , Zn are i.i.d random variables:
1 1
𝑉𝑎𝑟( ' 𝑍! ) = 𝑉𝑎𝑟(𝑍! )
𝑁 𝑁
!

So, the idea is that if several models are averaged, the model variance decreases without having an effect
on bias. The problem is that there is only one training set, so how do we get multiple models? The answer
to this problem is ensembling. Ensembling creates multiple models by creating multiple training sets from
one training set by using bootstrap (e.g., bagging, random forests), or by training multiple learning
algorithms (e.g., stacking). The predictions of the individual are then combined (averaged) to reduce final
model variance.

4. Consider the following training set:

(𝒙𝟏 , 𝒙𝟐 ) y
(0,0) 0
(0,1) 1
(1,1) 1

Consider the initial weight function as 𝜃 = {𝜃# , 𝜃$ , 𝜃% } = {0.2, -0.4, 0.1} and the activation
1 𝑖𝑓 Σ > 0
function of the perceptron as the step function of 𝑓 = * .
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

a) Can the perceptron learn a perfect solution for this data set?

A perceptron can only learn perfect solutions for linearly separable problems, so you should ask yourself
whether the data set is linearly separable. Indeed, it is. Imagine drawing the coordinates in a coordinate
system, you will find that you can separate the positive example (0,0) from the negative ones ((0,1), (1,1))
with a straight line.

2
1

0.5

0
0 0.5 1

b) Draw the perceptron graph and calculate the accuracy of the perceptron on the training data
before training?

To calculate the accuracy of the system we first need to calculate the output (prediction) of our
perceptron and then compare it the actual labels of the class.

(𝒙𝟏 , 𝒙𝟐 ) 𝚺 = 𝜽𝟎 + 𝜽𝟏 𝒙𝟏 + 𝜽𝟐 𝒙𝟐 𝒚
< = 𝒇(𝚺) y
(0,0) 0.2 – 0.4 x 0 + 0.1 x 0 = 0.2 f(0.2) = 1 0
(0,1) 0.2 – 0.4 x 0 + 0.1 x 1 = 0.3 f(0.3) = 1 1
(1,1) 0.2 – 0.4 x 1 + 0.1 x 1 = -0.1 f(-0.1) = 0 1

As you can see one of the predictions (outputs) of our perceptron match the actual label and therefore
"
the accuracy of our perceptron is # at this stage.

c) Using the perceptron learning rule and the learning rate of 𝜂 = 0.2. Train the perceptron for
one epoch. What are the weights after the training?

Remember the perceptron weights learning rule in iteration t and for train instance i is as follows:
(&) (&(")
𝜃$ ← 𝜃$ + 𝜂0𝑦 ! − 𝑦3 !,(&) 4 𝑥$!

For epoch 1 we will have:


(𝒙𝟏 , 𝒙𝟐 ) 𝚺 = 𝜽𝟎 + 𝜽𝟏 𝒙𝟏 + 𝜽𝟐 𝒙𝟐 𝒚
< = 𝒇(𝚺) y
(0,0) 0.2 - 0.4 x 0 + 0.1 x 0 = 0.2 1 0
Update 𝜃:
(") (*)
𝜃* = 𝜃* + 𝜂 0𝑦" − 𝑦3",(") 4 𝑥*" = 0.2 + 0.2 (0 − 1) 1 = 0
(") (*)
𝜃" = 𝜃" + 𝜂 0𝑦" − 𝑦3",(") 4 𝑥"" = −0.4 + 0.2 (0 − 1) 0 = − 0.4
(") (*)
𝜃+ = 𝜃+ + 𝜂 0𝑦" − 𝑦3",(") 4 𝑥+" = 0.1 + 0.2 (0 − 1) 0 = 0.1

(0,1) 0 – 0.4 x 0 + 0.1 x 1 = 0.1 1 1


Correct prediction → no update

(1,1) 0 – 0.4 x 1 + 0.1 x 1 = – 0.3 0 1


Update 𝜃:

3
(") (*)
𝜃* = 𝜃* + 𝜂 0𝑦 # − 𝑦3 #,(") 4 𝑥*# = 0 + 0.2 (1 − 0) 1 = 0.2
(") (*)
𝜃" = 𝜃" + 𝜂 0𝑦 # − 𝑦3 #,(") 4 𝑥"# = −0.4 + 0.2 (1 − 0) 1 = − 0.2
(") (*)
𝜃+ = 𝜃+ + 𝜂 0𝑦 # − 𝑦3 #,(") 4 𝑥+# = 0.1 + 0.2 (1 − 0) 1 = 0.3

d) What is the accuracy of the perceptron on the training data after training for one epoch? Did the
accuracy improve?

With the new weights we get


• for instance (0,0) with y=0: 0.2 – 0.2 x 0 + 0.3 x 0 = 0.2; f(0.2) = 1 ; incorrect
• for instance (0,1) with y=1: 0.2 – 0.2 x 0 + 0.3 x 1 = 0.0; f(0.5) = 1 ; correct
• for instance (1,1) with y=1: 0.2 – 0.2 x 1 + 0.3 x 1 = 0.1; f(0.3) = 1 ; correct
+
The accuracy of our perceptron is now # . So, the accuracy of the system has been improved J

5. [OPTIONAL] Why is a perceptron (which uses a sigmoid activation function) equivalent


to logistic regression?
A perceptron has a weight associated with each input (attribute); the output is acquired by (1) summing up
the weighted input features and (2) applying the activation function to the summed value. The standard
activation function for the Perceptron is the step function (as shown in the lectures), however, it can be
replaced with a different (appropriate) function. For example, we could use the sigmoid activation function
"
(𝜎(𝑥) = 𝑓(𝑥) = ",- !") and apply it to the linear combination of inputs (𝜃* + 𝜃" 𝑥" + 𝜃+ 𝑥+ + ⋯), which
simplifies to 𝑓(𝜃 . 𝑥) = 𝜎(𝜃 . 𝑥). — This is now similar to the logistic regression model.
Note also that the Perceptron and logistic regression have different objective functions. In Logistic
regression, we are using cross-entropy loss (negative log-likelihood) for optimizing the weights (𝜃). The
objective of the Perceptron is simply based on counting errors.
The perceptron and Logistic Regression will only be completely equivalent if we change (1) the objective
function to the cross-entropy loss and (2) the activation function to the Sigmoid.
The Perceptron and Logistic Regression also have different learning mechanisms (but this difference doesn't
impact the equivalence of the models). For Logistic regression, weights are typically updated after all the
training instances have been processed (after one full iteration). However, we could also apply batch gradient
descent -- updating weights after a subset of instances have been observed (and the subset can be as small
as 1 instance). The Perceptron by definition updates its weights after processing each instance. So the
weights (𝜃) are updated several times in each iteration over the training data.

You might also like