cs188 Fa23 Note21
cs188 Fa23 Note21
cs188 Fa23 Note21
Linear Regression
Now we’ll move on from our previous discussion of Naive Bayes to Linear Regression. This method, also
called least squares, dates all the way back to Carl Friedrich Gauss and is one of the most studied tools in
machine learning and econometrics.
Regression problems are a form of machine learning problem in which the output is a continuous variable
(denoted with y). The features can be either continuous or categorical. We will denote a set of features with
x ∈ Rn for n features, i.e. x = (x1 , . . . , xn ).
We use the following linear model to predict the output:
hw (x) = w0 + w1 x1 + · · · wn xn
where the weights wi of the linear model are what we want to estimate. The weight w0 corresponds to the
intercept of the model. Sometimes in literature we add a 1 on the feature vector x so that we can write the
linear model as wT x where now x ∈ Rn+1 . To train the model, we need a metric of how well our model
predicts the output. For that we will use the L2 loss function which penalizes the difference of the predicted
from the actual output using the L2 norm. If our training dataset has N data points then the loss function is
defined as follows:
N N
1 1 1
Loss(hw ) =
2 ∑ L2(y j , hw (x j )) = 2 ∑ (y j − hw (x j )2 = 2 ∥y − Xw∥22
j=1 j=1
Note that x j corresponds to the jth data point x j ∈ Rn . The term 12 is just added to simplify the expressions
of the closed form solution. The last expression is an equivalent formulation of the loss function which
makes the least square derivation easier. The quantities y, X and w are defined as follows:
1 x11 xn1
y1 ··· w0
y2 1 x2 ··· xn2 w1
1
y = . , X = . . .. , w = .. ,
.. .. .. ··· . .
yn 1 x1N ··· xnN wn
1 1 1
∇w ∥y − Xw∥22 = ∇w (y − Xw)T (y − Xw) = ∇w yT y − yT Xw − wT XT y + wT XT Xw
2 2 2
1 T
= ∇w y y − 2wT XT y + wT XT Xw = −XT y + XT Xw.
2
Having obtained the estimated vector of weights we can now make a prediction on new unseen test data
points as follows:
Perceptron
Linear Classifiers
The core idea behind Naive Bayes is to extract certain attributes of the training data called features and then
estimate the probability of a label given the features: P(y| f1 , f2 , ... fn ). Thus, given a new data point, we
can then extract the corresponding features, and classify the new data point with the label with the highest
probability given the features. This all, however, this requires us to estimate distributions, which we did
with MLE. What if instead we decided not to estimate the probability distribution? Lets start by looking
at a simple linear classifier, which we can use for binary classification, which is when the label has two
possibilities, positive or negative.
The basic idea of a linear classifier is to do classification using a linear combination of the features– a
value which we call the activation. Concretely, the activation function takes in a data point, multiplies each
feature of our data point, fi (x), by a corresponding weight, wi , and outputs the sum of all the resulting values.
In vector form, we can also write this as a dot product of our weights as a vector, w, and our featurized data
point as a vector f(x):
How does one do classification using the activation? For binary classification, when the activation of a data
point is positive, we classify that data point with the positive label, and if it is negative, we classify with the
negative label. (
+ if hw (x) > 0
classify(x) =
− if hw (x) < 0
1 For matrix algebra rules you can refer to The Matrix Cookbook.
Since magnitudes are always non-negative, and our classification rule looks at the sign of the activation, the
only term that matters for determining the class is cos(θ ).
(
+ if cos(θ ) > 0
classify(x) =
− if cos(θ ) < 0
We, therefore, are interested in when cos(θ ) is negative or positive. It is easily seen that for θ < π2 , cos(θ )
will be somewhere in the interval (0, 1], which is positive. For θ > π2 , cos(θ ) will be somewhere in the
interval [−1, 0), which is negative. You can confirm this by looking at a unit circle. Essentially, our simple
linear classifier is checking to see if the feature vector of a new data point roughly "points" in the same
direction as a predefined weight vector and applies a positive label if it does.
(
π
+ if θ < 2 (i.e. when θ is less than 90°, or acute)
classify(x) = π
− if θ > 2 (i.e. when θ is greater than 90°, or obtuse)
Up to this point, we haven’t considered the points where activationw (x) = wT f(x) = 0. Following all the
same logic, we will see that cos(θ ) = 0 for those points. Furthermore, θ = π2 (i.e θ is 90°) for those points.
In otherwords, these are the data points with feature vectors that are orthogonal to w. We can add a dotted
blue line, orthogonal to w, where any feature vector that lies on this line will have activation equaling 0.
Decision Boundary
We call this blue line the decision boundary because it is the boundary that separates the region where
we classify data points as positive from the region of negatives. In higher dimensions, a linear decision
boundary is generically called a hyperplane. A hyperplane is a linear surface that is one dimension lower
than the latent space, thus dividing the surface in two. For general classifiers (non-linear ones), the decision
boundary may not be linear, but is simply defined as surface in the space of feature vectors that separates the
classes. To classify points that end up on the decision boundary, we can apply either label since both classes
are equally valid (in the algorithms below, we’ll classify points on the line as positive).
Binary Perceptron
Great, now you know how linear classifiers work, but how do we build a good one? When building a
classifier, you start with data, which are labeled with the correct class, we call this the training set. You
build a classifier by evaluating it on the training data, comparing that to you training labels, and adjusting
the parameters of your classifier until you reach your goal.
Let’s explore one specific implementation of a simple linear classifier: the binary perceptron. The perceptron
is a binary classifier–though it can be extended to work on more than two classes. The goal of the binary
perceptron is to find a decision boundary that perfectly separates the training data. In other words, we’re
seeking the best possible weights– the best w– such that any featured training point that is multiplied by the
weights, can be perfectly classified.
The Algorithm
The perceptron algorithm works as follows:
Updating weights
Let’s examine and justify the procedure for updating our weights. Recall that in step 2b above, when our
classifier is right, nothing changes. But when our classifier is wrong, the weight vector is updated as follows:
w ← w + y∗ f(x)
where y∗ is the true label, which is either 1 or -1, and x is the training sample which we mis-classified. You
can interpret this update rule to be:
Case 1 : mis-classified positive as negative w ← w + f(x)
Case 2 : mis-classified negative as positve w ← w − f(x)
hw+f(x) (x) = (w + f(x))T f(x) = wT f(x) + f(x)T f(x) = hw (x) + f(x)T f(x)
Using our update rule, we see that the new activation increases by f(x)T f(x), which is a postive number,
therefore showing that our update makes sense. Activation is getting larger– closer to becoming positive.
You can repeat the same logic for when the classifier is mis-classifying because the activation is too large
(activation is positive when it should be negative). You’ll see that the update will cause the new activation
to decrease by f(x)T f(x), thus getting smaller and closer to classifying correctly.
While this makes it clear why we are adding and subtracting something, why would we want to add and
subtract our sample point’s features? A good way to think about it, is that the weights aren’t the only thing
that determines this score. The score is determined by multiplying the weights by the relevant sample. This
means that certain parts of a sample contribute more than others. Consider the following situation where x
is a training sample we are given with true label y∗ = −1:
4
wT = 2 2 2 , f(x) = 0
hw (x) = (2 ∗ 4) + (2 ∗ 0) + (2 ∗ 1) = 10
1
We know that our weights need to be smaller because activation needs to be negative to classify correctly.
We don’t want to change them all the same amount though. You’ll notice that the first element of our sample,
the 4, contributed much more to our score of 10 than the third element, and that the second element did not
contribute at all. An appropriate weight update, then, would change the first weight alot, the third weight a
little, and the second weight should not be changed at all. After all, the second and third weights might not
even be that broken, and we don’t fix want to fix what isn’t broken!
When thinking about a good way to change our weight vector in order to fulfill the above desires, it turns
out just using the sample itself does in fact do what we want; it changes the first weight by a lot, the third
weight by a little, and doesn’t change the second weight at all!
A visualization may also help. In the figure below, f(x) is the feature vector for a data point with positive
class (y∗ = +1) that is currently misclassified – it lies on the wrong side of the decision boundary defined
by “old w”. Adding it to the weight vector produces a new weight vector which has a smaller angle to f(x).
It also shifts the decision boundary. In this example, it has shifted the decision boundary enough so that x
will now be correctly classified (note that the mistake won’t always be fixed – it depends on the size of the
weight vector, and how far over the boundary f(x) currently is).
Bias
If you tried to implement a perceptron based on what has been mentioned thus far, you will notice one
particularly unfriendly quirk. Any decision boundary that you end up drawing will be crossing the origin.
Basically, your perceptron can only produce a decision boundary that could be represented by the function
w⊤ f(x) = 0, w, f(x) ∈ Rn . The problem is, even among problems where there is a linear decision boundary
that separates the positive and negative classes in the data, that boundary may not go through the origin, and
we want to be able to draw those lines.
To do so, we will modify our feature and weights to add a bias term: add a feature to your sample feature
vectors that is always 1, and add an extra weight for this feature to your weight vector. Doing so essentially
allows us to produce a decision boundary representable by w⊤ f(x) + b = 0, where b is the weighted bias
term (i.e. 1 * the last weight in the weight vector).
Geometrically, we can visualize this by thinking about what the activation function looks like when it is
w⊤ f(x) and when there is a bias w⊤ f(x) + b. To do so, we need to be one dimension higher than the space
of our featurized data (labeled data space in the figures below). In all the above sections, we had only been
looking at a flat view of the data space.
Example
Let’s see an example of running the perceptron algorithm step by step.
Let’s run one pass through the data with the perceptron algorithm, taking each data point in order. We’ll start
with the weight vector [w0 , w1 , w2 ] = [−1, 0, 0] (where w0 is the weight for our bias feature, which remember
is always 1).
Multiclass Perceptron
The perceptron presented above is a binary classifier, but we can extend it to account for multiple classes
rather easily. The main difference is in how we set up weights and how we update said weights. For the
binary case, we had one weight vector, which had a dimension equal to the number of features (plus the bias
feature). For the multi-class case, we will have one weight vector for each class, so in the 3 class case, we
have 3 weight vectors. In order to classify a sample, we compute a score for each class by taking the dot
product of the feature vector with each of the weight vectors. Whichever class yields the highest score is the
one we choose as our prediction.
For example, consider the 3-class case. Let our sample have features f(x) = −2 3 1 and our weights
for classes 0, 1, and 2 be:
w0 = −2 2 1
w1 = 0 3 4
w2 = 1 4 −2
Taking dot products for each class gives us scores s0 = 11, s1 = 13, s2 = 8. We would thus predict that x
belongs to class 1.
An important thing to note is that in actual implementation, we do not keep track of the weights as separate
structures, we usually stack them on top of each other to create a weight matrix. This way, instead of doing
as many dot products as there are classes, we can instead do a single matrix-vector multiplication. This tends
to be much more efficient in practice (because matrix-vector multiplication usually has a highly optimized
implementation).
In our above case, that would be: And our label would be:
−2 2 1 −2 11
W = 0 3 4 ,x = 3 arg max(Wx) = arg max(13) = 1
1 4 −2 1 8
Along with the structure of our weights, our weight update also changes when we move to a multi-class
case. If we correctly classify our data point, then do nothing just like in the binary case. If we chose
incorrectly, say we chose class y ̸= y∗ , then we add the feature vector to the weight vector for the true class
to y∗ , and subtract the feature vector from the weight vector corresponding to the predicted class y. In our
above example, let’s say that the correct class was class 2, but we predicted class 1. We would now take the
weight vector corresponding to class 1 and subtract x from it,
w1 = 0 3 4 − −2 3 1 = 2 0 3
Next we take the weight vector corresponding to the correct class, class 2 in our case, and add x to it:
w2 = 1 4 −2 + −2 3 1 = −1 7 −1
What this amounts to is ’rewarding’ the correct weight vector, ’punishing’ the misleading, incorrect weight
vector, and leaving alone an other weight vectors. With the difference in the weights and weight updates
Summary
In this note, we introduced several fundamental principles of machine learning, including:
• Splitting our data into training data, validation data, and test data.
• The difference between supervised learning, which learns from labeled data, and unsupervised learn-
ing, which doesn’t have labeled data and so attempts to infer inherent structure from it.
We then proceeded to discuss an number of supervised learning algorithms such as Naive Bayes, Linear
Regression, and the Perceptron Algorithm.
• We covered the Naive Bayes algorithm and derived the maximum likelihood estimates of the unknown
model parameters. We extended this idea to discuss the problem of overfitting in the context of Naive
Bayes’ and how this issue can be mitigated with Laplace smoothing.
• We talked about Linear Regression, a simple model where we predict real-valued outputs as linear
combinations of our input features. We also derived the linear regression closed form solution using
vector calculus.
• Finally, we talked about linear decision boundaries and the perceptron algorithm - a method for clas-
sification that repeatedly iterates over all our data and updates weight vectors when it classifies points
incorrectly.