Machine Learning Notes
Machine Learning Notes
Machine Learning
Kevin Zhou
kzhou7@gmail.com
These notes follow Stanford’s CS 229 machine learning course, as offered in Summer 2020. Other
good resources for this material include:
The most recent version is here; please report any errors found to kzhou7@gmail.com.
2 Contents
Contents
1 Supervised Learning 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 General Learning 28
3.1 Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Practical Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 PAC Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Reinforcement Learning 36
4.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Unsupervised Learning 41
5.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Principal and Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . 46
3 1. Supervised Learning
1 Supervised Learning
1.1 Introduction
We begin with an overview of the subfields of machine learning (ML).
• According to Arthur Samuel, ML is the field of study that gives computers the ability to learn
without being explicitly programmed. This gives ML systems the potential to outperform the
programmers that made them. More formally, ML algorithms learn from experiences by using
them to increase a performance measure on some task.
• Broadly speaking, ML can be broken into three categories: supervised learning, unsupervised
learning, and reinforcement learning.
• Supervised learning problems are characterized by having a “training set” that has “correct”
labels. Simple examples include regression, i.e. fitting a curve to points, and classification.
Supervised learning has its roots in statistics.
• Generically, a supervised learning algorithm will minimize some cost function over a hypothesis
class H evaluated on the training set. This will give a model that can be used to predict labels
on a “testing set”. Choosing the minimization procedure itself constitutes an entire field of
study, “optimization”. We’ll mostly avoid the details of it here.
• Picking a good H is called the model selection problem. If it’s too simple, it may underfitting,
being not be powerful enough to capture the trust; if it’s too complicated, it may overfit to the
training set, and do badly on the testing set. This is called the bias-variance tradeoff.
• In unsupervised learning, the goal is instead to find structure in an unlabeled data set. A typical
example is clustering, e.g. finding news articles about the same topic, or groups in a social
network. Other examples include dimensionality reduction, (probability) density estimation,
and outlier/anomaly detection. There are also semi-supervised learning algorithms, which do
classification when only some of the examples are labeled, and weakly supervised algorithms,
which can cope with noisy labels.
• In reinforcement learning (RL), we think in terms of an “agent” which moves between “states”
in an “environment”, seeking to maximize a reward signal by learning an appropriate “policy”,
without being explicitly told how. Most popular depictions of ML focus on RL, which has its
roots in control theory from engineering. It also has roots in biology, with similar language used
to describe animal learning. Examples include learning to play a board game, or maneuvering
a robot past obstacles.
• Often, RL systems model and predict their environment, and may use supervised or unsupervised
learning algorithms as subroutines. However, the overall problem is more general, and more
realistic. Also, model-free RL systems exist, such as genetic algorithms or simulated annealing.
• One of the fundamental problems of RL is the explore-exploit tradeoff, i.e. whether the agent
should keep on doing the same thing to get a known reward, or try new things for the potential
of an even higher reward. We can describe this tradeoff formally by having the agent maximize
a “value function”, which accounts for both the reward itself, and the possible rewards it could
get by further exploration.
4 1. Supervised Learning
• As an example, if an agent plays tic tac toe against a fixed opponent, it’s easy to find a way to
guarantee a tie. But if the opponent is imperfect, there could be a different sequence of moves
that could force a win. If the agent explores sufficiently, it can find this sequence.
Note that in the second expression we can take A to be symmetric without loss of generality,
since it doesn’t change the left-hand side. We’ll drop the subscript when it’s clear what the
gradient is with respect to, and we distinguish vectors and matrices by context.
which is best shown using the component expression for the determinant. As a special case, for
positive definite A we can always define log det A, and
∂ 2 f (x)
(∇2x f (x))ij = .
∂xi ∂xj
• As a simple example,
∇2x (xT Ax) = A + AT = 2A
where we assumed A was symmetric.
5 1. Supervised Learning
Intuitively, x should be aligned with the eigenvector of A with largest eigenvalue. To show this
formally, we form the Lagrangian
L(x, λ) = xT Ax − λ(xT x − 1)
along with the analogous expressions with A and B swapped. The first result just follows from the
definitions of Σ and ΣAA , while the second can be shown by completing the square.
For the conditional distribution, the term added to the mean represents how knowledge of xB
shifts the distribution of xA , which is why it is proportional to ΣAB , while the term subtracted
from the variance represents how xB reduces the uncertainty in xA . This is easier to see in the case
where the blocks have one element each,
2
xa µa σa ρσa σb
x= , µ= , Σ=
xb µb ρσa σb σb2
where we can have an intercept by including a feature x0 = 1. The parameters θi are weights.
• To learn θ, we will have our algorithm minimize a cost function, also called the loss function or
empirical risk. The choice
1X
J(θ) = |hθ (x(i) ) − y (i) |2
2
i
• For least squares, we can find the optimal θ in closed form. A more general method is gradient
descent, which works for general cost functions, and scales better for large data sets. We start
with an initial guess for θ, and at each step perform the update
This is called the LMS update rule, or the Widrow–Hoff learning rule.
• This procedure is called batch gradient descent, since each step depends on all of the data points.
However, when there are many data points, this can be slow. Another option is stochastic
gradient descent, where we loop through the data points i, and at each step set
Typically, stochastic gradient descent can get close to the minimum faster, so it is preferred
for large data sets. Another advantage is that it is less prone to getting stuck in local minima,
since iterating over data points gives the trajectory some natural “jitter”. (This isn’t relevant
for least squares, where the cost function is convex.)
Note. Getting gradient descent to work in practice requires some know-how. If α is too small,
convergence is slow; if it’s too large, J(θ) doesn’t decrease monotonically. In general, α should be
decreased as the learning continues, to hone in on the minimum; this is especially useful for stochastic
gradient descent. Also, note that gradient descent and least squares are not “reparametrization
invariant”. If one of the features has much larger values than the others, it dominates the gradient
and the cost, so convergence of the other θi is slow and the final result may not be useful. In general
it’s best to shift and scale all the features into a standard range, which e.g. can be done by replacing
features with their z-scores.
7 1. Supervised Learning
Note. We can also minimize the least squares loss function analytically. Combine the feature
vectors into a matrix X, and the outputs into a vector y. Then the loss function is
1
J(θ) = (Xθ − y)T (Xθ − y)
2
and setting the derivative ∇θ J to zero gives the “normal equations”,
X T Xθ − X T y = 0.
This condition has a geometric interpretation: when the length of the error vector Xθ − y is
minimized, it should be orthogonal to the column space of X, since that gives the set of possible
outputs. Thus, Xθ − y must be in the nullspace of X T . Solving for θ, we get
θ = (X T X)−1 X T y.
Computationally, the hard part of using this expression is the matrix inversion, which takes O(m3 )
time. It may turn out that X T X is noninvertible, e.g. if some of the features are redundant. However,
for this purpose it’s good enough to use the pseudoinverse.
Note. Linear regression may appear weak, but its flexibility actually lies in the choice of features;
you can fit anything with a line, if you choose the right (nonlinear) features. For example, if we use
both x and x2 as features, then we’re really fitting a quadratic. The features can be chosen using
knowledge of the data set. There are also learning algorithms that effectively automatically choose
features.
Note. Why least squares? Suppose that the output and input variables are related by
y (i) = θT x(i) + (i)
where the error term (i) represents unmodeled effects or random noise. The central limit theorem
motivates us to model the (i) as iid multivariate Gaussians with zero mean and variance σ 2 . Then
the likelihood function is
!
Y 1 (y (i) − θ T x(i) )2
L(θ) = p(y (i) |x(i) ; θ) = √ exp − .
2πσ 2 2σ 2
i
Maximizing the likelihood of the data is equivalent to minimizing the least squares cost function.
Note. Linear regression is a parametric algorithm, which means that after training, it can make
predictions by storing a fixed number of parameters. A nonparametric algorithm instead needs
space linear in the size of the training set. One example is locally weighted linear regression. Here,
to output the prediction for a query point x, we minimize the cost function
X
J(θ) = w(i) (x(i) , x) (y (i) − θT x(i) )2
i
and return θT x, where the weights emphasize the data points near x, e.g.
!
(x(i) − x)2
w(i) = exp −
2τ 2
where τ is the bandwidth parameter. To construct the cost function at all, we need to keep all x(i) .
This is a useful method when the data is not too high-dimensional, and one wants to get a decent
result without having to think about feature engineering.
8 1. Supervised Learning
Note. What’s the difference between machine learning and statistics? Some people quip “$200,000
per year”, or that machine learning is just “regressions done in California”. There’s definitely some
truth to that, since people try to apply spin for funding applications and press releases. But more
seriously, while many algorithms appear in both fields, machine learning cares more about the
predictions hθ (x), while statistics cares more about extracting and understanding the parameters
θ. Thus, machine learning commonly includes much more complex procedures that are extremely
powerful, at the cost of rendering the parameters uninterpretable.
• As a simple hack, we could use linear regression to find θ, compute θT x, and report which of 0
or 1 it’s closer to. However, this doesn’t actually make any sense. For example, suppose that
we are classifying tumors as malignant or benign. A very large tumor will almost certainly turn
out to be malignant. Since this will be no surprise, such a data point should have very little
effect on the cost function, but in least squares such outliers generally have a massive effect.
• The conceptual reason for this is that the likelihood (from which we can infer a cost function)
is qualitatively different. Let’s suppose that the data is generated by
P (y = 1|x; θ) = hθ (x).
Then the log-likelihood of the data, which is effectively the negative of our cost function, is
X
`(θ) = log L(θ) = y (i) log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) ))
i
which is very different from the least squares cost function; we call it the cross-entropy loss.
• Since the data is generated from a probability in [0, 1], it’s natural to output a probability in
[0, 1]. In a generalized linear model, we assume hθ (x) can be written as
However, this still leaves the choice of g open. In logistic regression, we choose
1
g(z) =
1 + e−z
which is called the sigmoid function. It is useful because it smoothly transitions between 0 and
1, the input z can be thought of as the log-odds ratio, and the resulting cost function has an
analytically tractable form. We will give a more canonical justification for it below.
• Now, we need to evaluate the derivative of the log-likelihood. Suppressing the superscripts,
y 1−y
∇θ `(θ) = − ∇θ g(θT x) = (y(1 − g) + (1 − y)g)x = (y − hθ (x))x
g 1−g
which is remarkably the exact same functional form as in linear regression. The difference
between the two algorithms is entirely in the choice of hθ (x). In particular, there is no analogue
of the normal equations here. The cost function is still convex, ensuring no local minima.
Note. There are many alternatives to gradient descent. Newton’s method is a fast method to find
the zeroes of a function, but the zeroes of the derivative are the extrema. Thus, to maximize `(θ)
for a one-dimensional θ, we can perform the iteration
`0 (θ)
θ ..= θ − .
`00 (θ)
θ ..= θ − H −1 ∇θ `(θ)
where H = ∇2θ `(θ) is the Hessian matrix. Applying this method to logistic regression is called Fisher
scoring. Newton’s method converges in much fewer steps than gradient descent. Formally, it has
“quadratic convergence”, heuristically meaning that for reasonable functions, if the error is 1,
then the error after the next step is 2 . However, each step takes much longer, since calculating H
takes O(n3 m) time.
Note. Once we’ve computed hθ (x), we can use the output values to construct a decision boundary,
by assigning a class of 1 when hθ (x) is above some cutoff. For logistic regression, the decision
boundary is always a hyperplane perpendicular to θ. Like for linear regression, a more complicated
decision boundary can be constructed by using nonlinear features.
Note. We could also try choosing g to force an output of 0 or 1, by letting g(z) = θ(z). If we
continue to use the same update rule,
then we have the perceptron learning algorithm. Since both hθ (x(i) ) and y (i) are 0 or 1, the update
rule actually has a very simple form. In stochastic gradient descent, we can think of the algorithm
as outputting a prediction hθ (x(i) ) as it sees each data point (thereby making it an “online” learning
algorithm). Then the update rule is
In the 1960s, the perceptron was thought to describe how neurons behave, with a 1 describing a
neuron firing. It also forms a crucial component of a neural network. However, it doesn’t have the
same probabilistic interpretation that linear or logistic regression have.
where η is the natural/canonical parameter, T (y) is the sufficient statistic (often equal to y
itself), b(y) is the base measure, and a(η) is the log partition function, which just normalizes
the distribution. A choice of T (y), a(η), and b(y) defines a family of distributions parametrized
by η. In general η and T (y) can be vectors, but we’ll focus on one-parameter families.
• To derive the cost function for linear regression, we assumed p(y|x; θ) was Gaussian. For
logistic regression, we assumed p(y|x; θ) was Bernoulli. Both of these distributions are in the
exponential family. For Bernoulli(φ), where we took φ = g(θT x), note that
φ
p(y; φ) = φy (1 − φ)1−y = exp log y + log(1 − φ)
1−φ
which corresponds to
φ
η = log , T (y) = y, b(y) = 1, a(η) = − log(1 − φ).
1−φ
Interestingly, this means the probability is
1
φ=
1 + e−η
which is the sigmoid function.
• The Gaussian distribution was a two-parameter family, but we can just set σ 2 = 1 since it
played no essential role. We have
1 1 2
p(y; µ) = √ exp − (y − µ)
2π 2
from which we read off
2
e−y /2
η = µ, T (y) = y, b(y) = √ , a(η) = µ2 /2.
2π
The multinomial, Poisson, gamma, exponential, beta, and Dirichlet distributions are all members
of the exponential family. On the other hand, the set of uniform distributions on intervals [c, d]
is not a member, because the support of the distribution is determined by b(y) alone, and hence
cannot depend on the parameters c and d.
• More complicated families of distributions can be described with more complicated T (y). For
example, for a general multivariate Gaussian, we can take
where the inner product of the two matrices should be interpreted as an element-wise product.
• In a generalized linear model (GLM), we assume the output distribution p(y; η) is in the
exponential family, and furthermore that η for each data point depends linearly on the inputs,
η = θT x
Example. Consider a classification problem where y ∈ {1, 2, . . . , k}. The data is multinomial
distributed, and this is an example of a (k − 1)-parameter set of distributions in the exponential
family. To see this explicitly, let φi = p(y = i; φ). Then the multinomial distribution is described by
ηi = log(φi /φk ), T (y)i = 1(y = i), b(y) = 1, a(η) = − log(φk )
as can be seen by plugging in the definitions. The outputs are the probabilities,
eηi
E[T (y; η)i ] = φi = Pk
ηj
j=1 e
where in a GLM, ηi = θiT x and the θi are fit by maximizing the log likelihood. The quantity on the
right-hand side here is the canonical response function g(η) for the multinomial distribution. It is
called the softmax function, and it generalizes the logistic function. This model can be trained by
gradient descent in an almost identical fashion to logistic regression; the log-likelihood is the cross
entropy between the distribution E[T (y; η)i ] and 1(y = i).
12 1. Supervised Learning
1.5 Kernels
As mentioned, we can fit nonlinear functions with a linear model and nonlinear features. To make
this distinction explicit, we’ll write the “original” input values, called “attributes”, as x, and the
features as φ(x), where φ : Rd → Rp is a feature map. We let there be n training examples.
A typical nonlinear feature map could contain the xi themselves, along with all quadratic
combinations xi xj and cubic combinations xi xj xk . But this makes every step of gradient
descent take O(p) = O(d3 ) time, which could be slow if there are many attributes.
• However, note that each step of gradient descent adds a term to θ proportional to φ(x(i) ). So
if θ is initially zero, it can be written as
X
θ= βi φ(x(i) )
i
where we can think of βi as the weight of training example i. (Also, the representer theorem
states that for a wide variety of cost functions, the optimal θ can also be written in this form.)
• Now, all relevant quantities can be written in terms of inner products of the φ(x(i) ). In one
step of stochastic gradient descent on example i, the coefficient βi updates by
X
βi ..= βi + α y (i) − βj φ(x(j) )T φ(x(i) )
j
where this expression works equally well if the φ(x(i) ) are an undercomplete or overcomplete
basis for feature space. The final prediction is
X
θT φ(x(i) ) = βj φ(x(j) )T φ(x(i) ).
j
where the function K is called the kernel. It would seem this takes O(n2 p) time, but many
kernels can be computed much faster. For example, for the cubic features above,
X X X
K(x, z) = 1 + xi z i + x i x j zi z j + x i x j x k zi zj zk
i ij ijk
which only requires O(d) time to compute. However, note that each step of gradient descent
now takes O(n) time, while previously it didn’t scale with n at all.
13 1. Supervised Learning
• More generally, the kernel K(x, z) = (xT z + c)k corresponds to all features up to order k, with
c controlling the relative weighting of different orders.
• Intuitively, the dot product measures how close two vectors are, so a kernel function should fill
the same role. This motivates the choice of the Gaussian/radial basis kernel,
kx − zk2
K(x, z) = exp − .
2σ 2
This is essentially an infinite-dimensional polynomial kernel, with a penalty for high degree
terms parametrized by σ. For large enough σ, this kernel can perfectly separate any training set,
but only by overfitting, drawing a narrow decision boundary around each data point. Decreasing
σ regularizes the model. This is a case where the feature mapping is so complicated that it’s
better to not think about it at all.
• Clearly, not all functions K(x, z) correspond to possible sets of features. However, many valid
kernel functions can be constructed using the following rules.
– K(x, z) = 1 is a kernel function, corresponding to the constant feature.
– If K(x, z) is a kernel function, so is f (x)K(x, z)f (z), corresponding to changing the feature
mapping from φ(x) to f (x)φ(x).
– Kernel functions are closed under addition, corresponding to concatenating feature vectors.
– Kernel functions are closed under multiplication, corresponding to taking the set of quadratic
combinations of the elements of a feature vector.
Using these rules, we can construct the kernels mentioned above.
• Alternatively, let Kij = K(x(i) , x(j) ) be the kernel matrix. Clearly, the kernel matrix must be
symmetric to have a valid kernel. Moreover, letting φk (x) denote the k th component of φ(x),
!2
X X X X
T (i) (j) (i)
z Kz = zi Kij zj = zi φk (x )φk (x )zj = zi φk (x )
ij ijk k i
• We can fit the parameters φ, µi , and Σ by maximizing the log-likelihood of the data. In our
discriminative algorithms, the likelihood was the product of the conditional probability of seeing
each y (i) given its x(i) . For a generative algorithm, we model both x and y, so we instead want
the product of the joint likelihood of the pairs (x(i) , y (i) ).
• Explicitly, we have
m
X m
X
`(φ, µ0 , µ1 , Σ) = log p(x(i) |y (i) ; µ0 , µ1 , Σ) + log p(y (i) ; φ)
i=1 i=1
Naive Bayes is a generative learning algorithm that works on feature vectors with discrete values.
We’ll use the running example of classifying text into spam and non-spam, where xj = 1 if an email
contains the j th word in the dictionary.
• Given a dictionary of d words, it takes O(2d ) parameters to specify a general form for p(x|y),
which is clearly unusable. The naive Bayes assumption is that all of the xi are conditionally
independent given y, so Y
p(x|y) = p(xj |y)
j
• This is a distinct assumption from independence of the xj , which would be unreasonable; such
a model would not know that “Nigerian” and “prince” tend to occur together. Conditional
independence is still not remotely true, e.g. within the class of spam emails, “Nigerian” makes
“prince” more likely, but “cheap” makes “Viagra” more likely. However, the hope is that this
does not adversely affect the classification accuracy, and it indeed works well in practice.
• The parameters φj|y=k = p(xj = 1|y = k) and φk = p(y = k) are found by maximum likelihood,
and again match the empirical probabilities,
(i)
= 1 ∧ y (i) = k}
P
(i)
i 1{xj
P
i 1{y = k}
φj|y=k = , φk =
1{y (i)
P
i = k} m
where the wedge denotes conjunction. Similar results hold when xi has more than two possible
values. Naive Bayes can also be applied to continuous variables by discretizing them.
• If there’s a word that appears in the testing data but wasn’t in the training data, then Naive
Bayes will produce nonsense 0/0 predictions. Also, rare words might inadvertently cause 100%
identification as spam or non-spam. In Laplace smoothing, we insert one dummy observation
of every outcome to avoid this, meaning
(i)
= 1 ∧ y (i) = k} + 1
P
i 1{xj
φj|y=k = P (i)
.
i 1{y = k} + 2
This can be shown to be the optimal prescription under certain mathematical conditions.
• This specific prescription is also what we get if we assume a flat prior on each φj|y=k . If p(φ) is
flat, then after observing p positive counts and q negative counts, the posterior distribution is
p(φ|p, q) = φp (1 − φ)q
which has an expectation value of (p + 1)/(p + q + 2). Variations on our prescription above,
such as adding different numbers of dummy observations, correspond to a family of priors.
• In practice, the dictionary of words should be constructed from what actually appears in the
training data, rather than copied from a standard dictionary, for computational efficiency. Also,
it may be useful to remove stop words, since they occur often but carry little useful information.
16 2. More Supervised Learning
• The algorithm we described above uses the “Bernoulli event model”. To generate an email
under its assumptions, we first decide if the email is spam or not (so y is Bernoulli), and then
for the j th word in the dictionary we independently decide if it’s in the email or not (so xj is
binary and xj |y is Bernoulli). But for text classification, there’s a better method. Keeping the
first step the same, for the `th word in the email we independently decide which word it is (so
x` is a word in the dictionary and x` |y is multinomial).
• We are still using the naive Bayes assumption of conditional independence, but now we can
account for the frequency with which words are used. The parameters are φk = p(y = k) as
before, and φj|y=k = p(x` = j|y = k) for any `. The likelihood is
Y di
Y
L(φ) = φy(i) φx(i) |y(i)
`
i `=1
where di is the number of words in email i. Maximizing it gives the expected results,
P P (i) (i) = k} P (i) = k}
i ` 1{x` = j ∧ y i 1{y
φj|y=k = P (i) = k}
, φk = .
i di 1{y m
To do Laplace smoothing, we would add 1 to the numerator of φj|y=k and d to the denominator.
• In the brain, neurons receive information from other neurons via dendrites, then decide whether
or not to fire, and output that information along their axons. From the perspective of machine
learning, the decision of firing or not is a highly nonlinear step, which allows neurons to learn
complex hypotheses. Thus, in a neural network, each neuron weights inputs from other neurons,
and applies a nonlinear function to form its output. We can think of logistic regression or the
perceptron as a neural network with a single neuron.
• For regression and classification, we can use the same cost/loss functions as before. For con-
creteness, the cost for one example for regression with the least squares loss would be
1
J (i) (θ) = (hθ (x(i) ) − y (i) )2 .
2
In practice, however, it is very difficult to train a deep neural network with this loss. It is
generally best to convert everything into a classification problem if possible, e.g. predicting a
star rating could be viewed as a classification problem with five classes.
hθ (x) = σ(wT x + b)
where b is a bias term, w is a vector of weights, and σ is a nonlinear activation function. Typical
examples include
max(t, 0)
rectified linear unit (ReLU)
−z
σ(t) = 1/(1 + e ) sigmoid .
tanh z hyperbolic tangent
17 2. More Supervised Learning
Since the derivatives of these functions must be computed often, it’s useful to have simple
expressions for them,
θ(t)
ReLU
0
σ (t) = σ(t)(1 − σ(t)) sigmoid .
1 − σ(t) 2 hyperbolic tangent
where W [k] is a matrix of weights, b[1] is a vector of biases, the inputs are a[0] = x, and the
outputs are hθ (x) = W [r] a[r−1] + b[r] . Note that the W [k] need not be square, i.e. we can have
any number of neurons per layer. For a real-valued output, the final layer has one neuron.
• This matrix notation is elegant, and it is also practically important. When neural networks
are trained, as much as possible should be put into “vectorized” form. The recent interest in
neural networks is due to their great power when they are sufficiently large, which requires
computation power to be used as efficiently as possible.
• It’s doubtful if neural networks actually resemble the brain. In deep learning, the layer structure
is emphasized due to the efficiency of matrix multiplication, and one can have architectures
with 1000 layers, which seem unlikely to exist in real brains. On the other hand, the “one
learning algorithm” hypothesis says that distinct regions of the brain all run on essentially the
same learning algorithm, which neural networks might partially capture. For example, if visual
input is given to the auditory cortex in a lab animal, it will learn to see, though not well.
• Opponents of the “one learning algorithm” idea point out the great complexity of the brain,
its division into many distinct subparts, and the existence of detailed instinctual knowledge
(e.g. for language acquisition). In AI, Minsky’s “society of mind” hypothesis posits that human
intelligence arises from the interaction of a diverse range of relatively simple “agents”.
• A linear model can deal with nonlinearity if it is given an appropriate feature map, which
requires careful “feature engineering”. Neural networks are useful because they essentially do
this automatically; the output of each layer can be thought of as constructing useful features
for the next layer to operate on. We can even think of the first r − 1 layers as defining a kernel,
and then swap out the final layer for one of the other algorithms we’ve considered.
Example. Consider two binary-valued inputs. A linear decision boundary, e.g. from a single
neuron, can represent boolean operators such as (N)AND and (N)OR. Such a decision boundary
can’t represent XOR, but a two-layer neural network can, because XOR can be written in terms of
a two-layer binary circuit.
Note. A neural network with a single hidden layer can approximate any function. To see this, note
that any reasonable activation function becomes a step function if the weights are large enough. Let
the output of each hidden neuron be a step function in the input, and let the output neuron just
sum up the hidden neuron outputs. The conclusion follows since any function can be approximated
as a sum of step functions.
18 2. More Supervised Learning
While this “universality” property might sound powerful, it’s not relevant to practical learning,
because almost everything is universal. For example, polynomials can approximate any function
arbitrarily well, as can sums of cosines and sines, as can discretizations (which these step functions
essentially are). This doesn’t mean that we can replace all of machine learning with Taylor series.
What we really need are algorithms that efficiently learn and represent the particular patterns that
exist in real data; this is the power of deep neural networks.
If we know δ [k] = ∂J/∂z [k] , then we can easily compute ∂J/∂z [k−1] by the chain rule,
[k]
∂J ∂J ∂zj ∂J [k] [k−1]
[k−1]
= [k] [k−1]
= [k]
Wji σ 0 (zi ).
∂zi ∂zj ∂zi ∂zj
• The final layer δ [r] is easy to compute. For example, for the least squares cost function,
and the relation above allows us to recursively compute the δ [k] . Since this goes from the last
layer to the first, it is known as “backpropagation”.
• Given the δ [k] , we can read off the gradients with respect to the parameters,
∂J [k] [k−1] ∂J [k]
[k]
= δi aj , [k]
= δi
∂Wij ∂bi
or in matrix/vectorized notation,
∂J ∂J
= δ [k] (a[k−1] )T , = δ [k] .
∂W [k] ∂b[k]
Different sources will differ on the index conventions, though the substance is the same.
• Mathematically, backpropagation is just the chain rule from introductory calculus. But it’s an
important idea because it’s far more efficient than a naive calculation of the gradient. Consider
a neural network with p parameters. Naively, calculating the gradient due to a single parameter
requires changing that parameter and seeing how the output changes, which takes O(p) time,
leading to O(p2 ) total time. Backpropagation memoizes the results at layer k to speed up the
calculation at layer k − 1, so the whole process takes O(p) time.
• Also note that to perform backpropagation, we need to know all the z [k] . But we already do
this as an intermediate step while computing the output z [r] in “forward propagation”.
19 2. More Supervised Learning
• In practice, it’s useful to do “minibatch” stochastic gradient descent, i.e. using several training
examples simultaneously. This is useful because using one training example at a time results in
many matrix-vector multiplications that must be done sequentially. With a minibatch, every
operation is a matrix multiplication, allowing us to parallelize.
• Notationally, we generalize all the vectors above to matrices, with each column representing
one training example. Then we have, e.g.
where each column of B [k] is a copy of b[k] . The backpropagation steps become
[k−1] [k] [k] [k−1]
∆i` = ∆j` Wji σ 0 (Zi` )
The expressions for the parameter derivatives are similar, but with a sum over the new index,
∂J ∂J X [k]
[k]
= ∆[k] (A[k−1] )T , [k]
= ∆ij .
∂W ∂bi j
• In terms of linear algebra, some of the expressions above look unnatural, but computationally
they are very efficient since they are just matrix multiplications, element-wise multiplications,
and summations over rows. Also note that there is no need to explicitly construct the matrix
B [k] . The operations above with B [k] can be performed efficiently by “broadcasting” b[k] .
• In general, we can run into objects with more than two indices, which we call tensors. Conven-
tions differ on the index ordering, so other sources may have a different matrix multiplication
order, or extra transposes, relative to us.
• In a physics context, we would usually indicate what is being kept constant for each partial
derivative. For the neural networks we’ve considered above, this isn’t necessary because of the
layer structure: each layer only influences the final result through its influence on the next
layer. Thus, partial derivatives of quantities at layer k are defined by changing layer k alone
and computing its effect on the later layers. This can become slightly more subtle for networks
without a straightforward layer structure.
[r]
• For classification with n classes, we can have n neurons in the final layer, with outputs zi .
These can be converted to probabilities using the softmax function,
[r]
e zi
pi = P [r]
zj
je
• When we covered linear and logistic regression, the cost functions had clear probabilistic inter-
pretations. The interpretation of a cost function is much less clear for a neural network; we
instead just use cost functions if they lead to good performance. For example, another common
option for classification is to take sigmoids in the final layer, rather than a softmax, and use
the cross-entropy loss.
20 2. More Supervised Learning
Note. As with all learning algorithms, neural networks need to be regularized for good performance.
As described later, we can use L1 or L2 regularization; the latter tends to perform better but the
former gives more interpretable parameters. Another natural idea would be to regularize by shrinking
the neural network, but in general this is not preferred; we get the best performance by using as
many parameters as computation power permits.
A better regularization technique, specific to neural networks, is “dropout”, where for each
minibatch we temporarily remove a random half of the hidden neurons; for predictions, we restore
all neurons and halve the weights. Dropout was found to be very effective in the early 2010s.
Heuristically, it works because it makes each neuron learn less fragile features, since it cannot rely
on its neighbors. Another way to think about it is that it trains multiple neural networks inside the
main neural network simultaneously, and the output is an ensemble average. Another useful idea is
to augment the training set, e.g. in image classification we could add slightly rotated or cropped
versions of the original training images. This keeps the neural network from fitting to irrelevant
details.
Note. How do we choose the activation function? In general, it’s just whatever works best, but
heuristically it’s useful for the activation function’s range to match the output we want, so the
sigmoid is good for classification, the tanh is good for bounded quantities, and ReLU is good for
regression.
Both tanh and the sigmoid have the “vanishing gradient” problem: their derivatives can be very
small when a neuron is saturated, and this problem worsens exponentially as we backpropagate,
making the early layers in a deep network hard to train. ReLU does not have this problem as long
as the argument is positive, but when the argument is negative the gradient vanishes entirely. This
leads to the “dying ReLU” problem: in the course of training, neurons can effectively die, never
outputting anything nonzero again. An alternative is the “leaky ReLU” function max(x/10, x).
Also, even though all of our examples have gradients with magnitude less than 1, the fact that
each neuron has many associated weights means that gradients can grow during backpropagation,
leading to numeric instability; this is the “exploding gradient” problem. The inherent instability of
gradient descent is one of the main obstacles in deep learning.
ReLU is the most popular activation function, because it is very cheap to evaluate and quick
to train with, but research is done on alternatives. For example, Google researchers proposed the
Swish activation function x/(1+e−x ), which seems to perform better than ReLU. It has the unusual
property of being nonmonotonic! One can also use different activation functions for different layers.
In general, not much is known for sure about activation functions.
Note. There are many small tricks needed to train neural networks reliably. It’s important to
initialize the weights to random values, in order to “break the symmetry” of the neurons in each
layer. To prevent neurons from beginning already saturated, and hence training slowly, these values
√
should be O(1/ n) where n is the number of inputs/outputs to each neuron; refinements of this
idea include Xavier and He initialization. One should also normalize the inputs, again to avoid
saturation. Finally, one should perform sanity checks on everything, such as the gradients, step
sizes, regularization, and loss function, before committing to a long computation. Even if all these
sanity checks pass, many things can go wrong in training, as amusingly shown here.
Also, we have talked about implementing (stochastic) gradient descent, but in practice one would
often use an off-the-shelf gradient descent algorithm such as Adam or Adagrad, reviewed here. These
algorithms often converge faster, using tricks such as adaptive learning rates for different parameters,
maintaining “momentum” between update steps (i.e. treating the gradient as a force which produces
21 2. More Supervised Learning
an acceleration, in the presence of friction), annealing (gradually slowing) the learning rate, and
using second derivative information. On the other hand, vanilla gradient descent with a predefined
annealing schedule is quite robust. To prevent overfitting, training can also feature “early stopping”,
i.e. stopping once performance on the validation set stops improving.
Note. Some historical context about deep learning, ML, and AI. In popular culture, ML and AI are
synonyms, but in most universities, they’re distinct courses that cover completely different material.
The professors for AI courses tend to be about three decades older, and both sides actively avoid
mentioning each other. What’s going on?
Technically, ML is just a subfield of AI, which consists of the techniques described in these notes.
But when people contrast the two, they typically are comparing ML to “symbolic AI” (also called
“good old fashioned AI”), a family of ideas that was dominant in the 1950s to 1980s. Symbolic AI
explicitly encodes structure in terms of human-readable representations such as logical propositions.
For example, a symbolic approach to recognizing a picture of an cat would begin by saying that it
is a member of the category “animal”, and has a property “legs” which has the value “four”, and
so on. Searle’s Chinese room argument, for example, focused on such systems. This paradigm was
so dominant that the cognitive science major at Stanford is called “symbolic systems”.
But while it sounds intuitively appealing, symbolic AI was brittle, catastrophically failing when
applied to real-world problems. Moravec’s paradox was the observation that AI systems found
emulating human reasoning about formally defined objects easy (for instance, AI systems rapidly
progressed in chess playing, defeating the world champion in 1997), but couldn’t perform sensori-
motor tasks like image recognition as well as a human infant. This led to “AI winters” in the early
1970s and late 1980s, where funding for AI would collapse after a cycle of hype failed to lead to
results. For this reason, some ML practitioners refer to ML as “the part of AI that actually works”.
Philosophically, the reason for the failure is as in the Chinese room argument: symbolic systems
are just pushing around symbols without knowing what they really mean, i.e. without “grounding”
them in the real world. Practically, this leads to bad performance because real-world problems
require a tremendous amount of tacit knowledge to solve, which cannot reasonably be encoded by
hand. Many AI practitioners concluded that true AI required “embodied cognition”, giving the AI
sensory experience in the real world, but ML methods were able to circumvent this by picking up
tacit knowledge from the training data by themselves.
Deep learning also has a long history, as it is the paradigmatic example of non-symbolic AI.
Simple versions of neural networks, called perceptrons, were investigated in the 1950s and 1960s.
However, they were infamously demolished in the 1969 book Perceptrons, by Minsky and Papert,
which led to the growth of symbolic AI in its place. Among other things, Perceptrons showed that
the simplest neural networks couldn’t compute XOR (which is why it is a standard example in ML
courses today) and dismissed deep neural networks as a “sterile” extension.
Of course, this wasn’t the only reason people didn’t use deep learning. Deep learning requires
an extreme amount of computation power, and proponents of perceptrons made many bold claims
that were infeasible given the computers available at the time. In the 1980s, deep learning was
revived under the name of “connectionism”, and backpropagation began to be used in 1985. This
led to a surge of interest, and in 1989, Yann LeCun used neural networks to identify digits on the
MNIST data set. However, progress stalled due to lack of computation power until around the year
2000, when GPUs began to be used. In the late 2000s, the paradigm of “big data” emerged, as
large tech companies accumulated the vast data sets require to train large neural networks. In the
2010s, the combination of increasing computation power, improved neural network architectures
and training techniques, and widely available huge data sets led to an explosion of spectacular
22 2. More Supervised Learning
• SVMs output decision boundaries, rather than probabilistic predictions. Suppose for concrete-
ness that a data set can be perfectly separated by a hyperplane. Generically, multiple different
hyperplanes will work. The key idea behind the SVM is that it picks the hyperplane with the
largest “margin”, i.e. with the greatest distance to any of the data points, which is called the
optimal margin classifier. Note that it is determined by the closest data points to the boundary;
these are the “support vectors”.
• Specifically, for binary classification we output
h(x) = sign(wT x + b)
where the notation is conventional, and we have separated out the intercept b. The decision
boundary is thus at wT x + b = 0.
• For each data point i, we define the functional margin
γ̂ (i) = y (i) (wT x(i) + b)
where a high number indicates a confident and correct prediction. The functional margin of
the entire data set is the minimum of the γ̂ = mini γ̂ (i) .
• The functional margin is not invariant under scaling w and b, even though this doesn’t change
the decision boundary. So it is more useful to think about maximizing the geometric margin,
γ̂ (i)
γ = min γ (i) , γ (i) = .
i |w|
We could then maximize the geometric margin using, e.g. gradient descent on w and b.
• However, there is a better way. We note that maximizing the geometric margin is equivalent
to minimizing |w| for fixed functional margin γ̂ = 1,
1 2
minimizew,b |w| such that y (i) (wT x(i) + b) ≥ 1.
2
This optimization problem only involves a quadratic “objective function” subject to linear
constraints, so it can be solved using powerful “quadratic programming” methods. (The case
where the objective function is also linear is “linear programming”, and both are subsets of
convex optimization.)
23 2. More Supervised Learning
We can train an SVM more efficiently by transforming the “primal” optimization problem above to
a “dual” problem.
In the method of Lagrange multipliers, this can be solved by defining the Lagrangian
X
L(w, β) = f (w) + βi hi (w)
i
and setting its partial derivatives to zero, since this yields precisely the same constraints.
Then solving the primal problem is equivalent to minimizing θP (w). Let p∗ be the corresponding
optimal value.
• The dual problem switches the order of the maximum and minimum. We let
• It can be shown that if f and the gi are convex, and the hi are affine, and the constraints gi
are strictly feasible (i.e. we can simultaneously have gi (w) < 0), then there is no duality gap,
p∗ = d∗ = L(w∗ , α∗ , β ∗ ).
In this case, by combining all of the constraints from solving the primal and dual problems, the
solution must obey the KKT conditions,
∂L ∂L
= = 0, αi∗ gi (w∗ ) = 0, gi (w∗ ) ≤ 0, α∗ ≥ 0.
∂wi ∂βi
The first two are familiar from Lagrange multipliers. The third is the dual complementarity
condition, which states that if αi∗ > 0, then the corresponding constraint is binding.
24 2. More Supervised Learning
which has only inequality constraints. The conditions for the duality gap to vanish hold, so we
can solve the dual problem instead.
• At first glance, the dual problem might seem harder, because there are many more parameters,
with one αi per data point. However, by the dual complementarity condition, αi is only nonzero
for the support vectors, so the vast majority are zero!
• Furthermore, the dual problem is written in terms of inner products of the data points x(i) .
Once the optimal αi are found, we can output predictions on a new data point x by computing
X
wT x + b = αi y (i) (x(i) · x) + b
i
which is also in terms of inner products. Thus, we can directly “kernelize” the SVM. Note that
the value of b can be inferred by noting that y (i) (wT x(i) + b) = 1 for the support vectors.
• So far, we’ve assumed the data set is linearly separable, but this might not be the case, even
after using a feature map, in which case there is no solution at all. Moreover, a single outlier or
misclassified data point can strongly affect the entire decision boundary. We can address this
by adding “slack” to the constraints, yielding the optimization problem
1 X
minimizew,b,ξ |w|2 + C ξi where y (i) (wT x(i) + b) ≥ 1 − ξi and ξi ≥ 0.
2
i
• Following the same logic as before, the dual problem is to maximize the same θD (α) as above,
but subject to 0 ≤ αi ≤ C rather than just 0 ≤ αi . Again, αi = 0 for the “obvious” data points,
while those that impact the decision boundary are nonzero.
• The sequential minimal optimization (SMO) algorithm is a particularly fast way to solve the
dual problem, invented in 1998. The essential idea is to optimize only two of the αi at once.
This is fast, because each optimization step can be done in closed form. This approach, of
optimizing a subset of the coefficients at once, is called coordinate ascent.
25 2. More Supervised Learning
• In Bayesian linear regression, we start with a prior distribution over the parameters. For
concreteness, we let θ ∼ N (0, τ 2 I), which is reasonable if the features have been appropriately
normalized. We assume the data set is generated with Gaussian noise as before,
Given a new input x∗ , we have the posterior predictive distribution of the corresponding y∗ ,
Z
p(y∗ |x∗ , S) = p(y∗ |x∗ , θ)p(θ|S) dθ.
Finally, a confidence region can be plotted by encompassing, for each x∗ , a region in y∗ that
contains a given proportion of the probability. (Confidence regions can also be made in the
frequentist approach, but they have a different meaning, as discussed in the notes on Statistics.)
• In general, these integrals are expensive to evaluate, and Bayesian methods usually use approx-
imations such as maximum a posterior (MAP) estimation, where some posterior distributions
are replaced with delta function peaks at their maxima.
• However, for the particular assumptions we have made, the solution can be written down in
closed form! Using our earlier results about multivariate Gaussians, it can be shown that
where, e.g. we have K(X, X)ij = k(x(i) , x(j) ). The output values are
K(X, X) + σ 2 I
y f K(X, X∗ )
X, X∗ = + ∼ N 0, .
y∗ f∗ ∗ K(X∗ , X) K(X∗ , X∗ ) + σ 2 I
• It turns out that Bayesian linear regression is equivalent to Gaussian process regression, upon
using an appropriate kernel; however, showing this directly is messy. Just like Bayesian linear
regression, Gaussian process regression automatically gives a confidence region. The kernel
function itself plays the role of the regulator, and can be selected based on the structure of the
data to improve performance.
• In geostatistics, Gaussian process regression is known as kriging, and was originally used to
determine the distribution of underground gold for mining.
28 3. General Learning
3 General Learning
3.1 Bias-Variance Tradeoff
In this section we describe some general features of machine learning. We begin with an explicit
example of a bias-variance tradeoff in parameter estimation in statistics, which overlaps with the
notes on Statistics.
• Consider an iid data set (x(i) , y (i) ) with n data points, where x(i) is a d-element vector, combined
into an n × d matrix X. In the context of statistical inference, the x(i) are fixed, and the
corresponding y (i) is generated from it according to some probability distribution, with unknown
parameter θ∗ ∈ Rd .
• The goal is to estimate θ∗ . An estimator θ̂ is a function of the data. Since the y (i) are random,
θ̂ is itself a random variable, and its distribution is called the sampling distribution.
MSE(θ̂) = E[|θ̂ − θ∗ |2 ].
This can be decomposed into a bias and variance term using the usual “parallel axis theorem”
reasoning, E[x2 ] = var(x) + E[x]2 , giving
Both bias and variance contribute, and typically reducing one increases the other. Informally,
the bias is the part that survives as the number of data points goes to infinity.
θ̂ = (X T X)−1 X T y
for λ > 0, and by similar reasoning to that used to derive the normal equations, we have
θ̂ = (X T X + λI)−1 X T y.
Note that the inverse exists since X T X + λI is PSD, since X T X is and λ is positive.
29 3. General Learning
X T X = U DU T , D = diag(σi2 )
in which case
(X T X + λI)−1 = U diag(1/(σi2 + λ))U T .
Since E[] = 0, the first term alone yields the bias. As is intuitive from the cost function,
nonzero λ biases θ̂ towards zero.
• From the perspective of the variance, the first term is just a constant (since X is fixed), so only
the second term contributes. It in turn is just a linear transformed Gaussian, so
• We are more interested in predictions than parameters, so we would like to quantify the bias-
variance tradeoff in terms of prediction error. Let’s suppose that the data is generated by
y = f (x) +
where the random error is iid on the data points, and satisfies
E[] = 0, var() = τ 2 .
We don’t know f (x), but we would like to learn it from training data, so that our learned
function fˆ yields a good prediction for y on a previously unseen test data point (x∗ , y∗ ).
Splitting the last term using “parallel axis theorem” reasoning again, we get
where the terms represent the irreducible error due to noise, bias, and variance.
30 3. General Learning
• To relate this to our bias and variance for estimators, suppose that f (x) = θ∗ T x and we use
Ridge regression to find fˆ, giving fˆ = θ̂T x. Then
• In practice, we split our data set into a testing and training set, and split the training set into a
training and a validation/development set. The model is trained on the training set, and errors
for both the training set and validation set are computed.
• High bias (underfitting) corresponds to having a high error on both. It can be fixed by using
an more expressive model or adding features. High variance (overfitting) corresponds to having
a high error on just the validation set, which means the algorithm is fitting to noise in the
training set. It can be fixed by getting more data, increasing regularization, or simplifying the
model.
• A typical split is 70/30 for the training and validation sets. As a typical methodology, one
might train a variety of models with different hyperparameters on the training set, and choose
the one with the lowest error on the validation set.
• When data is scarce, we can do a more elaborate procedure to avoid “losing” training data. In
k-fold cross validation, we instead split the training set into k equal subsets. We train each
model k times, each time picking a different subset to act as the validation set, then average the
validation errors. This method is called k-fold cross validation, and a typical value is k = 10.
31 3. General Learning
• For cases with extremely scarce data, we can set k to the total number of data points, which is
called leave-one-out cross validation. A disadvantage of higher k is the increased time required.
• The final performance of the model should be reported as error on the testing set, which should
not be used at all during the above procedure. The reason we want a separate testing set is
that typically one goes through many models. If we validated on the testing set, then we would
effectively be fitting hyperparameters to it, making the error on the testing set lower than the
true generalization error.
• On the other hand, this argument means that we are fitting hyperparameters to the validation
set, so how can we know if we are overfitting to it? In practice, one can use multiple layers of
validation sets, one for each cycle of hyperparameter tuning.
• A typical train/validation/test split would be 60/20/20. However, when we have very large
data sets, the testing set can be much smaller (e.g. 0.1%), since its only purpose is to estimate
the generalization error.
• Often, one will have many possible features, but only a small proportion will be relevant to
the learning algorithm. However, given d features there are 2d possible subsets, so trying each
subset is not an option.
• One heuristic idea is to use a greedy “forward search” which adds in features one at a time, at
each step adding the one that results in the best validation error. This requires O(d2 ) calls to
the learning algorithm. We can also do “backward search”, removing the least useful feature at
each step. Another option is to see the order in which the features “turn on” as λ is decreased
in lasso regularization.
• These options are “wrapper model” feature selection algorithms, because they call the learning
algorithm as a subroutine. A “filter” feature selection method computes a score for each feature
reflecting how informative it is. Examples of scoring functions include the correlation and the
mutual information,
X p(xi , y)
MI(xi , y) = p(xi , y) log .
x ,y
p(xi )p(y)
i
We then take the top k features, where k is chosen to minimize validation error.
• In information theory, there are statistics such as the Akaike information criterion or Bayesian
information criterion, which measure how informative the model is relative to its simplicity.
• In almost all cases, after putting together a machine learning algorithm, the result just “won’t
work”. The outputs will be qualitatively wrong, and fixing it is like debugging.
• To see if there’s enough data, try plotting a “learning curve”, i.e. the cost functions as a function
of the number of data points in the training set. If these curves have flattened out, the data is
sufficient to train the model, though the model itself might still exhibit high bias.
32 3. General Learning
• We’ve already discussed diagnosing bias and variance above. There can also be issues with the
learning procedure, e.g. gradient descent might not be converging. To diagnose this, it’s useful
to run sanity checks, such as plotting the data, or comparing cost functions to a dummy model
such as the one that always predicts the average.
• It is also important to clean and sanity check the data, by looking at summary statistics such
as the range. Sometimes, features can be missing entirely, or have nonsensical values. In
these cases, the corresponding data points can be thrown out entirely, or one can use “data
imputation”, filling the missing values in with a reasonable reference value.
• There are many other possible subtle issues with data. For example, the full data set might
contain duplicates, which then appear in both the training and testing set, underestimating
the generalization error. Or, the data might have irrelevant features, e.g. the infamous story of
an algorithm that “learned” to detect camouflaged tanks by looking at the weather, since all
photos with tanks were taken on cloudy days.
• Another possible issue is that one is minimizing the wrong cost function, i.e. a lower cost function
does not correspond to qualitatively good performance. For example, in spam detection one
might want to penalize false positives more than false negatives, but the logistic regression cost
function does not account for this.
• Hyperparameters should generally be tuned by trying values on a log scale. For more than
one hyperparameter, try an automated “grid search”. This can get very inefficient for many
hyperparameters, in which case one can try specialized hyperparameter selection algorithms,
such as random search (values picked randomly from a pre-specified distribution) or Bayesian
hyperparameter optimization (choosing the next values to try based on performance of past
ones). There are even gradient-based or evolutionary optimization techniques.
Example. Consider a reinforcement learning setup for flying a helicopter, which is trained to
minimize a cost function in a simulator, but qualitatively doesn’t work well in reality. If the setup
works qualitatively well in simulation but not in reality, then the problem is in the simulator. (For
example, the simulated detector inputs might have more or less noise than in reality.) Otherwise, if
the real-life human control achieves a lower cost function than the algorithm, then the cost function
is not being minimized properly. But if the human has a higher cost function, then the cost function
itself is not reflecting good autonomous flight.
Example. Consider a facial recognition algorithm. Often, such algorithms are made of many
components strung together, such as one that cleans up the image, one that detects the face, one
that detects the eyes, and so on. In this case, we can determine how much error is attributable
to each component by plugging in the ground truth and seeing how the accuracy changes. An
“ablative analysis” runs in the opposite direction. We can see how much, e.g. a new feature helped
a classifier by removing it entirely. This is important in applications where speed is a concern, or
in research where one wants to understand the reason for good performance.
• Suppose we set a threshold for the output score, so that everything about the threshold is
classified as positive. Let the number of positive and negative examples be P and N, with a
total number of examples S. Then the prevalence of positive examples is P/S.
• Upon applying the classifier, the positive examples are split into true positives and false negatives,
while the negative examples are split into true negatives and false positives. These four numbers
are typically presented together in a “confusion matrix”. In statistics, we call false positives
type I errors, and false negatives type II errors.
The point of using evaluation metrics beyond just the accuracy is that often the classes are
asymmetric. For example, often the positive class is rare, so it is reasonable to define metrics
that focus on it; a typical cost function would be swamped by the majority class.
• We can also use a weighted accuracy, which assigns a weight to each entry in the confusion
matrix. This is useful if the weights reflect our utility, e.g. they can account for false positives
hurting more or less than false negatives.
• We can adjust the threshold that divides positively and negatively labeled examples, leading
to a tradeoff against specificity and sensitivity. This can be visualized using a ROC (receiver
operating characteristic) curve, which plots the two against each other. The general quality of
the classifier can be quantified by the area under the ROC curve. Note that randomly guessing
positive with probability p would produce a straight ROC curve, and hence an area of 1/2.
Similarly, we can plot the precision against the recall.
• Everything mentioned above only depends on the ranking of the scores. However, it is also
useful to have a model make “confident” predictions, i.e. if the scores are probabilities, positive
examples should have high probabilities. This is just the motivation behind the standard log-loss
cost function used in logistic regression, but it can also be used as an evaluation metric for
other models.
• Calibration is the property that data points assigned probability p turn out positive with
probability p. This is quantified by the Brier score, which is the mean square deviation of the
predictions h(x(i) ) from the y (i) . We used this as a cost function for linear regression, and in
the context of classification it is useful because it is a proper scoring rule, i.e. it is minimized
when the classifier is calibrated.
Some of these ideas about scoring rules come from decision theory, which is in turn tied to game
theory and Bayesian statistics.
34 3. General Learning
• We assume that our n training examples are drawn iid from the same distribution D. For
concreteness, we focus on binary classification.
• A hypothesis is a function from the sample space to {0, 1}. For a hypothesis h, we define the
training error (also called the empirical risk) as
n
1X
ˆ(h) = 1{h(x(i) 6= y (i) }.
n
i=1
The generalization error is the expected error on a new data point drawn the same way,
This assumption that this test point is drawn from the same distribution as the training data
is one of the strongest assumptions of PAC learning.
• We consider a set of hypotheses H. The simplest way to pick one is to use the one that minimizes
the empirical risk,
ĥ = argmin ˆ(h).
h∈H
Our goal is to show that (ĥ) is, with high likelihood, not too much higher than ˆ(ĥ), i.e. that
ĥ is probably approximately correct. This gives a theoretical guarantee on performance.
The Hoeffding inequality/Chernoff bound states that for iid random variables Zi ∼ Bernoulli(φ),
P
the mean φ̂ = i Zi /n is bounded near its expectation value by
• First, for each hypothesis hi we can bound the difference between the training error and
generalization error using the Hoeffding inequality,
We can turn this into a bound on all the hypotheses using the union bound,
δ = 2k exp(−2γ 2 n).
• We can use the above result twice to bound the generalization error of ĥ to the actual best
hypothesis h∗ = argminh∈H (h). With probability at least 1 − δ,
This bound illustrates the bias-variance tradeoff, with (h∗ ) representing the bias, and 2γ
representing the variance, because at fixed δ, increasing k increases γ.
• Unfortunately, many of our algorithms use infinite |H|, since they have real-valued coefficients.
Informally, real numbers have finite machine precision, so a hypothesis class parametrized
with d real numbers really contains (264 )d hypotheses. Plugging this in above gives a sample
complexity n ≥ O(d/γ 2 ).
• More rigorously, we can characterize the size of a hypothesis class using the Vapnik-Chervonenkis
dimension VC(H). For binary classification, the VC dimension is defined as the largest n for
which there exists a set of n points that the hypothesis class can shatter, i.e. if H contains
hypotheses with all 2n possible labelings of the points.
• For example, linear classifiers over two features have a VC dimension of 3, and discrete hypothesis
classes have a VC dimension of at most blog2 kc. For most “reasonable” infinite hypothesis
classes, the VC dimension is roughly equal to the number of real parameters.
• One of the most famous theorems of learning theory is that if VC(H) = d, then
r !
d n 1 1
with probability at least 1 − δ, |(h) − ˆ(h)| ≤ O log + log for all h ∈ H.
n d n δ
36 4. Reinforcement Learning
4 Reinforcement Learning
4.1 Markov Decision Processes
Reinforcement learning (RL) solves qualitatively different problems than supervised learning.
• As a simple example, we will model the environment as a Markov decision process (MDP),
containing a set S of states and a set A of actions. If one is at state s at some timestep and takes
action a, then the probability distribution of the next state is Psa . Finally, at each timestep
there is a reward R : S × A → R. For notational simplicity, we’ll suppose that the reward only
depends on the state, though this assumption can be lifted without much trouble.
where γ ∈ [0, 1) is called the discount factor. The goal of the RL algorithm is to maximize
the expected payoff. The purpose of the discount factor is to make the sum well-behaved, and
to give the algorithm a sense of urgency, getting positive rewards as quickly as possible and
deferring negative rewards as long as possible.
• A policy (also called a controller, in the context of control theory) is a function π : S → A which
prescribes an action for each state. The value function of a policy is the expected payoff,
Using the definition of V π (s) on the right-hand side gives the Bellman equations,
X
V π (s) = R(s) + γ Psπ(s) (s0 )V π (s0 )
s0 ∈S
where the first term represents the immediate reward, and the second term represents the
expected future reward. Focusing on discrete state spaces for now, the Bellman equations are
a set of |S| linear equations, which can be used to solve for V π (s).
• For each state, the optimal value function is the highest possible value function for any policy,
• The first possibly unintuitive result is that V ∗ (s) can be attained, for all states, by a single
policy π ∗ . This policy is defined by
X
π ∗ (s) = argmax Psa (s0 )V ∗ (s0 )
a∈A
s0 ∈S
∗ (s)
and its value function is V π = V ∗ (s) because it satisfies the Bellman equations for V ∗ (s).
• This result is due to the “memoryless” nature of the reward function in an MDP. For example,
if reward was instead given for having the learner return to the initial state, whatever that
initial state was, then V ∗ (s) for each s is attained by a strategy that tries to return to s, so
there is no policy that attains V ∗ (s) for all s simultaneously. (On the other hand, literally any
environment including this one can in principle be viewed as an MDP in a larger state space;
in this case the state space would be S × S where the first factor represents the initial state.)
• There are two useful iterative methods to compute π ∗ . In value iteration, we find V ∗ and use
it to infer π ∗ . We initialize V (s) = 0 and update
X
V (s) ..= R(s) + max γ Psa (s0 )V (s0 ).
a∈A
s0
We loop over the states repeatedly until convergence. This can be done synchronously (updating
all the V (s) at once after one full loop), or asynchronously (updating one V (s) at a time).
• To understand why value iteration works, it’s easier to think in terms of synchronous updates.
Suppose time suddenly ends at t = 0. After the first iteration, V (s) = R(s), which means V (s)
accounts for the immediate reward only; it thus computes the optimal value as if the agent
started at t = 0. After the second iteration, V (s) contains both the immediate reward and the
best possible reward in the next timestep, so it’s as if the agent started at t = −1. Thus, we
are computing V ∗ (s) in “dynamic programming” style. As long as the rewards are bounded,
this process converges over many updates because of the discount factor γ.
• Formally, the update step defines a “Bellman backup operator” on the space of value functions.
We can show convergence by demonstrating that it is a contraction mapping.
synchronously. In other words, π(s) is set to be greedy with respect to the value function
defined by the previous π(s).
• The intuition for why policy iteration converges is similar to value iteration. We now suppose
that at time t = 0, the agent is forced to begin executing π0 . After the first iteration, π is the
best policy if the agent starts at t = −1 and knows this will happen at t = 0. After the second
iteration, π is the best policy if the agent starts at t = −2, and so on.
• Unlike value iteration, policy iteration converges to the exact optimal policy, and hence the
exact optimal value function, in a finite number of iterations. Policy iteration tends to be faster
for small |S|, but for larger |S| it becomes impractical because it requires computing V π (s) at
every step, which naively takes O(|S|3 ) time. Most modern applications use value iteration.
38 4. Reinforcement Learning
The above discussion assumed that everything about the MDP was known in advance, but in reality
many parameters are unknown.
• For example, consider learning to play a videogame. The RL agent knows only the actions A
ahead of time. The states, transition probabilities, and rewards must be learned from experience.
• These constraints make RL nontrivial, especially when the rewards are sparse. In this case, the
agent faces an “explore/exploit” tradeoff, where it must balance the benefit of accruing known
rewards with exploring the possibility of potential higher rewards. It also faces the “credit
assignment problem”: if the reward is a binary win or loss, then it can be challenging for the
agent to understand why it won or lost. These issues are exacerbated by the potentially huge
size of the state space, which can be subject to the curse of dimensionality, and the fact that
the agent might not even know exactly what state it’s in.
• As a first step, let’s consider the case where Psa is unknown. Then the RL agent can infer it
from experience using a maximum likelihood estimate,
• Thus, one can use a modified value iteration. After initializing π randomly, we repeat the
following until convergence:
For performance, it’s useful to save V between iterations, so that value iteration doesn’t have
to start over from scratch.
• To increase the amount of exploration, we could use “-greedy” exploration, where in step 1 we
execute a random action with probability . However, for larger convergence will be slower.
Alternatively, in “Boltzmann” exploration, the chance of exploring a seemingly less promising
option is proportional to a Boltzmann factor eV /T . Over time, and T can be annealed. Another
idea is “intrinsic motivation”, where the RL agent is rewarded for reaching new states.
• For example, an inverted pendulum on a cart has a 4d state space (position, linear velocity,
angle, angular velocity) and a 1d action space (acceleration). A helicopter has a 12d state space
(six for translation, six for rotation) and a 4d action space.
• The state/action space can depend on the level of description. For example, if we want to
navigate a car over a huge distance, we can use a 2d state space (its current position) and a 2d
action space (its velocity). But for navigating a specific road, we actually want to include the
39 4. Reinforcement Learning
velocity in the state space; the action space would instead be the position of the gas pedal and
orientation of the steering wheel. If we wanted to model avoiding a car crash, we might need to
account for the time it takes to turn the wheel, so its orientation would go into the state space
and its angular velocity would be in the action space, and so on.
• A simple idea is to discretize the state space. This works well when the state space is low-
dimensional (in practice, d ≤ 3, though higher d can be made to work with clever choices
of discretization), but then degrades due to the curse of dimensionality. Discretizations can
approximate any function, just like neural networks, but they are inefficient and generalize
poorly. For example, they give no prediction at all for bins containing no observations, even if
they are surrounded by bins with observations, because they ignore this structure.
• Instead, we will consider approximating V ∗ directly, using “fitted value iteration”. As a first
step, we assume we have a model/simulator for the environment, which is a black box that
takes the current state and action and outputs a next state.
• In some cases, such as in some games, the model is just the known Psa themselves. In other
cases, such as in robotics, the model can be derived from the laws of physics. Or, the model
can be inferred from experience as described above, in which case constructing a model is just
a regression problem. The model’s output could be deterministic, or it could be stochastic,
e.g. by adding Gaussian noise. This noise helps prevent the RL agent from learning “brittle”
strategies.
• For concreteness, we suppose S is continuous but |A| is small. Recall that in value iteration,
we perform the update
The main idea of fitted value iteration is to use the model to estimate the right-hand side.
V (s; θ) = θT φ(s).
In each step of fitted value iteration, we start with a set of coefficients θ. We sample n states
s(i) and use the model to estimate E[V (s0 ; θ)|s0 ∼ Ps(i) a ] for each action a, giving values y (i) for
the right-hand side. Finally, we set the new θ by fitting it to the data points (s(i) , y (i) ) as usual
in linear regression.
• Fitted value iteration doesn’t have the same kinds of convergence guarantees as value iteration,
but it works well for many problems in practice.
• The same idea holds for different forms of V (s; θ). For instance, “deep RL” essentially consists
of making V (s; θ) a neural network.
• Once we have a suitable approximation for V ∗ , we can use the model again to infer π ∗ . Note
that if the model has stochastic noise, then we can turn the noise off for this step to save time.
This doesn’t make the model brittle, since the noise already has suitably smoothed V ∗ .
40 4. Reinforcement Learning
• This approach is “model based”, but there are also “model free” approaches that never construct
Psa at all; the contrast is like that between generative and discriminative supervised learning
algorithms. Both approaches are used in practice. Model free approaches are more general, but
take much longer to train, so model based approaches are more common in robotics.
• For example, Q-learning is a model-free approach where everything is phrased in terms of the
Q-function Q(s, a), which gives the expected future reward assuming we take action a in state
s. The Q-function is learned directly; no reference is made to what future states a leads to.
5 Unsupervised Learning
5.1 Clustering
Consider the grouping of data x(i) into cohesive clusters. This is an unsupervised learning problem,
as no labels y (i) are given.
Then we reassign each cluster centroid to the centroid of the data points inside,
(i) = j} x(i)
P
. i 1{c
µj = P
.
(i) = j}
.
i 1{c
• To see that this converges, note that we can define a distortion/cost function
X
J(µj , c(i) ) = kx(i) − µc(i) k2 .
i
The two update steps in k-means perform coordinate ascent, minimizing this cost function with
respect to the c(i) and µj , respectively. Thus, the cost function decreases monotonically, though
it may be trapped in a local minimum. To address this, we can run the algorithm multiple
times. Future data points are assigned to clusters by whichever one has the least cost.
• We can also get variations on k-means by changing the distance function. However, for a
general distance function, the µj update step can’t be done in closed form. Instead we can use
k-medoids, where the µj are required to be “exemplars”, i.e. members of the data set. When
we update µj , we thus only need to minimize over the data set.
• In general, the cost function will be lower the higher k is, but taking k arbitrarily high would
be useless. The “best” value of k depends on what the clustering is being used to do, but an
old rule of thumb is that we should stop when Jmin (k) hits an “elbow”, after which it stops
falling as steeply.
• A more principled approach is “gap statistics”, where we estimate how large Jmin (k) is expected
to be for a hypothetical data set without structure, and choose the k so that the actual Jmin (k)
looks best in comparison.
• Alternatively, in some cases the clustering might be “semi-supervised”, where a small number
of data points come with labels. In this case, we can split the labeled points into training and
validation sets, fix the labels of the points in the training set while training on the unlabeled
points, and choose k to minimize the validation error.
42 5. Unsupervised Learning
• The simplest technique for density estimation would just be fitting a Gaussian to the whole
data set. But often this is not general enough, because the data will have clusters. The next
simplest thing to do is to fit a mixture of Gaussians.
where the z (i) indicate which Gaussian to use. (Statisticians would call the individual z (i)
“categorical” distributed, and the set of total counts of each z (i) value “multinomial” distributed;
the distinction between the two is analogous to that between the Bernoulli and binomial
distributions. However, we will treat the two as interchangeable.)
This is similar to GDA, but the z (i) are now unobserved (“latent”) variables, rather than
provided labels. This makes it impossible to maximize the log-likelihood in closed form.
(i)
• In the M-step, we treat these weights wj as soft GDA labels and update the parameters as
(i) (i) (i) (i)
− µj )(x(i) − µj )T
P P
1 X (i) i wj x i wj (x
φj = wj , µj = P (i) , Σj = P (i) .
m
i i wj i wj
We can think of the first step as computing our “expectation” for the z (i) , and the second step
as carrying out a “maximization”.
• Like k-means, the mixture of Gaussians model can get stuck in local optima. One particularly
bad possibility is if one of the clusters centers on only one data point; in this case it will shrink
indefinitely. This can be prevented by regularization.
• Probabilistic unsupervised algorithms can be tested for overfitting by using a training and
testing set. We maximize the log-likelihood over the training set, and evaluate it on the testing
set; if the two are dramatically different, the model is overfit.
43 5. Unsupervised Learning
Just as for k-means, we would like a more general way of viewing this algorithm which lets us
establish convergence.
• The EM algorithm is the analogue of maximum likelihood estimation in the presence of latent
variables. It alternates between computing expectations for the distribution of the latent
variables given the parameters, and maximizing the likelihood over the parameters given those
latent variable distributions.
• Jensen’s inequality states that for a convex function f and random variable X,
and if f is strictly convex, equality holds if and only if X = E[X] with probability one.
• Now, for each data point i, let Qi (z) be a probability distribution over the z’s. Abbreviating
all the parameters as θ, the log-likelihood is
X X X X p(x(i) , z (i) ; θ)
`(θ) = log p(x(i) , z (i) ; θ) = log Qi (z (i) ) .
i i
Qi (z (i) )
z (i) z (i)
This has the form of the logarithm of the expectation value of p(x(i) , z (i) ; θ)/Qi (z (i) ), distributed
according to Qi . Since the logarithm is concave, Jensen’s inequality holds in reverse, giving
X p(x(i) , z (i) ; θ)
`(θ) ≥ Qi (z (i) ) log ≡ J(Q, θ).
Qi (z (i) )
i,z (i)
Thus, for each choice of Qi , we have a lower bound on the log-likelihood. The quantity J(Q, θ)
is also called the evidence lower bound (ELBO).
• The equality case of Jensen’s inequality is achieved if the argument of the logarithm does not
depend on z (i) . This is achieved for the probability distribution
• Now the EM algorithm can be phrased as follows. If the current parameters are θ0 , then in the
E-step we construct Q0 , and in the M-step we set the new value of θ to
thus establishing convergence. However, we are not guaranteed to converge to the global
maximum of `.
• Alternatively, we can choose to view J(Q, θ) as the cost function. In this case the analogy with
k-means is very close: the E-step maximizes over Q, while the M-step maximizes over θ.
44 5. Unsupervised Learning
• The EM algorithm is also useful for semi-supervised learning, which is a supervised learning
task where only some of the data points have labels. For the other data points, the labels are
treated as latent variables.
Note. We can also think of J(Q, θ) as the KL divergence DKL (Qkp). This is useful because we can
also write it in the forms
DKL (Qkp) = Ez∼Q [log p(x|z; θ)] − DKL (Qkpz ) = log p(x) − DKL (Qkpz|x ).
The first form tells us that the M-step maximizes J by optimizing the first term, i.e. by maximizing
the log-likelihood. The second form tells us that the E-step maximizes J by optimizing the second
term, i.e. by setting Q = pz|x .
Example. Let’s properly derive the mixture of Gaussians model, starting from this notation. The
(i)
E-step is the same, if we identify wj = Qi (z (i) = j). In the M-step, we maximize
∂J 1 ∂ X (i) (i) X
(i)
=− wj (x − µj )T Σ−1
j (x
(i)
− µj ) = w` Σ−1
` (x
(i)
− µ` )
∂µ` 2 ∂µ`
ij i
and setting this to zero recovers our previously stated result. To set the φ, note that
X (i)
J(Q0 , θ) ⊃ wj log φj .
ij
P
To maximize this, we use Lagrange multipliers to account for the constraint j φj = 1, which again
recovers the previous result. The maximization over Σj is a bit more complicated, but similar.
• Suppose we have n data points x(i) ∈ Rd that we wish to fit a Gaussian to. The maximum
likelihood estimators are simply
1 X (i) 1 X (i)
µ= x , Σ= (x − µ)(x(i) − µ)T .
n n
i i
However, in some cases we have d > n, which means Σ is singular. Geometrically, all the
probability is concentrated on the hyperplane containing the data points.
• We could avoid singular Σ by restricting it, such as by making it diagonal or even proportional
to the identity. But this prevents us from modeling correlations between the variables. We
could account for this by letting some off-diagonal elements of Σ be nonzero, but we don’t know
ahead of time which – if we already knew, then we wouldn’t have to fit Σ in the first place!
45 5. Unsupervised Learning
• As for the mixture of Gaussians model, the resolution is to introduce a latent random variable
which parametrizes the correlated degrees of freedom. In factor analysis, we let these be a set
of k-dimensional “factors”,
z ∼ N (0, I), z ∈ Rk
where k < n, and model the data as generated by
x = µ + Λz + , ∼ N (0, Ψ)
where Ψ is diagonal with positive entries, and Λ is a d × k matrix.
• Using standard properties of Gaussians and the definition of the covariance,
ΛT
z 0 I
∼N , .
x µ Λ ΛΛT + Ψ
The marginal distribution of x is
x ∼ N (µ, ΛΛT + Ψ)
and the log-likelihood is
n
Y 1 1 (i) T T −1 (i)
`(µ, Λ, Ψ) = log exp − (x − µ) (ΛΛ + Ψ) (x − µ) .
i=1
(2π)d/2 |ΛΛT + Ψ|1/2 2
• To train the model, we maximize the log-likelihood. Now, the latent variable z does not appear
in the likelihood. In fact, we could have skipped introducing it entirely, and just said that
we are restricting Σ to the form ΛΛT + Ψ. But maximizing the log-likelihood is hard! We
introduced z not because it’s necessary to formulate the model, but because it allows us to use
the EM algorithm, which is the best maximization method for the job.
• For brevity, we suppress the training label i. In the E-step, we set
1 1 T −1
Q(z) = p(z|x; µ, Λ, Ψ) = exp − (z − µz|x ) Σz|x (z − µz|x ) .
(2π)k/2 |Σz|x |1/2 2
Using our earlier results for conditionals of Gaussians,
µz|x = ΛT (ΛΛT + Ψ)−1 (x − µ), Σz|x = I − ΛT (ΛΛT + Ψ)−1 Λ.
Since µ doesn’t depend on the z (i) , it can be calculated in the first step and need not be updated.
• With a bit more effort, we can show that Ψii = Φii , where
1 X (i) (i) T T
Φ= x x − x(i) µTz(i) |x(i) ΛT − Λµz (i) |x(i) x(i) + Λ(µz (i) |x(i) µTz(i) |x(i) + Σz (i) |x(i) )ΛT .
n
i
• Note that the M-step in principle depends on the full distribution of the z (i) , rather than just
the expected value µz (i) |x(i) , as can be seen from the Σz (i) |x(i) terms above. Some sources will
incorrectly state that we only need the expectation value E[z (i) ] from the E-step. This is a
misconception which arises because of the name of the E-step, and because this happens to be
true in the simplest examples, such as the Gaussian mixture model.
• As a first step, we should normalize the data by replacing the features with their z-scores. If
all the features are known to be similar, e.g. if they are the pixel values in an image, then we
only have to subtract their means.
47 5. Unsupervised Learning
T
• The variance of the data points is i x(i) x(i) /n. The first principal component is the direction
P
that captures the most of this variance, i.e. the unit vector u that maximizes
!
1 X T (i) 2 T 1 X (i) (i) T
(u x ) = u x x u ≡ uT Σu.
n n
i i
Equivalently, it is the direction that minimizes the sums of the squared distances to the first
principal axis. Note that it is essential to normalize the data, or else the features with larger
values will always dominate.
• As we showed earlier, the critical points of this optimization problem are when u is an eigenvector
of Σ. Thus the first principal component corresponds to the eigenvector of Σ with the largest
eigenvalue, the second to the second largest, and so on.
• If there are d features, finding all the eigenvectors of Σ takes O(d3 ) time. However, finding
the first principal component only takes O(d2 ) time by power iteration: if we start with any
vector and repeatedly apply Σ and normalize the vector, then it will converge to the desired
eigenvector. We can then restrict to the orthogonal subspace and repeat to find the second
principal component, so finding k principal components takes O(d2 k) time.
• Finally, given the unit vectors ui corresponding to the k chosen principal components, we can
define the reduced feature vectors
T (i)
u1 x
(i) ..
y = . .
uTk x(i)
• Another way to phrase the above results is that PCA finds the eigenvectors of X T X, where X
is the design matrix; this is equivalent to performing the singular value decomposition of X,
with the principal components corresponding to the right singular vectors of X.
• PCA does not model how the data was generated; it is strictly for dimensionality reduction.
For example, it can be used for data compression, or to reduce the dimensionality of input data
by identifying redundant features. It can also be used by human begins to gain insight.
• PCA can also be used for noise reduction, by projecting data down into a lower-dimensional
subspace. For example, the “eigenfaces” method uses PCA to identify a subspace of the set of
images of faces. Face matching is performed using the Euclidean distance in this subspace.
Independent component analysis (ICA) can find a basis where the features are independent.
• A classic application of ICA is the cocktail party problem: suppose that we have a room with d
microphones and d speakers. The speakers produce a vector of outputs s at every time, whose
components are iid. The microphone readings are x = As where A is the unknown mixing
matrix. The purpose of ICA is to find the “unmixing” matrix W = A−1 and hence recover the
original speakers.
• ICA is practically used in EEG measurements, where it can separate out artifacts such as
blinking and heartbeats, leaving clean data. Also, when ICA is applied to natural images, the
independent components are essentially “edges”.
48 5. Unsupervised Learning
• Properly speaking, we aren’t looking for a matrix W , but for an unordered normalized basis in
which the features are independent. This is because the order of the speakers and their scalings
are not defined. Thus, W is not exactly determined, though this isn’t important in practice.
• A more serious problem is if the speakers produce Gaussians. In this case, the joint distribution
is s ∼ N (0, cI), which is rotationally invariant, which means W is completely undetermined.
This problem is unique to Gaussians, and is related to the central limit theorem.
• Let the distribution of each source sj have probability density ps , so the joint distribution is
Y
p(s) = ps (sj ).
j
The distribution of x = W −1 s is
Y
p(x; W ) = (det W ) ps (wjT x)
j
• Summing over the training examples and assuming they are independent, the log-likelihood is
X X
`(W ) = log det W + log ps (wjT x(i) ) .
i j
Taking the derivative with respect to W gives the stochastic gradient ascent update rule,
0 T (i)
ps (w1 x )/ps (w1T x(i) )
.. (i) T
W ..= W + α x + (W T )−1 .
.
p0s (wdT x(i) )/ps (wdT x(i) )
• To continue we must postulate a form for ps , which can come from domain knowledge. In
the absence of such knowledge, one option is to normalize the data and assume a logistic
distribution, i.e. a sigmoid cdf, ps (s) = g 0 (s), which will generally perform decently. In this
case, p0 /p simplifies to 1 − 2g.
• For time series data, nearby training examples are often not independent, but this is not an
issue in practice. Counterintuitively, if we have such data, then randomly reshuffling the data
points can accelerate the convergence of stochastic gradient descent.
• ICA also works perfectly well if there are more microphones than speakers, though the formalism
above has to be slightly generalized to let A be non-square. When there are fewer microphones
than speakers, it doesn’t work; we would need to use additional domain knowledge to have a
chance of separating all the speakers. For example, for one microphone and a male and female
speaker, we can separate by frequency.