Estimation
Max Welling
California Institute of Technology 136-93
Pasadena, CA 91125
welling@vision.caltech.edu
1
Preliminaries
Let x denote a random variable and p(x) its probability density. x may be multidimensional and continuous
or discrete. In the discrete case p(x = si ) is the probability that x assumes the value si . Obviously, all
probabilities have to sum to up to one,
X
p(x) = 1.
(1)
x
In the continuous case, p(x = s)∆x is the probability of finding the vector x in the range [s, s + ∆x]. We
also have the constraint,
Z
dx p(x) = 1
(2)
Important characteristics of a probability density are its mean and its (co)variance, defined by,
Z
µ = E[x] = dx p(x) x
Z
Σ = E[(x − µ)(x − µ)T ] = dx p(x) (x − µ)(x − µ)T
(3)
(4)
The mean is the average value of the random variable, while the variance (diagonal terms) is a measure for
the spread around the mean. The off-diagonal terms of the covariance matrix compute the (second order)
dependencies between the different entries of x. In general higher order moments are needed to characterize
the probability density function (PDF). Indeed usually infinitely many are needed, just like a Taylor series
approximates a function. For a normal distribution (Gaussian PDF), these two moments completely specify
the PDF,
T
−1
1
1
e− 2 (x−µ) Σ (x−µ)
Gx [µ, Σ] =
(5)
Dp
(2π) 2 det[Σ]
If two random variables are independent, information about one does not reveal information about the other.
Mathematically this is expressed as,
p(x, y) = p(x) p(y)
(6)
and in particular,
E[xy] = E[x] E[y]
2
(7)
Supervised versus unsupervised Learning
The main topic of this class will be to train probabilistic models from observed data sequences. This can
be divided into two categories; supervised and unsupervised learning. Unsupervised learning takes a set of
input samples and fits a probabilistic model. Supervised learning uses input samples and target samples,
provided by a “teacher”, and learns a mapping from input to output, under a probabilistic model. Let’s
assume we observe N independent samples xN = {x1 , ..., xN }, for unsupervised learning; and N independent
input samples xN and N independent target samples tN = {t1 , ..., tN } for supervised learning. We assume
that we have a parametrized model p(x|θ) for unsupervised learning and p(t|x, θ) for supervised learning.
The main question we want to answer is, what is the optimal set of parameters θ that can explain the data.
1
We would like to mention here that the most powerful models have unobserved random variables in their
model. In that case, the probabilistic model can be written as follows,
Z
p(x|θ) =
dy p(x, y|θ)
(8)
Z
p(t|x, θ) =
dy p(t, y|x, θ),
(9)
where y is now the hidden (or latent or missing) variable. In this course all models will be of this sort!
3
Bayes Rule
The single most important equation that we will use is Bayes rule,
p(x, y) =
p(x, y) =
p(x|y) p(y)
p(y|x) p(x)
(10)
(11)
In words, the probability of the events x and y happening together, is equal to the probability of x happening,
times the probability of y happening, given that x happened. Combining the above equations we find,
p(y|x) =
p(x|y) p(y)
p(x)
(12)
and the same where the roles of x and y are interchanged. p(x) can be viewed as a normalization constant,
i.e.
Z
p(x) = dy p(x|y) p(y)
(13)
4
Bayesian estimation
A Bayesian would typically condider θ as a random variable. He would probably have some prior knowledge
about the distribution which he formalizes in a prior distribution p(θ). For instance, he could express
his intuition that the distribution over θ should be smooth, and that the values of θ should not become
unreasonably large. For instance, he could use a Gaussian prior, p(θ) = Gθ [0, α2 I], where the parameter
α is called a hyperparameter, and could either be fixed or again be treated as a random variable (like θ).
Next, the Bayesian would want to know how his prior knowledge of the random variable θ changes in the
light of the new observations d. The data will consist of input samples xN for unsupervised learning and
pairs of input and target data {xN , tN } for supervised learning. To inquire this, he calculates the posterior
distribution,
p(d|θ) p(θ)
(14)
p(θ|d) =
p(d)
The data will move the modes of the distributions to the most probable values of θ and also determine a
spread around those values, expressing how well the data actually determine θ. Notice, that we need to
specify the likelihood p(d|θ), while p(d) acts as a normalization constant, which is hard to calculate because
we need to integrate over all possible values of θ,
Z
p(d) = dθ p(d|θ) p(θ).
(15)
This normalization constant is sometimes called “evidence” or “marginal likelihood”. For definiteness let’s
look at supervised learning and use the following likelihood, p(t|x, θ) = Gt [f (x, θ), β 2 I] where β represents the
noise variance around the deterministic mapping f (x, θ). The supervised Bayesian would want to calculate
the probability over target values, given the input datum and the previous data.
Z
p(t|x, d) = dθ p(t|x, θ) p(θ|d)
(16)
2
Similarly, an unsupervised Bayesian would want to be able to compute the probability of a new data point
x, given the data, i.e.
Z
p(x|d) = dθ p(x|θ) p(θ|d)
(17)
Notice that we never actually estimate or choose any value for θ. Instead, we determine the posterior density
over all values for θ and use it to integrate over all possible values of θ. The downside is that these integrals
are usually very difficult, and only approximately computable using time consuming sampling techniques.
In our example, if we linearize the deterministic mapping f (x, θ) around the ‘maximum a posteriori’ (MAP)
value of θ (the value which maximizes p(θ|d)), i.e.
f (x, θ) = f0 (x, θ M AP ) + f1 (x, θ M AP ) × (θ − θ M AP ),
(18)
then we can actually perform the intergral in (16), resulting in a Gaussian Gt [f0 (x, θ M AP ), σ(x, θ M AP )].
The average value of this mapping (which is the most likely value for a Gaussian) is determined by taking
the deterministic map using the parameters at their MAP-value. The variance of the Gaussian around its
mean gives us an indication of our confidence in the estimate of the mean. In our example this can be worked
out with the result,
σ 2 (x, θ) = β 2 + f1T (x, θ M AP )A−1 f1 (x, θ M AP )
A=
1
1
I+ 2
α2
β
N
X
f1 (xn , θ M AP )f1 (xn , θ M AP )T + (f0 (xn , θ M AP ) − tn )∇f1T (xn , θ M AP )
(19)
(20)
n=1
where ∇ denote taking derivative with respect to θ. More important than this explicit form is that we
notice that the mean and variance depend on the data through the value of θ M AP , and the variance also
through the matrix A. Both are also functions of x, the input value of the new datapoint. It is important to
realize that the variance contains two contributions. One from the noise which is added to the deterministic
mapping f (x, θ). This terms is independent of the data. The other, data dependent term however, reflects
our confidence in the value of θ M AP . If the data cannot determine θ M AP accurately for certain values of x
this term dominates the total variance. If there are enough data in a certain region of x space, than β 2 will
give the main contribution (see demo-Bayes).
5
MAP Estimation
Because the integrals in the Bayesian approach are often intractible, we may use the following approximation
for the unsupervised and supervised problems respectively,
Z
p(x|d) ≈
dθ p(x|θ M AP ) p(θ|d) = p(x|θ M AP )
(21)
Z
p(t|x, d) ≈ = dθ p(t|x, θ M AP ) p(θ|d) = p(t|x, θ M AP )
(22)
It is important to realize that all information about our confidence in the value θ M AP is thrown away. I.e.
in the example of the previous section we are left with a Gaussian with the same mean, but the variance now
only contains the term β 2 . So we are indeed trading off some information against computational efficiency.
In order to find θ M AP we need to solve,
θ M AP
= argmaxθ p(θ|d)
p(d|θ) p(θ)
= argmaxθ
p(d)
= argmaxθ p(d|θ) p(θ)
= argmaxθ log[p(d|θ)] + log[p(θ)]
(23)
(24)
(25)
(26)
since the log a monotonic function. Notice, that also in this calculation the integral term p(d) is irrelevant.
The prior p(θ) is still there and plays a crucial role in protecting against overfitting, an issue we will pick
up later. It can be considered as a regularizer term if one expresses his belief in the prior that the function
should be smooth. The term can be used to penalize too complex models. Also some complexity terms. like
3
“minimum desrcition length” criterium (MDL, see later), Akaike’s information criterium (AIC) and Baysian
information criterium (BIC) can be understood as a prior on the complexity of the model.
To illustrate the effect of a prior we will solve the following toy problem ( demo-MAP). Let’s assume
we have N datapoints which we think are generated by a one dimensional Gaussian, with unit variance
and mean µ, p(x|µ) = Gx [µ, 1]. Since we think that the mean should not be very big we use as a prior
p(µ) = Gµ [0, α2 ], where α is a hyperparameter. The total objective function is thus,
E∝−
N
X
(xn − µ)2 −
n=1
µ2
α2
(27)
which is easily maximized to give,
µ=
1
N+
N
X
xn
1
α2 n=1
(28)
The first thing to observe is that for very large N >> α12 the influence of the prior is neglicable, and the
result is the Maximum Likelihood estimate for the mean of a Gaussian (see next section). On the other hand,
for very strong belief in the prior, α12 >> N , the estimate tends to zero. Thus, if few data are available, the
prior will bias the estimate towards the prior expected value.
6
Maximum Likelihood Estimation
If we omit the prior, and thus maximize the likelihood alone, we are calculating the “maximum likelihood”
(ML) estimate. Where MAP-estimation still relied on the interpretation of θ as a random variable (since
we had a prior over θ), in ML-estimation θ is simply viewed as parameter. Frequentists think that it makes
no sense to view θ as a random variable, and ML-estimation is the right thing to do. The argument is that
a Bayesian probability is more like a belief that some event happens than the truly physical probability
which must be measured by doing repeated experiments (the frequency of an event occurring is than the
probability). We have already seen that when ample data are available, there is no significant difference
between ML and MAP estimation (even Bayesian estimation would not add much). We can only expect
Bayesian methods to be beneficial when the model complexity (number of parameters) becomes very large
relative to the number of data samples. The ML estimate may overfit to the data, i.e. it will fit noise instead
of underlying structure, and this will lead to bad generalization performance on a new (test) data set. We
will come back to these issues later on in this course, but for the next classes we will assume that there are
enough data, and proceed with using the ML estimation procedure. To summarize we have,
θ M L = argmaxθ log[p(d|θ)]
(29)
Assuming independent samples this will give for unsupervised and supervised learning respectively,
θM L
= argmaxθ
N
X
log[p(xn |θ)]
(30)
N
X
log[p(tn |xn , θ)]
(31)
n=1
θM L
= argmaxθ
n=1
As an example we will work out the case of a Gaussian (demo-ML). The model is given by (5). The
log-likelihood is given by,
L=
N
1X
{log(det[Σ−1 ]) − (xn − µ)T Σ−1 (xn − µ)}
2 n=1
(32)
To maximize this we need to take derivatives with respect to µ and Σ−1 and equate them to zero. The
following identities are useful in doing so,
aT Ab
= tr[AbaT ]
4
(33)
tr[AB]
∂
tr[AB]
∂A
∂
tr[AT B]
∂A
log det[A]
∂
log det[A]
∂A
= tr[BA]
(34)
= BT
(35)
= B
(36)
= − log det[A−1 ]
(37)
= (AT )−1
(38)
We find,
∂L
∂µ
=
N
X
Σ−1 (xn − µ) ⇒
n=1
µ =
N
1 X
xn
N n=1
(39)
For Σ we find,
∂L
∂Σ−1
=
Σ =
N
1X
Σ − (xn − µ)(xn − µ)T ⇒
2 n=1
N
1 X
(xn − µ)(xn − µ)T
N n=1
(40)
These are called the sample mean (arithmetic mean) and sample covariance.
One is tempted to say that ML estimation is nothing more than MAP estimation with a constant prior,
p(θ) = c. This interpretation is somewhat dangerous though. Assume we change variables from θ → φ(θ).
The prior changes according to:
p(φ) = | det(J(θ))|p(θ) = | det(J(θ))|c
(41)
where J is the Jacobian for this transformation. So a uniform prior changes into non-uniform prior under
this transformation! It reassuring to know however that the ML estimate is equivariant: if θ̂ is the MLE of
θ, then f (θ̂) is the MLE of f (θ).
Before reviewing some more good properties of ML-estimation let me list two bad ones. First, there
is the above mentioned problem of overfitting is the presence of too few data points. Another problem is
that the method is not guaranteed to be robust. Robustness implies that the estimate of θ is not too much
influenced by deviations from the assumptions. For instance, we could use a slightly wrong model to describe
the data, or there could be outliers present (not described by the model). The arithmetic mean and variance
are examples of non robust estimators.
Now let me discuss the good properties. We will assume that we have an infinite amount of datasets,
each consisting of N independent samples, xN
i , i = 1.... For practical purposes, we may assume a very large
amount of such datasets. In the following we will denote by E the expectation taking over those datasets, or
alternatively over p(x) directly (the true distribution from which the data is sampled). We will also assume
that the data are generated by a parameterized density p(xN |θ). For every dataset we estimate θ using some
kind of estimator θ̂(xN ) and study the distribution of those estimates. In this stage this is not necessarily
the ML-estimate, but we will show that if we use the ML estimate, then some nice properties hold. We find
for the expected squared error (θ ∗ denotes the true value),
E[(θ̂ − θ ∗ )2 ] =
2
E[θ̂ − 2θ̂θ ∗ + θ 2∗ )]
2
= E[θ̂ ] − 2E[θ̂]θ ∗ + θ 2∗
2
2
= E[θ̂ ] − 2θ̄θ ∗ + θ 2∗ + 2θ̄ − 2θ̄
5
2
2
2
= E[θ̂ ] + (θ ∗ − θ̄)2 + θ̄ − 2θ̄
2
2
2
= (θ ∗ − θ̄)2 + E[θ̂ ] + θ̄ − 2E[θ̂]θ̄
= (θ ∗ − θ̄)2 + E[(θ̄ − θ̂)2 ]
(42)
The first term is the squared bias of the estimate θ̂, while the second term is the variance. The bias term is
independent of N , while the variance scales like N1 with the size of the dataset. This implies that for a very
large data-set, the bias dominates the error. Without proof we will now state an important theorem about
the ML-estimator.
For large N , the distribution of the parameter estimates θ M L is Gaussian with mean θ ∗ , and variance
1 −1
J
. This implies that the estimator becomes asymptotically unbiased, and maximally efficient (minimum
N
variance). This follows because we can prove in general the Cramer-Rao bound Σ ≥ N1 J−1 , where J is the
Fisher information matrix.
Basically it says, that if the data are distributed according to p(xN |θ) for some θ, than for very large
datasets, the mean of the ML estimates will give the correct parameter value (no bias) and more over, we
could not do this more efficiently since the variance of the estimates is minimal.
This notion of asymptotic unbiased estimators is closely related to consistency, which means that as the
dataset grows, the value of the estimator approaches the true parameter value in probability. One can show
that if an estimator is asymptotically unbiased and if the variance vanished asymptotically then the estimator
is also consistent. Hence, ML likelihood estimates are always consistent if certain regularity conditions hold
on the likelihood function.
As an example we mention that the ML estimate of the covariance as in equation 40 is biased, but
asymptotically unbiased. We can easily correct for this, by replacing the factor 1/N with 1/(N − 1).
Finally, I’ll say one word on another property of estimators called sufficiency. A set of estimators is called
sufficient for a model if the likelihood function can be completely specified in terms of these estimators. For
instance the sample mean and variance are sufficient for a Gaussian model. What we really like is a minimal
set of sufficient statistics. This means that any sufficient set of estimators can be expressed in terms of this
minimal set, i.e. a basis set of some sort.
For completeness let us proof the Cramer-Rao inequility. First let us define the score function as,
s(xN , θ) =
=
∂
log p(xN , θ)
∂θ
N
X
s(xn , θ)
(43)
(44)
n=1
where xN denotes N independent random variables (samples). It is easy to prove the following two properties
for s,
N
E[s(x , θ)] =
=
=
∂
p(xN |θ)
dxN p(xN |θ) ∂θ N
p(x |θ)
Z
∂
dxN p(xN |θ)
∂θ
∂
1=0
∂θ
Z
(45)
and
N
N
E[θ M L (x )s(x , θ)] =
=
=
∂
p(xN |θ)
dxN p(xN |θ) θ M L (xN ) ∂θ N
p(x |θ)
Z
∂
dxN p(xN |θ) θ M L (xN )
∂θ
∂
θ=I
∂θ
Z
6
(46)
Then we define the Fisher information as the variance of the score function, which is, since the mean is zero,
equal to,
JN (θ) =
E[s(xN , θ)s(xN , θ)T ]
N
X
s(xn , θ)s(xm , θ)T ]
=
E[
=
N
X
E[s(xn , θ)s(xn , θ)T ] +
N
X
E[s(xn , θ)s(xn , θ)T ]
n,m=1
n=1
=
X
E[s(xn , θ)]E[s(xm , θ)]T
n6=m
n=1
=
N J,
(47)
where we used the independence of the samples and the fact that the score function has zero mean. Finally,
we proof the inequality by defining,
α = aT θ M L (xN )
β = bT s(xN , θ)
(48)
From the Cauchy-Schwartz inequality we have,
E[(α − E[α])2 ] E[(β − E[β])2 ] ≥ E[(α − E[α])(β − E[β])]2
(49)
Using (45) and (46) we may write,
≥ (aT b)2 ⇒
(aT Σa)(bT JN b)
(aT b)2
(aT Σa)(bT JN b)
≤ 1
(50)
Now we will maximize the right hand site over b using the lemma
maxb
(aT b)2
= aT J−1
N a
bT JN b
(51)
J−1
a
N
(52)
giving
aT Σa ≥ aT
which is the definition of Σ ≥
7
∀a
1 −1
.
NJ
Generalization, Bias/Variance tradeoff, Minimum Description
Length
We have already mentioned several times that ML-estimation does not solve the problem of overfitting. What
we need is a criterium to choose from a set of different models the one which is expected to generalize best, i.e.
the one which has not fitted the noise. This amounts to choosing the model with the right complexity. We
may need to choose between different models with different number of parameters (model order selection) or
equivalent models, but trained in a different way. In the latter case, the effective complexity can be different,
e.g. smooth solutions will have a lower complexity than highly structured solutions.
To illustrate what is going on we will go back to the derivation of the squared error of an estimator. Now
we will look at the integrated squared error between the true probability density and the estimated one and
proof a similar bias-variance trade-off. Lets call p̂(x) the estimate of the pdf using the data xN . We could
repeat the same analyis for p(t|x) for the supervised case. We are interested in the goodness of fit, i.e.,
Z
L2 = dx E[(p∗ (x) − p̂(x))2 ]
(53)
7
where p∗ denotes the true density. Following essentially the same procedure as for the bias/variance trade
off for estimators, and using E[p̂(x)] = p̄(x) and E[p∗ (x)] = p∗ (x), we find,
Z
L2 = dx {(p∗ (x) − p̄(x))2 + E[(p̂(x) − p̄(x))2 ]}
(54)
The first term is again the bias and the second the variance around the mean estimate. The reason that
this equation is important is that it illuminates a trade-off between the bias and variance. We may reason
as follows. Let’s assume we have a very simple model to describe the data. It is then very likely that there
is a large bias in the estimation of p(x), simply because our model is not flexible enough to capture all the
structure. The variance however is expected to be small since the parameters are well determined by the
data. On the other hand, imagine a very complex model. Due to the enormous flexibility we probably have a
very small bias term, but since the data cannot determine all parameters, the different datasets drawn from
the true PDF cause the parameters to acquire very different values every time we estimate them. Thus, the
variance is expected to be large. What we want is a small expected error, and therefore we need to trade off
the bias and the variance by choosing a model which has the right complexity.
A fully Bayesian approach can deal with these issues by simply integrating over all possible values of the
parameters and/or summing over the different models. However, the choice of the priors for both parameter
values and models is a difficult issue, and moreover the integrals involved are usually too complicated to do
analytically or very time consuming to do approximately. As an alternative one could use MAP-estimation,
with a log-prior which acts as a penalty term for too complex models. To choose between different models
some penalty terms were suggested in the literature (AIC, BIC, MDL), one of which (MDL) we will explore
a bit further below. These terms provide rough estimates for the correct number of parameters in the light
of the available data. To avoid overfitting during training of a specific model, one may add regularizer terms
which penalize too high derivatives for instance, i.e. they enforce smoothness. Adding noise to the data has
a similar effect. Also weight decay in neural nets is an instance of a log-prior term which acts as a regularizer.
Yet another alternative is to split the dataset into a training set and a validation set. During training of
the model, the training error is expected to decrease, but when the model starts overfitting its generalization
performance will become worse, resulting in an increasing validation error. A simple heuristic is therefore
to stop the training when the validation error increases (early stopping). The same procedure may be used
to test different models with different numbers of degrees of freedom. The downside is of course that one
cannot use all data for training. Cross-validation is designed to overcome this problem, by splitting the data
in many small chunks, train on all but one chunk and test on the remaining chunck. Then cycle through the
chunks and average the result. Of course, this procedure is computationally demanding.
Let me explain in a little bit more detail what the philosophy behind the “minimum description length”
(MDL, mentioned above) principle is. The idea is that we image that we need to send the samples xN
over a communication channel to an imaginary receiver. If both sender and receiver agree beforehand on a
probability distribution for p(xN ), then the minimum cost of sending these data is according to Shannon’s
theorem − log[p(xN )] bits, while we can approximate this bound arbitrarily closely by some coding scheme.
Let’s assume therefore that this is the actual cost of sending the data (for continuous data we need to
quantize the space in bins of size dxN and the coding cost is then multiplied by that binsize). There is
however a smarter idea, which is to build a model M, which captures the dependencies (structure) in the
data, then send
• The specifics of the model
• The activities of the model when applied to the data
• The reconstruction errors which are the differences of the predicted values of the model and the true
values.
Of course, for all continuous values we need to decide on the tolerances (precision) of the coding. With
this information the receiver can losslessly recontruct the data, but we have send fewer bits! The principle
behind MDL is the belief that the model for which the sum of those three costs is minimized also has the
best generalization performance. To see how this can work we first image that we have very few datapoints.
In this case it makes no sense to build a sophisticated model, since sending the specifics of the model require
more bits than the data in the first place. Next, imagine we have enormous amounts of data at our disposal,
than the specifics of the model are a fraction of the total cost and we may profit from building complicated
models. Note also, that building accurate models reduces the error term, so the simplest, most accurate
model that can describe the data will win. We see that there is a trade-off between the first cost and the
last two costs, which will be minimized by a model of the “correct” complexity. As an example consider
some data which almost on a straight line in two dimensions. We may send the values {x1 , y1 , ....xN , yN }
8
directly or we may recognize that it is cheaper to send the two parameters determining the straight line
(cost 1), then project the datapoint onto the line and send distances along the line, for instance from the
x-intersection (cost 2). Finally send the errors (cost 3). Notice that instead of sending two potentially large
values for every datapoint, we now send one large value (distance along line) and a small one (error) per
datapoint, which may be coded with fewer bits, plus two values for the line specifics which are independent
of the number of datapoints.
Using this philosopy, Rissanan derived an error term which should be added to the ML criterium,
1
M DL = M L − (# parameters) log(N )
2
(55)
which, when maximized, gives the model with the best generalization performance. It must be said however
that for very small datasets (55) is not very accurate.
Recommended literature, [1], [2].
References
[1] C.M. Bishop. Neural networks for pattern recognition. Clarendon Press, 1995.
[2] B.D. Ripley. Pattern recognition and neural networks. Cambridge University Press, 1996.
9