Maximum Likelihood Estimation.: N N I N I 1 N I I 1

Lecture 7
Maximum Likelihood Estimation.
1 MLE
Let f (·|θ) with θ∈Θ be a parametric family. Let X = (X1 , ..., Xn ) be a random sample from distribution
∏n
f1 (·|θ0 ) θ0 ∈ Θ. Then the joint pdf is
with f (x|θ) = i=1 f1 (xi |θ) where x = (x1 , ..., xn ). The log-likelihood
∑n
is ℓ(θ|x) = i=1 log f1 (xi |θ). The maximum likelihood estimator is, by de˝nition,
θ̂M L = arg max ℓ(θ|x).

θ∈Θ
The FOC is
1 ∑ ∂ℓ1 (θ̂M L |xi )
n
= 0.
n i=1 ∂θ
Note that the ˝rst information equality is E[∂ℓ1 (θ0 |Xi )] = 0. Thus MLE is the method of moments estimator
corresponding to the ˝rst information equality. So we can expect that the MLE is consistent. Indeed, the
theorem below gives the consistency result for MLE:
Theorem 1 (MLE consistency). In the setting above, assume that (1) θ0 is identi˝able, i.e. for any θ =
̸ θ0 ,
̸ f (x|θ0 ), (2) the support of f (·|θ) does not depend on θ, and (3) θ0 is an
there exists x such that f (x|θ) =
interior point of parameter space Θ. Then θ̂M L →p θ0 .
The proof of MLE consistency will be given in 14.382 and 14.385. What the proof does, it shows that
function g(θ) = Eθ0 ℓ1 (θ|Xi ) (here Xi ∼ f1 (xi |θ0 )) is maximized at θ = θ0 and random process
1
n ℓ(θ|X)
converges to the function g(θ) in a uniform manner in probability. Then it argues that the maximizer of the
process θˆM L will converge to θ0 .

Once we know that the estimator is consistent, we can think about the asymptotic distribution of the
estimator. The next theorem gives the asymptotic distribution of MLE:
Theorem 2 (MLE asymptotic normality). In the setting above, assume that conditions (1)-(3) in the
MLE consistency theorem hold. In addition, assume that (4) f1 (xi |θ) is thrice di˙erentiable with respect
to θ and we can interchange integration with respect to x and di˙erentiation with respect to θ, and (5)
|∂ 3 log f1 (xi |θ)/∂θ3 | ≤ M (x) and E[M (Xi )] < ∞. Then
√
n(θ̂M L − θ0 ) ⇒ N (0, I1−1 (θ0 ))
1
Proof. This is a sketch of the proof as it misses and important step. By de˝nition,
∂ℓ(θ̂M L ,x)
∂θ = 0. By the
Taylor theorem with a remainder, there is some random variable θ̃ with value between θ0 and θ̂M L such that
∂ℓ(θ̂M L )|X ∂ℓ(θ0 |X) ∂ 2 ℓ(θ̃|X)

= + (θ̂M L − θ0 ).
∂θ ∂θ ∂θ2
So,
√ − √1n ∂ℓ(θ∂θ0 |X)

n(θ̂M L − θ0 ) = .
1 ∂ 2 ℓ(θ̃|X)
n ∂θ 2
Since θ̂M L →p θ0 and θ̃ is between θ0 and θ̂M L , θ̃ →p θ0 as well. From θ̃ →p θ0 , one can prove that
1 ∂ 2 ℓ(θ̃|X) 1 ∂ 2 ℓ(θ0 |X)

2
− = op (1).
n ∂θ n ∂θ2
We will not discuss this result here since it requires knowledge of the concept of asymptotic equicontinuity
which we do not cover in this class. You will learn it in 14.385. Note, however, that this result does not
follow from the Continuous mapping theorem since we have a sequence of random functions ℓ(θ|X) instead
of just one non-random function. Suppose we believe in this result. Then, by the Law of large numbers,
[ 2 ]
1 ∑ ∂ 2 log f1 (Xi |θ0 )
n
1 ∂ 2 ℓ(θ0 |X) ∂ log f1 (Xi |θ0 )
= →p E = −I1 (θ0 ).
n ∂θ2 n i=1 ∂θ2 ∂θ2
[ ] [ ]
∂ log f1 (Xi |θ0 ) ∂ log f1 (Xi |θ0 )
Next, by the ˝rst information equality, E ∂θ =0 while V ar ∂θ = I1 (θ0 ). Thus,
by the Central limit theorem,
1 ∑ ∂ log f1 (Xi |θ0 )

n
1 ∂ℓ(θ0 |X)
√ =√ ⇒ N (0, I(θ0 )).
n ∂θ n i=1 ∂θ
Finally, by the Slutsky theorem,

√
n(θ̂M L − θ0 ) ⇒ N (0, I −1 (θ0 )).
One interpretation of MLE asymptotics is that the MLE is asymptotically e°cient (hit Rao-Cramer
bound in very large samples).
Example Let X1 , ..., Xn be a random sample from a distribution with pdf f (x|λ) = λ exp(−λx). This
distribution is called exponential. Its log-likelihood for one draw is ℓ1 (λ|xi ) = log λ − λxi . So ∂ℓ1 ∂λ
(λ|xi )
=
2
1/λ − xi and ∂ ℓ∂λ
1 (λ|xi )
2 = −1/λ 2
. So Fisher information is I1 (λ) = 1/λ 2
. Let us ˝nd the MLE for λ. The
∑n ∑n
log-likelihood for the whole is ℓ(θ|x) = n log λ − λ i=1 xi . The FOC is
n
− i=1 xi = 0. So λ̂M L = 1
.
√ λ̂M L Xn
Its asymptotic distribution is given by n(λ̂M L − λ) ⇒ N (0, λ2 ).
2
2 Inference using MLE
−1
We will have a longer discussion about how to estimate the asymptotic variance of the MLE (I1 (θ0 )) later
when we will discuss asymptotic tests. Right now I want to mention several suggestions.
First of all, if I1 (θ) is a continuous function in θ (which is needed for asymptotic results), then given that
( )−1 ( −1 )−1
θ̂M L is consistent for θ0 , the quantity I1 (θ̂M L ) is consistent for I1 (θ0 ) .
Second, by de˝nition of Fisher information, it equals to the expectation of either negative second deriva-
tive of the likelihood or of the squared score. Instead of taking expectation one may approximate it by taking
averages. For example

1 ∑
n
∂ 2 ℓ1 (θ̂|Xi )
Iˆ = −
n i=1
∂θ2
will be a consistent estimator of the Fisher information.
The third idea to be used in this context is parametric bootstrap. Assume θ̂M L is the MLE we obtained
from our sample of size n. For b = 1, ..., B do the following:
• Simulate sample Xb∗ = (X1b

∗ ∗
, ..., Xnb ) as i.i.d. draws from f1 (xi |θ̂M L ) (that is, assuming that θˆM L is
the true parameter value).
• Find MLE using sample Xb∗ , denote it θb∗ .

Calculate the sample variance of
∗
(θ1∗ , ..., θB ), it gives the bootstrap approximation to (nI1 (θ0 ))−1 . You may
also do bootstrap- bias correction using similar procedure.
3 When MLE asymptotic theory fails us...
Example A word of caution. For asymptotic normality of MLE, we should have common support. Let
us see what might happen otherwise. Let X1 , ..., Xn be a random sample from U [0, θ]. Then θ̂M L = X(n) .
√
So n(θ̂M L − θ) is always nonpositive. So it does not converge to mean zero normal distribution. In fact,
E[X(n) ] = (n/(n + 1))θ and V (X(n) ) = θ2 n/((n + 1)2 (n + 2)) ≈ θ2 /n2 . On the other hand, if the theorem
worked, we would have V (X(n) ) ≈ 1/(nI(θ)). The MLE happens to be super-consistent here, means it
√
converges to the true value at a faster speed than the regular parametric speed of 1/ n.
Example Now, let us consider what might happen if the true parameter value θ0 were on the boundary
of Θ. Let X1 , ..., Xn be a random sample from distribution N (µ, 1) with µ ≥ 0. As an exercise, check that
√
µ̂M L = X n if Xn ≥ 0 and 0 otherwise. Suppose that µ0 = 0 . Then n(µ̂M L − µ0 ) is always nonnegative.
So it does not converge to mean zero normal distribution.
Example Finally, note that it is implicitly assumed both in the consistency and asymptotic normality
theorems that parameter space Θ is ˝xed, i.e. independent of n. In particular, the number of parameters
should not depend on n. Indeed, let
( ) (( ) ( ))
X1i µi σ2 0
Xi = ∼N ,
X2i µi 0 σ2
3
for i = 1, ..., n, and X1 , ..., Xn be mutually independent. One can show that if the sample size (n) increases
2
to in˝nity, the MLE for σ is inconsistent in this case, though a consistent estimator for σ2 exists.
What is interesting, though we won't show it here is that a bootstrap does not help in this cases, that
is, the bootstrap approximation to the distribution of θ̂M L is not close to the true ˝nite-sample distribution
of θ̂M L .
4 Pseudo-MLE
Let us have a sample X = (X1 , ..., Xn ) i.i.d from some distribution. We do not know what distribution it
is, let's assume it has pdf g(xi ). But we wrongly assumed a speci˝c parametric family, that is, we assumed
Xi ∼ f1 (xi |θ). What would happen if we do MLE. Apparently, MLE will be estimating a pseudo-true
parameter value θ0 with minimizes in some sense the distance between g(·) and family f (·|θ). In particular:
∫
θ0 = arg max log[f1 (xi |θ)]g(xi )dxi = arg max E log f1 (Xi |θ).
θ θ
Parameter θ0 may be of interest or may be not. Under some regularity condition θ̂M L →p θ0 , and in most
parts the logic of the proof of theorem about normality will hold. However, the the information equality
would fail. De˝ne [( )2 ]

∂ log f1 (Xi |θ0 )
Σ1 = E ,
∂θ0
[ ]
∂ 2 log f1 (Xi |θ0 )
Σ2 = −E ,
∂θ02
where expectations in both cases are taken assuming that Xi ∼ g(·). If g is not in the parametric family,
then in general Σ1 ̸= Σ2 . But using the logic of the proof, we can prove that
√
n(θ̂M L − θ0 ) ⇒ N (0, Σ−1 −1
2 Σ1 Σ2 )
This asymptotic variance Σ−1 −1

2 Σ1 Σ2 is often called White's due to White's (1980) paper and thus White's
standard errors.
4
MIT OpenCourseWare
https://ocw.mit.edu
14.381 Statistical Method in Economics

Fall 2018
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms

Maximum Likelihood Estimation.: N N I N I 1 N I I 1

Uploaded by

Copyright:

Available Formats

Maximum Likelihood Estimation.: N N I N I 1 N I I 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Maximum Likelihood Estimation.: N N I N I 1 N I I 1

Uploaded by

Copyright:

Available Formats

Lecture 7

Maximum Likelihood Estimation.

θ̂M L = arg max ℓ(θ|x).

theorem below gives the consistency result for MLE:

process θˆM L will converge to θ0 .

estimator. The next theorem gives the asymptotic distribution of MLE:

∂ℓ(θ̂M L )|X ∂ℓ(θ0 |X) ∂ 2 ℓ(θ̃|X)

√ − √1n ∂ℓ(θ∂θ0 |X)

1 ∂ 2 ℓ(θ̃|X) 1 ∂ 2 ℓ(θ0 |X)

by the Central limit theorem,

1 ∑ ∂ log f1 (Xi |θ0 )

Finally, by the Slutsky theorem,

bound in very large samples).

averages. For example

will be a consistent estimator of the Fisher information.

from our sample of size n. For b = 1, ..., B do the following:

• Simulate sample Xb∗ = (X1b

the true parameter value).

• Find MLE using sample Xb∗ , denote it θb∗ .

also do bootstrap- bias correction using similar procedure.

3 When MLE asymptotic theory fails us...

So it does not converge to mean zero normal distribution.

should not depend on n. Indeed, let

would fail. De˝ne [( )2 ]

This asymptotic variance Σ−1 −1

14.381 Statistical Method in Economics

You might also like