ML Notes

This document discusses maximum likelihood estimation (MLE), which is a method for estimating parameters of a population from a random sample. MLE chooses parameter values that make the observed sample most probable. The document provides examples of using MLE to estimate parameters for normal and linear regression models.

Uploaded by

Eshank Arora

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

ML Notes

Uploaded by

Eshank Arora

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Maximum Likelihood Estimation

A general and powerful method of estimation is that known as maximum

likelihood. The idea is as follows. Suppose we start with a random sample of n
observations x1 , x2 , . . . , xn drawn from a probability density function f (x | θ)
involving an unknown parameter θ. If we denote the complete set of observations
by the vector X = (x1 , x2 , . . . , xn ), then the joint probability density function
of X may be written as

f (X | θ) = f (x1 | θ) f (x2 | θ) . . . f (xn | θ)

In probability theory the function f (X | θ) gives the probability of observing

the sample given the parameter θ (remembering that a random sample gives
independent drawings from the population so the probabilities multiply). In
particular if
f (X1 | θ) > f (X2 | θ)
then we may (loosely) say that observing X1 is “more likely" than observing
X2 . However, in estimation theory we observe the value of X and wish to say
something about θ. Now consider f (X | θ) as a function of θ (with X ﬁxed at
the observed values). We call this the likelihood function of θ. If

f (X | θ1 ) > f (X | θ2 )

we may (loosely) say that θ1 is a “more plausible" value of θ than θ2 , since θ1

ascribes a larger probability to the observed X than does θ2 . The principle
of maximum likelihood just says that we should use as our estimator of θ the
parameter point for which our observed sample is most likely. That is, we should
choose the value of θ which maximises the likelihood function. To emphasise
that we have moved from considering the probability of the sample given the
parameter θ to considering the likelihood of the parameter θ given the sample
we use a diﬀerent notation for the likelihood function, L(θ | X). So

L(θ | X) = f (X | θ)

If the likelihood function is diﬀerentiable as a function of θ then possible

candidates for the maximum likelihood estimator are the values of θ for which
∂
L(θ | X) = 0
∂θ
Note that solutions are only candidates for the MLE, the derivative being
zero is only necessary for a maximum, not suﬃcient. The second derivative must
be negative for a maximum, though this guarantees only a local maximum,
not global. In addition if the parameter θ is restricted to lie in some region
then the extrema may occur on the boundary, and the derivative need not
be zero there. Maximum likelihood estimators can also be found by direct
maximisation of the likelihood function via numerical search methods. With
modern computing power this can be done very easily for a wide variety of

1
specifications of the likelihood function, allowing estimation of very complicated
models. Since the maximum of the logarithm of something occurs at the same
point as the maximum itself it is often convenient to consider the log-likelihood
function. The extension to multiple parameter models (i.e. where θ is a vector
of unknown parameters) is straightforward. Under fairly mild conditions on the
likelihood function it can be shown that, at least asymptotically (i.e. as sample
sizes tend to infinity) that maximum likelihood estimators are unbiased and
fully efficient.
Example Let x1 , x2 , . . . , xn be iid N (θ, 1) and let L(θ | X) denote the
likelihood function. Then
Y
n
1
√ e−(xi −θ) /2
2
L(θ | X) =
i=1
2π
so that
1X
n
n
logL(θ | X) = − log(2π) − (xi − θ)2
2 2 i=1

The equation ∂
∂θ L(θ | X) = 0 reduces to
X
n
b =0
(xi − θ)
i=1
Pn
so the mean i=1 xi /n is a candidate for the maximum likelihood estimator of
θ. In fact we can verify that it is indeed a global maximum of the likelihood
function (there are no other extrema of L, the second order condition is easily
checked, and the likelihood is zero at the boundaries ±∞).
Example Let x1 , x2 , . . . , xn be iid N (θ, 1) where it is known thatPθ must be
n
non-negative. With no restrictions on θ we know the MLE of θ is i=1 xi /n.
However if this is negative it will lie outside the permitted range of the param-
eter. If the mean is negative it is easy to check that the likelihood function
L(θ | X) is decreasing in θ forPθ ≥ 0 and is maximised at θ = 0. Hence in this
n
case the MLE of θ is max {0, i=1 xi /n}.
Example Suppose we have a random sample consisting of N observations
of a variable yi and corresponding explanatory variable xi . We wish to estimate
a linear regression model, and specify that the error term is independently
normally distributed. That is, we have
yi = βxi + εi i = 1, 2, 3, . . . , N
εi ∼ N (0, σ )2

The unknown parameters are β and σ 2 . Now yi − βxi = εi ∼ N (0, σ 2 )

so √the likelihood
of each observation (that is, the pair (xi , yi )) is given by
1/ 2πσ exp −(yi − βxi )2 /2σ 2 . So the likelihood for the whole sample is:
2

Y
n
1
e−(yi −βxi )
2
/2σ 2
L(y, x | β, σ 2 ) = √
i=1 2πσ 2

2
The log-likelihood is thus:

n Xn
logL(y, x | β, σ 2 ) = K − log(σ 2 ) − (yi − βxi )2 /2σ 2
2 i=1

where K is a constant (reﬂecting the log of the constants in the expression for
the likelihood). Maximising over β and σ 2 (notice we diﬀerentiate with respect
to σ 2 , not σ), we get the following conditions

1 X
N
∂L
= 2 (yi − βxi )xi
∂β σ i=1

PN
∂L n − βxi )2
i=1 (yi
2
=− 2 +
∂σ 2σ 2σ 4
setting these to zero and solving we get
P
b x i yi
βM L = P 2
xi
PN b 2
c2 M L = i=1 (yi − βxi )
σ
n

The alert will notice that βbM L coincides with the OLS estimator. The second
order conditions and boundary behaviour can be checked (tedious). In this
speciﬁc case where the disturbances are known to be independently normally
distributed the maximum likelihood estimator coincides with the least squares
estimator.

0.0.1 Numerical Optimisation

Maximum likelihood provides a powerful estimation technique, but frequently
generates ﬁrst order conditions that cannot be solved explicitly for the maximum
likelihood estimates. In these circumstances numerical maximisation of the
likelihood function via some search algorithm is usually needed to obtain the
ML estimates. Given a likelihood function L(θ) where θ are the parameters of
the model then numerical optimisation generally requires three stages.
1. A set of starting values θ0
2. A rule for updating the parameter estimates ie θi+1 = Φ(θi ) at each iter-
ation
3. A rule for stopping
The starting values may be chosen at random or some weaker estimation tech-
nique may suggest values. There are a variety of updating rules that may
be suitable depending on the form of the likelihood function and the general

3
nature of the problem, from simple grid searches to methods such as steep-
est ascent, Davidon-Fletcher-Powell or Newton-Raphson/Gauss-Newton type
methods. Convergence is usually declared when the parameter updates fail to
improve the objective function by some threshold. Problems that may arise
include failure of the search algorithm to converge, ﬁnding local rather than
global maxima, or the maximum not being well-deﬁned. It is usually wise to
check the maximum likelihood estimates obtained from search algorithms care-
fully. Standard errors of the estimates are usually obtained by considering the
second derivative of the objective function (since the curvature of the objective
function gives some indication of the precision of the estimates).