Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

ML Notes

This document discusses maximum likelihood estimation (MLE), which is a method for estimating parameters of a population from a random sample. MLE chooses parameter values that make the observed sample most probable. The document provides examples of using MLE to estimate parameters for normal and linear regression models.

Uploaded by

Eshank Arora
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

ML Notes

This document discusses maximum likelihood estimation (MLE), which is a method for estimating parameters of a population from a random sample. MLE chooses parameter values that make the observed sample most probable. The document provides examples of using MLE to estimate parameters for normal and linear regression models.

Uploaded by

Eshank Arora
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Maximum Likelihood Estimation

A general and powerful method of estimation is that known as maximum


likelihood. The idea is as follows. Suppose we start with a random sample of n
observations x1 , x2 , . . . , xn drawn from a probability density function f (x | θ)
involving an unknown parameter θ. If we denote the complete set of observations
by the vector X = (x1 , x2 , . . . , xn ), then the joint probability density function
of X may be written as

f (X | θ) = f (x1 | θ) f (x2 | θ) . . . f (xn | θ)

In probability theory the function f (X | θ) gives the probability of observing


the sample given the parameter θ (remembering that a random sample gives
independent drawings from the population so the probabilities multiply). In
particular if
f (X1 | θ) > f (X2 | θ)
then we may (loosely) say that observing X1 is “more likely" than observing
X2 . However, in estimation theory we observe the value of X and wish to say
something about θ. Now consider f (X | θ) as a function of θ (with X fixed at
the observed values). We call this the likelihood function of θ. If

f (X | θ1 ) > f (X | θ2 )

we may (loosely) say that θ1 is a “more plausible" value of θ than θ2 , since θ1


ascribes a larger probability to the observed X than does θ2 . The principle
of maximum likelihood just says that we should use as our estimator of θ the
parameter point for which our observed sample is most likely. That is, we should
choose the value of θ which maximises the likelihood function. To emphasise
that we have moved from considering the probability of the sample given the
parameter θ to considering the likelihood of the parameter θ given the sample
we use a different notation for the likelihood function, L(θ | X). So

L(θ | X) = f (X | θ)

If the likelihood function is differentiable as a function of θ then possible


candidates for the maximum likelihood estimator are the values of θ for which

L(θ | X) = 0
∂θ
Note that solutions are only candidates for the MLE, the derivative being
zero is only necessary for a maximum, not sufficient. The second derivative must
be negative for a maximum, though this guarantees only a local maximum,
not global. In addition if the parameter θ is restricted to lie in some region
then the extrema may occur on the boundary, and the derivative need not
be zero there. Maximum likelihood estimators can also be found by direct
maximisation of the likelihood function via numerical search methods. With
modern computing power this can be done very easily for a wide variety of

1
specifications of the likelihood function, allowing estimation of very complicated
models. Since the maximum of the logarithm of something occurs at the same
point as the maximum itself it is often convenient to consider the log-likelihood
function. The extension to multiple parameter models (i.e. where θ is a vector
of unknown parameters) is straightforward. Under fairly mild conditions on the
likelihood function it can be shown that, at least asymptotically (i.e. as sample
sizes tend to infinity) that maximum likelihood estimators are unbiased and
fully efficient.
Example Let x1 , x2 , . . . , xn be iid N (θ, 1) and let L(θ | X) denote the
likelihood function. Then
Y
n
1
√ e−(xi −θ) /2
2
L(θ | X) =
i=1

so that
1X
n
n
logL(θ | X) = − log(2π) − (xi − θ)2
2 2 i=1

The equation ∂
∂θ L(θ | X) = 0 reduces to
X
n
b =0
(xi − θ)
i=1
Pn
so the mean i=1 xi /n is a candidate for the maximum likelihood estimator of
θ. In fact we can verify that it is indeed a global maximum of the likelihood
function (there are no other extrema of L, the second order condition is easily
checked, and the likelihood is zero at the boundaries ±∞).
Example Let x1 , x2 , . . . , xn be iid N (θ, 1) where it is known thatPθ must be
n
non-negative. With no restrictions on θ we know the MLE of θ is i=1 xi /n.
However if this is negative it will lie outside the permitted range of the param-
eter. If the mean is negative it is easy to check that the likelihood function
L(θ | X) is decreasing in θ forPθ ≥ 0 and is maximised at θ = 0. Hence in this
n
case the MLE of θ is max {0, i=1 xi /n}.
Example Suppose we have a random sample consisting of N observations
of a variable yi and corresponding explanatory variable xi . We wish to estimate
a linear regression model, and specify that the error term is independently
normally distributed. That is, we have
yi = βxi + εi i = 1, 2, 3, . . . , N
εi ∼ N (0, σ )2

The unknown parameters are β and σ 2 . Now yi − βxi = εi ∼ N (0, σ 2 )


so √the likelihood
 of each observation (that is, the pair (xi , yi )) is given by
1/ 2πσ exp −(yi − βxi )2 /2σ 2 . So the likelihood for the whole sample is:
2

Y
n
1
e−(yi −βxi )
2
/2σ 2
L(y, x | β, σ 2 ) = √
i=1 2πσ 2

2
The log-likelihood is thus:

n Xn
logL(y, x | β, σ 2 ) = K − log(σ 2 ) − (yi − βxi )2 /2σ 2
2 i=1

where K is a constant (reflecting the log of the constants in the expression for
the likelihood). Maximising over β and σ 2 (notice we differentiate with respect
to σ 2 , not σ), we get the following conditions

1 X
N
∂L
= 2 (yi − βxi )xi
∂β σ i=1

PN
∂L n − βxi )2
i=1 (yi
2
=− 2 +
∂σ 2σ 2σ 4
setting these to zero and solving we get
P
b x i yi
βM L = P 2
xi
PN b 2
c2 M L = i=1 (yi − βxi )
σ
n

The alert will notice that βbM L coincides with the OLS estimator. The second
order conditions and boundary behaviour can be checked (tedious). In this
specific case where the disturbances are known to be independently normally
distributed the maximum likelihood estimator coincides with the least squares
estimator.

0.0.1 Numerical Optimisation


Maximum likelihood provides a powerful estimation technique, but frequently
generates first order conditions that cannot be solved explicitly for the maximum
likelihood estimates. In these circumstances numerical maximisation of the
likelihood function via some search algorithm is usually needed to obtain the
ML estimates. Given a likelihood function L(θ) where θ are the parameters of
the model then numerical optimisation generally requires three stages.
1. A set of starting values θ0
2. A rule for updating the parameter estimates ie θi+1 = Φ(θi ) at each iter-
ation
3. A rule for stopping
The starting values may be chosen at random or some weaker estimation tech-
nique may suggest values. There are a variety of updating rules that may
be suitable depending on the form of the likelihood function and the general

3
nature of the problem, from simple grid searches to methods such as steep-
est ascent, Davidon-Fletcher-Powell or Newton-Raphson/Gauss-Newton type
methods. Convergence is usually declared when the parameter updates fail to
improve the objective function by some threshold. Problems that may arise
include failure of the search algorithm to converge, finding local rather than
global maxima, or the maximum not being well-defined. It is usually wise to
check the maximum likelihood estimates obtained from search algorithms care-
fully. Standard errors of the estimates are usually obtained by considering the
second derivative of the objective function (since the curvature of the objective
function gives some indication of the precision of the estimates).

You might also like