ML Notes
ML Notes
f (X | θ1 ) > f (X | θ2 )
L(θ | X) = f (X | θ)
1
specifications of the likelihood function, allowing estimation of very complicated
models. Since the maximum of the logarithm of something occurs at the same
point as the maximum itself it is often convenient to consider the log-likelihood
function. The extension to multiple parameter models (i.e. where θ is a vector
of unknown parameters) is straightforward. Under fairly mild conditions on the
likelihood function it can be shown that, at least asymptotically (i.e. as sample
sizes tend to infinity) that maximum likelihood estimators are unbiased and
fully efficient.
Example Let x1 , x2 , . . . , xn be iid N (θ, 1) and let L(θ | X) denote the
likelihood function. Then
Y
n
1
√ e−(xi −θ) /2
2
L(θ | X) =
i=1
2π
so that
1X
n
n
logL(θ | X) = − log(2π) − (xi − θ)2
2 2 i=1
The equation ∂
∂θ L(θ | X) = 0 reduces to
X
n
b =0
(xi − θ)
i=1
Pn
so the mean i=1 xi /n is a candidate for the maximum likelihood estimator of
θ. In fact we can verify that it is indeed a global maximum of the likelihood
function (there are no other extrema of L, the second order condition is easily
checked, and the likelihood is zero at the boundaries ±∞).
Example Let x1 , x2 , . . . , xn be iid N (θ, 1) where it is known thatPθ must be
n
non-negative. With no restrictions on θ we know the MLE of θ is i=1 xi /n.
However if this is negative it will lie outside the permitted range of the param-
eter. If the mean is negative it is easy to check that the likelihood function
L(θ | X) is decreasing in θ forPθ ≥ 0 and is maximised at θ = 0. Hence in this
n
case the MLE of θ is max {0, i=1 xi /n}.
Example Suppose we have a random sample consisting of N observations
of a variable yi and corresponding explanatory variable xi . We wish to estimate
a linear regression model, and specify that the error term is independently
normally distributed. That is, we have
yi = βxi + εi i = 1, 2, 3, . . . , N
εi ∼ N (0, σ )2
Y
n
1
e−(yi −βxi )
2
/2σ 2
L(y, x | β, σ 2 ) = √
i=1 2πσ 2
2
The log-likelihood is thus:
n Xn
logL(y, x | β, σ 2 ) = K − log(σ 2 ) − (yi − βxi )2 /2σ 2
2 i=1
where K is a constant (reflecting the log of the constants in the expression for
the likelihood). Maximising over β and σ 2 (notice we differentiate with respect
to σ 2 , not σ), we get the following conditions
1 X
N
∂L
= 2 (yi − βxi )xi
∂β σ i=1
PN
∂L n − βxi )2
i=1 (yi
2
=− 2 +
∂σ 2σ 2σ 4
setting these to zero and solving we get
P
b x i yi
βM L = P 2
xi
PN b 2
c2 M L = i=1 (yi − βxi )
σ
n
The alert will notice that βbM L coincides with the OLS estimator. The second
order conditions and boundary behaviour can be checked (tedious). In this
specific case where the disturbances are known to be independently normally
distributed the maximum likelihood estimator coincides with the least squares
estimator.
3
nature of the problem, from simple grid searches to methods such as steep-
est ascent, Davidon-Fletcher-Powell or Newton-Raphson/Gauss-Newton type
methods. Convergence is usually declared when the parameter updates fail to
improve the objective function by some threshold. Problems that may arise
include failure of the search algorithm to converge, finding local rather than
global maxima, or the maximum not being well-defined. It is usually wise to
check the maximum likelihood estimates obtained from search algorithms care-
fully. Standard errors of the estimates are usually obtained by considering the
second derivative of the objective function (since the curvature of the objective
function gives some indication of the precision of the estimates).