Frequentist Estimation: 4.1 Likelihood Function
Frequentist Estimation: 4.1 Likelihood Function
Lecture 4:
Frequentist Estimation
Suppose we have an unknown parameter and some data X. We may want to use the data
to estimate the value of .
An estimator of an unknown parameter is any function of the data that is intended
it is actually (X),
4.1
Likelihood Function
Many procedures in statistical inference involve a mathematical object called the likelihood
function.
Definition of Likelihood
Let be an unknown parameter, and let X be a sample with joint pmf or pdf f (x). The
likelihood function for a particular set of data values x is Lx () = f (x).
The function Lx () is simply a function of (though it may be a different function for
different values of x). The function itself is not random.
Sometimes we may also need to consider LX (), which is a random function of .
Example 4.1.1: Let X Bin(n, ). If we observe X = x, then Lx () = (nx)x (1 )nx . We
might also consider the random function LX () = (Xn )X (1 )nX .
Interpretation of Likelihood
The likelihood function is essentially the same mathematical object as the joint pmf or pdf,
but its interpretation is different.
For f (x), we think about fixing a parameter value and allowing x to vary.
For Lx (), we think about fixing a collection of sample values x and allowing to vary.
Since the pmf or pdf is nonnegative, the likelihood must be nonnegative as well.
What Likelihood Is Not
The likelihood is not a pdf (or pmf) of given the data. There are several things wrong
with such an interpretation:
In frequentist inference, the unknown parameter is not a random variable, so talking
about its distribution makes no sense.
Even in Bayesian inference, the likelihood is still the same mathematical object as the
pmf or pdf of the data. Hence, it describes probabilities of observing data values given
certain parameter values, not the other way around.
The likelihood does not (in general) sum or integrate to 1 when summing or integrating
over . In fact, it may sum or integrate to , in which case we cannot even scale it to
make it a pdf (or pmf).
Likelihood of Independent Samples
If the sample X consists of iid observations from a common individual pmf or pdf f (x),
then the likelihood of the sample X is simply the product of the likelihoods associated with
the individual random variables X1 , . . . , Xn , in which case Lx () = ni=1 Lxi () = ni=1 f (xi ).
Log-Likelihood
It is often more convenient to work with the logarithm of the likelihood, `x () = log Lx ().
We adopt the convention that `x () = when Lx () = 0.
4.2
An estimate (x)
of (with allowed parameter space ) is called a maximum likelihood
noting that maximizing the likelihood is equivalent to maximizing the log-likelihood. Also
note that the MLE can only take values within the allowed parameter space .
`x () = log[
i=1
n
xi exp()
] = n + nx log log[(xi )!].
(xi )!
i=1
Then
nx
`x () = n +
= 0 = x.
Observe that = x is the only critical point. Also observe that if x > 0, then `x ()
both as 0 and as , so = x is clearly the maximizer if x > 0. If instead x = 0, then
`x () is simply a decreasing function of , and so = 0 = x is again the maximizer. Hence,
`x (x) = max `x ()
0
MLE = X.
where N0 = {0, 1, 2, . . .}. Thus, the maximum likelihood estimator of is
n n
2
`x (, ) = 2 xi 2 = 2 (x ),
i=1
n
n
1
2
`x (, ) = 2 +
(xi )2
2
2
2
( )
2
2( ) i=1
n
1
[(xi )2 2 ].
=
2( 2 )2 i=1
We now set both partial derivatives equal to zero and solve. First, note that
n
`x (, 2 ) = 2 (x ) = 0 = x.
We can now substitute = x into the other partial derivative and set it equal to zero, yielding
n
1
1 n
n1 2
2
2
2
2
2
[(x
]
`
(,
)
=
)S .
x)
=
0
=
(x
x)
=
(
x
i
i
( 2 )
2( 2 )2 i=1
n i=1
n
It can be seen from the form of `x (, 2 ) that this point is indeed the maximizer, i.e.,
`x [x, (
n1 2
)S ] = max2 `x (, 2 ).
R, >0
n
(4.2.1)
Note: The result in (4.2.1) holds for all x Rp such that x1 , . . . , xn are not all equal.
If instead x1 , . . . , xn are all equal, then (n 1)S 2 /n = 0, which is outside the allowed
parameter space for 2 . However, since X1 , . . . , Xn are continuous random variables,
P (Xi = Xj ) = 0 for i j regardless of the values of and 2 , meaning that this issue
arises with probability zero. As mentioned in an earlier note, we are typically willing to
ignore problematic samples if they occur with probability zero for all parameter values.
Yi N ( j xij , 2 )
j=1
Invariance to Reparametrization
The next theorem provides a very convenient property of the maximum likelihood estimator.
Theorem 4.2.4. Let MLE be a maximum likelihood estimator of over the parameter
space , and let g be a function with domain and image . Then MLE = g(MLE ) is a
maximum likelihood estimator of = g() over the parameter space .
Proof. See the proof of Theorem 7.6.2 in DeGroot & Schervish.
Example 4.2.5: Suppose that in Example 4.2.2, we had taken the second unknown parameter to be the standard deviation instead of the variance 2 . Then the maximum likelihood
estimator of would have simply been
(n 1)S 2
MLE
= (
2 ) MLE =
n
by Theorem 4.2.4.
exp( + xi )
]
1 + exp( + xi )
where R and R are both unknown. (This is often called logistic regression.) The
log-likelihood based on the sample y = (y1 , . . . , yn ) is
yi
1yi
exp( + xi )
1
] [
]
`y (, ) = log Ly (, ) = log [
1 + exp( + xi )
i=1 1 + exp( + xi )
n
i=1
exp( + xi )
`y (, ) = yi
,
i=1
i=1 1 + exp( + xi )
n
n
xi exp( + xi )
`y (, ) = xi yi
.
i=1
i=1 1 + exp( + xi )
Setting both partial derivatives above equal to zero yields a system of equations that cannot
be solved in closed form. However, we can find a solution numerically to obtain maximum
likelihood estimates
MLE and MLE for most samples y {0, 1}p . (However, no matter what
the true values of and are, there is a nonzero probability of obtaining a sample such that
the maximum likelihood estimates do not exist.)
4.3
Example 4.3.2: In the regression setup of Example 4.2.3, we could instead consider the
least absolute deviation estimator
LAD = arg min Y X1 ,
Rp
Example 4.3.3: In the regression setup of Example 4.2.3, we could instead consider the
lasso estimator
LASSO = arg min(Y X22 + 1 ),
Rp