Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
52 views

Frequentist Estimation: 4.1 Likelihood Function

n−1 2 )S = σˆMLE . n ♢ The key points are: 1) An estimator is any function of the data intended to approximate an unknown parameter. The maximum likelihood estimator (MLE) is the estimator that maximizes the likelihood function. 2) For a good estimator, the estimated value should be "close" to the true parameter value. Examples of potentially bad estimators are given. 3) The likelihood function describes the probabilities of observing particular data values given certain parameter values, not the other way around. The MLE is found by maximizing the likelihood function over the possible parameter space.

Uploaded by

Jung Yoon Song
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Frequentist Estimation: 4.1 Likelihood Function

n−1 2 )S = σˆMLE . n ♢ The key points are: 1) An estimator is any function of the data intended to approximate an unknown parameter. The maximum likelihood estimator (MLE) is the estimator that maximizes the likelihood function. 2) For a good estimator, the estimated value should be "close" to the true parameter value. Examples of potentially bad estimators are given. 3) The likelihood function describes the probabilities of observing particular data values given certain parameter values, not the other way around. The MLE is found by maximizing the likelihood function over the possible parameter space.

Uploaded by

Jung Yoon Song
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

STATS 200 (Stanford University, Summer 2015)

Lecture 4:

Frequentist Estimation

Suppose we have an unknown parameter and some data X. We may want to use the data
to estimate the value of .
An estimator of an unknown parameter is any function of the data that is intended
it is actually (X),

to approximate in some sense. Although we typically just write ,


a random variable.
An estimate is the value an estimator takes for a particular set of data values. Thus,

the estimator (X)


would yield the estimate (x)
if we observe the data X = x.
Good and Bad Estimators
Any function of the data can be considered an estimator for any parameter. However, it
may not be a good estimator. A good estimator will usually be close to the parameter
it estimates, in a sense to be formalized later. It may or may not be obvious whether an
estimator is good, or whether one estimator is better than another.
Example 4.0.1: Suppose X1 , . . . , Xn iid Exp(), and we wish to estimate = 1/E(X).
= (sample median)1 might be good estimators.
= ( X )1 or
Estimators such as
= 1 + ( X )3 or
= 17 are probably bad estimators.
Estimators such as
Anything involving itself is not an estimator.
1 , . . . , xn ) = 17 might be fine as an estimate for some particular data set
Note that (x
1 , . . . , Xn ) = 17, which ignores the data
X1 = x1 , . . . , Xn = xn . However, the estimator (X
and takes the value 17 no matter what, is probably a bad estimator.

4.1

Likelihood Function

Many procedures in statistical inference involve a mathematical object called the likelihood
function.
Definition of Likelihood
Let be an unknown parameter, and let X be a sample with joint pmf or pdf f (x). The
likelihood function for a particular set of data values x is Lx () = f (x).
The function Lx () is simply a function of (though it may be a different function for
different values of x). The function itself is not random.
Sometimes we may also need to consider LX (), which is a random function of .
Example 4.1.1: Let X Bin(n, ). If we observe X = x, then Lx () = (nx)x (1 )nx . We
might also consider the random function LX () = (Xn )X (1 )nX .

Lecture 4: Frequentist Estimation

Interpretation of Likelihood
The likelihood function is essentially the same mathematical object as the joint pmf or pdf,
but its interpretation is different.
For f (x), we think about fixing a parameter value and allowing x to vary.
For Lx (), we think about fixing a collection of sample values x and allowing to vary.
Since the pmf or pdf is nonnegative, the likelihood must be nonnegative as well.
What Likelihood Is Not
The likelihood is not a pdf (or pmf) of given the data. There are several things wrong
with such an interpretation:
In frequentist inference, the unknown parameter is not a random variable, so talking
about its distribution makes no sense.
Even in Bayesian inference, the likelihood is still the same mathematical object as the
pmf or pdf of the data. Hence, it describes probabilities of observing data values given
certain parameter values, not the other way around.
The likelihood does not (in general) sum or integrate to 1 when summing or integrating
over . In fact, it may sum or integrate to , in which case we cannot even scale it to
make it a pdf (or pmf).
Likelihood of Independent Samples
If the sample X consists of iid observations from a common individual pmf or pdf f (x),
then the likelihood of the sample X is simply the product of the likelihoods associated with
the individual random variables X1 , . . . , Xn , in which case Lx () = ni=1 Lxi () = ni=1 f (xi ).
Log-Likelihood
It is often more convenient to work with the logarithm of the likelihood, `x () = log Lx ().
We adopt the convention that `x () = when Lx () = 0.

4.2

Maximum Likelihood Estimation

An estimate (x)
of (with allowed parameter space ) is called a maximum likelihood

estimate (MLE) of if (x)


maximizes Lx (), the likelihood of , over . An estimator that
takes the value of a maximum likelihood estimate for every possible sample X = x is called
a maximum likelihood estimator (also MLE). If the MLE is unique, then we can write
MLE = arg max LX () = arg max `X (),

noting that maximizing the likelihood is equivalent to maximizing the log-likelihood. Also
note that the MLE can only take values within the allowed parameter space .

Lecture 4: Frequentist Estimation

Existence and Uniqueness


The maximum likelihood estimator need not be unique or even exist. It may be the case that
for certain possible samples X = x, the likelihood function Lx () has a non-unique maximum
or fails to achieve its maximum altogether. Further discussion of these possibilities can be
found in Examples 7.5.8, 7.5.9, and 7.5.10 of DeGroot & Schervish.
Note: We are typically willing to ignore any such problematic samples if they occur
with probability zero for all .

Finding the MLE


Maximizing Lx (), or equivalently `x (), is just a calculus problem. Typically we find all
points in where the derivative Lx / is zero or undefined. Then we identify the global
maximum, considering all critical points and boundary points of .
Example 4.2.1: Suppose X1 , . . . , Xn iid Poisson(), where 0, and we want to find the
MLE of . The log-likelihood based on the sample x = (x1 , . . . , xn ) is
n

`x () = log[
i=1

n
xi exp()
] = n + nx log log[(xi )!].
(xi )!
i=1

Then

nx
`x () = n +
= 0 = x.

Observe that = x is the only critical point. Also observe that if x > 0, then `x ()
both as 0 and as , so = x is clearly the maximizer if x > 0. If instead x = 0, then
`x () is simply a decreasing function of , and so = 0 = x is again the maximizer. Hence,
`x (x) = max `x ()
0

for all x (N0 )n ,

MLE = X.
where N0 = {0, 1, 2, . . .}. Thus, the maximum likelihood estimator of is

MLE with Multiple Parameters


The definition of the maximum likelihood estimator still holds if the unknown parameter
is really = (1 , . . . , p ), i.e., if there are multiple unknown parameters. We still find the
MLE = (1 , . . . , p ) the same way, though the calculus problem may be more complicated.
Example 4.2.2: Let X1 , . . . , Xn iid N (, 2 ), where R and 2 > 0 are both unknown,
and we want to find the MLE of both parameters. The likelihood and log-likelihood based
on the sample x = (x1 , . . . , xn ) are
n
1
(xi )2
1 n
2 n/2
Lx ( 2 ) =
exp[
]
=
(2
)
exp[
(x )2 ],
2
2 i
2
2
2
2
i=1
i=1
n
n
n
1
`x ( 2 ) = log Lx ( 2 ) = log(2) log 2 2 (xi )2 .
2
2
2 i=1

Lecture 4: Frequentist Estimation

Differentiating with respect to each parameter yields


1 n

n n
2
`x (, ) = 2 xi 2 = 2 (x ),

i=1

n
n
1

2
`x (, ) = 2 +
(xi )2
2
2
2
( )
2
2( ) i=1
n
1
[(xi )2 2 ].
=

2( 2 )2 i=1

We now set both partial derivatives equal to zero and solve. First, note that

n
`x (, 2 ) = 2 (x ) = 0 = x.

We can now substitute = x into the other partial derivative and set it equal to zero, yielding
n
1

1 n
n1 2
2
2
2
2
2
[(x
]
`
(,

)
=
)S .

x)

=
0

=
(x

x)
=
(

x
i
i
( 2 )
2( 2 )2 i=1
n i=1
n

It can be seen from the form of `x (, 2 ) that this point is indeed the maximizer, i.e.,
`x [x, (

n1 2
)S ] = max2 `x (, 2 ).
R, >0
n

(4.2.1)

Note: The result in (4.2.1) holds for all x Rp such that x1 , . . . , xn are not all equal.
If instead x1 , . . . , xn are all equal, then (n 1)S 2 /n = 0, which is outside the allowed
parameter space for 2 . However, since X1 , . . . , Xn are continuous random variables,
P (Xi = Xj ) = 0 for i j regardless of the values of and 2 , meaning that this issue
arises with probability zero. As mentioned in an earlier note, we are typically willing to
ignore problematic samples if they occur with probability zero for all parameter values.

Thus, the maximum likelihood estimators of and 2 are, respectively,


MLE = X and
2
MLE
2
(
)
= (n 1)S /n. Notice that the MLE of the variance is smaller than the usual
sample variance by a factor of (n 1)/n.

Example 4.2.3: Let X be an n p matrix of known constants (not random variables,


despite the capital letter), and let Y1 , . . . , Yn be independent random variables with
p

Yi N ( j xij , 2 )

for each i {1, . . . , n},

j=1

where = (1 , . . . , p ) Rp and 2 > 0 are both unknown. Let Y = (Y1 , . . . , Yn ). It can


be shown (see, e.g., Theorem 11.5.1 of DeGroot & Schervish) that if the matrix X has
rank p, then the maximum likelihood estimators of and 2 are MLE = (X T X)1 X T Y
and (
2 ) MLE = n1 Y X MLE 22 , respectively, where u22 = ki=1 u2i for any u Rk .
Note: If the rank of X is strictly less than p (which is automatically the case if n < p),
then the MLE of still exists but is not unique. However, the MLE of 2 does not
exist (although we could take it to be zero if we expand the parameter space to 2 0
and adopt the convention that a normal distribution with variance zero is simply a
degenerate distribution).

Observe that MLE is just the ordinary least squares estimator of .

Lecture 4: Frequentist Estimation

Invariance to Reparametrization
The next theorem provides a very convenient property of the maximum likelihood estimator.
Theorem 4.2.4. Let MLE be a maximum likelihood estimator of over the parameter
space , and let g be a function with domain and image . Then MLE = g(MLE ) is a
maximum likelihood estimator of = g() over the parameter space .
Proof. See the proof of Theorem 7.6.2 in DeGroot & Schervish.
Example 4.2.5: Suppose that in Example 4.2.2, we had taken the second unknown parameter to be the standard deviation instead of the variance 2 . Then the maximum likelihood
estimator of would have simply been

(n 1)S 2
MLE

= (
2 ) MLE =
n
by Theorem 4.2.4.

Numerical Calculation of Maximum Likelihood Estimates


It is often the case that the maximum likelihood estimator MLE (X) cannot be expressed in
closed form as a function of X. However, the maximum likelihood estimate MLE (x) for a
particular sample X = x can usually still be found numerically.
Example 4.2.6: Let x1 , . . . , xn be known constants, and let Y1 , . . . , Yn be independent random variables with
Yi Bin[1,

exp( + xi )
]
1 + exp( + xi )

for each i {1, . . . , n},

where R and R are both unknown. (This is often called logistic regression.) The
log-likelihood based on the sample y = (y1 , . . . , yn ) is
yi

1yi

exp( + xi )
1
] [
]
`y (, ) = log Ly (, ) = log [
1 + exp( + xi )
i=1 1 + exp( + xi )
n

= yi ( + xi ) log[1 + exp( + xi )].


i=1

i=1

Differentiating with respect to and yields


n
n

exp( + xi )
`y (, ) = yi
,

i=1
i=1 1 + exp( + xi )

n
n

xi exp( + xi )
`y (, ) = xi yi
.

i=1
i=1 1 + exp( + xi )

Setting both partial derivatives above equal to zero yields a system of equations that cannot
be solved in closed form. However, we can find a solution numerically to obtain maximum
likelihood estimates
MLE and MLE for most samples y {0, 1}p . (However, no matter what
the true values of and are, there is a nonzero probability of obtaining a sample such that
the maximum likelihood estimates do not exist.)

Lecture 4: Frequentist Estimation

4.3

Estimators that Optimize Other Functions

Sometimes we may want to find an estimator by maximizing or minimizing some real-valued


function other than the likelihood. There are many reasons why we might want to do this:
The likelihood itself may be difficult to work with.
We may be unsure of some aspect of our model (e.g., we may not know if the observations are normally distributed).
We may want to favor certain kinds of estimates over others.
An estimator that is found by maximizing or minimizing some real-valued function other
than the likelihood is called an M-estimator. Note that the maximum likelihood estimator
is a special case of an M-estimator.
Example 4.3.1: In the regression setup of Example 4.2.3, the least squares estimator
LS = arg min Y X22
Rp

is an M-estimator of . However, LS = MLE , so this estimator coincides with the maximum


likelihood estimator.

Example 4.3.2: In the regression setup of Example 4.2.3, we could instead consider the
least absolute deviation estimator
LAD = arg min Y X1 ,
Rp

where u1 = ki=1 ui for any u Rk . Then LAD is another M-estimator.

Example 4.3.3: In the regression setup of Example 4.2.3, we could instead consider the
lasso estimator
LASSO = arg min(Y X22 + 1 ),
Rp

where > 0 is some fixed constant.


Reference: Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society, Series B, 58 267288.

The addition of the term 1 does several things:


It typically results in an estimate for which some components are exactly zero.
The nonzero components are typically smaller in absolute value than the corresponding
components of MLE (provided the MLE exists and is unique).
LASSO is unique in many cases when MLE is not. For example, LASSO is usually
still unique even when p > n.
The lasso and other similar methods are often called regularized or penalized regression.
Numerical Calculation of M-Estimates
Even when an M-estimator cannot be expressed in closed form, the corresponding estimate
for a sample X = x can usually still be found numerically.

You might also like