1 Review
1 Review
1 Review
We introduced the method of maximum likelihood for simple linear regression
in the notes for two lectures ago. Let’s review.
We start with the statistical model, which is the Gaussian-noise simple linear
regression model, defined as follows:
In multiplying together the probabilities like this, we are using the independence
of the Yi .
When we see the data, we do not known the true parameters, but any guess
at them, say (b0 , b1 , s2 ), gives us a probability density:
n n
Y Y 1 (yi −(b0 +b1 xi ))2
p(yi |xi ; b0 , b1 , s2 ) = √ e− 2s2
1
This is the likelihood, a function of the parameter values. It’s just as informa-
tive, and much more convenient, to work with the log-likelihood,
n
Y
2
L(b0 , b1 , s ) = log p(yi |xi ; b0 , b1 , s2 ) (1)
i=1
n
X
= log p(yi |xi ; b0 , b1 , s2 ) (2)
i=1
n
n 1 X
= − log 2π − n log s − 2 (yi − (b0 + b1 xi ))2 (3)
2 2s i=1
As you will recall, the estimators for the slope and the intercept exactly match
the least squares estimators. This is a special property of assuming independent
Gaussian noise. Similarly, σc2 is exactly the in-sample mean squared error.
2 Sampling Distributions
We may seem not to have gained much from the Gaussian-noise assumption,
because our point estimates are just the same as they were from least squares.
What makes the Gaussian noise assumption important is that it gives us an
exact conditional distribution for each Yi , and this in turn gives us a distribution
— the sampling distribution — for the estimators. Remember, from the notes
from last time, that we can write βb1 and βb0 in the form “constant plus sum of
noise variables”. For instance,
n
X xi − x
βb1 = β1 + i
i=1
ns2X
Now, in the Gaussian-noise model, the i are all independent Gaussians. There-
fore, βb1 is also Gaussian. Since we worked out its mean and variance last time,
we can just say
βb1 ∼ N (β1 , σ 2 /ns2X )
2
Again, we saw that the fitted value at an arbitrary point x, m(x),
b is a
constant plus a weighted sum of the :
n
1X xi − x
m(x)
b = β0 + β1 x + 1 + (x − x) 2 i
n i=1 sX
Once again, because the i are independent Gaussians, a weighted sum of them
is also Gaussian, and we can just say
σ2 (x − x)2
m(x) ∼ N β0 + β1 x, 1+
s2X
b
n
2.1 Illustration
To make the idea of these sampling distributions more concrete, I present a small
simulation. Figure 1 provides code which simulates a particular Gaussian-noise
linear model: β0 = 5, β1 = −2, σ 2 = 3, with twenty X’s initially randomly
drawn from an exponential distribution, but thereafter held fixed through all
the simulations. The theory above lets us calculate just what the distribution
of βb1 should be, in repeated simulations, and the distribution of m(−1).
b (By
construction, we have no observations where x = −1; this is an example of using
the model to extrapolate beyond the data.) Figure 2 compares the theoretical
sampling distributions to what we actually get by repeated simulation, i.e., by
repeating the experiment.
3
# Fix x values for all runs of the simulation; draw from an exponential
n <- 20 # So we don't have magic #s floating around
beta.0 <- 5
beta.1 <- -2
sigma.sq <- 3
fixed.x <- rexp(n=n)
4
0.8
Density
0.4
0.0
0.2
0.0
4 6 8 10
^ (− 1)
m
par(mfrow=c(2,1))
slope.sample <- replicate(1e4, coefficients(sim.lin.gauss(model=TRUE))["x"])
hist(slope.sample,freq=FALSE,breaks=50,xlab=expression(hat(beta)[1]),main="")
curve(dnorm(x,-2,sd=sqrt(3/(n*var(fixed.x)))), add=TRUE, col="blue")
pred.sample <- replicate(1e4, predict(sim.lin.gauss(model=TRUE),
newdata=data.frame(x=-1)))
hist(pred.sample, freq=FALSE, breaks=50, xlab=expression(hat(m)(-1)),main="")
curve(dnorm(x, mean=beta.0+beta.1*(-1),
sd=sqrt((sigma.sq/n)*(1+(-1-mean(fixed.x))^2/var(fixed.x)))),
add=TRUE,col="blue")
5
3 Virtues and Limitations of Maximum Likeli-
hood
The method of maximum likelihood does not always work; there are models
where it gives poor or even pathological estimates. For Gaussian-noise linear
models, however, it actually works very well. Indeed, in more advanced statistics
classes, one proves that for such models, as for many other “regular” statistical
models, maximum likelihood is asymptotically efficient, meaning that its
parameter estimates converge on the truth as quickly as possible2 . This is on
top of having exact sampling distributions for the estimators.
Of course, all these wonderful abilities come at a cost, which is the Gaussian
noise assumption. If that is wrong, then so are the sampling distributions I
gave above, and so are the inferential calculations which rely on those sampling
distributions. Before we begin to do those inferences on any particular data
set, and especially before we begin to make grand claims about the world on the
basis of those inferences, we should really check all those modeling assumptions.
That, however, brings us into the topics for next week.
Exercises
To think through or to practice on, not to hand in.
(a) Write down the log-likelihood function. Use an explicit formula for
the density of the t distribution.
2 Veryroughly: writing θ for the true parameter, θb forh the MLE, iand θ̃ for any other consis-
h i
tent estimator, asymptotic efficiency means limn→∞ E nkθb − θk2 ≤ limn→∞ E nkθ̃ − θk .
(This way of formulating it takes it for granted that the MSE of estimation goes to zero like
1/n, but it typically does in parametric problems.) For more precise statements, see, for
instance, ?, ? or ?.
6
(b) Find the derivatives of this log-likelihood with respect to the four
parameters β0 , β1 , σ (or σ 2 , if more convenient) and ν. Simplify as
much as possible. (It is legitimate to use derivatives of the gamma
function here, since that’s another special function.)
(c) Can you solve for the maximum likelihood estimators of β0 and β1
without knowing σ and ν? If not, why not? If you can, do they
match the least-squares estimators again? If they don’t match, how
do they differ?
(d) Can you solve for the MLE of all four parameters at once? (Again,
you may have to express your answer in terms of the gamma function
and its derivatives.)