HW3
HW3
HW3
1. (10 points) Suppose we have data (X1 , Y1 ), . . . , (Xn , Yn ). We fit a linear model using
least squares to get βb0 and βb1 . Let Ybi = βb0 + βb1 Xi . (We do not assume that the linear
model is true.) Prove that
n n
1Xb 1X
Yi = Yi .
n i=1 n i=1
2. (40 points) Assume the linear model with Normal errors. In other words, assume that
Yi = β0 + β1 Xi + i
where 1 , . . . , n ∼ N (0, σ 2 ).
1
b2 = n−2 2
P
(a) (20) Let σ i=1 ei where ei = Yi − Yi . Show that this estimator is unbiased.
b
2 2
In other words, show that E[b σ ]=σ .
Pn 2
i=1 ei
Hint: You may use this fact that we stated in class: σ2
∼ χ2n−2 .
(b) (20) We saw that the maximum likelihood estimator of σ 2 is
n
2 1X 2
σ
b = e.
n i=1 i
What is the amount of bias in this estimator? What happens to the bias as the sample
size n becomes large?
Note: Recall (from 36-226) that “amount of bias” is defined as the difference between
the expected value of an estimator and the true parameter value. Of course, for unbi-
ased estimators, this will be zero.
3. (50 points) Urban economies.
This question will help you prepared for writing full reports on the results of statistical
analyses. Hence, when asked to explain, comment, or discuss, use complete sentences
and answer thoroughly in the context of the problem for full credit. Use R when
appropriate for the homework problems. You do not need to hand in code, but you do
need to hand in graphs and a writeup of the R output. Figures should be clearly
labeled and readable.
The data file http://www.stat.cmu.edu/~larry/=stat401/bea-2006.csv contains
information about the economies of the 366 “metropolitan statistical areas” (≈ cities)
of the US in 2006. In particular, it lists, for each city, the population, the total value
of all goods and services produced for sale in the city that year per person (“per capita
gross metropolitan product”, pcgmp), and the share of economic output coming from
four selected industries.
1
(a) (2) Load the data file and verify that it has 366 rows and 7 columns. Why should
it have seven columns, when the paragraph above described only six variables?
(b) (2) Calculate summary statistics for the six numerical-valued columns.
(c) (6) Make univariate EDA plots for population and for per-capita GMP, and de-
scribe their distributions in words. (Use the commands hist and boxplot.)
(d) (6) Make a bivariate EDA plot for per-capita GMP as a function of population.
Describe the relationship in words.
(e) (5) Using only the functions mean, var, cov, sum and arithmetic, calculate the
slope and intercept of the least-squares regression line.
(f) (3) What are the slope and intercept returned by the function lm? Does it agree
with your answer in the previous part? Should it?
(g) (4) Add both lines to the bivariate EDA plot. (Add only one line, of course, if
you think they are the same.) Comment on the fit. Do the assumptions of the
simple linear regression model appear to hold? Are there any places where the fit
seems better than others?
(h) (4) Find Pittsburgh in the data set. What is the population? The per-capita
GMP? The per-capita GMP predicted by the model? The residual for Pittsburgh?
(i) (2) What is the mean squared error of the regression? That is, what is n−1 ni=1 e2i
P