Econ5025 Practice Problems
Econ5025 Practice Problems
0 1
GPA ACT | | = +
b. Compute the fitted values and residuals for each observation, and verify that
the residuals (approximately) sum to zero.
c. What is the predicted value of GPA when ACT = 20?
d. How much of the variation in GPA for these eight students is explained by
ACT? Explain.
4. The data set BWGHT.RAW contains data on births to women in the United
States. Two variables of interest are the dependent variable, infant birth weight in
ounces (bwght) and an explanatory variable, average number of cigarettes the
mother smoked per day during pregnancy (cigs). The following simple regression
was estimated using data on n = 1388 births:
0 1
cons inc | | = +
the (estimated) marginal propensity to consume (MPC) out of income is simply the
slope,
1
/ | | + = inc inc ns o c .
Using observations for 100 families on annual income and consumption (both
measured in dollars), the following equation is obtained:
2
1.392 .0135 .00148
4,137, .273,
colgpa hsperc sat
n R
= +
= =
where colgpa is measured on a four-point scale, hsperc is percentile in the high
school graduating class (Defined so that, for example, hsperc=5 means top five
percent of the class), and sat is the combined math and verbal scores on the
student achievement test.
a. Why does it make sense for the coefficient on hsperc to be negative?
b. What is the predicted college GPA when hsperc = 20 and sat = 1050?
c. Suppose that two high school graduates, A and B, graduated in the same
percentile from high school, but student As SAT score was 140 points
higher (about one standard deviation in the sample). What is the predicted
difference in college GPA for these two students?
d. Holding hsperc fixed, what difference in SAT scores leads to a predicted
colgpa difference of 0.50, or one-half of a grade point?
2. The data in WAGE2.RAW on working men was used to estimate the following
equation:
2
10.36 .094 .131 .210
722, .214,
educ sibs meduc feduc
n R
= + +
= =
Where educ is years of schooling, sibs is number of siblings, meduc is mothers
years of schooling, and feduc is fathers years of schooling.
a. Does sibs have the expected effect? Explain. Holding meduc and feduc
fixed, by how much does sibs have to increase to reduce predicted years of
education by one year? (A noninteger answer is acceptable here.)
b. Discuss the interpretation of the coefficient on meduc.
c. Suppose that Man A has no siblings, and his mother and father each have
12 years of education. Man B has no siblings, and his mother and father
each have 16 years of education. What is the predicted difference in years
of education between B and A?
3. The median starting salary for new law school graduates is determined by
,
) log(cos ) log( ) log(
5
4 3 2 1 0
u rank
t libvol GPA LSAT salary
+ +
+ + + + =
|
| | | | |
where LSAT is the median LSAT score for the graduating class, GPA is the median
college GPA for the class, libvol is the number of volumes in the law school library,
cost is the annual cost of attending law school, and rank is a law school ranking (with
rank=1 being the best).
9
a. Explain why we expect . 0
5
s |
b. What signs do you expect for the other slope parameters? Justify your
answers.
c. Using the data in LAWSCH85.RAW, the estimated equation is
2
log( ) 8.34 .0047 .248 .095log( )
.038log(cos ) .0033
136, .842
salary LSAT GPA libvol
t rank
n R
= + + +
+
= =
What is the predicted ceteris paribus difference in salary for schools with a
median GPA different by one point? (Report your answer as a percentage.)
d. Interpret the coefficient on the variable log(libvol).
e. Would you say it is better to attend a better ranked law school? How much
is a difference in ranking of 20 worth in terms of predicted starting salary?
4. In a study relating college grade point average to time spent in various activities,
you distribute a survey to several students. The students are asked how many
hours they spend each week in four activities: studying, sleeping, working, and
leisure. Any activity is put into one of the four categories, so that for each student,
the sum of hours in the four activities must be 168.
a. In the model
,
4 3 2 1 0
u leisure work sleep study GPA + + + + + = | | | | |
does it make sense to hold sleep, work, and leisure fixed, while changing
study?
b. Explain why this model violates Assumption MLR.3.
c. How could you reformulate the model so that its parameters have a useful
interpretation and it satisfies Assumption MLR.3?
5. Consider the multiple regression model containing three independent variables,
under Assumptions MLR.1 through MLR.4:
.
3 3 2 2 1 1 0
u x x x y + + + + = | | | |
You are interested in estimating the sum of the parameters on x
1
and x
1
; call this
.
2 1 1
| | u + =
a. Show that
2 1 1
| | u + = is an unbiased estimator of .
1
u
b. Find )
(
1
u Var in terms of )
(
1
| Var , )
(
2
| Var and )
(
2 1
| | corr .
6. Which of the following can cause OLS estimators to be biased?
a. Heteroskedasticity.
b. Omitting an important variable.
c. A sample correlation coefficient of .95 between two independent variables
both included in the model.
10
7. Suppose that average worker productivity at manufacturing firms (avgprod)
depends on two factors, average hours of training (avgtrain) and average worker
ability (avgabil):
.
2 1 0
u avgabil avgtrain avgprod + + + = | | |
Assume that this equation satisfies MLR.1 through MLR.4. If grants have been
given to firms whose workers have less than average ability, so that avgtrain and
avgabil are negatively correlated, what is the likely bias in
1
~
| Obtained from the
simple regression of avgprof on avgtrain? (using one of terminologies such as
upward bias, downward bias, or biased toward zero).
8. Suppose that you are interested in estimating the ceteris paribus relationship
between y and x
1
. For this purpose, you can collect data on two control variables,
x
2
and x
3
. (For concreteness, you might think of y as final exam score, x
1
as class
attendance, x
2
as GPA up to the previous semester, and x
3
as SAT or ACT score.)
Let
1
~
| be the simple regression estimate from y on x
1
and let
1
| be the multiple
regression estimate from y on x
1
, x
2
, x
3
.
a. If x
1
is highly correlated with x
2
and x
3
in the sample, and x
2
an x
3
have
large partial effects on y, would you expect
1
~
| and
1
| to be similar or
very different? Explain.
b. If x
1
is almost uncorrelated with x
2
and x
3
, but x
2
and x
3
are highly
correlated, will
1
~
| and
1
(
1
| se to be smaller? Explain.
d. If x
1
is almost uncorrelated with x
2
and x
3
, and x
2
and x
3
have large partial
effects on y, and x
2
and x
3
are highly correlated, would you expect )
~
(
1
| se
or )
(
1
| se to be smaller? Explain.
9. Suppose the population model is
0 1
y x u | | = + +
The key condition needed for OLS to consistently estimate the | is that the error
term has mean zero and is uncorrelated with the regressor:
( ) ( ) 0, 0 E u E xu = = .
Show than the zero conditional mean assumption
( )
E u x is stronger than the
above condition. (actually given the zero conditional mean assumption, you can
show the error term is uncorrelated with any function of x .)
10. Derivations related to OLS estimators
a. Deriving OLS estimator for a simple regression (p.29)
b. Show that y y =
c. Show that
1
0
n
i i
i
u y
=
=
d. Show that SST SSE SSR = + (page 39)
11
e. Partialling out interpretation of multiple regression
Suppose the population regression is
0 1 1 2 2
...
i i i k ik i
y x x x u | | | | = + + + + +
Claim:
1
i
r
Step 2: regress
i
y on
1
i
r with an intercept
0 1
i i i
y r e = + +
then we claim:
1 1
| =
where
( )
1
1
1
2
1
1
n
i i
i
n
i
i
r y
r
=
=
| |
|
\ .
| |
|
\ .
According to (2.19) on page 29, for the simple regression in step 2, we have
( ) ( )
( )
1 1
1
1
2
1 1
1
=
n
i i
i
n
i
i
r r y y
r r
=
=
| |
|
\ .
| |
|
\ .
Show that
( )( )
( ) ( )
1 1 1
1 1
2 2
1 1 1
1 1
n n
i i i i
i i
n n
i i
i i
r r y y r y
r r r
= =
= =
| | | |
| |
\ . \ .
=
| | | |
| |
\ . \ .
(you need
1
1
0 thus 0
n
i
i
r r
=
= =
)
Show that
1 1
| and
2
| , respectively.
Based the above regression results verify that
1 2 1 1
~
~
o | | | + = .
C3.6 The data in MEAP93.RAW are used to estimate the following regression.
a. I estimate the model
0 1 2
10 log( ) , math expend lnchprg u | | | = + + +
Report the SRF, including the sample size and R-squared.
b. What do you make of the intercept (a)? In particular, does it make sense to set
the two explanatory variables to zero? [Hint: Recall that log(1)=0.]
14
c. Now I run the simple regression of math10 on log(expend), and compare the
slope coefficient with the estimate obtained in (a). Is the estimated spending
effect now larger or smaller than in (a)?
d. Report the correlation between lexpend = log(expend) and lnchprg. Does its
sign make sense to you?
e. Use (d) to explain your findings in (c).
C3.7 I Use the data in DISCRIM.RAW for this question. These are zip code-level data on
prices for various items at fast-food restaurants, along with characteristics of the zip code
population, in New Jersey and Pennsylvania. The idea is to see whether fast-food
restaurants charge higher prices in areas with a larger concentration of blacks.
a. Report the sample mean of prpblck and income, along with their standard
deviations. Can you deduce the units of measurement of prpblck and income?
b. Consider a model to explain the price of soda, psoda, in terms of the
proportion of the population that is black and median income:
u income prpblck psoda + + + =
2 1 0
| | |
Report the SRF, including the sample size and R-squared. Interpret the
coefficient on prpblck. Do you think the effect of prpblck on price of soda is
economically large (Comparing two hypothetical communities, one with
100% white and the other with 100% black)?
c. Compare the estimate from (b) with the simple regression estimate from
psoda and prpblack. Is the discrimination effect larger or smaller when you
control for income?
d. A model with constant price elasticity with respect to income may be more
appropriate. Report estimates of the model
( )
0 1 2
log( ) log psoda prpblck income u | | | = + + +
If prpblck increases by .20 (20 percentage points), what is the estimated
percentage change in psoda?
e. Now add the variable prppov to the regression in (d). What happens to
prpblck
|
?
f. Report the correlation between log(income) and prppov. Is it roughly what
you expected?
g. Evaluate the following statement: Because log(income) and prppov are so
highly correlated, they have no business being in the same regression.
Chapter 4
1. Consider an equation to explain salaries of CEOs in terms of annual firm sales,
return on equity (roe, in percentage), and return on the firms stock (ros, in
percentage):
. ) log( ) log(
3 2 1 0
u ros roe sales salary + + + + = | | | |
a. State the null hypothesis that, after controlling for sales and roe, ros has
no effect on CEO salary. State the alternative that better stock market
performance (higher ros) increases a CEOs salary.
15
b. Using the data in CEOSAL1.RAW, the following SRF was obtained by
OLS:
2
log( ) 4.32 .280 log( ) .0174 .00024
(.32) (.035) (.0041) (.00054)
209, .283.
salary sales roe ros
n R
= + + +
= =
What is the effect of ros on the predicted salary if ros increases by 50
percentage points? Does ros have a practically large effect on salary?
c. Test the null hypothesis that ros has no effect on salary against the
alternative that ros has a positive effect. Carry out the test at the 10%
significance level.
d. Would you include ros in a final model explaining CEO compensation in
terms of firm performance? Explain.
2. The variable rdintens is expenditures on research and development (R&D) as a
percentage of sales. Sales are measured in millions of dollars. The variable
profmarg is profits as a percentage of sales.
Using the data in RDCHEM.RAW for 32 firms in the chemical industry, the
following equation is estimated:
2
.472 .321log( ) .050
(1.369)(.216) (.046)
32, .099.
rdintens sales profmarg
n R
= + +
= =
a. Interpret the coefficient on log(sales). In particular, if sales increases by 10%,
what is the estimated effect on rdintens? It this an economically large effect?
b. Test the hypothesis that R&D intensity does not change with sales against the
alternative that it does increase with sales. Do the test at the 5% and 10%
levels.
c. Interpret the coefficient on profmarg. Is it economically large?
d. Does profmarg have a statistically significant effect on rdintens?
3. Are rent rates influenced by the student population in a college town? Let rent be
the average monthly rent paid on rental units in a college town in the United
States. Let pop denote the total city population, avginc the average city income,
and pctstu the student population as a percentage of the total population. One
model to test for a relationship between rent rates and percentage of students in
overall population is
0 1 2 3
log( ) log( ) log( ) . rent pop avginc pctstu u | | | | = + + + +
a. State the null hypothesis that size of the student body relative to the
population has no ceteris paribus effect on monthly rents. State the
alternative that there is an effect.
b. What signs do you expect for
1
| and
2
| ?
c. The equation estimated using 1990 data from RENTAL.RAW for 64
college towns is
16
2
log( ) .043 .066log( ) .507log( ) .0056
(.844) (.039) (.081) (.0017)
64, .458.
rent pop avginc pctstu
n R
= + + +
= =
What is wrong with the statement: A 10% increase in population is
associated with about a 6.6% increase in rent?
d. Test the hypothesis stated in (a) at the 1% level.
4. Consider the estimated equation from Example 4.3, which can be used to study
the effect of skipping class on college GPA:
2
1.39 .412 .015 .083
(.33) (.094) (.011) (.026)
141, .234
colGPA hsGPA ACT skipped
n R
= + +
= =
a. Find the 95% confidence interval for
hsGPA
| .
b. Can you reject the null hypothesis 4 . :
0
=
hsGPA
H | against the two-
sided alternative at the 5% level?
c. Can you reject the null hypothesis 1 :
0
=
hsGPA
H | against the two-sided
alternative at the 5% level?
5. In section 4.5, we used as an example testing the rationality of assessments of
housing prices. There, we used a log-log model in price and assess [see equation
(4.47)]. Here, we use a level-level specification.
a. In the simple regression model
,
1 0
u assess price + + = | |
the assessment is rational if 1
1
= | and 0
0
= | . The estimated equation is
2
14.47 .976
(16.27)(.049)
88, 165, 644.51, .820
price assess
n SSR R
= +
= = =
First, test the hypothesis that 0 :
0 0
= | H against a two-sided
alternative. Then, test 1 :
1 0
= | H against a two-sided alternative. What
do you conclude?
b. To test the joint hypothesis that 0
0
= | and 1
1
= | , we need the SSR in the
restricted model. This amounts to computing
=
n
i
i i
assess price
1
2
) ( , where
n = 88, since the residuals in the restricted model are just price
i
asses
i
.
(No estimation is needed for the restricted model because both parameters
are specified under H
0
.) This turns out to yield SSR = 209,448.99. Carry
out the F test for the joint hypothesis. Is the null hypothesis rejected at the
1% level?
17
c. Now, test 0 :
2 0
= | H , 0
3
= | , and 0
4
= | in the model
.
4 3 2 1 0
u bdrms sqrft lotsize assess price + + + + + = | | | | |
The R-squared from estimating this model using the same 88 houses is
.829. Can we reject the null hypothesis at the 10% level?
6. Consider the multiple regression model with three independent variables, under
the classical linear model assumptions MLR.1 through MLR.6:
0 1 1 2 2 3 3
. y x x x u | | | | = + + + +
You would like to test the null hypothesis . 1 3 :
2 1 0
= | | H
a. Let
1
| and
2
(
2 1
| | Var in terms of the variances of
1
| and
2
(
2 1
| | ?
b. Write the t statistic for testing 1 3 :
2 1 0
= | | H .
c. Define
2 1 1
3| | u = and
2 1 1
3
| | u = . Write a regression equation
involving
0
| ,
1
u ,
2
| and
3
| that allows you to directly obtain
1
u and its
standard error.
7. The following table was created based on results from three regressions using the
data in CEOSAL2.RAW:
Dependent Variable: log(salary)
Independent Variables (1) (2) (3)
log(sales)
.224
(.027)
.158
(.040)
.188
(.040)
log(mktval) _______
.112
(.050)
.100
(.049)
profmarg _______
.0023
(.0022)
.0022
(.0021)
ceoten _______ _______
.0171
(.0055)
comten _______ _______
.0092
(.0033)
intercept
4.94
(0.20)
4.62
(0.25)
4.57
(0.25)
Observations
R-squared
177
.281
177
.304
177
.353
18
The variable mktval is market value of the firm, profmarg is the profit as a percentage
of sales, ceoten is years as CEO with the current company, and comten is total years
with the company.
a. Comment on the effect of profmarg on CEO salary based on the second and
third regressions in the table.
b. Based on the third regression in the table, does market value have a significant
effect in a two-sided test? Explain.
c. Interpret the coefficients on ceoten and comten in the third regression. Are the
variables statistically significant for a two-sided test at the 5% level?
d. What do you make of the fact that longer tenure with the company, holding
the other factors fixed, is associated with a lower salary?
Computer exercises
C4.1 The following model can be used to study whether campaign expenditures affect
election outcomes:
0 1 2 3
log( ) log( ) voteA expendA expendB prtystrA u | | | | = + + + +
where voteA is the percentage of the vote received by Candidate A, expendA and expendB
are campaign expenditures by Candidates A and B, and prtystrA is a measure of party
strength for Candidate A (the percentage of the most recent presidential vote that went to
As party).
a. What is the interpretation of
1
| ?
b. In terms of the parameters, state the null hypothesis that a 1% increase in As
expenditures is offset by a 1% increase in Bs expenditures.
c. I estimate the given model using the data in VOTE1.RAW. Report the SRF
with standard errors in parentheses. Is As expenditures variable statistically
significant? What about Bs expenditures? Can you use these results to test the
hypothesis in (b)?
d. Write down the model that directly gives the t statistic for testing the
hypothesis in (b).
C4.2 Use the data in LAWSCH85.RAW for this exercise.
a. Using the same model as problem 3 of chapter 3, state the null hypothesis that
the rank of law schools has no ceteris paribus effect on median starting salary
and a one-sided alternative hypothesis.
b. Based on the STATA output, interpret the rank coefficient. Can you reject the
null hypothesis in a) at the 5% level?
c. Are features of the incoming class of students, LSAT and GPA, individually or
jointly significant for explaining salary? (to account for missing data on LSAT
and GPA, I estimated the restricted model using individuals only if their LSAT
and GPA are not missing.)
d. Test whether the size of the entering class (clsize) or the size of the faculty
(faculty) needs to be added to this equation by carrying out a single test at the
5% level. (Again I accounted for missing data on clsize and faculty.)
19
C4.3 Use the data in MLB1.RAW for this exercise.
a. I estimate the model in equation (4.31) and drop the variable rbisyr. What
happens to the statistical significance of hrunsyr? What about the size of the
coefficient on hrunsyr?
b. I then add the variables runsyr (runs per year), fldperc (fielding percentage),
and sbasesyr (stolen bases per year) to the model in (a). Which of these
factors are individually significant? Interpret the significant coefficient(s).
c. In the model in (b), test the joint significance of bavg, fldperc, and sbasesyr.
C4.4 Use the data in WAGE2.RAW for this exercise.
a. Consider the standard wage equation
0 1 2 3
log( ) . wage educ exper tenure u | | | | = + + + +
State the null hypothesis that another year of general workforce experience
has the same effect on log(wage) as anther year of tenure with the current
employer.
b. Test the null hypothesis in (a) against a two-sided alternative, at the 5%
significance level, by constructing a 95% confidence interval. What do you
conclude?
C4.5 Refer to example used in Section 4.4. I will use the data set TWOYEAR.RAW.
a. The variable phsrank is the persons high school percentile. (A larger number
is better. For example, 90 means you are ranked better than 90 percent of your
graduating class.) Find the smallest, largest, and average phsrank in the
sample.
b. I then add phsrank to equation (4.26) and estimate the new model. Report the
OLS estimates in the usual form. Is phsrank statistically significant? How
much is 10 percentage points of high school rank worth in terms of wage?
c. Does adding phsrank to (4.26) substantively change the conclusions on the
returns to two- and four-year colleges? Explain.
C4.6 Use the data in DISCRIM.RAW to answer this equation. (See also Computer
Exercise C3.7 in Chapter 3.)
a. I estimate the model using STATA
, ) log( ) log(
3 2 1 0
u prppov income prpblck psoda + + + + = | | | |
Report the SRF with standard errors, number of observation and
2
R . Is
1
|
statistically different from zero at the 5% level against a two-sided
alternative? What about at the 1% level?
b. What is the correlation between log(income) and prppov? For both variables,
report the t statistics and two-sided p-values.
c. To the regression in (a), add the variable log(hseval) (hseval is
median housing value at zipcode level). Interpret its coefficient and report the
two-sided p-value for 0 :
) log(
=
hseval o
H | .
d. In the regression in (c), what happens to the individual statistical significance
of log(income) and prppov? Are these variables jointly significant? (Compute
a p-value.) What do you make of your answers?
20
e. Given the results of the previous regressions, which one would you report as
most reliable in determining whether the racial makeup of a zip code
influences local fast-food prices? What is the effect of prpblck on price of
soda based on the model you picked as the most reliable?
C4.7 Use the data in HPRICE1.dta to answer this question. We set a population model
( )
0 1 2
log price sqrft bdrms u | | | = + + +
a. You are interested in estimating and obtaining a confidence interval for the
percentage change in price when a 150-square-foot bedroom is added to a
house. In decimal form, this is
1 1 2
150 u | | = + . Use the data to estimate
1
u .
b. Write
2
| in terms of
1
u and
1
| and plus this into the regression equation
above.
c. Use the new regression you get in b) to obtain a standard error for
1
u and use
this standard error to construct a 95% confidence interval.
Chapter 5
Computer exercises
C5.1 Use the data in WAGE1.dta for this exercise.
a. Estimate the equation
0 1 2 3
wage educ exper tenure u | | | | = + + + +
Save the residuals and plot a histogram.
b. Repeat part (a), but with ( ) log wage as the dependent variable.
c. Would you say that Assumption MLR.6 is closer to being satisfied for the
level-level model or the log-level model?
C5.2 Use the data in GPA2.dta for this exercise.
a. Using all 4,137 observations, estimate the equation
0 1 2
lg co pa hsperc sat u | | | = + + +
and report the results
b. Reestimate the equation in part (a), using the first 2,070 observations.
c. Find the ratio of the standard errors on hsperc from parts (a) and (b). Compare
this with the result from equation (5.10) in the book.
Chapter 6
1. The following SRF was estimated using the data in CEOSAL.RAW:
2
2
log( ) 4.322 .276 log( ) .0215 .00008
(.324) (.033) (.0129) (.00026)
209, .282.
salary sales roe roe
n R
= + +
= =
21
This model allows roe to have a diminishing effect on log(salary). Is this
generality necessary? Explain why or why not.
2. Let
o
|
,
1
| , ,
k
|
~
= ,
k k o k o
c c c c | | | |
) / (
~
..., ,
) / (
~
1 1 1
= = .
(Hint: Use the fact that the
j
|
2
2
2.613 0.00030 0.0000000070
(0.429) (0.00014) (0.0000000037)
32 .1484
rdintens sales sales
n R
= +
= =
a. At what point does the marginal effect of sales on rdintens become
negative?
b. Would you keep the quadratic term in the model? Explain.
c. Define salesbil as sales measured in billions of dollars: salesbil =
sales/1,000. Rewrite (without re-estimating the model) the estimated
equation with salesbil and
2
salesbil as the independent variables. Be sure
to report standard errors and the R-squared.
d. For the purpose of reporting the result, which equation do you prefer?
4. The following model allows the return to education to depend upon the total
amount of both parents education, called pareduc:
. exp . ) log(
4 3 2 1 0
u tenure er pareduc educ educ wage + + + + + = | | | | |
a. Using calculus to show that the return to another year of education in this
model is roughly
1 2
log( ) / . wage educ pareduc | | A A = +
What sign do you expect for
2
| ? Why?
b. Using the data in WAGE2.RAW, the estimated equation is
2
log( ) 5.65 .047 .00078 .
(.13) (.010) (.00021)
.019exp .010
(.004) (.003)
722, .169
wage educ educ pareduc
er tenure
n R
= + + +
+
= =
(Only 722 observations contain full information on parents education.)
Interpret the coefficient on the interaction term. It might help to choose two
22
specific values for pareduc, for example, pareduc=32 if both parents have a
college education, or pareduc=24 if both parents have a high school
education, and to compare the estimated return to educ.
c. When pareduc is added as a separate variable to the equation, we get:
2
log( ) 4.94 .097 0.033 0.0016 .
(.38) (.027) (.017) (.0012)
.020exp .010
(.004) (.003)
722, .174
wage educ pareduc educ pareduc
er tenure
n R
= + + +
+
= =
Does the estimated return to education now depend positively on parent
education? Test the null hypothesis that the return to education does not depend
on parent education.
5. In example 4.2, where the percentage of students receiving a passing score on a
tenth-grade math exam (math10) is the dependent variable, does it make sense to
include sci10 the percentage of tenth graders passing a science exam as an
additional explanatory variable?
6. When
2
atndrte and ACT atndrte are added to the equation estimated in (6.19), the
R-squared becomes 0.232. Are these additional terms jointly significant at the
10% level? Would you include them in the model?
7. Suppose we want to estimate the effects of alcohol consumption (alcohol) on
colleage grade point average (colGPA). In addition to collecting information on
grade point average and alcohol usage, we also obtain attendance information
(say, percentage of lectures attended, called attend). A standardized test score
(say, SAT) and high school GPA (hsGPA) are also available.
a. Should we include attend along with alcohol as explanatory variables in a
multiple regression model? (think about how you would interpret
alcohol
| .)
b. Should SAT and hsGPA be included as explanatory variables? Explain.
Computer exercises
C6.1 I use the data in KEILMC.RAW for the year 1981 to run the following regressions.
The data are for houses that sold during 1981 in North Andover, Massachusetts; 1981
was the year construction began on a local garbage incinerator.
a. To study the effects of the incinerator location on housing price, consider the
simple regression model
, ) log( ) log(
1 0
u dist price + + = | |
where price is housing price in dollars and dist is distance from the house to
the incinerator measured in feet. Interpreting this equation casually, what sign
23
do you expect for
1
| if the presence of the incinerator depresses housing
prices?
b. I estimate this simple equation. Report the regression results and interpret the
results.
c. To the simple regression model in (a), I add the variables log(intst), log(area),
log(land), rooms, baths, and age, where intst is distance from the home to the
interstate (highway) measured in feet, area is square footage of the house,
land is the lot size in square feet, rooms is total number of rooms, baths is
number of bathrooms, and age is age of the house in years. Now, what do you
conclude about the effects of the incinerator?
d. Next I add
2
[log( )] intst to the model from c). Now what happens? What do
you conclude about the importance of functional form?
e. Is the square of log(dist) significant when I add it to the model in d)?
C6.2 I use the data in WAGE1.RAW for this exercise.
a. I estimate the equation
2
0 1 2 3
log( ) , wage educ exper exper u | | | | = + + + +
Report the results using the usual format.
b. Is exper
2
statistically significant at the 1% level?
c. Find the return to the fifth year of experience. What is the return to the
twentieth year of experience? (not using approximations)
d. At what value of exper does additional experience actually lower predicted
log(wage)? How many people have more experience in this sample?
C6.3 Consider a model where the return to education depends upon the amount of work
experience (and vice versa):
0 1 2 3
log( ) . . wage educ exper educ exper u | | | | = + + + +
a. Show that the return to another year of education, holding exper fixed, is
1 3
exper | | + .
b. State the null hypothesis that the return to education does not depend on the
level of exper. What do you think is the appropriate alternative?
c. Test the null hypothesis in (b) against your stated alternative.
d. Let
1
u denote the return to education. Write down the model that directly
gives the estimate and standard error for
1
u .
C6.4 Use the housing price data in HPRICE1.dta for this exercise.
a. Estimate the model
( ) ( ) ( )
0 1 2 3
log log log price lotsize sqrft bdrms u | | | | = + + + +
and report the results in the usual OLS format (as on page 154)
b. Find the predicted value of log(price), when 20, 000 lotsize = , 2,500 sqrt = , and
4 bdrms = . Using the method of equation (6.43), find the predicted value of
price at the same values of the explanatory variables.
24
C6.5 Use the data in VOTE1.dta for this exercise.
a. Consider a model with an interaction between expenditures:
0 1 2 3 4
exp exp voteA prtystrA expendA endB expendA endB u | | | | | = + + + + +
What is the partial effect of expendB on voteA, holding prtystrA and expendA
fixed? What is the partial effect of expendA on voteA? Is the expected sign for
4
| obvious?
b. Estimate the equation in a) and report the results in the usual form. Is the
interaction term statistically significant?
c. Find the average of expendA in the sample. Fix expendA at 300 (for
$300,000). What is the estimated effect of another $100,000 spent by
Candidate B on voteA? Is this a large effect?
d. Now fix expendB at 100. What is the estimated effect of 100 expendA A = on
voteA? Is this a large effect?
e. Now, estimate a model that replaces the interaction with shareA, Candidate
As percentage share of total campaign expenditures. Does it make sense to
hold both expendA and expendB fixed, while changing shareA?
f. In the model from e), find the partial effect of expendB on voteA, holding
prtystrA and expendA fixed. Evaluate this at expendA = 300 and expendB = 0
and comment on the results.
C6.6 Use the data in ATTEND.dta for this exercise.
a. Give the population regression function in Example 6.3, we have
2 4 6
2
stndfnl
priGPA atndrte
priGPA
| | |
c
= + +
c
Use equation (6.19) to estimate the partial effect when 2.59 priGPA = and
82 atndrte = . Interpret your estimate.
b. Reparameterize the model to capture the above effect by a single parameter
and estimate the reparameterized model.
( )
( )
2
2
0 1 2 3 4 5
6
2.59
82
stndfnl atndrte priGPA ACT priGPA ACT
priGPA atndrte u
u u u u u u
u
= + + + + + +
+
Where ( ) ( )
2 2 4 6
2 2.59 82 u | | | = + + . (Note that the intercept has changed, but
this is not important.) Use this to obtain the standard error of
2
u . Is it
statistically significant?
C6.7 Use the data in HPRICE1.dta for this exercise.
a. Estimate the model
0 1 2 3
price lotsize sqrft bdrms u | | | | = + + + +
and report the results in the usual form, including the standard error of the
regression. Obtain predicted price, when we plug in 10, 000 lotsize = ,
2300 sqrft = , and 4 bdrms = ; round this price to the nearest dollar.
b. Run a regression that allows you to put a 95% confidence interval around the
predicted value in a). Note that your prediction will differ somewhat due to
rounding error.
25
Chapter 7
1. In example 7.2, let noPC be a dummy variable equal to one if the student does not
own a PC, and zero otherwise.
a. If noPC is used in place of PC in equation 7.6, what happen to the
intercept in the estimated equation? What will be the coefficient on noPC?
b. What will happen to the R-squared if noPC is used in place of PC?
c. Should PC and noPC both be included as independent variable in the
model? Explain.
2. Suppose you collect data from a survey on wages, education, and gender. In
addition, you ask for information about marijuana usage. The original question is:
On how many separate occasions last month did you smoke marijuana?
a. Write an equation that would allow you to estimate the effects of
marijuana usage on wage, while controlling for other factors. You should
be able to make statement such as, Smoking marijuana five more times
per month is estimated to change wage by % x .
b. Write a model that would allow you to test whether drug usage has
different effects on wages for men and women. How could you test that
there are no differences in the effects of drug usage for mean and women?
c. Suppose you think it is better to measure marijuana usage by putting pople
into one of four categories: nonuser, light user (1 to 5 times per month),
moderate user (6 to 10 times per month), and heavy user (more than 10
times per month). Now write a model that allows you to estimate the
effects of marijuana usage on wage.
d. Using the model in c), explain in detail how to test the null hypothesis that
marijuana usage has no effect on wage. Be very specific and include a
careful listing of degrees of freedom.
e. What are some potential problems with drawing causal inference using the
survey data that you collected?
Computer Exercises
C 7.1 Use the data in WAGE2.dta for this exercise
a. Estimate the model
0 1 2 3 4
5 6 7
log( )
.
wage educ exper tenure married
black south urban u
| | | | |
| | |
= + + + +
+ + + +
and report the results in the usual form. Holding other factors fixed, what
is the approximate difference in monthly salary between blacks and non-
blacks? Is this difference statistically significant?
b. Expand the model in a) to allow the return to education to depend on race
and test whether the return to education does depend on race.
c. Again, start with the model in a), but now allow wages to differ across
four groups of people: married and black, married and nonblack, single
and black, and single and nonblack. What is the estimated wage
differential between married blacks and married nonblacks?
C 7.2 Use the data in GPA2.dta for this exercise
a. Consider the equation
26
2
0 1 2 3 4
5 6
.
colgpa hsize hsize hsperc sat
female athlete u
| | | | |
| |
= + + + +
+ + +
where colgpa is cumulative college grade point average, hsize is size of high
school graduating class, in hundreds, hsperc is academic percentile in
graduating class, sat is combined SAT score, female is a binary gender
variable, and athlete is a binary variable, which is one for student-athletes.
What are your expectations for the coefficients in this equation? Which ones
are you unsure about?
b. Estimate the equation in a) and report the results in the usual form. What
is the estimated GPA differential between athletes and nonathletes? Is it
statistically significant?
c. Drop sat from the model and reestimate the equation. Now what is the
estimated effect of being an athlete? Discuss why the estimate is different
than that obtained in b).
d. In the model from a), allow the effect of being an athlete to differ by
gender and test the null hypotheses that there is no ceteris paribus
difference between women athletes and women nonathletes.
e. Does the effect of sat on colgpa differ by gender? Justify your answer.
Chapter 8
Computer Exercises
C 8.1
a. Use the data in HPRICE1.dta to obtain the heteroskedasticity-robust
standard errors for equation (8.17). discuss any important differences with
the usual standard errors.
b. Repeat a) for equation (8.18).
c. What does this example suggest about heteroskedasticity and the
transformation used for the dependent variable?
Chapter 9
Computer Exercises
C9.1 Let math10 denote the percentage of students at a Michigan high school reveiving a
passing score on a standardized math test (see also Example 4.2). We are interested in
estimating the effect of per student spending on math performance. A simple model is
( ) ( )
0 1 2 3
log log math10 expend enroll poverty u | | | | = + + + +
Where poverty is the percentage of students living in poverty.
a. The variable lnchprg is the percentage of students eligible for the federally
funded school lunch program. Why is this a sensible proxy variable for
poverty?
b. Estimate the model with and without lnchprg as an explanatory variable
and report your regression results. Compare the effect of expenditures on
math10 from both regressions.
27
c. Does it appear that pass rates are lower at larger schools, other factors
being equal? Explain.
d. Interpret the coefficient of lnchprg.
e. What do you make of the substantial increase in
2
R after adding lnchprg?
C 9.2 Use the data set WAGE2.dta for this exercise.
a. Use the variable KWW (the knowledge of the world of work test score)
as a proxy variable for ability in place of IQ in Example 9.3. What is the
estimated return to education?
b. Now, use IQ and KWW together as proxy variables. What happens to the
estimated return to education?
c. In b), are IQ and KWW individually significant? Are they jointly
significant?
C 9.3 Use the data from JTRAIN.dta for this exercise.
a. Consider the simple regression model
( )
0 1
log scrap grant u | | = + +
where scrap is the firm scrap rate and grant is a dummy variable
indicating whether a firm received a job training grant. Can you think of
some reasons why the unobserved factor in u might be correlated with
grant?
b. Estimate the simple regression model using the data for 1988. (you should
have 54 observations.) Does receiving a job training grant significantly
lower a firms scrap rate?
c. Now, add as an explanatory variable ( )
87
log scrap . How does this change
the estimated effect of grant? Interpret the coefficient on grant. Is it
statistically significant at the 5% level against the one-sided alternative
: 0
a grant
H | < ?
d. Test the null hypothesis that the parameter on ( )
87
log scrap one against
the two-sided alternative. Report the p-value of the test.
e. Repeat c) and d), using heterskedasticity-robust standard errors, and
briefly discuss any notable differences.
C 9.4 You need to use two data sets for this exercise JTRAIN2.dta and JTRAIN3.dta.
(Before solving this problem, read the data dictionary regarding both data sets). The
former is data from a job training experiment, where job training was assigned by
randomization. The latter contains observational data (a random sample from the
population of (American) men working in 1978.), where job training participation was
largely determined by individual choice. The two data sets cover the same time period.
a. In the data set JTRAIN2.dta, what fraction of the men received job
training? What is the fraction in JTRAIN3.dta? Why do you think there is
such a big difference?
b. Using JTRAIN2.dta, run a simple regression of re78 on train. What is the
estimated effect of participating in job training on real earnings?
28
c. Now add as controls to the regression in b) the variables re74, re75, educ,
age, black, and hisp. Does the estimated effect of job training on re78
change much? How come?
d. Do the regression in b) and c) using the data in JTRAIN3.dta, reporting
only the estimated coefficients on train, along with their t statistics. What
is the effect now of controlling for the extra factors, and why?
e. Define ( ) 74 75 2 avgre re re = + . Find the sample averages, standard
deviations, and minimum and maximum values in the two data sets. Are
these data sets representative of the same populations in 1978?
f. Almost 96% of men in the data set JTRAIN2.dta have avgre less than
$10,000. Using only these men, run the regression re78 on train, re74,
re75, educ, age, black, hisp and report the training estimate and its t
statistic. Run the same regression for JTRAIN3.dta, using only men with
avgre less than $10,000. For the subsample of low-income men, how do
the estimated training effects compare across the experimental and
nonexperimental data sets?
g. Now use each data set to run the simple regression re78 on train, but only
for men who were unemployed in 1974 and 1975. How do the training
estimates compare now? If you fine the estimate from the observational
data is higher than that from the experiment data, can you think of an
explanation?
h. Using your findings from the previous regressions, discuss the potential
importance of having comparable populations underlying comparisons of
experimental and nonexperimental estimates.
Chapter 13
1. In example 13.1, assume that the average of all factors other than educ have
remained constant over time and that the average level of education is 12.2 for the
1972 sample and 13.3 in the 1984 sample. Using the estimates in Table 13.1, find
the estimated change in average fertility between 1972 and 1984. (Be sure to
account for the intercept change and the change in average education.)
2. Using the data in KIELMC.dta, the following two equations were estimated using
the years 1978 and 1981:
( )
2
log 11.49 .547 .394 81
(.26) (.058) (.080)
321, .220
price nearinc y nearinc
n R
= +
= =
( )
2
log 11.18 .563 81 .403 81
(.27) (.044) (.067)
321, .337
price y y nearinc
n R
= +
= =
29
The estimates on the interaction term 81 y nearinc from the above two equations
are very different from that in equation (13.9). Explain the difference between
these two regressions and equation (13.9).
3. Suppose we want to estimate the effect of several variables on annual saving and
that we have a panel data set on individuals collected on January 31, 1990, and
January 31, 1992. If we include a year dummy for 1992 and use first differencing,
can we also include age in the original model (the model before differencing)?
Explain.
Computer Exercises
C13.1
Use the data in FERTIL1.data for this exercise.
a. In the equation estimated in Example 13.1, test whether living
environment at age 16 has an effect on fertility. (the base group is large
city.) Report the value of the F statistic and the p-value.
b. Test whether region of the country at age 16 (South is the base group) has
an effect on fertility.
c. Add the interaction terms 74 y educ , 76 y educ ,, and 84 y educ to the
model estimated in Table 13.1. Explain what these terms represent. Are
they jointly significant?
d. Based on the SRF you got in c), find out the relative fertility level of 1984
compared to the base year 1972 for 12 years of education and at the
sample mean of education in 1984. Explain that how we know if the above
two estimates are significant, and you only need to suggest a regression to
run for each situation (educ = 12 and educ at the sample mean of 1984)?
C13.2
Use the data in CPS78_85.dat for this exercise.
a. How do you interpret the coefficient on 85 y in equation (13.2)? Does it
have an interesting interpretation? (Be careful here; you must account for
the interaction terms 85 y educ and 85 y female .)
b. Holding other factors fixed, what is the estimated percent increase in
nominal wage for a male with 12 years of education over this time period?
Propose a regression to obtain a confidence interval for this estimate.
c. Reestimate equation (13.2) but let all wages be measured in 1978 dollars.
In particular, define the real wage as rwage = wage for 1978 and as rwage
= wage/1.65 for 1985. Now use ( ) log rwage in place of ( ) log wage in
estimating (13.2). Before running the regression, try to predict which
coefficients will differ from those in equation (13.2).
d. Explain why the
2
R from your regression in c) is not the same as in
equation (13.2).
e. Describe how union participation changed from 1978 to 1985.
f. Starting with equation (13.2), test whether the union wage differential
changed over time.
30
g. Do your findings in e) and f) conflict? Explain.
C 13.3
Use the data in KIELMC.dta for this exercise
a. The variable dist is the distance from each home to the incinerator site, in
feet. Consider the model
( ) ( ) ( )
0 0 1 1
log 81 log 81 log price y dist y dist u | o | o = + + + +
If building the incinerator reduces the value of homes closer to the site,
what is the sign of
1
o ? What does it mean if
1
0 | > ?
b. Estimate the model in a) and report the results in the usual form. Interpret
the coefficient on ( ) 81 log y dist . What do you conclude?
c. Add age,
2
age , rooms, baths, ( ) log intst , ( ) log land , and ( ) log area to the
equation. Now, what do you conclude about the effect of the incinerator
on housing values?
C 13.4
For this exercise, we use JTRAIN.dta to determine the effect of the job training grant on
hours of job training per employee. The basic model for the three years is
( )
0 1 2 1 2 , 1 3
88 89 log
it it it it i t it i it
hrsemp d d grant grant employ a u | o o | | |
= + + + + + + +
a. Estimate the equation using first differencing. How many firms are used in
the estimation? How many total observations would be used if each firm
had data on all variables for all three time period?
b. Interpret the coefficient on grant and comment on its significance.
c. Is it surprising that
1
grant
is insignificant? Explain.
d. Do larger firms train their employees more or less, on average? How big
are the differences in training due to firm size?
Chapter 15
1. Consider a simple model to estimate the effect of personal computer (PC)
ownership on college grade point average for graduating seniors at a large public
university:
0 1
GPA PC u | | = + +
where PC is a binary variable indicating PC ownership.
a. Why might PC ownership be correlated with u?
b. Explain why PC is likely to be related to parents annual income. Does
this mean parental income is a good IV for PC? Why or why not?
c. Suppose that, four years ago, the university gave grants to buy computers
to roughly one-half of the incoming students, and the students who
received grants were randomly chosen. Carefully explain how you would
use this information to construct an instrumental variable for PC.
2. Suppose that you wish to estimate the effect of class attendance on student
performance, as in Example 6.3. A basic model is
0 1 2 3
stndfnl atndrte priGPA ACT u | | | | = + + + +
31
a. Let dist be the distance from the students living quarters to the lecture
hall. Assuming that dist and u are uncorrelated, what other assumption
must dist satisfy in order to be a valid IV for atndrte?
b. Suppose, as in equation (6.18), we add the interaction term
priGPA atndrte . What might be a good IV for priGPA atndrte ? [Hint:
if
( )
, , 0 E u priGPA ACT dist = , as happens when priGPA, ACT, and dist
are all exogenous, then any function of priGPA and dist is uncorrelated
with u.]
3. Consider the simple regression model
0 1
y x u | | = + +
and let z be a binary instrumental variable for x. Use (15.10) to show that the IV
estimator
1
| can be written as
( ) ( )
1 1 0 1 0
y y x x | =
where
0
y and
0
x are the sample average of
i
y and
i
x over the part of the sample
with 0
i
z = , and where
1
y and
1
x are the sample average of
i
y and
i
x over the
part of the sample with 1
i
z = . This estimator, known as a grouping estimator, was
first suggested by Wald (1940).
4. Refer to equations (5.19) and (15.20). Assume that
u x
o o = , so that the
population variation in the error term is the same as it is in x . Suppose that the
instrumental variable, z , is slightly correlated with u : ( ) , 0.1 Corr z u = . Suppose
that z and x have a somewhat stronger correlation: ( ) , 0.2 Corr z x = .
a. What is the asymptotic bias in the IV estimator?
b. How much correlation would have to exist between u and x before OLS
has more asymptotic bias than 2SLS?
5. The following is a simple model to measure the effect of a school choice program
on standardized test performance (see Rouse[1998] for motivation):
0 1 2 1
score choice faminc u | | | = + + +
where score is the score on a statewide test, choice is a binary variable indicating
whether a student attended a choice school in the last year, and faminc is family
income. The IV for choice is grant, the dollar amount granted to students to use
for tuition at choice schools. The grant amount differed by family income level,
which is why we control for faminc in the equation.
a. Even with faminc in the equation, why might choice be correlated with
1
u ?
b. If withing each income class, the grant amounts were assigned randomly,
is grant uncorrelated with
1
u ?
c. Write the reduced form equation for choice. What is needed for grant to
be partially correlated with choice?
6. Suppose that, in equation (15.8), you do not have a good instrumental variable
candidate for skipped. But you have two other pieces of information on students:
32
combined SAT score and cumulative GPA prior to the semester. What would you
do instead of IV estimation?
Computer Exercises
C15.1
Use the data in WAGE2.dta for this exercise.
a. In Example 15.2, using sibs as an instrument for educ, the IV estimate of the
return to education is 0.122. To convince yourself that using sibs as an IV for
educ is not the same as just plugging sibs in for educ and running an OLS
regression, run the regression of ( ) log wage on sibs and explain your findings.
b. The variable brthord is birth order (it is one for a first-born child, two for a
second-born child, and so on). Explain why educ and brthord might be negatively
correlated. Regress educ on brthord to determine whether there is a statistically
significant negative correlation.
c. Use brthord as an IV for educ in equation (15.1). Report and interpret the results.
d. Now, suppose that we include number of siblings as an explanatory variable in the
wage equation; this controls for family background, to some extent:
( )
0 1 2
log wage educ sibs u | | | = + + +
Suppose that we want to use brthord as an IV for educ, assuming that sibs is
exogenous. The reduced form for educ is
0 1 2
educ sibs brthord v t t t = + + +
State and test the identification assumption.
e. Estimate the wage equation in d) using brthord as an IV for educ (and sibs as its
own IV). Comment on the standard errors for
educ
| and
sibs
| .
f. Using the fitted values from e)
educ and
sibs. Use this result to explain your findings from e).
C15.2
Use the data in CARD.dta for this exercise.
a. The equation we estimated in Example 15.4 can be written as
( )
0 1 2
log ... wage educ exper u | | | = + + + +
where the other explanatory variables are listed in Table 15.1. In order for IV to
be consistent, the IV for educ, nearc4, must be uncorrelated with u. Could nearc4
be correlated with things in the error term, such as unobserved ability? Explain.
b. For a subsample of the mean in the data set, an IQ score is available. Regress IQ
on nearc4 to check whether average IQ scores vary by whether the man grew up
near a four-year college. What do you conclude?
c. Now, regress IQ on nearc4, smsa66, and the 1966 regional dummy variables
reg662,,reg669. Are IQ and nearc4 related after the geographic dummy
variables have been partialled out?
d. From b) and c), what do you conclude about the importance of controlling for
smsa66 and the 1966 regional dummies in the ( ) log wage equation?
33
C15.3
The purpose of this exercise is to compare the estimates and standard errors obtained by
correctly using 2SLS with those obtained using inappropriate procedures. Use the data
file WAGE2.dta.
a. Use a 2SLS routine to estimate the equation
( )
0 1 2 3 4
log wage educ exper tenure black u | | | | | = + + + + +
where sibs is the IV for educ. Report the results in the usual form.
b. Now, manually carry out 2SLS. That is, first regress educ on sibs, exper, tenure
and black and obtain the fitted value
| are identical to
those obtained from a), but that the standard errors are somewhat different. The
standard errors obtained from the second stage regression when manually carrying
out 2SLS are generally inappropriate.
c. Now, use the following two-step procedure, which generally yields inconsistent
parameter estimates of | , and not just inconsistent standard errors. In step one,
regress educ on sibs only and obtain the fitted value
educ , exper, tenure and black. Compare the estimate of the return
to education from this incorrect procedure with that from the proper procedure of
a).