Solutions - PS5 (DAE)
Solutions - PS5 (DAE)
Solutions - PS5 (DAE)
SOLUTIONS PS5:
Estimation Problems
a- The first empirical studies aimed at measuring the impact of class size on education
performance were based on data comparing the grades in comprehensive tests
achieved by students from different schools and different class sizes. If we aimed at
measuring the relationship between class size and academic performance with such
data, could we infer that size has a causal effect on performance? Justify.
b- The presence of more policemen to fight crime is a matter of controversy. Suppose
that we have data for all the capital cities in France about crime incidence per 10,000
inhabitants and number of police units per 10,000 inhabitants. With such data, could
we obtain the causal effect of police surveillance on crime incidence? Explain.
c- Suppose that there is a positive and strong correlation between the amount of
children´s books within a home and the academic performance of the children at that
home. Could you say that the number of children´s book at home has a positive
causal effect on the academic performance of children at such home? Justify.
2 Suppose you are interested in estimating the effect of hours spent in a SAT
preparation course (hours) on total SAT score (sat). The population is all college-bound high
school seniors for a particular year.
a- Suppose you are given a grant to run a controlled experiment. Explain how you would
structure the experiment in order to estimate the causal effect of hours on sat.
b- Consider the more realistic case where students choose how much time to in a
preparation course, and you can only randomly simple sat and hours from the
population. Write the population model as:
𝑠𝑎𝑡i = 𝛽0 + 𝛽1ℎ𝑜𝑢𝑟𝑠i + 𝑢i
List, at least, two factors contained in the random perturbance term. Are these likely
to have positive or negative correlation with hours? Explain.
3 The following equation describes the number of hours of television watched per
week by a child as a function of his age, his education, his mother´s education, his father´s
education and the number of siblings:
1
DAE: Solutions PS5
We suspect the dependent variable contains a certain error of measurement. Explain the
consequences in your estimation results.
Since the dependent variable contains some measurement errors the main
consequences in your estimation results will be that your estimators are less efficient,
that is, their standard errors will be higher than the ones you should expect.
X: Family income.
P: Price index.
Two different regressions are estimated with the following estimation results (standard errors
are in brackets and sample size is 500):
Find and discuss the specification error the first model is suffering. Explain it using the estimation
results of the above table.
The estimation problem that the first regression model is suffering is the omission of
a relevant explanatory variable (X). You can infer this problem because (1) adjusted
in the first model lower than in the second model so introducing X in the second
increases the explanatory power (X being relevant) or (2) you can perform a t-test on
X in the second model to see that is individually significant.
2
DAE: Solutions PS5
You can also see that omitting X variable in the first regression model produces your
OLS estimator to be overestimated. In other words, the effect of P on Y is greater
than the one it should be. In fact, 𝒃𝟐 = 𝟐, 𝟒𝟔𝟐 in the first model whereas, in reality,
𝒃𝟐 = −𝟎, 𝟕𝟑𝟗 in the second model. Additionally, the efficiency of the OLS estimator
in the first model is lower than in the second (compare the standard errors for b2 in
both models). Finally, the determination coefficient in model one is unreliable as you
are omitting a relevant factor making the explanatory power of the model lower than
it should be when introducing X.
a- Find the assumption that does not hold in this model and explain why.
That is, there is a perfect linear relationship among the explanatory variables
included in the above regression model. Therefore, there is no solution when
performing the OLS minimization problem and coefficients cannot be estimated.
b- How would you rewrite the model in order to solve the problem?
One possible solution would be dropping one regressor from the model such as:
3
DAE: Solutions PS5
Knowing that the standard error associated to the effect of drive is 0.065:
a.- Interpret the effect of drive variable and test its individual significance at
5% significance level. Do you think is it realistic?
We reject the null hypothesis since −𝟐. 𝟑𝟐𝟑 < −𝟏. 𝟗𝟕𝟔 at 5% significance
level and therefore, driving to campus is a statistically significant
explanatory variable that help us to understand variations in GPA.
Given that the correlation between students age and drive is 0.57 and the correlation
between GPA and age is -0.48:
b.- Explain the potential problem the above model may suffer.
Given the above correlations, surely driving is not the factor that is affecting grades, but
a third variable correlated both with GPA and driving to campus. Maybe, older students
are more likely to drive to campus and maybe also they are more likely to get lower
grades. You first compute the correlation coefficients to check your hypothesis. Indeed,
the variable age is positively correlated with the variable drive (the correlation is 0.27)
and negatively correlated with GPA (the correlation is -0.26).Therefore, the estimation
problem is about spurious correlation.
c.- Knowing that in this model drive is statistically insignificant and age is
individually significant, is your answer in question b consistent with the
second model estimation results? Explain.
It is consistent because according to the results of the second model, indeed age could
be the potential variable which was omitted from Model 1 and was driven the spurious
correlation. Once included in the regression, the effect of driving to campus turns
statistically insignificant.
7 We have estimated a SLRM explaining office rental prices in the city of Madrid (Y)
with the information contained in distance to the city center (X). The following two graphs:
Figure 1( Y versus X) and Figure 2 (residuals versus fitted values of Y) are related to the above
model.
4
DAE: Solutions PS5
a- Discus according to the two graphs if the model may suffer a non linearity problem.
According to the first figure, it seems the relationship between Y and X is suffering a
non-linearity problem. Moreover, it seems it is suffering decreasing returns. That is,
as the distance to the city center increases, the negative effect of distance on office
rental prices seems to decrease.
When plotting the residuals versus the fitted values (Figure 2), it seems there is a
relationship between them and therefore the covariance between these two variables
is not equal to zero. This is a signal of this model suffering a non-linearity problem
as we want residuals and predicted values of the dependent variable being
independent. This figure is consistent with the analysis of the first figure.
As you move further away of the city center you do not expect the negative effect of
distance on rental prices being so big as when you are very close of the city center
and this is the reason why the relationship between rental prices and distance may
suffer decreasing returns.
c- How should Figure 2 be if the relationship between office rental prices and distance
was a linear relationship?
If the relationship were to be linear, Figure two should be a random cloud of points
signaling that residuals and predicted values of the dependent variable are
independent and therefore satisfying linearity assumption.
5
DAE: Solutions PS5
8 The following table shows two different samples with two explanatory variables each
of them in order to study the behaviour of Y (dependent variable):
Sample 1 Sample 2
Observation Y X1 X2 Z1 Z2
1 1 2 4 2 4
2 4 6 12 6 12
3 2 4 11 4 8
b- If yes, please explain the consequences in your OLS estimations in each sample.
c- If yes, please explain the strategies that you would use in order to solve the problem
in each sample.
Sample 1: We could do nothing and accept that there may be a problem; we could
transform the variables in order to solve the problem or drop one explanatory variable
from the model that is causing the imperfect multicollinearity problem.
Sample 2: We need to eliminate one of the explanatory variables (the less relevant in
explaining the behaviour of the dependent variable).
9 Consider the regression of country level GDP per capita on percentage urban
population in several countries (1995) obtaining a determination coefficient of 0.457 and
obtaining the following graph when plotting the data (Figure 1):
6
DAE: Solutions PS5
a- Can you detect a non-linear relationship between the two variables? Why?
In Figure 1, we can observe that the relationship between urban population and GDP
per capita seems to be not linear because the positive effect of GDP per capita on
urban population may suffer decreasing returns. Note that given the above graph, if
you were to plot residuals versus estimated values of the dependent variable you
should obtain some kind of relationship indicating that they are not independent and
therefore suffering the non linearity problem.
Solutions that can be implemented in order to solve this non linearity problem are
the so called transformations of variables, that is, using a log, semi-log or quadratic
specification to transform the above non linear model into a linear relationship so
that linearity assumption can be satisfied.
Suppose now that we estimate the same model but using a semilog transformation obtaining
the following estimation results:
7
DAE: Solutions PS5
c- Compare the determination coefficients and the graphs between the two models. Do
you think the semilog transformation might be a good solution for the nonlinearity
problem? Explain your answer.
When taking the semi-log transformation we see clearly that the relationship between
GDP per capita and urban population is now linear if compared with the first Figure.
Therefore the semi-log transformation working well in order to solve the non linearity
problem that we were suffering before. Moreover, the determination coefficient in the
semi-log specification is higher than in the previous model and therefore the semi-
log specification increasing the explanatory power of the model since it is solving the
non linearity problem.
10 We have data for a sample of high schools in Vietnam where the variable math
denotes the percentage of students who passed a maths test. We want to estimate the effect
that spending per student has on the outcomes of this test and propose the following model:
Where poverty describes the percentage of students living below the poverty line, spend denotes
spending per student and enroll is the number of students enrolled in the high school.
a- We do not have data for poverty variable but the variable lnchprg describes the
percentage of students eligible for a program subsidizing school lunches. Why is this
variable a sensible proxy variable for poverty?
Since we do not have data for poverty variable we need to find a proxy (similar
variable to capture the same effect). Therefore, lnchprg is a good proxy because
8
DAE: Solutions PS5
students living below the poverty line will be, on average, students eligible for the
program subsidizing school lunches.
b- The table below shows the OLS estimates with and without the inclusion of lnchprg
as an explanatory variable:
Explain why the effect of spending and enrol are greater in the first model than in
the second one?
We can conclude that the second model is a better specification than the first one
because it includes an additional relevant and significant explanatory variable, the
signs of the coefficients are the expected ones, standard errors are more efficient than
in the first model and it has a greater explanatory power than the first model.
9
DAE: Solutions PS5
VARIABLE DESCRIPTION
NAME
10
DAE: Solutions PS5
a- In order to avoid specification errors, which variables would you keep in your analysis
according to practical significance? Justify your choices.
I will not take into consideration those variables that according to real life (practical
significance) not have any significant effect on property prices. Advance variable has
nothing to do with property prices. It is the price of the property the variable that may
affect the loan amount and not the other way around. Additionally, I will also drop
11
DAE: Solutions PS5
those variables related to buyers´ characteristics since are characteristics that not affect
on how sellers set property prices such as buyage (age of the buyer) and first time
buyer (ftbuyer).
b- Explain, the process you would follow in order to specify your final model and to
choose the final variables in your model.
I would follow the step-wise-regression. That is, I would estimate a model with all
the variables that I did not drop because of practical significance (in order to avoid
the omission of relevant variables). Then, I would use the individual t-test to keep
those independent factors being individually significant and drop those ones being
individually insignificant. I would use this process up to the point I achieve a model
with all the explanatory variables being individually significant (in order to avoid the
inclusion of irrelevant variables).
Practical significance is about real life arguments that you can use to choose the
regressors to be included in your model while statistical significance is about using
hypothesis test (using probability criteria) to choose the right regressors.
12 We have the following information for the annual growth rates (%) in different
countries about stock prices (Y) and in consumer prices (X):
12
DAE: Solutions PS5
Sweden 8 4
UK 7.5 3.9
USA 9 2.1
b- Show both graphically and formally if the above data suffers from an outlier problem.
13
DAE: Solutions PS5
15
8.5
10
Predicted Stock Prices
8
Stock
5
7.5
0
7
-5
Estimation Residuals
According to the above graphs, there may be an outlier problem related to the
observation of Israel (slightly different behaviour than the rest of country
observations).
If z > 2.06 or z < −2.06 (critical values with a 2% probability at the right and left
hand tails of the normal distribution) then, the corresponding data point associated
to that specific estimation residual can be considered as an outlier.
Note that the standard deviation of the estimation residuals is 3.22 and the sample
mean of the estimation residuals is closed to 0.
None of the normalized estimation residuals satisfied the above conditions and
therefore, our model does not suffer from a significant outlier problem.
c- If the answer to b is positive, please explain any strategy you would perform in order
to solve the problem.
13 Imagine that you are interested in analyzing the determinants of infant mortality rates
worldwide. Using the Development Reports from the World Bank in 2013, you get the
following information for 248 countries:
IMR Infant Mortality rate - is the number of deaths of infants per 1,000 live births.
GDP GDP per capita (constant 2005 US$)
Source: World Bank Development Reports, 2013.
14
DAE: Solutions PS5
a- Have a look at the graph above, why Angola and Guinea might be considered as
outliers in this regression model? Comment on the implications of the inclusion of
these two countries in the analysis.
They seem to behave in a slightly different way as the rest of our observations and
therefore they could be considered as possible outliers within our sample. Including
outliers could affect your estimation results in a negative way leading to erroneous
conclusions. Additionally, less accurate results since estimation residuals are large.
b- Angola presents one of the highest infant mortality rates in this sample (103 per 1,000
live births). Compute the residual for this country given that our model predicts for
Angola an infant mortality rate of 28.6 per 1,000 live births.
c- Knowing that the standard deviation of the estimation residuals (using all the
observations) is 26.22, is Angola a significant outlier?
15
DAE: Solutions PS5
𝒃 − 𝜷𝟎 𝟕𝟒. 𝟒 − 𝟎
𝒛 = 𝒔. 𝒅(𝒃) = 𝟐𝟔. 𝟐𝟐 = 𝟐. 𝟖𝟑𝟕
Since 𝟐. 𝟖𝟑𝟕 is higher than 2.06 which is the critical limit at 4% significance level in
a two tailed normal test and we reject the null hypothesis. That is, Angola observation
is significantly an outlier, consistent with our previous analysis.
d- What about Guinea? Note that the estimation residual associated to Guinea
observation is 52.
This is about performing again a normal test for outliers for Guinea:
Since 𝟏. 𝟗𝟖𝟑 is lower than 2.06 which is the critical limit at 4% significance level in a
two tailed normal test and we fail to reject the null hypothesis. That is, Guinea
observation is insignificantly an outlier, even if in the above figure seems an outlier.
14 We have representative data for 30 years old for the US. Levine, Gustafson and
Velenchik (1997) estimated a wage equation using the following variables:
Y = ln(wage)
ED = years of education
(se=0.031)
(se=0.021) (se=0.0004)
16
DAE: Solutions PS5
Coefficient of determination = 0.68
Compare the two fitted models and explain what happens when we omit one relevant variable (in this
case, years of education).
All of the above is indicative of model 1 suffering the omission of a relevant factor (years of
education).
15 Consider the following regression model to analyze salaries respect to age groups in a sample
with 150 individuals:
𝑤 =𝛽 𝑑 +𝛽 𝑑 +𝛽 𝑑 +𝑢
Such that d1 refers to individuals between 20-30 years old, d2 accounts for individuals between
30-40 years old and finally, d3 captures individuals between 40-50 years old.
In order to test for heteroscedasticity, the second group of age is eliminated and we run two
different regressions models (with the same specification). The first model, taking into
consideration the youngest individuals group (with 50 observations and with SSR1=234). The
second regression model accounting for the oldest individuals group (with 50 observations
and SSR2=387).
17
DAE: Solutions PS5
18
DAE: Solutions PS5
16 A time series analysis is conducted to see the relationship between Consumption and
income in a country. The analysis is carry out using yearly observations from 1959 until
1988. Since the researcher suspects heteroscedasticity issues, she decides to perform
the GQT by running the same simple linear regression model, first using yearly data
from 1959 until 1971 and secondly using yearly observations from 1976 until 1988.
She obtains that in the first subsample the SSR is 2532224 while the SSR in the second
subsample is 10339356.
a- Help her by testing heteroscedasticity at 5% significance level.
b- Explain the implications in the estimation results of the model of your result in the
above question.
The OLS estimated coefficients will be inefficient and the t-tests will be invalid.
19
DAE: Solutions PS5
17 We have data regarding profits and sales for 20 companies and are interested in
estimating a models explaining profits with sales. The following graph shows the
existing relationship between the two variable.
a- Interpret the above diagram. Could you infer any possible potential problem?
According to the above graph it seems that the dispersion increases as sales
increases and therefore it may be a case for a heteroscedasticity problem.
b- Ordering the sample from the lowest to the highest sales and dropping the six central
observations, we obtain SSR1=1.134 for the first observation when running a SLRM
explaining profits with sales and a SSR2=32.934 when estimating the same model
for the last observations. Test for heteroscedasticity at 5% significance level.
20
DAE: Solutions PS5
B- A situation in which measures of two or more variables are statistically related but
are not in fact causally linked because the statistical relationship is caused by a third
omitted variable is called:
a- Partial correlation
b- Linear correlation
c- Spurious correlation
d- Marginal correlation
C- Step-wise regression is the most widely used search procedure of developing the
……….. regression model without examining all possible models.
a- worst
b- best
c- medium
d- least
21
DAE: Solutions PS5
a- The dependent variable is highly correlated with the explanatory variables included
in the regression model.
b- There is a high degree of correlation between the explanatory variables
included in a multiple regression model.
c- The application of a multiple regression model yields estimates that are nonlinear
form.
d- None of the above.
G- If your dataset has heteroscedasticity, but you completely ignore the problem and
use OLS, you will
22
DAE: Solutions PS5
23