Solutions - PS5 (DAE)

DAE: Solutions PS5
SOLUTIONS PS5:
Estimation Problems
Professor: Rodrigo Alegría
1 Answer to the following three questions:
a- The first empirical studies aimed at measuring the impact of class size on education
performance were based on data comparing the grades in comprehensive tests
achieved by students from different schools and different class sizes. If we aimed at
measuring the relationship between class size and academic performance with such
data, could we infer that size has a causal effect on performance? Justify.
b- The presence of more policemen to fight crime is a matter of controversy. Suppose
that we have data for all the capital cities in France about crime incidence per 10,000
inhabitants and number of police units per 10,000 inhabitants. With such data, could
we obtain the causal effect of police surveillance on crime incidence? Explain.
c- Suppose that there is a positive and strong correlation between the amount of
children´s books within a home and the academic performance of the children at that
home. Could you say that the number of children´s book at home has a positive
causal effect on the academic performance of children at such home? Justify.
2 Suppose you are interested in estimating the effect of hours spent in a SAT
preparation course (hours) on total SAT score (sat). The population is all college-bound high
school seniors for a particular year.
a- Suppose you are given a grant to run a controlled experiment. Explain how you would
structure the experiment in order to estimate the causal effect of hours on sat.
b- Consider the more realistic case where students choose how much time to in a
preparation course, and you can only randomly simple sat and hours from the
population. Write the population model as:
𝑠𝑎𝑡i = 𝛽0 + 𝛽1ℎ𝑜𝑢𝑟𝑠i + 𝑢i
List, at least, two factors contained in the random perturbance term. Are these likely
to have positive or negative correlation with hours? Explain.
3 The following equation describes the number of hours of television watched per
week by a child as a function of his age, his education, his mother´s education, his father´s
education and the number of siblings:
1
DAE: Solutions PS5
𝑡𝑣ℎ𝑜𝑢𝑟𝑠∗ = α + 𝛽1𝑎𝑔𝑒 + 𝛽2𝑎𝑔𝑒2 + 𝛽3𝑚𝑜𝑡ℎ𝑒𝑑𝑢 + 𝛽4𝑓𝑎𝑡ℎ𝑒𝑑𝑢 + 𝛽5𝑠𝑖𝑏𝑠 + 𝑢
We suspect the dependent variable contains a certain error of measurement. Explain the
consequences in your estimation results.
Since the dependent variable contains some measurement errors the main
consequences in your estimation results will be that your estimators are less efficient,
that is, their standard errors will be higher than the ones you should expect.
4 We have the following variables:
Y: Food expenditure in USA.
X: Family income.
P: Price index.
Two different regressions are estimated with the following estimation results (standard errors
are in brackets and sample size is 500):
Coefficient for Coefficient for Adjusted Determination

Regression X P coefficient
Y/P 2.462 0.614

(0.407)
Y / X; P 0.112 -0.739 0.978
(0.003) (0.114)
Find and discuss the specification error the first model is suffering. Explain it using the estimation
results of the above table.
The estimation problem that the first regression model is suffering is the omission of
a relevant explanatory variable (X). You can infer this problem because (1) adjusted
in the first model lower than in the second model so introducing X in the second
increases the explanatory power (X being relevant) or (2) you can perform a t-test on
X in the second model to see that is individually significant.
2
DAE: Solutions PS5
You can also see that omitting X variable in the first regression model produces your
OLS estimator to be overestimated. In other words, the effect of P on Y is greater
than the one it should be. In fact, 𝒃𝟐 = 𝟐, 𝟒𝟔𝟐 in the first model whereas, in reality,
𝒃𝟐 = −𝟎, 𝟕𝟑𝟗 in the second model. Additionally, the efficiency of the OLS estimator
in the first model is lower than in the second (compare the standard errors for b2 in
both models). Finally, the determination coefficient in model one is unreliable as you
are omitting a relevant factor making the explanatory power of the model lower than
it should be when introducing X.
5 There is an econometric study at IE University which relates the average grade in

Econometrics with the time students employ in different activities during the week. Some
students are asked about how many hours they employ in four different activities: study,
sleep, work and leisure. Any activity must be included in one of these four categories such
that the time spent in the four activities is 168 hours for each student.
The model is the following:
𝐴𝐺𝐸 = 𝛽0 + 𝛽1𝑠𝑡𝑢𝑑𝑦 + 𝛽2𝑠𝑙𝑒𝑒𝑝 + 𝛽3𝑤𝑜𝑟𝑘 + 𝛽4𝑙𝑒𝑖𝑠𝑢𝑟𝑒 + 𝑢
a- Find the assumption that does not hold in this model and explain why.
The above model suffers a perfect multicollinearity problem because:
𝒔𝒕𝒖𝒅𝒚 + 𝒔𝒍𝒆𝒆𝒑 + 𝒘𝒐𝒓𝒌 + 𝒍𝒆𝒊𝒔𝒖𝒓𝒆 = 𝟏𝟔𝟖 ∀𝒊
That is, there is a perfect linear relationship among the explanatory variables
included in the above regression model. Therefore, there is no solution when
performing the OLS minimization problem and coefficients cannot be estimated.
b- How would you rewrite the model in order to solve the problem?
One possible solution would be dropping one regressor from the model such as:
𝑨𝑮𝑬 = 𝜷0 + 𝜷𝟏𝒔𝒕𝒖𝒅𝒚 + 𝜷𝟐𝒔𝒍𝒆𝒆𝒑 + 𝜷𝟑𝒘𝒐𝒓𝒌 + 𝒖
6 In a university campus someone argues that going to class by car lowers

grades. In order to test this hypothesis, we have estimated a model, using 141
students, explaining GPA with a dummy variable taking a value of 1 if student
goes to class by car and 0 otherwise obtaining the following regression results:
GPA = 3.43 − 0.151drive
3
DAE: Solutions PS5
Knowing that the standard error associated to the effect of drive is 0.065:
a.- Interpret the effect of drive variable and test its individual significance at
5% significance level. Do you think is it realistic?
𝑯𝟎: 𝜷 = 𝟎, 𝑯𝒂: 𝜷 ≠ 𝟎 𝒕𝒔 =-0.151/0.065 = −𝟐. 𝟑𝟐𝟑
𝒕𝒄 = −𝟏. 𝟗𝟕𝟔 two-tailed t-test with 139 degrees of freedom at 5%
We reject the null hypothesis since −𝟐. 𝟑𝟐𝟑 < −𝟏. 𝟗𝟕𝟔 at 5% significance
level and therefore, driving to campus is a statistically significant
explanatory variable that help us to understand variations in GPA.
Interpretation: a student that goes to campus driving has, on average, a GPA

0.151points lower than other students. This effect is not realistic as one
should not expect driving to campus should affect in a significant way your
academic performance.
Given that the correlation between students age and drive is 0.57 and the correlation
between GPA and age is -0.48:
b.- Explain the potential problem the above model may suffer.
Given the above correlations, surely driving is not the factor that is affecting grades, but
a third variable correlated both with GPA and driving to campus. Maybe, older students
are more likely to drive to campus and maybe also they are more likely to get lower
grades. You first compute the correlation coefficients to check your hypothesis. Indeed,
the variable age is positively correlated with the variable drive (the correlation is 0.27)
and negatively correlated with GPA (the correlation is -0.26).Therefore, the estimation
problem is about spurious correlation.
A new regression model is estimated such that:
GPA = 3.43 − 0.151drive − 0.056age
c.- Knowing that in this model drive is statistically insignificant and age is
individually significant, is your answer in question b consistent with the
second model estimation results? Explain.
It is consistent because according to the results of the second model, indeed age could
be the potential variable which was omitted from Model 1 and was driven the spurious
correlation. Once included in the regression, the effect of driving to campus turns
statistically insignificant.
7 We have estimated a SLRM explaining office rental prices in the city of Madrid (Y)
with the information contained in distance to the city center (X). The following two graphs:
Figure 1( Y versus X) and Figure 2 (residuals versus fitted values of Y) are related to the above
model.
4
DAE: Solutions PS5
a- Discus according to the two graphs if the model may suffer a non linearity problem.
According to the first figure, it seems the relationship between Y and X is suffering a
non-linearity problem. Moreover, it seems it is suffering decreasing returns. That is,
as the distance to the city center increases, the negative effect of distance on office
rental prices seems to decrease.
When plotting the residuals versus the fitted values (Figure 2), it seems there is a
relationship between them and therefore the covariance between these two variables
is not equal to zero. This is a signal of this model suffering a non-linearity problem
as we want residuals and predicted values of the dependent variable being
independent. This figure is consistent with the analysis of the first figure.
b- Provide an economic reason explaining the possible non-linearity in the above

relationship.
As you move further away of the city center you do not expect the negative effect of
distance on rental prices being so big as when you are very close of the city center
and this is the reason why the relationship between rental prices and distance may
suffer decreasing returns.
c- How should Figure 2 be if the relationship between office rental prices and distance
was a linear relationship?
If the relationship were to be linear, Figure two should be a random cloud of points
signaling that residuals and predicted values of the dependent variable are
independent and therefore satisfying linearity assumption.
5
DAE: Solutions PS5
8 The following table shows two different samples with two explanatory variables each
of them in order to study the behaviour of Y (dependent variable):
Sample 1 Sample 2
Observation Y X1 X2 Z1 Z2
1 1 2 4 2 4
2 4 6 12 6 12
3 2 4 11 4 8
a- Can you detect a multicollinearity problem in any of the two samples?
Sample 1: there is a problem of imperfect multicollinearity since the linear correlation

coefficient between the two explanatory variables is 0.9 (strong and positive linear
association).
Sample 2: there is a problem of perfect multicollinearity since the linear correlation

coefficient between the two explanatory variables is 1 (perfect and positive linear
association).
b- If yes, please explain the consequences in your OLS estimations in each sample.
Sample 1: estimations may be erroneous (wrong signs and magnitudes) and

inefficient.
Sample 2: there is no solution to the OLS minimization problem.
c- If yes, please explain the strategies that you would use in order to solve the problem
in each sample.
Sample 1: We could do nothing and accept that there may be a problem; we could
transform the variables in order to solve the problem or drop one explanatory variable
from the model that is causing the imperfect multicollinearity problem.
Sample 2: We need to eliminate one of the explanatory variables (the less relevant in
explaining the behaviour of the dependent variable).
9 Consider the regression of country level GDP per capita on percentage urban
population in several countries (1995) obtaining a determination coefficient of 0.457 and
obtaining the following graph when plotting the data (Figure 1):
6
DAE: Solutions PS5
Figure 1: GDPpc versus % urban pop
a- Can you detect a non-linear relationship between the two variables? Why?
In Figure 1, we can observe that the relationship between urban population and GDP
per capita seems to be not linear because the positive effect of GDP per capita on
urban population may suffer decreasing returns. Note that given the above graph, if
you were to plot residuals versus estimated values of the dependent variable you
should obtain some kind of relationship indicating that they are not independent and
therefore suffering the non linearity problem.
b- Can you explain solutions to be implemented in order to solve the non-linearity

problem?
Solutions that can be implemented in order to solve this non linearity problem are
the so called transformations of variables, that is, using a log, semi-log or quadratic
specification to transform the above non linear model into a linear relationship so
that linearity assumption can be satisfied.
Suppose now that we estimate the same model but using a semilog transformation obtaining
the following estimation results:
l o g ( 𝐺𝐷 𝑃 𝑝 𝑐 ) ι = 4.631 + 0.052𝑢𝑟𝑏𝑎𝑛i 𝑅2 = 0.549
and obtaining Figure 2 when plotting the data:
7
DAE: Solutions PS5
Figure 2: logGDPpc versus % urban pop.
c- Compare the determination coefficients and the graphs between the two models. Do
you think the semilog transformation might be a good solution for the nonlinearity
problem? Explain your answer.
When taking the semi-log transformation we see clearly that the relationship between
GDP per capita and urban population is now linear if compared with the first Figure.
Therefore the semi-log transformation working well in order to solve the non linearity
problem that we were suffering before. Moreover, the determination coefficient in the
semi-log specification is higher than in the previous model and therefore the semi-
log specification increasing the explanatory power of the model since it is solving the
non linearity problem.
10 We have data for a sample of high schools in Vietnam where the variable math
denotes the percentage of students who passed a maths test. We want to estimate the effect
that spending per student has on the outcomes of this test and propose the following model:
𝑚𝑎𝑡ℎ = 𝜷0 + 𝛽1 log(𝑠𝑝𝑒𝑛𝑑) + 𝛽2 log(𝑒𝑛𝑟𝑜𝑙𝑙) + 𝛽3𝑝𝑜𝑣𝑒𝑟𝑡𝑦 +

𝑢
Where poverty describes the percentage of students living below the poverty line, spend denotes
spending per student and enroll is the number of students enrolled in the high school.
a- We do not have data for poverty variable but the variable lnchprg describes the
percentage of students eligible for a program subsidizing school lunches. Why is this
variable a sensible proxy variable for poverty?
Since we do not have data for poverty variable we need to find a proxy (similar
variable to capture the same effect). Therefore, lnchprg is a good proxy because
8
DAE: Solutions PS5
students living below the poverty line will be, on average, students eligible for the
program subsidizing school lunches.
b- The table below shows the OLS estimates with and without the inclusion of lnchprg
as an explanatory variable:
Explanatory variables (1) (2)
log(spend) 11.13 7.75

(3.30) (3.04)
log(enroll) 0.022 -1.26
(0.615) (0.58)
lnchprg - -0.324
(0.036)
intercept -69.24 -23.14
(26.74) (24.99)
n 408 408
Determination coefficient 0.0293 0.1893
Explain why the effect of spending and enrol are greater in the first model than in
the second one?
In the above table we have a problem of omission of a relevant explanatory variable.

The first model omits lnchprg variable (significant explanatory variable in the second
model). One consequence when omitting relevant explanatory variables is that your
OLS coefficients are going to be biased. In our example, the coefficients associated
to spend and enroll variables are biased (greater values than the ones in the second
model and therefore overestimating the effect of both variables on the dependent
variable). In addition, and comparing the standard errors associated to each of the
explanatory variables, we can see that in the first model they are less efficient
(standard errors are greater than in the second model).
c- What conclusions can you derive when comparing both models?
We can conclude that the second model is a better specification than the first one
because it includes an additional relevant and significant explanatory variable, the
signs of the coefficients are the expected ones, standard errors are more efficient than
in the first model and it has a greater explanatory power than the first model.
9
DAE: Solutions PS5
11 We want to estimate a regression model explaining the behavior of property prices

in the city of Barcelona in 2015 (cross sectional analysis). We are provided with a dataset
containing information about property, neighborhood and buyer´s characteristics that can
be used as explanatory variables. The following table describes those variables:
VARIABLE DESCRIPTION
NAME
advance Loan amount when buying the property
age Age of property
bathroom Number of bathrooms
bedroom Number of bedrooms
buyage Age of main buyer
chnone No central heating (dummy)
dcitycenter Distance to city center (km)
floorm2 Floor area of dwelling (m2)
ftbuyer First time buyer (dummy)
lagood Dwelling is in neighborhood with higher-status social housing
labad Dwelling is in neighborhood with lower-status social housing
pflat Flat/maisonnette dwelling (dummy)
psemi Semi-detached dwelling (dummy)
pdetach Detached dwelling (dummy)
pterrace Terraced dwelling (dummy)
10
DAE: Solutions PS5
a- In order to avoid specification errors, which variables would you keep in your analysis
according to practical significance? Justify your choices.
I will not take into consideration those variables that according to real life (practical
significance) not have any significant effect on property prices. Advance variable has
nothing to do with property prices. It is the price of the property the variable that may
affect the loan amount and not the other way around. Additionally, I will also drop
11
DAE: Solutions PS5
those variables related to buyers´ characteristics since are characteristics that not affect
on how sellers set property prices such as buyage (age of the buyer) and first time
buyer (ftbuyer).
b- Explain, the process you would follow in order to specify your final model and to
choose the final variables in your model.
I would follow the step-wise-regression. That is, I would estimate a model with all
the variables that I did not drop because of practical significance (in order to avoid
the omission of relevant variables). Then, I would use the individual t-test to keep
those independent factors being individually significant and drop those ones being
individually insignificant. I would use this process up to the point I achieve a model
with all the explanatory variables being individually significant (in order to avoid the
inclusion of irrelevant variables).
c- Explain the difference between practical and statistical significance.
Practical significance is about real life arguments that you can use to choose the
regressors to be included in your model while statistical significance is about using
hypothesis test (using probability criteria) to choose the right regressors.
12 We have the following information for the annual growth rates (%) in different
countries about stock prices (Y) and in consumer prices (X):
Stock prices Predicted Estimation
Country (Y) Consumer prices (X) Y Residuals

Australia 5 4.3
Austria 11.1 4.6
Belgium 3.2 2.4
Canada 7.9 2.4
Denmark 3.8 4.2
Finland 11.1 5.5
France 9.9 4.7
Germany 13.5 2.2
India 1.5 4
Ireland 6.4 4
Israel 8.9 8.4
Italy 8.1 3.3
Japan 13.5 4.7
Mexico 4.7 5.2
Netherlands 7.5 3.6
New Zealand 4.7 3.6
12
DAE: Solutions PS5
Sweden 8 4
UK 7.5 3.9
USA 9 2.1
Knowing that: 𝑦^ι = 6.83 + 0.201𝑥i
Answer to the following questions:
a- Complete the missing values in the above table.
Stock Consumer Predicted Normalized

Country Prices(Y) Prices (X) Y Residual Residuals
Australia 5 4,3 7,694 -2,694 -0,812
Austria 11,1 4,6 7,755 3,345 1,008
Belgium 3,2 2,4 7,312 -4,112 -1,240
Canada 7,9 2,4 7,312 0,588 0,177
Denmark 3,8 4,2 7,674 -3,874 -1,168
Finland 11,1 5,5 7,936 3,165 0,954
France 9,9 4,7 7,775 2,125 0,641
Germany 13,5 2,2 7,272 6,228 0,878
India 1,5 4 7,634 -6,134 -1,849
Ireland 6,4 4 7,634 -1,234 -0,372
Israel 8,9 8,4 8,518 0,382 0,115
Italy 8,1 3,3 7,493 0,607 0,182
Japan 13,5 4,7 7,775 5,725 1,726
Mexico 4,7 5,2 7,875 -3,175 -0,957
Netherlands 7,5 3,6 7,554 -0,054 -0,016
New Zeeland 4,7 3,6 7,554 -2,854 -0,861
Sweden 8 4 7,634 0,366 0,110
UK 7,5 3,9 7,614 -0,114 -0,034
USA 9 2,1 7,252 1,748 0,527
b- Show both graphically and formally if the above data suffers from an outlier problem.
13
DAE: Solutions PS5
15
8.5
10
Predicted Stock Prices
8
Stock
5
7.5
0
7
-5
Estimation Residuals
According to the above graphs, there may be an outlier problem related to the
observation of Israel (slightly different behaviour than the rest of country
observations).
Formally, we have to compare each of the normalized estimation residuals to the

critical values (residual analysis) such that:
If z > 2.06 or z < −2.06 (critical values with a 2% probability at the right and left
hand tails of the normal distribution) then, the corresponding data point associated
to that specific estimation residual can be considered as an outlier.
Note that the standard deviation of the estimation residuals is 3.22 and the sample
mean of the estimation residuals is closed to 0.
None of the normalized estimation residuals satisfied the above conditions and
therefore, our model does not suffer from a significant outlier problem.
c- If the answer to b is positive, please explain any strategy you would perform in order
to solve the problem.
No strategy is required as none of the residuals are significantly outliers.
13 Imagine that you are interested in analyzing the determinants of infant mortality rates
worldwide. Using the Development Reports from the World Bank in 2013, you get the
following information for 248 countries:
IMR Infant Mortality rate - is the number of deaths of infants per 1,000 live births.
GDP GDP per capita (constant 2005 US$)
Source: World Bank Development Reports, 2013.
And construct the following figure:
14
DAE: Solutions PS5
a- Have a look at the graph above, why Angola and Guinea might be considered as
outliers in this regression model? Comment on the implications of the inclusion of
these two countries in the analysis.
They seem to behave in a slightly different way as the rest of our observations and
therefore they could be considered as possible outliers within our sample. Including
outliers could affect your estimation results in a negative way leading to erroneous
conclusions. Additionally, less accurate results since estimation residuals are large.
b- Angola presents one of the highest infant mortality rates in this sample (103 per 1,000
live births). Compute the residual for this country given that our model predicts for
Angola an infant mortality rate of 28.6 per 1,000 live births.
In order to compute the estimation residual associated to Angola:

𝒖=real-estimated=103 – 28.6 = 74.4 per 1,000 lives.
𝒖 =𝒚 −
c- Knowing that the standard deviation of the estimation residuals (using all the
observations) is 26.22, is Angola a significant outlier?
This is about performing a normal test for outliers:
𝑯𝟎: Insignificant outlier 𝑯𝒂: Significant outlier
Then, we have to normalized it, that is, computing the z score:
15
DAE: Solutions PS5
𝒃 − 𝜷𝟎 𝟕𝟒. 𝟒 − 𝟎
𝒛 = 𝒔. 𝒅(𝒃) = 𝟐𝟔. 𝟐𝟐 = 𝟐. 𝟖𝟑𝟕
Since 𝟐. 𝟖𝟑𝟕 is higher than 2.06 which is the critical limit at 4% significance level in
a two tailed normal test and we reject the null hypothesis. That is, Angola observation
is significantly an outlier, consistent with our previous analysis.
d- What about Guinea? Note that the estimation residual associated to Guinea
observation is 52.
This is about performing again a normal test for outliers for Guinea:
𝑯𝟎: Insignificant outlier 𝑯𝒂: Significant outlier
Then, we have to normalized it, that is, computing the z score:

𝒃 − 𝜷𝟎 𝟓𝟐 − 𝟎
𝒛= = = 𝟏. 𝟗𝟖𝟑
𝒔. 𝒅(𝒃) 𝟐𝟔. 𝟐𝟐
Since 𝟏. 𝟗𝟖𝟑 is lower than 2.06 which is the critical limit at 4% significance level in a
two tailed normal test and we fail to reject the null hypothesis. That is, Guinea
observation is insignificantly an outlier, even if in the above figure seems an outlier.
14 We have representative data for 30 years old for the US. Levine, Gustafson and
Velenchik (1997) estimated a wage equation using the following variables:
Y = ln(wage)
F = a dummy variable that takes a value of 1 for smokers and 0, otherwise
ED = years of education
Two specifications are considered:
MODEL 1: Y = -0.176F  omitting education
(se=0.031)
Coefficient of determination = 0.35
MODEL 2: Y = -0.080F + 0.070ED  including education
(se=0.021) (se=0.0004)
16
DAE: Solutions PS5
Coefficient of determination = 0.68
Compare the two fitted models and explain what happens when we omit one relevant variable (in this
case, years of education).
When omitting years of education (which is an individual significant explanatory variable if

you perform the individual t-test) in the first model we can see that the negative effect of
smokers on salaries is overestimated if compared with the second regression model (the
coefficient in Model 1 is more negative than in Model 2). In addition, the standard error
associated to the effect of smokers is higher in the first model than in the second. That is,
omitting education produces the estimators in Model 1 being less efficient than in model 2.
Finally, if we compare the two regression models in terms of explanatory power by
computing the adjusted determination coefficients, we can see that model 2 is better than
model 1. That is, including education in the second model helps to predict better variability
in salaries if compared with the first model. Furthermore, if you were to test the individual
significance of education, you would reject the null, meaning education is statistically
significant variable to explain the behavior of salaries.
All of the above is indicative of model 1 suffering the omission of a relevant factor (years of
education).
15 Consider the following regression model to analyze salaries respect to age groups in a sample
with 150 individuals:
𝑤 =𝛽 𝑑 +𝛽 𝑑 +𝛽 𝑑 +𝑢
Such that d1 refers to individuals between 20-30 years old, d2 accounts for individuals between
30-40 years old and finally, d3 captures individuals between 40-50 years old.
In order to test for heteroscedasticity, the second group of age is eliminated and we run two
different regressions models (with the same specification). The first model, taking into
consideration the youngest individuals group (with 50 observations and with SSR1=234). The
second regression model accounting for the oldest individuals group (with 50 observations
and SSR2=387).
a-Test for heteroscedasticity at 1% significance level and interpret your result.
17
DAE: Solutions PS5
To test for heteroscedasticity, we use the GQT in the following way:

𝑯𝟎 : 𝑯𝒐𝒎𝒐𝒔𝒄𝒆𝒅𝒂𝒔𝒕𝒊𝒄𝒊𝒕𝒚
𝑯𝟏 : 𝑯𝒆𝒕𝒆𝒓𝒐𝒔𝒄𝒆𝒅𝒂𝒔𝒕𝒊𝒄𝒊𝒕𝒚
GQT = SSR2/SSR1 = 387/234 = 1.653
Degrees of freedom: n=n1-k-1 = 50 – 1 -1 = 48
Degrees of freedom: d=n2-k-1=50 – 1 – 1 = 48
F critical value at 1% significance level = 1.976
Since 1.653 < 1.976, we fail to reject the null hypothesis, thus, the model is homoscedastic and
the second GM condition is being satisfied. Dispersion is homogeneous among all the
observations.
18
DAE: Solutions PS5
16 A time series analysis is conducted to see the relationship between Consumption and
income in a country. The analysis is carry out using yearly observations from 1959 until
1988. Since the researcher suspects heteroscedasticity issues, she decides to perform
the GQT by running the same simple linear regression model, first using yearly data
from 1959 until 1971 and secondly using yearly observations from 1976 until 1988.
She obtains that in the first subsample the SSR is 2532224 while the SSR in the second
subsample is 10339356.
a- Help her by testing heteroscedasticity at 5% significance level.

GQT = SSR2/SSR1 = 10339356/2532224 = 4.083
Degrees of freedom: d=n2-k-1=13 – 1 – 1 = 11
Since 4.083 >2.817, we reject the null hypothesis, thus, the model is heteroscedastic,
and the second GM condition is not satisfied. Dispersion is heterogeneous among all
the observations.
b- Explain the implications in the estimation results of the model of your result in the
above question.
The OLS estimated coefficients will be inefficient and the t-tests will be invalid.
19
DAE: Solutions PS5
17 We have data regarding profits and sales for 20 companies and are interested in
estimating a models explaining profits with sales. The following graph shows the
existing relationship between the two variable.
Profits vs. Sales

200
150
100
50
0
0 5 10 15 20 25
a- Interpret the above diagram. Could you infer any possible potential problem?
According to the above graph it seems that the dispersion increases as sales
increases and therefore it may be a case for a heteroscedasticity problem.
b- Ordering the sample from the lowest to the highest sales and dropping the six central
observations, we obtain SSR1=1.134 for the first observation when running a SLRM
explaining profits with sales and a SSR2=32.934 when estimating the same model
for the last observations. Test for heteroscedasticity at 5% significance level.

GQT = SSR2/SSR1 = 32.934/1.134 = 29.043
Degrees of freedom: d=n2-k-1= 7 – 1 – 1 = 5
Since 29.043 > 5.05, we reject the null hypothesis, thus, the model is heteroscedastic,
and the second GM condition is not satisfied. Dispersion is heterogeneous among all
the observations.
20
DAE: Solutions PS5
18 Answer to the following multiple choice questions about estimation problems:
A- Autocorrelation refers to a situation in which:
a- Successive error terms derived from the application of regression analysis to

time series data are correlated.
b- There is a high degree of correlation between two or more of the independent
variables included in a multiple regression model.
c- The dependent variable is highly correlated with the independent variable(s) in a
regression analysis.
d- The application of a multiple regression model yields estimates that are nonlinear in
form.
B- A situation in which measures of two or more variables are statistically related but
are not in fact causally linked because the statistical relationship is caused by a third
omitted variable is called:
a- Partial correlation
b- Linear correlation
c- Spurious correlation
d- Marginal correlation
C- Step-wise regression is the most widely used search procedure of developing the
……….. regression model without examining all possible models.
a- worst
b- best
c- medium
d- least
21
DAE: Solutions PS5
D- If there is measurement error in both dependent and explanatory variables of your

simple linear regression model, then
a- OLS is unbiased but inefficient.

b- OLS is unbiased but inconsistent.
c- OLS is biased and inefficient.
d- OLS is biased but efficient.
E- A non-formal way to detect a non-linearity problem is plotting your model fitted

values versus the
a- Values of your independent variables

b- Values of your explanatory variables
c- Model residuals
d- Model predictions
F- Multicollinearity refers to a situation in which
a- The dependent variable is highly correlated with the explanatory variables included
in the regression model.
b- There is a high degree of correlation between the explanatory variables
included in a multiple regression model.
c- The application of a multiple regression model yields estimates that are nonlinear
form.
d- None of the above.
G- If your dataset has heteroscedasticity, but you completely ignore the problem and
use OLS, you will
a- Get biased estimates of the parameters.

b- Get parameter standard errors that could be either too large or too small.
c- Get t-statistics that make you too optimistic about your parameters being statistically
different from zero.
d- Get t-statistics that make you too pessimistic about your parameters being statistically
different from zero.
22
DAE: Solutions PS5
H- A useful graphical method for detecting the presence of heteroscedasticity is
a- Plot 𝑦 against each 𝑥 variable in turn

b- Plot the residuals from a preliminary regression against the 𝑥 variables, each in turn
c- Plot the squared residuals from a preliminary regression against the
variables, each in turn
d- Plot the logarithm of the squared residuals from a preliminary regression against the
𝑥 variables, each in turn
23

Solutions - PS5 (DAE)

Uploaded by

Copyright:

Available Formats

Solutions - PS5 (DAE)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solutions - PS5 (DAE)

Uploaded by

Copyright:

Available Formats

DAE: Solutions PS5

Professor: Rodrigo Alegría

1 Answer to the following three questions:

𝑡𝑣ℎ𝑜𝑢𝑟𝑠∗ = α + 𝛽1𝑎𝑔𝑒 + 𝛽2𝑎𝑔𝑒2 + 𝛽3𝑚𝑜𝑡ℎ𝑒𝑑𝑢 + 𝛽4𝑓𝑎𝑡ℎ𝑒𝑑𝑢 + 𝛽5𝑠𝑖𝑏𝑠 + 𝑢

4 We have the following variables:

Y: Food expenditure in USA.

Coefficient for Coefficient for Adjusted Determination

Y/P 2.462 0.614

5 There is an econometric study at IE University which relates the average grade in

The model is the following:

𝐴𝐺𝐸 = 𝛽0 + 𝛽1𝑠𝑡𝑢𝑑𝑦 + 𝛽2𝑠𝑙𝑒𝑒𝑝 + 𝛽3𝑤𝑜𝑟𝑘 + 𝛽4𝑙𝑒𝑖𝑠𝑢𝑟𝑒 + 𝑢

The above model suffers a perfect multicollinearity problem because:

𝒔𝒕𝒖𝒅𝒚 + 𝒔𝒍𝒆𝒆𝒑 + 𝒘𝒐𝒓𝒌 + 𝒍𝒆𝒊𝒔𝒖𝒓𝒆 = 𝟏𝟔𝟖 ∀𝒊

𝑨𝑮𝑬 = 𝜷0 + 𝜷𝟏𝒔𝒕𝒖𝒅𝒚 + 𝜷𝟐𝒔𝒍𝒆𝒆𝒑 + 𝜷𝟑𝒘𝒐𝒓𝒌 + 𝒖

6 In a university campus someone argues that going to class by car lowers

GPA = 3.43 − 0.151drive

𝑯𝟎: 𝜷 = 𝟎, 𝑯𝒂: 𝜷 ≠ 𝟎 𝒕𝒔 =-0.151/0.065 = −𝟐. 𝟑𝟐𝟑

𝒕𝒄 = −𝟏. 𝟗𝟕𝟔 two-tailed t-test with 139 degrees of freedom at 5%

Interpretation: a student that goes to campus driving has, on average, a GPA

A new regression model is estimated such that:

GPA = 3.43 − 0.151drive − 0.056age

b- Provide an economic reason explaining the possible non-linearity in the above

a- Can you detect a multicollinearity problem in any of the two samples?

Sample 1: there is a problem of imperfect multicollinearity since the linear correlation

Sample 2: there is a problem of perfect multicollinearity since the linear correlation

Sample 1: estimations may be erroneous (wrong signs and magnitudes) and

Sample 2: there is no solution to the OLS minimization problem.

Figure 1: GDPpc versus % urban pop

b- Can you explain solutions to be implemented in order to solve the non-linearity

l o g ( 𝐺𝐷 𝑃 𝑝 𝑐 ) ι = 4.631 + 0.052𝑢𝑟𝑏𝑎𝑛i 𝑅2 = 0.549

and obtaining Figure 2 when plotting the data:

Figure 2: logGDPpc versus % urban pop.

𝑚𝑎𝑡ℎ = 𝜷0 + 𝛽1 log(𝑠𝑝𝑒𝑛𝑑) + 𝛽2 log(𝑒𝑛𝑟𝑜𝑙𝑙) + 𝛽3𝑝𝑜𝑣𝑒𝑟𝑡𝑦 +

Explanatory variables (1) (2)

log(spend) 11.13 7.75

In the above table we have a problem of omission of a relevant explanatory variable.

c- What conclusions can you derive when comparing both models?

11 We want to estimate a regression model explaining the behavior of property prices

advance Loan amount when buying the property

age Age of property

bathroom Number of bathrooms

bedroom Number of bedrooms

buyage Age of main buyer

chnone No central heating (dummy)

dcitycenter Distance to city center (km)

floorm2 Floor area of dwelling (m2)

ftbuyer First time buyer (dummy)

lagood Dwelling is in neighborhood with higher-status social housing

labad Dwelling is in neighborhood with lower-status social housing

pflat Flat/maisonnette dwelling (dummy)

psemi Semi-detached dwelling (dummy)

pdetach Detached dwelling (dummy)

pterrace Terraced dwelling (dummy)

c- Explain the difference between practical and statistical significance.

Stock prices Predicted Estimation

Country (Y) Consumer prices (X) Y Residuals

Knowing that: 𝑦^ι = 6.83 + 0.201𝑥i

Answer to the following questions: