Tutorial-4
Tutorial-4
2024-12-01
library(readxl)
library(tidyverse)
library(modelsummary)
library(ggfortify)
library(car)
library(estimatr)
library(lmtest)
library(stargazer)
##
## Please cite as:
##
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary
Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
a. Estimate a linear regression model that explains airq from the other variables using
OLS.
b. Test the null hypothesis that average income does not affect the airquality. Test the joint
hypothesis that none of the variables has an effect upon airquality.
#null hypothesis that average income does not affect the airquality
summary(reg1)
##
## Call:
## lm(formula = airq ~ vala + rain + coas + dens + medi, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.958 -9.891 -6.173 13.714 69.430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.119e+02 1.533e+01 7.301 1.53e-07 ***
## vala 8.834e-04 2.256e-03 0.392 0.6989
## rain 2.507e-01 3.435e-01 0.730 0.4726
## coas -3.340e+01 1.046e+01 -3.194 0.0039 **
## dens -1.073e-03 1.623e-03 -0.661 0.5148
## medi 5.545e-04 8.503e-04 0.652 0.5205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.2 on 24 degrees of freedom
## Multiple R-squared: 0.3829, Adjusted R-squared: 0.2544
## F-statistic: 2.979 on 5 and 24 DF, p-value: 0.03133
average income coefficient is not significantly different from zero all vars jointly significant
at 5% level (F-Stat=2.9 with p- value=0.03133) This implies that At least one of the
regressors (variables) has a statistically significant effect on air quality. Note that: none of
these tests is reliable if there is heteroskedasticity since standard errors would have been
calculated incorrectly in that case
c. Use the command autoplot to graphically inspect the residuals; is there any sign of
heteroskedasticity?
#BP TEST
bptest(reg1)
##
## studentized Breusch-Pagan test
##
## data: reg1
## BP = 3.1416, df = 5, p-value = 0.6782
Fail to reject the null hypothesis that The residuals are homoscedastic. Since the p-value is
not less than 0.05, We do not have sufficient evidence to say that heteroscedasticity is
present in the regression model.
# you can also define the z vars heteroskedasticity depends on, for example:
bptest(reg1, ~ vala + rain, data = airquality)
##
## studentized Breusch-Pagan test
##
## data: reg1
## BP = 0.19782, df = 2, p-value = 0.9058
Here, instead of testing heteroskedasticity across all predictors, the test specifically
checks if the variance of residuals depends on these two variables. Again, Since the p-
value is not less than 0.05, We do not have sufficient evidence to say that
heteroscedasticity is present in the regression model with respect to variables, vala and
rain.
e. Perform a White test for heteroskedasticity. How reliable is this test given that we have
30 observations and how many degrees of freedom on the chi-square distribution?????
bptest(model, ~ all vars that needs to be included, data=dataset)
#WHITE TEST
bptest(reg1, ~ vala + rain + coas + dens + medi + I(vala^2) +
I(rain^2) + I(dens^2)+ I(medi^2), data = airquality)
##
## studentized Breusch-Pagan test
##
## data: reg1
## BP = 5.4355, df = 9, p-value = 0.7948
The White test examines heteroskedasticity in residuals by allowing the variance to depend
on all predictors and their squares. It is a more general and robust test compared to the
Breusch-Pagan test. Since p value is higher than 0.05, we Fail to reject the null hypothesis
This suggests no evidence of heteroskedasticity in the model.
The White test is generally reliable for larger sample sizes, as it relies on asymptotic
properties (large-sample approximations). With only 30 observations, the test results may
be less reliable due to insufficient power to detect heteroskedasticity accurately.
how do you interpret the results? Can you understand why df1 = 15, df2 = 3?
#Goldfeld-Quandt test
summary(airquality$coas)
##
## Goldfeld-Quandt test
##
## data: reg1
## GQ = 27.244, df1 = 15, df2 = 3, p-value = 0.009802
## alternative hypothesis: variance increases from segment 1 to 2
Since p value is less than 0.05, we reject the the null hypothesis. Therefore, There is
significant evidence of heteroskedasticity. Specifically, the variance of residuals increases
between the two groups
df=Number of observations in the group − Number of estimated parameters(k)
# suppose the order.by variable was continuous, for example medi
gq_test2 <- gqtest(reg1, order.by = ~ medi, point=0.5, fraction=0.2,
data = airquality)
# point=0.5 means split sample evenly
# fraction=0.2 20% of observations are excluded from the middle of the sorted
dataset, leaving two smaller subsets for the comparison of variances.
print(gq_test2)
##
## Goldfeld-Quandt test
##
## data: reg1
## GQ = 7.1341, df1 = 6, df2 = 6, p-value = 0.01532
## alternative hypothesis: variance increases from segment 1 to 2
##
## Goldfeld-Quandt test
##
## data: model
## GQ = 7.1341, df1 = 6, df2 = 6, p-value = 0.01532
## alternative hypothesis: variance increases from segment 1 to 2
g. Both BP and White test have quite a generic alternative hypothesis and have therefore a low
power. Assume now that heteroskedasticity is multiplicative.
Plot the residuals against the various regressors (one at a time) to see
what variables heteroskedasticity may be depending upon.
#Plot the residuals against the various regressors (one at a time) to see
what variables heteroskedasticity may be depending upon.
res <- residuals(reg1)
plot(airquality$vala, res)
plot(airquality$rain, res)
plot(airquality$dens, res)
plot(airquality$medi, res)
plot(airquality$coas, res)
clear
heteroskedasticity only in relation to the variable coas, but for other vars there are too few
observations for high values of the x var
##
## Call:
## lm(formula = log(res_sq) ~ coas + medi, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1050 -0.8973 0.1499 0.9418 2.9630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.437e+00 5.271e-01 8.418 4.98e-09 ***
## coas 1.572e+00 6.149e-01 2.556 0.0165 *
## medi -5.288e-05 2.293e-05 -2.306 0.0290 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.521 on 27 degrees of freedom
## Multiple R-squared: 0.2731, Adjusted R-squared: 0.2192
## F-statistic: 5.072 on 2 and 27 DF, p-value: 0.01349
i. Based on the test in h. compute the FGLS estimator for the linear model. Compare the results
with the OLS estimates.
#FGLS ESTIMATOR
variance <- exp(fitted(aux.multip))
# Step 3: estimate the model using lm and specify weights
model_fgls <- lm(airq ~ vala + rain + coas + dens + medi,
weights=1/variance, data=airquality)
stargazer(reg1, model_fgls,
type="text",
no.space=TRUE)
##
## ==========================================================
## Dependent variable:
## ----------------------------
## airq
## (1) (2)
## ----------------------------------------------------------
## vala 0.001 0.0001
## (0.002) (0.001)
## rain 0.251 0.165
## (0.344) (0.300)
## coas -33.398*** -32.647***
## (10.458) (7.744)
## dens -0.001 -0.001
## (0.002) (0.001)
## medi 0.001 0.001*
## (0.001) (0.0004)
## Constant 111.935*** 115.794***
## (15.332) (11.715)
## ----------------------------------------------------------
## Observations 30 30
## R2 0.383 0.602
## Adjusted R2 0.254 0.519
## Residual Std. Error (df = 24) 24.203 1.628
## F Statistic (df = 5; 24) 2.979** 7.262***
## ==========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Note that now medi is somewhat significant at 6% no difference in significance in the other
regressors R-square is inflated. It measures the variation of transformed variable y as
explained by the model not the variation of y (values with high h are less weighted by the
model)
j. Obtain the robust standard errors. You need to install estimatr and car
model_robust <- lm_robust(y ~ x1 + x2 + x3, data=dataset)
modelsummary(list(model,model_fgls,model_robust), stars=TRUE)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1193e+02 1.2151e+01 9.2118 2.382e-09 ***
## vala 8.8339e-04 3.4095e-03 0.2591 0.7977709
## rain 2.5070e-01 3.3457e-01 0.7493 0.4609464
## coas -3.3398e+01 8.5125e+00 -3.9235 0.0006392 ***
## dens -1.0734e-03 3.2379e-03 -0.3315 0.7431462
## medi 5.5449e-04 1.4824e-03 0.3740 0.7116556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1