Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Tutorial-4

Uploaded by

sahrish.khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Tutorial-4

Uploaded by

sahrish.khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Tutorial 4

Sahrish Aisha Khan

2024-12-01
library(readxl)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse


2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ──────────────────────────────────────────
tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r -lib.org/>) to force all
conflicts to become errors

library(modelsummary)

## `modelsummary` 2.0.0 now uses `tinytable` as its default table -drawing


## backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
##
## Revert to `kableExtra` for one session:
##
## options(modelsummary_factory_default = 'kableExtra')
## options(modelsummary_factory_latex = 'kableExtra')
## options(modelsummary_factory_html = 'kableExtra')
##
## Silence this message forever:
##
## config_modelsummary(startup_message = FALSE)

library(ggfortify)
library(car)

## Warning: package 'car' was built under R version 4.4.2

## Loading required package: carData


##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some

library(estimatr)
library(lmtest)

## Loading required package: zoo


##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric

library(stargazer)

##
## Please cite as:
##
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary
Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

airquality <- read_excel("airquality.xlsx")


airquality <- data.frame(airquality)

a. Estimate a linear regression model that explains airq from the other variables using
OLS.

#Estimate a linear regression model


reg1 <- lm(airq ~ vala + rain + coas + dens + medi, data = airquality)

b. Test the null hypothesis that average income does not affect the airquality. Test the joint
hypothesis that none of the variables has an effect upon airquality.

#null hypothesis that average income does not affect the airquality
summary(reg1)

##
## Call:
## lm(formula = airq ~ vala + rain + coas + dens + medi, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.958 -9.891 -6.173 13.714 69.430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.119e+02 1.533e+01 7.301 1.53e-07 ***
## vala 8.834e-04 2.256e-03 0.392 0.6989
## rain 2.507e-01 3.435e-01 0.730 0.4726
## coas -3.340e+01 1.046e+01 -3.194 0.0039 **
## dens -1.073e-03 1.623e-03 -0.661 0.5148
## medi 5.545e-04 8.503e-04 0.652 0.5205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.2 on 24 degrees of freedom
## Multiple R-squared: 0.3829, Adjusted R-squared: 0.2544
## F-statistic: 2.979 on 5 and 24 DF, p-value: 0.03133

average income coefficient is not significantly different from zero all vars jointly significant
at 5% level (F-Stat=2.9 with p- value=0.03133) This implies that At least one of the
regressors (variables) has a statistically significant effect on air quality. Note that: none of
these tests is reliable if there is heteroskedasticity since standard errors would have been
calculated incorrectly in that case
c. Use the command autoplot to graphically inspect the residuals; is there any sign of
heteroskedasticity?

#Graphical inspection of residulas to check for heteroskedacity


autoplot(reg1) +
theme_bw(base_size=14)
plot(reg1)
Yes, we can observe fanning in behaviour, hence, potentially heteroskedastic

d. Perform a Breusch-Pagan test for heteroskedasticity related to all five explanatory


variables.
bptest(model)

#BP TEST
bptest(reg1)

##
## studentized Breusch-Pagan test
##
## data: reg1
## BP = 3.1416, df = 5, p-value = 0.6782

Fail to reject the null hypothesis that The residuals are homoscedastic. Since the p-value is
not less than 0.05, We do not have sufficient evidence to say that heteroscedasticity is
present in the regression model.
# you can also define the z vars heteroskedasticity depends on, for example:
bptest(reg1, ~ vala + rain, data = airquality)

##
## studentized Breusch-Pagan test
##
## data: reg1
## BP = 0.19782, df = 2, p-value = 0.9058

Here, instead of testing heteroskedasticity across all predictors, the test specifically
checks if the variance of residuals depends on these two variables. Again, Since the p-
value is not less than 0.05, We do not have sufficient evidence to say that
heteroscedasticity is present in the regression model with respect to variables, vala and
rain.

e. Perform a White test for heteroskedasticity. How reliable is this test given that we have
30 observations and how many degrees of freedom on the chi-square distribution?????
bptest(model, ~ all vars that needs to be included, data=dataset)

#WHITE TEST
bptest(reg1, ~ vala + rain + coas + dens + medi + I(vala^2) +
I(rain^2) + I(dens^2)+ I(medi^2), data = airquality)

##
## studentized Breusch-Pagan test
##
## data: reg1
## BP = 5.4355, df = 9, p-value = 0.7948

The White test examines heteroskedasticity in residuals by allowing the variance to depend
on all predictors and their squares. It is a more general and robust test compared to the
Breusch-Pagan test. Since p value is higher than 0.05, we Fail to reject the null hypothesis
This suggests no evidence of heteroskedasticity in the model.
The White test is generally reliable for larger sample sizes, as it relies on asymptotic
properties (large-sample approximations). With only 30 observations, the test results may
be less reliable due to insufficient power to detect heteroskedasticity accurately.

f. Perform a Goldfeld-Quandt test for equality of the variance between smsa on


the coast (coas=1) and not (coas=0)

- first find the percentage of observations with coas=0


summary(airquality$coas)
- Then use this command
gq_test <- gqtest(model, order.by = ~ coas, point=0.3, data = airquality)
print(gq_test)

how do you interpret the results? Can you understand why df1 = 15, df2 = 3?

#Goldfeld-Quandt test
summary(airquality$coas)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.0 0.0 1.0 0.7 1.0 1.0

gq_test <- gqtest(reg1, order.by = ~ coas, point=0.3, data =


airquality)
print(gq_test)

##
## Goldfeld-Quandt test
##
## data: reg1
## GQ = 27.244, df1 = 15, df2 = 3, p-value = 0.009802
## alternative hypothesis: variance increases from segment 1 to 2

Since p value is less than 0.05, we reject the the null hypothesis. Therefore, There is
significant evidence of heteroskedasticity. Specifically, the variance of residuals increases
between the two groups
df=Number of observations in the group − Number of estimated parameters(k)
# suppose the order.by variable was continuous, for example medi
gq_test2 <- gqtest(reg1, order.by = ~ medi, point=0.5, fraction=0.2,
data = airquality)
# point=0.5 means split sample evenly
# fraction=0.2 20% of observations are excluded from the middle of the sorted
dataset, leaving two smaller subsets for the comparison of variances.
print(gq_test2)

##
## Goldfeld-Quandt test
##
## data: reg1
## GQ = 7.1341, df1 = 6, df2 = 6, p-value = 0.01532
## alternative hypothesis: variance increases from segment 1 to 2

##
## Goldfeld-Quandt test
##
## data: model
## GQ = 7.1341, df1 = 6, df2 = 6, p-value = 0.01532
## alternative hypothesis: variance increases from segment 1 to 2

g. Both BP and White test have quite a generic alternative hypothesis and have therefore a low
power. Assume now that heteroskedasticity is multiplicative.
Plot the residuals against the various regressors (one at a time) to see
what variables heteroskedasticity may be depending upon.

plot(airquality$vala, res) for example

#Plot the residuals against the various regressors (one at a time) to see
what variables heteroskedasticity may be depending upon.
res <- residuals(reg1)
plot(airquality$vala, res)
plot(airquality$rain, res)

plot(airquality$dens, res)
plot(airquality$medi, res)

plot(airquality$coas, res)
clear
heteroskedasticity only in relation to the variable coas, but for other vars there are too few
observations for high values of the x var

h. Assume that multiplicative heteroskedasticity is related to coas and medi.


Estimate the coefficients by running a regression of 𝑙𝑜𝑔𝜀̂𝑖2 upon these two variables. Test the
null hypothesis of homoskedasticity on the basis of this auxiliary regression.

#Multiplicative Heteroskedasticity Test


res_sq <- (res^2)
aux.multip <- lm(log(res_sq) ~ coas + medi, data=airquality)
summary(aux.multip)

##
## Call:
## lm(formula = log(res_sq) ~ coas + medi, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1050 -0.8973 0.1499 0.9418 2.9630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.437e+00 5.271e-01 8.418 4.98e-09 ***
## coas 1.572e+00 6.149e-01 2.556 0.0165 *
## medi -5.288e-05 2.293e-05 -2.306 0.0290 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.521 on 27 degrees of freedom
## Multiple R-squared: 0.2731, Adjusted R-squared: 0.2192
## F-statistic: 5.072 on 2 and 27 DF, p-value: 0.01349

here, auxiliary regression is used to text multiplicative heteroskedacity Here we are


assuming a specific functional form and restricting to two variables in the z vector. Both
variables (coas and medi) have p-values less than 0.05, meaning they are statistically
significant in explaining heteroskedasticity. The F-statistic is 5.072 with a p-value =
0.01349 (< 0.05), indicating the model as a whole is statistically significant in explaining the
variance of residuals. Testing for joint significance of both coefficients, you reject null of
homoskedastic errors at 5%, not at 1%.

i. Based on the test in h. compute the FGLS estimator for the linear model. Compare the results
with the OLS estimates.

In general for the FGLS estimator:

res_sq <- (res^2)

# Step 1: run auxiliary regression (estimate model for variance in log)


aux.variance <- lm(log(res_sq)~ x1 + x2, data=dataset)

# Step 2: obtain estimate of the individual variance


variance <- exp(fitted(aux.variance))

# Step 3: estimate the model using lm and specify weights


model_fgls <- lm(y ~ x1 + x2 + x3, weights=1/variance, data=dataset)

Note: we have already done up to and including step 1 when answering h.


stargazer(model, model_fgls,
type="text",
no.space=TRUE)

#FGLS ESTIMATOR
variance <- exp(fitted(aux.multip))
# Step 3: estimate the model using lm and specify weights
model_fgls <- lm(airq ~ vala + rain + coas + dens + medi,
weights=1/variance, data=airquality)
stargazer(reg1, model_fgls,
type="text",
no.space=TRUE)

##
## ==========================================================
## Dependent variable:
## ----------------------------
## airq
## (1) (2)
## ----------------------------------------------------------
## vala 0.001 0.0001
## (0.002) (0.001)
## rain 0.251 0.165
## (0.344) (0.300)
## coas -33.398*** -32.647***
## (10.458) (7.744)
## dens -0.001 -0.001
## (0.002) (0.001)
## medi 0.001 0.001*
## (0.001) (0.0004)
## Constant 111.935*** 115.794***
## (15.332) (11.715)
## ----------------------------------------------------------
## Observations 30 30
## R2 0.383 0.602
## Adjusted R2 0.254 0.519
## Residual Std. Error (df = 24) 24.203 1.628
## F Statistic (df = 5; 24) 2.979** 7.262***
## ==========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01

Note that now medi is somewhat significant at 6% no difference in significance in the other
regressors R-square is inflated. It measures the variation of transformed variable y as
explained by the model not the variation of y (values with high h are less weighted by the
model)

j. Obtain the robust standard errors. You need to install estimatr and car
model_robust <- lm_robust(y ~ x1 + x2 + x3, data=dataset)
modelsummary(list(model,model_fgls,model_robust), stars=TRUE)

#Robust Standard Errors


model_robust <- lm_robust(airq ~ vala + rain + coas + dens + medi,
data=airquality)
modelsummary(list(reg1,model_fgls,model_robust), stars=TRUE)

(Intercept) 111.935*** 115.794*** 111.935***


(15.332) (11.715) (12.646)
vala 0.001 0.000 0.001
(0.002) (0.001) (0.002)
rain 0.251 0.165 0.251
(0.344) (0.300) (0.334)
coas -33.398** -32.647*** -33.398***
(10.458) (7.744) (7.415)
dens -0.001 -0.001 -0.001
(0.002) (0.001) (0.001)
medi 0.001 0.001+ 0.001
(0.001) (0.000) (0.000)
Num.Obs. 30 30 30
R2 0.383 0.602 0.383
R2 Adj. 0.254 0.519 0.254
AIC 283.6 272.8 283.6
BIC 293.4 282.6 293.4
Log.Lik. -134.815 -129.378
F 2.979 7.262
RMSE 21.65 21.81 21.65
• p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

OLS Provides unbiased coefficients but unreliable standard errors under


heteroskedasticity. FGLS Accounts for heteroskedasticity by reweighting observations
based on their variance (using a multiplicative heteroskedasticity model).It Provides more
efficient estimates under heteroskedasticity but assumes the variance model is correctly
specified. Robust Standard errors Uses OLS coefficients but computes heteroskedasticity-
robust standard errors. This Provides valid inference without assuming a specific variance
model but may be less efficient than FGLS if the heteroskedasticity model is correct.

k. Obtain the bootstrapped s.e . You need to install sandwich

coeftest(model, vcov = vcovBS(model))

#Bootstraped Standard Errors


library(sandwich)

## Warning: package 'sandwich' was built under R version 4.4.2

coeftest(reg1, vcov = vcovBS(reg1))

##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1193e+02 1.2151e+01 9.2118 2.382e-09 ***
## vala 8.8339e-04 3.4095e-03 0.2591 0.7977709
## rain 2.5070e-01 3.3457e-01 0.7493 0.4609464
## coas -3.3398e+01 8.5125e+00 -3.9235 0.0006392 ***
## dens -1.0734e-03 3.2379e-03 -0.3315 0.7431462
## medi 5.5449e-04 1.4824e-03 0.3740 0.7116556
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Bootstrapping is non-parametric and doesn’t assume a specific error distribution. It is


particularly useful when classical methods (e.g., heteroskedasticity-robust covariance
estimators) might fail or when the sample size is small.

You might also like