Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
99 views

Robust Regression Modeling With STATA Lecture Notes

This document provides an overview of robust regression modeling techniques in STATA. It defines what is meant by "robust" in a statistical context, which generally refers to models that are stable and reliable, and resistant to the influence of outliers. The document outlines various robust modeling approaches available in STATA, SAS and S-PLUS to address issues like heteroscedasticity and non-normal residuals. It also discusses testing assumptions of regression models and detecting influential outliers through tools like standardized and studentized residuals.

Uploaded by

Toulouse18
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Robust Regression Modeling With STATA Lecture Notes

This document provides an overview of robust regression modeling techniques in STATA. It defines what is meant by "robust" in a statistical context, which generally refers to models that are stable and reliable, and resistant to the influence of outliers. The document outlines various robust modeling approaches available in STATA, SAS and S-PLUS to address issues like heteroscedasticity and non-normal residuals. It also discusses testing assumptions of regression models and detecting influential outliers through tools like standardized and studentized residuals.

Uploaded by

Toulouse18
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Robust Regression

Modeling
with STATA
lecture notes

Robert A. Yaffee, Ph.D.


Statistics, Social Science, and Mapping Group
Academic Computing Services
Office:
75 Third Avenue, Level C-3
Phone: 212-998-3402
Email: yaffee@nyu.edu
1

What does Robust mean?


1.Definitions differ in scope and content. In the most
general construction: Robust models pertains to
stable and reliable models.
2. Strictly speaking:
Threats to stability and reliability include
influential outliers
Influential outliers played havoc with statistical
estimation. Since 1960, many robust
techniques of estimation have developed that
have been resistant to the effects of such
outliers.
SAS Proc Robustreg in Version 9
deals with these.
S-Plus robust library in
Stata rreg, prais, and arima
models
3.
Broadly speaking: Heteroskedasticity
Heteroskedastically consistent variance
estimators
Stata regress y x1 x2, robust
4.
Non-normal residuals
1. Nonparametric Regression models
Stata qreg, rreg
2. Bootstrapped Regression
1. bstrap
2. bsqreg

Outline
1.

Regression modeling preliminaries


1. Tests for misspecification
1. Outlier influence
2. Testing for normality
3. Testing for heterskedasticity
4. Autocorrelation of residuals
2. Robust Techniques
1. Robust Regression
2. Median or quantile regression
3. Regression with robust standard errors
4. Robust autoregression models
3. Validation and cross-validation
1. Resampling
2. Sample splitting
4. Comparison of STATA with SPLUS and SAS

Preliminary Testing: Prior to


linear regression modeling, use a
matrix graph to confirm linearity
of relationships
graph y x1 x2, matrix

38.4

91.3
244.2

137.2
91.3

x1

38.4
19.1

x2

15.8
137.2

244.2

15.8

19.1

The independent variables


appear to be linearly
related with y
We try to keep the models simple. If the
relationships are linear then we model them with
linear models. If the relationships are nonlinear,
then we model them with nonlinear or
nonparametric models.

Theory of Regression
Analysis
What is linear regression
Analysis?
Finding the relationship between a
dependent and an independent
variable.
Y = a + bx + e

Graphically, this can be done with


a simple Cartesian graph

The Multiple
Regression Formula

Y = a + bx + e
Y is the dependent variable
a is the intercept
b is the regression coefficient
x is the predictor variable

Graphical Decomposition
of Effects
Decomposition of Effects

yi y = Total Effect

}}
Yi

y = a + bx

yi yi = error

y y = regression effect

X
X

Derivation of the Intercept


y = a + bx + e
e = y a bx
n

e
i =1

y a
i

i =1

i =1

b xi
i =1

Because by definition ei = 0
i =1

0=

y a
i =1

i =1

i =1

i =1

i =1

b xi
i =1

ai = yi b xi
n

i =1

i =1

na = yi b xi
a = y bx
9

Derivation of the
Regression Coefficient
Given : yi = a + b xi + ei
ei = yi a b xi
n

i =1

(y

i =1

a b xi )

2
e
i =

2
y

b
x
(
)
i
i

i =1

i =1

ei 2

i =1

i =1

= 2 xi ( yi ) 2b xi xi

i =1

b
0

i =1

i =1

= 2 xi ( yi ) 2b xi xi
n

b =

x y
i =1
n

2
x
i
i =1

10

If we recall that the formula for


the correlation coefficient can
be expressed as follows:

11

r =

i=1

xi yi

( x ) ( y )
n

i=1

i=1

w here
x = xi x
y = yi y
n

i = 1
n

i = 1

from which it can be seen that the regression coefficient b,


is a function of r.

bj = r *

sd y
sd x

12

Extending the bivariate to the multivariate


Case

13

yx . x =
1

yx . x =
2

ryx1 ryx2 rx1x2


1 r

2
x1 x2

ryx2 ryx1 rx1x2


1 r

2
x1 x2

sd y
sd x
sd y
sd x

(6)

(7)

It is also easy to extend the bivariate intercept


to the multivariate case as follows.

a = Y b1 x1 b2 x2 (8)

14

Linear Multiple
Regression
Suppose that we have the
following data set.

15

Stata OLS regression


model syntax

We now see that the significance levels reveal that x1 and x2


are both statistically significant. The R2 and adjusted R2
have not been significantly reduced, indicating that this model still
fits well. Therefore, we leave the interaction term pruned from the
model.
What are the assumptions of multiple linear regression analysis?

16

Regression modeling
and the assumptions
1. What are the assumptions?
1. linearity
2. Heteroskedasticity
3. No influential outliers in small
samples
4. No multicollinearity
5. No autocorrelation of residuals
6. Fixed independent variables-no
measurement error
7. Normality of residuals

17

Testing the model for


mispecification and
robustness
Linearity
matrix graphs shown above
Multicollinearity
vif

Misspecification tests
heteroskedasticity tests
rvfplot
hettest
residual autocorrelation tests
corrgram
outlier detection
tabulation of standardized residuals
influence assessment
residual normality tests
sktest
Specification tests (not covered in this lecture)

18

Misspecification tests
We need to test the residuals
for normality.
We can save the residuals in
STATA, by issuing a command
that creates them, after we
have run the regression
command.
The command to generate the
residuals is
predict resid, residuals

19

Generation of the regression residuals

20

Generation of
standardized residuals
Predict rstd, rstandard

21

Generation of
studentized residuals
Predict rstud, rstudent

22

Testing the Residuals


for Normality
1. We use a Smirnov-Kolmogorov
test.
2. The command for the test is:
sktest resid

This tests the cumulative distribution of the residuals against that of


the theoretical normal distribution with a chi-square test
To determine whether there is a statistically significant difference.
The null hypothesis is that there is no difference. When the probability
is less than .05, we must reject the null hypothesis and infer that23
the residuals are non-normally distributed.

Testing the Residuals


for heteroskedasticity
1. We may graph the standardized or
studentized residuals against the
predicted scores to obtain a graphical
indication of heteroskedasticity.
2. The Cook-Weisberg test is used to test
the residuals for heteroskedasticity.

24

A Graphical test of
heteroskedasticity:
rvfplot, border yline(0)

This displays any problematic patterns that might suggest


heteroskedasticity. But it doesnt tell us which residuals are
outliers.
25

Cook-Weisberg Test
Var (ei ) = 2 exp( zt )
where
ei = error in regression model
z = x or variable list supplied by user
The test is whether t = 0
hettest estimates the model ei 2 = + zi t + i
SS of model
2
2 where p = number of parameters

it forms a score test S =


h0 : S df = p

26

Cook-Weisberg test
syntax
1. The command for this test is:
hettest resid

An insignificant result indicates lack of heteroskedasticity.


That is, an such a result indicates the presence of equal variance
of the residuals along the predicted line. This condition is
otherwise known as homoskedasticity.

27

Testing the residuals for


Autocorrelation
1. One can use the command,
dwstat, after the regression to
obtain the Durbin-Watson d
statistic to test for first-order
autocorrelation.
2. There is a better way.
Generate a casenum
variable: Gen casenum = _n

28

Create a time
dependent series

29

Run the Ljung-Box Q statistic


which tests previous lags for
autocorrelation and partial
autocorrelation
The STATA command is :

corrgram resid

The significance of the AC (Autocorrelation) and PAC


(Partial autocorrelation) is shown in the Prob column.
None of these residuals has any significant autocorrelation.
30

One can run


Autoregression in the
event of autocorrelation
This can be done with
newey y x1 x2 x3 lag(1) time
prais y x1 x2 x3

31

Outlier detection
Outlier detection involves the
determination whether the residual
(error = predicted actual) is an
extreme negative or positive value.
We may plot the residual versus
the fitted plot to determine which
errors are large, after running the
regression.
The command syntax was already
demonstrated with the graph on
page 16: rvfplot, border yline(0)

32

Create Standardized
Residuals
A standardized residual is one
divided by its standard deviation.

yi yi
resid standardized =
s
where s = std dev of residuals

33

Standardized residuals
predict residstd, rstandard
list residstd
tabulate residstd

34

Limits of Standardized
Residuals
If the standardized residuals
have values in excess of 3.5
and -3.5, they are outliers.
If the absolute values are less
than 3.5, as these are, then
there are no outliers
While outliers by themselves
only distort mean prediction
when the sample size is small
enough, it is important to
gauge the influence of outliers.
35

Outlier Influence
Suppose we had a different
data set with two outliers.
We tabulate the standardized
residuals and obtain the
following output:

36

Outlier a does not distort


the regression line but
outlier b does.
b
a

Y=a+bx

Outlier a has bad leverage and outlier a


does not.

37

In this data set, we have two outliers. One is negative and the
other is positive.

38

Studentized Residuals
Alternatively, we could form
studentized residuals. These are
distributed as a t distribution with
df=n-p-1, though they are not
quite independent. Therefore, we
can approximately determine if
they are statistically significant or
not.
Belsley et al. (1980)
recommended the use of
studentized residuals.
39

Studentized Residual

ei =
s

ei
s 2 (i ) (1 hi )

where
ei s = studentized residual
s( i ) = standard deviation where ith obs is deleted
hi = leverage statistic
These are useful in estimating the statistical significance
of a particular observation, of which a dummy variable
indicator is formed. The t value of the studentized residual
will indicate whether or not that observation is a significant
outlier.
The command to generate studentized residuals, called rstudt is:
predict rstudt, rstudent
40

Influence of Outliers
1. Leverage is measured by the
diagonal components of the hat
matrix.
2. The hat matrix comes from the
formula for the regression of Y.
Y = X = X '( X ' X ) 1 X ' Y
where X '( X ' X ) 1 X ' = the hat matrix, H
Therefore,
Y = HY

41

Leverage and the Hat


matrix
1.
2.
3.
4.
5.
6.

The hat matrix transforms Y into the


predicted scores.
The diagonals of the hat matrix indicate
which values will be outliers or not.
The diagonals are therefore measures of
leverage.
Leverage is bounded by two limits: 1/n and
1. The closer the leverage is to unity, the
more leverage the value has.
The trace of the hat matrix = the number of
variables in the model.
When the leverage > 2p/n then there is high
leverage according to Belsley et al. (1980)
cited in Long, J.F. Modern Methods of
Data Analysis (p.262). For smaller samples,
Vellman and Welsch (1981) suggested that
3p/n is the criterion.

42

Cooks D
1. Another measure of influence.
2. This is a popular one. The
formula for it is:

1 hi
ei 2
Cook ' s Di =
2

p 1 hi s (1 hi )
Cook and Weisberg(1982) suggested that values of
D that exceeded 50% of the F distribution (df = p, n-p)
are large.

43

Using Cooks D in
STATA
Predict cook, cooksd
Finding the influential outliers
List cook, if cook > 4/n
Belsley suggests 4/(n-k-1) as a cutoff

44

Graphical Exploration of
Outlier Influence
Graph cook residstd, xlab ylab

The two influential outliers can be found easily here


in the upper right.

45

DFbeta
One can use the DFbetas to
ascertain the magnitude of
influence that an observation has
on a particular parameter estimate
if that observation is deleted.

DFbeta j =

b j b(i ) j u j

2
j

(1 h j )

where u j = residuals of
regression of x on remaining xs.
46

Obtaining DFbetas in
STATA

47

Robust statistical options


when assumptions are
violated
1.

Nonlinearity
1.
2.

2.

Influential Outliers
1.
2.

3.

2.

Regression with Huber/White/Sandwich


variance-covariance estimators
Regress y x1 x2, robust

Residual autocorrelation correction


1.
2.

5.

Robust regression with robust weight functions


rreg y x1 x2

Heteroskedasticity of residuals
1.

4.

Transformation to linearity
Nonlinear regression

Autoregression with
prais y x1 x2, robust
newey-west regression

Nonnormality of residuals
1.
2.

Quantile regression: qreg y x1 x2


Bootstrapping the regression coefficients
48

Nonlinearity:
Transformations to linearity
1. When the equation is not
intrinsically nonlinear, the
dependent variable or
independent variable may be
transformed to effect a
linearization of the relationship.
2. Semi-log, translog, Box-Cox, or
power transformations may be
used for these purposes.
1. Boxcox regression permits
determines the optimal parameters
for many of these transformations.
49

Fix for Nonlinear functional


form: Nonlinear Regression
Analysis
Examples of 2 exponential growth curve models, the first
of which we estimate with our data.

nl exp2 y x
estimates Y = b1b2

nl exp3 y x
estimates y = b0 + b1b2

50

Nonlinear Regression in
Stata

. nl exp2 y x
(obs = 15)

Iteration 0:
Iteration 1:
Iteration 2:
Iteration 3:

Source

residual SS =
residual SS =
residual SS =
residual SS =

56.08297
49.46372
49.4593
49.4593

SS
df
MS
Number of obs =
15
F( 2, 13) = 1585.01
Model 12060.5407 2 6030.27035
Prob > F
= 0.0000
Residual 49.4592999
13 3.80456153
R-squared =
Adj R-squared = 0.9953
Total
12110
15 807.333333
Root MSE
= 1.950529
Res. dev. = 60.46465
2-param. exp. growth curve,
y=b1*b2^x
y

Coef.

Std. Err.

b1
b2

58.60656
.9611869

1.472156 39.81 0.000


.0016449 584.36 0.000

(SE's, P values, CI's, and

P>t

0.9959

[95% Conf. Interval]


55.42616 61.78696
.9576334 .9647404

correlations are asymptotic approximations)

51

Heteroskedasticity
correction
1. Prof. Halbert White showed that
heteroskedasticity could be
handled in a regression with a
heteroskedasticity-consistent
covariance matrix estimator
(Davidson & McKinnon (1993),
Estimation and Inference in
Econometrics, Oxford U Press,
p. 552).
2. This variance-covariance matrix
under ordinary least squares is
shown on the next page.
52

OLS Covariance Matrix


Estimator
1

( X ' X ) ( X ' X )( X ' X )

where = st /( X ' X )
2

53

Whites HAC estimator


1. Whites estimator is for large
samples.
2. Whites heteroskedasticitycorrected variance and standard
errors can be larger or smaller
than the OLS variances and
standard errors.

54

Heteroskedastically consistent
covariance
matrix Sandwich estimator (H.
White)
Bread

Meat(tofu)

Bread

n 1 ( X ' X ) 1 (n 1 X ' X )(n 1 X ' X ) 1


et 2
where =
1 ht 2
However , there are different versions :
HC0 : = et 2
n
et 2
nk
et 2
HC 2 : =
1 ht

HC1 : =

et 2
HC 3 : =
(1 ht ) 2

55

Regression with robust standard


errors for heteroskedasticity
Regress y x1 x2, robust

Options other than robust, are hc2 and hc3 referring


to the versions mentioned by Davidson and McKinnon above.

56

Robust options for the


VCV matrix in Stata
Regress y x1 x2, hc2
Regress y x1 x2, hc3
These correspond to the
Davidson and McKinnons
versions of the
heteroskedastically consistent
vcv options 2 and 3.

57

Problems with
Autoregressive Errors
1.

Problems in estimation with OLS


1. When there is first-order autocorrelation of the
residuals,
2. et = D1et-1 + vt

2.

Effect on the Variance


1.
et2 = D12et-12 + vt 2

58
58

Sources of Autocorrelation
1. Lagged endogenous variables
2. Misspecification of the model
3. Simultaneity, feedback, or reciprocal
relationships
4. Seasonality or trend in the model

59

Prais-Winston
Transformation-contd
2
v
t
et 2 =
, therefore et =
2
(1 )

vt
(1 2 )

It follows that

Yt = a + bxt +
(1 2 Yt =

(1 2 a +

vt
(1 )
2

(1 2 bxt + vt

Yt * = a * + bxt * + vt
60

Autocorrelation of the
residuals: prais & newey
regression
To test whether the variable is
autocorrelated
Tsset time
corrgram y
prais y x1 x2, robust
newey y x1 x2, lag(1) t(time)

61

Testing for autocorrelation


of residuals
regress mna10 l5sumprc
predict resid10, residual
corrgram resid10

62

Prais-Winston Regression
for AR(1) errors
Using the robust option here guarantees that the
White heteroskedasticity consistent sandwich
variance-covariance estimator will be used
in the autoregression procedure.

63

Newey-West Robust
Standard errors
An autocorrelation correction is
added to the meat or tofu in the
White Sandwich estimator by
Newey-West.
n 1 ( X ' X ) 1 (n 1 X ' X )(n 1 X ' X ) 1
et 2
where =
1 ht 2
However , there are different versions :
HC0 : = et 2
n
et 2
nk
et 2
HC 2 : =
1 ht

HC1 : =

et 2
HC 3 : =
(1 ht ) 2
64

Central Part of NeweyWest Sandwich estimator


X
X '
newey west
X
= X '
white
n m
l
1
+

ei ei 1 ( xi ' xi 1 + xi 1 ' xi )
n k l =1
m+1
where k = number of predictors
l = time lag
m = maximum time lag

65

Newey-West Robust
Standard errors
Newey West standard errors are robust to autocorrelation
and heteroskedasticity with time series regression models.

66

Assume OLS
regression
We regress y on x1 x2 x3
We obtain the following output

Next we examine the residuals

67

Residual Assessment

The data set is to small to drop case 21, so I use robust


regression

68

Robust regression
algorithm: rreg
1. A regression is performed
and absolute residuals are
computed.
ri = | y i x i b |

2. These residuals are


computed and scaled:

ri
ui =
s
yi xi b
=
s
69

Scaling the residuals

M
s=
0.6745
where
M = med (| ri med (ri ) |)
The residuals are scaled by the median absolute
value of the median residual.

70

Essential Algorithm
The estimator of the parameter b
minimizes the sum of a less
rapidly increasing function of the
residuals (SAS Institute, The
Robustreg Procedure, draft copy,
p.3505, forthcoming):
ri
Q(b) =

i =1
where ri = y
n

xi b

is estimated by s
71

Essential algorithm-contd
1. If this were OLS, the would be
a quadratic function.
2. If we can ascertain s, we
can by taking the derivatives
with respect to b, find a first
order solution to

ri
xij = 0,

s
i =1
where j = 1,..., p
n

= '

72

Case weights are developed


from weight functions
1. Case weights are formed based
on those residuals.
2. Weight functions for those case
weights are first the Huber
weights and then the Tukey
bisquare weights:
3. A weighted regression is rerun
with the case weights.

73

Iteratively reweighted
least squares
The case weight w(x) is defined as:

w( x) =

( x)
x

It is updated at each iteration until it


converges on a value and the change
from iteration to iteration declines below
a criterion.

74

Weights functions for


reducing outlier influence

c is the tuning constant used in determining the case weights.


For the Huber weights c = 1.345 by default.

75

Weight Functions
Tukey biweight (bisquare)

C is also the biweight tuning constant. C is set at 4.685


for the biweight.
76

Tuning Constants
When the residuals are normally
distributed and the tuning
constants are set at the default,
they give the procedure about
95% of the efficiency of OLS.
The tuning constants may be
adjusted to provide
downweighting of the outliers at
the expense of Gaussian
efficiency.
Higher tuning constants cause the
estimator to more closely
approximate OLS.
77

Robust Regression
algorithm contd
3. WLS regression is performed
using those case weights
4. Iterations case when case
weights drop below a
tolerance level
5. Weights are based initially on
Huber weights. Then Beaton
and Tukey biweights are
used.
6. Caveat: M estimation is not
that robust with regard to
leverage points.
78

Robust Regression for


down-weighting outliers
rreg y x1 x2 x3
Uses Huber and Tukey biweights to downweight the influence
of outliers in the estimation of the mean of y in the upper panel
whereas ols regression is given in the lower panel.

79

A Corrective Option for


Nonnormality of the
Residuals
1. Quantile regression (median
regression is the default) is one
option.
2. Algorithm
1. Minimizes the sum of the absolute
residuals
2. The residual in this case is the
value minus the unconditional
median.
3. This produces a formula that
predicts the median of the
dependent variable
Ymed = a + bx
80

Quantile Regression
qreg in STATA estimates least
absolute value ( LAV or MAD or
L1 norm regression).
The algorithm minimizes the sum of
the absolute deviations about the
median.
The formula generated estimates the
median rather than the mean, as
rreg does.
Ymedian = constant + bx
81

Median regression

82

Bootstrapping
Bootstrapping may be used to
obtain empirical regression
coefficients, standard errors,
confidence intervals, etc. when
the distribution is non-normal.
Bootstrapping may be applied
to qreg with bsqreg

83

Bootstrapping quantile or
median regression
standard errors
qreg y x1 x2 x3
bsqreg y x1 x2 x3, reps(1000)

84

Methods of Model
Validation
These methods may be
necessary where the sampling
distributions of the parameters
of interest are nonnormal or
unknown.
Bootstrapping
Cross-validation
Data-splitting

85

Bootstrapping
When the distribution of the
residuals is nonnormal or the
distribution is unknown,
bootstrapping can provide
proper regression coefficients,
standard errors, and
confidence intervals.

86

Stata Bootstrapping
Syntax

Bs regress y x1 x2 x3, _b[x1] _b[x2] _b[x3], reps(1000)


saveing(mybstrap1)

87

Internal Validation
R2 and adjusted R2
1. Plot Y against Y. Compute an R2
and an adjusted R2.

88

Cross-validation
Jacknifing
This is repeated sampling,
where one group or
observation is left out.
The analysis is reiterated and
the results are averaged to
obtain a validation.

89

Resampling
1.

Bootstrapping was performed developed by


Efron. Resampling generally needs to be
done at least B=100 times.
2. Resampling with replacement is performed
on a sample. From each bootstrapped
sample, a mean is computed. The average of
all of these b bootstrapped means is the
mean.
3. The bootstrapped means are used to
compute a bootstrapped variance estimate.
If b is the number of bootstraps, then b is
the n used in the computation. A
bootrapped variance estimate is now known.
4.
After enough resampling, an empirical
distribution function is formed.

90

Bootstrapped Formulae
x = xi / n
b

Var ( x)b =

b
b 2
(
x

avg
(
x
) /( B 1)

b =1

91

Data-splitting
1. Sample Splitting
1. Subset the sample into a training
and a validation subsample. One
has to be careful about the tail
wagging the dog, as David Reilly
is wont to say.
2. This results in poorer accuracy and
loss of power unless there is plenty
of data.
3. Tests for parameter constancy

92

Comparison of STATA,
SAS, and S-PLUS
Stata has rreg, qreg, bsqreg
Rreg is M estimation with Huber and Tukey bisquare
weight functions
qreg is quantile regression
Bsqreg is bootstrapped quantile regression
Bootstrapping

SAS has M, Least Trimmed squares, S, and MM


estimation in Proc Robustreg in version 9. It
can perform Robust ANOVA as well. SAS
has 10 different weight functions that may
be applied. It does not have bootstrapping
SPLUS has a robust library of procedures. Among the
procedures it can apply are robust regression,
robust ANOVA, robust principal components
analysis, robust covariance matrix estimation,
robust discriminant function analysis, robust
distribution estimation for asymmetric distributions.
SPLUS has procedures to run OLS regression side
by side with robust MM regression to show the
differences. It has a wide variety of graphical
diagnostics as well.

93

You might also like