An Introduction To Robust Regression
An Introduction To Robust Regression
j
= arg
j
min
_
n
i=1
(y
i
y
i
)
2
_
where
_
_
y =
0
+
p
j=1
j
X
j
j;
j = 0, 1, . . . , p
The reason why OLS regression is popular is because of the convenience brought about by
its properties, e.g. parameter estimates are BLUE, ease in computation, and simplicity in
interpretation.
However, there is a caveat to the beauty of OLS regression it imposes stringent assump-
tions, viz. normality, independence of observations, and homoskedasticity. OLS is quite
sensitive to departures from these classical assumptions.
But it is not just the fulllment of the classical assumptions that aects the tenability of
inferences. OLS regression is quite sensitive to outliers because of the nature of how the
parameter estimates are arrived at.
1.1 Outliers:
What they are and what they do
Outliers persist for various reasons encoding errors, data contamination, or observations
surrounded by unqiue circumstances. Regardless of source, outliers pose a serious threat to
data analysis through the distortion of resulting inferences.
1
1.1 Outliers:
What they are and what they do 1 INTRODUCTION
In fact, the presence of outliers introduces non-normality into the equation through heavy-
tailed error distributions (Hamilton, 1992). Robust regression assigns lower weights to out-
lying observations so as to limit their spurious inuence, thus rendering resistance to the
inferences.
In order to appreciate the benets brought by robust regression, the dierent characteristics
of outliers and how they garble the analysis are presented.
Leverage Point An observation whose explanatory value(s) lie far from the bulk of the
dataset is deemed as a leverage point. Leverage points need to be paid special attention
because of their potential to inuence the resulting OLS estimates greatly. Ipso facto, the
presence of a leverage point has the potential to severely distort inferences made from the
subject data.
To illustrate its eect (and understand where the termleverage comes from), consider the
following datasets taken from Rousseeuw and Leroy (1987).
Dataset.noLev
x y
1 5.00 0.30
2 1.00 1.23
3 1.27 1.78
4 1.57 2.79
5 2.10 3.90
Dataset.wLev
x y
1 0.20 0.30
2 1.00 1.23
3 1.27 1.78
4 1.57 2.79
5 2.10 3.90
One of the datapoints in Dataset.noLev has been erroneously encoded into Dataset.wLev,
in particular the x-value, causing an observation to lie far from the other data points along
the x-axis (a plot is presented in Figure 1 on the following page to better visualize the
datasets). The resulting tted OLS models on the two datasets are then compared.
Fitted OLS Model without Leverage
R-squared: 0.9557
Parameter Estimates:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3726911 0.3316215 -1.123845 0.342892886
x 1.9321589 0.2402323 8.042877 0.004014026
Fitted OLS Model with Leverage
R-squared: 0.2277
Parameter Estimates:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.8956610 1.1431918 2.5329618 0.08520275
x -0.4093515 0.4352826 -0.9404268 0.41637622
2
1.1 Outliers:
What they are and what they do 1 INTRODUCTION
Notice that the stability of the OLS model tted on the dataset with the leverage point is
comparably lower than the the one tted on the dataset without, with the R-squared falling
from 0.9557 to 0.2277. Furthermore, the validity of the parameter estimates of the tted
OLS model has become dubious upon the introduction of the leverage point, as reected by
the dierence in the standard errors (or equivalently, the p-values).
Apart from the degradation in the tenability of the parameter estimates, juxtaposing the two
models also points out the drastic change in the estimated slope. This is a very dangerous
case under the context of regression as it may lead to misleading inferences.
0 2 4 6
0
1
2
3
4
5
6
no leverage
with leverage
Figure 1: Scatterplot and Fitted OLS Lines of Datasets 2a and 2b
The substantial change in the values of the parameter estimates caused by the presence of
the leverage point is illustrated in Figure 1.
On another note, notice that the outlier pulled the tted OLS model towards it, similar
to how an external force acting on lever changes the levers orientation (hence the term
leverage).
Setting the trivia aside, the potential for leverage points to mislead does not only come
from the wild change in parameter estimates, but also from how the drastic change in the
tted OLS line masks which observations are supposed to be treated as outliers. In other
words, discrimination of outliers based on the tted regression model becomes misleading as
well.
To provide an insight on this, the residuals of the Dataset.wLev from its tted OLS model
and the residuals of the same dataset from the model tted on Dataset.noLev
1
are com-
1
Technically speaking, this kind of procedure is spurious. The proper procedure will be discussed later on.
But for the purposes of illustrating the eects of leverage points, this example is enough since the premise
is that the tted OLS model on the dataset without a leverage point and the model tted with the bulk of
the observations in the dataset that are close enough
3
1.1 Outliers:
What they are and what they do 1 INTRODUCTION
pared.
From OLS Model
x y Residuals Std.Residuals
1 5.00 0.30 -0.5489037 -0.345044
2 1.00 1.23 -1.2563095 -1.331951
3 1.27 1.78 -0.5957846 -0.689797
4 1.57 2.79 0.5370208 0.676806
5 2.10 3.90 1.8639771 2.548244
From Model Fitted on Data without Leverage
x y Residuals Std.Residuals
1 5.00 0.30 -8.9881033 -1.97266528
2 1.00 1.23 -0.3294678 -0.13005900
3 1.27 1.78 -0.3011507 -0.12613446
4 1.57 2.79 0.1292017 0.04767381
5 2.10 3.90 0.2151575 0.05292135
Looking at the standardized residuals from the rst and second models generated, it can
be observed that the observations considered as outliers by the two models are dierent.
The OLS model identies the observation on x = 2.1 as an outlier, despite its consistency
with the general linear trend folowed by the rest of the points. Meanwhile, the second
model discriminates the observation at x = 0.20 as the relatively wayward one, which is not
surprising because the second set of residuals have been obtained from a model tted on a
set of points that closely follow a specic linear trend.
While on the topic of identifying outlying observations using residuals, it is worth mentioning
that although Studentized residuals could be used on the identifying the leverage point from
the residuals of the OLS model tted on the dataset with a leverage point in this example to
discover that the rst observation is actually an outlier instead of the fth, this is not always
the case.
The Studentized residual only singles out one observation at a time. Ipso facto, the inclusion
of other outliers in the computation for the Studentized residual of one of the actual outliers
will fail to inate the Studentized residual.
Apart from the propensity of leverage points to severely aect analysis through spurious
estimates and discrimination of outliers, special attention is given to them because they are
more likely to occur in a multidimensional setting. Naturally, the consideration of more
explanatory variables would provide more opportunities for leverage points to appear.
That said, not all outliers are detrimental to analysis; some outliers are benign only because
they do not debilitate the inferences.
Figure 2 on the next page highlights how important it is to keep in mind that leverage points
only have the potential to impair analysis. While Figure 2(a) is similar to the illustration of
the eects of leverage points (Dataset.noLev and Dataset.wLev), Figure 2(b) shows that
4
1.2 Robust Regression:
What does it do 1 INTRODUCTION
(a) Debilitating Outlier (b) Benign Outlier
Figure 2: Examples of a Debilitating and a Benign Leverage Point
Source: Hamilton (1992)
the outlying observation, despite being a leverage point, is still consistent with the linear
trend followed by the bulk of the data points.
If a leverage point is consistent with the linear trend followed by the majority of the dataset,
it is concomitant that the observation would not just be a leverage point, but also an outlier
in the y-direction. This is not to say, however, that an observation considered as a leverage
point and an outlier in the y-direction will mean that it is consistent with the linear trend
of the majority. If the value of its response or one of its explanatory variables is too far o,
then it will not follow the linear trend.
Outlying points that deviate away from the linear trend exhibited by the majority of the
datapoints are labeled as regression outliers.
That said, it actually the presence of regression outliers that erode the tenability of the
parameter estiamtes. So, leverage points are considered detrimentally inuential if they
are regression outliers as well. If not, then the subject observation is a benign leverage
point.
1.2 Robust Regression:
What does it do
It is worth mentioning before discussing the essence of robust regression that in OLS re-
gression, outliers are determined based on their deviation from the tted line using various
measures such as the adjusted, standardized, and studentized residuals; DFts; DFBetas;
Cooks Distance, etc.
5
1.2 Robust Regression:
What does it do 1 INTRODUCTION
As previously mentioned, this sort of discrimination of residuals could lead to complica-
tions as tting an OLS line would mask regression outliers before their eects are marginal-
ized.
Robust Regression, on the other hand, ts a line using resistant estimators rst. In using re-
sistant estimators in the estimation process itself, eects of outlying values are marginalized,
thus obtaining a robust solution.
Figure 3: Illustration of the dierence in response of OLS and robust regression to outliers
Source: Hamilton (1992)
Notice in Figure 3, the OLS tted model was pulled downward by the data values of four
regression outliers San Jose, San Diego, San Francisco, and Los Angeles while the
tted model using robust regression ignored the excessive inuence imposed by said outliers.
In this sense, robust regression is sometimes referred to as resistant regression.
It is noteworthy that these resistant estimators essentially assign weights to observations
(similar to L-estimators). Wayward observations are just assigned lower weights but not
always zero-weights.
Outliers are then identied based on their deviation from the robust solution.
While there is a plethora of robust estimators available e.g. repeated median, iteratively
reweighted least squares this article will only be focusing on two: the least median of
squares and the least trimmed squares. These two estimators are noted for their very high
breakdown bound.
6
2 UNIVARIATE ROBUST ESTIMATION
2 Univariate Robust Estimation
This section will be presenting two estimators used in robust regression the least median
of squares (LMS) and the least trimmed squares (LTS) but only under the univariate
setting. LTS and LMS estimation under multidimensional data will be presented in the next
section, but already under the context of regression since estimation involving more than
one variables is usually done in this setting.
That said, this section will proceed as follows: a brief description about the estimation
procedure is presented, followed by an outline on its computation, then an illustration, and
ending with a presentation of its properties.
2.1 Least Median of Squares (LMS)
As the name implies, the LMS estimator,
LMS
, is computed as the value that would minimize
the median squared deviation, i.e.:
LMS
= arg
inf
_
Med
_
_
y
i
_
2
__
In fact, this denition would imply that the LMS estimators objective function is given
by:
_
y
i
;
_
= Med
_
_
y
i
_
2
_
However, LMS estimation is not a form of M-estimation because the objective function above
does not include all observations, which is inconsistent with the denition of M-estimation.
In fact, the Help and Documentation of SAS 9.3 (SAS Institute, Inc., 2011) dierentiates
LMS and LTS estimation from M-estimation.
Computing for the LMS estimator
(1) First order of business is to arrange the batch of size n in ascending order,
y
(1)
, y
(2)
, . . . , y
(n)
y
(1)
y
(2)
. . . y
(n)
If n is odd, just repeat the median and include it in the ordered batch. Adjust the batch
size accordingly and still denote it by n.
(2) Compute for:
h =
__
n
2
__
+ 1
7
2.1 LMS 2 UNIVARIATE ROBUST ESTIMATION
(3) Partition the batch into two parts, where the second partition starts at y
h
. Denote them
as:
y
(1)
, y
(2)
, . . . , y
(n)
_
y
(1)
,
y
(h)
,
y
(2)
,
y
(h+1)
,
. . . ,
. . . ,
y
(nh+1)
y
(n)
Note that both of the sub-batches are of size n h + 1. Ipso facto, there should be a
one-to-one correspondence between the sub-batches.
(4) Compute for:
y
(d)
i
= y
(i+h1)
y
(i)
i = 1, 2, . . . , n h + 1
(5) The LMS estimate is the midpoint of the values corresponding to the pair with the least
dierence, i.e.:
LMS
=
y
(h+k1)
+ y
(k)
2
where y
(h+k1)
y
(k)
= min
i
y
(d)
i
Illustration Consider the following batch of numbers taken from Rousseeuw and Leroy
(1987):
40 75 80 83 86 88 90 92 93 95
Note that n = 10, which means that h =
__
10
2
88 90 92 93 95
After dividing the batch into two sub-batches, the sub-batches are then paired up and their
dierences obtained.
88
40
90
75
92
80
93
83
95
86
min
_
y
(d)
1
, y
(d)
2
, y
(d)
3
, y
(d)
4
, y
(d)
5
_
= min {48, 15, 12, 10, 9}
= 9 = y
(d)
5
= y
(10)
y
(5)
= 95 86
LMS
=
y
(10)
+ y
(5)
2
=
95 + 86
2
= 90.5
Properties of the LMS estimator Having laid down the procedure involved in com-
puting its computation, some salient properties of the LMS estimator are presented, which
are:
1. has a breakdown bound of 50%
2. the LMS estimator is location and scale equivariant (i.e. linear equivariant);
8
2.1 LMS 2 UNIVARIATE ROBUST ESTIMATION
3. a solution for the objective function always exists;
4. the objective function is not smooth; and
5. the objective function has a low convergence rate.
9
2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION
30 40 50 60 70 80 90 100
LMS Mean
Median
Figure 4: Wilkinson dot plot and location of the mean, median, and the LMS estimator.
In addition to the properties mentioned, the LMS estimator is also considered as some sort
of mode estimator, in that it tends to the modal value of the batch. In simpler terms, the
LMS estimator tends to where the values cluster, as seen in Figure 4, compared to orthodox
location estimators such as the mean and median.
Since the LMS estimator is aected by the shape (or the skewness) of the data, it is inherently
less reliable than other robust estimators because it is more variable.
Despite its having a higher variability and a non-smooth objective function with a slow
convergence rate, the LMS estimator is generalizeabile to a multidimensional case while still
maintaining a high breakdown bound and linear equivariance.
2.2 Least Trimmed Squares (LTS)
The LTS estimator, meanwhile, is computed as the value that would minimize the trimmed
sum of ordered squared deviations. In mathematical notation:
LTS
= arg
inf
_
h
i=1
r
2
(i)
_
where
_
_
h =
__
n
2
__
+ 1,
r
j
= (y
j
) j = 1, 2, . . . , n,
where r
2
(1)
r
2
(2)
. . . r
2
(n)
Again, this denition would imply that the objective function of the LTS estimator is given
10
2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION
by:
_
y
i
;
_
=
h
i=1
r
2
(i)
As before, it must be kept in mind that the LTS estimator is still not an M-estimator because
it does not include all observations in evaluating its objective function, similar to the premise
of how the LMS estimator is not an M-estimator.
Note that the upper bound of the summation is h, not n. So in essence, the LTS estimator
minimizes the sum of the lower h ordered squared residuals, equivalently discarding the upper
nh squared deviations.
Computing for the LTS estimator
(1) As before, rst order of business is to sort the data:
y
(1)
, y
(2)
, . . . , y
(n)
y
(1)
y
(2)
. . . y
(n)
But n here can take on any positive integer value special procedures are neither needed
for odd nor even n.
(2) Compute for h =
__
n
2
+ 1
(3) Now, partition the sorted data into nh+1 sub-batchs, each of size h, in the following
manner:
_
y
(1)
, y
(2)
, . . . , y
(h)
_
,
_
y
(2)
, y
(3)
, . . . , y
(h+1)
_
,
.
.
.
_
y
(nh+1)
, y
(nh+2)
, . . . , y
(n)
_
i.e. Simply enclose the rst h units of the sorted batch to obtain the rst sub-batch. To
obtain the second sample, just move the left and the right enclosures one unit to the
right. Repeat the process nh+1 times (including the rst iteration) until the right
enclosure reaches the end of the batch. Each repition would then correspond to one
sub-batch.
(4) Next, compute for the means of each sub-batch. There are two ways to go about this:
y
(j)
=
1
h
j+h1
i=j
y
(i)
(1)
=
h y
(j1)
y
(j1)
+ y
(j+h1)
h
(2)
where j = 2, 3, . . . , n h + 1
Note that Equation 1 is simply the sub-batch mean.
11
2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION
To understand Equation 2, keep in mind that the nh+1 sub-batchs are obtained in
a progressive manner. For example, the second sub-batch contains some elements from
the rst sub-batch, but the rst ordered observation is excluded while the (h + 1)
th
ob-
servation is included.
Generally speaking, the (j + 1)
th
sub-batch is the same as the j
th
sub-batch, but exclud-
ing the j
th
observation and including the (j + h)
th
observation, where
j = 1, 2, . . . , n h.
That said, note that before Equation 2 can be used, Equation 1 must rst be evaluated
at j = 1.
(5) After obtaining the nh+1 means, the nh+1 variances must then be computed for.
Any of the two formulae can be used for this:
SQ
(j)
=
j+h1
i=j
_
y
(i)
y
(i)
_
2
(3)
= SQ
(j1)
y
2
(j1)
+ h
_
y
(j1)
_
2
+ y
2
(j+h1)
h
_
y
(j)
_
2
(4)
j = 2, 3, . . . , n h + 1
Again, Equation 4 is a recursive form of Equation 3. Also, Equation 3 must be evaluated
at j = 1 rst before proceeding to use Equation 4
(6) The LTS estimator is then taken as the mean corresponding to the sub-batch with the
least variance, SQ
(j)
, i.e.:
LTS
= y
(k)
where SQ
(k)
= min
j
SQ
(j)
Before moving on, care must be taken when using the recursive formulae Equa-
tions 2 and 4 in that rounding-o must not be done within each iteration. Rounding-
o the y
(j)
s and the SQ
(j)
s in each iteration will result in not just grouping errors, but
also its propagation.
LTS Illustration Consider the same batch of numbers from the previous illustration:
40 75 80 83 86 88 90 92 93 95
The resulting sub-batch means and sub-batch variances are approximated here only to con-
serve space again, these values must not be rounded-o before obtaining the actual LTS
estimate.
12
2.2 LTS 2 UNIVARIATE ROBUST ESTIMATION
That said, the y
(j)
s and the SQ
(j)
s are computed as follows:
40 75 80 83 86 88
. .
y
(1)
75.3333
SQ
(1)
160.3333
90 92 93 95
40 75 80 83 86 88 90
. .
y
(2)
83.6667
SQ
(2)
153.3333
92 93 95
40 75 80 83 86 88 90 92
. .
y
(3)
= 86.5
SQ
(3)
= 99.5
93 95
40 75 80 83 86 88 90 92 93
. .
y
(4)
88.6667
SQ
(4)
71.3333
95
40 75 80 83 86 88 90 92 93 95
. .
y
(5)
90.6667
SQ
(5)
55.3333
min
_
SQ
(1)
, SQ
(2)
, SQ
(3)
, SQ
(4)
, SQ
(5)
_
= min {160.33, 153.33, 99.5, 71.33, 55.33}
= 55.33 = SQ
(5)
LTS
= y
(5)
90.6667
As previously mentioned, the LTS estimator includes only the elements from the sub-batch,
of size h, with the lowest variance. In doing so, theh other nh observations are excluded. So
really, the LTS estimator, at least as presented, is the trimmed mean of the sub-batch with
the lowest squared deviations with a trimming proportion of
_
1
h
n
_
. It need not be said
that, having described the LTS estimator as a trimmed mean, it allows for an asymmetric
trimming of observations.
Properties of the LTS Estimator Unlike the LMS estimator, the LTS estimator per-
forms (relatively) well under asymptotic eciency. Meaning to say, it has comparably faster
convergence rate, or equivalently, it takes less iterations before a value for the estimate is
arrived at, at least compared to the LTS estimator.
13
2.3 Large Batch Estimation 2 UNIVARIATE ROBUST ESTIMATION
Like the LMS estimator, the LTS estimator:
1. has a breakdown bound of 50%;
2. is linearly equivariant (i.e. location and scale equivariance);
3. is extendable to to multidimensional cases (while still maintaining a high breakdown
bound and linear equivariance); and
4. a lack of a smooth objective function.
30 40 50 60 70 80 90 100
LTS
LMS Mean
Median
Figure 5: Wilkinson dot plot and locations of the mean, median, LTS, and LMS estimators.
Like the LMS estimator, the LTS estimator should also be located somewhere near the
modal value of the batch (at least relative to the mean and the median). Since the objective
function of the LTS estimator is based on the ordered partition of the batch with the smallest
variance, which more often than not is the interval around which the data values cluster, then
it should follow that the LTS estimator as well can be likened into a modal estimator.
2.3 LMS and LTS Estimation in Large Batches
The compromise for having a high breakdown bound, among others, of these estimators is
the ineciency in computation. As illustrated in the previous examples, computation of
these estimators involves solving for the scales of the sub-batches (range and sum of squared
deviations for LMS and LTS, respectively) nh+1 times.
In especially large batches, this is quite impractical. To render eciency in solving for the
LMS and LTS estimators of large batches, resampling techniques are used instead. Thus,
solutions are determined randomly for large batch sizes. Ipso facto, it is possible to yield
inconsistent resultant computational values.
14
3 ROBUST REGRESSION
3 Robust Regression
This section is outlined as follows: a brief description of the properties of the robust regression
techniques are presented, in particular the objective function that is used to arrive at pa-
rameter estimates and the breakdown bounds of the parameter estimates. After, inferential
properties under the robust regression techniques are presented.
3.1 LMS Regression
The parameter estimates are estimated in LMS regression as those that would yield the
minimum median of squared residuals, i.e.:
arg
min
_
Med
_
r
2
(i)
__
= arg
min
_
Med
r
(i)
_
where r
i
= y
i
y
i
i
The resulting breakdown bound of the resulting estimates are:
BDB(LMS) =
__
n p
2
__
+ 1
n
provided that p > 1, p being the number of parameters estimated.
3.2 LTS Regression
The parameter estimates in LTS regression are computed as the ones that would yield the
minimum trimmed sum of ordered squared residuals:
arg
inf
_
h
i=1
r
2
(i)
_
where
_
_
h =
__
n
2
__
+ 1,
r
j
= (y
j
y
j
) j = 1, 2, . . . , n,
where r
2
(1)
r
2
(2)
. . . r
2
(n)
with a breakdown bound of:
BDB(LTS) =
__
n p
2
__
+ 1
n
where p is the number of parameters estimated.
15
3.3 Inference in Robust Regression 3 ROBUST REGRESSION
3.3 Inference in Robust Regression
Scale estimator of error terms,
s
LMS
=
_
1 +
5
n p
_
c
h,n
r
(h)
(5)
s
LTS
= d
h,n
_
1
h
h
i=1
r
2
(i)
(6)
where d
h,n
=
1
1
2
h c
h,n
_
1
c
h,n
_
c
h,n
=
1
1
_
n + h
2n
_
h =
__
n
2
__
+ 1
Note that c
h,n
and d
h,n
are chosen to make the scale estimators consistent with the Gaus-
sian model (Rousseeuw and Hubert, 1997).
Moreover, it is important to note that Equation 5 only applies for odd n, and that
_
1 +
5
np
_
is a nite population correction factor (see Rousseeuw and Hubert (1997)).
It is noteworthy to mention that there are more more ecient scale estimates (see Rousseeuw
and Hubert (1997)) based on Equations 5 and 6; but for the purposes of just introducing the
notion of robust regression, these equations should suce.
Coecient of determination, R
2
Rousseeuw and Hubert (1997) proposes a robust coun-
terpart of the OLS notion of R
2
, based on Equations 5 and 6, for LMS and LTS regression
as:
R
2
LMS
= 1
s
LMS
_
1 +
5
np
_
c
h,n
(h),LMS
R
2
LTS
= 1
s
LTS
d
h,n
_
1
h
h
i=1
r
2
(i),LTS
where the r