Ordinary Least Squares: Linear Model
Ordinary Least Squares: Linear Model
Linear model
Suppose the data consists of n observations { yi, xi } . Each observation includes a scalar response yi and a
vector of predictors (or regressors) xi. In a linear regression model the response variable is a linear function of the
regressors:
where β is a p×1 vector of unknown parameters; εi's are unobserved scalar random variables (errors) which account
for the discrepancy between the actually observed responses yi and the "predicted outcomes" x′iβ; and ′ denotes
matrix transpose, so that x′ β is the dot product between the vectors x and β. This model can also be written in matrix
notation as
where y and ε are n×1 vectors, and X is an n×p matrix of regressors, which is also sometimes called the design
matrix.
As a rule, the constant term is always included in the set of regressors X, say, by taking xi1 = 1 for all i = 1, …, n.
The coefficient β1 corresponding to this regressor is called the intercept.
There may be some relationship between the regressors. For instance, the third regressor may be the square of the
second regressor. In this case (assuming that the first regressor is constant) we have a quadratic model in the second
regressor. But this is still considered a linear model because it is linear in the βs.
Ordinary least squares 2
Assumptions
There are several different frameworks in which the linear regression model can be cast in order to make the OLS
technique applicable. Each of these settings produces the same formulas and same results, the only difference is the
interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The
choice of the applicable framework depends mostly on the nature of data at hand, and on the inference task which
has to be performed.
One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined
constants. In the first case (random design) the regressors xi are random and sampled together with the yi's from
some population, as in an observational study. This approach allows for more natural study of the asymptotic
properties of the estimators. In the other interpretation (fixed design), the regressors X are treated as known
constants set by a design, and y is sampled conditionally on the values of X as in an experiment. For practical
purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on X.
All results stated in this article are within the random design framework.
The immediate consequence of the exogeneity assumption is that the errors have mean zero: E[ε] = 0, and that the
regressors are uncorrelated with the errors: E[X′ε] = 0.
The exogeneity assumption is critical for the OLS theory. If it holds then the regressor variables are called
exogenous. If it doesn't, then those regressors that are correlated with the error term are called endogenous,[2] and
then the OLS estimates become invalid. In such case the method of instrumental variables may be used to carry
out inference.
• No linear dependence. The regressors in X must all be linearly independent. Mathematically it means that the
matrix X must have full column rank almost surely:[3]
Usually, it is also assumed that the regressors have finite moments up to at least second. In such case the matrix
Qxx = E[X′X / n] will be finite and positive semi-definite.
When this assumption is violated the regressors are called linearly dependent or perfectly multicollinear. In such
case the value of the regression coefficient β cannot be learned, although prediction of y values is still possible for
new values of the regressors that lie in the same linearly dependent subspace.
• Spherical errors:[3]
where In is an n×n identity matrix, and σ2 is a parameter which determines the variance of each observation. This
σ2 is considered a nuisance parameter in the model, although usually it is also estimated. If this assumption is
violated then the OLS estimates are still valid, but no longer efficient.
It is customary to split this assumption into two parts:
• Homoscedasticity: E[ εi2 | X ] = σ2, which means that the error term has the same variance σ2 in each
observation. When this requirement is violated this is called heteroscedasticity, in such case a more efficient
estimator would be weighted least squares. If the errors have infinite variance then the OLS estimates will also
have infinite variance (although by the law of large numbers they will nonetheless tend toward the true values
so long as the errors have zero mean). In this case, robust estimation techniques are recommended.
Ordinary least squares 3
• Nonautocorrelation: the errors are uncorrelated between observations: E[ εiεj | X ] = 0 for i ≠ j. This
assumption may be violated in the context of time series data, panel data, cluster samples, hierarchical data,
repeated measures data, longitudinal data, and other data with dependencies. In such cases generalized least
squares provides a better alternative than the OLS.
• Normality. It is sometimes additionally assumed that the errors have normal distribution conditional on the
regressors:[4]
This assumption is not needed for the validity of the OLS method, although certain additional finite-sample
properties can be established in case when it does (especially in the area of hypotheses testing). Also when the
errors are normal, the OLS estimator is equivalent to MLE, and therefore it is asymptotically efficient in the class
of all regular estimators.
Estimation
Suppose b is a "candidate" value for the parameter β. The quantity yi − xi′b is called the residual for the i-th
observation, it measures the vertical distance between the data point (xi, yi) and the hyperplane y = x′b, and thus
assesses the degree of fit between the actual data and the model. The sum of squared residuals (SSR) (also called
the error sum of squares (ESS) or residual sum of squares (RSS))[5] is a measure of the overall model fit:
The value of b which minimizes this sum is called the OLS estimator for β. The function S(b) is quadratic in b with
positive-definite Hessian, and therefore this function possesses a unique global minimum, which can be given by an
explicit formula:[6][proof]
After we have estimated β, the fitted values (or predicted values) from the regression will be
where P = X(X′X)−1X′ is the projection matrix onto the space spanned by the columns of X. This matrix P is also
sometimes called the hat matrix because it "puts a hat" onto the variable y. Another matrix, closely related to P is the
Ordinary least squares 4
annihilator matrix M = In − P, this is a projection matrix onto the space orthogonal to X. Both matrices P and M are
symmetric and idempotent (meaning that P2 = P), and relate to the data matrix X via identities PX = X and MX = 0.[7]
Matrix M creates the residuals from the regression:
The denominator, n-p, is the statistical degrees of freedom. The first quantity, s2, is the OLS estimate for σ2, whereas
the second, , is the MLE estimate for σ2. The two estimators are quite similar in large samples; the first one is
always unbiased, while the second is biased but minimizes the mean squared error of the estimator. In practice s2 is
used more often, since it is more convenient for the hypothesis testing. The square root of s2 is called the standard
error of the regression (SER), or standard error of the equation (SEE).[7]
It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the
sample can be reduced by regressing onto X. The coefficient of determination R2 is defined as a ratio of
"explained" variance to the "total" variance of the dependent variable y:[8]
where TSS is the total sum of squares for the dependent variable, L = In − 11′/n, and 1 is an n×1 vector of ones. (L
is a "centering matrix" which is equivalent to regression on a constant; it simply subtracts the mean from a variable.)
In order for R2 to be meaningful, the matrix X of data on regressors must contain a column vector of ones to
represent the constant whose coefficient is the regression intercept. In that case, R2 will always be a number between
0 and 1, with values close to 1 indicating a good degree of fit.
The least squares estimates in this case are given by simple formulas
Ordinary least squares 5
Alternative derivations
In the previous section the least squares estimator was obtained as a value that minimizes the sum of squared
residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases
the formula for OLS estimator remains the same: ^β = (X′X)−1X′y, the only difference is in how we interpret this
result.
Geometric approach
For mathematicians, OLS is an approximate solution to
an overdetermined system of linear equations Xβ ≈ y,
where β is the unknown. Assuming the system cannot
be solved exactly (the number of equations n is much
larger than the number of unknowns p), we are looking
for a solution that could provide the smallest
discrepancy between the right- and left- hand sides. In
other words, we are looking for the solution that
satisfies
Maximum likelihood
The OLS estimator is identical to the maximum likelihood estimator under the normality assumption for the error
terms.[10][proof] This normality assumption has historical importance, as it provided the basis for the early work in
linear regression analysis by Yule and Pearson. From the properties of MLE, we can infer that the OLS estimator is
asymptotically efficient (in the sense of attaining the Cramér-Rao bound for variance) if the normality assumption is
satisfied.[11]
These moment conditions state that the regressors should be uncorrelated with the errors. Since xi is a p-vector, the
number of moment conditions is equal to the dimension of the parameter vector β, and thus the system is exactly
identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the
weighting matrix.
Note that the original strict exogeneity assumption E[εi | xi] = 0 implies a far richer set of moment conditions than
stated above. In particular, this assumption implies that for any vector-function ƒ, the moment condition E[ƒ(xi)·εi] =
0 will hold. However it can be shown using the Gauss–Markov theorem that the optimal choice of function ƒ is to
take ƒ(x) = x, which results in the moment equation posted above.
Ordinary least squares 6
If the strict exogeneity does not hold (as is the case with many time series models, where exogeneity is assumed only
with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.
The variance-covariance matrix of is equal to [13]
In particular, the standard error of each coefficient is equal to square root of the j-th diagonal element of this
matrix. The estimate of this standard error is obtained by replacing the unknown quantity σ2 with its estimate s2.
Thus,
It can also be easily shown that the estimator is uncorrelated with the residuals from the model:[13]
The Gauss–Markov theorem states that under the spherical errors assumption (that is, the errors should be
uncorrelated and homoscedastic) the estimator is efficient in the class of linear unbiased estimators. This is called
the best linear unbiased estimator (BLUE). Efficiency should be understood as if we were to find some other
estimator which would be linear in y and unbiased, then [13]
in the sense that this is a nonnegative-definite matrix. This theorem establishes optimality only in the class of linear
unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ε, other, non-linear
estimators may provide better results than OLS.
Assuming normality
The properties listed so far are all valid regardless of the underlying distribution of the error terms. However if you
are willing to assume that the normality assumption holds (that is, that ε ~ N(0, σ2In)), then additional properties of
the OLS estimators can be stated.
The estimator is normally distributed, with mean and variance as given before:[14]
This estimator reaches the Cramér–Rao bound for the model, and thus is optimal in the class of all unbiased
estimators.[11] Note that unlike the Gauss–Markov theorem, this result establishes optimality among both linear and
non-linear estimators, but only in the case of normally distributed error terms.
The estimator s2 will be proportional to the chi-squared distribution:[15]
The variance of this estimator is equal to 2σ4/(n − p), which does not attain the Cramér–Rao bound of 2σ4/n.
However it was shown that there are no unbiased estimators of σ2 with variance smaller than that of the estimator
s2.[16] If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the
sum of squared residuals (SSR) of the model, then the best (in the sense of the mean squared error) estimator in this
class will be ~σ2 = SSR / (n − p + 2), which even beats the Cramér–Rao bound in case when there is only one
regressor (p = 1).[17]
Ordinary least squares 7
Moreover, the estimators and s2 are independent,[18] the fact which comes in useful when constructing the t- and
F-tests for the regression.
Influential observations
As was mentioned before, the estimator is linear in y, meaning that it represents a linear combination of the
dependent variables yi's. The weights in this linear combination are functions of the regressors X, and generally are
unequal. The observations with high weights are called influential because they have a more pronounced effect on
the value of the estimator.
To analyze which observations are influential we remove a specific j-th observation and consider how much the
estimated quantities are going to change (similarly to the jackknife method). It can be shown that the change in the
OLS estimator for β will be equal to [19]
where hj = xj′ (X′X)−1xj is the j-th diagonal element of the hat matrix P, and xj is the vector of regressors
corresponding to the j-th observation. Similarly, the change in the predicted value for j-th observation resulting from
omitting that observation from the dataset will be equal to [19]
From the properties of the hat matrix, 0 ≤ hj ≤ 1, and they sum up to p, so that on average hj ≈ p/n. These quantities
hj are called the leverages, and observations with high hj's — leverage points.[20] Usually the observations with
high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way
atypical of the rest of the dataset.
Partitioned regression
Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so
that the regression takes form
where X1 and X2 have dimensions n×p1, n×p2, and β1, β2 are p1×1 and p2×1 vectors, with p1 + p2 = p.
The Frisch–Waugh–Lovell theorem states that in this regression the residuals and the OLS estimate will be
[21]
numerically identical to the residuals and the OLS estimate for β2 in the following regression:
Constrained estimation
Suppose it is known that the coefficients in the regression satisfy a system of linear equations
where Q is a p×q matrix of full rank, and c is a q×1 vector of known constants, where q < p. In this case least
squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint H0.
The constrained least squares (CLS) estimator can be given by an explicit formula:[22]
Ordinary least squares 8
This expression for the constrained estimator is valid as long as the matrix X′X is invertible. It was assumed from the
beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, β will not
be identifiable. However it may happen that adding the restriction H0 makes β identifiable, in which case one would
like to find the formula for the estimator. The estimator is equal to [23]
where R is a p×(p−q) matrix such that the matrix [Q R] is non-singular, and R′Q = 0. Such a matrix can always be
found, although generally it is not unique. The second formula coincides with the first in case when X′X is
invertible.[23]
where
Using this asymptotic distribution, approximate two-sided confidence intervals for the j-th component of the vector
can be constructed as
where q denotes the quantile function of standard normal distribution, and [·]jj is the j-th diagonal element of a
matrix.
Similarly, the least squares estimator for σ2 is also consistent and asymptotically normal (provided that the fourth
moment of εi exists) with limiting distribution
These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As
an example consider the problem of prediction. Suppose is some point within the domain of distribution of the
regressors, and one wants to know what the response variable would have been at that point. The mean response is
the quantity , whereas the predicted response is . Clearly the predicted response is a random
variable, its distribution can be derived from that of :
Height (m): 1.47 1.50 1.52 1.55 1.57 1.60 1.63 1.65 1.68 1.70 1.73 1.75 1.78 1.80 1.83
Weight (kg): 52.21 53.12 54.48 55.84 57.20 58.57 59.93 61.29 63.11 64.47 66.28 68.10 69.92 72.19 74.46
When only one dependent variable is being modeled, a scatterplot will suggest the form and strength of the
relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and
other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests
that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear
relationships by introducing the regressor HEIGHT2. The regression model then becomes a multiple linear model:
The output from most popular statistical packages will look similar to this:
Fitted regression
Ordinary least squares 10
In this table:
• The Coefficient column gives the least squares estimates of parameters βj
• The Std. errors column shows standard errors of each coefficient estimate:
• The t-statistic and p-value columns are testing whether any of the coefficients might be equal to zero. The
t-statistic is calculated simply as . If the errors ε approximately follow a normal distribution, t
follows a Student-t distribution. Under weaker conditions, t is asymptotically normal. Large values of t indicate
that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column,
p-value, expresses the results of the hypothesis test as a significance level. Conventionally, p-values smaller than
0.05 are taken as evidence that the population coefficient is nonzero.
• R-squared is the coefficient of determination indicating goodness-of-fit of the regression. This statistic will be
equal to one if fit is perfect, and to zero when regressors X have no explanatory power whatsoever. This is a
biased estimate of the population R-squared, and will never decrease if additional regressors are added, even if
they are irrelevant.
• Adjusted R-squared is a slightly modified version of , designed to penalize for the excess number of
regressors which do not add to the explanatory power of the regression. This statistic is always smaller than ,
can decrease as you add new regressors, and even be negative for poorly fitting models:
• Log-likelihood is calculated under the assumption that errors follow normal distribution. Even though the
assumption is not very reasonable, this statistic may still find its use in conducting LR tests.
• Durbin–Watson statistic tests whether there is any evidence of serial correlation between the residuals. As a rule
of thumb, the value smaller than 2 will be an evidence of positive correlation.
• Akaike information criterion and Schwarz criterion are both used for model selection. Generally when comparing
two alternative models, smaller values of one of these criteria will indicate a better model.[24]
• Standard error of regression is an estimate of σ, standard error of the error term.
• Total sum of squares, model sum of squared, and residual sum of squares tell us how much of the initial variation
in the sample were explained by the regression.
• F-statistic tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has
F(p–1,n–p) distribution under the null hypothesis and normality assumption, and its p-value indicates probability
Ordinary least squares 11
that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other
tests such as for example Wald test or LR test should be used.
Ordinary least squares analysis often
includes the use of diagnostic plots designed
to detect departures of the data from the
assumed form of the model. These are some
of the common diagnostic plots:
• Residuals against the explanatory
variables in the model. A non-linear
relation between these variables suggests
that the linearity of the conditional mean
function may not hold. Different levels of
variability in the residuals for different
Residuals plot
levels of the explanatory variables
suggests possible heteroscedasticity.
• Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would
suggest considering these variables for inclusion in the model.
• Residuals against the fitted values, .
• Residuals against the preceding residual. This plot may identify serial correlations in the residuals.
An important consideration when carrying out statistical inference using regression models is how the data were
sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model
is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy
based only on her height.
Beware
This example also demonstrates that sophisticated calculations will not overcome the use of badly prepared data. The
heights were originally given in inches, and have been converted to the nearest centimetre. Since the conversion
factor is one inch to 2.54cm, this is not a correct conversion. The original inches can be recovered by
Round(x/0.0254) and then re-converted to metric: if this is done, the results become
References
[1] Hayashi (2000, page 7)
[2] Hayashi (2000, page 187)
[3] Hayashi (2000, page 10)
[4] Hayashi (2000, page 34)
[5] Hayashi (2000, page 15)
[6] Hayashi (2000, page 18)
[7] Hayashi (2000, page 19)
[8] Hayashi (2000, page 20)
[9] Hayashi (2000, page 5)
[10] Hayashi (2000, page 49)
[11] Hayashi (2000, page 52)
[12] Hayashi (2000, pages 27, 30)
Ordinary least squares 12
License
Creative Commons Attribution-Share Alike 3.0 Unported
//creativecommons.org/licenses/by-sa/3.0/