Chapter3 Econometrics MultipleLinearRegressionModel
Chapter3 Econometrics MultipleLinearRegressionModel
y X 1β1 + X 2 β 2 + ... + X k β k + ε .
=
This is called as the multiple linear regression model. The parameters β1 , β 2 ,..., β k are the regression
coefficients associated with X 1 , X 2 ,..., X k respectively and ε is the random error component reflecting the
difference between the observed and fitted linear relationship. There can be various reasons for such
difference, e.g., joint effect of those variables not included in the model, random factors which can not be
accounted in the model etc.
Note that the j th regression coefficient β j represents the expected change in y per unit change in j th
∂E ( y )
βj = .
∂X j
Linear model:
∂y ∂E ( y )
A model is said to be linear when it is linear in parameters. In such a case (or equivalently )
∂β j ∂X j
β1
iv) y β0 +
=
X − β2
is nonlinear in parameters and variables both. So it is a nonlinear model.
v) y β 0 + β1 X β2
=
is nonlinear in parameters and variables both. So it is a nonlinear model.
vi) y =β 0 + β1 X + β 2 X 2 + β3 X 3
is a cubic polynomial model which can be written as
y =β 0 + β1 X + β 2 X 2 + β3 X 3
Example:
The income and education of a person are related. It is expected that, on an average, higher level of
education provides higher income. So a simple linear regression model can be expressed as
β 0 + β1 education + ε .
income =
Not that β1 reflects the change is income with respect to per unit change is education and β 0 reflects the
income when education is zero as it is expected that even an illiterate person can also have some income.
Further this model neglects that most people have higher income when they are older than when they are
young, regardless of education. So β1 will over-state the marginal impact of education. If age and
education are positively correlated, then the regression model will associate all the observed increase in
income with an increase in education. So better model is
β 0 + β1 education + β 2 age + ε .
income =
This is how we proceed for regression modeling in real life situation. One needs to consider the experimental
condition and the phenomenon before taking the decision on how many, why and how to choose the
dependent and independent variables.
y X β + ε.
or =
vector of regression coefficients and ε = (ε1 , ε 2 ,..., ε n ) ' is a n × 1 vector of random error components or
disturbance term.
(iii) Rank ( X ) = k
(iv) X is a non-stochastic matrix
(v) ε ~ N (0, σ 2 I n ) .
These assumptions are used to study the statistical properties of estimator of regression coefficients. The
following assumption is required to study particularly the large sample properties of the estimators
X 'X
(vi) lim = ∆ exists and is a non-stochastic and nonsingular matrix (with finite elements).
n →∞
n
The explanatory variables can also be stochastic in some cases. We assume that X is non-stochastic unless
stated separately.
We consider the problems of estimation and testing of hypothesis on regression coefficient vector under the
stated assumption.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
4
Estimation of parameters:
A general procedure for the estimation of regression coefficient vector is to minimize
n n
i
=i 1 =i 1
∑ M=
(ε ) ∑ M ( y − x β i i1 1 − xi 2 β 2 − ... − xik β k )
M ( x) = x , in general.
p
We consider the principle of least square which is related to M ( x) = x 2 and method of maximum likelihood
estimation for the estimation of parameters.
∑ ε i2 ==
S (β ) =
i =1
ε ' ε ( y − X β ) '( y − X β )
for given y and X . A minimum will always exist as S ( β ) is a real valued, convex and differentiable
function. Write
S (β ) =
y ' y + β ' X ' X β − 2β ' X ' y .
Differentiate S ( β ) with respect to β
∂S ( β )
= 2X ' X β − 2X ' y
∂β
∂ 2 S (β )
= 2 X ' X (atleast non-negative definite).
∂β 2
The normal equation is
∂S ( β )
=0
∂β
⇒ X ' Xb = X 'y
Since it is assumed that rank ( X ) = k (full rank), then X ' X is positive definite and unique solution of
normal equation is
b = ( X ' X ) −1 X ' y
which is termed as ordinary least squares estimator (OLSE) of β .
∂ 2 S (β )
Since is at least non-negative definite, so b minimize S ( β ) .
∂β 2
where ( X ' X ) − is the generalized inverse of X ' X and ω is an arbitrary vector. The generalized inverse
= X ( X ' X )− X ' y
which is independent of ω . This implies that ŷ has same value for all solution b of X ' Xb = X ' y.
S ( β ) =[ y − Xb + X (b − β ) ]′ [ y − Xb + X (b − β ) ]
= ( y − Xb)′( y − Xb) + (b − β )′ X ' X (b − β ) + 2(b − β )′ X ′( y − Xb)
= ( y − Xb)′( y − Xb) + (b − β )′ X ' X (b − β ) (Using X ' Xb = X ' y )
≥ ( y − Xb)′( y − Xb) = S (b)
= y ' y − 2 y ' Xb + b ' X ' Xb
= y ' y − b ' X ' Xb
= y ' y − yˆ ' yˆ .
Fitted values:
If β̂ is any estimator of β for the model =
y X β + ε , then the fitted values are defined as
In case of βˆ = b,
yˆ = Xb
= X ( X ' X ) −1 X ' y
= Hy
(iii) H tr X ( X ′X ) −1=
tr= ) −1 tr=
X ' tr X ' X ( X ' X= Ik k .
Residuals
The difference between the observed and fitted values of study variable is called as residual. It is denoted
as
e = y ~ yˆ
= y − yˆ
= y − Xb
= y − Hy
= (I − H ) y
= Hy
where H = I − H .
Properties of OLSE
(i) Estimation error:
The estimation error of b is
b − β ( X ' X ) −1 X ' y − β
=
= ( X ' X ) −1 X '( X β + ε ) − β
= ( X ' X ) −1 X ' ε
(ii) Bias
Since X is assumed to be nonstochastic and E (ε ) = 0
E (b − β ) = ( X ' X ) −1 X ' E (ε )
= 0.
Thus OLSE is an unbiased estimator of β .
matrix of b . Thus
Var (b) = tr [V (b) ]
k
= ∑ E (b − β )
i =1
i i
2
k
= ∑ Var (bi ).
i =1
Estimation of σ 2
The least squares criterion can not be used to estimate σ 2 because σ 2 does not appear in S ( β ) . Since
e= y − yˆ
= y − X ( X ' X ) −1 X ' y
= [ I − X ( X ' X ) −1 X '] y
= Hy.
Consider the residual sum of squares
n
SS r e s = ∑ ei2
i =1
= e 'e
= ( y − Xb) '( y − Xb)
=y '( I − H )( I − H ) y
= y '( I − H ) y
= y ' Hy.
Also
SS r e s =
( y − Xb) '( y − Xb)
=
y ' y − 2b ' X ' y + b ' X ' Xb
=
y ' y − b ' X ' y (Using X ' Xb =
X ' y)
SS r e s = y ' Hy
(X β + ε )'H (X β + ε )
=
= ε=
' H ε (Using HX 0)
So y ~ N ( X β , σ 2 I ) .
Hence y ' Hy ~ χ 2 (n − k ) .
] (n − k )σ 2
Thus E[ y ' Hy=
y ' Hy
or E =σ2
n−k
or E [ MS r e s ] = σ 2
SS r e s
where MS r e s = is the mean sum of squares due to residual.
n−k
Variance of ŷ
The variance of ŷ is
V ( yˆ ) = V ( Xb)
= XV (b) X '
= σ 2 X ( X ' X ) −1 X '
= σ 2H.
Gauss-Markov Theorem:
The ordinary least squares estimator (OLSE) is the best linear unbiased estimator (BLUE) of β .
Proof: The OLSE of β is
b = ( X ' X ) −1 X ' y
which is a linear function of y . Consider the arbitrary linear estimator
b* = a ' y
of linear parametric function ' β where the elements of a are arbitrary constants.
Then for b* ,
= (a ' y ) a ' X β
E (b* ) E=
=
E ' X β 'β
(b* ) a=
⇒ a'X = '.
Since we wish to consider only those estimators that are linear and unbiased, so we restrict ourselves to
those estimators for which a ' X = '.
Further
=
Var (a ' y ) a= 'Var ( y )a σ 2 a ' a
Var ( ' b) = 'Var (b)
= σ 2 a ' X ( X ' X ) −1 X ' a.
Consider
Var (a ' y ) − Var ( ' b) =σ 2 a ' a − a ' X ( X ' X ) −1 X ' a
= σ 2 a ' I − X ( X ' X ) −1 X ' a
= σ 2 a '( I − H )a.
Since ( I − H ) is a positive semi-definite matrix, so
Var (a ' y ) − Var ( ' b) ≥ 0 .
This reveals that if b* is any linear unbiased estimator then its variance must be no smaller than that of b .
Consequently b is the best linear unbiased estimator, where ‘best’ refers to the fact that b is efficient
within the class of linear and unbiased estimators.
1 n 2
1
=
(2πσ 2 ) n /2
exp − 2σ 2 ∑ ε i
i =1
1 1
= exp − 2 ε ' ε
(2πσ )
2 n /2
2σ
1 1
= exp − 2 ( y − X β ) '( y − X β ) .
(2πσ ) 2 n /2
2σ
Since the log transformation is monotonic, so we maximize ln L( β , σ 2 ) instead of L( β , σ 2 ) .
n 1
ln L( β , σ 2 ) =
− ln(2πσ 2 ) − 2 ( y − X β ) '( y − X β ) .
2 2σ
The maximum likelihood estimators (m.l.e.) of β and σ 2 are obtained by equating the first order
derivatives of ln L( β , σ 2 ) with respect to β and σ 2 to zero as follows:
∂ ln L( β , σ 2 ) 1
= 2 X '( y − X
= β) 0
∂β 2σ 2
∂ ln L( β , σ 2 ) n 1
=
− 2+ ( y − X β ) '( y − X β ).
∂σ 2
2σ 2(σ 2 ) 2
The likelihood equations are given by
X 'Xβ = X 'y
1
σ 2 =( y − X β ) '( y − X β ).
n
Since rank( X ) = k , so that the unique mle of β and σ 2 are obtained as
β = ( X ' X ) −1 X ' y
1
σ 2 =( y − X β ) '( y − X β ).
n
Further to verify that these values maximize the likelihood function, we find
∂ 2 ln L( β , σ 2 ) 1
= − 2 X 'X
∂β 2
σ
∂ 2 ln L( β , σ 2 ) n 1
= 4 − 6 ( y − X β ) '( y − X β )
∂ (σ )
2 2 2
2σ σ
∂ 2 ln L( β , σ 2 ) 1
− 4 X '( y − X β ).
=
∂β∂σ 2
σ
∂ 2 ln L( β , σ 2 ) ∂ 2 ln L( β , σ 2 )
∂β 2 ∂β∂σ 2
∂ 2 ln L( β , σ 2 ) ∂ 2 ln L( β , σ 2 )
∂σ 2 ∂β ∂ 2 (σ 2 ) 2
which is negative definite at β = β and σ 2 = σ 2 . This ensures that the likelihood function is maximized at
these values.
Consistency of estimators
(i) Consistency of b :
X 'X
Under the assumption that lim = ∆ exists as a nonstochastic and nonsingular matrix (with finite
n →∞
n
elements), we have
−1
1 X 'X
lim V (b) = σ 2 lim
n →∞ n →∞ n
n
1
= σ 2 lim ∆ −1
n →∞ n
= 0.
This implies that OLSE converges to β in quadratic mean. Thus OLSE is a consistent estimator of β . This
holds true for maximum likelihood estimators also.
Same conclusion can also be proved using the concept of convergence in probability.
An estimator θˆn converges to θ in probability if
lim P θˆn − θ ≥ δ =
0 for any δ > 0
n →∞
(ii) Consistency of s 2
Now we look at the consistency of s 2 as an estimate of σ 2 as
1
s2 = e 'e
n−k
1
= ε ' Hε
n−k
−1
1 k
1 − ε ' ε − ε ' X ( X ' X ) X ' ε
−1
=
n n
k ε 'ε ε ' X X ' X X 'ε
−1 −1
=
1 − − .
n n n n n
ε 'ε 1 n 2
Note that
n
consists of ∑ ε i and {ε i2 , i = 1, 2,..., n} is a sequence of independently and identically
n i =1
distributed random variables with mean σ 2 . Using the law of large numbers
X 'X X '( y − X β )
− E − σ 2 −E
σ
4
=
(y − X β )' X n ( y − X β ) '( y − X β )
− E −E 4 −
σ4 2σ σ6
X 'X
σ2 0
= .
0 n
2σ 4
Then
σ 2 ( X ' X ) −1 0
[ I (θ )] =
−1
2σ 4
0
n
is the Cramer-Rao lower bound matrix of β and σ 2 .
σ 2 ( X ' X ) −1 0
∑ OLS = 0 2σ 4
n − k
which means that the Cramer-Rao have bound is attained for the covariance of b but not for s 2 .
of measurement of j th explanatory variable X j . For example, in the following fitted regression model
yˆ =+
5 X 1 + 1000 X 2 ,
y is measured in liters, X 1 is liters and X 2 in milliliters. Although βˆ2 >> βˆ1 but effect of both explanatory
variables is identical. One liter change in either X 1 and X 2 when other variable is held fixed produces the
same change is ŷ .
Sometimes it is helpful to work with scaled explanatory variables and study variable that produces
dimensionless regression coefficients. These dimensionless regression coefficients are called as
standardized regression coefficients.
There are two popular approaches for scaling which gives standardized regression coefficients. We discuss
them as follows:
1 n
=
where s 2j ∑
n − 1 i =1
( xij − x j ) 2
are the sample variances of j th explanatory variable and study variable respectively.
All scaled explanatory variable and scaled study variable have mean zero and sample variance unity, i.e.,
using these new variables, the regression model becomes
yi*= γ 1 zi1 + γ 2 zi 2 + ... + γ k zik + ε i , i= 1, 2,..., n.
Such centering removes the intercept term from the model. The least squares estimate of γ = (γ 1 , γ 2 ,..., γ k ) '
is
γˆ = ( Z ' Z ) −1 Z ' y* .
This scaling has a similarity to standardizing a normal random variable, i.e., observation minus its mean and
divided by its standard deviation. So it is called as a unit normal scaling.
yi − y
yi0 =
SST1/2
n
where=
S jj ∑ (x
i =1
ij − x j ) 2 is the corrected sum of squares for j th explanatory variables X j and
n
= =
ST SST ∑ ( y − y)
i =1
i
2
is the total sum of squares. In this scaling, each new explanatory variable W j has
n
1 n
=
mean ωj = ∑ ωij 0 and length
n i =1
∑ (ω
i =1
ij − ω j )2 =
1.
where
n
∑ (x ui − xi )( xuj − x j )
rij = u =1
( Sii S jj )1/2
Sij
=
( Sii S jj )1/2
is the simple correlation coefficient between the explanatory variables X i and X j . Similarly
where
n
∑ (x uj − x j )( yu − y )
rjy = u =1
( S jj SST )1/2
Siy
=
( S jj SST )1/2
is the simple correlation coefficient between j th explanatory variable X j and study variable y .
Note that it is customary to refer rij and rjy as correlation coefficient though X i ' s are not random variable.
So the estimates of regression coefficient in unit normal scaling (i.e., γˆ ) and unit length scaling (i.e., δˆ) are
The regression coefficients obtained after such scaling, viz., γˆ or δˆ usually called standardized regression
coefficients.
where b0 is the OLSE of intercept term and b j are the OLSE of slope parameters.
First all the data is expressed in terms of deviations from sample mean.
1 0 0 1 1 1
0 1 0 1 1
1 1
=A − .
n
0 0 1 1 1 1
Then
y1
1 1 n y
=y = ∑ y (1,1,...,1) 2
n i =1
i
n
yn
1
= ' y
n
Ay =y − y =( y1 − y , y2 − y ,..., yn − y ) '.
Note that
1
A= − '
n
1
= − .n
n
= −
=0
and A is symmetric and idempotent matrix.
In the model
y X β +ε,
=
the OLSE of β is
Note that Ae = e.
=
where X 1 (1,1,...,1) ' is n ×1 vector with all elements unity, X 2* is n × ( k − 1) matrix of observations of
( k − 1) explanatory variables X 2 , X 3 ,..., X k and OLSE b = ( b1 , b2* ') is suitably partitioned with OLSE of
Then
y = X 1b1 + X 2*b2* + e.
Ay = AX 1b1 + AX 2*b2* + Ae
= AX 2*b2* + e.
Premultiply by X 2* gives
=
X 2* ' Ay X 2* ' AX 2*b2* + X 2* ' e
= X 2* ' AX 2*b2* .
( AX ) ' ( Ay ) = ( AX ) ' ( AX ) b
*
2
*
2
*
2
*
2 ..
y X β + ε . Such a
This equation can be compared with the normal equations X ' y = X ' Xb in the model=
comparison yields following conclusions:
• b2* is the sub vector of OLSE.
• This is normal equation in terms of deviations. Its solution gives OLS of slope coefficients as
The expression of total sum of squares (TSS) remains same as earlier and is given by
TSS = y ' Ay.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
21
Since
=
Ay AX 2*b2* + e
=
y ' Ay y ' AX 2*b2* + y ' e
Testing of hypothesis:
There are several important questions which can be answered through the test of hypothesis concerning the
regression coefficients. For example
1. What is the overall adequacy of the model?
2. Which specific explanatory variables seems to be important?
etc.
In order the answer such questions, we first develop the test of hypothesis for a general framework, viz.,
general linear hypothesis. Then several tests of hypothesis can be derived as its special cases. So first we
discuss the test of a general linear hypothesis.
We assume that rank ( R) = J , i.e., full rank so that there is no linear dependence in the hypothesis.
(i) H 0 : βi = 0
Choose=
J 1,=
r 0,=
R [0, 0,..., 0,1, 0,..., 0] where 1 occurs at the i th position is R .
This particular hypothesis explains whether X i has any effect on the linear model or not.
(ii) H 0 : β3 β 4 or H 0=
= : β3 − β 4 0
Choose =
J 1,=
r 0,=
R [0, 0,1, −1, 0,..., 0]
(iii) H 0 : β=
3 β=
4 β5
or H 0 : β3 − β 4= 0, β3 − β5= 0
0 0 1 − 1 0 0 ... 0
Choose =
J 2,=
r (0, 0) ',=
R .
0 0 1 0 − 1 0 ... 0
(iv) H 0 : β 3 +5β 4 =
2
Choose =
J 1,=
r 2,=
R [0, 0,1,5, 0...0]
(v)
H 0 : β 2= β3= ...= β k= 0
J= k − 1
r = (0, 0,..., 0) '
0 1 0 ... 0 0
0 0 1 ... 0 0 I k −1
R = .
0 0 0 ... 1 ( k −1)×k 0
This particular hypothesis explains the goodness of fit. It tells whether βi has linear effect or not and are
they of any importance. It also tests that X 2 , X 3 ,..., X k have no influence in the determination of y . Here
β1 = 0 is excluded because this involves additional implication that the mean level of y is zero. Our main
concern is to know whether the explanatory variables helps to explain the variation in y around its mean
value or not.
max L( β , σ 2 | y, X ) Lˆ (Ω)
λ =
max L( β , σ | y, X , Rβ = r ) Lˆ (ω )
2
If both the likelihood are maximized, one constrained and the other unconstrained, then the value of the
unconstrained will not be smaller than the value of the constrained. Hence λ ≥ 1.
First we discus the likelihood ratio test for a simpler case when
=R I=
k and r β 0 ,=
i.e., β β 0 . This will give as better and detailed understanding for the minor details and
then we generalize it for Rβ = r , in general.
H 0 : β = β0
where β 0 is specified by the investigator. The elements of β 0 can take on any value, including zero. The
concerned alternative hypothesis is
H1 : β ≠ β 0 .
Since ε ~ N (0, σ 2 I ) in =
y X β + ε , so y ~ N ( X β , σ 2 I ). Thus the whole parametric space and sample
space are Ω and ω respectively given by
1
σω2 =( y − X β 0 ) '( y − X β 0 )
n
n
n n /2 exp −
Lˆ (ω ) = 2 .
n /2
(2π ) ( y − X β 0 ) '( y − X β 0)
n /2
Thus
( y − X β ) '( y − X β ) + ( β − β 0 ) ' X ' X ( β − β 0 )
λ=
( y − X β ) '( y − X β )
( β − β 0 ) ' X ' X ( β − β 0 )
= 1+
( y − X β ) '( y − X β )
( β − β 0 ) ' X ' X ( β − β 0 )
or λ − 1= λ0 =
( y − X β ) '( y − X β )
where 0 ≤ λ0 < ∞.
( y − X β ) '( y − X β ) =
e ' e
= y ' I − X ( X ' X ) −1 X ' y
= y ' Hy
(X β + ε )'H(X β + ε )
=
= ε=
' Hε (using HX 0)
= (n − k )σˆ 2
Z ' AZ
idempotent n × n matrix of rank p then ~ χ 2 ( p ). If B is another n × n symmetric idempotent
σ2
Z ' BZ
matrix of rank q , then ~ χ 2 (q ) . If AB = 0 then Z ' AZ is distributed independently of Z ' BZ .
σ 2
Furthermore, the product of the quadratic form matrices in the numerator (ε ' H ε ) and denominator (ε ' H ε )
of λ0 is
and hence the χ 2 random variables in numerator and denominator of λ0 are independent. Dividing each of
(β − β0 ) ' X ' X (β − β0 )
σ2
k
λ1 =
(n − k )σˆ 2
σ 2
n−k
( β − β 0 ) ' X ' X ( β − β 0 )
=
kσˆ 2
( y − X β 0 ) '( y − X β 0 ) − ( y − X β ) '( y − X β )
=
kσˆ 2
~ F (k , n − k ) under H 0 .
Note that
( y − X β 0 ) '( y − X β 0 ) : Restricted error sum of squares
( y − X β ) '( y − X β ) : Unrestricted error sum of squares
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
27
Numerator in λ1 : Difference between the restricted and unrestricted error sum of squares.
λ1 ≥ Fα (k , n − k )
where Fα (k , n − k ) is the upper critical points on the central F -distribution with k and n − k degrees of
freedom.
=Ω {(β ,σ 2
) : − ∞ < βi < ∞, σ=
2
> 0, i 1, 2,..., k }
ω
= {(β ,σ 2
Rβ r , σ 2 > 0} .
) : − ∞ < β i < ∞, =
Then
E ( Rβ ) = Rβ
V ( Rβ )= E R( β − β )( β − β ) ' R '
= RV ( β ) R '
= σ 2 R ( X ' X ) −1 R '.
Since β ~ N β , σ 2 ( X ' X ) −1
so Rβ ~ N Rβ , σ 2 R( X ' X ) −1 R '
Rβ − r= Rβ − Rβ= R ( β − β ) ~ N 0, σ 2 R ( X ' X ) −1 R ' .
−1
There exists a matrix Q such that R ( X ' X ) −1 R ' = QQ ' and then
σ
2
n−k
( ( ))
−1
Rβ − r ) ' R ( X ' X ) −1 R ' R β − r
=
J σˆ 2
~ F ( J , n − k ) under H 0 .
λ1 ≥ Fα ( J , n − k )
where Fα ( J , n − k ) is the upper critical points on the central F distribution with J and (n − k ) degrees of
freedom.
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
29
Test of significance of regression (Analysis of variance)
= I k −1 ], r 0, then the hypothesis H 0 : R β = r reduces to the following null hypothesis:
If we set R [0=
H 0 : β 2= β3= ...= β k= 0
against the alternative hypothesis
H1 : β j ≠ 0 for at least one j = 2,3,..., k
This hypothesis determines if there is a linear relationship between y and any set of the explanatory
variables X 2 , X 3 ,..., X k . Notice that X 1 corresponds to the intercept term in the model and hence
=xi1 1=
for all i 1, 2,..., n.
This is an overall or global test of model adequacy. Rejection of the null hypothesis indicates that at least
one of the explanatory variables among X 2 , X 3 ,..., X k . contributes significantly to the model. This is called
as analysis of variance.
Since ε ~ N (0, σ 2 I ),
so y ~ N ( X β , σ 2 I )
b = ( X ' X ) −1 X ' y ~ N β , σ 2 ( X ' X ) −1 .
SS res
Also σˆ 2 =
n−k
( y − yˆ ) '( y − yˆ )
=
n−k
y ' I − X ( X ' X ) −1 X ' y y ' Hy y' y −b' X ' y
= = = .
n−k n−k n−k
Since ( X ' X )-1 X ' H = 0, so b and σˆ 2 are independently distributed.
SS r e s ~ χ (2n − k ) ,
and partition β = [ β1 , β 2* ] where the subvector β 2* contains the regression coefficients β 2 , β 3 ,..., β k .
where SS r e g = b2* ' X 2* ' AX 2*b2* is the sum of squares due to regression and the sum of squares due to residuals
is given by
SS r e s =
( y − Xb) '( y − Xb)
= y ' Hy
= SST − SS r e g .
Further
SS r e g β * ' X * ' AX * β * β * ' X * ' AX * β *
~ χ k2−1 2 2 2 2 2 , i.e., non-central χ 2 distribution with non − centrality parameter 2 2 2 2 2 ,
σ2 2σ 2σ
Since X 2 H = 0, so SS r e g and SS r e s are independently distributed. The mean squares due to regression is
SS r e g
MS r e g =
k −1
and the mean square due to error is
SS r e s
MS res = .
n−k
Then
MS reg β * ' X * ' AX * β *
~ Fk −1,n − k 2 2 2 2 2
MS res 2σ
which is a non-central F -distribution with (k − 1, n − k ) degrees of freedom and noncentrality parameter
β 2* ' X 2* ' AX 2* β 2*
.
2σ 2
Under H 0 : β 2= β3= ...= β k ,
MS reg
F= ~ Fk −1,n − k .
MS res
The decision rule is to reject at α level of significance whenever
F ≥ Fα (k − 1, n − k ).
Total SST n −1
Adding such explanatory variables also increases the variance of fitted values ŷ , so one need to be careful
that only those regressors are added that are of real value in explaining the response. Adding unimportant
explanatory variables may increase the residual mean square which may decrease the usefulness of the
model.
has already been discussed is the case of simple linear regression model. In present case, if H 0 is accepted,
it implies that the explanatory variable X j can be deleted from the model. The corresponding test statistic
is
bj
=t ~ t (n − k − 1) under H 0
se(b j )
t > tα .
, n − k −1
2
Note that this is only a partial or marginal test because βˆ j depends on all the other explanatory variables
X i (i ≠ j that are in the model. This is a test of the contribution of X j given the other explanatory variables
in the model.
y ~ N ( X β ,σ 2 I )
b ~ N ( β , σ 2 ( X ' X ) −1 ).
Thus the marginal distribution of any regression coefficient estimate
b j ~ N ( β j , σ 2C jj )
Thus
bj − β j
tj = ~ t (n − k ) under H 0 , j =1, 2,...
σˆ 2C jj
bj − β j
P −tα ≤ ≤ tα = 1−α
2 ,n −k σˆ 2C jj ,n −k
2
P b j − t α σˆ 2C jj ≤ β j ≤ b j + tα σˆ 2C jj =1 − α .
,n −k ,n −k
2 2
So the confidence interval is
b j − tα ,n − k σˆ C jj , b j + tα ,n − k σˆ C jj .
2 2
2 2
Econometrics | Chapter 3 | Multiple Linear Regression Model | Shalabh, IIT Kanpur
33
Simultaneous confidence intervals on regression coefficients:
A set of confidence intervals that are true simultaneously with probability (1 − α ) are called simultaneous or
joint confidence intervals.
It is relatively easy to define a joint confidence region for β in multiple regression model.
Since
(b − β ) ' X ' X (b − β )
~ Fk ,n − k
k MS r e s
(b − β ) ' X ' X (b − β )
⇒ P 1−α.
≤ Fα (k , n − k ) =
k MS r e s
So a 100 (1 − α )% joint confidence region for all of the parameters in β is
(b − β ) ' X ' X (b − β )
≤ Fα (k , n − k )
k MS r e s
which describes an elliptically shaped region.
∑ ( y − y)
i =1
i
2
SS res
= 1−
SST
SS r e g
=
SST
where
R 2 measure the explanatory power of the model which in turn reflects the goodness of fit of the model. It
reflects the model adequacy in the sense that how much is the explanatory power of explanatory variable.
Since
y ' I − X ( X ' X ) −1 X ' y =
e 'e = y ' Hy,
n n
=i 1 =i 1
∑ ( yi − y )2 = ∑y 2
i − ny 2 ,
1 n 1
=
where y = ∑
n i =1
yi
n
' y =
with (1,1,...,1=
) ', y ( y1 , y2 ,..., yn ) '
Thus
n
1
∑ ( y − y)
i =1
i = y ' y − n 2 ' yy '
2
n
1
= y ' y − y ' ' y
n
−1
= y ' y − y ' ( ' ) ' y
= y ' I − ( ' ) −1 ' y
= y ' Ay
y ' Hy
So R2 = 1 − .
y ' Ay
Similarly any other value of R 2 between 0 and 1 indicates the adequacy of fitted model.
With a purpose of correction in overly optimistic picture, adjusted R 2 , denoted as R 2 or adj R 2 is used
which is defined as
SS r e s / (n − k )
R2 = 1−
SST / (n − 1)
n −1
=
1− (1 − R ).
2
n − k
We will see later that (n − k ) and (n − 1) are the degrees of freedom associated with the distributions of SS res
SS r e s SST
and SST . Moreover , the quantities and are based on the unbiased estimators of respective
n−k n −1
variances of e and y is the context of analysis of variance.
The adjusted R 2 will decline if the addition if an extra variable produces too small a reduction in (1 − R 2 )
n −1
to compensate for the increase is .
n−k
Reason that why R 2 is valid only in linear models with intercept term:
y X β + ε , the ordinary least squares estimator of β is b = ( X ' X ) −1 X ' y . Consider the
In the model =
fitted model as
y = Xb + ( y − Xb)
= Xb + e
where e is the residual. Note that
y − ly = Xb + e − ly
= yˆ + e − ly
where ŷ = Xb is the fitted value and l = (1,1,...,1) ' is a n ×1 vector of elements unity. The total sum of
n
squares =
TSS ∑ ( y − y)
i =1
i
2
is then obtained as
∑x y i i n n n
parameter β1 is estimated as b = *
1
i =1
n
. Then l ' e = ∑ e=
i ∑ ( yi − yˆi =) ∑ ( y − b x ) ≠0, in general.
i
*
1 i
∑x
i =1
2
i
=i 1 =i 1 =i 1
Next we consider a simple linear regression model with intercept term yi =β 0 + β1 xi + ε i , (i = 1, 2,..., n)
sxy
where the parameters β 0 and β1 are estimated as b0= y − b1 x and b1 = respectively, where
sxx
n n
1 n 1 n
sxy = ∑ ( xi − x )( yi − y ), =
i =1
sxx ∑ ( xi − x )2 , x =
i =1
∑ i n∑
n i =1
x y =
i =1
yi . We find that
n n
∑ ei
l ' e ==
=i 1 =i 1
∑ ( y − yˆ )
i i
n
= ∑(y −b
i =1
i 0 − b1 xi )
n
= ∑(y − y + b x −b x )
i =1
i 1 1 i
n
= ∑ [( y − y ) − b ( x − x )]
i =1
i 1 i
n n
=i 1 =i 1
= ∑ ( yi − y ) − b1 ∑ ( xi − x )
= 0.
In a multiple linear regression model with an intercept term y =β 0l + X β + ε where the parameters β 0
and β are estimated as βˆ0= y − bx and b = ( X ' X ) −1 X ' y , respectively. We find that
l ' e =l '( y − yˆ )
=l '( y − βˆ0 − Xb)
=l '( y − y + Xb − Xb) ,
=l '( y − y ) + l '( X − X )b
=0.
Thus we conclude that for the Fisher Cochran to hold true in the sense that the total sum of squares can
be divided into two orthogonal components, viz., sum of squares due to regression and sum of squares
due to errors, it is necessary that l ' e =l '( y − yˆ ) =
0 holds and which is possible only when the intercept
term is present in the model.
3. R 2 always increases with an increase in the number of explanatory variables in the model. The main
drawback of this property is that even when the irrelevant explanatory variables are added in the
model, R 2 still increases. This indicates that the model is getting better which is not really correct.
4. Consider a situation where we have following two models:
yi = β1 + β 2 X i 2 + ... + β k X ik + ui , i = 1, 2,.., n
log yi = γ 1 + γ 2 X i 2 + ... + γ k X ik + vi
The question is now which model is better?
For the first model,
n
∑ ( y − yˆ )
i i
2
R12 = 1 − i =1
n
∑ ( y − y)
i =1
i
2
∑ (log y − log yˆ )
i i
2
R22 = 1 − i =1
n
.
∑ (log y − log y )
i =1
i
2
As such R12 and R22 are not comparable. If still, the two models are needed to be compared, a better
∑ ( y − anti log yˆ )
i
*
i
R32 = 1 − i =1
n
∑ ( y − y)
i =1
i
2
where y . Now
yi* = log R12 and R32 on comparison may give an idea about the adequacy of the two
i
models.
where R 2 is the coefficient of determination. So F and R 2 are closely related. When R 2 = 0, then F = 0.
In limit, when R 2 = 1, F = ∞ . So both F and R 2 vary directly. Larger R 2 implies greater F value. That is
why the F test under analysis of variance is termed as the measure of overall significance of estimated
regression. It is also a test of significance of R 2 . If F is highly significant, it implies that we can reject
H 0 , i.e. y is linearly related to X ' s.
The confidence interval on the mean response at a particular point, such as x01 , x02 ,..., x0 k can be found as
follows:
Define x0 = ( x01 , x02 ,..., x0 k ) '. The fitted value at x0 is yˆ 0 = x0' b.
Then
yˆ 0 − E ( y / x0 )
P −tα ≤ ≤ tα = 1−α
2 ,n − k σˆ 2 x0' ( X ' X ) −1 x0 2
,n −k
P yˆ 0 − tα ≤ σˆ 2 x0' ( X ' X ) −1 x0 ≤ E ( y / x0 ) ≤ yˆ 0 + tα ≤ σˆ 2 x0' ( X ' X ) −1 x0 =−
1 α.
,n −k ,n −k
2 2
The 100 (1 − α )% confidence interval on the mean response at the point x01 , x02 ,..., x0 k , i.e., E ( y / x0 ) is
yˆ 0 − tα ,n − k σˆ x0 ( X ' X ) x0 , yˆ 0 + tα ,n − k ≤ σˆ x0 ( X ' X ) x0
2 ' −1 2 ' −1
.
2 2
p f − tα ,n − k σˆ [1 + x0 ( X ' X ) x0 ], p f + tα ,n − k σˆ [1 + x0 ( X ' X ) x0 ] .
2 ' −1 2 ' −1
2 2