1) Derivations - Regression
1) Derivations - Regression
1) Derivations - Regression
1) Regression......................................................................................................................................................................... 4
1.1) Means and Predictors............................................................................................................................................... 4
1.1.1) Best Linear Predictor (data on 𝒚) ..................................................................................................................... 4
1.1.2) Best Linear Predictor (data on 𝒚 and 𝒙) .......................................................................................................... 4
1.1.3) Best Linear Predictor (data on 𝒚 and 𝒙 + intercept) ...................................................................................... 4
1.1.4) Best Linear Predictor (data on 𝒚 and a vector 𝒙) ........................................................................................... 6
1.1.5) Decomposition of 𝒚𝒊 into 2 orthogonal components .................................................................................... 8
1.2) Consistency and Asymptotic Normality of Predictors ....................................................................................... 8
1.2.1) Population level predictor 𝜷 ............................................................................................................................ 8
1.2.2) Estimation error 𝜷 − 𝜷 ...................................................................................................................................... 9
1.2.3) Consistency of OLS estimator .......................................................................................................................... 9
1.2.4) Asymptotic normality of (1/n)sum(xiui) ....................................................................................................... 10
1.2.5) Asymptotic Normality of OLS & Sandwich Formula ................................................................................ 11
1.2.6) (Heteroskedasticity consistent) Estimators of the asymptotic variance of the estimation error aka
“estimator of the sandwich” ......................................................................................................................................... 12
1.2.7) Asymptotic normality for Individual Coefficients ..................................................................................... 13
1.2.8) Confidence Intervals ....................................................................................................................................... 13
1.3) Classical Regression Model (CRM) ...................................................................................................................... 13
1.3.1) Assumptions..................................................................................................................................................... 14
1.3.2) Properties .......................................................................................................................................................... 14
1.3.3) Variance of 𝒚 = Variance of 𝒚 & Variance of 𝒖 = Expectation of 𝒖𝒖′ ........................................................ 14
1.3.4) The conditional distribution of u determines the conditional distribution of y ..................................... 14
1.3.5) Deriving the Estimator .................................................................................................................................... 15
1.3.6) Conditional & Unconditional Expectation of 𝜷 (showing unbiasedness) ............................................... 15
1.3.7) Conditional & Unconditional Variance of 𝜷 ................................................................................................ 16
1.3.8) Conditional asymptotic distribution of estimation error & Sandwich Formula .................................... 16
1.3.9) Estimation of the Error Variance ................................................................................................................... 18
1.4) Weighted Least Squares (WLS) ............................................................................................................................ 19
1.4.1) Deriving the WLS estimator ........................................................................................................................... 19
1.4.2) Deriving the WLS population 𝜷 .................................................................................................................... 20
1.4.3) Consistency of WLS estimator ....................................................................................................................... 20
1.4.4) Asymptotic normality of (1/n)sum(wixiui) ................................................................................................... 22
1.4.1) Asymptotic Normality .................................................................................................................................... 23
1.4.2) Asymptotic Efficiency (Optimal choice of weights wi) .............................................................................. 24
1.4.3) Proving that the variance is smaller when we use the optimal choice of weights (Ask Rob) .............. 24
1.4.4) Generalized Least Squares (GLS) aka “WLS with optimal weights” ........................................................... 25
1.4.5) Asymptotic normality of GLS ........................................................................................................................ 25
1.5) Clustered data ......................................................................................................................................................... 25
1.5.1) Individual observations .................................................................................................................................. 25
1.5.2) Expression for one cluster .............................................................................................................................. 26
1.5.3) Expression for all the data grouped (not really useful).............................................................................. 26
1.5.4) Expression for all the data (no grouping) .................................................................................................... 26
1.5.5) OLS Estimator in 3 equivalent formats......................................................................................................... 27
1.5.6) Estimation error ............................................................................................................................................... 27
1.5.7) Consistency....................................................................................................................................................... 28
1.5.8) Asymptotic normality of sumX’huh ............................................................................................................... 29
1.5.9) Asymptotic normality ..................................................................................................................................... 30
1.5.1) Cluster robust standard errors aka “estimator of the W-sandwich” ............................................................ 31
1.6) Fixed effects ............................................................................................................................................................. 32
1.6.1) Individual observations .................................................................................................................................. 32
1.6.2) At a cluster level .............................................................................................................................................. 33
1.6.3) Expression for all the data grouped (not really useful).............................................................................. 33
1.6.4) Expression for all the data (no grouping) .................................................................................................... 33
1.6.5) OLS Estimator w/ Fixed Effects through Partitioned Regression ............................................................. 34
1.6.6) Breaking down the Q matrix .......................................................................................................................... 35
1.6.7) Conditional Variance of 𝜷 .............................................................................................................................. 37
2) Appendix ........................................................................................................................................................................ 38
2.1) Matrix Differentiation ............................................................................................................................................ 38
2.2) Adding matrix expressions that are actually scalars ......................................................................................... 38
2.3) Slutsky’s Theorem (Convergence of Transformations)..................................................................................... 38
2.4) Slutsky’s Theorem (Convergence of sums or products or R.V’s) .................................................................... 38
2.5) Law of Large Numbers (LLN) .............................................................................................................................. 38
2.6) Cramér’s Theorem .................................................................................................................................................. 39
2.7) Difference between Sample and Population ...................................................................................................... 39
2.8) Tricks for estimating population objects ............................................................................................................. 40
2.9) Delta Method........................................................................................................................................................... 40
2.10) CLT ......................................................................................................................................................................... 40
2.11) Operations with normal distribution operator................................................................................................. 40
2.11.1) Univariate distributions ................................................................................................................................ 40
2.11.2) Multivariate distributions ............................................................................................................................ 40
2.12) Linear Regression and CEF ................................................................................................................................. 41
2.12.1) CEF................................................................................................................................................................... 41
2.12.2) Population Linear Regression / Best Linear Predictor (BLP) ................................................................... 42
2.12.3) Useful tricks for Sandwich Formulas.......................................................................................................... 42
2.12.4) Projection Matrix ........................................................................................................................................... 43
2.12.5) Annihilator Matrix......................................................................................................................................... 43
2.12.6) Useful Matrix Properties .............................................................................................................................. 44
2.12.7) Victor’s TA comments:.................................................................................................................................. 44
2.12.8) Multivariate Normal Distribution of the y’s .............................................................................................. 44
2.13) Useful trick for working in deviations from the mean .................................................................................... 45
1) Regression
isolate 𝑎:
𝑛 𝑛 𝑛 𝑛 𝑛
−2 ∑(𝑦𝑖 − 𝑎) = 0 ⇒ ∑ 𝑦𝑖 − ∑ 𝑎 = 0 ⇒ ∑ 𝑦𝑖 = ∑ 𝑎 = 𝑛𝑎
𝑖 𝑖 𝑖 𝑖 𝑖
𝑛
1
𝑦̅ ≡ 𝑎 = ∑ 𝑦𝑖
𝑛
𝑖
𝜕 𝔼[𝑥𝑖 𝑦𝑖 ]
= −2𝔼[𝑥𝑖 (𝑦𝑖 − 𝛽𝑥𝑖 )] = 0 ⇒ 𝔼[𝑥𝑖 𝑦𝑖 ] = 𝛽𝔼[𝑥𝑖2 ] ⇒ 𝛽=
𝜕𝛽 𝔼[𝑥𝑖2 ]
̂)
Sample level (𝜷
𝑛 𝑛
𝜕 ∑𝑛
𝑖=1(𝑥𝑖 𝑦𝑖 )
𝜕𝛽
= −2 ∑𝑛𝑖=1[𝑥𝑖 (𝑦𝑖 − 𝛽𝑥𝑖 )] = 0 ⇒ ∑𝑛𝑖=1(𝑥𝑖 𝑦𝑖 ) = 𝛽 ∑𝑛𝑖=1(𝑥𝑖2 ) ⇒ 𝛽= ∑𝑛 2
𝑖=1(𝑥𝑖 )
𝑛𝑎 = ∑ 𝑦𝑖 − 𝑏 ∑ 𝑥𝑖
𝑖=1 𝑖=1
𝑛 𝑛
1 𝑏
𝑎 = ∑ 𝑦𝑖 − ∑ 𝑥𝑖 (1)
𝑛 𝑛
𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛
𝜕
= ∑ −2𝑥𝑖 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = 0 ∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑥𝑖 𝑎 − ∑ 𝑥𝑖2 𝑏 = 0
𝜕𝑏
𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛
∑ 𝑥𝑖 𝑦𝑖 − 𝑎 ∑ 𝑥𝑖 − 𝑏 ∑ 𝑥𝑖2 = 0 (2)
𝑖=1 𝑖=1 𝑖=1
∑ 𝑥𝑖 𝑦𝑖 − 𝑎 ∑ 𝑥𝑖 − 𝑏 ∑ 𝑥𝑖2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛
1 𝑏
∑ 𝑥𝑖 𝑦𝑖 − ( ∑ 𝑦𝑖 − ∑ 𝑥𝑖 ) ∑ 𝑥𝑖 − 𝑏 ∑ 𝑥𝑖2 = 0
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
1 𝑏
∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑦𝑖 ∑ 𝑥𝑖 + ∑ 𝑥𝑖 ∑ 𝑥𝑖 − 𝑏 ∑ 𝑥𝑖2 = 0
𝑛 𝑛⏟
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
2
(∑𝑛
𝑖=1 𝑥𝑖 )
𝑛 𝑛 𝑛 𝑛 2 𝑛
1 𝑏
∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑦𝑖 ∑ 𝑥𝑖 + (∑ 𝑥𝑖 ) − 𝑏 ∑ 𝑥𝑖2 = 0
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
Isolate 𝑏
𝑛 𝑛 2 𝑛 𝑛 𝑛
𝑏 1
𝑏 ∑ 𝑥𝑖2 − (∑ 𝑥𝑖 ) = ∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑦𝑖 ∑ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 2 𝑛 𝑛 𝑛
1 1
𝑏 [∑ 𝑥𝑖2 − (∑ 𝑥𝑖 ) ] = ∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑦𝑖 ∑ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
And finally:
1
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑛𝑖=1 𝑦𝑖 ∑𝑛𝑖=1 𝑥𝑖
𝑛
𝛽̂ ≡ 𝑏 =
1 2
∑𝑛𝑖=1 𝑥𝑖2 − (∑𝑛𝑖=1 𝑥𝑖 )
𝑛
1
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑛𝑖=1 𝑦𝑖 ∑𝑛𝑖=1 𝑥𝑖 1⁄𝑛
= 𝑛 ·
1 2 1⁄𝑛
∑𝑛𝑖=1 𝑥𝑖2 − (∑𝑛𝑖=1 𝑥𝑖 )
𝑛
1 𝑛 1 1
∑𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑛𝑖=1 𝑦𝑖 × ∑𝑛𝑖=1 𝑥𝑖
= 𝑛 𝑛 𝑛 1
2
1 𝑛 1
∑𝑖=1 𝑥𝑖2 − ( ∑𝑛𝑖=1 𝑥𝑖 )
𝑛 𝑛
𝐶𝑜𝑣(𝑥𝑖 , 𝑦𝑖 )
=
𝑉𝑎𝑟(𝑥𝑖 )
Conclusion:
𝐶𝑜𝑣(𝑥𝑖 , 𝑦𝑖 )
𝛼̂ = 𝑦̅ − 𝛽̂ 𝑥̅ ∧ 𝛽̂ =
𝑉𝑎𝑟(𝑥𝑖 )
Sigma Notation
➢ 𝑦𝑖 is an individual observation of the dependent variable
1
𝑥
➢ 𝒙𝒊 is a vector of regressors or independent variables: 𝒙𝒊 = ( 𝑖2 )
⋮
𝑥𝑖𝑘
𝑛 2
̂ = arg min ∑ ( 𝑦⏟𝑖 − 𝒙⏟′𝒊 𝒃
𝜷 ⏟ )
𝒃
𝑖=1 1×1 1×𝑘 𝑘×1
𝑛 𝑛 𝑛
𝜕
= ∑ −2𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃) = 0 ∑ 𝒙⏟𝒊 𝑦⏟𝑖 = ∑ 𝒙⏟𝒊 𝒙⏟′𝒊 𝒃
⏟
𝜕𝒃
𝑖=1 𝑖=1 𝑘×1 1×1 𝑖=1 𝑘×1 1×𝑘 𝑘×1
𝑛 𝑛
(∑ 𝒙𝒊 𝑦𝑖 ) = (∑ 𝒙𝒊 𝒙′𝒊 ) 𝒃
𝑖=1 𝑖=1
𝑛 −1 𝑛 𝑛 −1 𝑛
∑ 𝒙𝒊 𝒙′𝒊 ∑ 𝒙𝒊 𝑦𝑖 = 𝒃
⏟
⏟
𝑖=1 ⏟
𝑖=1 𝑘×1
( 𝑘×𝑘 ) ( 𝑘×1 )
Hence:
1 Recall that:
Note that the FOC has an important implication. Open it to see what it contains
𝑛 𝑛
𝑛 1 𝑛 𝑢̂𝑖 ∑𝑢̂𝑖
𝑥𝑖2 𝑥 𝑢̂ ∑ 𝑥 𝑢̂
∑ [( ) × 𝑢𝑖 ] = ∑ ( 𝑖2 𝑖 ) = ( 𝑖2 𝑖 ) = 0
⋮ ⋮ ⋮
𝑖=1 𝑖=1
𝑥𝑖𝑘 𝑥𝑖𝑘 𝑢̂𝑖 ∑ 𝑥𝑖𝑘 𝑢̂𝑖
So:
∑𝑢̂𝑖 0
∑ 𝑥𝑖2 𝑢̂𝑖 0
( )=( )
⋮ ⋮
∑ 𝑥𝑖𝑘 𝑢̂𝑖 0
∑ 𝑢̂𝑖 = 0
𝑖=1
➢ The 𝑥’s and the 𝑢’s have 0 covariance thanks to: ∑𝑢̂𝑖 and ∑ 𝑥𝑖𝑘 𝑢̂𝑖 for any regressor 𝑘
Define:
̂ = ⏟
𝒖
⏟ 𝒚 − 𝑿
⏟ 𝜷⏟
𝑛×1 𝑛×1 ⏟ 𝑘×1
𝑛×𝑘
𝑛×1
̂ = arg min ( 𝒖
𝜷 ̂⏟′ 𝒖
̂ )
⏟
𝒃
1×𝑛 𝑛×1
where:
̂′𝒖
𝒖 ̂ = (𝒚 − 𝑿𝜷)′(𝒚 − 𝑿𝜷)
= 𝒚′ 𝒚 − 𝒚′ 𝑿𝜷 − (𝑿𝜷)′ 𝒚 + (𝑿𝜷)′ (𝑿𝜷)
= 𝒚′ 𝒚 − 𝒚′ 𝑿𝜷 − 𝜷′𝑿′𝒚 + 𝜷′𝑿′𝑿𝜷
Hence:
̂ = arg min(𝒚′ 𝒚 − 𝒚′ 𝑿𝒃 − 𝒃′𝑿′𝒚 + 𝒃′𝑿′𝑿𝒃)
𝜷
𝒃
FOC:
𝜕
(𝒖 ̂ ) = −(𝒚′ 𝑿)′ − 𝑿′ 𝒚 + 2𝑿′𝑿𝒃
̂′𝒖 =0
𝜕𝒃
= −𝑿′𝒚 − 𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
= −2𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
Solve for 𝒃
−2𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
𝑿′𝑿𝒃 = 𝑿′ 𝒚
(𝑿′ 𝑿)−𝟏 (𝑿′𝑿)𝒃 = (𝑿′ 𝑿)−𝟏 (𝑿′ 𝒚)
𝒃 = (𝑿′ 𝑿)−𝟏 (𝑿′ 𝒚)
Thus:
𝑦𝑖 = 𝑦̂𝑖 + 𝑢̂𝑖
• ̂
𝑦̂𝑖 = 𝒙′𝒊 𝜷
• ̂
𝑢̂𝑖 = 𝑦𝑖 − 𝒙′𝒊 𝜷
𝐶𝑜𝑣(𝑦̂𝑖 , 𝑢̂𝑖 ) = 0
̂ , 𝑢̂𝒊 ) = ⏟
𝐶𝑜𝑣(𝑦̂𝑖 , 𝑢̂𝑖 ) = 𝐶𝑜𝑣(𝒙′𝒊 𝜷 ̂=0
𝐶𝑜𝑣(𝒙′𝒊 , 𝑢̂𝒊 ) 𝜷
0
𝑦𝑖 = 𝒙′𝒊 𝜷 + 𝑢𝑖
𝜕 𝔼[𝒙𝒊 𝒙′𝒊 𝒃]
𝔼[𝒙𝒊 𝑦𝑖 ] = ⏟
⏟
(·) = 𝔼[−2𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃)] = 0
𝜕𝑏 𝑘×1 𝑘×1
or equivalently:
𝔼[𝒙𝒊 𝑢𝑖 ] = 0
which states that the regressors are uncorrelated with the population errors.
̂−𝜷
1.2.2) Estimation error 𝜷
̂:
Recall the expression for 𝜷
𝑛 −1 𝑛
̂≡
𝜷 (∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝑦𝑖 )
𝑖=1 𝑖=1
Substitute 𝑦𝑖 = 𝒙′𝒊 𝜷 + 𝑢𝑖
𝑛 −1 𝑛
̂ = (∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 (∑ 𝒙𝒊 (𝒙′𝒊 𝜷 + 𝑢𝑖 ))
⏟𝑖=1 ⏟𝑖=1
𝑘×𝑘 𝑘×1
𝑛 −1 𝑛 𝑛 −1 𝑛
̂ = (∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 (∑ 𝒙𝒊 𝒙′𝒊 𝜷) + (∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝑢𝑖 )
⏟𝑖=1 ⏟𝑖=1 ⏟𝑖=1 ⏟𝑖=1
𝑘×𝑘 𝑘×1 𝑘×𝑘 𝑘×1
𝑛 −1 𝑛
̂
⏟− 𝜷 = (∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 (∑ 𝒙𝒊 𝑢𝑖 )
𝑘×1 ⏟𝑖=1 ⏟𝑖=1
⏟ 𝑘×𝑘 𝑘×1
𝑘×1
𝑛 −1 𝑛
1 1
̂ − 𝜷 = ( ∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 ( ∑ 𝒙𝒊 𝑢𝑖 )
𝑛 𝑛
𝑖=1 𝑖=1
Take the plim of this expression:
𝑛 −1 𝑛
1 1
̂ − 𝜷] = plim [( ∑ 𝒙𝒊 𝒙′𝒊 )
plim[𝜷 ( ∑ 𝒙𝒊 𝑢𝑖 )]
𝑛 𝑛
𝑖=1 𝑖=1
𝑝 𝑝 𝑛 𝑛 −1
𝑍𝑛̅ → 𝑍 ⟹ [𝑍
⏟̅𝑛 ]−1 → ⏟
[𝑍]−1 1 𝑝 1 𝑝
∑ 𝒙𝒊 𝒙′𝒊 → 𝔼(𝒙𝒊 𝒙′𝒊 ) ⟹ [ ∑ 𝒙𝒊 𝒙′𝒊 ] → [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑔[𝑍̅𝑛 ] 𝑔(𝑍) 𝑛 𝑛
𝑖=1 𝑖=1
By (the weaker version of) Slutsky’s Theorem (II) applied to convergence in probability:
if:
𝑛 −1
𝑝 1 𝑝
𝑫𝑛 → 𝑫 ( ∑ 𝒙𝒊 𝒙′𝒊 ) → [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑛
𝑖=1
𝑛
𝑑 1 𝑝
𝑲𝑛 → 𝑲 ( ∑ 𝒙𝒊 𝑢𝑖 ) → 𝔼[𝒙𝒊 𝑢𝑖 ] = 𝟎
𝑛
𝑖=1
then
𝑛 −1 𝑛
𝑑 1 1 𝑝
𝑫𝒏 𝑲𝒏 → 𝑫𝑲 ( ∑ 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝒙𝒊 𝑢𝑖 ) → [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 ⏟
𝔼[𝒙𝒊 𝑢𝑖 ]
𝑛 𝑛
𝑖=1 𝑖=1 𝟎
̂ is a consistent estimator.
Hence 𝜷
o 𝐶𝑜𝑣[𝒙𝒊 𝑢𝑖 , 𝒙𝒋 𝑢𝑗 ] = 0 follows from the independence of observations across groups, which implies that
𝔼[𝒙𝒊 𝑢𝑖 𝒙𝒋 𝑢𝑗 ] = 𝔼[𝒙𝒊 𝑢𝑖 ]𝔼[𝒙𝒋 𝑢𝑗 ] = 0.
𝑛
1 𝑑
∑ 𝒙𝒊 𝑢𝑖 → 𝒩(0, 𝑽)
√𝑛 𝑖=1
We want to apply Cramér’s Theorem to approximate the asymptotic distribution of this object, hence, we need
to decompose the expression above into 2 matrices: one that converges in probability to a known object and
another one of which we know its asymptotic distribution:
1 −1
✓ (𝑛 ∑𝑛𝑖=1 𝒙𝒊 𝒙′𝒊 ) will converge in probability to a known object, which is its expectation.
1
✓ (𝑛 ∑𝑛𝑖=1 𝒙𝒊 𝑢𝑖 ) is an object of which we know the asymptotic distribution; in fact, we calculated it just
1 𝑑
above. It will be convenient to use the final expression to which we arrived: ∑𝑛 𝒙 𝑢 → 𝒩(0, 𝑽).
√𝑛 𝑖=1 𝒊 𝑖
Thus, we will scale our estimation error to apply the theorem in a cleaner way:
𝑛 −1 𝑛
1 1
̂ − 𝜷) = ( ∑ 𝒙𝒊 𝒙′𝒊 )
√𝒏(𝜷 ( ∑ 𝒙𝒊 𝑢𝑖 )
⏟𝑛 𝑖=1 ⏟√𝑛 𝑖=1
𝑝 −1 𝑑
→[𝔼(𝒙𝒊 𝒙′𝒊 )] →𝒩(0,𝑽)
Using the properties of distributional operators, we can see that the above expression is equivalent to:
𝑛 −1 𝑛
1 1 𝑑
( ∑ 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 𝑽[𝔼(𝒙𝒊 𝒙′𝒊 )]−1 )
𝑛 √𝑛 𝑖=1
𝑖=1
or simply:
𝑛 −1 𝑛
1 1 𝑑
( ∑ 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, 𝑾)
𝑛 √𝑛 𝑖=1
𝑖=1
where:
𝑾
⏟ =⏟ 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ] ⏟
[𝔼(𝒙𝒊 𝒙′𝒊 )]−1 ⏟ [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑘×𝑘 𝑘×𝑘 𝑘×𝑘 𝑘×𝑘
̂ − 𝜷 is a 𝑘 × 𝑘 vector, we know that the variance of this object has to be a square 𝑘 × 𝑘 matrix
Since 𝜷
𝑑
Hence: ̂ − 𝜷) → 𝒩(0, 𝑾)
√𝑛(𝜷
a consistent estimator can be proposed by simply changing population objects by sample objects:
1
Instead of expectations 𝔼(·), use a sample mean 𝑛 ∑𝑛𝑖=1(·)
Instead of the population error 𝑢𝑖 , use the sample error 𝑢̂𝑖
𝑛 −1 𝑛 𝑛 −1
1 1 1
̂ = [ ∑(𝒙𝒊 𝒙′𝒊 )]
𝑾 [ ∑(𝑢̂𝑖2 𝒙𝒊 𝒙′𝒊 )] [ ∑(𝒙𝒊 𝒙′𝒊 )]
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
̂ are HETEROSKEDASTICITY-ROBUST, as we didn’t impose any assumption to
Important: 𝑾 and its estimator 𝑾
obtain them.
or simply:
𝑑 1 𝑑 𝑤𝑗𝑗 𝑑 𝑤𝑗𝑗 𝑑 𝑤𝑗𝑗
(𝛽̂𝑗 − 𝛽𝑗 ) → 𝒩(0, 𝑤𝑗𝑗 ) ⇒ (𝛽̂𝑗 − 𝛽𝑗 ) → 𝒩 (0, ) ⇒ 𝛽̂𝑗 → 𝛽𝑗 + 𝒩 (0, ) ⇒ 𝛽̂𝑗 → 𝒩 (𝛽𝑗 , )
√𝑛 𝑛 𝑛 𝑛
𝑑 𝑤𝑗𝑗 √𝑛(𝛽̂𝑗 − 𝛽𝑗 ) 𝑑
𝛽̂𝑗 → 𝒩 (𝛽𝑗 , ) ⟺ → 𝒩(0,1)
𝑛 √𝑤𝑗𝑗
𝑑 𝑤
̂𝑗𝑗 √𝑛(𝛽̂𝑗 − 𝛽𝑗 ) 𝑑
𝛽̂𝑗 → 𝒩 (𝛽𝑗 , ) ⟺ → 𝒩(0,1)
𝑛 √𝑤
̂𝑗𝑗
𝑤̂𝑗𝑗 𝑤
̂𝑗𝑗
𝐶𝐼0.95 = (𝛽̂𝑗 − 1.96√ , 𝛽̂𝑗 + 1.96√ )
𝑛 𝑛
This Confidence Interval uses heteroskedasticity-consistent standard errors, as homoskedasticity was not
̂.
assumed to obtain 𝑾 and its estimate 𝑾
1.3.1) Assumptions
Assumption 1 (A1) 𝔼(𝒚|𝑿) = 𝑿𝜷
1a) Strict Exogeneity 𝔼(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ) = 𝔼(𝑦𝑖 | 𝑥𝑖 )
1b) Linearity 𝔼(𝑦𝑖 | 𝑥𝑖 ) = 𝛼 + 𝑥𝑖′ 𝛽
Assumption 2 (A2) 𝑉𝑎𝑟(𝒚|𝑿) = 𝜎 2 𝑰𝒏
2a) Conditional Uncorrelatedness 𝑉𝑎𝑟(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ) = 𝑉𝑎𝑟(𝑦𝑖 | 𝑥𝑖 ) ⇒ 𝐶𝑜𝑣(𝑦𝑖 , 𝑦𝑗 |𝑥1 , … , 𝑥𝑛 ) = 0
2b) Homoskedasticity 𝑉𝑎𝑟(𝑦𝑖 |𝑥𝑖 ) = 𝜎 2
Assumption 3 (A3) 𝒚|𝑿 ∼ 𝓝(𝑿𝜷, 𝜎 2 𝑰𝒏 )
The joint pdf must be normal.
Normality (multivariate) normality implies that conditional mean is
linear (1b) and conditional variance is constant (2b)
𝔼(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ) = 𝔼(𝑦𝑖 | 𝑥𝑖 )
1.3.2) Properties
Property Assumption required
Conditional Unbiasedness A1
BLUE A1 & A2
BUE A1, A2 & A3
𝒚 |𝑿) = 𝜎 2 𝑰𝒏
Under A2: 𝑉𝑎𝑟 ( ⏟
⏟ 𝑁×1
𝑁×𝑁
′
𝑉𝑎𝑟(𝒚|𝑿) = 𝑉𝑎𝑟 ( 𝑿
⏟ 𝜷 + 𝒖|𝑿) = 𝑉𝑎𝑟(𝒖|𝑿)
constant
and :
′
𝑉𝑎𝑟(𝒖|𝑿) = 𝔼(𝒖𝒖
⏟ |𝑿) − 𝔼(𝒖|𝑿)
⏟ [𝔼(𝒖|𝑿)]
⏟ ′ = 𝔼(𝒖𝒖′ |𝑿)
𝑁×𝑁 ⏟𝑁×1 1×𝑁
𝑁×𝑁
Recapitulation :
(𝑨𝟐)
𝑉𝑎𝑟(𝒚|𝑿) = 𝑉𝑎𝑟(𝒖|𝑿) = 𝔼(𝒖𝒖′ |𝑿) =
⏞ 𝜎 2 𝑰𝒏
Given that: 𝒚 = 𝑿𝜷 + 𝒖
Sigma notation
𝑛
̂ = arg min ∑(𝑦𝑖 − 𝒙′𝒊 𝒃)2
𝜷
𝒃
𝑖=1
𝑛 𝑛 𝑛
𝜕
= ∑ −2𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃) = 0 ∑ 𝒙𝒊 𝑦𝑖 = ∑ 𝒙𝒊 𝒙′𝒊 𝒃
𝜕𝒃
𝑖=1 𝑖=1 𝑖=1
Hence:
𝑛 −1 𝑛
̂ ≡ 𝒃 = (∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 (∑ 𝒙𝒊 𝑦𝑖 )
𝑖=1 𝑖=1
Matrix notation
̂ = arg min ( 𝒖
𝜷 ̂⏟′ 𝒖
̂ )
⏟
𝒃
1×𝑛 𝑛×1
̂′𝒖
𝒖 ̂ = (𝒚 − 𝑿𝜷)′(𝒚 − 𝑿𝜷)
= 𝒚′ 𝒚 − 𝒚′ 𝑿𝜷 − (𝑿𝜷)′ 𝒚 + (𝑿𝜷)′ (𝑿𝜷)
= 𝒚′ 𝒚 − 𝒚′ 𝑿𝜷 − 𝜷′𝑿′𝒚 + 𝜷′𝑿′𝑿𝜷
Hence:
̂ = arg min(𝒚′ 𝒚 − 𝒚′ 𝑿𝒃 − 𝒃′𝑿′𝒚 + 𝒃′𝑿′𝑿𝒃)
𝜷
𝒃
FOC:
𝜕
(𝒖 ̂ ) = −(𝒚′ 𝑿)′ − 𝑿′ 𝒚 + 2𝑿′𝑿𝒃
̂′𝒖 =0
𝜕𝒃
= −𝑿′𝒚 − 𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
= −2𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
Solve for 𝒃
−2𝑿′ 𝒚 + 2𝑿′𝑿𝒃 = 0
𝑿′𝑿𝒃 = 𝑿′ 𝒚
(𝑿′ 𝑿)−𝟏 (𝑿′𝑿)𝒃 = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒚
−𝟏
′
⏟
𝒃 =(𝑿⏟ 𝑿 ⏟ ) ⏟′ ⏟
𝑿 𝒚
𝑘×1 ⏟𝑘×𝑛 𝑛×𝑘 ⏟ 𝑛×1
𝑘×𝑛
𝑘×𝑘 𝑘×1
Thus:
̂ ≡ 𝒃 = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒚
𝜷
̂ (showing unbiasedness)
1.3.6) Conditional & Unconditional Expectation of 𝜷
➔ Conditional Expectation
̂ is an unbiased estimator
Under A1, 𝔼(𝒚|𝑿) = 𝑿𝜷. Hence, we can show that 𝜷
̂ |𝑿] = 𝔼 [(𝑿
𝔼[𝜷 ⏟ ′ 𝑿)−𝟏 𝑿′ 𝒚|𝑿] = (𝑿′ 𝑿)−𝟏 𝑿′ 𝔼[𝒚|𝑿]
⏟ (𝑿′ 𝑿)−𝟏 𝑿′ 𝑿 𝜷 = 𝜷
=⏟
constant 𝑿𝜷 𝑰𝒌
Hence: ̂ |𝑿] = 𝜷
𝔼[𝜷
➔ Unconditional Expectation
By the law of iterated expectations, we can show that 𝛽̂ is also unconditionally unbiased:
̂ ] = 𝔼 [𝔼[𝜷
𝔼[𝜷 ̂ |𝑿]] = 𝔼[𝜷] = 𝜷
Hence: ̂] = 𝜷
𝔼[𝜷
̂
1.3.7) Conditional & Unconditional Variance of 𝜷
➔ Conditional Variance
Under A2, 𝑉𝑎𝑟(𝒚|𝑿) = 𝜎 2 𝑰𝒏
⏟ ′ 𝑿)−𝟏 𝑿′ 𝒚|𝑿]
̂ |𝑿] = 𝑉𝑎𝑟 [(𝑿
𝑉𝑎𝑟[𝜷
constant
= ((𝑿′ 𝑿)−𝟏 𝑿′ ) 𝑉𝑎𝑟[𝒚|𝑿]
⏟ ((𝑿′ 𝑿)−𝟏 𝑿′ )′
𝜎 2 𝑰𝒏
= 𝜎2 ⏟
(𝑿′ 𝑿)−𝟏 𝑿′ 𝑿 (𝑿′ 𝑿)−𝟏
𝑰𝒌
= 𝜎 2 (𝑿′ 𝑿)−𝟏
➔ Unconditional Variance
⏟ ̂ |𝑿)] + 𝔼 [𝑉𝑎𝑟(𝜷
̂ ] = 𝑉𝑎𝑟 [𝔼(𝜷
𝑉𝑎𝑟[𝜷 ⏟ ̂ |𝑿)]
𝜷 −𝟏
𝜎2 (𝑿′ 𝑿)
= 𝑉𝑎𝑟[𝜷]
⏟ + 𝔼[𝜎 2 (𝑿′ 𝑿) ] −𝟏
= 𝜎 2 𝔼[(𝑿′ 𝑿)−𝟏 ]
where:
̂ − 𝜷)|𝑿] = √𝑛 {𝔼[𝜷
𝔼[√𝑛(𝜷 ⏟ ̂ |𝑿] − 𝔼[𝜷|𝑿]
⏟ }=𝟎
𝜷 𝜷
and to find our sandwich formula 𝑾 we need to calculate the conditional variance of the rescaled estimation
̂ − 𝜷 = (∑𝑛𝑖=1 𝒙𝒊 𝒙′𝒊 )−1 ∑𝑛𝑖=1 𝒙𝒊 𝑢𝑖
error. Recall that: 𝜷
̂ − 𝜷)|𝑿] = 𝑛 · 𝑉𝑎𝑟[𝜷
𝑉𝑎𝑟[√𝑛(𝜷 ̂ − 𝜷|𝑿]
𝑛 −1 𝑛
= 𝑛 (∑ 𝒙𝒊 𝒙′𝒊 ) 𝒙𝒊 (𝜎 2
𝑰𝒏 )𝒙′𝒊 (∑ 𝒙𝒊 𝒙′𝒊 )
𝑖=1 𝑖=1
𝑛 −1 𝑛 −1
= 𝑛𝜎 2
(∑ 𝒙𝒊 𝒙′𝒊 )
𝑖=1
𝑛 −1
1 1
= 𝑛𝜎 ( ∑ 𝒙𝒊 𝒙′𝒊 )
2
𝑛 𝑛
𝑖=1
𝑛 −1
1
= 𝜎 2 ( ∑ 𝒙𝒊 𝒙′𝒊 )
𝑛
𝑖=1
Note that:
Conclusion:
𝑝
̂ − 𝜷)|𝑿] → 𝜎 2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑉𝑎𝑟[√𝑛(𝜷
Hence, we have:
̂ − 𝜷)|𝑿 ∼ 𝓝(𝟎, 𝑾)
√𝑛(𝜷
with:
̂⏟′ 𝒖
𝔼[ 𝒖 ̂ ] = 𝔼[(𝑴𝒖)′(𝑴𝒖)] = 𝔼 [𝒖′ 𝑴
⏟ ⏟ ′
𝑴 𝒖] = 𝔼[𝒖′ 𝑴𝒖] = 𝔼[𝔼(𝒖′ 𝑴𝒖|𝑿)]
1×𝑛 𝑛×1 𝑴
= 𝔼 [𝔼 (tr (𝑴𝒖
⏟ 𝒖′
⏟ ) |𝑿)] = 𝔼[tr(𝔼(𝑴𝒖𝒖′)|𝑿)] = 𝔼 [tr (𝑴 𝔼(𝒖𝒖′)|𝑿
⏟ )] = 𝔼[tr(𝑴 𝜎 2 )]
𝑩 𝑨 𝜎2
𝔼[tr(𝑴 𝜎 2 )] = 𝜎 2 𝔼 [tr(𝑴)
⏟ ] = 𝜎 2 𝔼[𝑛 − 𝑘] = 𝜎 2 (𝑛 − 𝑘)
𝑛−𝑘
Hence: ̂′𝒖
𝔼[𝒖 ̂ ] = 𝜎 2 (𝑛 − 𝑘)
̂′𝒖
𝔼[𝒖 ̂] ̂′𝒖
𝒖 ̂
This means that, 𝜎 2 = 𝑛−𝑘
= 𝔼 [𝑛−𝑘]. So working backwards, our unbiased estimator of 𝜎 2 is:
2
𝒖̂′𝒖
̂
𝜎̂ =
𝑛−𝑘
̂′𝒖
𝒖 ̂
which is consistent, since: 𝔼[𝜎̂ 2 ] = 𝔼 [ ] = 𝜎2.
𝑛−𝑘
𝑛 𝑛 𝑛
𝜕
= ∑ 𝑤𝑖 (−2)𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃) = 0 ∑ 𝑤𝑖 𝒙⏟𝒊 𝑦⏟𝑖 = ∑ 𝑤𝑖 𝒙⏟𝒊 𝒙⏟′𝒊 𝒃
⏟
𝜕𝒃
𝑖=1 𝑖=1 𝑘×1 1×1 𝑖=1 𝑘×1 1×𝑘 𝑘×1
𝑛 𝑛
(∑ 𝑤𝑖 𝒙𝒊 𝑦𝑖 ) = (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) 𝒃
𝑖=1 𝑖=1
𝑛 −1 𝑛 𝑛 −1 𝑛
∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ∑ 𝑤𝑖 𝒙𝒊 𝑦𝑖 = 𝒃
⏟
⏟
𝑖=1 ⏟
𝑖=1 𝑘×1
( 𝑘×𝑘 ) ( 𝑘×1 )
Hence:
𝑛 −1 𝑛
̃=
𝜷 (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) (∑ 𝑤𝑖 𝒙𝒊 𝑦𝑖 )
𝑖=1 𝑖=1
Matrix Notation
𝑦1
➢ 𝒚=(⋮)
𝑦𝑛 𝑛×1
𝑥11 ⋯ 𝑥1𝑘 𝒙′𝟏
➢ 𝑿=( ⋮ ⋱ ⋮ ) =( ⋮ )
𝑥𝑛1 ⋯ 𝑥𝑛𝑘 𝑛×𝑘 𝒙′𝒏
𝑤1 ⋯ 0
➢ 𝛀=( ⋮ ⋱ ⋮ ) = 𝑑𝑖𝑎𝑔(𝑤1 , … , 𝑤𝑛 ) it’s a symmetric matrix, so 𝛀 = 𝛀′
0 ⋯ 𝑤𝑛 𝑛×𝑛
Define:
̂ = ⏟
𝒖
⏟ 𝒚 − 𝑿
⏟ 𝜷⏟
𝑛×1 𝑛×1 ⏟ 𝑘×1
𝑛×𝑘
𝑛×1
̂ = arg min ( 𝒖
𝜷 ̂⏟′ 𝛀 ̂ )
⏟ 𝒖
⏟
𝒃
1×𝑛 𝑛×𝑛 𝑛×1
where:
̂′𝒖
𝒖 ̂ = (𝒚 − 𝑿𝒃)′𝛀(𝒚 − 𝑿𝒃)
= 𝒚′ 𝛀𝒚 − 𝒚′ 𝛀𝑿𝒃 − (𝑿𝒃)′ 𝛀𝒚 + (𝑿𝒃)′ 𝛀(𝑿𝒃)
= 𝒚′ 𝛀𝒚 − 𝒚′ 𝛀𝑿𝒃 − 𝒃′𝑿′𝛀𝒚 + 𝒃′𝑿′𝛀𝑿𝒃
Hence:
̂ = arg min(𝒚′ 𝛀𝒚 − 𝒚′ 𝛀𝑿𝒃 − 𝒃′𝑿′𝛀𝒚 + 𝒃′𝑿′𝛀𝑿𝒃)
𝜷
𝒃
FOC:
𝜕
(𝒖 ̂ ) = −(𝒚′ 𝛀𝑿)′ − 𝑿′𝛀𝒚 + 2𝑿′ 𝛀𝑿𝒃
̂′𝒖 =0
𝜕𝒃
= −𝑿′𝛀′𝒚 − 𝑿′𝛀𝒚 + 2𝑿′ 𝛀𝑿𝒃 =0
′
= −2𝑿 𝛀𝒚 + 2𝑿′𝛀𝑿𝒃 =0
where we have applied 𝛀 = 𝛀′ to add −𝑿 𝛀 𝒚 − 𝑿 𝛀𝒚 = −2𝑿′ 𝛀𝒚
′ ′ ′
Solve for 𝒃
−2𝑿′ 𝛀𝒚 + 2𝑿′𝛀𝑿𝒃 =0
𝑿′𝛀𝑿𝒃 = 𝑿′ 𝛀𝒚
(𝑿′ 𝛀𝑿)−𝟏 (𝑿′𝛀𝑿)𝒃 = (𝑿′ 𝛀𝑿)−𝟏 (𝑿′ 𝛀𝒚)
𝒃 = (𝑿′ 𝛀𝑿)−𝟏 (𝑿′ 𝛀𝒚)
Thus:
𝑦𝑖 = 𝒙′𝒊 𝜷 + 𝑢𝑖
𝜕 ′
(·) = 𝔼[𝑤𝑖 (−2)𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃)] = 0 ⏟ 𝑖 𝒙𝒊 𝑦𝑖 ] = 𝔼[𝑤
𝔼[𝑤 ⏟ 𝑖 𝒙𝒊 𝒙𝒊 𝒃]
𝜕𝑏 𝑘×1 𝑘×1
or equivalently:
𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 0
which states that the regressors are uncorrelated with the weighted population errors
▪ 𝑦𝑖 = 𝒙′𝒊 𝜷 + 𝑢𝑖 ;
▪ (𝒙𝒊 )𝑘×1 ; (𝒙′𝒊 )1×𝑘 ; (𝒙𝒊 𝒙′𝒊 )𝑘×𝑘
𝑛 −1 𝑛
̃ = (∑ 𝑤𝑖 𝒙𝒊 𝒙′ )
𝜷 (∑ 𝑤𝑖 𝒙𝒊 (𝒙′𝒊 𝜷 + 𝑢𝑖 ))
𝒊
𝑖=1 𝑖=1
𝑛 −1 𝑛 𝑛 −1 𝑛
= 𝜷 + (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖
𝑖=1 𝑖=1
Hence:
𝑛 −1 𝑛
̃−𝜷=
𝜷 (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖
𝑖=1 𝑖=1
which we can conveniently rewrite it to make it look like sample means (multiplying and dividing by 𝑛)
𝑛 −1 𝑛
1 1
̃ − 𝜷 = ( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 )
𝜷 ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖
𝑛 𝑛
𝑖=1 𝑖=1
̃ − 𝜷) = 0
Proving consistency requires showing that: plim(𝜷
We want to find:
𝑛 −1 𝑛
1 1
̃ − 𝜷) = plim [( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 )
plim(𝜷 ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ]
𝑛 𝑛
𝑖=1 𝑖=1
1 𝑝
➔ By LLN, sample means converge to population values, plim [ ∑𝑛𝑖=1 ●] → 𝔼[●]
𝑛
1 𝑛 𝑝
o ∑ 𝑤 𝒙 𝒙′ → 𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]
𝑛 𝑖=1 𝑖 𝒊 𝒊
1 𝑛 𝑝
o ∑ 𝑤𝒙𝑢
𝑛 𝑖=1 𝑖 𝒊 𝑖
→ 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]
𝑝 𝑝
➔ By Slutsky’s Theorem (I): 𝑊𝑛 → 𝑐 ⇒ 𝑔(𝑊𝑛 ) → 𝑔(𝑐)
1 𝑛 𝑝 1 −1 𝑝
o 𝑛
∑𝑖=1 𝑤𝑖 𝒙𝒊 𝒙′𝒊 → 𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ] ⇒ {𝑛 ∑𝑛𝑖=1 𝑤𝑖 𝒙𝒊 𝒙′𝒊 } → {𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]}−1
➔ By Slutsky’s Theorem (II) in its weaker version (applied to convergence in probability):
if:
𝑛 −1
𝑝 1 𝑝
𝒁𝑛 → 𝒁 ( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) → [𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1
𝑛
𝑖=1
𝑛
𝑑 1 𝑝
𝑲𝑛 → 𝑲 ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) → 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]
𝑛
𝑖=1
then:
𝑛 −1 𝑛
𝑑 1 1 𝑝
𝒁𝒏 𝑲𝒏 → 𝒁𝑲 ( ∑ 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝒙𝒊 𝑢𝑖 ) → [𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]
𝑛 𝑛
𝑖=1 𝑖=1
✓ 𝑤𝑖 = 𝑤(𝑥𝑖 ) is a function of 𝑥𝑖 only (so that we can take it out of the expectation conditional on 𝑿).
✓ 𝔼(𝑢𝑖 |𝑿) = 0
In general, 𝛽̃ is not consistent when the CEF is not linear: 𝔼(𝑦𝑖 |𝑥𝑖 ) ≠ 𝑥𝑖′ 𝛽
2 2
➔ 𝑉𝑎𝑟 [𝑤𝑖 𝒙⏟𝒊 𝑢⏟𝑖 ] = {𝔼
⏟[(𝑤𝑖 𝒙𝒊 𝑢𝑖 )(𝑤𝑖 𝒙𝒊 𝑢𝑖 )]′ − 𝔼[𝑤 ⏟ 𝑖 𝒙𝒊 𝑢𝑖 ]′} = 𝔼[𝑤𝑖 𝑢𝑖 𝒙𝒊 𝒙′𝒊 ]
⏟ 𝑖 𝒙𝒊 𝑢𝑖 ] 𝔼[𝑤
⏟ 𝑘×1 1×1 𝑘×𝑘 𝑘×1 1×𝑘
𝑘×𝑘
➔ 𝐶𝑜𝑣[𝑤𝑖 𝒙𝒊 𝑢𝑖 , 𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = ⏟
𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 𝑤𝑗 𝒙𝒋 𝑢𝑗 ] − 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]𝔼[𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]𝔼[𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 0
𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]𝔼[𝑤𝑗 𝒙𝒋 𝑢𝑗 ]
o 𝐶𝑜𝑣[𝑤𝑖 𝒙𝒊 𝑢𝑖 , 𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 0 follows from the independence of observations across groups, which implies that
𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]𝔼[𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 0.
𝑛
1 𝑑
∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 → 𝒩(0, 𝑽)
√𝑛 𝑖=1
We want to apply Cramér’s Theorem to approximate the asymptotic distribution of this object, hence, we need
to decompose the expression above into 2 matrices: one that converges in probability to a known object and
another one of which we know its asymptotic distribution:
1 −1
✓ (𝑛 ∑𝑛𝑖=1 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) will converge in probability to a known object, which is its expectation (we already
showed this using Slutsky’s Theorem in our discussion on consistency).
1
✓ ( ∑𝑛𝑖=1 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) is an object of which we know the asymptotic distribution; in fact, we calculated it just
𝑛
1 𝑑
above. It will be convenient to use the final expression to which we arrived: ∑𝑛𝑖=1 𝑤𝑖 𝒙𝒊 𝑢𝑖 → 𝒩(0, 𝑽).
√𝑛
Thus, we will scale our estimation error to apply the theorem in a cleaner way:
𝑛 −1 𝑛
1 1
̃ − 𝜷) = ( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 )
√𝒏(𝜷 ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 )
⏟𝑛 𝑖=1 ⏟√𝑛 𝑖=1
𝑝 −1 𝑑
→[𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )] →𝒩(0,𝑽)
Using the properties of distributional operators, we can see that the above expression is equivalent to:
𝑛 −1 𝑛
1 1 𝑑
( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, [𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 𝑽[𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 )
𝑛 √𝑛 𝑖=1
𝑖=1
or simply:
𝑛 −1 𝑛
1 1 𝑑
( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, 𝑾)
𝑛 √𝑛 𝑖=1
𝑖=1
where:
′ −1 2 2 ′ ′ −1
⏟ = [𝔼(𝑤
𝑾 ⏟ 𝑖 𝑤𝑖 𝒙𝒊 𝒙𝒊 ] [𝔼(𝑤
⏟ 𝑖 𝒙𝒊 𝒙𝒊 )] 𝔼[𝑢 ⏟ 𝑖 𝒙𝒊 𝒙𝒊 )]
𝑘×𝑘 𝑘×𝑘 𝑘×𝑘 𝑘×𝑘
̂ − 𝜷 is a 𝑘 × 𝑘 vector, we know that the variance of this object has to be a square 𝑘 × 𝑘 matrix
Since 𝜷
𝑑
Hence: ̃ − 𝜷) → 𝒩(0, 𝑾)
√𝑛(𝜷
When weights 𝑤𝑖 are chosen to be proportional to the reciprocal of 𝜎𝑖2 = 𝔼(𝑢𝑖2 |𝒙𝒊 ), the asymptotic variance
becomes:
Proof:
1
Take each part of the sandwich 𝑾, substitute 𝑤𝑖 = , and take iterated expectations:
𝜎𝑖2
• The bread
−1 −1
1 1
[𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 = [𝔼 ( 2 𝒙𝒊 𝒙′𝒊 )] = [ 2 𝔼(𝒙𝒊 𝒙′𝒊 )] = 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝜎𝑖 𝜎𝑖
• The ham
2 2 2
1 1 1 1
𝔼[𝑢𝑖2 𝑤𝑖2 𝒙𝒊 𝒙′𝒊 ] = 𝔼 [𝑢𝑖2 ( ′ 2 ′ ′
2 ) 𝒙𝒊 𝒙𝒊 ] = ( 2 ) 𝔼[𝑢𝑖 𝒙𝒊 𝒙𝒊 ] = ( 2 ) 𝔼 [𝒙𝒊 𝒙𝒊 𝔼(𝑢
⏟
2 ′
𝑖 |𝑿)] = ( 2 ) 𝔼[𝒙𝒊 𝒙𝒊 ]
𝜎𝑖 𝜎𝑖 𝜎𝑖 2
𝜎𝑖
𝜎𝑖
1
𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 ( 2 ) 𝔼[𝒙𝒊 𝒙′𝒊 ] ⏟
=⏟ 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑾 ⏟𝜎𝑖
bread bread
ham
1.4.3) Proving that the variance is smaller when we use the optimal choice of weights (Ask
Rob)
1.4.4) Generalized Least Squares (GLS) aka “WLS with optimal weights”
The GLS estimator is simply the WLS estimator using the optimal weights
1 1
𝑤𝑖 = =
𝔼(𝑢𝑖2 |𝒙𝒊 ) 𝜎𝑖2
Applying this weight to the WLS estimator we get the GLS estimator:
𝑛 −1 𝑛
𝒙𝒊 𝒙′𝒊 𝒙𝒊 𝑦𝑖
̃ 𝐺𝐿𝑆
𝜷 = (∑ 2 ) (∑ )
𝜎𝑖 𝜎𝑖2
𝑖=1 𝑖=1
This estimator is asymptotically efficient in the sense of having the smallest asymptotic variance among the
class of consistent WLS estimator.
In matrix notation:
̃ 𝐺𝐿𝑆 ≡ 𝒃 = (𝑿′ 𝛀𝑿)−𝟏 (𝑿′ 𝛀𝒚)
𝜷
1 1
with 𝛀 = diag (𝜎2 , … , 𝜎2 )
1 𝑁
𝑑
̃ 𝑮𝑳𝑺 − 𝜷 → 𝒩(𝟎, 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 )
√𝜷
• 𝐻 groups
• 𝑀ℎ observations in each group, ℎ = 1, … , 𝐻
𝑛 = 𝑀1 + ⋯ + 𝑀𝐻
Note that we don’t necessarily have the same number of observations in each group, that’s precisely what
incorporating the subindex ℎ in 𝑀ℎ allows us to consider:
𝑀1 ≠ 𝑀2 ≠ ⋯ ≠ 𝑀ℎ
𝒚
⏟𝒉 = 𝑿
⏟𝒉 𝜷
⏟ + 𝒖
⏟𝒉
𝑀ℎ ×1 𝑀ℎ ×𝑘 𝑘×1 𝑀ℎ ×1
𝑦ℎ1 2 𝑘 𝛽1 𝑢ℎ1
1 𝑥ℎ1 ⋯ 𝑥ℎ1
𝛽2
⏟𝒉 = ( ⋮ )
𝒚 ⏟𝒉 = ( ⋮
𝑿 ⋯ ⋱ ⋮ ) 𝜷
⏟ =( ) ⏟𝒉 = ( ⋮ )
𝒖
𝑀ℎ ×1 𝑦ℎ𝑀ℎ 1 2
𝑥ℎ𝑀 𝑘
⋯ 𝑥ℎ𝑀 ⋮ 𝑀ℎ ×1 𝑢ℎ𝑀ℎ
𝑀ℎ ×𝑘 𝑘×1
ℎ ℎ
𝛽𝑘
∀ℎ = 1, … , 𝐻
1.5.3) Expression for all the data grouped (not really useful)
Compact representation (we necessarily assume here that 𝑀1 = ⋯ = 𝑀ℎ = ⋯ 𝑀𝐻 )
We don’t estimate the model like this. To do this, we need to aggregate all the data (see in the next sections)
𝑘
⏟
𝒀 = 𝑨 ⏟𝑙 ⨀ 𝜷
⏟ + ∑[ 𝑿 ⏟𝑙 ] + 𝑼
⏟
𝑯×𝑴𝒉 𝑯×𝑴𝒉 𝑙=1 𝑯×𝑴𝒉 𝑯×𝑴𝒉 𝑯×𝑴𝒉
where
we use double notation: (𝑦ℎ𝑚 , 𝑥ℎ𝑚 ) for ℎ = 1, … , 𝐻 (group index) and 𝑚 = 1, … , 𝑀ℎ (within group index)
𝒚 = 𝑿
⏟ ⏟ 𝜷
⏟ + 𝒖
⏟
𝑛×1 𝑛×𝑘 𝑘×1 𝑛×1
2 𝑘
𝑦11 1 𝑥11 ⋯ 𝑥11 𝑢11
⋮ ⋮ ⋮ ⋯ ⋮ ⋮
2 𝑘
𝑦1𝑀1 1 𝑥1𝑀1 ⋯ 𝑥1𝑀1 𝑢1𝑀1
𝑦21 1 2
𝑥21 ⋯ 𝑘
𝑥21 𝑢21
⋮ ⋮ ⋮ ⋯ ⋮ ⋮
𝑦2𝑀2 1 2
𝑥2𝑀2 ⋯ 𝑘
𝑥2𝑀2 𝛽1 𝑢2𝑀2
⋮ 𝛽2 ⋮
𝒚 = 𝑦
⏟ ⏟ = ⋮
𝑿 ⋮
2
⋯ ⋮
𝑘 𝜷
⏟ =( ) 𝒖
⏟ = 𝑢
ℎ1
𝑛×𝑘
1 𝑥ℎ1 ⋯ 𝑥ℎ1 ⋮ 𝑛×1
ℎ1
𝑛×1 ⋮ 𝑘×1 ⋮
⋮ ⋮ ⋯ ⋮ 𝛽𝑘
𝑦ℎ𝑀ℎ 1 2
𝑥ℎ𝑀 ⋯ 𝑘
𝑥ℎ𝑀 𝑢ℎ𝑀ℎ
ℎ ℎ
⋮ ⋮ ⋮ ⋯ ⋮ ⋮
𝑦𝐻1 1 2
𝑥𝐻1 ⋯ 𝑘
𝑥𝐻1 𝑢𝐻1
⋮ ⋮ ⋮ ⋯ ⋮ ⋮
𝑦
( 𝐻𝑀𝐻 ) 2 𝑘 𝑢
( 𝐻𝑀𝐻 )
(1 𝑥𝐻𝑀𝐻 ⋯ 𝑥𝐻𝑀𝐻 )
where
• 𝑛 = ∑𝐻
ℎ=1 𝑀ℎ = 𝑀1 + ⋯ + 𝑀𝐻
−𝟏
̂ = ( 𝑿′
𝜷 ⏟ 𝑿 ⏟ ) 𝑿′
⏟ ⏟𝒚
Matrix Notation (all the data ignoring groups)
⏟𝑘×𝑛 𝑛×𝑘 ⏟ 𝑛×1
𝑘×𝑛
𝑘×𝑘 𝑘×1
𝐻 −𝟏 𝐻
̂=
𝜷 (∑ 𝑿 ′
⏟𝒉 𝑿 ⏟𝒉 ) ∑ 𝑿 ′
⏟𝒉 𝒚 ⏟𝒉
Sigma Notation by clusters
⏟ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×𝑘 ⏟
ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×1
𝑘×𝑘 𝑘×1
𝑀ℎ −𝟏
𝐻 𝐻 𝑀
̂ = (∑ ∑ 𝒙
𝜷 ⏟𝒉𝒎 𝒙
⏟
′
𝒉𝒎 ) ∑∑𝒙 ⏟𝒉𝒎 𝑦
⏟ℎ𝑚
Sigma notation by individual observations
⏟ℎ=1 𝑚=1 𝑘×1 1×𝑘 ⏟
ℎ=1 𝑚=1 𝑘×1 1×1
𝑘×𝑘 𝑘×1
●✠ = ∑ ●ℎ ✠ℎ = ∑ ∑ ●ℎ𝑚 ✠ℎ𝑚
ℎ=1 ℎ=1 𝑚=1
For example:
−𝟏 𝐻 𝐻 𝑀ℎ
′ ′
⏟′ 𝑿
(𝑿 ⏟ ) =∑ 𝑿
⏟𝒉 𝑿
⏟𝒉 = ∑ ∑ 𝒙
⏟𝒉𝒎 𝒙
⏟𝒉𝒎
𝑘×𝑛 𝑛×𝑘 ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×𝑘 ℎ=1 𝑚=1 𝑘×1 1×𝑘
𝐻 𝐻 𝑀
′
𝑿′
⏟ ⏟𝒚 = ∑ 𝑿 ⏟𝒉 𝒚 ⏟𝒉 =∑∑𝒙
⏟𝒉𝒎 𝑦
⏟ℎ𝑚
𝑘×𝑛 𝑛×1 ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×1 ℎ=1 𝑚=1 𝑘×1 1×1
So we can rewrite
𝐻
̂ − 𝜷 = (𝑿′ 𝑿)−𝟏 ∑ 𝑿′𝒉 𝒖𝒉
𝜷
ℎ=1
𝐻 𝐻 𝐻 1
After some rescaling: √𝐻 = √𝐻 = √𝐻 = =𝐻·
𝐻 √ √𝐻
𝐻 √𝐻 √𝐻
𝐻
1
̂ − 𝜷) = 𝐻(𝑿′ 𝑿)−𝟏
√𝐻(𝜷 ∑ 𝑿′𝒉 𝒖𝒉
√𝐻 ℎ=1
Hence:
−𝟏 𝐻
𝑿′ 𝑿 1
̂ − 𝜷) = (
√𝐻(𝜷 ) ∑ 𝑿′𝒉 𝒖𝒉
𝐻 √𝐻 ℎ=1
1.5.7) Consistency
𝐻
̂ − 𝜷 = (𝑿′ 𝑿)−𝟏 ∑ 𝑿′𝒉 𝒖𝒉
𝜷
ℎ=1
We can conveniently adjust it in order to have sample means. Namely, we can apply: ●✠ = ∑𝐻
ℎ=1 ●ℎ ✠ℎ
𝐻
′
𝑿′ 𝑿 = ∑ 𝑿
⏟𝒉 𝑿
⏟𝒉
ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×𝑘
so that:
𝐻 −1 𝐻
̂ − 𝜷 = (∑ 𝑿′𝒉 𝑿𝒉 )
𝜷 ∑ 𝑿′𝒉 𝒖𝒉
ℎ=1 ℎ=1
̂ − 𝜷) = 0
Proving consistency requires showing that: plim(𝜷
We want to find:
𝐻 −1 𝐻
1 1
̂ − 𝜷) = plim [( ∑ 𝑿′𝒉 𝑿𝒉 )
plim(𝜷 ∑ 𝑿′𝒉 𝒖𝒉 ]
𝐻 𝐻
ℎ=1 ℎ=1
1 𝑝
➔ By LLN, sample means converge to population values, plim [ ∑𝑛𝑖=1 ●] → 𝔼[●]
𝑛
1 𝐻 𝑝
o ∑ 𝑿′ 𝑿
𝐻 ℎ=1 𝒉 𝒉
→ 𝔼[𝑿′𝒉 𝑿𝒉 ]
1 𝐻 𝑝
o ∑ 𝑿′ 𝒖
𝐻 ℎ=1 𝒉 𝒉
→ 𝔼[𝑿′𝒉 𝒖𝒉 ]
𝑝 𝑝
➔ By Slutsky’s Theorem (I): 𝑊𝑛 → 𝑐 ⇒ 𝑔(𝑊𝑛 ) → 𝑔(𝑐)
1 𝐻 𝑝 1 −1 𝑝
o ∑ℎ=1 𝑿′𝒉 𝑿𝒉 → 𝔼[𝑿′𝒉 𝑿𝒉 ] ⇒ { ∑𝐻 ′
ℎ=1 𝑿𝒉 𝑿𝒉 } → {𝔼[𝑿′𝒉 𝑿𝒉 ]}−1
𝐻 𝐻
➔ By Slutsky’s Theorem (II) in its weaker version (applied to convergence in probability):
if:
𝐻 −1
𝑝 1 𝑝
𝒁𝑛 → 𝒁 ( ∑ 𝑿′𝒉 𝑿𝒉 ) → [𝔼(𝑿′𝒉 𝑿𝒉 )]−1
𝐻
ℎ=1
𝐻
𝑑 1 𝑝
𝑲𝑛 → 𝑲 ( ∑ 𝑿′𝒉 𝒖𝒉 ) → 𝔼[𝑿′𝒉 𝒖𝒉 ]
𝐻
ℎ=1
then:
𝐻 −1 𝐻
𝑑 1 1 𝑝
𝒁𝒏 𝑲𝒏 → 𝒁𝑲 ( ∑ 𝑿′𝒉 𝑿𝒉 ) ( ∑ 𝑿′𝒉 𝒖𝒉 ) → [𝔼(𝑿′𝒉 𝑿𝒉)]−1 𝔼[𝑿′𝒉 𝒖𝒉 ]
𝐻 𝐻
ℎ=1 ℎ=1
𝐻 𝐻 𝐻 𝐻
1 1 1 1
𝔼 [ ∑ 𝑿′𝒉 𝒖𝒉 ] = ∑ 𝔼[𝑿′𝒉 𝒖𝒉 ] = ∑ 𝔼[𝔼(𝑿′𝒉 𝒖𝒉 |𝑿)] = ∑ 𝔼 [𝑿′𝒉 ⏟
𝔼(𝒖𝒉 |𝑿)] = 0
𝐻 𝐻 𝐻 𝐻
ℎ=1 ℎ=1 ℎ=1 ℎ=1 0
1
{𝐻 · 𝔼[𝑿′𝒉𝒖𝒉 𝒖′𝒉 𝑿𝒉 ]}
=
𝐻2
1
= 𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ]
𝐻
where we have used: (recall 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 0 from the FOC)
′ ′
Let 𝑽
⏟ ≡ 𝔼 𝑿⏟𝒉 𝒖⏟𝒉 𝒖 ⏟𝒉 𝑿 ⏟𝒉 , then
𝑘×𝑘 ⏟ ℎ 𝑀ℎ ×1 1×𝑀ℎ 𝑀ℎ ×𝑘
𝑘×𝑀
[ 𝑘×𝑘 ]
𝐻
1 𝑑
∑ 𝑿′𝒉 𝒖𝒉 → 𝒩 (0, ⏟
𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ])
√𝐻 ℎ=1 𝑽
We can conveniently adjust it in order to have sample means. Namely, we can apply: ●✠ = ∑𝐻
ℎ=1 ●ℎ ✠ℎ
𝐻
′ ′
𝑿𝑿=∑ 𝑿
⏟𝒉 𝑿
⏟𝒉
ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×𝑘
so that:
𝐻 −1
−𝟏
𝑿′ 𝑿 1
( ) = ( ∑ 𝑿′𝒉 𝑿𝒉 )
𝐻 𝐻
ℎ=1
We want to apply Cramér’s Theorem to approximate the asymptotic distribution of this object, hence, we need
to decompose the expression above into 2 matrices: one that converges in probability to a known object and
another one of which we know its asymptotic distribution:
1 −1
′
✓ (𝐻 ∑𝐻
ℎ=1 𝑿𝒉 𝑿𝒉 ) will converge in probability to a known object, which is its expectation.
1
✓ ∑𝐻 𝑿′ 𝒖 is an object of which we know the asymptotic distribution; in fact, we calculated it just
√𝐻 ℎ=1 𝒉 𝒉
1 𝑑
above: 𝐻 ∑𝐻
ℎ=1 𝑿′𝒉 𝒖𝒉 → 𝒩(0, 𝑽).
√
Thus:
𝐻 −1 𝐻
1 1
̂ − 𝜷) = ( ∑ 𝑿′𝒉 𝑿𝒉 )
√𝐻(𝜷 ( ∑ 𝑿′𝒉 𝒖𝒉 )
⏟𝐻 ℎ=1 ⏟√𝐻 ℎ=1
𝑝 −1 𝑑
→[𝔼(𝑿′𝒉 𝑿𝒉 )] →𝒩(0,𝑽)
Using the properties of distributional operators, we can see that the above expression is equivalent to:
𝐻 −1 𝐻
1 1 𝑑
( ∑ 𝑿′𝒉 𝑿𝒉 ) ( ∑ 𝑿′𝒉 𝒖𝒉 ) → 𝒩(0, [𝔼(𝑿′𝒉 𝑿𝒉)]−1 𝑽[𝔼(𝑿′𝒉 𝑿𝒉 )]−1 )
𝐻 √𝐻 ℎ=1
ℎ=1
or simply:
𝐻 −1 𝐻
1 1 𝑑
( ∑ 𝑿′𝒉 𝑿𝒉 ) ( ∑ 𝑿′𝒉 𝒖𝒉 ) → 𝒩(0, 𝑾)
𝐻 √𝐻 ℎ=1
ℎ=1
where:
𝑾 [𝔼(𝑿′𝒉𝑿𝒉 )]−1 ⏟
⏟ =⏟ 𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ] ⏟
[𝔼(𝑿′𝒉 𝑿𝒉 )]−1
𝑘×𝑘 𝑘×𝑘 𝑘×𝑘 𝑘×𝑘
̂ − 𝜷 is a 𝑘 × 𝑘 vector, we know that the variance of this object has to be a square 𝑘 × 𝑘 matrix
Since 𝜷
𝑑
Hence: ̂ − 𝜷) → 𝒩(0, 𝑾)
√𝐻(𝜷
a consistent estimator can be proposed by simply changing population objects by sample objects:
1
Instead of expectations 𝔼(·), use a sample mean 𝑛 ∑𝑛𝑖=1(·)
Instead of the population error 𝒖𝒉 , use the sample error 𝒖
̂𝒉
𝐻 −1 𝐻 𝐻 −1
1 1 1
̂ = [ ∑ 𝑿′𝒉 𝑿𝒉 ]
𝑾 [ ∑ 𝑿′𝒉 𝒖 ̂ ′𝒉 𝑿𝒉 ] [ ∑ 𝑿′𝒉 𝑿𝒉 ]
̂𝒉𝒖
𝐻 𝐻 𝐻
ℎ=1 ℎ=1 ℎ=1
∑ 𝑿′𝒉 𝑿𝒉 = 𝑿′𝑿
ℎ=1
Thus:
−1 𝐻 −1
𝑿′𝑿 1 𝑿′𝑿
̂=(
𝑾 ) [ ∑ 𝑿′𝒉 𝒖 ̂ ′𝒉 𝑿𝒉 ] (
̂𝒉𝒖 )
𝐻 𝐻 𝐻
ℎ=1
̂ gives us an estimation of the asymptotic variance of the rescaled estimation error. In order to find
Recall that 𝑊
an estimator for the asymptotic variance of 𝜷̂ , we need to do the following:
𝑑
̂ − 𝜷) → 𝒩(0, 𝑾
√𝐻(𝜷 ̂)
𝑑 1
̂ −𝜷→
𝜷 ̂)
𝒩(0, 𝑾
√𝐻
𝑑 ̂
𝑾
̂ − 𝜷 → 𝒩 (0,
𝜷 )
𝐻
𝑑 ̂
𝑾
̂ → 𝒩 (𝜷,
𝜷 )
𝐻
Hence:
𝐻
̂ 1 𝑿′𝑿 −1 1
𝑾 𝑿′𝑿
−1
̂] =
Var[𝜷 = ( ′
̂𝒉𝒖
) [ ∑ 𝑿𝒉 𝒖 ′
̂ 𝒉 𝑿𝒉 ] ( )
𝐻 𝐻 𝐻 𝐻 𝐻
ℎ=1
which is a 𝑘 × 𝑘 matrix
−1 𝐻 −1
̂ ] = ( 𝑿′
Var[𝜷 ⏟ 𝑿 ⏟ ) ∑ 𝑿 ′
⏟𝒉 𝒖 ̂
⏟𝒉 𝒖 ̂ ′
⏟𝒉 𝑿 ⏟𝒉 ( 𝑿′
⏟ 𝑿 ⏟ )
⏟𝑘×𝑛 𝑛×𝑘 ℎ=1 ⏟
𝑘×𝑀ℎ 𝑀ℎ ×1 1×𝑀ℎ 𝑀ℎ ×𝑘 ⏟𝑘×𝑛 𝑛×𝑘
𝑘×𝑘 [ 𝑘×𝑘 ] 𝑘×𝑘
Once we have an estimator for the asymptotic variance, it’s straightforward to obtain standard errors. In
particular, note that standard errors are simply the word we use to refer to the “estimator of the standard
deviation”, and this is simply the square root of the estimator of the variance. So the fancy name of “cluster
robust standard errors” simply boils down to computing the square root of the estimated variance-covariance
̂
𝑾
̂ , which is .
matrix for 𝜷 𝐻
𝑾̂
̂ [𝛽̂] = √
𝑆𝐸
𝐻
If we want to find the standard error associated to the 𝑗 th coefficient, simply take the 𝑗 th diagonal element of
this matrix:
̂
𝑾
̂ [𝛽̂ 𝑗 ] = √ 𝑗𝑗
𝑆𝐸
𝐻
𝒚
⏟𝒉 = 𝜶
⏟𝒉 + 𝑿
⏟𝒉 𝜷
⏟ +𝒖
⏟𝒉𝒎
𝑀ℎ ×1 𝑀ℎ ×1 𝑀ℎ ×𝑘 𝑘×1 𝑀ℎ ×1
𝑦ℎ1 𝛼ℎ 1 2 𝑘 𝑢ℎ1
𝑥ℎ1 𝑥ℎ1 ⋯ 𝑥ℎ1 𝛽1
⏟𝒉 = ( ⋮ )
𝒚 ⏟𝒉 = ( ⋮ )
𝜶 ⏟𝒉 = ( ⋮
𝑿 ⋯ ⋱ ⋮ ) 𝜷
⏟ =( ⋮ ) ⏟𝒉 = ( ⋮ )
𝒖
𝑀ℎ ×1 𝑦ℎ𝑀ℎ 𝑀ℎ ×1 𝛼ℎ 𝑀ℎ ×𝑘
1
𝑥ℎ𝑀 2
𝑥ℎ𝑀 𝑘
⋯ 𝑥ℎ𝑀 𝑘×1 𝛽𝑘 𝑀ℎ ×1 𝑢ℎ𝑀ℎ
ℎ ℎ ℎ
∀ℎ = 1, … , 𝐻
1.6.3) Expression for all the data grouped (not really useful)
Compact representation (we necessarily assume here that 𝑀1 = ⋯ = 𝑀ℎ = ⋯ 𝑀𝐻 )
We don’t estimate the model like this. To do this, we need to aggregate all the data (see in the next sections)
𝑘
⏟
𝒀 = 𝑨 ⏟𝑙 ⨀ 𝜷
⏟ + ∑[ 𝑿 ⏟𝑙 ] + 𝑼
⏟
𝑯×𝑴𝒉 𝑯×𝑴𝒉 𝑙=1 𝑯×𝑴𝒉 𝑯×𝑴𝒉 𝑯×𝑴𝒉
where
we use double notation: (𝑦ℎ𝑚 , 𝑥ℎ𝑚 ) for ℎ = 1, … , 𝐻 (group index) and 𝑚 = 1, … , 𝑀ℎ (within group index)
𝒚 = 𝑫
⏟ ⏟ 𝜶
⏟ + 𝑿
⏟ 𝜷
⏟ + 𝒖
⏟
𝑛×1 𝑛×𝐻 𝐻×1 𝑛×𝑘 𝑘×1 𝑛×1
1 0 0 ⋯ 0
⋮ ⋮ ⋮ ⋯ ⋮
1 0 0 ⋯ 0
0 1 0 ⋯ 0 𝟏𝑴𝟏 𝟎𝑴𝟏 ⋯ 𝟎𝑴𝟏 𝛼1
⋮ ⋮ ⋮ ⋯ ⋮ 𝟎𝑴𝟐 𝟏𝑴𝟐 ⋯ 𝟎𝑴𝟐 𝛼2
𝑫
⏟ = = 𝜶
⏟ =( ⋮ )
𝑛×𝐻
0 1 0 ⋯ 0 ⋮ ⋯ ⋱ ⋮ 𝐻×1
⋮ ⋮ ⋮ ⋯ ⋮ (𝟎𝑴𝑯 𝟎𝑴𝑯 ⋯ 𝟏𝑴𝑯 ) 𝛼𝐻
0 0 0 ⋱ 1
⋮ ⋮ ⋮ ⋯ ⋮
(0 0 0 ⋯ 1)
where
• 𝑛 = ∑𝐻
ℎ=1 𝑀ℎ = 𝑀1 + ⋯ + 𝑀𝐻
(𝜶 ̂ ) = arg min { 𝒖′
̂, 𝜷 ⏟ 𝒖⏟ }
(𝜶,𝜷)
1×𝑛 𝑛×1
where 𝒖 = 𝒚 − 𝑫𝜶 − 𝑿𝜷
̂ ) = arg min {(𝒚 − 𝑫𝜶 − 𝑿𝜷)′ (𝒚 − 𝑫𝜶 − 𝑿𝜷)}
̂, 𝜷
(𝜶
(𝜶,𝜷)
Open it:
FOC for 𝜶:
𝜕 ′ 𝜕 𝜕 𝜕 𝜕
𝜕𝒖′ 𝒖 = − [𝒚 𝑫𝜶] − [𝜶′ 𝑫′ 𝒚] + [𝜶′ 𝑫′ 𝑫𝜶] + [𝜶′ 𝑫′ 𝑿𝜷] + [𝜷′ 𝑿′ 𝑫𝜶] = 0
⏟
𝜕𝜶 ⏟
𝜕𝜶 ⏟
𝜕𝜶 ⏟
𝜕𝜶 ⏟
𝜕𝜶
𝜕𝜶 (𝒚′ 𝑫)′ 𝑫′ 𝒚 2𝑫′ 𝑫𝜶 𝑫′ 𝑿𝜷 (𝜷′ 𝑿′ 𝑫)′
⇒ −𝑫′ 𝒚 − 𝑫′ 𝒚 + 2𝑫′ 𝑫𝜶 + 𝑫′ 𝑿𝜷 + 𝑫′ 𝑿𝜷 = 𝟎
⇒ −2𝑫′ 𝒚 + 2𝑫′ 𝑫𝜶 + 2𝑫′ 𝑿𝜷 = 𝟎
⇒ −𝑫′ 𝒚 + 𝑫′ 𝑫𝜶 + 𝑫′ 𝑿𝜷 = 𝟎
Hence: 𝑫′ 𝒚 = 𝑫′ 𝑿𝜷 + 𝑫′ 𝑫𝜶
FOC for 𝜷:
𝜕 ′ 𝜕 𝜕 𝜕 𝜕
𝜕𝒖′ 𝒖 = − [𝒚 𝑿𝜷] + [𝜶′ 𝑫′ 𝑿𝜷] − [𝜷′ 𝑿′ 𝒚] + [𝜷′ 𝑿′ 𝑫𝜶] + [𝜷′ 𝑿′ 𝑿𝜷] = 0
𝜕𝜷
⏟ 𝜕𝜷
⏟ 𝜕𝜷
⏟ 𝜕𝜷
⏟ 𝜕𝜷
⏟
𝜕𝜷 ′ ′ ′ ′ ′ ′ ′ ′
(𝒚 𝑿) (𝜶 𝑫 𝑿) 𝑿𝒚 𝑿 𝑫𝜶 2𝑿 𝑿𝜷
⇒ −𝑿′ 𝒚 + 𝑿′ 𝑫𝜶 − 𝑿′ 𝒚 + 𝑿′ 𝑫𝜶 + 𝟐𝑿′ 𝑿𝜷 = 𝟎
⇒ −2𝑿′ 𝒚 + 2𝑿′ 𝑫𝜶 + 2𝑿′ 𝑿𝜷 = 𝟎
⇒ −𝑿′ 𝒚 + 𝑿′ 𝑫𝜶 + 𝑿′ 𝑿𝜷 = 𝟎
Hence: 𝑿′ 𝒚 = 𝑿′ 𝑿𝜷 + 𝑿′ 𝑫𝜶
We can summarize the 2 FOC obtained above in compact form:
̂ 𝑿′𝒚
( 𝑿′𝑿 𝑿′𝑫 ) (𝜷) = ( )
𝑫′𝑿 𝑫′𝑫 𝜶 ̂ 𝑫′𝒚
Modus Operandi:
1) Solve for 𝜶 in the 2nd block (FOC for 𝜶)
FOC for 𝜶: 𝑫′ 𝒚 = 𝑫′ 𝑿𝜷 + 𝑫′ 𝑫𝜶
𝑫′ 𝑫𝜶 = 𝑫′ 𝒚 − 𝑫′ 𝑿𝜷
FOC for 𝜷: 𝑿′ 𝒚 = 𝑿′ 𝑿𝜷 + 𝑿′ 𝑫𝜶
(𝑿′ ⏟
[𝑰 − 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ ] 𝑿) 𝜷 = 𝑿′ ⏟
[𝑰 − 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ ] 𝒚
=𝑸 =𝑸
(𝑿′ 𝑸𝑿)𝜷 = 𝑿′ 𝑸𝒚
̂ = (𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸𝒚
𝜷
̂
̂ using 𝜷
3) Recover 𝜶
̂
̅′𝒉 𝜷
or by groups: 𝛼̂ℎ = 𝑦̅ℎ − 𝒙
𝑸 = [𝑰 − 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ ]
𝟏𝑴𝟏 𝟎𝑴𝟏 ⋯ 𝟎𝑴𝟏
𝟎𝑴𝟐 𝟏𝑴𝟐 ⋯ 𝟎𝑴𝟐
𝑫
⏟ =
𝑛×𝐻
⋮ ⋯ ⋱ ⋮
𝟎
( 𝑴𝑯 𝟎𝑴𝑯 ⋯ 𝟏𝑴𝑯 )
1 0 0
1 0 0
1 1 1 0 0 0
1 0 0
𝑫= 𝑫′ = (0 0 0 1 1 0)
0 1 0
⏟0 0 0 0 0 1
0 1 0 3×6
(
⏟0 0 1)
6×3
1 0 0
1 0 0
1 1 1 0 0 0 3 0 0
1 0 0
𝑫′ 𝑫 = (0 0 0 1 1 0) = (0 2 0)
0 1 0
⏟0 0 0 0 0 1 0 0 1
3×6
0 1 0
(
⏟0 0 1)
6×3
1/3 0 0
(𝑫′ 𝑫)−𝟏
=( 0 1/2 0 )
0 0 1/1
̃ = 𝑸𝒚 = [𝑰 − 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ ]𝒚
𝒚
• 𝔼[𝒚|𝑿, 𝑫] = 𝑿𝜷 + 𝑫𝜶
• 𝑉𝑎𝑟[𝒚|𝑿, 𝑫] = 𝜎 2 𝑰𝒏
⏟ ′ 𝑸𝑿)−𝟏 𝑿′ 𝑸 𝒚|𝑿, 𝑫]
̂ |𝑿, 𝑫] = Var [(𝑿
Var[𝜷
constant
′
= [(𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸] Var[𝒚|𝑿, 𝑫][(𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸]
= 𝜎 2 (𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸
⏟ 𝑸′ 𝑿(𝑿′ 𝑸𝑿)−𝟏
𝑸
= 𝜎 2 (𝑿′ 𝑸𝑿)−𝟏
̂ |𝑿, 𝑫]
̅′𝒉 𝜷
Var[𝛼̂ℎ |𝑿, 𝑫] = Var[𝑦̅ℎ − 𝒙
𝑀ℎ
1
= Var [ ∑ 𝑦ℎ𝑚 |𝑿, 𝑫] + Var[𝒙 ̂ |𝑿, 𝑫] − Cov (𝑦̅ℎ , ̅
̅′𝒉 𝜷 𝒙
⏟
′ ̂
𝒉 𝜷 |𝑿, 𝑫)
𝑀ℎ
𝑚=1 constant
𝑀ℎ
1
= Var [ ∑ 𝑦ℎ𝑚 |𝑿, 𝑫] + Var[𝒙 ̂ |𝑿, 𝑫]
̅′𝒉 𝜷
𝑀ℎ
𝑚=1
𝑀ℎ
1
=
𝑀ℎ2
Var [ ∑ 𝑦ℎ𝑚 |𝑿, 𝑫] + 𝒙 ⏟ ̂ |𝑿, 𝑫] 𝒙
̅′𝒉 Var[𝜷 ̅𝒉
𝑚=1 2 ′ −𝟏
𝜎 (𝑿 𝑸𝑿)
𝑀ℎ
1
= ∑⏟ ̅′𝒉 ⏟
Var[𝑦ℎ𝑚 |𝑿, 𝑫] + 𝒙 Var[𝜷̂ |𝑿, 𝑫] 𝒙
̅𝒉
𝑀ℎ2
⏟
𝑚=1 𝜎2 2 ′
𝜎 (𝑿 𝑸𝑿) −𝟏
𝑀ℎ 𝜎 2
2
𝜎
= ̅′𝒉 (𝑿′ 𝑸𝑿)−𝟏 𝒙
+ 𝜎2𝒙 ̅𝒉
𝑀ℎ
2) Appendix
• https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
⏟′ 𝚺⏟′ 𝓖
𝛀 ⏟′ 𝚺⏟′ 𝓖
⏟ = (𝛀 ⏟ 𝚺⏟′ 𝛀
⏟ ) = 𝓖′ ⏟′
1×𝑛 𝑛×𝑘 𝑘×1 1×𝑛 𝑛×𝑘 𝑘×1 1×𝑘 𝑘×𝑛 𝑛×1
Consider the product of two random variables, one of which converges in distribution and the other converges
in probability to a constant: the asymptotic distribution of this product is unaffected by replacing the one that
converges to a constant by this constant. Formally, let 𝐴𝑛 be a statistic with an asymptotic distribution and let
𝐵𝑛 be a statistic with probability limit 𝑏. Then 𝐴𝑛 𝐵𝑛 and 𝐴𝑛 𝑏 have the same asymptotic distribution.
𝑑
𝐴𝑛 𝐵𝑛 → 𝐴𝑛 𝑏
This value is the sample mean – from a much wider population, we have drawn a finite sequence of
observations, and calculated the average across them. How do we know that this sample parameter is
meaningful with respect to the population, and therefore that we can make inferences from it?
WLLN states that the mean of a sequence of i.i.d. random variables converges in probability to the expected
value of the random variable as the length of that sequence tends to infinity. By ‘converging in probability’, we
mean that the probability that the difference between the mean of the sample and the expected value of the
random variable tends to zero.
In short, WLLN guarantees that with a large enough sample size the sample mean should approximately match
the true population parameter. Clearly, this is powerful theorem for any statistical exercise: given we are
(always) constrained by a finite sample, WLLN ensures that we can infer from the data something meaningful
about the population. For example, from a large enough sample of voters we can estimate the average support
for a candidate or party.
Note that convergence in probability is defined as the limit in probability as 𝑛 → ∞. And we can intuitively see
that, as 𝑛 → ∞, our sample becomes the population. So it makes sense that, as 𝑛 gets increasingly large, the
sample mean becomes the population mean.
➢ 𝑦 = 𝑥𝛽 + 𝑢
➢ minimize expected squared error
At a sample level,
➢ 𝑦 = 𝑥𝛽̂ + 𝑢̂
➢ minimize mean squared error
2.8) TRICKS FOR ESTIMATING POPULATION OBJECTS
Substitute parameters by estimated parameters (with a hat)
𝜎2
𝑋𝑛 ∼ 𝜇 + 𝒩 (0, )
𝑛
𝜎2
𝑋𝑛 − 𝜇 ∼ 𝒩 (0, )
𝑛
𝜎
𝑋𝑛 − 𝜇 ∼ 𝒩(0,1)
√𝑛
√𝑛(𝑋𝑛 − 𝜇)
∼ 𝒩(0,1)
𝜎
Conclusions:
● ∼ 𝒩(◆, ✠)
● ∼ ◆ + 𝒩(0, ✠)
● − ◆ ∼ 𝒩(0, ✠)
● − ◆ ∼ √✠ · 𝒩(0,1)
●−◆
∼ 𝒩(0,1)
√✠
2.11.2) Multivariate distributions
⏟ ∼𝓝
𝒂 ⏟ , 𝑪
𝒃 ⏟ 𝑫
⏟ 𝑪
⏟′
𝑘×1 𝑘×1 ⏟
𝑘×𝑘 𝑘×𝑘 𝑘×𝑘
( 𝑘×𝑘 )
⏟ ∼ 𝒃
𝒂 ⏟ + 𝓝( ⏟
𝟎 , 𝑪𝑫𝑪′
⏟ )
𝑘×1 𝑘×1 𝑘×1 𝒌×𝒌
(𝒂
⏟ − 𝒃
⏟ ) ∼ 𝓝( ⏟
𝟎 , 𝑪𝑫𝑪′
⏟ )
𝑘×1 𝑘×1 𝑘×1 𝒌×𝒌
(𝒂
⏟ − 𝒃
⏟ )∼ 𝑪
⏟ 𝓝( ⏟
𝟎 , 𝑫
⏟ )
𝑘×1 𝑘×1 𝑘×𝑘 𝑘×1 𝒌×𝒌
−𝟏
𝑪
⏟ ⏟ − 𝒃
𝒂 ⏟ ∼ 𝓝( ⏟
𝟎 , 𝑫
⏟ )
𝑘×𝑘 ⏟
𝑘×1 𝑘×1 𝑘×1 𝒌×𝒌
⏟ ( 𝑘×1 )
𝑘×1
2.12.1) CEF
The CEF is the best predictor of of 𝒚𝒊 given 𝒙𝒊 in the class of all functions of 𝒙𝒊
[𝑦𝑖 − 𝑚(𝒙𝒊 )]2 = [𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 ) + 𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )]2
2
= [(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 ))]
2 2
= (𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )) + 2 (𝑦
⏟𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) (𝔼(𝑦
⏟ 𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 ))
𝑢𝑖 ℎ(𝒙𝒊 )
2 2
= (𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )) + 2𝑢𝑖 ℎ(𝒙𝒊 )
The term in red doesn’t depend on 𝑚(𝒙𝒊 ) , so it drops out of the program. We are left with:
2
min {𝔼 [(𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )) ]}
𝑚(𝒙𝒊 )
where we can clearly see that, the functional form of 𝑚(𝒙𝒊 ) that minimizes the expression is when:
It is found by solving:
FOC
𝜕 ′
(·) = 𝔼 [−2𝒙𝒊 (𝑦
⏟𝑖 − 𝒙𝒊 𝒃)] = 0 𝔼[𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃)] = 0
𝜕𝒃
𝑢𝑖
′
⏟ 𝒊 𝑦𝑖 ] = 𝔼[𝒙
𝔼[𝒙 ⏟ 𝒊 𝒙𝒊 𝒃]
𝑘×1 𝑘×1
Link between the CEF and the population linear regression model (BLP)
✓ If the CEF is linear, then the population linear regression (BLP) is equal to the CEF
✓ Even when the CEF is not linear, the BLP provides us the best mean-squared-error (MSE) approximation
to 𝔼(𝑦𝑖 |𝑥𝑖 ).
(𝑦𝑖 − 𝒙′𝒊 𝒃)2 = [𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 ) + 𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃]2
2
= [(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)]
2
= (𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)2 + 2 (𝑦
⏟𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) (𝔼(𝑦
′
⏟ 𝑖 |𝒙𝒊 ) − 𝒙𝒊 𝒃)
𝑢𝑖 ℎ(𝒙𝒊 )
2
= (𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)2 + 2𝑢𝑖 ℎ(𝒙𝒊 )
Take the expectation:
2
𝔼[(𝑦𝑖 − 𝒙′𝒊 𝒃)2 ] = 𝔼 [(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) ] + 𝔼[(𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)2 ] + 2 ⏟
𝔼[𝑢𝑖 ℎ(𝒙𝒊 )]
0
where the first term (in red) is irrelevant to the minimization program as it doesn’t depend on 𝒃. We are left
with:
Hence, finding the 𝛽̂ that best approximates the dependent variable 𝑦𝑖 is equivalent to finding the best
approximator to the CEF.
Properties
▪ 𝑷𝑿 = 𝑿
𝑿(𝑿′ 𝑿)−𝟏 𝑿′ 𝑿 = 𝑿
𝑷𝑿 = ⏟
𝑷
𝑿(𝑿′ 𝑿)−𝟏 𝑿′ 𝑿𝚪
𝑷𝒁 = ⏟ ⏟ = 𝑿𝚪
𝑷 𝒁
▪ Idempotent: 𝑷𝑷 = 𝑷
𝑿(𝑿′ 𝑿)−𝟏 𝑿′ ⏟
𝑷𝑷 = ⏟ 𝑿(𝑿′ 𝑿)−𝟏 𝑿′ = 𝑿(𝑿′ 𝑿)−𝟏 𝑿′
𝑷 𝑷
▪ Symmetric: 𝑷′ = 𝑷
′
′ ′ −𝟏 (𝑿′ ) ′
𝑷 = [(𝑿)(𝑿
⏟ 𝑿) ] = (𝑿′ )′ [(𝑿′ 𝑿)−𝟏 ] (𝑿)′ = 𝑿[(𝑿′ 𝑿)′ ]−𝟏 𝑿′ = 𝑿(𝑿′ 𝑿)−𝟏 𝑿′ = 𝑷
𝑷
▪ ̂
“Hat Matrix" 𝑷𝒀 = 𝒀
̂=𝒀
𝑿(𝑿′ 𝑿)−𝟏 𝑿′ 𝒀 = 𝑿𝜷
𝑷𝒀 = ⏟ ̂
𝑷
▪ tr(𝑷) = 𝑘
⏟ ′ 𝑿)−𝟏 𝑿′
tr(𝑷) = tr [𝑿(𝑿 ⏟ ] = tr [𝑿′
⏟⏟ 𝑿(𝑿′ 𝑿)−𝟏 ] = tr[𝑰𝑘 ] = 𝑘
𝑨 𝑩 𝑩 𝑨
▪ 𝑴𝑿 = 𝟎
𝑴𝑿 = [𝑰𝒏 − 𝑷]𝑿 = 𝑿 − 𝑷𝑿 = 𝑿 − 𝑿 = 𝟎
▪ Idempotent: (𝑴)𝒓 = 𝑴
▪ Symmetric: 𝑴′ = 𝑴
2.12.6) Useful Matrix Properties
▪ Inverses and Transposes
o (𝑨𝑩𝑪𝑫)′ = 𝑫′𝑪′𝑩′𝑨′
o (𝑨 + 𝑩 + 𝑪)′ = 𝑨′ + 𝑩′ + 𝑪′
o (𝒌𝑨)′ = 𝒌𝑨′
o (𝑨−1 )′ = (𝑨′ )−1
o (𝑨′ )′ = 𝑨
o (𝑨−1 )−1 = 𝑨
▪ Traces
o tr(𝑐) = 𝑐, (𝑐 is a constant)
o tr(𝑨𝑩) = tr(𝑩𝑨)
o tr(𝑨𝑩𝑪) = tr(𝑪𝑨𝑩) = tr(𝑩𝑪𝑨)
o tr(𝑐𝑨) = 𝑐 × tr(𝑨)
o tr(𝑨′) = tr(𝑨)
o tr(𝑨 + 𝑩) = tr(𝑨) + tr(𝑩)
o 𝔼[tr(●)] = tr[𝔼(●)]
o tr(𝑰𝑛 ) = 𝑛
o tr(𝟎) = 0
▪ Determinants
o det[𝑑𝑖𝑎𝑔(𝜆, … , 𝜆)] = ∏𝑛𝑖=1 𝜆 = 𝜆𝑛
o det[𝑑𝑖𝑎𝑔(𝜆1 , … , 𝜆𝑛 )] = ∏𝑛𝑖=1 𝜆𝑖
o det[𝐼𝑛 ] = 1𝑛 = 1
If we are asked to compute the expectation or variance of an estimator, unless stated otherwise, we are being
asked to compute the conditional expectation and conditional variance.
𝒚 ∼ 𝒩(𝑿𝜷, 𝚺)
𝑛 1 1
𝑓𝐘 (𝐲) = (2𝜋)−2 × |det(𝚺)|−2 × exp − ⏟(𝒚 − 𝑿𝜷)′ 𝚺
⏟ −1 (𝒚
⏟ − 𝑿𝜷)
2⏟ 𝑛×𝑛
1×𝑛 𝑛×1
[ 1×1 ]
𝑛 𝑛
Proof:
∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) = ∑(𝑥𝑖 − 𝑥̅ )𝑦𝑖 − ∑(𝑥𝑖 − 𝑥̅ )𝑦̅ = ∑(𝑥𝑖 − 𝑥̅ )𝑦𝑖 − 𝑦̅ ∑(𝑥𝑖 − 𝑥̅ ) = ∑(𝑥𝑖 − 𝑥̅ )𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 ⏟
𝑖=1 𝑖=1
0
Var[𝐀𝐁 | 𝐗] = 𝐀 Var[𝐁|𝐗] 𝐀′
In general:
Var[●◈] = ● Var[◈] ●′