1) Derivations - Regression

Index
1) Regression......................................................................................................................................................................... 4
1.1) Means and Predictors............................................................................................................................................... 4
1.1.1) Best Linear Predictor (data on 𝒚) ..................................................................................................................... 4
1.1.2) Best Linear Predictor (data on 𝒚 and 𝒙) .......................................................................................................... 4
1.1.3) Best Linear Predictor (data on 𝒚 and 𝒙 + intercept) ...................................................................................... 4
1.1.4) Best Linear Predictor (data on 𝒚 and a vector 𝒙) ........................................................................................... 6
1.1.5) Decomposition of 𝒚𝒊 into 2 orthogonal components .................................................................................... 8
1.2) Consistency and Asymptotic Normality of Predictors ....................................................................................... 8
1.2.1) Population level predictor 𝜷 ............................................................................................................................ 8
1.2.2) Estimation error 𝜷 − 𝜷 ...................................................................................................................................... 9
1.2.3) Consistency of OLS estimator .......................................................................................................................... 9
1.2.4) Asymptotic normality of (1/n)sum(xiui) ....................................................................................................... 10
1.2.5) Asymptotic Normality of OLS & Sandwich Formula ................................................................................ 11
1.2.6) (Heteroskedasticity consistent) Estimators of the asymptotic variance of the estimation error aka
“estimator of the sandwich” ......................................................................................................................................... 12
1.2.7) Asymptotic normality for Individual Coefficients ..................................................................................... 13
1.2.8) Confidence Intervals ....................................................................................................................................... 13
1.3) Classical Regression Model (CRM) ...................................................................................................................... 13
1.3.1) Assumptions..................................................................................................................................................... 14
1.3.2) Properties .......................................................................................................................................................... 14
1.3.3) Variance of 𝒚 = Variance of 𝒚 & Variance of 𝒖 = Expectation of 𝒖𝒖′ ........................................................ 14
1.3.4) The conditional distribution of u determines the conditional distribution of y ..................................... 14
1.3.5) Deriving the Estimator .................................................................................................................................... 15
1.3.6) Conditional & Unconditional Expectation of 𝜷 (showing unbiasedness) ............................................... 15
1.3.7) Conditional & Unconditional Variance of 𝜷 ................................................................................................ 16
1.3.8) Conditional asymptotic distribution of estimation error & Sandwich Formula .................................... 16
1.3.9) Estimation of the Error Variance ................................................................................................................... 18
1.4) Weighted Least Squares (WLS) ............................................................................................................................ 19
1.4.1) Deriving the WLS estimator ........................................................................................................................... 19
1.4.2) Deriving the WLS population 𝜷 .................................................................................................................... 20
1.4.3) Consistency of WLS estimator ....................................................................................................................... 20
1.4.4) Asymptotic normality of (1/n)sum(wixiui) ................................................................................................... 22
1.4.1) Asymptotic Normality .................................................................................................................................... 23
1.4.2) Asymptotic Efficiency (Optimal choice of weights wi) .............................................................................. 24
1.4.3) Proving that the variance is smaller when we use the optimal choice of weights (Ask Rob) .............. 24
1.4.4) Generalized Least Squares (GLS) aka “WLS with optimal weights” ........................................................... 25
1.4.5) Asymptotic normality of GLS ........................................................................................................................ 25
1.5) Clustered data ......................................................................................................................................................... 25
1.5.1) Individual observations .................................................................................................................................. 25
1.5.2) Expression for one cluster .............................................................................................................................. 26
1.5.3) Expression for all the data grouped (not really useful).............................................................................. 26
1.5.4) Expression for all the data (no grouping) .................................................................................................... 26
1.5.5) OLS Estimator in 3 equivalent formats......................................................................................................... 27
1.5.6) Estimation error ............................................................................................................................................... 27
1.5.7) Consistency....................................................................................................................................................... 28
1.5.8) Asymptotic normality of sumX’huh ............................................................................................................... 29
1.5.9) Asymptotic normality ..................................................................................................................................... 30
1.5.1) Cluster robust standard errors aka “estimator of the W-sandwich” ............................................................ 31
1.6) Fixed effects ............................................................................................................................................................. 32
1.6.1) Individual observations .................................................................................................................................. 32
1.6.2) At a cluster level .............................................................................................................................................. 33
1.6.3) Expression for all the data grouped (not really useful).............................................................................. 33
1.6.4) Expression for all the data (no grouping) .................................................................................................... 33
1.6.5) OLS Estimator w/ Fixed Effects through Partitioned Regression ............................................................. 34
1.6.6) Breaking down the Q matrix .......................................................................................................................... 35
1.6.7) Conditional Variance of 𝜷 .............................................................................................................................. 37
2) Appendix ........................................................................................................................................................................ 38
2.1) Matrix Differentiation ............................................................................................................................................ 38
2.2) Adding matrix expressions that are actually scalars ......................................................................................... 38
2.3) Slutsky’s Theorem (Convergence of Transformations)..................................................................................... 38
2.4) Slutsky’s Theorem (Convergence of sums or products or R.V’s) .................................................................... 38
2.5) Law of Large Numbers (LLN) .............................................................................................................................. 38
2.6) Cramér’s Theorem .................................................................................................................................................. 39
2.7) Difference between Sample and Population ...................................................................................................... 39
2.8) Tricks for estimating population objects ............................................................................................................. 40
2.9) Delta Method........................................................................................................................................................... 40
2.10) CLT ......................................................................................................................................................................... 40
2.11) Operations with normal distribution operator................................................................................................. 40
2.11.1) Univariate distributions ................................................................................................................................ 40
2.11.2) Multivariate distributions ............................................................................................................................ 40
2.12) Linear Regression and CEF ................................................................................................................................. 41
2.12.1) CEF................................................................................................................................................................... 41
2.12.2) Population Linear Regression / Best Linear Predictor (BLP) ................................................................... 42
2.12.3) Useful tricks for Sandwich Formulas.......................................................................................................... 42
2.12.4) Projection Matrix ........................................................................................................................................... 43
2.12.5) Annihilator Matrix......................................................................................................................................... 43
2.12.6) Useful Matrix Properties .............................................................................................................................. 44
2.12.7) Victor’s TA comments:.................................................................................................................................. 44
2.12.8) Multivariate Normal Distribution of the y’s .............................................................................................. 44
2.13) Useful trick for working in deviations from the mean .................................................................................... 45
1) Regression
1.1) MEANS AND PREDICTORS

1.1.1) Best Linear Predictor (data on 𝒚)
Given data {𝑦𝑖 }𝑛𝑖=1, the mean 𝑦̅ (sample object) minimizes mean squared error: (where the error is defined as:
𝑢̂𝑖 = 𝑦𝑖 − 𝑎)
𝑛 𝑛
𝑦̅ = arg min ∑ 𝑢 = arg min ∑(𝑦𝑖 − 𝑎)2

2
𝑎 𝑎
𝑖=1 𝑖=1
Open the summation:

𝑛
∑(𝑦𝑖 − 𝑎)2 = (𝑦1 − 𝑎)2 + ⋯ + (𝑦𝑛 − 𝑎)2

𝑖=1
F.O.C wrt this object:

𝑛
𝜕
(·) = ∑ −2(𝑦𝑖 − 𝑎) = 0
𝜕𝑎
𝑖
isolate 𝑎:
𝑛 𝑛 𝑛 𝑛 𝑛
−2 ∑(𝑦𝑖 − 𝑎) = 0 ⇒ ∑ 𝑦𝑖 − ∑ 𝑎 = 0 ⇒ ∑ 𝑦𝑖 = ∑ 𝑎 = 𝑛𝑎
𝑖 𝑖 𝑖 𝑖 𝑖
𝑛
1
𝑦̅ ≡ 𝑎 = ∑ 𝑦𝑖
𝑛
𝑖
1.1.2) Best Linear Predictor (data on 𝒚 and 𝒙)

Population level (𝜷)
min[𝔼(𝑢𝑖 )]2 = min[𝔼(𝑦𝑖 − 𝛽𝑥𝑖 )]2
𝛽 𝛽
𝜕 𝔼[𝑥𝑖 𝑦𝑖 ]
= −2𝔼[𝑥𝑖 (𝑦𝑖 − 𝛽𝑥𝑖 )] = 0 ⇒ 𝔼[𝑥𝑖 𝑦𝑖 ] = 𝛽𝔼[𝑥𝑖2 ] ⇒ 𝛽=
𝜕𝛽 𝔼[𝑥𝑖2 ]
̂)
Sample level (𝜷
𝑛 𝑛
min [∑(𝑢𝑖 ] = min [∑(𝑦𝑖 − 𝛽𝑥𝑖 )2 ]

)2
𝛽 𝛽
𝑖=1 𝑖=1
𝜕 ∑𝑛
𝑖=1(𝑥𝑖 𝑦𝑖 )
𝜕𝛽
= −2 ∑𝑛𝑖=1[𝑥𝑖 (𝑦𝑖 − 𝛽𝑥𝑖 )] = 0 ⇒ ∑𝑛𝑖=1(𝑥𝑖 𝑦𝑖 ) = 𝛽 ∑𝑛𝑖=1(𝑥𝑖2 ) ⇒ 𝛽= ∑𝑛 2
𝑖=1(𝑥𝑖 )
1.1.3) Best Linear Predictor (data on 𝒚 and 𝒙 + intercept)

Given data {𝑦𝑖 , 𝑥𝑖 }𝑛𝑖=1 , we look for parameters 𝛼̂, 𝛽̂ (sample objects) that minimize mean squared error (where
the error is defined as 𝑢̂𝑖 = 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )
𝑛
(𝛼̂, 𝛽̂ ) = arg min ∑(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2

(𝑎,𝑏)
𝑖=1
𝑛 𝑛 𝑛 𝑛
𝜕
= ∑ −2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = 0 ∑ 𝑦𝑖 = ∑ 𝑎 + 𝑏𝑥𝑖 = 𝑛𝑎 + 𝑏 ∑ 𝑥𝑖
𝜕𝑎
𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
𝑛𝑎 = ∑ 𝑦𝑖 − 𝑏 ∑ 𝑥𝑖
𝑖=1 𝑖=1
𝑛 𝑛
1 𝑏
𝑎 = ∑ 𝑦𝑖 − ∑ 𝑥𝑖 (1)
𝑛 𝑛
𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛
𝜕
= ∑ −2𝑥𝑖 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = 0 ∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑥𝑖 𝑎 − ∑ 𝑥𝑖2 𝑏 = 0
𝜕𝑏
𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛
∑ 𝑥𝑖 𝑦𝑖 − 𝑎 ∑ 𝑥𝑖 − 𝑏 ∑ 𝑥𝑖2 = 0 (2)
𝑖=1 𝑖=1 𝑖=1
Plug (1) into (2)

𝑛 𝑛 𝑛
∑ 𝑥𝑖 𝑦𝑖 − 𝑎 ∑ 𝑥𝑖 − 𝑏 ∑ 𝑥𝑖2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛
1 𝑏
∑ 𝑥𝑖 𝑦𝑖 − ( ∑ 𝑦𝑖 − ∑ 𝑥𝑖 ) ∑ 𝑥𝑖 − 𝑏 ∑ 𝑥𝑖2 = 0
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
1 𝑏
∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑦𝑖 ∑ 𝑥𝑖 + ∑ 𝑥𝑖 ∑ 𝑥𝑖 − 𝑏 ∑ 𝑥𝑖2 = 0
𝑛 𝑛⏟
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
2
(∑𝑛
𝑖=1 𝑥𝑖 )
𝑛 𝑛 𝑛 𝑛 2 𝑛
1 𝑏
∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑦𝑖 ∑ 𝑥𝑖 + (∑ 𝑥𝑖 ) − 𝑏 ∑ 𝑥𝑖2 = 0
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
Isolate 𝑏
𝑛 𝑛 2 𝑛 𝑛 𝑛
𝑏 1
𝑏 ∑ 𝑥𝑖2 − (∑ 𝑥𝑖 ) = ∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑦𝑖 ∑ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 2 𝑛 𝑛 𝑛
1 1
𝑏 [∑ 𝑥𝑖2 − (∑ 𝑥𝑖 ) ] = ∑ 𝑥𝑖 𝑦𝑖 − ∑ 𝑦𝑖 ∑ 𝑥𝑖
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
And finally:
1
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑛𝑖=1 𝑦𝑖 ∑𝑛𝑖=1 𝑥𝑖
𝑛
𝛽̂ ≡ 𝑏 =
1 2
∑𝑛𝑖=1 𝑥𝑖2 − (∑𝑛𝑖=1 𝑥𝑖 )
𝑛
1
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑛𝑖=1 𝑦𝑖 ∑𝑛𝑖=1 𝑥𝑖 1⁄𝑛
= 𝑛 ·
1 2 1⁄𝑛
∑𝑛𝑖=1 𝑥𝑖2 − (∑𝑛𝑖=1 𝑥𝑖 )
𝑛
1 𝑛 1 1
∑𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑛𝑖=1 𝑦𝑖 × ∑𝑛𝑖=1 𝑥𝑖
= 𝑛 𝑛 𝑛 1
2
1 𝑛 1
∑𝑖=1 𝑥𝑖2 − ( ∑𝑛𝑖=1 𝑥𝑖 )
𝑛 𝑛
𝐶𝑜𝑣(𝑥𝑖 , 𝑦𝑖 )
=
𝑉𝑎𝑟(𝑥𝑖 )
Plug the expression for 𝛽̂ back into the expression for 𝑎:

𝑛 𝑛 𝑛 𝑛
1 𝑏 1 1
𝛼̂ ≡ 𝑎 = ∑ 𝑦𝑖 − ∑ 𝑥𝑖 = ∑ 𝑦𝑖 − 𝛽̂ ∑ 𝑥𝑖 = 𝑦̅ − 𝛽̂ 𝑥̅
𝑛 𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1
Conclusion:
𝐶𝑜𝑣(𝑥𝑖 , 𝑦𝑖 )
𝛼̂ = 𝑦̅ − 𝛽̂ 𝑥̅ ∧ 𝛽̂ =
𝑉𝑎𝑟(𝑥𝑖 )
1.1.4) Best Linear Predictor (data on 𝒚 and a vector 𝒙)

̂ (sample
Given data {𝑦𝑖 , 𝒙𝒊 }𝑛𝑖=1 , where 𝑥𝑖 is a 𝑘 × 1 vector of data, we look for a 𝑘 × 1 vector of parameters 𝜷
objects) that minimize mean squared error:
Sigma Notation
➢ 𝑦𝑖 is an individual observation of the dependent variable
1
𝑥
➢ 𝒙𝒊 is a vector of regressors or independent variables: 𝒙𝒊 = ( 𝑖2 )
⋮
𝑥𝑖𝑘
𝑛 2
̂ = arg min ∑ ( 𝑦⏟𝑖 − 𝒙⏟′𝒊 𝒃
𝜷 ⏟ )
𝒃
𝑖=1 1×1 1×𝑘 𝑘×1
𝑛 𝑛 𝑛
𝜕
= ∑ −2𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃) = 0 ∑ 𝒙⏟𝒊 𝑦⏟𝑖 = ∑ 𝒙⏟𝒊 𝒙⏟′𝒊 𝒃
⏟
𝜕𝒃
𝑖=1 𝑖=1 𝑘×1 1×1 𝑖=1 𝑘×1 1×𝑘 𝑘×1
𝑛 𝑛
(∑ 𝒙𝒊 𝑦𝑖 ) = (∑ 𝒙𝒊 𝒙′𝒊 ) 𝒃
𝑖=1 𝑖=1
𝑛 −1 𝑛 𝑛 −1 𝑛
(∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝑦𝑖 ) = (∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝒙′𝒊 ) 𝒃

𝑖=1 𝑖=1 𝑖=1 𝑖=1
−1
𝑛 𝑛
∑ 𝒙𝒊 𝒙′𝒊 ∑ 𝒙𝒊 𝑦𝑖 = 𝒃
⏟
⏟
𝑖=1 ⏟
𝑖=1 𝑘×1
( 𝑘×𝑘 ) ( 𝑘×1 )
Hence:
1 Recall that:
➢ 𝐶𝑜𝑣(𝑥, 𝑦) = 𝔼(𝑥, 𝑦) − 𝔼(𝑥)𝔼(𝑦)

1 1 1
o at a sample level: 𝐶𝑜𝑣(𝑥, 𝑦) = ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑛𝑖=1 𝑦𝑖 × ∑𝑛𝑖=1 𝑥𝑖
𝑛 𝑛 𝑛
➢ 𝑉𝑎𝑟(𝑥) = 𝔼(𝑥 2 ) − [𝔼(𝑥)]2
1 1 2
o at a sample level: 𝑉𝑎𝑟(𝑥) = ∑𝑛𝑖=1 𝑥𝑖2 − ( ∑𝑛𝑖=1 𝑥𝑖 )
𝑛 𝑛
𝑛 −1 𝑛
̂≡𝒃=
𝜷 (∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝑦𝑖 )
𝑖=1 𝑖=1
Note that the FOC has an important implication. Open it to see what it contains
𝑛 𝑛
∑ 𝒙⏟𝒊 𝑦⏟𝑖 − 𝒙⏟′𝒊 𝜷⏟̂ = 0 ⇔ ∑ 𝒙⏟𝒊 ( 𝑢̂⏟𝑖 ) = 0

𝑖=1 𝑘×1 1×1 ⏟ 𝑘×1
1×𝑘 ⏟ 1×1
𝑖=1 𝑘×1
⏟ 1×1 𝑘×1
⏟ ( 1×1 )
𝑘×1
𝑛 1 𝑛 𝑢̂𝑖 ∑𝑢̂𝑖
𝑥𝑖2 𝑥 𝑢̂ ∑ 𝑥 𝑢̂
∑ [( ) × 𝑢𝑖 ] = ∑ ( 𝑖2 𝑖 ) = ( 𝑖2 𝑖 ) = 0
⋮ ⋮ ⋮
𝑖=1 𝑖=1
𝑥𝑖𝑘 𝑥𝑖𝑘 𝑢̂𝑖 ∑ 𝑥𝑖𝑘 𝑢̂𝑖
So:
∑𝑢̂𝑖 0
∑ 𝑥𝑖2 𝑢̂𝑖 0
( )=( )
⋮ ⋮
∑ 𝑥𝑖𝑘 𝑢̂𝑖 0
Conclusions: when an intercept is included in the model
➢ The sum of residuals is equal to 0:

𝑛
∑ 𝑢̂𝑖 = 0
𝑖=1
➢ The 𝑥’s and the 𝑢’s have 0 covariance thanks to: ∑𝑢̂𝑖 and ∑ 𝑥𝑖𝑘 𝑢̂𝑖 for any regressor 𝑘
𝐶𝑜𝑣(𝑥𝑘 , 𝑢̂) = 𝔼(𝑥𝑘 𝑢̂) − 𝔼(𝑥𝑘 )𝔼(𝑢̂)

𝑛 𝑛 𝑛
1 1 1
= ( ∑ 𝑥𝑖𝑘 𝑢̂𝑖 ) − ( ∑ 𝑥𝑖𝑘 ) ( ∑ 𝑢̂𝑖 )
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
=0
Matrix Notation
𝑦1
➢ 𝒚=(⋮)
𝑦𝑛 𝑛×1
𝑥11 ⋯ 𝑥1𝑘 𝒙′𝟏
➢ 𝑿=( ⋮ ⋱ ⋮ ) =( ⋮ )
𝑥𝑛1 ⋯ 𝑥𝑛𝑘 𝑛×𝑘 𝒙′𝒏
Define:
̂ = ⏟
𝒖
⏟ 𝒚 − 𝑿
⏟ 𝜷⏟
𝑛×1 𝑛×1 ⏟ 𝑘×1
𝑛×𝑘
𝑛×1
𝑢̂1 𝑦1 𝑥11 ⋯ 𝑥1𝑘 𝛽1

(⋮) ⋮
=( ) −( ⋮ ⋱ ⋮ ) (⋮)
𝑢̂𝑛 𝑛×1 𝑦𝑛 𝑛×1 𝑥𝑛1 ⋯ 𝑥𝑛𝑘 𝑛×𝑘 𝛽𝑘 𝑘×1
We seek to minimize mean squared error:
̂ = arg min ( 𝒖
𝜷 ̂⏟′ 𝒖
̂ )
⏟
𝒃
1×𝑛 𝑛×1
where:
̂′𝒖
𝒖 ̂ = (𝒚 − 𝑿𝜷)′(𝒚 − 𝑿𝜷)
= 𝒚′ 𝒚 − 𝒚′ 𝑿𝜷 − (𝑿𝜷)′ 𝒚 + (𝑿𝜷)′ (𝑿𝜷)
= 𝒚′ 𝒚 − 𝒚′ 𝑿𝜷 − 𝜷′𝑿′𝒚 + 𝜷′𝑿′𝑿𝜷
Hence:
̂ = arg min(𝒚′ 𝒚 − 𝒚′ 𝑿𝒃 − 𝒃′𝑿′𝒚 + 𝒃′𝑿′𝑿𝒃)
𝜷
𝒃
FOC:
𝜕
(𝒖 ̂ ) = −(𝒚′ 𝑿)′ − 𝑿′ 𝒚 + 2𝑿′𝑿𝒃
̂′𝒖 =0
𝜕𝒃
= −𝑿′𝒚 − 𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
= −2𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
Solve for 𝒃
−2𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
𝑿′𝑿𝒃 = 𝑿′ 𝒚
(𝑿′ 𝑿)−𝟏 (𝑿′𝑿)𝒃 = (𝑿′ 𝑿)−𝟏 (𝑿′ 𝒚)
𝒃 = (𝑿′ 𝑿)−𝟏 (𝑿′ 𝒚)
Thus:
̂ ≡ 𝒃 = (𝑿′ 𝑿)−𝟏 (𝑿′ 𝒚)

𝜷
1.1.5) Decomposition of 𝒚𝒊 into 2 orthogonal components

A linear regression decomposes 𝑦𝑖 into 2 orthogonal components:
𝑦𝑖 = 𝑦̂𝑖 + 𝑢̂𝑖
• ̂
𝑦̂𝑖 = 𝒙′𝒊 𝜷
• ̂
𝑢̂𝑖 = 𝑦𝑖 − 𝒙′𝒊 𝜷
They are orthogonal:
𝐶𝑜𝑣(𝑦̂𝑖 , 𝑢̂𝑖 ) = 0
̂ , 𝑢̂𝒊 ) = ⏟
𝐶𝑜𝑣(𝑦̂𝑖 , 𝑢̂𝑖 ) = 𝐶𝑜𝑣(𝒙′𝒊 𝜷 ̂=0
𝐶𝑜𝑣(𝒙′𝒊 , 𝑢̂𝒊 ) 𝜷
0
1.2) CONSISTENCY AND ASYMPTOTIC NORMALITY OF PREDICTORS

1.2.1) Population level predictor 𝜷
At the population level, the linear regression model is:
𝑦𝑖 = 𝒙′𝒊 𝜷 + 𝑢𝑖
To find the population predictor 𝛽, we minimize expected quadratic loss:
𝜷 = arg min 𝔼[𝑦𝑖 − 𝒙′𝒊 𝒃]2

𝒃
𝜕 𝔼[𝒙𝒊 𝒙′𝒊 𝒃]
𝔼[𝒙𝒊 𝑦𝑖 ] = ⏟
⏟
(·) = 𝔼[−2𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃)] = 0
𝜕𝑏 𝑘×1 𝑘×1
𝔼[𝒙𝒊 𝑦𝑖 ] = 𝔼[𝒙𝒊 𝒙′𝒊 ]𝒃
𝔼[𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝒙𝒊 𝑦𝑖 ] = 𝔼[𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝒙𝒊 𝒙′𝒊 ]𝒃

𝒃 = 𝔼[𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝒙𝒊 𝑦𝑖 ]
Hence:
𝜷 = 𝔼[𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝒙𝒊 𝑦𝑖 ]
and, as we have seen, by constriction, the first order condition imposes:
𝔼[−2𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃)] = 0
or equivalently:
𝔼[𝒙𝒊 𝑢𝑖 ] = 0
which states that the regressors are uncorrelated with the population errors.
̂−𝜷
1.2.2) Estimation error 𝜷
̂:
Recall the expression for 𝜷
𝑛 −1 𝑛
̂≡
𝜷 (∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝑦𝑖 )
𝑖=1 𝑖=1
Substitute 𝑦𝑖 = 𝒙′𝒊 𝜷 + 𝑢𝑖
𝑛 −1 𝑛
̂ = (∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 (∑ 𝒙𝒊 (𝒙′𝒊 𝜷 + 𝑢𝑖 ))
⏟𝑖=1 ⏟𝑖=1
𝑘×𝑘 𝑘×1
𝑛 −1 𝑛 𝑛 −1 𝑛
̂ = (∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 (∑ 𝒙𝒊 𝒙′𝒊 𝜷) + (∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝑢𝑖 )
⏟𝑖=1 ⏟𝑖=1 ⏟𝑖=1 ⏟𝑖=1
𝑘×𝑘 𝑘×1 𝑘×𝑘 𝑘×1
Take 𝜷 out of the summation

𝑛 −1 𝑛 𝑛 −1 𝑛
̂=
𝜷 (∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝒙′𝒊 ) 𝜷 + (∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝑢𝑖 )
⏟𝑖=1 ⏟𝑖=1 ⏟𝑖=1 ⏟𝑖=1
𝑘×𝑘 𝑘×𝑘 𝑘×𝑘 𝑘×1
𝑛 −1 𝑛
̂
⏟− 𝜷 = (∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 (∑ 𝒙𝒊 𝑢𝑖 )
𝑘×1 ⏟𝑖=1 ⏟𝑖=1
⏟ 𝑘×𝑘 𝑘×1
𝑘×1
1.2.3) Consistency of OLS estimator

̂ ) = 𝜷, or equivalently, that plim(𝜷
We need to show that plim(𝜷 ̂ − 𝜷) = 𝟎
Take the expression of the estimation error:

𝑛 −1 𝑛
̂ − 𝜷 = (∑ 𝒙𝒊 𝒙′𝒊 )
𝑖=1 𝑖=1
1 1
Multiply and divide by 𝑛. Namely, we multiply by 𝑛
the inverted matrix (it’s like multiplying by 𝑛), and we
1
multiply by the other matrix. Overall, the expression remains unaltered.
𝑛
𝑛 −1 𝑛
1 1
̂ − 𝜷 = ( ∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 ( ∑ 𝒙𝒊 𝑢𝑖 )
𝑛 𝑛
𝑖=1 𝑖=1
Take the plim of this expression:
𝑛 −1 𝑛
1 1
̂ − 𝜷] = plim [( ∑ 𝒙𝒊 𝒙′𝒊 )
plim[𝜷 ( ∑ 𝒙𝒊 𝑢𝑖 )]
𝑛 𝑛
𝑖=1 𝑖=1
Apply property of plims: plim[◼× ◆] = plim[◼] × plim[◆]

−1
𝑛 𝑛
1 1
̂ − 𝜷] = plim
plim[𝜷 ∑ 𝒙𝒊 𝒙′𝒊 plim ∑ 𝒙𝒊 𝑢𝑖
𝑛
⏟ 𝑛
⏟
𝑖=1 𝑖=1
[( 𝑍̅𝑛 ) ] [ ̅𝑛
𝐾 ]
1 𝑝
By the Law of Large Numbers, sample means converge to population values, plim [ ∑𝑛𝑖=1 ●] → 𝔼[●]
𝑛
𝑛
𝑝 1 𝑝
𝑍̅𝑛 → 𝔼[𝑍𝑛̅ ] = 𝑍 ∑ 𝒙𝒊 𝒙′𝒊 → 𝔼(𝒙𝒊 𝒙′𝒊 )
𝑛
𝑖=1
𝑛
𝑝 1 𝑝
̅𝑛 → 𝔼[𝐾
𝐾 ̅𝑛 ] = 𝐾 ∑ 𝒙𝒊 𝑢𝑖 → 𝔼[𝒙𝒊 𝑢𝑖 ]
𝑛
𝑖=1
𝑝 𝑝
By Slutsky’s Theorem (I): if 𝑋̅𝑛 → 𝜇 ⟹ 𝑔(𝑋̅𝑛 ) → 𝑔(𝜇). In this case:
𝑝 𝑝 𝑛 𝑛 −1
𝑍𝑛̅ → 𝑍 ⟹ [𝑍
⏟̅𝑛 ]−1 → ⏟
[𝑍]−1 1 𝑝 1 𝑝
∑ 𝒙𝒊 𝒙′𝒊 → 𝔼(𝒙𝒊 𝒙′𝒊 ) ⟹ [ ∑ 𝒙𝒊 𝒙′𝒊 ] → [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑔[𝑍̅𝑛 ] 𝑔(𝑍) 𝑛 𝑛
𝑖=1 𝑖=1
By (the weaker version of) Slutsky’s Theorem (II) applied to convergence in probability:
if:
𝑛 −1
𝑝 1 𝑝
𝑫𝑛 → 𝑫 ( ∑ 𝒙𝒊 𝒙′𝒊 ) → [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑛
𝑖=1
𝑛
𝑑 1 𝑝
𝑲𝑛 → 𝑲 ( ∑ 𝒙𝒊 𝑢𝑖 ) → 𝔼[𝒙𝒊 𝑢𝑖 ] = 𝟎
𝑛
𝑖=1
then
𝑛 −1 𝑛
𝑑 1 1 𝑝
𝑫𝒏 𝑲𝒏 → 𝑫𝑲 ( ∑ 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝒙𝒊 𝑢𝑖 ) → [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 ⏟
𝔼[𝒙𝒊 𝑢𝑖 ]
𝑛 𝑛
𝑖=1 𝑖=1 𝟎
Therefore: ̂ − 𝜷] = 𝔼[(𝒙𝒊 𝒙′𝒊 )−1 ] ⏟

plim[𝜷 𝔼[𝒙𝒊 𝑢𝑖 ] = ⏟
𝟎
𝟎 𝑘×1
̂ is a consistent estimator.
Hence 𝜷
1.2.4) Asymptotic normality of (1/n)sum(xiui)

1
We have that ∑𝑛𝑖=1 𝒙𝒊 𝑢𝑖 is a sample mean with known expectation:
𝑛
𝑛 𝑛
1 1 1
𝔼 [ ∑ 𝒙𝒊 𝑢𝑖 ] = ∑ 𝔼[𝒙𝒊 𝑢𝑖 ] = · 𝑛 · 0 = 0
𝑛 𝑛 𝑛
𝑖=1 𝑖=1
and with variance:

𝑛 𝑛
1 1
𝑉𝑎𝑟 [ ∑ 𝒙𝒊 𝑢𝑖 ] = 2 {∑ 𝑉𝑎𝑟[𝒙𝒊 𝑢𝑖 ] + 2 ∑ 𝐶𝑜𝑣[𝒙
⏟ 𝒊 𝑢𝑖 , 𝒙𝒋 𝑢𝑗 ]}
𝑛 𝑛
𝑖=1 𝑖=1 𝑖>𝑗 0
1
= 2 {𝑛 · 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ]}
𝑛
1
= 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ]
𝑛
where we have used:
➔ 𝑉𝑎𝑟 [ 𝒙⏟𝒊 𝑢⏟𝑖 ] = ⏟ 𝔼[𝒙𝒊 𝑢𝑖 ]′ = 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ]

𝔼[𝒙𝒊 𝑢𝑖 ] ⏟
𝔼[𝒙𝒊 𝑢𝑖 (𝒙𝒊 𝑢𝑖 )′] − ⏟
⏟ 𝑘×1 1×1 𝑘×𝑘 𝑘×1 1×𝑘
𝑘×𝑘
➔ 𝐶𝑜𝑣[𝒙𝒊 𝑢𝑖 , 𝒙𝒋 𝑢𝑗 ] = 𝔼[𝒙
⏟ 𝒊 𝑢𝑖 𝒙𝒋 𝑢𝑗 ] − 𝔼[𝒙𝒊 𝑢𝑖 ]𝔼[𝒙𝒋 𝑢𝑗 ] = 𝔼[𝒙𝒊 𝑢𝑖 ]𝔼[𝒙𝒋 𝑢𝑗 ] = 0
𝔼[𝒙𝒊 𝑢𝑖 ]𝔼[𝒙𝒋 𝑢𝑗 ]
o 𝐶𝑜𝑣[𝒙𝒊 𝑢𝑖 , 𝒙𝒋 𝑢𝑗 ] = 0 follows from the independence of observations across groups, which implies that
𝔼[𝒙𝒊 𝑢𝑖 𝒙𝒋 𝑢𝑗 ] = 𝔼[𝒙𝒊 𝑢𝑖 ]𝔼[𝒙𝒋 𝑢𝑗 ] = 0.
By the Central Limit Theorem,

𝑛
1 𝑑 1
∑ 𝒙𝒊 𝑢𝑖 → 𝒩 (0, 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ])
𝑛 𝑛
𝑖=1
𝑛
1 𝑑 1
∑ 𝒙𝒊 𝑢𝑖 → 𝒩(0, 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ])
𝑛 √𝑛
𝑖=1
𝑛
√𝑛 𝑑
𝑛
𝑖=1
𝑛
1 𝑑
√𝑛 𝑖=1
⏟ ≡ 𝔼 [𝑢𝑖2 𝒙⏟𝒊 𝒙⏟′𝒊 ], then

Let 𝑽
𝑘×𝑘 𝑘×1 1×𝑘
𝑛
1 𝑑
∑ 𝒙𝒊 𝑢𝑖 → 𝒩(0, 𝑽)
√𝑛 𝑖=1
1.2.5) Asymptotic Normality of OLS & Sandwich Formula

Recall the expression of the estimation error:
𝑛 −1 𝑛
̂ − 𝜷 = (∑ 𝒙𝒊 𝒙′𝒊 )
𝑖=1 𝑖=1
or equivalently, it’s conveniently tuned version (to have sample means):

𝑛 −1 𝑛
1 1
̂ − 𝜷 = ( ∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 ( ∑ 𝒙𝒊 𝑢𝑖 )
𝑛 𝑛
𝑖=1 𝑖=1
We want to apply Cramér’s Theorem to approximate the asymptotic distribution of this object, hence, we need
to decompose the expression above into 2 matrices: one that converges in probability to a known object and
another one of which we know its asymptotic distribution:
1 −1
✓ (𝑛 ∑𝑛𝑖=1 𝒙𝒊 𝒙′𝒊 ) will converge in probability to a known object, which is its expectation.
1
✓ (𝑛 ∑𝑛𝑖=1 𝒙𝒊 𝑢𝑖 ) is an object of which we know the asymptotic distribution; in fact, we calculated it just
1 𝑑
above. It will be convenient to use the final expression to which we arrived: ∑𝑛 𝒙 𝑢 → 𝒩(0, 𝑽).
√𝑛 𝑖=1 𝒊 𝑖
Thus, we will scale our estimation error to apply the theorem in a cleaner way:
𝑛 −1 𝑛
1 1
̂ − 𝜷) = ( ∑ 𝒙𝒊 𝒙′𝒊 )
√𝒏(𝜷 ( ∑ 𝒙𝒊 𝑢𝑖 )
⏟𝑛 𝑖=1 ⏟√𝑛 𝑖=1
𝑝 −1 𝑑
→[𝔼(𝒙𝒊 𝒙′𝒊 )] →𝒩(0,𝑽)
By Cramér’s Theorem, if we have:

𝑛 −1
𝑝 1 𝑝
𝐴𝑛 → 𝐴 ( ∑ 𝒙𝒊 𝒙′𝒊 ) → [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑛
𝑖=1
𝑛
𝑑 1 𝑑
𝜉𝑛 → 𝜉 ( ∑ 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, 𝑽)
√𝑛 𝑖=1
then
𝑛 −1 𝑛
𝑑 1 1 𝑑
𝐴𝑛 𝜉𝑛 → 𝐴𝜉𝑛 ( ∑ 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝒙𝒊 𝑢 𝑖 ) → [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 𝒩(0, 𝑽)
𝑛 √𝑛 𝑖=1
𝑖=1
Using the properties of distributional operators, we can see that the above expression is equivalent to:
𝑛 −1 𝑛
1 1 𝑑
( ∑ 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 𝑽[𝔼(𝒙𝒊 𝒙′𝒊 )]−1 )
𝑛 √𝑛 𝑖=1
𝑖=1
or simply:
𝑛 −1 𝑛
1 1 𝑑
( ∑ 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, 𝑾)
𝑛 √𝑛 𝑖=1
𝑖=1
where:
𝑾
⏟ =⏟ 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ] ⏟
[𝔼(𝒙𝒊 𝒙′𝒊 )]−1 ⏟ [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑘×𝑘 𝑘×𝑘 𝑘×𝑘 𝑘×𝑘
̂ − 𝜷 is a 𝑘 × 𝑘 vector, we know that the variance of this object has to be a square 𝑘 × 𝑘 matrix
Since 𝜷
𝑑
Hence: ̂ − 𝜷) → 𝒩(0, 𝑾)
√𝑛(𝜷
1.2.6) (Heteroskedasticity consistent) Estimators of the asymptotic variance of the estimation

error aka “estimator of the sandwich”
Recall the population object:
𝑾 = [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ][𝔼(𝒙𝒊 𝒙′𝒊 )]−1
a consistent estimator can be proposed by simply changing population objects by sample objects:
1
 Instead of expectations 𝔼(·), use a sample mean 𝑛 ∑𝑛𝑖=1(·)
 Instead of the population error 𝑢𝑖 , use the sample error 𝑢̂𝑖
𝑛 −1 𝑛 𝑛 −1
1 1 1
̂ = [ ∑(𝒙𝒊 𝒙′𝒊 )]
𝑾 [ ∑(𝑢̂𝑖2 𝒙𝒊 𝒙′𝒊 )] [ ∑(𝒙𝒊 𝒙′𝒊 )]
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
̂ are HETEROSKEDASTICITY-ROBUST, as we didn’t impose any assumption to
Important: 𝑾 and its estimator 𝑾
obtain them.
1.2.7) Asymptotic normality for Individual Coefficients

Recall that we found:
𝑑
̂ − 𝜷) → 𝒩(0, 𝑾)
√𝑛(𝜷
which, for individual coefficients correspond to:

𝑑
√𝑛(𝛽̂𝑗 − 𝛽𝑗 ) → 𝒩(0, 𝑤𝑗𝑗 )
or simply:
𝑑 1 𝑑 𝑤𝑗𝑗 𝑑 𝑤𝑗𝑗 𝑑 𝑤𝑗𝑗
(𝛽̂𝑗 − 𝛽𝑗 ) → 𝒩(0, 𝑤𝑗𝑗 ) ⇒ (𝛽̂𝑗 − 𝛽𝑗 ) → 𝒩 (0, ) ⇒ 𝛽̂𝑗 → 𝛽𝑗 + 𝒩 (0, ) ⇒ 𝛽̂𝑗 → 𝒩 (𝛽𝑗 , )
√𝑛 𝑛 𝑛 𝑛
𝑑 𝑤𝑗𝑗 √𝑛(𝛽̂𝑗 − 𝛽𝑗 ) 𝑑
𝛽̂𝑗 → 𝒩 (𝛽𝑗 , ) ⟺ → 𝒩(0,1)
𝑛 √𝑤𝑗𝑗
where 𝑤𝑗𝑗 is the diagonal element of 𝑾.
̂, the asymptotic normality result remains unaltered:

Under 𝑾
𝑑 𝑤
̂𝑗𝑗 √𝑛(𝛽̂𝑗 − 𝛽𝑗 ) 𝑑
𝛽̂𝑗 → 𝒩 (𝛽𝑗 , ) ⟺ → 𝒩(0,1)
𝑛 √𝑤
̂𝑗𝑗
1.2.8) Confidence Intervals

̂𝑗 −𝛽𝑗 ) 𝑑
√𝑛(𝛽
Given that → 𝒩(0,1), we can construct a 95% confidence interval as follows:
√𝑤𝑗𝑗
√𝑛(𝛽̂𝑗 − 𝛽𝑗 ) 𝑤𝑗𝑗 𝑤𝑗𝑗

Pr [−1.96 < < 1.96] = 0.95 ⟺ Pr [−1.96√ < 𝛽̂𝑗 − 𝛽𝑗 < 1.96√ ] = 0.95
√𝑤𝑗𝑗 𝑛 𝑛
𝑤𝑗𝑗 𝑤𝑗𝑗
⟺ Pr [−𝛽̂𝑗 − 1.96√ < −𝛽𝑗 < −𝛽̂𝑗 + 1.96√ ] = 0.95
𝑛 𝑛
𝑤𝑗𝑗 𝑤𝑗𝑗 𝑤𝑗𝑗 𝑤𝑗𝑗
⟺ Pr [𝛽̂𝑗 + 1.96√ > 𝛽𝑗 > 𝛽̂𝑗 − 1.96√ ] = 0.95 ⟺ Pr [𝛽̂𝑗 − 1.96√ < 𝛽𝑗 < 𝛽̂𝑗 + 1.96√ ]
𝑛 𝑛 𝑛 𝑛
= 0.95
Hence, if we propose a consistent estimate for 𝑾 so that we get a 𝑤

̂𝑗𝑗 , a confidence interval is:
𝑤̂𝑗𝑗 𝑤
̂𝑗𝑗
𝐶𝐼0.95 = (𝛽̂𝑗 − 1.96√ , 𝛽̂𝑗 + 1.96√ )
𝑛 𝑛
This Confidence Interval uses heteroskedasticity-consistent standard errors, as homoskedasticity was not
̂.
assumed to obtain 𝑾 and its estimate 𝑾
1.3) CLASSICAL REGRESSION MODEL (CRM)

The CRM tries to find the conditional expectation function.
𝜷 = arg min 𝔼𝒙 {[𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃]2 }

𝑏
Note that the mean is taken with respect to the distribution of 𝒙.
Hence, 𝜷 is the best linear approximation to the conditional mean of 𝑦 given 𝒙:

arg min 𝔼[(𝑦𝑖 − 𝑐)2 |𝒙]
𝑐
1.3.1) Assumptions
Assumption 1 (A1) 𝔼(𝒚|𝑿) = 𝑿𝜷
1a) Strict Exogeneity 𝔼(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ) = 𝔼(𝑦𝑖 | 𝑥𝑖 )
1b) Linearity 𝔼(𝑦𝑖 | 𝑥𝑖 ) = 𝛼 + 𝑥𝑖′ 𝛽
Assumption 2 (A2) 𝑉𝑎𝑟(𝒚|𝑿) = 𝜎 2 𝑰𝒏
2a) Conditional Uncorrelatedness 𝑉𝑎𝑟(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ) = 𝑉𝑎𝑟(𝑦𝑖 | 𝑥𝑖 ) ⇒ 𝐶𝑜𝑣(𝑦𝑖 , 𝑦𝑗 |𝑥1 , … , 𝑥𝑛 ) = 0
2b) Homoskedasticity 𝑉𝑎𝑟(𝑦𝑖 |𝑥𝑖 ) = 𝜎 2
Assumption 3 (A3) 𝒚|𝑿 ∼ 𝓝(𝑿𝜷, 𝜎 2 𝑰𝒏 )
The joint pdf must be normal.
Normality (multivariate) normality implies that conditional mean is
linear (1b) and conditional variance is constant (2b)
• Random Sampling delivers [1a) Exogeneity] and [2a) Conditional Uncorrelatedness]

• Normality delivers [1b) Linearity] and [2b) Homoskedasticity]
• Thus, having Random Sampling + Normality delivers all the assumptions
If we are given 𝔼(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ), exogeneity always holds.
𝔼(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ) = 𝔼(𝑦𝑖 | 𝑥𝑖 )
To show it, apply the Law of Iterated Expectations on 𝔼(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ):
𝔼(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ) = 𝔼[(𝑦𝑖 | 𝑥1 , … , 𝑥𝑛 ) | 𝑥𝑖 ] = 𝔼(𝑦𝑖 | 𝑥𝑖 )
1.3.2) Properties
Property Assumption required
Conditional Unbiasedness A1
BLUE A1 & A2
BUE A1, A2 & A3
1.3.3) Variance of 𝒚 = Variance of 𝒚 & Variance of 𝒖 = Expectation of 𝒖𝒖′

Recall: 𝔼(𝒖|𝑿) = 𝟎 (by FOC), and 𝑉𝑎𝑟(𝑐) = 0, 𝐶𝑜𝑣(𝑐, 𝑊) = 0 where 𝑐 denotes a constant.
𝒚 |𝑿) = 𝜎 2 𝑰𝒏
Under A2: 𝑉𝑎𝑟 ( ⏟
⏟ 𝑁×1
𝑁×𝑁
′
𝑉𝑎𝑟(𝒚|𝑿) = 𝑉𝑎𝑟 ( 𝑿
⏟ 𝜷 + 𝒖|𝑿) = 𝑉𝑎𝑟(𝒖|𝑿)
constant
and :
′
𝑉𝑎𝑟(𝒖|𝑿) = 𝔼(𝒖𝒖
⏟ |𝑿) − 𝔼(𝒖|𝑿)
⏟ [𝔼(𝒖|𝑿)]
⏟ ′ = 𝔼(𝒖𝒖′ |𝑿)
𝑁×𝑁 ⏟𝑁×1 1×𝑁
𝑁×𝑁
Recapitulation :
(𝑨𝟐)
𝑉𝑎𝑟(𝒚|𝑿) = 𝑉𝑎𝑟(𝒖|𝑿) = 𝔼(𝒖𝒖′ |𝑿) =
⏞ 𝜎 2 𝑰𝒏
1.3.4) The conditional distribution of u determines the conditional distribution of y

Claim: 𝒖|𝑿 ∼ 𝒩(0, 𝜎 2 𝑰𝒏 ) ⟹ 𝒚|𝑿 ∼ 𝓝(𝑿𝜷, 𝜎 2 𝑰𝒏 )
Proof:
Given that: 𝒚 = 𝑿𝜷 + 𝒖
• 𝔼[𝒚|𝑿] = 𝔼[𝑿𝜷 + 𝒖|𝑿] = 𝑿𝜷 + 𝔼[𝒖|𝑿]

⏟ = 𝑿𝜷
𝟎
• Var[𝒚|𝑿] = Var[𝑿𝜷 + 𝒖|𝑿] = Var[𝒖|𝑿]
⏟ = 𝜎 2 𝑰𝒏
𝜎 2 𝑰𝒏
1.3.5) Deriving the Estimator

The derivations are the same as in here (go to the link for a more detailed explanation).
Sigma notation
𝑛
̂ = arg min ∑(𝑦𝑖 − 𝒙′𝒊 𝒃)2
𝜷
𝒃
𝑖=1
𝑛 𝑛 𝑛
𝜕
= ∑ −2𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃) = 0 ∑ 𝒙𝒊 𝑦𝑖 = ∑ 𝒙𝒊 𝒙′𝒊 𝒃
𝜕𝒃
𝑖=1 𝑖=1 𝑖=1
Hence:
𝑛 −1 𝑛
̂ ≡ 𝒃 = (∑ 𝒙𝒊 𝒙′𝒊 )
𝜷 (∑ 𝒙𝒊 𝑦𝑖 )
𝑖=1 𝑖=1
Matrix notation
̂ = arg min ( 𝒖
𝜷 ̂⏟′ 𝒖
̂ )
⏟
𝒃
1×𝑛 𝑛×1
̂′𝒖
𝒖 ̂ = (𝒚 − 𝑿𝜷)′(𝒚 − 𝑿𝜷)
= 𝒚′ 𝒚 − 𝒚′ 𝑿𝜷 − (𝑿𝜷)′ 𝒚 + (𝑿𝜷)′ (𝑿𝜷)
= 𝒚′ 𝒚 − 𝒚′ 𝑿𝜷 − 𝜷′𝑿′𝒚 + 𝜷′𝑿′𝑿𝜷
Hence:
̂ = arg min(𝒚′ 𝒚 − 𝒚′ 𝑿𝒃 − 𝒃′𝑿′𝒚 + 𝒃′𝑿′𝑿𝒃)
𝜷
𝒃
FOC:
𝜕
(𝒖 ̂ ) = −(𝒚′ 𝑿)′ − 𝑿′ 𝒚 + 2𝑿′𝑿𝒃
̂′𝒖 =0
𝜕𝒃
= −𝑿′𝒚 − 𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
= −2𝑿′ 𝒚 + 2𝑿′𝑿𝒃 =0
Solve for 𝒃
−2𝑿′ 𝒚 + 2𝑿′𝑿𝒃 = 0
𝑿′𝑿𝒃 = 𝑿′ 𝒚
(𝑿′ 𝑿)−𝟏 (𝑿′𝑿)𝒃 = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒚
−𝟏
′
⏟
𝒃 =(𝑿⏟ 𝑿 ⏟ ) ⏟′ ⏟
𝑿 𝒚
𝑘×1 ⏟𝑘×𝑛 𝑛×𝑘 ⏟ 𝑛×1
𝑘×𝑛
𝑘×𝑘 𝑘×1
Thus:
̂ ≡ 𝒃 = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒚
𝜷
̂ (showing unbiasedness)
1.3.6) Conditional & Unconditional Expectation of 𝜷
➔ Conditional Expectation
̂ is an unbiased estimator
Under A1, 𝔼(𝒚|𝑿) = 𝑿𝜷. Hence, we can show that 𝜷
̂ |𝑿] = 𝔼 [(𝑿
𝔼[𝜷 ⏟ ′ 𝑿)−𝟏 𝑿′ 𝒚|𝑿] = (𝑿′ 𝑿)−𝟏 𝑿′ 𝔼[𝒚|𝑿]
⏟ (𝑿′ 𝑿)−𝟏 𝑿′ 𝑿 𝜷 = 𝜷
=⏟
constant 𝑿𝜷 𝑰𝒌
Hence: ̂ |𝑿] = 𝜷
𝔼[𝜷
➔ Unconditional Expectation
By the law of iterated expectations, we can show that 𝛽̂ is also unconditionally unbiased:
̂ ] = 𝔼 [𝔼[𝜷
𝔼[𝜷 ̂ |𝑿]] = 𝔼[𝜷] = 𝜷
as 𝜷 is just a vector of constants.
Hence: ̂] = 𝜷
𝔼[𝜷
̂
1.3.7) Conditional & Unconditional Variance of 𝜷
➔ Conditional Variance
Under A2, 𝑉𝑎𝑟(𝒚|𝑿) = 𝜎 2 𝑰𝒏
⏟ ′ 𝑿)−𝟏 𝑿′ 𝒚|𝑿]
̂ |𝑿] = 𝑉𝑎𝑟 [(𝑿
𝑉𝑎𝑟[𝜷
constant
= ((𝑿′ 𝑿)−𝟏 𝑿′ ) 𝑉𝑎𝑟[𝒚|𝑿]
⏟ ((𝑿′ 𝑿)−𝟏 𝑿′ )′
𝜎 2 𝑰𝒏
= 𝜎 2 ((𝑿′ 𝑿)−𝟏 𝑿′ )𝑰𝒌 (𝑿(𝑿′ 𝑿)−𝟏 )
= 𝜎2 ⏟
(𝑿′ 𝑿)−𝟏 𝑿′ 𝑿 (𝑿′ 𝑿)−𝟏
𝑰𝒌
= 𝜎 2 (𝑿′ 𝑿)−𝟏
Hence: ̂ |𝑿] = 𝜎 2 (𝑿′ 𝑿)−𝟏

𝑉𝑎𝑟[𝜷
➔ Unconditional Variance
⏟ ̂ |𝑿)] + 𝔼 [𝑉𝑎𝑟(𝜷
̂ ] = 𝑉𝑎𝑟 [𝔼(𝜷
𝑉𝑎𝑟[𝜷 ⏟ ̂ |𝑿)]
𝜷 −𝟏
𝜎2 (𝑿′ 𝑿)
= 𝑉𝑎𝑟[𝜷]
⏟ + 𝔼[𝜎 2 (𝑿′ 𝑿) ] −𝟏
= 𝜎 2 𝔼[(𝑿′ 𝑿)−𝟏 ]
Hence: ̂ ] = 𝜎 2 𝔼[(𝑿′ 𝑿)−𝟏 ]

𝑉𝑎𝑟[𝜷
1.3.8) Conditional asymptotic distribution of estimation error & Sandwich Formula

The conditional distribution of our estimation error (scaled by the normalization factor of √𝑛) is given by:
̂ − 𝜷)|𝑿 ∼ 𝓝(𝟎, 𝑾)
√𝑛(𝜷
where:
̂ − 𝜷)|𝑿] = √𝑛 {𝔼[𝜷
𝔼[√𝑛(𝜷 ⏟ ̂ |𝑿] − 𝔼[𝜷|𝑿]
⏟ }=𝟎
𝜷 𝜷
and to find our sandwich formula 𝑾 we need to calculate the conditional variance of the rescaled estimation
̂ − 𝜷 = (∑𝑛𝑖=1 𝒙𝒊 𝒙′𝒊 )−1 ∑𝑛𝑖=1 𝒙𝒊 𝑢𝑖
error. Recall that: 𝜷
̂ − 𝜷)|𝑿] = 𝑛 · 𝑉𝑎𝑟[𝜷
𝑉𝑎𝑟[√𝑛(𝜷 ̂ − 𝜷|𝑿]
𝑛 −1 𝑛
=𝑛· 𝑉𝑎𝑟 (∑ 𝒙𝒊 𝒙′𝒊 ) ∑ 𝒙𝒊 𝑢𝑖 |𝑿

⏟𝑖=1 𝑖=1
[ constant ]
𝑛 −1 𝑛 𝑛 −1
= 𝑛 (∑ 𝒙𝒊 𝒙′𝒊 ) 𝑉𝑎𝑟 [∑ 𝒙𝒊 𝑢𝑖 |𝑿] (∑ 𝒙𝒊 𝒙′𝒊 )

𝑖=1 𝑖=1 𝑖=1
𝑛 −1 𝑛 𝑛 −1
= 𝑛 (∑ 𝒙𝒊 𝒙′𝒊 ) ∑ 𝑉𝑎𝑟[𝒙𝒊 𝑢𝑖 |𝑿] (∑ 𝒙𝒊 𝒙′𝒊 )

𝑖=1 𝑖=1 𝑖=1
𝑛 −1 𝑛 𝑛 −1
= 𝑛 (∑ 𝒙𝒊 𝒙′𝒊 ) 𝒙𝒊 (∑ ⏟𝑉𝑎𝑟[𝑢𝑖 |𝑿]) 𝒙′𝒊 (∑ 𝒙𝒊 𝒙′𝒊 )

𝑖=1 𝑖=1 𝑉𝑎𝑟[𝑦𝑖 |𝑿] 𝑖=1
𝑛 −1 𝑛 𝑛 −1
= 𝑛 (∑ 𝒙𝒊 𝒙′𝒊 ) 𝒙𝒊 (∑ 𝑉𝑎𝑟[𝑦𝑖 |𝑿]) 𝒙′𝒊 (∑ 𝒙𝒊 𝒙′𝒊 )

𝑖=1 𝑖=1 𝑖=1
This is the expression that we get if we don’t impose homoskedasticity.
We can simplify it imposing Assumption 2: 𝑉𝑎𝑟[𝑦𝑖 |𝑿] = 𝜎 2 𝑰𝒏

𝑛 −1 𝑛 −1
= 𝑛 (∑ 𝒙𝒊 𝒙′𝒊 ) 𝒙𝒊 (𝜎 2
𝑰𝒏 )𝒙′𝒊 (∑ 𝒙𝒊 𝒙′𝒊 )
𝑖=1 𝑖=1
𝑛 −1 𝑛 −1
= 𝑛𝜎 2 (∑ 𝒙𝒊 𝒙′𝒊 ) 𝒙𝒊 𝒙′𝒊 (∑ 𝒙𝒊 𝒙′𝒊 )

⏟𝑖=1 𝑖=1
𝑰𝒌
𝑛 −1
= 𝑛𝜎 2
(∑ 𝒙𝒊 𝒙′𝒊 )
𝑖=1
𝑛 −1
1 1
= 𝑛𝜎 ( ∑ 𝒙𝒊 𝒙′𝒊 )
2
𝑛 𝑛
𝑖=1
𝑛 −1
1
= 𝜎 2 ( ∑ 𝒙𝒊 𝒙′𝒊 )
𝑛
𝑖=1
Note that:
▪ 𝑉𝑎𝑟[∑𝑛𝑖=1 𝒙𝒊 𝑢𝑖 |𝑿] = ∑𝑛𝑖=1 𝑉𝑎𝑟[𝒙𝒊 𝑢𝑖 |𝑿] + 2 ∑𝑖>𝑗 ⏟

𝐶𝑜𝑣[𝒙𝒊 𝑢𝑖 , 𝒙𝒋 𝑢𝑗 |𝑿] = ∑𝑛𝑖=1 𝑉𝑎𝑟[𝒙𝒊 𝑢𝑖 |𝑿]
0
▪ 𝐶𝑜𝑣[𝒙𝒊 𝑢𝑖 , 𝒙𝒋 𝑢𝑗 |𝑿] = 0 follows from the independence of observations across groups (random sampling), which
implies that 𝔼[𝑢𝑖 𝑢𝑗 |𝐗] = 𝔼[𝑢𝑖 |𝐗]𝔼[𝑢𝑗 |𝐗] = 0
′
𝐶𝑜𝑣[𝒙𝒊 𝑢𝑖 , 𝒙𝒋 𝑢𝑗 |𝑿] = 𝔼[𝒙𝒊 𝑢𝑖 (𝒙𝒋 𝑢𝑗 ) |𝑿] − ⏟
𝔼[𝒙𝒊 𝑢𝑖 |𝑿] ⏟
𝔼[𝒙𝒋 𝑢𝑗 |𝑿]
0 0
= 𝒙𝒊 𝒙′𝒋 𝔼[𝑢𝑖 𝑢𝑗 |𝑿]
= 𝒙𝒊 𝒙′𝒋 ⏟
𝔼[𝑢𝑖 |𝑿] 𝔼[𝑢
⏟ 𝑗 |𝑿]
0 0
=0
▪ 𝑉𝑎𝑟[𝑢𝑖 |𝑿] = 𝑉𝑎𝑟 [𝑦𝑖 − 𝒙

⏟
′
𝒊 𝜷 |𝑿] = 𝑉𝑎𝑟[𝑦𝑖 |𝑿]
constant
Conclusion:
Imposing A2, we have obtained:

𝑛 −1
1
̂ − 𝜷)|𝑿] = 𝜎 ( ∑ 𝒙𝒊 𝒙′𝒊 )
𝑉𝑎𝑟[√𝑛(𝜷 2
𝑛
𝑖=1
Asymptotically, we can see that:
𝑝
̂ − 𝜷)|𝑿] → 𝜎 2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑉𝑎𝑟[√𝑛(𝜷
Hence, we have:
̂ − 𝜷)|𝑿 ∼ 𝓝(𝟎, 𝑾)
√𝑛(𝜷
with:
𝑾 = 𝜎 2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
1.3.9) Estimation of the Error Variance

In order to find an estimator of the error variance 𝜎̂, we need to find an expression relating to the sum of squared
residuals from the regression 𝒖 ̂′𝒖̂ that produces an unbiased estimator. In order to do this, we start from the
expression of the expectation 𝔼[𝒖 ̂′𝒖
̂ ] and we find the factor by which we would need to multiply 𝒖 ̂′𝒖
̂ in order
to produce an unbiased estimator.
̂ = 𝑴𝒖 , 𝑴′ = 𝑴 , 𝑴𝑴 = 𝑴 and: 𝔼[tr(●)] = tr[𝔼(●)] , tr(𝑨𝑩) = tr(𝑩𝑨)

Recall: 𝒖
̂⏟′ 𝒖
𝔼[ 𝒖 ̂ ] = 𝔼[(𝑴𝒖)′(𝑴𝒖)] = 𝔼 [𝒖′ 𝑴
⏟ ⏟ ′
𝑴 𝒖] = 𝔼[𝒖′ 𝑴𝒖] = 𝔼[𝔼(𝒖′ 𝑴𝒖|𝑿)]
1×𝑛 𝑛×1 𝑴
Note that 𝒖⏟′ 𝑴⏟ 𝒖 ⏟ is a constant, so tr(𝒖′ 𝑴𝒖) = 𝒖′ 𝑴𝒖. Apply

⏟
1×𝑛 𝑛×𝑛 𝑛×1
1×1
= 𝔼[𝔼(𝒖′ 𝑴𝒖|𝑿)] = 𝔼 [𝔼 (tr (𝒖′

⏟ 𝑴𝒖
⏟ ) |𝑿)] = 𝔼 [𝔼 (tr (𝑴𝒖
⏟ 𝒖′
⏟ ) |𝑿)]
𝑨 𝑩 𝑩 𝑨
Apply 𝔼[tr(●)] = tr[𝔼(●)]
= 𝔼 [𝔼 (tr (𝑴𝒖
⏟ 𝒖′
⏟ ) |𝑿)] = 𝔼[tr(𝔼(𝑴𝒖𝒖′)|𝑿)] = 𝔼 [tr (𝑴 𝔼(𝒖𝒖′)|𝑿
⏟ )] = 𝔼[tr(𝑴 𝜎 2 )]
𝑩 𝑨 𝜎2
Apply tr(𝑐 × 𝑨) = 𝑐 × tr(𝑨)
𝔼[tr(𝑴 𝜎 2 )] = 𝜎 2 𝔼 [tr(𝑴)
⏟ ] = 𝜎 2 𝔼[𝑛 − 𝑘] = 𝜎 2 (𝑛 − 𝑘)
𝑛−𝑘
Hence: ̂′𝒖
𝔼[𝒖 ̂ ] = 𝜎 2 (𝑛 − 𝑘)
̂′𝒖
𝔼[𝒖 ̂] ̂′𝒖
𝒖 ̂
This means that, 𝜎 2 = 𝑛−𝑘
= 𝔼 [𝑛−𝑘]. So working backwards, our unbiased estimator of 𝜎 2 is:
2
𝒖̂′𝒖
̂
𝜎̂ =
𝑛−𝑘
̂′𝒖
𝒖 ̂
which is consistent, since: 𝔼[𝜎̂ 2 ] = 𝔼 [ ] = 𝜎2.
𝑛−𝑘
Sampling distribution under conditional normality
Other distributional results

1.4) WEIGHTED LEAST SQUARES (WLS)
Motivation: it’s useful when there is heteroskedasticity, as WLS allows us to put more weight to the more precise
observations (the ones with lower variance)
1.4.1) Deriving the WLS estimator

The program that we solve is (note that OLS is the case where 𝑤𝑖 = 1):
𝑛 2
̃=
𝜷 arg min ∑ 𝑤𝑖 ( 𝑦⏟𝑖 − 𝒙⏟′𝒊 𝒃⏟)
𝒃
𝑖=1 1×1 1×𝑘 𝑘×1
𝑛 𝑛 𝑛
𝜕
= ∑ 𝑤𝑖 (−2)𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃) = 0 ∑ 𝑤𝑖 𝒙⏟𝒊 𝑦⏟𝑖 = ∑ 𝑤𝑖 𝒙⏟𝒊 𝒙⏟′𝒊 𝒃
⏟
𝜕𝒃
𝑖=1 𝑖=1 𝑘×1 1×1 𝑖=1 𝑘×1 1×𝑘 𝑘×1
𝑛 𝑛
(∑ 𝑤𝑖 𝒙𝒊 𝑦𝑖 ) = (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) 𝒃
𝑖=1 𝑖=1
𝑛 −1 𝑛 𝑛 −1 𝑛
(∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) (∑ 𝑤𝑖 𝒙𝒊 𝑦𝑖 ) = (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) 𝒃

𝑖=1 𝑖=1 𝑖=1 𝑖=1
−1
𝑛 𝑛
∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ∑ 𝑤𝑖 𝒙𝒊 𝑦𝑖 = 𝒃
⏟
⏟
𝑖=1 ⏟
𝑖=1 𝑘×1
( 𝑘×𝑘 ) ( 𝑘×1 )
Hence:
𝑛 −1 𝑛
̃=
𝜷 (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) (∑ 𝑤𝑖 𝒙𝒊 𝑦𝑖 )
𝑖=1 𝑖=1
Matrix Notation
𝑦1
➢ 𝒚=(⋮)
𝑦𝑛 𝑛×1
𝑥11 ⋯ 𝑥1𝑘 𝒙′𝟏
➢ 𝑿=( ⋮ ⋱ ⋮ ) =( ⋮ )
𝑥𝑛1 ⋯ 𝑥𝑛𝑘 𝑛×𝑘 𝒙′𝒏
𝑤1 ⋯ 0
➢ 𝛀=( ⋮ ⋱ ⋮ ) = 𝑑𝑖𝑎𝑔(𝑤1 , … , 𝑤𝑛 ) it’s a symmetric matrix, so 𝛀 = 𝛀′
0 ⋯ 𝑤𝑛 𝑛×𝑛
Define:
̂ = ⏟
𝒖
⏟ 𝒚 − 𝑿
⏟ 𝜷⏟
𝑛×1 𝑛×1 ⏟ 𝑘×1
𝑛×𝑘
𝑛×1
𝑢̂1 𝑦1 𝑥11 ⋯ 𝑥1𝑘 𝛽1

(⋮) ⋮
=( ) −( ⋮ ⋱ ⋮ ) (⋮)
𝑢̂𝑛 𝑛×1 𝑦𝑛 𝑛×1 𝑥𝑛1 ⋯ 𝑥𝑛𝑘 𝑛×𝑘 𝛽𝑘 𝑘×1
We seek to minimize mean weighted squared error:
̂ = arg min ( 𝒖
𝜷 ̂⏟′ 𝛀 ̂ )
⏟ 𝒖
⏟
𝒃
1×𝑛 𝑛×𝑛 𝑛×1
where:
̂′𝒖
𝒖 ̂ = (𝒚 − 𝑿𝒃)′𝛀(𝒚 − 𝑿𝒃)
= 𝒚′ 𝛀𝒚 − 𝒚′ 𝛀𝑿𝒃 − (𝑿𝒃)′ 𝛀𝒚 + (𝑿𝒃)′ 𝛀(𝑿𝒃)
= 𝒚′ 𝛀𝒚 − 𝒚′ 𝛀𝑿𝒃 − 𝒃′𝑿′𝛀𝒚 + 𝒃′𝑿′𝛀𝑿𝒃
Hence:
̂ = arg min(𝒚′ 𝛀𝒚 − 𝒚′ 𝛀𝑿𝒃 − 𝒃′𝑿′𝛀𝒚 + 𝒃′𝑿′𝛀𝑿𝒃)
𝜷
𝒃
FOC:
𝜕
(𝒖 ̂ ) = −(𝒚′ 𝛀𝑿)′ − 𝑿′𝛀𝒚 + 2𝑿′ 𝛀𝑿𝒃
̂′𝒖 =0
𝜕𝒃
= −𝑿′𝛀′𝒚 − 𝑿′𝛀𝒚 + 2𝑿′ 𝛀𝑿𝒃 =0
′
= −2𝑿 𝛀𝒚 + 2𝑿′𝛀𝑿𝒃 =0
where we have applied 𝛀 = 𝛀′ to add −𝑿 𝛀 𝒚 − 𝑿 𝛀𝒚 = −2𝑿′ 𝛀𝒚
′ ′ ′
Solve for 𝒃
−2𝑿′ 𝛀𝒚 + 2𝑿′𝛀𝑿𝒃 =0
𝑿′𝛀𝑿𝒃 = 𝑿′ 𝛀𝒚
(𝑿′ 𝛀𝑿)−𝟏 (𝑿′𝛀𝑿)𝒃 = (𝑿′ 𝛀𝑿)−𝟏 (𝑿′ 𝛀𝒚)
𝒃 = (𝑿′ 𝛀𝑿)−𝟏 (𝑿′ 𝛀𝒚)
Thus:
̂ ≡ 𝒃 = (𝑿′ 𝛀𝑿)−𝟏 (𝑿′ 𝛀𝒚)

𝜷
1.4.2) Deriving the WLS population 𝜷

At the population level, the WLS regression model is:
𝑦𝑖 = 𝒙′𝒊 𝜷 + 𝑢𝑖
To find the population predictor 𝛽, we minimize expected weighted quadratic loss:
𝜷 = arg min 𝔼{𝑤𝑖 [𝑦𝑖 − 𝒙′𝒊 𝒃]2 }

𝒃
𝜕 ′
(·) = 𝔼[𝑤𝑖 (−2)𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃)] = 0 ⏟ 𝑖 𝒙𝒊 𝑦𝑖 ] = 𝔼[𝑤
𝔼[𝑤 ⏟ 𝑖 𝒙𝒊 𝒙𝒊 𝒃]
𝜕𝑏 𝑘×1 𝑘×1
𝔼[𝑤𝑖 𝒙𝒊 𝑦𝑖 ] = 𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]𝒃
𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝑤𝑖 𝒙𝒊 𝑦𝑖 ] = 𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]𝒃
𝒃 = 𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝑤𝑖 𝒙𝒊 𝑦𝑖 ]

Hence:
𝜷 = 𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝑤𝑖 𝒙𝒊 𝑦𝑖 ]
and, as we have seen, by constriction, the first order condition imposes:
𝔼[𝑤𝑖 (−2)𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃)] = 0
or equivalently:
𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 0
which states that the regressors are uncorrelated with the weighted population errors
1.4.3) Consistency of WLS estimator

Recall:
▪ 𝑦𝑖 = 𝒙′𝒊 𝜷 + 𝑢𝑖 ;
▪ (𝒙𝒊 )𝑘×1 ; (𝒙′𝒊 )1×𝑘 ; (𝒙𝒊 𝒙′𝒊 )𝑘×𝑘
𝑛 −1 𝑛
̃ = (∑ 𝑤𝑖 𝒙𝒊 𝒙′ )
𝜷 (∑ 𝑤𝑖 𝒙𝒊 (𝒙′𝒊 𝜷 + 𝑢𝑖 ))
𝒊
𝑖=1 𝑖=1
𝑛 −1 𝑛 𝑛 −1 𝑛
= (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 𝜷) + (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) (∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 )

𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 −1 𝑛 𝑛 −1 𝑛
= (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) 𝜷 + (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖

⏟𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑰𝒌
𝑛 −1 𝑛
= 𝜷 + (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖
𝑖=1 𝑖=1
Hence:
𝑛 −1 𝑛
̃−𝜷=
𝜷 (∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖
𝑖=1 𝑖=1
which we can conveniently rewrite it to make it look like sample means (multiplying and dividing by 𝑛)
𝑛 −1 𝑛
1 1
̃ − 𝜷 = ( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 )
𝜷 ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖
𝑛 𝑛
𝑖=1 𝑖=1
̃ − 𝜷) = 0
Proving consistency requires showing that: plim(𝜷
We want to find:
𝑛 −1 𝑛
1 1
̃ − 𝜷) = plim [( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 )
plim(𝜷 ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ]
𝑛 𝑛
𝑖=1 𝑖=1
➔ By plim properties: plim(𝑎𝑏) = plim(𝑎)plim(𝑏)

𝑛 −1 𝑛
1 1
̃ − 𝜷) = plim [( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ] plim [ ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ]
plim(𝜷
𝑛 𝑛
𝑖=1 𝑖=1
1 𝑝
➔ By LLN, sample means converge to population values, plim [ ∑𝑛𝑖=1 ●] → 𝔼[●]
𝑛
1 𝑛 𝑝
o ∑ 𝑤 𝒙 𝒙′ → 𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]
𝑛 𝑖=1 𝑖 𝒊 𝒊
1 𝑛 𝑝
o ∑ 𝑤𝒙𝑢
𝑛 𝑖=1 𝑖 𝒊 𝑖
→ 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]
𝑝 𝑝
➔ By Slutsky’s Theorem (I): 𝑊𝑛 → 𝑐 ⇒ 𝑔(𝑊𝑛 ) → 𝑔(𝑐)
1 𝑛 𝑝 1 −1 𝑝
o 𝑛
∑𝑖=1 𝑤𝑖 𝒙𝒊 𝒙′𝒊 → 𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ] ⇒ {𝑛 ∑𝑛𝑖=1 𝑤𝑖 𝒙𝒊 𝒙′𝒊 } → {𝔼[𝑤𝑖 𝒙𝒊 𝒙′𝒊 ]}−1
➔ By Slutsky’s Theorem (II) in its weaker version (applied to convergence in probability):
if:
𝑛 −1
𝑝 1 𝑝
𝒁𝑛 → 𝒁 ( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) → [𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1
𝑛
𝑖=1
𝑛
𝑑 1 𝑝
𝑲𝑛 → 𝑲 ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) → 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]
𝑛
𝑖=1
then:
𝑛 −1 𝑛
𝑑 1 1 𝑝
𝒁𝒏 𝑲𝒏 → 𝒁𝑲 ( ∑ 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝒙𝒊 𝑢𝑖 ) → [𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]
𝑛 𝑛
𝑖=1 𝑖=1
Therefore: ̃ − 𝜷) = plim{[𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]}

plim(𝜷
̃ to be a consistent estimator.
Hence, we need 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 𝟎 in order for 𝜷
̃ − 𝜷) = plim {[𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 ⏟

plim(𝜷 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] } = ⏟
𝟎 ⟺ 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 𝟎
=0 (assumption) 𝑘×1
When do we obtain 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 𝟎?
Apply iterated expectations:
𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 𝔼[𝔼(𝑤𝑖 𝒙𝒊 𝑢𝑖 |𝑿)] = 𝔼 [𝑤𝑖 𝒙𝒊 ⏟

𝔼(𝑢𝑖 |𝑿)] = 𝟎
0
Conclusion: 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 𝟎 if:
✓ 𝑤𝑖 = 𝑤(𝑥𝑖 ) is a function of 𝑥𝑖 only (so that we can take it out of the expectation conditional on 𝑿).
✓ 𝔼(𝑢𝑖 |𝑿) = 0
In general, 𝛽̃ is not consistent when the CEF is not linear: 𝔼(𝑦𝑖 |𝑥𝑖 ) ≠ 𝑥𝑖′ 𝛽
1.4.4) Asymptotic normality of (1/n)sum(wixiui)

1
We have that 𝑛 ∑𝑛𝑖=1 𝒙𝒊 𝑢𝑖 is a sample mean with known expectation:
𝑛 𝑛 𝑛 𝑛
1 1 1 1
𝔼 [ ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = ∑ 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = ∑ 𝔼[𝔼(𝑤𝑖 𝒙𝒊 𝑢𝑖 |𝑿)] = ∑ 𝔼 [𝑤𝑖 𝒙𝒊 ⏟
𝔼(𝑢𝑖 |𝑿)] = 0
𝑛 𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 0
and with variance:

𝑛 𝑛
1 1
𝑉𝑎𝑟 [ ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 2 {∑ 𝑉𝑎𝑟[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] + 2 ∑ ⏟
𝐶𝑜𝑣[𝑤𝑖 𝒙𝒊 𝑢𝑖 , 𝑤𝑗 𝒙𝒋 𝑢𝑗 ]}
𝑛 𝑛
𝑖=1 𝑖=1 𝑖>𝑗 0
1
= 2 {𝑛 · 𝔼[𝑤𝑖2 𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ]}
𝑛
1
= 𝔼[𝑤𝑖2 𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ]
𝑛
where we have used: (recall 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 0 from the FOC)
2 2
➔ 𝑉𝑎𝑟 [𝑤𝑖 𝒙⏟𝒊 𝑢⏟𝑖 ] = {𝔼
⏟[(𝑤𝑖 𝒙𝒊 𝑢𝑖 )(𝑤𝑖 𝒙𝒊 𝑢𝑖 )]′ − 𝔼[𝑤 ⏟ 𝑖 𝒙𝒊 𝑢𝑖 ]′} = 𝔼[𝑤𝑖 𝑢𝑖 𝒙𝒊 𝒙′𝒊 ]
⏟ 𝑖 𝒙𝒊 𝑢𝑖 ] 𝔼[𝑤
⏟ 𝑘×1 1×1 𝑘×𝑘 𝑘×1 1×𝑘
𝑘×𝑘
➔ 𝐶𝑜𝑣[𝑤𝑖 𝒙𝒊 𝑢𝑖 , 𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = ⏟
𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 𝑤𝑗 𝒙𝒋 𝑢𝑗 ] − 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]𝔼[𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]𝔼[𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 0
𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]𝔼[𝑤𝑗 𝒙𝒋 𝑢𝑗 ]
o 𝐶𝑜𝑣[𝑤𝑖 𝒙𝒊 𝑢𝑖 , 𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 0 follows from the independence of observations across groups, which implies that
𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ]𝔼[𝑤𝑗 𝒙𝒋 𝑢𝑗 ] = 0.

𝑛
1 𝑑 1
∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 → 𝒩 (0, 𝔼[𝑤𝑖2 𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ])
𝑛 𝑛
𝑖=1
𝑛
1 𝑑 1
∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 → 𝒩(0, 𝔼[𝑤𝑖2 𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ])
𝑛 √𝑛
𝑖=1
𝑛
√𝑛 𝑑
∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 → 𝒩(0, 𝔼[𝑤𝑖2 𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ])
𝑛
𝑖=1
𝑛
1 𝑑
𝔼[𝑤𝑖2 𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ])
∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 → 𝒩 (0, ⏟
√𝑛 𝑖=1 𝑽
⏟ ≡ 𝔼 [𝑤𝑖2 𝑢𝑖2 𝒙⏟𝒊 𝒙⏟′𝒊 ], then

Let 𝑽
𝑘×𝑘 𝑘×1 1×𝑘
𝑛
1 𝑑
∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 → 𝒩(0, 𝑽)
√𝑛 𝑖=1
1.4.1) Asymptotic Normality

Recall the expression for the estimation error (conveniently adjusted to have sample means)
𝑛 −1 𝑛
1 1
̃ − 𝜷 = ( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 )
𝜷 ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖
𝑛 𝑛
𝑖=1 𝑖=1
1 −1
✓ (𝑛 ∑𝑛𝑖=1 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) will converge in probability to a known object, which is its expectation (we already
showed this using Slutsky’s Theorem in our discussion on consistency).
1
✓ ( ∑𝑛𝑖=1 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) is an object of which we know the asymptotic distribution; in fact, we calculated it just
𝑛
1 𝑑
above. It will be convenient to use the final expression to which we arrived: ∑𝑛𝑖=1 𝑤𝑖 𝒙𝒊 𝑢𝑖 → 𝒩(0, 𝑽).
√𝑛
Thus, we will scale our estimation error to apply the theorem in a cleaner way:
𝑛 −1 𝑛
1 1
̃ − 𝜷) = ( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 )
√𝒏(𝜷 ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 )
⏟𝑛 𝑖=1 ⏟√𝑛 𝑖=1
𝑝 −1 𝑑
→[𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )] →𝒩(0,𝑽)

𝑛 −1
𝑝 1 𝑝
𝐴𝑛 → 𝐴 ( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) → [𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1
𝑛
𝑖=1
𝑛
𝑑 1 𝑑
𝜉𝑛 → 𝜉 ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, 𝑽)
√𝑛 𝑖=1
then
𝑛 −1 𝑛
𝑑 1 1 𝑑
𝐴𝑛 𝜉𝑛 → 𝐴𝜉𝑛 ( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) → [𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 𝒩(0, 𝑽)
𝑛 √𝑛 𝑖=1
𝑖=1
𝑛 −1 𝑛
1 1 𝑑
( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, [𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 𝑽[𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 )
𝑛 √𝑛 𝑖=1
𝑖=1
or simply:
𝑛 −1 𝑛
1 1 𝑑
( ∑ 𝑤𝑖 𝒙𝒊 𝒙′𝒊 ) ( ∑ 𝑤𝑖 𝒙𝒊 𝑢𝑖 ) → 𝒩(0, 𝑾)
𝑛 √𝑛 𝑖=1
𝑖=1
where:
′ −1 2 2 ′ ′ −1
⏟ = [𝔼(𝑤
𝑾 ⏟ 𝑖 𝑤𝑖 𝒙𝒊 𝒙𝒊 ] [𝔼(𝑤
⏟ 𝑖 𝒙𝒊 𝒙𝒊 )] 𝔼[𝑢 ⏟ 𝑖 𝒙𝒊 𝒙𝒊 )]
Since 𝜷
𝑑
Hence: ̃ − 𝜷) → 𝒩(0, 𝑾)
√𝑛(𝜷
1.4.2) Asymptotic Efficiency (Optimal choice of weights wi)

The optimal choice of weights is
1 1
𝑤𝑖 ∝ =
𝔼(𝑢𝑖2 |𝒙𝒊 ) 𝜎𝑖2
When weights 𝑤𝑖 are chosen to be proportional to the reciprocal of 𝜎𝑖2 = 𝔼(𝑢𝑖2 |𝒙𝒊 ), the asymptotic variance
𝑾 = [𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 𝔼[𝑢𝑖2 𝑤𝑖2 𝒙𝒊 𝒙′𝒊 ][𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1
becomes:
𝑾 = 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
Proof:
1
Take each part of the sandwich 𝑾, substitute 𝑤𝑖 = , and take iterated expectations:
𝜎𝑖2
• The bread
−1 −1
1 1
[𝔼(𝑤𝑖 𝒙𝒊 𝒙′𝒊 )]−1 = [𝔼 ( 2 𝒙𝒊 𝒙′𝒊 )] = [ 2 𝔼(𝒙𝒊 𝒙′𝒊 )] = 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝜎𝑖 𝜎𝑖
• The ham
2 2 2
1 1 1 1
𝔼[𝑢𝑖2 𝑤𝑖2 𝒙𝒊 𝒙′𝒊 ] = 𝔼 [𝑢𝑖2 ( ′ 2 ′ ′
2 ) 𝒙𝒊 𝒙𝒊 ] = ( 2 ) 𝔼[𝑢𝑖 𝒙𝒊 𝒙𝒊 ] = ( 2 ) 𝔼 [𝒙𝒊 𝒙𝒊 𝔼(𝑢
⏟
2 ′
𝑖 |𝑿)] = ( 2 ) 𝔼[𝒙𝒊 𝒙𝒊 ]
𝜎𝑖 𝜎𝑖 𝜎𝑖 2
𝜎𝑖
𝜎𝑖
• The sandwich = bread × ham × bread
1
𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 ( 2 ) 𝔼[𝒙𝒊 𝒙′𝒊 ] ⏟
=⏟ 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
𝑾 ⏟𝜎𝑖
bread bread
ham
= 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 𝔼[𝒙𝒊 𝒙′𝒊 ][𝔼(𝒙𝒊 𝒙′𝒊 )]−1
= 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
1.4.3) Proving that the variance is smaller when we use the optimal choice of weights (Ask
Rob)
1.4.4) Generalized Least Squares (GLS) aka “WLS with optimal weights”
The GLS estimator is simply the WLS estimator using the optimal weights
1 1
𝑤𝑖 = =
𝔼(𝑢𝑖2 |𝒙𝒊 ) 𝜎𝑖2
Applying this weight to the WLS estimator we get the GLS estimator:
𝑛 −1 𝑛
𝒙𝒊 𝒙′𝒊 𝒙𝒊 𝑦𝑖
̃ 𝐺𝐿𝑆
𝜷 = (∑ 2 ) (∑ )
𝜎𝑖 𝜎𝑖2
𝑖=1 𝑖=1
This estimator is asymptotically efficient in the sense of having the smallest asymptotic variance among the
class of consistent WLS estimator.
In matrix notation:
̃ 𝐺𝐿𝑆 ≡ 𝒃 = (𝑿′ 𝛀𝑿)−𝟏 (𝑿′ 𝛀𝒚)
𝜷
1 1
with 𝛀 = diag (𝜎2 , … , 𝜎2 )
1 𝑁
1.4.5) Asymptotic normality of GLS

Since GLS is just an especial case of WLS with optimal weights and we have already proven that:
• WLS is asymptotically unbiased (so GLS will also be)

• The asymptotic variance of GLS is 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1
we can conclude that:
𝑑
̃ 𝑮𝑳𝑺 − 𝜷 → 𝒩(𝟎, 𝜎𝑖2 [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 )
√𝜷
1.5) CLUSTERED DATA

Consider a sample that consists of:
• 𝐻 groups
• 𝑀ℎ observations in each group, ℎ = 1, … , 𝐻
In total, we have a sample of 𝑛 observations {𝑦𝑖 , 𝒙𝒊 }𝑛𝑖=1 , with
𝑛 = 𝑀1 + ⋯ + 𝑀𝐻
Note that we don’t necessarily have the same number of observations in each group, that’s precisely what
incorporating the subindex ℎ in 𝑀ℎ allows us to consider:
𝑀1 ≠ 𝑀2 ≠ ⋯ ≠ 𝑀ℎ
Characteristics of the sample
• Observations are independent across groups

• Observations are dependent within groups
• 𝐻 is large
• 𝑀ℎ is small
1.5.1) Individual observations

′
𝑦⏟
ℎ𝑚 = 𝒙
⏟𝒉𝒎 𝜷
⏟ +𝑢
⏟ ℎ𝑚
1×1 1×𝑘 𝑘×1 1×1
1 𝛽1
1
𝑦⏟
ℎ𝑚
𝑥ℎ𝑚 𝛽2 𝑢
⏟ ℎ𝑚
𝒙
⏟𝒉𝒎 = ( ⋮ ) 𝜷
⏟ =( )
1×1 ⋮ 1×1
𝑘×1 𝑘 𝑘×1
𝑥ℎ𝑚 𝛽𝑘
• ∀ℎ = 1, … , 𝐻
• ∀𝑚 = 1, … , 𝑀ℎ
1.5.2) Expression for one cluster

In compact form, for each cluster ℎ:
𝒚
⏟𝒉 = 𝑿
⏟𝒉 𝜷
⏟ + 𝒖
⏟𝒉
𝑀ℎ ×1 𝑀ℎ ×𝑘 𝑘×1 𝑀ℎ ×1
𝑦ℎ1 2 𝑘 𝛽1 𝑢ℎ1
1 𝑥ℎ1 ⋯ 𝑥ℎ1
𝛽2
⏟𝒉 = ( ⋮ )
𝒚 ⏟𝒉 = ( ⋮
𝑿 ⋯ ⋱ ⋮ ) 𝜷
⏟ =( ) ⏟𝒉 = ( ⋮ )
𝒖
𝑀ℎ ×1 𝑦ℎ𝑀ℎ 1 2
𝑥ℎ𝑀 𝑘
⋯ 𝑥ℎ𝑀 ⋮ 𝑀ℎ ×1 𝑢ℎ𝑀ℎ
𝑀ℎ ×𝑘 𝑘×1
ℎ ℎ
𝛽𝑘
∀ℎ = 1, … , 𝐻
1.5.3) Expression for all the data grouped (not really useful)
Compact representation (we necessarily assume here that 𝑀1 = ⋯ = 𝑀ℎ = ⋯ 𝑀𝐻 )
We don’t estimate the model like this. To do this, we need to aggregate all the data (see in the next sections)
𝑘
⏟
𝒀 = 𝑨 ⏟𝑙 ⨀ 𝜷
⏟ + ∑[ 𝑿 ⏟𝑙 ] + 𝑼
⏟
𝑯×𝑴𝒉 𝑯×𝑴𝒉 𝑙=1 𝑯×𝑴𝒉 𝑯×𝑴𝒉 𝑯×𝑴𝒉
where
• ⨀ denotes the Hadamard (element-wise matrix product)

• the superscript ●𝑘 refers to the 𝑘 th regressor
This expression can be opened up as follows:

𝑦11 ⋯ 𝑦1𝑀1 𝛼 ⋯ 𝛼 1
𝑥11 1
⋯ 𝑥1𝑀 𝛽1 ⋯ 𝛽1
𝑘
𝑥11 ⋯ 𝑘
𝑥1𝑀 𝛽𝑘 ⋯ 𝛽𝑘 𝑢11 ⋯ 𝑢1𝑀1
1 1
( ⋮ ⋱ ⋮ ) = (⋮ ⋱ ⋮)+ ( ⋮ ⋱ ⋮ )⨀( ⋮ ⋱ ⋮ ) +⋯+ ( ⋮ ⋱ ⋮ )⨀( ⋮ ⋱ ⋮ )+( ⋮ ⋱ ⋮ )
𝑦𝐻1 ⋯ 𝑦𝐻𝑀𝐻 𝛼 ⋯ 𝛼 1
𝑥𝐻1 1
⋯ 𝑥𝑁𝑀 𝛽1 ⋯ 𝛽1 𝑘
𝑥𝐻1 ⋯ 𝑘
𝑥𝑁𝑀 𝛽𝑘 ⋯ 𝛽𝑘 𝑢𝐻1 ⋯ 𝑢𝑁𝑀𝐻
𝐻 𝐻
we use double notation: (𝑦ℎ𝑚 , 𝑥ℎ𝑚 ) for ℎ = 1, … , 𝐻 (group index) and 𝑚 = 1, … , 𝑀ℎ (within group index)
1.5.4) Expression for all the data (no grouping)

We forget about groups and simply put all the data together. This is how we estimate the OLS model.
𝒚 = 𝑿
⏟ ⏟ 𝜷
⏟ + 𝒖
⏟
𝑛×1 𝑛×𝑘 𝑘×1 𝑛×1
2 𝑘
𝑦11 1 𝑥11 ⋯ 𝑥11 𝑢11
⋮ ⋮ ⋮ ⋯ ⋮ ⋮
2 𝑘
𝑦1𝑀1 1 𝑥1𝑀1 ⋯ 𝑥1𝑀1 𝑢1𝑀1
𝑦21 1 2
𝑥21 ⋯ 𝑘
𝑥21 𝑢21
⋮ ⋮ ⋮ ⋯ ⋮ ⋮
𝑦2𝑀2 1 2
𝑥2𝑀2 ⋯ 𝑘
𝑥2𝑀2 𝛽1 𝑢2𝑀2
⋮ 𝛽2 ⋮
𝒚 = 𝑦
⏟ ⏟ = ⋮
𝑿 ⋮
2
⋯ ⋮
𝑘 𝜷
⏟ =( ) 𝒖
⏟ = 𝑢
ℎ1
𝑛×𝑘
1 𝑥ℎ1 ⋯ 𝑥ℎ1 ⋮ 𝑛×1
ℎ1
𝑛×1 ⋮ 𝑘×1 ⋮
⋮ ⋮ ⋯ ⋮ 𝛽𝑘
𝑦ℎ𝑀ℎ 1 2
𝑥ℎ𝑀 ⋯ 𝑘
𝑥ℎ𝑀 𝑢ℎ𝑀ℎ
ℎ ℎ
⋮ ⋮ ⋮ ⋯ ⋮ ⋮
𝑦𝐻1 1 2
𝑥𝐻1 ⋯ 𝑘
𝑥𝐻1 𝑢𝐻1
⋮ ⋮ ⋮ ⋯ ⋮ ⋮
𝑦
( 𝐻𝑀𝐻 ) 2 𝑘 𝑢
( 𝐻𝑀𝐻 )
(1 𝑥𝐻𝑀𝐻 ⋯ 𝑥𝐻𝑀𝐻 )
where
• 𝑛 = ∑𝐻
ℎ=1 𝑀ℎ = 𝑀1 + ⋯ + 𝑀𝐻
1.5.5) OLS Estimator in 3 equivalent formats

We can use estimate the model using 3 different types of notations (all of them are equivalent)
−𝟏
̂ = ( 𝑿′
𝜷 ⏟ 𝑿 ⏟ ) 𝑿′
⏟ ⏟𝒚
Matrix Notation (all the data ignoring groups)
⏟𝑘×𝑛 𝑛×𝑘 ⏟ 𝑛×1
𝑘×𝑛
𝑘×𝑘 𝑘×1
𝐻 −𝟏 𝐻
̂=
𝜷 (∑ 𝑿 ′
⏟𝒉 𝑿 ⏟𝒉 ) ∑ 𝑿 ′
⏟𝒉 𝒚 ⏟𝒉
Sigma Notation by clusters
⏟ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×𝑘 ⏟
ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×1
𝑘×𝑘 𝑘×1
𝑀ℎ −𝟏
𝐻 𝐻 𝑀
̂ = (∑ ∑ 𝒙
𝜷 ⏟𝒉𝒎 𝒙
⏟
′
𝒉𝒎 ) ∑∑𝒙 ⏟𝒉𝒎 𝑦
⏟ℎ𝑚
Sigma notation by individual observations
⏟ℎ=1 𝑚=1 𝑘×1 1×𝑘 ⏟
ℎ=1 𝑚=1 𝑘×1 1×1
𝑘×𝑘 𝑘×1
There is a 1to1 equivalence between:

𝐻 𝐻 𝑀
●✠ = ∑ ●ℎ ✠ℎ = ∑ ∑ ●ℎ𝑚 ✠ℎ𝑚
ℎ=1 ℎ=1 𝑚=1
however, be careful with dimensions when using this!!!!!!
For example:
−𝟏 𝐻 𝐻 𝑀ℎ
′ ′
⏟′ 𝑿
(𝑿 ⏟ ) =∑ 𝑿
⏟𝒉 𝑿
⏟𝒉 = ∑ ∑ 𝒙
⏟𝒉𝒎 𝒙
⏟𝒉𝒎
𝑘×𝑛 𝑛×𝑘 ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×𝑘 ℎ=1 𝑚=1 𝑘×1 1×𝑘
𝐻 𝐻 𝑀
′
𝑿′
⏟ ⏟𝒚 = ∑ 𝑿 ⏟𝒉 𝒚 ⏟𝒉 =∑∑𝒙
⏟𝒉𝒎 𝑦
⏟ℎ𝑚
𝑘×𝑛 𝑛×1 ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×1 ℎ=1 𝑚=1 𝑘×1 1×1
1.5.6) Estimation error

Take the estimator in matrix notation
̂ = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒚
𝜷
Recall that: 𝑦 = 𝑋𝛽 + 𝑢. Hence, we can rewrite the estimator as:

̂ = (𝑿′ 𝑿)−𝟏 𝑿′ (𝑿𝜷 + 𝒖)
𝜷
̂=⏟
𝜷 (𝑿′ 𝑿)−𝟏 𝑿′ 𝑿 𝜷 + (𝑿′ 𝑿)−𝟏 𝑿′ 𝒖
𝑰𝒌
So the estimation error is

̂ − 𝜷 = (𝑿′ 𝑿)−𝟏 𝑿′ 𝒖
𝜷
We can reexpress 𝑿′𝒖 as a sum over clusters using ●✠ = ∑𝐻

ℎ=1 ●ℎ ✠ℎ
𝐻
′
𝑿′
⏟ 𝒖⏟ =∑ 𝑿 ⏟𝒉 𝒖 ⏟𝒉
⏟ 𝑛×1 ℎ=1 ⏟
𝑘×𝑛 𝑘×𝑀ℎ 𝑀ℎ ×1
𝑘×1 𝑘×1
So we can rewrite
𝐻
̂ − 𝜷 = (𝑿′ 𝑿)−𝟏 ∑ 𝑿′𝒉 𝒖𝒉
𝜷
ℎ=1
𝐻 𝐻 𝐻 1
After some rescaling: √𝐻 = √𝐻 = √𝐻 = =𝐻·
𝐻 √ √𝐻
𝐻 √𝐻 √𝐻
𝐻
1
̂ − 𝜷) = 𝐻(𝑿′ 𝑿)−𝟏
√𝐻(𝜷 ∑ 𝑿′𝒉 𝒖𝒉
√𝐻 ℎ=1
Hence:
−𝟏 𝐻
𝑿′ 𝑿 1
̂ − 𝜷) = (
√𝐻(𝜷 ) ∑ 𝑿′𝒉 𝒖𝒉
𝐻 √𝐻 ℎ=1
1.5.7) Consistency
𝐻
̂ − 𝜷 = (𝑿′ 𝑿)−𝟏 ∑ 𝑿′𝒉 𝒖𝒉
𝜷
ℎ=1
We can conveniently adjust it in order to have sample means. Namely, we can apply: ●✠ = ∑𝐻
ℎ=1 ●ℎ ✠ℎ
𝐻
′
𝑿′ 𝑿 = ∑ 𝑿
⏟𝒉 𝑿
⏟𝒉
ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×𝑘
so that:
𝐻 −1 𝐻
̂ − 𝜷 = (∑ 𝑿′𝒉 𝑿𝒉 )
𝜷 ∑ 𝑿′𝒉 𝒖𝒉
ℎ=1 ℎ=1
And multiplying and dividing by 𝐻 yields:

𝐻 −1 𝐻
1 1
̂ − 𝜷 = ( ∑ 𝑿′𝒉 𝑿𝒉 )
𝜷 ∑ 𝑿′𝒉 𝒖𝒉
𝐻 𝐻
ℎ=1 ℎ=1
̂ − 𝜷) = 0
Proving consistency requires showing that: plim(𝜷
We want to find:
𝐻 −1 𝐻
1 1
̂ − 𝜷) = plim [( ∑ 𝑿′𝒉 𝑿𝒉 )
plim(𝜷 ∑ 𝑿′𝒉 𝒖𝒉 ]
𝐻 𝐻
ℎ=1 ℎ=1
➔ By plim properties: plim(𝑎𝑏) = plim(𝑎)plim(𝑏)

𝐻 −1 𝐻
1 1
̂ − 𝜷) = plim [( ∑ 𝑿′𝒉 𝑿𝒉 ) ] plim [ ∑ 𝑿′𝒉 𝒖𝒉 ]
plim(𝜷
𝐻 𝐻
ℎ=1 ℎ=1
1 𝑝
➔ By LLN, sample means converge to population values, plim [ ∑𝑛𝑖=1 ●] → 𝔼[●]
𝑛
1 𝐻 𝑝
o ∑ 𝑿′ 𝑿
𝐻 ℎ=1 𝒉 𝒉
→ 𝔼[𝑿′𝒉 𝑿𝒉 ]
1 𝐻 𝑝
o ∑ 𝑿′ 𝒖
𝐻 ℎ=1 𝒉 𝒉
→ 𝔼[𝑿′𝒉 𝒖𝒉 ]
𝑝 𝑝
➔ By Slutsky’s Theorem (I): 𝑊𝑛 → 𝑐 ⇒ 𝑔(𝑊𝑛 ) → 𝑔(𝑐)
1 𝐻 𝑝 1 −1 𝑝
o ∑ℎ=1 𝑿′𝒉 𝑿𝒉 → 𝔼[𝑿′𝒉 𝑿𝒉 ] ⇒ { ∑𝐻 ′
ℎ=1 𝑿𝒉 𝑿𝒉 } → {𝔼[𝑿′𝒉 𝑿𝒉 ]}−1
𝐻 𝐻
➔ By Slutsky’s Theorem (II) in its weaker version (applied to convergence in probability):
if:
𝐻 −1
𝑝 1 𝑝
𝒁𝑛 → 𝒁 ( ∑ 𝑿′𝒉 𝑿𝒉 ) → [𝔼(𝑿′𝒉 𝑿𝒉 )]−1
𝐻
ℎ=1
𝐻
𝑑 1 𝑝
𝑲𝑛 → 𝑲 ( ∑ 𝑿′𝒉 𝒖𝒉 ) → 𝔼[𝑿′𝒉 𝒖𝒉 ]
𝐻
ℎ=1
then:
𝐻 −1 𝐻
𝑑 1 1 𝑝
𝒁𝒏 𝑲𝒏 → 𝒁𝑲 ( ∑ 𝑿′𝒉 𝑿𝒉 ) ( ∑ 𝑿′𝒉 𝒖𝒉 ) → [𝔼(𝑿′𝒉 𝑿𝒉)]−1 𝔼[𝑿′𝒉 𝒖𝒉 ]
𝐻 𝐻
ℎ=1 ℎ=1
Therefore: ̂ − 𝜷) = plim{[𝔼(𝑿′𝒉 𝑿𝒉 )]−1 𝔼[𝑿′𝒉 𝒖𝒉 ]} = 0

plim(𝜷
So our estimator is consistent
1.5.8) Asymptotic normality of sumX’huh

1 ′
We have that 𝐻 ∑𝐻
ℎ=1 𝑿𝒉 𝒖𝒉 is a sample mean with known expectation:
𝐻 𝐻 𝐻 𝐻
1 1 1 1
𝔼 [ ∑ 𝑿′𝒉 𝒖𝒉 ] = ∑ 𝔼[𝑿′𝒉 𝒖𝒉 ] = ∑ 𝔼[𝔼(𝑿′𝒉 𝒖𝒉 |𝑿)] = ∑ 𝔼 [𝑿′𝒉 ⏟
𝔼(𝒖𝒉 |𝑿)] = 0
𝐻 𝐻 𝐻 𝐻
ℎ=1 ℎ=1 ℎ=1 ℎ=1 0
and with variance:

𝐻 𝐻
1 1
𝑉𝑎𝑟 [ ∑ 𝑿′𝒉 𝒖𝒉 ] = 2 {∑ 𝑉𝑎𝑟[𝑿′𝒉 𝒖𝒉 ] + 2 ∑ 𝐶𝑜𝑣[𝑿
⏟
′ ′
𝒉 𝒖𝒉 , 𝑿𝒋 𝒖𝒋 ]}
𝐻 𝐻
ℎ=1 ℎ=1 ℎ>𝑗 0
1
{𝐻 · 𝔼[𝑿′𝒉𝒖𝒉 𝒖′𝒉 𝑿𝒉 ]}
=
𝐻2
1
= 𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ]
𝐻
where we have used: (recall 𝔼[𝑤𝑖 𝒙𝒊 𝑢𝑖 ] = 0 from the FOC)
➔ 𝑉𝑎𝑟[𝑿′𝒉 𝒖𝒉 ] = 𝔼[𝑿′𝒉 𝒖𝒉 (𝑿′𝒉 𝒖𝒉 )′] − 𝔼[𝑿′𝒉 𝒖𝒉 ]{𝔼[𝑿′𝒉 𝒖𝒉 ]}′ = 𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ]

➔ 𝐶𝑜𝑣[𝑿′𝒉 𝒖𝒉 , 𝑿′𝒋 𝒖𝒋 ] = ⏟
𝔼[𝑿′𝒉 𝒖𝒉 𝑿′𝒋 𝒖𝒋 ] − 𝔼[𝑿′𝒉 𝒖𝒉 ]𝔼[𝑿′𝒋 𝒖𝒋 ] = 0
𝔼[𝑿′𝒉 𝒖𝒉 ]𝔼[𝑿′𝒋 𝒖𝒋 ]

𝐻
1 𝑑 1
∑ 𝑿′𝒉 𝒖𝒉 → 𝒩 (0, 𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ])
𝐻 𝐻
ℎ=1
𝐻
1 𝑑 1
∑ 𝑿′𝒉 𝒖𝒉 → 𝒩(0, 𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ])
𝐻 √𝐻
ℎ=1
𝐻
√𝐻 𝑑
∑ 𝑿′𝒉 𝒖𝒉 → 𝒩(0, 𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ])
𝐻
ℎ=1
𝐻
1 𝑑
∑ 𝑿′𝒉 𝒖𝒉 → 𝒩 (0, ⏟
𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ])
√𝐻 ℎ=1 𝑽
′ ′
Let 𝑽
⏟ ≡ 𝔼 𝑿⏟𝒉 𝒖⏟𝒉 𝒖 ⏟𝒉 𝑿 ⏟𝒉 , then
𝑘×𝑘 ⏟ ℎ 𝑀ℎ ×1 1×𝑀ℎ 𝑀ℎ ×𝑘
𝑘×𝑀
[ 𝑘×𝑘 ]
𝐻
1 𝑑
∑ 𝑿′𝒉 𝒖𝒉 → 𝒩 (0, ⏟
𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ])
√𝐻 ℎ=1 𝑽
1.5.9) Asymptotic normality

Recall the expression for the estimation error:
−𝟏 𝐻
𝑿′ 𝑿 1
̂
√𝐻(𝜷 − 𝜷) = ( ) ∑ 𝑿′𝒉 𝒖𝒉
𝐻 √𝐻 ℎ=1
We can conveniently adjust it in order to have sample means. Namely, we can apply: ●✠ = ∑𝐻
ℎ=1 ●ℎ ✠ℎ
𝐻
′ ′
𝑿𝑿=∑ 𝑿
⏟𝒉 𝑿
⏟𝒉
ℎ=1 𝑘×𝑀ℎ 𝑀ℎ ×𝑘
so that:
𝐻 −1
−𝟏
𝑿′ 𝑿 1
( ) = ( ∑ 𝑿′𝒉 𝑿𝒉 )
𝐻 𝐻
ℎ=1
And rewrite our estimation error as:

𝐻 −1 𝐻
1 1
̂ − 𝜷) = ( ∑ 𝑿′𝒉 𝑿𝒉 )
√𝐻(𝜷 ∑ 𝑿′𝒉 𝒖𝒉
𝐻 √𝐻 ℎ=1
ℎ=1
1 −1
′
✓ (𝐻 ∑𝐻
ℎ=1 𝑿𝒉 𝑿𝒉 ) will converge in probability to a known object, which is its expectation.
1
✓ ∑𝐻 𝑿′ 𝒖 is an object of which we know the asymptotic distribution; in fact, we calculated it just
√𝐻 ℎ=1 𝒉 𝒉
1 𝑑
above: 𝐻 ∑𝐻
ℎ=1 𝑿′𝒉 𝒖𝒉 → 𝒩(0, 𝑽).
√
Thus:
𝐻 −1 𝐻
1 1
̂ − 𝜷) = ( ∑ 𝑿′𝒉 𝑿𝒉 )
√𝐻(𝜷 ( ∑ 𝑿′𝒉 𝒖𝒉 )
⏟𝐻 ℎ=1 ⏟√𝐻 ℎ=1
𝑝 −1 𝑑
→[𝔼(𝑿′𝒉 𝑿𝒉 )] →𝒩(0,𝑽)

𝐻 −1
𝑝 1 𝑝
𝐴𝑛 → 𝐴 ( ∑ 𝑿′𝒉 𝑿𝒉 ) → [𝔼(𝑿′𝒉 𝑿𝒉 )]−1
𝐻
ℎ=1
𝐻
𝑑 1 𝑑
𝜉𝑛 → 𝜉 ( ∑ 𝑿′𝒉 𝒖𝒉 ) → 𝒩(0, 𝑽)
√𝐻 ℎ=1
then
𝐻 −1 𝐻
𝑑 1 1 𝑑
𝐴𝑛 𝜉𝑛 → 𝐴𝜉𝑛 ( ∑ 𝑿′𝒉 𝑿𝒉 ) ( ∑ 𝑿′𝒉 𝒖𝒉 ) → [𝔼(𝑿′𝒉 𝑿𝒉 )]−1 𝒩(0, 𝑽)
𝐻 √𝐻 ℎ=1
ℎ=1
𝐻 −1 𝐻
1 1 𝑑
( ∑ 𝑿′𝒉 𝑿𝒉 ) ( ∑ 𝑿′𝒉 𝒖𝒉 ) → 𝒩(0, [𝔼(𝑿′𝒉 𝑿𝒉)]−1 𝑽[𝔼(𝑿′𝒉 𝑿𝒉 )]−1 )
𝐻 √𝐻 ℎ=1
ℎ=1
or simply:
𝐻 −1 𝐻
1 1 𝑑
( ∑ 𝑿′𝒉 𝑿𝒉 ) ( ∑ 𝑿′𝒉 𝒖𝒉 ) → 𝒩(0, 𝑾)
𝐻 √𝐻 ℎ=1
ℎ=1
where:
𝑾 [𝔼(𝑿′𝒉𝑿𝒉 )]−1 ⏟
⏟ =⏟ 𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ] ⏟
[𝔼(𝑿′𝒉 𝑿𝒉 )]−1
Since 𝜷
𝑑
Hence: ̂ − 𝜷) → 𝒩(0, 𝑾)
√𝐻(𝜷
1.5.1) Cluster robust standard errors aka “estimator of the W-sandwich”

Recall the population object:
𝑾 = [𝔼(𝑿′𝒉 𝑿𝒉 )]−1 𝔼[𝑿′𝒉 𝒖𝒉 𝒖′𝒉 𝑿𝒉 ][𝔼(𝑿′𝒉 𝑿𝒉 )]−1
a consistent estimator can be proposed by simply changing population objects by sample objects:
1
 Instead of expectations 𝔼(·), use a sample mean 𝑛 ∑𝑛𝑖=1(·)
 Instead of the population error 𝒖𝒉 , use the sample error 𝒖
̂𝒉
𝐻 −1 𝐻 𝐻 −1
1 1 1
̂ = [ ∑ 𝑿′𝒉 𝑿𝒉 ]
𝑾 [ ∑ 𝑿′𝒉 𝒖 ̂ ′𝒉 𝑿𝒉 ] [ ∑ 𝑿′𝒉 𝑿𝒉 ]
̂𝒉𝒖
𝐻 𝐻 𝐻
ℎ=1 ℎ=1 ℎ=1
We can express the “bread” of the sandwich in compact notation using: ●✠ = ∑𝐻

ℎ=1 ●ℎ ✠ℎ
∑ 𝑿′𝒉 𝑿𝒉 = 𝑿′𝑿
ℎ=1
Thus:
−1 𝐻 −1
𝑿′𝑿 1 𝑿′𝑿
̂=(
𝑾 ) [ ∑ 𝑿′𝒉 𝒖 ̂ ′𝒉 𝑿𝒉 ] (
̂𝒉𝒖 )
𝐻 𝐻 𝐻
ℎ=1
̂ gives us an estimation of the asymptotic variance of the rescaled estimation error. In order to find
Recall that 𝑊
an estimator for the asymptotic variance of 𝜷̂ , we need to do the following:
𝑑
̂ − 𝜷) → 𝒩(0, 𝑾
√𝐻(𝜷 ̂)
𝑑 1
̂ −𝜷→
𝜷 ̂)
𝒩(0, 𝑾
√𝐻
𝑑 ̂
𝑾
̂ − 𝜷 → 𝒩 (0,
𝜷 )
𝐻
𝑑 ̂
𝑾
̂ → 𝒩 (𝜷,
𝜷 )
𝐻
Hence:
𝐻
̂ 1 𝑿′𝑿 −1 1
𝑾 𝑿′𝑿
−1
̂] =
Var[𝜷 = ( ′
̂𝒉𝒖
) [ ∑ 𝑿𝒉 𝒖 ′
̂ 𝒉 𝑿𝒉 ] ( )
𝐻 𝐻 𝐻 𝐻 𝐻
ℎ=1
The final expression is:

𝐻
̂
𝑾
̂] =
Var[𝜷 = (𝑿′𝑿)−1 [∑ 𝑿′𝒉 𝒖 ̂ ′𝒉 𝑿𝒉 ] (𝑿′𝑿)−1
̂𝒉𝒖
𝐻
ℎ=1
which is a 𝑘 × 𝑘 matrix
−1 𝐻 −1
̂ ] = ( 𝑿′
Var[𝜷 ⏟ 𝑿 ⏟ ) ∑ 𝑿 ′
⏟𝒉 𝒖 ̂
⏟𝒉 𝒖 ̂ ′
⏟𝒉 𝑿 ⏟𝒉 ( 𝑿′
⏟ 𝑿 ⏟ )
⏟𝑘×𝑛 𝑛×𝑘 ℎ=1 ⏟
𝑘×𝑀ℎ 𝑀ℎ ×1 1×𝑀ℎ 𝑀ℎ ×𝑘 ⏟𝑘×𝑛 𝑛×𝑘
𝑘×𝑘 [ 𝑘×𝑘 ] 𝑘×𝑘
Once we have an estimator for the asymptotic variance, it’s straightforward to obtain standard errors. In
particular, note that standard errors are simply the word we use to refer to the “estimator of the standard
deviation”, and this is simply the square root of the estimator of the variance. So the fancy name of “cluster
robust standard errors” simply boils down to computing the square root of the estimated variance-covariance
̂
𝑾
̂ , which is .
matrix for 𝜷 𝐻
𝑾̂
̂ [𝛽̂] = √
𝑆𝐸
𝐻
If we want to find the standard error associated to the 𝑗 th coefficient, simply take the 𝑗 th diagonal element of
this matrix:
̂
𝑾
̂ [𝛽̂ 𝑗 ] = √ 𝑗𝑗
𝑆𝐸
𝐻
1.6) FIXED EFFECTS

1.6.1) Individual observations
In a regression with fixed effects, we regress 𝑦ℎ𝑚 on 𝑥ℎ𝑚 and group dummy variables:
′
𝑦⏟
ℎ𝑚 = 𝛼
⏟ℎ + 𝒙
⏟𝒉𝒎 𝜷
⏟ +𝑢
⏟ ℎ𝑚
1×1 1×1 1×𝑘 𝑘×1 1×1
Note that before we had a column of 1’s in the 𝒙 variable and 𝛽1 was the intercept. Now the intercept will be
captured separately, so no column of ones in the 𝒙′𝑠 and 𝛽1 is no longer the intercept, but a coefficient associated
to the 𝑥 1 ′𝑠.
1
𝑥ℎ𝑚 𝛽1
𝑦⏟
ℎ𝑚 𝛼
⏟ℎ 𝒙 𝑢
⏟ ℎ𝑚
⏟𝒉𝒎 = ( ⋮ ) 𝜷
⏟ =( ⋮ )
1×1 1×1 𝑘 1×1
𝑘×1 𝑥ℎ𝑚 𝑘×1 𝛽𝑘
• ∀ℎ = 1, … , 𝐻
• ∀𝑚 = 1, … , 𝑀ℎ
1.6.2) At a cluster level

For each cluster ℎ:
𝒚
⏟𝒉 = 𝜶
⏟𝒉 + 𝑿
⏟𝒉 𝜷
⏟ +𝒖
⏟𝒉𝒎
𝑀ℎ ×1 𝑀ℎ ×1 𝑀ℎ ×𝑘 𝑘×1 𝑀ℎ ×1
𝑦ℎ1 𝛼ℎ 1 2 𝑘 𝑢ℎ1
𝑥ℎ1 𝑥ℎ1 ⋯ 𝑥ℎ1 𝛽1
⏟𝒉 = ( ⋮ )
𝒚 ⏟𝒉 = ( ⋮ )
𝜶 ⏟𝒉 = ( ⋮
𝑿 ⋯ ⋱ ⋮ ) 𝜷
⏟ =( ⋮ ) ⏟𝒉 = ( ⋮ )
𝒖
𝑀ℎ ×1 𝑦ℎ𝑀ℎ 𝑀ℎ ×1 𝛼ℎ 𝑀ℎ ×𝑘
1
𝑥ℎ𝑀 2
𝑥ℎ𝑀 𝑘
⋯ 𝑥ℎ𝑀 𝑘×1 𝛽𝑘 𝑀ℎ ×1 𝑢ℎ𝑀ℎ
ℎ ℎ ℎ
∀ℎ = 1, … , 𝐻
1.6.3) Expression for all the data grouped (not really useful)
Compact representation (we necessarily assume here that 𝑀1 = ⋯ = 𝑀ℎ = ⋯ 𝑀𝐻 )
We don’t estimate the model like this. To do this, we need to aggregate all the data (see in the next sections)
𝑘
⏟
𝒀 = 𝑨 ⏟𝑙 ⨀ 𝜷
⏟ + ∑[ 𝑿 ⏟𝑙 ] + 𝑼
⏟
𝑯×𝑴𝒉 𝑯×𝑴𝒉 𝑙=1 𝑯×𝑴𝒉 𝑯×𝑴𝒉 𝑯×𝑴𝒉
where
• ⨀ denotes the Hadamard (element-wise matrix product)

• the superscript ●𝑘 refers to the 𝑘 th regressor
This expression can be opened up as follows:

𝑦11 ⋯ 𝑦1𝑀1 𝛼1 ⋯ 𝛼1 1
𝑥11 1
⋯ 𝑥1𝑀 𝛽1 ⋯ 𝛽1 𝑘
𝑥11 𝑘
⋯ 𝑥1𝑀 𝛽𝑘 ⋯ 𝛽𝑘 𝑢11 ⋯ 𝑢1𝑀1
1 1
( ⋮ ⋱ ⋮ )=( ⋮ ⋱ ⋮ )+( ⋮ ⋱ ⋮ )⨀( ⋮ ⋱ ⋮ )+ ⋯+ ( ⋮ ⋱ ⋮ )⨀( ⋮ ⋱ ⋮ )+( ⋮ ⋱ ⋮ )
𝑦𝐻1 ⋯ 𝑦𝐻𝑀𝐻 𝛼𝐻 ⋯ 𝛼𝐻 1
𝑥𝐻1 1
⋯ 𝑥𝐻𝑀𝐻
𝛽1 ⋯ 𝛽1 𝑘
𝑥𝐻1 𝑘
⋯ 𝑥𝑁𝑀𝐻
𝛽𝑘 ⋯ 𝛽𝑘 𝑢𝐻1 ⋯ 𝑢𝑁𝑀𝐻
we use double notation: (𝑦ℎ𝑚 , 𝑥ℎ𝑚 ) for ℎ = 1, … , 𝐻 (group index) and 𝑚 = 1, … , 𝑀ℎ (within group index)
1.6.4) Expression for all the data (no grouping)

We forget about groups and simply put all the data together. This is how we estimate the OLS model.
𝒚 = 𝑫
⏟ ⏟ 𝜶
⏟ + 𝑿
⏟ 𝜷
⏟ + 𝒖
⏟
𝑛×1 𝑛×𝐻 𝐻×1 𝑛×𝑘 𝑘×1 𝑛×1
1 0 0 ⋯ 0
⋮ ⋮ ⋮ ⋯ ⋮
1 0 0 ⋯ 0
0 1 0 ⋯ 0 𝟏𝑴𝟏 𝟎𝑴𝟏 ⋯ 𝟎𝑴𝟏 𝛼1
⋮ ⋮ ⋮ ⋯ ⋮ 𝟎𝑴𝟐 𝟏𝑴𝟐 ⋯ 𝟎𝑴𝟐 𝛼2
𝑫
⏟ = = 𝜶
⏟ =( ⋮ )
𝑛×𝐻
0 1 0 ⋯ 0 ⋮ ⋯ ⋱ ⋮ 𝐻×1
⋮ ⋮ ⋮ ⋯ ⋮ (𝟎𝑴𝑯 𝟎𝑴𝑯 ⋯ 𝟏𝑴𝑯 ) 𝛼𝐻
0 0 0 ⋱ 1
⋮ ⋮ ⋮ ⋯ ⋮
(0 0 0 ⋯ 1)
where
• 𝟏𝑴𝒉 is a 𝑀ℎ × 1 vector of 1’s

• 𝟎𝑴𝒉 is a 𝑀ℎ × 1 vector of 0’s
1 𝑘
𝑦11 𝑥11 ⋯ 𝑥11 𝑢11
⋮ ⋮ ⋯ ⋮ ⋮
1 𝑘
𝑦1𝑀1 𝑥1𝑀 1
⋯ 𝑥1𝑀 1 𝑢1𝑀1
𝑦21 𝑥211
⋯ 𝑥21𝑘 𝑢21
⋮ ⋮ ⋯ ⋮ ⋮
𝑦2𝑀2 1
𝑥2𝑀 𝑘
⋯ 𝑥2𝑀 𝑢2𝑀2
⋮
2 2 𝛽1 ⋮
𝒚 = 𝑦 𝑿
⏟ = ⋮ ⋯ ⋮ 𝜷
⏟ 1 𝑘 ⏟ =( ⋮ ) 𝒖
⏟ = 𝑢
𝑛×1
ℎ1
𝑛×𝑘
𝑥ℎ1 ⋯ 𝑥ℎ1 𝑘×1 𝛽𝑘 𝑛×1
ℎ1
⋮ ⋮ ⋯ ⋮ ⋮
𝑦ℎ𝑀ℎ 1
𝑥ℎ𝑀ℎ 𝑘
⋯ 𝑥ℎ𝑀ℎ 𝑢ℎ𝑀ℎ
⋮ ⋮ ⋯ ⋮ ⋮
𝑦𝐻1 1
𝑥𝐻1 𝑘
⋯ 𝑥𝐻1 𝑢𝐻1
⋮ ⋮
⋮ ⋯ ⋮
(𝑦𝐻𝑀𝐻 ) 𝑥 1 𝑘
⋯ 𝑥𝐻𝑀 (𝑢𝐻𝑀𝐻 )
( 𝐻𝑀𝐻 𝐻)
where
• 𝑛 = ∑𝐻
ℎ=1 𝑀ℎ = 𝑀1 + ⋯ + 𝑀𝐻
1.6.5) OLS Estimator w/ Fixed Effects through Partitioned Regression

The minimization problem is:
(𝜶 ̂ ) = arg min { 𝒖′
̂, 𝜷 ⏟ 𝒖⏟ }
(𝜶,𝜷)
1×𝑛 𝑛×1
where 𝒖 = 𝒚 − 𝑫𝜶 − 𝑿𝜷
̂ ) = arg min {(𝒚 − 𝑫𝜶 − 𝑿𝜷)′ (𝒚 − 𝑫𝜶 − 𝑿𝜷)}
̂, 𝜷
(𝜶
(𝜶,𝜷)
Open it:
min [𝒖′𝒖] = min [𝒚′ 𝒚 − 𝒚′ 𝑫𝜶 − 𝒚′ 𝑿𝜷 − 𝜶′ 𝑫′ 𝒚 + 𝜶′ 𝑫′ 𝑫𝜶 + 𝜶′ 𝑫′ 𝑿𝜷 − 𝜷′ 𝑿′ 𝒚 + 𝜷′ 𝑿′ 𝑫𝜶 + 𝜷′ 𝑿′ 𝑿𝜷]

(𝜶,𝜷) (𝜶,𝜷)
Now we take FOC’s. Recall the rules of matrix differentiation:

𝜕 𝜕 𝜕
(𝒃′𝑿) =𝑿 (𝑿′𝒃) =𝑿 (𝒃′𝑿𝒃) = 𝟐𝑿𝒃
𝜕𝒃 𝜕𝒃 𝜕𝒃
FOC for 𝜶:
𝜕 ′ 𝜕 𝜕 𝜕 𝜕
𝜕𝒖′ 𝒖 = − [𝒚 𝑫𝜶] − [𝜶′ 𝑫′ 𝒚] + [𝜶′ 𝑫′ 𝑫𝜶] + [𝜶′ 𝑫′ 𝑿𝜷] + [𝜷′ 𝑿′ 𝑫𝜶] = 0
⏟
𝜕𝜶 ⏟
𝜕𝜶 ⏟
𝜕𝜶 ⏟
𝜕𝜶 ⏟
𝜕𝜶
𝜕𝜶 (𝒚′ 𝑫)′ 𝑫′ 𝒚 2𝑫′ 𝑫𝜶 𝑫′ 𝑿𝜷 (𝜷′ 𝑿′ 𝑫)′
⇒ −𝑫′ 𝒚 − 𝑫′ 𝒚 + 2𝑫′ 𝑫𝜶 + 𝑫′ 𝑿𝜷 + 𝑫′ 𝑿𝜷 = 𝟎
⇒ −2𝑫′ 𝒚 + 2𝑫′ 𝑫𝜶 + 2𝑫′ 𝑿𝜷 = 𝟎
⇒ −𝑫′ 𝒚 + 𝑫′ 𝑫𝜶 + 𝑫′ 𝑿𝜷 = 𝟎
Hence: 𝑫′ 𝒚 = 𝑫′ 𝑿𝜷 + 𝑫′ 𝑫𝜶
FOC for 𝜷:
𝜕 ′ 𝜕 𝜕 𝜕 𝜕
𝜕𝒖′ 𝒖 = − [𝒚 𝑿𝜷] + [𝜶′ 𝑫′ 𝑿𝜷] − [𝜷′ 𝑿′ 𝒚] + [𝜷′ 𝑿′ 𝑫𝜶] + [𝜷′ 𝑿′ 𝑿𝜷] = 0
𝜕𝜷
⏟ 𝜕𝜷
⏟ 𝜕𝜷
⏟ 𝜕𝜷
⏟ 𝜕𝜷
⏟
𝜕𝜷 ′ ′ ′ ′ ′ ′ ′ ′
(𝒚 𝑿) (𝜶 𝑫 𝑿) 𝑿𝒚 𝑿 𝑫𝜶 2𝑿 𝑿𝜷
⇒ −𝑿′ 𝒚 + 𝑿′ 𝑫𝜶 − 𝑿′ 𝒚 + 𝑿′ 𝑫𝜶 + 𝟐𝑿′ 𝑿𝜷 = 𝟎
⇒ −2𝑿′ 𝒚 + 2𝑿′ 𝑫𝜶 + 2𝑿′ 𝑿𝜷 = 𝟎
⇒ −𝑿′ 𝒚 + 𝑿′ 𝑫𝜶 + 𝑿′ 𝑿𝜷 = 𝟎
Hence: 𝑿′ 𝒚 = 𝑿′ 𝑿𝜷 + 𝑿′ 𝑫𝜶
We can summarize the 2 FOC obtained above in compact form:
̂ 𝑿′𝒚
( 𝑿′𝑿 𝑿′𝑫 ) (𝜷) = ( )
𝑫′𝑿 𝑫′𝑫 𝜶 ̂ 𝑫′𝒚
Modus Operandi:
1) Solve for 𝜶 in the 2nd block (FOC for 𝜶)
FOC for 𝜶: 𝑫′ 𝒚 = 𝑫′ 𝑿𝜷 + 𝑫′ 𝑫𝜶
𝑫′ 𝑫𝜶 = 𝑫′ 𝒚 − 𝑫′ 𝑿𝜷
𝜶 = (𝑫′ 𝑫)−𝟏 (𝑫′ 𝒚 − 𝑫′ 𝑿𝜷)
̂ = (𝑫′ 𝑫)−𝟏 𝑫′ (𝒚 − 𝑿𝜷)

𝜶
2) Insert the result in the 1st block (FOC for 𝜷)
FOC for 𝜷: 𝑿′ 𝒚 = 𝑿′ 𝑿𝜷 + 𝑿′ 𝑫𝜶
Plug the above expression for 𝜶 𝑿′ 𝒚 = 𝑿′ 𝑿𝜷 + 𝑿′ 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ (𝒚 − 𝑿𝜷)
𝑿′ 𝒚 = 𝑿′ 𝑿𝜷 + 𝑿′ 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ 𝒚 − 𝑿′ 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ 𝑿𝜷
(𝑿′ 𝑿 − 𝑿′ 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ 𝑿)𝜷 + 𝑿′ 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ 𝒚 = 𝑿′𝒚
(𝑿′ 𝑿 − 𝑿′ 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ 𝑿)𝜷 = 𝑿′ 𝒚 − 𝑿′ 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ 𝒚
(𝑿′ ⏟
[𝑰 − 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ ] 𝑿) 𝜷 = 𝑿′ ⏟
[𝑰 − 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ ] 𝒚
=𝑸 =𝑸
(𝑿′ 𝑸𝑿)𝜷 = 𝑿′ 𝑸𝒚
̂ = (𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸𝒚
𝜷
̂
̂ using 𝜷
3) Recover 𝜶
̂ = (𝑫′ 𝑫)−𝟏 𝑫′ (𝒚 − 𝑿(𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸𝒚)

𝜶
̂
̅′𝒉 𝜷
or by groups: 𝛼̂ℎ = 𝑦̅ℎ − 𝒙
1.6.6) Breaking down the Q matrix

Note that:
𝑸 = [𝑰 − 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ ]
𝟏𝑴𝟏 𝟎𝑴𝟏 ⋯ 𝟎𝑴𝟏
𝟎𝑴𝟐 𝟏𝑴𝟐 ⋯ 𝟎𝑴𝟐
𝑫
⏟ =
𝑛×𝐻
⋮ ⋯ ⋱ ⋮
𝟎
( 𝑴𝑯 𝟎𝑴𝑯 ⋯ 𝟏𝑴𝑯 )
1 0 0
1 0 0
1 1 1 0 0 0
1 0 0
𝑫= 𝑫′ = (0 0 0 1 1 0)
0 1 0
⏟0 0 0 0 0 1
0 1 0 3×6
(
⏟0 0 1)
6×3
1 0 0
1 0 0
1 1 1 0 0 0 3 0 0
1 0 0
𝑫′ 𝑫 = (0 0 0 1 1 0) = (0 2 0)
0 1 0
⏟0 0 0 0 0 1 0 0 1
3×6
0 1 0
(
⏟0 0 1)
6×3
1/3 0 0
(𝑫′ 𝑫)−𝟏
=( 0 1/2 0 )
0 0 1/1
1/3 0 0 1 1 1 0 0 0 1/3 1/3 1/3 0 0 0

(𝑫′ −𝟏
𝑫) ′
𝑫 =( 0 1/2 0 ) (0 0 0 1 1 0) = ( 0 0 0 1/3 1/2 0 )
⏟0 0 1/1 ⏟0 0 0 0 0 1 ⏟0 0 0 0 0 1/2
3×3 3×6 3×6
1 0 0 1/3 1/3 1/3 0 0 0

1 0 0 1/3 1/3 1/3 0 0 0
1/3 1/3 1/3 0 0 0
′ −𝟏 ′ 1 0 0 1/3 1/3 1/3 0 0 0
𝑫(𝑫 𝑫) 𝑫 = ( 0 0 0 1/3 1/2 0 )=
0 1 0 0 0 0 1/2 1/2 0
⏟0 0 0 0 0 1/2
0 0 0 1/2 1/2 0
0 1 0
3×6
(0
⏟ 0 1) ( 0
⏟ 0 0 0 0 1/1)
6×3 6×6
When pre-multiplying 𝒚 by 𝑸, we obtain 𝑦 in deviations from the mean.

𝑦11
𝑦12
𝑦13
𝒚= 𝑦
21
𝑦22
(𝑦31 )
̃ = 𝑸𝒚 = [𝑰 − 𝑫(𝑫′ 𝑫)−𝟏 𝑫′ ]𝒚
𝒚
1 0 0 0 0 0 1/3 1/3 1/3 0 0 0 𝑦11

0 1 0 0 0 0 1/3 1/3 1/3 0 0 0 𝑦12
0 0 1 0 0 0 1/3 1/3 1/3 0 0 0 𝑦13
− 𝑦
0 0 0 1 0 0 0 0 0 1/2 1/2 0 21
0 0 0 0 1 0 0 0 0 1/2 1/2 0 𝑦22
(0
⏟ 0 0 0 0 1) (⏟ 0 0 0 0 0 1/1) (⏟𝑦31 )
[ 6×6 6×6 ] 6×1
𝑦11 (𝑦11 + 𝑦12 + 𝑦13 )⁄3 𝑦11 𝑦̅1

𝑦12 (𝑦11 + 𝑦12 + 𝑦13 )⁄3 𝑦12 𝑦̅1
𝑦13 (𝑦11 + 𝑦12 + 𝑦13 )⁄3 𝑦13 𝑦̅
= 𝑦 − = 𝑦 − 1
21 (𝑦21 − 𝑦22 )⁄2 21 𝑦̅2
𝑦22 (𝑦21 − 𝑦22 )⁄2 𝑦22 𝑦̅2
(𝑦31 ) (𝑦 ⁄1 ) ( 𝑦31 ) ( 𝑦̅3 )
31
̂
1.6.7) Conditional Variance of 𝜷
Under the assumptions:
• 𝔼[𝒚|𝑿, 𝑫] = 𝑿𝜷 + 𝑫𝜶
• 𝑉𝑎𝑟[𝒚|𝑿, 𝑫] = 𝜎 2 𝑰𝒏
𝛼̂ and 𝛽̂ are unbiased, with conditional variances given by:
⏟ ′ 𝑸𝑿)−𝟏 𝑿′ 𝑸 𝒚|𝑿, 𝑫]
̂ |𝑿, 𝑫] = Var [(𝑿
Var[𝜷
constant
′
= [(𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸] Var[𝒚|𝑿, 𝑫][(𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸]
= (𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸 Var[𝒚|𝑿,

⏟ 𝑫] 𝑸′𝑿(𝑿′ 𝑸𝑿)−𝟏
𝜎 2 𝑰𝒏
= 𝜎 2 (𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸
⏟ 𝑸′ 𝑿(𝑿′ 𝑸𝑿)−𝟏
𝑸
= 𝜎 2 (𝑿′ 𝑸𝑿)−𝟏 𝑿′ 𝑸𝑿(𝑿′ 𝑸𝑿)−𝟏
= 𝜎 2 (𝑿′ 𝑸𝑿)−𝟏
̂ |𝑿, 𝑫]
̅′𝒉 𝜷
Var[𝛼̂ℎ |𝑿, 𝑫] = Var[𝑦̅ℎ − 𝒙
𝑀ℎ
1
= Var [ ∑ 𝑦ℎ𝑚 |𝑿, 𝑫] + Var[𝒙 ̂ |𝑿, 𝑫] − Cov (𝑦̅ℎ , ̅
̅′𝒉 𝜷 𝒙
⏟
′ ̂
𝒉 𝜷 |𝑿, 𝑫)
𝑀ℎ
𝑚=1 constant
𝑀ℎ
1
= Var [ ∑ 𝑦ℎ𝑚 |𝑿, 𝑫] + Var[𝒙 ̂ |𝑿, 𝑫]
̅′𝒉 𝜷
𝑀ℎ
𝑚=1
𝑀ℎ
1
=
𝑀ℎ2
Var [ ∑ 𝑦ℎ𝑚 |𝑿, 𝑫] + 𝒙 ⏟ ̂ |𝑿, 𝑫] 𝒙
̅′𝒉 Var[𝜷 ̅𝒉
𝑚=1 2 ′ −𝟏
𝜎 (𝑿 𝑸𝑿)
𝑀ℎ
1
= ∑⏟ ̅′𝒉 ⏟
Var[𝑦ℎ𝑚 |𝑿, 𝑫] + 𝒙 Var[𝜷̂ |𝑿, 𝑫] 𝒙
̅𝒉
𝑀ℎ2
⏟
𝑚=1 𝜎2 2 ′
𝜎 (𝑿 𝑸𝑿) −𝟏
𝑀ℎ 𝜎 2
2
𝜎
= ̅′𝒉 (𝑿′ 𝑸𝑿)−𝟏 𝒙
+ 𝜎2𝒙 ̅𝒉
𝑀ℎ
2) Appendix
2.1) MATRIX DIFFERENTIATION

𝜕
➢ 𝜕𝒃
(𝒃′𝑿) = 𝑿
𝜕
➢ 𝜕𝒃
(𝑿′𝒃) = 𝑿
𝜕
➢ 𝜕𝒃
(𝒃′𝒃) = 𝟐𝒃
𝜕
➢ 𝜕𝒃
(𝒃′𝑿𝒃) = 𝟐𝑿𝒃
Matrix Cookbook Page 10
• https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
2.2) ADDING MATRIX EXPRESSIONS THAT ARE ACTUALLY SCALARS

If you have any matrix expression that yields a scalar, it is equal to its transpose
′
⏟′ 𝚺⏟′ 𝓖
𝛀 ⏟′ 𝚺⏟′ 𝓖
⏟ = (𝛀 ⏟ 𝚺⏟′ 𝛀
⏟ ) = 𝓖′ ⏟′
1×𝑛 𝑛×𝑘 𝑘×1 1×𝑛 𝑛×𝑘 𝑘×1 1×𝑘 𝑘×𝑛 𝑛×1
2.3) SLUTSKY’S THEOREM (CONVERGENCE OF TRANSFORMATIONS)

If a random variable (𝑊𝑛 ) converges in probability to a constant (𝑐), then a transformation 𝑔(·) of that random
variable 𝑔(𝑊𝑛 ) converges in probability to the same transformation applied to the constant 𝑔(𝑐).
Hence, if 𝑔(·) is continuous at 𝑐

𝑝 𝑝
𝑊𝑛 → 𝑐 ⇒ 𝑔(𝑊𝑛 ) → 𝑔(𝑐)
2.4) SLUTSKY’S THEOREM (CONVERGENCE OF SUMS OR PRODUCTS OR R.V’S)

Consider the sum of two random variables, one of which converges in distribution (in other words, has an
asymptotic distribution) and the other converges in probability to a constant: the asymptotic distribution of this
sum is unaffected by replacing the one that converges to a constant by this constant. Formally, let 𝐴𝑛 be a statistic
with an asymptotic distribution and let 𝐵𝑛 be a statistic with probability limit 𝑏. Then 𝐴𝑛 + 𝐵𝑛 and 𝐴𝑛 + 𝑏 have
the same asymptotic distribution.
𝑑
𝐴𝑛 + 𝐵𝑛 → 𝐴𝑛 + 𝑏
Consider the product of two random variables, one of which converges in distribution and the other converges
in probability to a constant: the asymptotic distribution of this product is unaffected by replacing the one that
converges to a constant by this constant. Formally, let 𝐴𝑛 be a statistic with an asymptotic distribution and let
𝐵𝑛 be a statistic with probability limit 𝑏. Then 𝐴𝑛 𝐵𝑛 and 𝐴𝑛 𝑏 have the same asymptotic distribution.
𝑑
𝐴𝑛 𝐵𝑛 → 𝐴𝑛 𝑏
There is also a weaker version where we just impose convergence in probability.
2.5) LAW OF LARGE NUMBERS (LLN)

Suppose we have a random variable 𝑋. From 𝑋, we can generate a sequence of random variables 𝑋1 , … , 𝑋𝑛 that
are independent and identically distributed (i.i.d.) draws of 𝑋. Assuming 𝑛 is finite, we can perform calculations
on this sequence of random numbers. For example, we can calculate the mean of the sequence
𝑛
1
𝑋̅𝑛 = ∑ 𝑋𝑖
𝑛
𝑖=1
This value is the sample mean – from a much wider population, we have drawn a finite sequence of
observations, and calculated the average across them. How do we know that this sample parameter is
meaningful with respect to the population, and therefore that we can make inferences from it?
WLLN states that the mean of a sequence of i.i.d. random variables converges in probability to the expected
value of the random variable as the length of that sequence tends to infinity. By ‘converging in probability’, we
mean that the probability that the difference between the mean of the sample and the expected value of the
random variable tends to zero.
In short, WLLN guarantees that with a large enough sample size the sample mean should approximately match
the true population parameter. Clearly, this is powerful theorem for any statistical exercise: given we are
(always) constrained by a finite sample, WLLN ensures that we can infer from the data something meaningful
about the population. For example, from a large enough sample of voters we can estimate the average support
for a candidate or party.
More formally, we can state WLLN as follows:

𝑝
𝑋̅𝑛 → 𝔼[𝑋]
𝑝
where → denotes `converging in probability’.
Note that convergence in probability is defined as the limit in probability as 𝑛 → ∞. And we can intuitively see
that, as 𝑛 → ∞, our sample becomes the population. So it makes sense that, as 𝑛 gets increasingly large, the
sample mean becomes the population mean.
2.6) CRAMÉR’S THEOREM

Let 𝑨𝒏 be a stochastic matrix that converges in probability to a determinate matrix 𝐴 and 𝑋𝑛 a random variable
with limiting distribution s.t.:
𝑝
𝐴𝑛 → 𝐴
𝑑
𝑋𝑛 → 𝑋
Then, Cramér’s Theorem states that:

𝑑
𝐴𝑛 𝑋𝑛 → 𝐴𝑋
2.7) DIFFERENCE BETWEEN SAMPLE AND POPULATION

At a population level,
➢ 𝑦 = 𝑥𝛽 + 𝑢
➢ minimize expected squared error
At a sample level,
➢ 𝑦 = 𝑥𝛽̂ + 𝑢̂
➢ minimize mean squared error
2.8) TRICKS FOR ESTIMATING POPULATION OBJECTS
Substitute parameters by estimated parameters (with a hat)
Substitute expectations by averages
2.9) DELTA METHOD

2.10) CLT
2.11) OPERATIONS WITH NORMAL DISTRIBUTION OPERATOR

2.11.1) Univariate distributions
𝜎2
𝑋𝑛 ∼ 𝒩 (𝜇, )
𝑛
𝜎2
𝑋𝑛 ∼ 𝜇 + 𝒩 (0, )
𝑛
𝜎2
𝑋𝑛 − 𝜇 ∼ 𝒩 (0, )
𝑛
𝜎
𝑋𝑛 − 𝜇 ∼ 𝒩(0,1)
√𝑛
√𝑛(𝑋𝑛 − 𝜇)
∼ 𝒩(0,1)
𝜎
Conclusions:
➔ The expectation gets out summing
● ∼ 𝒩(◆, ✠)
● ∼ ◆ + 𝒩(0, ✠)
● − ◆ ∼ 𝒩(0, ✠)
➔ The variance gets out under square root & multiplying
● − ◆ ∼ √✠ · 𝒩(0,1)
●−◆
∼ 𝒩(0,1)
√✠
2.11.2) Multivariate distributions
⏟ ∼𝓝
𝒂 ⏟ , 𝑪
𝒃 ⏟ 𝑫
⏟ 𝑪
⏟′
𝑘×1 𝑘×1 ⏟
𝑘×𝑘 𝑘×𝑘 𝑘×𝑘
( 𝑘×𝑘 )
⏟ ∼ 𝒃
𝒂 ⏟ + 𝓝( ⏟
𝟎 , 𝑪𝑫𝑪′
⏟ )
𝑘×1 𝑘×1 𝑘×1 𝒌×𝒌
(𝒂
⏟ − 𝒃
⏟ ) ∼ 𝓝( ⏟
𝟎 , 𝑪𝑫𝑪′
⏟ )
𝑘×1 𝑘×1 𝑘×1 𝒌×𝒌
(𝒂
⏟ − 𝒃
⏟ )∼ 𝑪
⏟ 𝓝( ⏟
𝟎 , 𝑫
⏟ )
𝑘×1 𝑘×1 𝑘×𝑘 𝑘×1 𝒌×𝒌
−𝟏
𝑪
⏟ ⏟ − 𝒃
𝒂 ⏟ ∼ 𝓝( ⏟
𝟎 , 𝑫
⏟ )
𝑘×𝑘 ⏟
𝑘×1 𝑘×1 𝑘×1 𝒌×𝒌
⏟ ( 𝑘×1 )
𝑘×1
2.12) LINEAR REGRESSION AND CEF

The regression coefficients defined in this section are not estimators; rather, they are nonstochastic features of
the joint distribution of dependent and independent variables. This joint distribution is what you would observe
if you had a complete enumeration of the population of interest (or knew the stochastic process generating the
data). You probably don’t have such information. Still, it’s good empirical practice to think about what
population parameters mean before worrying about how to estimate them.
2.12.1) CEF
The CEF is the best predictor of of 𝒚𝒊 given 𝒙𝒊 in the class of all functions of 𝒙𝒊
It solves the program:
𝔼(𝑦𝑖 |𝒙𝒊 ) = arg min 𝔼{[𝑦𝑖 − 𝑚(𝒙𝒊 )]2 }

𝑚(𝒙𝒊 )
To show it, let’s go step by step:

2
Expand the expression (𝑦𝑖 − 𝑚(𝒙𝒊 )) :
[𝑦𝑖 − 𝑚(𝒙𝒊 )]2 = [𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 ) + 𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )]2
2
= [(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 ))]
2 2
= (𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )) + 2 (𝑦
⏟𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) (𝔼(𝑦
⏟ 𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 ))
𝑢𝑖 ℎ(𝒙𝒊 )
2 2
= (𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )) + 2𝑢𝑖 ℎ(𝒙𝒊 )
Take the expectation of the resulting expression:

2 2
𝔼[𝑦𝑖 − 𝑚(𝒙𝒊 )]2 = 𝔼 [(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) ] + 𝔼 [(𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )) ] + 2 ⏟
𝔼[𝑢𝑖 ℎ(𝒙𝒊 )]
0
2 2
= 𝔼 [(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) ] + 𝔼 [(𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )) ]
The program is then:

2 2
min {𝔼 [(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) ] + 𝔼 [(𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )) ]}
𝑚(𝒙𝒊 )
The term in red doesn’t depend on 𝑚(𝒙𝒊 ) , so it drops out of the program. We are left with:
2
min {𝔼 [(𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝑚(𝒙𝒊 )) ]}
𝑚(𝒙𝒊 )
where we can clearly see that, the functional form of 𝑚(𝒙𝒊 ) that minimizes the expression is when:
𝑚(𝒙𝒊 ) = 𝔼(𝑦𝑖 |𝒙𝒊 )

2.12.2) Population Linear Regression / Best Linear Predictor (BLP)
The population linear regression is the best predictor of 𝒚𝒊 given 𝒙𝒊 in the class of linear functions (indeed, this
is why it’s called the best linear predictor)
It is found by solving:
𝜷 = arg min 𝔼[(𝑦𝑖 − 𝒙′𝒊 𝒃)2 ]

𝒃
FOC
𝜕 ′
(·) = 𝔼 [−2𝒙𝒊 (𝑦
⏟𝑖 − 𝒙𝒊 𝒃)] = 0 𝔼[𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃)] = 0
𝜕𝒃
𝑢𝑖
′
⏟ 𝒊 𝑦𝑖 ] = 𝔼[𝒙
𝔼[𝒙 ⏟ 𝒊 𝒙𝒊 𝒃]
𝑘×1 𝑘×1
𝔼[𝒙𝒊 𝑦𝑖 ] = 𝔼[𝒙𝒊 𝒙′𝒊 ]𝒃
[𝔼([𝒙𝒊 𝒙′𝒊 ])]−1 𝔼[𝒙𝒊 𝑦𝑖 ] = 𝒃
Hence: 𝜷 = [𝔼([𝒙𝒊 𝒙′𝒊 ])]−1 𝔼[𝒙𝒊 𝑦𝑖 ]
and by construction (FOC) 𝔼[𝒙𝒊 𝑢𝑖 ] = 0
Link between the CEF and the population linear regression model (BLP)
✓ If the CEF is linear, then the population linear regression (BLP) is equal to the CEF
✓ Even when the CEF is not linear, the BLP provides us the best mean-squared-error (MSE) approximation
to 𝔼(𝑦𝑖 |𝑥𝑖 ).
To show this, break down the minimization program of the BLP:
(𝑦𝑖 − 𝒙′𝒊 𝒃)2 = [𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 ) + 𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃]2
2
= [(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)]
2
= (𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)2 + 2 (𝑦
⏟𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) (𝔼(𝑦
′
⏟ 𝑖 |𝒙𝒊 ) − 𝒙𝒊 𝒃)
𝑢𝑖 ℎ(𝒙𝒊 )
2
= (𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) + (𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)2 + 2𝑢𝑖 ℎ(𝒙𝒊 )
Take the expectation:
2
𝔼[(𝑦𝑖 − 𝒙′𝒊 𝒃)2 ] = 𝔼 [(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) ] + 𝔼[(𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)2 ] + 2 ⏟
𝔼[𝑢𝑖 ℎ(𝒙𝒊 )]
0
The minimization program becomes:

2
min 𝔼[(𝑦𝑖 − 𝒙′𝒊 𝒃)2 ] = min {[(𝑦𝑖 − 𝔼(𝑦𝑖 |𝒙𝒊 )) ] + 𝔼[(𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)2 ]}
𝒃 𝒃
where the first term (in red) is irrelevant to the minimization program as it doesn’t depend on 𝒃. We are left
with:
min 𝔼[(𝑦𝑖 − 𝒙′𝒊 𝒃)2 ] = min{𝔼[(𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃)2 ]}

𝒃 𝒃
Hence, finding the 𝛽̂ that best approximates the dependent variable 𝑦𝑖 is equivalent to finding the best
approximator to the CEF.
2.12.3) Useful tricks for Sandwich Formulas

𝑉𝑎𝑟[●✠|●] = ● 𝑉𝑎𝑟[✠|●] ●′
2.12.4) Projection Matrix
𝑷 = 𝑿(𝑿′ 𝑿)−𝟏 𝑿′
Properties
▪ 𝑷𝑿 = 𝑿
𝑿(𝑿′ 𝑿)−𝟏 𝑿′ 𝑿 = 𝑿
𝑷𝑿 = ⏟
𝑷
▪ 𝑷𝒁 = 𝒁 (if 𝒁 = 𝑿𝚪, for any matrix 𝚪)
𝑿(𝑿′ 𝑿)−𝟏 𝑿′ 𝑿𝚪
𝑷𝒁 = ⏟ ⏟ = 𝑿𝚪
𝑷 𝒁
▪ Idempotent: 𝑷𝑷 = 𝑷
𝑿(𝑿′ 𝑿)−𝟏 𝑿′ ⏟
𝑷𝑷 = ⏟ 𝑿(𝑿′ 𝑿)−𝟏 𝑿′ = 𝑿(𝑿′ 𝑿)−𝟏 𝑿′
𝑷 𝑷
▪ Symmetric: 𝑷′ = 𝑷
′
′ ′ −𝟏 (𝑿′ ) ′
𝑷 = [(𝑿)(𝑿
⏟ 𝑿) ] = (𝑿′ )′ [(𝑿′ 𝑿)−𝟏 ] (𝑿)′ = 𝑿[(𝑿′ 𝑿)′ ]−𝟏 𝑿′ = 𝑿(𝑿′ 𝑿)−𝟏 𝑿′ = 𝑷
𝑷
▪ ̂
“Hat Matrix" 𝑷𝒀 = 𝒀
̂=𝒀
𝑿(𝑿′ 𝑿)−𝟏 𝑿′ 𝒀 = 𝑿𝜷
𝑷𝒀 = ⏟ ̂
𝑷
▪ tr(𝑷) = 𝑘
⏟ ′ 𝑿)−𝟏 𝑿′
tr(𝑷) = tr [𝑿(𝑿 ⏟ ] = tr [𝑿′
⏟⏟ 𝑿(𝑿′ 𝑿)−𝟏 ] = tr[𝑰𝑘 ] = 𝑘
𝑨 𝑩 𝑩 𝑨
2.12.5) Annihilator Matrix

𝑴 = 𝑰𝒏 − 𝑷 = 𝑰𝒏 − 𝑿(𝑿′ 𝑿)−𝟏 𝑿′
▪ 𝑴𝑿 = 𝟎
𝑴𝑿 = [𝑰𝒏 − 𝑷]𝑿 = 𝑿 − 𝑷𝑿 = 𝑿 − 𝑿 = 𝟎
▪ 𝑴𝒁 = 𝟎 (if 𝒁 = 𝑿𝚪, for any matrix 𝚪)
𝑴𝒁 = [𝑰𝒏 − 𝑷]𝑿𝚪 = 𝑿𝚪 − 𝑷𝑿𝚪 = 𝒁 − 𝑷𝒁 = 𝒁 − 𝒁 = 𝟎

▪ tr(𝑴) = 𝑛 − 𝑘
tr(𝑴) = tr[𝑰𝒏 − 𝑷] = tr[𝑰𝒏 ] − tr[𝑷] = 𝑛 − 𝑘
▪ ̂
𝑴𝒀 = 𝒖
̂=𝒖
𝑴𝒀 = [𝑰𝒏 − 𝑷]𝒀 = 𝒀 − 𝑷𝒀 = 𝒀 − 𝑿𝜷 ̂
▪ ̂
𝑴𝒖 = 𝒖
𝑴𝒖 = 𝑴[𝒀 − 𝑿𝜷] = 𝑴𝒀 − 𝑴𝑿 ̂
⏟ 𝜷 = 𝑴𝒀 = 𝒖
𝟎
▪ Idempotent: (𝑴)𝒓 = 𝑴
▪ Symmetric: 𝑴′ = 𝑴
2.12.6) Useful Matrix Properties
▪ Inverses and Transposes
o (𝑨𝑩𝑪𝑫)′ = 𝑫′𝑪′𝑩′𝑨′
o (𝑨 + 𝑩 + 𝑪)′ = 𝑨′ + 𝑩′ + 𝑪′
o (𝒌𝑨)′ = 𝒌𝑨′
o (𝑨−1 )′ = (𝑨′ )−1
o (𝑨′ )′ = 𝑨
o (𝑨−1 )−1 = 𝑨
▪ Traces
o tr(𝑐) = 𝑐, (𝑐 is a constant)
o tr(𝑨𝑩) = tr(𝑩𝑨)
o tr(𝑨𝑩𝑪) = tr(𝑪𝑨𝑩) = tr(𝑩𝑪𝑨)
o tr(𝑐𝑨) = 𝑐 × tr(𝑨)
o tr(𝑨′) = tr(𝑨)
o tr(𝑨 + 𝑩) = tr(𝑨) + tr(𝑩)
o 𝔼[tr(●)] = tr[𝔼(●)]
o tr(𝑰𝑛 ) = 𝑛
o tr(𝟎) = 0
▪ Determinants
o det[𝑑𝑖𝑎𝑔(𝜆, … , 𝜆)] = ∏𝑛𝑖=1 𝜆 = 𝜆𝑛
o det[𝑑𝑖𝑎𝑔(𝜆1 , … , 𝜆𝑛 )] = ∏𝑛𝑖=1 𝜆𝑖
o det[𝐼𝑛 ] = 1𝑛 = 1
2.12.7) Victor’s TA comments:

When looking for asymptotic distributional results, we don’t condition on the x’s when calculating expectations
and variances
If we are asked to compute the expectation or variance of an estimator, unless stated otherwise, we are being
asked to compute the conditional expectation and conditional variance.
2.12.8) Multivariate Normal Distribution of the y’s

𝑓𝐗 (𝐱)~𝓝𝑚 (𝛍, 𝚺)
𝑛 1 1
𝑓𝐗 (𝐱) = (2𝜋)−2 × |det(𝚺)|−2 × exp [− (𝐱 − 𝛍)′ 𝚺 −1 (𝐱 − 𝛍)]
2
• 𝚺 ≡ 𝑛 × 𝑛 positive definite matrix, |𝚺| is its determinant
• 𝝁 ≡ 𝑛 × 1 vector of constants
Application: Linear Regression
𝒚 ∼ 𝒩(𝑿𝜷, 𝚺)
𝑛 1 1
𝑓𝐘 (𝐲) = (2𝜋)−2 × |det(𝚺)|−2 × exp − ⏟(𝒚 − 𝑿𝜷)′ 𝚺
⏟ −1 (𝒚
⏟ − 𝑿𝜷)
2⏟ 𝑛×𝑛
1×𝑛 𝑛×1
[ 1×1 ]
• 𝚺 ≡ 𝑛 × 𝑛 positive definite matrix, |𝚺| is its determinant

𝔼[𝑦] 𝑿𝜷
• 𝝁 ≡ 𝑛 × 1 vector of constants = ( ⋮ ) = ( ⋮ )
𝔼[𝑦] 𝑿𝜷
2.13) USEFUL TRICK FOR WORKING IN DEVIATIONS FROM THE MEAN
𝑛 𝑛
∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) = ∑(𝑥𝑖 − 𝑥̅ )𝑦𝑖

𝑖=1 𝑖=1
Proof:
Simply expand it:

𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
∑(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) = ∑(𝑥𝑖 − 𝑥̅ )𝑦𝑖 − ∑(𝑥𝑖 − 𝑥̅ )𝑦̅ = ∑(𝑥𝑖 − 𝑥̅ )𝑦𝑖 − 𝑦̅ ∑(𝑥𝑖 − 𝑥̅ ) = ∑(𝑥𝑖 − 𝑥̅ )𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 ⏟
𝑖=1 𝑖=1
0
and note that:

𝑛 𝑛 𝑛
∑(𝑥𝑖 − 𝑥̅ ) = ∑ 𝑥𝑖 − ∑ 𝑥̅ = 𝑛𝑥̅ − 𝑛𝑥̅ = 0

𝑖=1 𝑖 𝑖
2.14) SANDWICH FORMULAS

If matrix 𝐀 is a function of 𝐗: 𝐀(𝐗)
Var[𝐀𝐁 | 𝐗] = 𝐀 Var[𝐁|𝐗] 𝐀′
In general:
Var[●◈] = ● Var[◈] ●′

1) Derivations - Regression

Uploaded by

Copyright:

Available Formats

1) Derivations - Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1) Derivations - Regression

Uploaded by

Copyright:

Available Formats

Index

1.1) MEANS AND PREDICTORS

𝑦̅ = arg min ∑ 𝑢 = arg min ∑(𝑦𝑖 − 𝑎)2

Open the summation:

∑(𝑦𝑖 − 𝑎)2 = (𝑦1 − 𝑎)2 + ⋯ + (𝑦𝑛 − 𝑎)2

F.O.C wrt this object:

1.1.2) Best Linear Predictor (data on 𝒚 and 𝒙)

min [∑(𝑢𝑖 ] = min [∑(𝑦𝑖 − 𝛽𝑥𝑖 )2 ]

1.1.3) Best Linear Predictor (data on 𝒚 and 𝒙 + intercept)

(𝛼̂, 𝛽̂ ) = arg min ∑(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2

Plug (1) into (2)

Plug the expression for 𝛽̂ back into the expression for 𝑎:

1.1.4) Best Linear Predictor (data on 𝒚 and a vector 𝒙)

(∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝑦𝑖 ) = (∑ 𝒙𝒊 𝒙′𝒊 ) (∑ 𝒙𝒊 𝒙′𝒊 ) 𝒃

➢ 𝐶𝑜𝑣(𝑥, 𝑦) = 𝔼(𝑥, 𝑦) − 𝔼(𝑥)𝔼(𝑦)

∑ 𝒙⏟𝒊 𝑦⏟𝑖 − 𝒙⏟′𝒊 𝜷⏟̂ = 0 ⇔ ∑ 𝒙⏟𝒊 ( 𝑢̂⏟𝑖 ) = 0

Conclusions: when an intercept is included in the model

➢ The sum of residuals is equal to 0:

𝐶𝑜𝑣(𝑥𝑘 , 𝑢̂) = 𝔼(𝑥𝑘 𝑢̂) − 𝔼(𝑥𝑘 )𝔼(𝑢̂)

𝑢̂1 𝑦1 𝑥11 ⋯ 𝑥1𝑘 𝛽1

We seek to minimize mean squared error:

̂ ≡ 𝒃 = (𝑿′ 𝑿)−𝟏 (𝑿′ 𝒚)

1.1.5) Decomposition of 𝒚𝒊 into 2 orthogonal components

They are orthogonal:

1.2) CONSISTENCY AND ASYMPTOTIC NORMALITY OF PREDICTORS

To find the population predictor 𝛽, we minimize expected quadratic loss:

𝜷 = arg min 𝔼[𝑦𝑖 − 𝒙′𝒊 𝒃]2

𝔼[𝒙𝒊 𝑦𝑖 ] = 𝔼[𝒙𝒊 𝒙′𝒊 ]𝒃

𝔼[𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝒙𝒊 𝑦𝑖 ] = 𝔼[𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝒙𝒊 𝒙′𝒊 ]𝒃

𝜷 = 𝔼[𝒙𝒊 𝒙′𝒊 ]−1 𝔼[𝒙𝒊 𝑦𝑖 ]

and, as we have seen, by constriction, the first order condition imposes:

𝔼[−2𝒙𝒊 (𝑦𝑖 − 𝒙′𝒊 𝒃)] = 0

Take 𝜷 out of the summation

1.2.3) Consistency of OLS estimator

Take the expression of the estimation error:

Apply property of plims: plim[◼× ◆] = plim[◼] × plim[◆]

Therefore: ̂ − 𝜷] = 𝔼[(𝒙𝒊 𝒙′𝒊 )−1 ] ⏟

1.2.4) Asymptotic normality of (1/n)sum(xiui)

and with variance:

➔ 𝑉𝑎𝑟 [ 𝒙⏟𝒊 𝑢⏟𝑖 ] = ⏟ 𝔼[𝒙𝒊 𝑢𝑖 ]′ = 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ]

By the Central Limit Theorem,

⏟ ≡ 𝔼 [𝑢𝑖2 𝒙⏟𝒊 𝒙⏟′𝒊 ], then

1.2.5) Asymptotic Normality of OLS & Sandwich Formula

or equivalently, it’s conveniently tuned version (to have sample means):

By Cramér’s Theorem, if we have:

1.2.6) (Heteroskedasticity consistent) Estimators of the asymptotic variance of the estimation

𝑾 = [𝔼(𝒙𝒊 𝒙′𝒊 )]−1 𝔼[𝑢𝑖2 𝒙𝒊 𝒙′𝒊 ][𝔼(𝒙𝒊 𝒙′𝒊 )]−1

1.2.7) Asymptotic normality for Individual Coefficients

which, for individual coefficients correspond to:

where 𝑤𝑗𝑗 is the diagonal element of 𝑾.

̂, the asymptotic normality result remains unaltered:

1.2.8) Confidence Intervals

√𝑛(𝛽̂𝑗 − 𝛽𝑗 ) 𝑤𝑗𝑗 𝑤𝑗𝑗

Hence, if we propose a consistent estimate for 𝑾 so that we get a 𝑤

1.3) CLASSICAL REGRESSION MODEL (CRM)

𝜷 = arg min 𝔼𝒙 {[𝔼(𝑦𝑖 |𝒙𝒊 ) − 𝒙′𝒊 𝒃]2 }

Note that the mean is taken with respect to the distribution of 𝒙.

Hence, 𝜷 is the best linear approximation to the conditional mean of 𝑦 given 𝒙: