Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Var PPTS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 249

to

A Bayesian Approach to
Identification of Structural VAR
Models
Christiane Baumeister
University of Notre Dame
CEPR and NBER

Training Course
Central Reserve Bank of Peru
March 13-15, 2023
What is Econometrics All About?
• Econometrics is concerned with the use of sample data to
learn about a phenomenon the researcher is interested in
use data to learn about unknown economic parameters
that capture the relationship between macro variables

• Think of a simple regression model


Goal: We want to learn about something unknown
the regression coefficient
given something known
the data

2
Why Bayesian?
• In many applications, the econometrician possesses, in
addition to the sample, other information about the
parameters:
 theoretical constraints on the parameter space: integrate theory
with empirics (e.g. identifying restrictions, stability constraints)
 previous empirical research: past samples, data from other
countries, micro data (e.g. surveys)
• Bayesian analysis allows to:
 include non-sample information in estimation in a flexible way
 Vector autoregressions (VARs)
 Short time series, measurement error
 Lag length
 account for uncertainty in decision-making context (e.g. policy)
 analyze models that are intractable using classical methods 3
What Are The Goals of This Course?
• Chris Sims (2007):
Being Bayesian is more than a basket of “methods”,
it is a mindset.

What we want to do is to study methods and


applications of Bayesian inference to
develop and embrace this mindset.

4
Course Overview

• Bayesian Methods and Numerical Simulation


Methods

• Bayesian Vector Autoregressive (BVAR)


models

• Identification and structural (causal) analysis

5
Vector Autoregressive Models
• Workhorse models in empirical macroeconomics
 capture the dynamic interrelationships between variables
that represent the economy
 used for data description, forecasting, structural
dynamics, and policy & counterfactual analysis

• Bayesian estimation of VAR models


 Markov Chain Monte Carlo (MCMC) methods:
Gibbs sampling algorithm
 Choice of priors
 Minnesota prior
 normal-inverse Wishart prior
 prior using dummy observations (data augmentation) 6
Identification and Structural Analysis
• The Identification Problem in Structural VARs
 Identification strategies: point and set identification
• A Bayesian Interpretation of Traditional Approaches
to Identification
 What prior information?
 Delay, sign, and boundary restrictions
• Inference when Identifying Assumptions are
Doubted
 How to compute credibility sets
 Prior information about structural coefficients and
impacts of shocks 7
Introduction to
Bayesian
Estimation
How do Classical and Bayesian Analysis
Differ?
Consider a simple model:
where

Assume is known we want to estimate

95% confidence interval:


√ √
2
How do Classical and Bayesian Analysis
Differ?
1. Classical analysis
 μ is a fixed, unknown quantity “true value”
 The estimator is a random variable and is
evaluated via repeated sampling the interval
we constructed will contain the true value in 95%
of cases if we estimate for thousand different
samples taken from a population with given μ and
σ2
 The estimator is “best” in the sense of having the
highest probability of being close to the true μ
Probability is objective and is the limit of the
relative frequency of an event.

3
How do Classical and Bayesian Analysis
Differ?
2. Bayesian analysis
 μ is treated as a random variable it has a
probability distribution
 The distribution summarizes our knowledge about
the model parameter 2 sources of information:
 Prior information (before seeing the data):
subjective belief about how likely different
parameter values are
 Sample information: leads researcher to
revise/update his prior beliefs
 Probabilities are subjective and not necessarily
related to the relative frequency of an event.
 Explicit use of probabilities to quantify
uncertainty. 4
Key Ingredients for Bayesian Analysis

1. Probabilities
Review some probability rules to derive Bayes’
rule
2. Initial information
What is the reason for using prior information?
How to specify a prior distribution for
parameters?
3. How to combine data and non-data (prior)
information?
Bayesian estimation in practice

5
Some Rules of Probability
Consider two random variables: A and B

The rules of probability imply: p(A, B) = p(A | B) p(B)

Where • p(A, B) is the joint probability of A and B


• p(A | B) is the probability of A occurring
conditional on B having occurred
• p(B) is the marginal probability of B

Alternatively, we can reverse the roles of A and B so that:


p(A, B) = p(B | A) p(A)
6
Bayes’ Rule
Equating the two expressions for the joint probability of
A and B provides us with Bayes’ rule:

Let’s map this rule into a simple regression model where


we want to learn about a parameter θ given the data y:

7
A Closer Look at Each Component
Key object of interest: p(θ | y)

 p(y) marginal data density


Since we are interested in learning about
θ, we can ignore p(y) since it does not
involve θ.
 p(θ) prior density
It does not depend on the data y; instead,
it contains non-data information about θ.
 p(y | θ) likelihood function
It is the density of the data conditional
on the parameters.
8
The Posterior Distribution

“The posterior is proportional to the likelihood times the


prior.”
 The posterior summarizes all we know about θ
after seeing the data.
The posterior combines both data and non-
data information.
 The equation can be viewed as an updating rule
where the data allow us to update our prior views
about θ.

9
Skills for Bayesian Inference
Bayesian inference requires a good knowledge of:
• Probability distributions
 to formulate prior distributions
 to generate draws from them
 to analyze posterior distributions

• Numerical simulation techniques


 Gibbs sampling
 Metropolis-Hastings algorithm

10
More on Priors
• Two decisions with regard to priors:
1. Family of the prior distribution
2. Hyperparameters of the prior distribution
• In principle any distribution can be combined with the
likelihood to form the posterior.
• Conjugate priors
If a prior is conjugate, then the posterior has the same density as the
prior. Very convenient
• Natural conjugate priors
Additional property: they have the same functional form as the
likelihood function. The prior can be interpreted as arising from
earlier data analysis. 11
The Linear Regression Model
• Consider the linear regression model with fixed
regressors:
𝑻

where and are vectors, is a matrix of


exogenous variables and deterministic terms.

• Likelihood:

12
Bayesian Analysis
• Idea 1: The parameters are random variables
with a probability distribution.
• Idea 2: A Bayesian estimate of this distribution combines
prior beliefs and information from the data.
 Step 1: Form prior beliefs about parameters (based on past
experience or other studies) and express in the form of
a probability distribution: 𝑝 𝜽
 Step 2: Information contained in the data is summarized by the
likelihood function: 𝐿 𝜽|𝐘
 Step 3: Bayes’ Rule gives the posterior distribution of the
parameters: 𝑝 𝜽 𝐘 ∝ 𝐿 𝜽 𝐘 𝑝 𝜽

13
Example 1: Inference of β when known
Prior distribution of β
p(β|σ2) ~ N(β0, Σ0)
K 1
1
Prior density: 2𝜋 2 | 𝚺𝟎 | 2 exp 𝛃 𝛃𝟎 ′ 𝚺𝟎 𝟏 𝛃 𝛃𝟎
2

2 1 ′ 𝟏
p(β|σ ) 𝟎 𝟎 𝟎
2

1 1 0
Example: 𝛃𝟎 and 𝚺𝟎
1 0 10

Likelihood
2 1 ′
L(β|σ , Y)
2𝜎 2 14
Combining prior density and likelihood
p(β|σ2, Y) p(β|σ2) L(β|σ2, Y)
1 1
∝ exp 𝛃 𝛃𝟎 ′ 𝚺𝟎 𝟏 𝛃 𝛃𝟎 𝐘 𝐗𝛃 ′ 𝐘 𝐗𝛃
2 2𝜎 2

Posterior distribution of β
p(β|σ2, Y) ~ N(β1, Σ1)
where
β1 = (Σ0-1 + σ-2X'X)-1 (Σ0-1β0 + σ-2X'Y)
= (Σ0-1 + σ-2X'X)-1 (Σ0-1β0 + σ-2X'Xb) with b=(X'X)-1X'Y
Σ1 = (Σ0-1 + σ-2X'X)-1
15
Example 2: Inference of when β known
• Recall:
then
with the density for the Gamma distribution given by:

where and

• Use this as a prior for the inverse of the variance (also


called the “precision”):

16
Why Use this Prior?
for
2) flexible family (different shapes)

17
Gamma Distributions with Mean Unity

ν 1, 𝛿 1

ν 2, 𝛿 2

ν 4, 𝛿 4

18
Why Use this Prior?
3) It is the “natural conjugate prior” given the likelihood,
meaning that if the prior is ,
then the posterior turns out to be

 If prior were derived from earlier data analysis, it would


have this form
it is equivalent to having observations with
sum of squared residuals
 This prior makes analytical treatment of the problem
tractable

19
Example 2: Inference of when β known

Prior distribution of 1/σ2


p(1/σ2|β) ~ Γ 0 0

2 1 𝜈𝑜 1 𝛿0
Prior density: p(1/σ |β) 2
2
𝜎 2𝜎 2

Likelihood
2 1 1 ′
L(1/σ |β, Y) 𝑇
2𝜎 2
𝜎2 2

20
Combining prior density and likelihood
p(1/σ2|β, Y) p(1/σ2|β) L(σ2|β, Y)
𝜈𝑜
1 1 𝛿0 1 1
2 ′
∝ exp 𝑇 exp 𝐘 𝐗𝛃 𝐘 𝐗𝛃
𝜎2 2𝜎 2 2𝜎 2
𝜎2 2
1 𝜈𝑜 𝑇
1 1 ′
2
2 2 exp 𝛿0 𝐘 𝐗𝛃 𝐘 𝐗𝛃
𝜎 2𝜎 2

Posterior distribution of 1/σ2


p(1/σ2|β, Y) ~ Γ 1 1
where
ν1 = ν0 + T
δ1 = δ0 + (Y – Xβ)' (Y – Xβ)
21
What If All Parameters Are Unknown?
• Setting the prior: joint density for and

p(β,1/σ2) = p(β|σ2) p(1/σ2)


where p(β|σ2) ~ N(β0, σ2Σ0)
p(1/σ2) ~ Γ 𝑜 0

• Setting up the likelihood function

22
What If All Parameters Are Unknown?
• Calculating the joint posterior distribution

𝑝 𝛃, 1/𝜎 2 |𝐘 ∝ 𝑝 𝐘, 𝛃, 1/𝜎 2

1 1 ′
∝ 𝑇 exp 𝐘 𝐗𝛃 𝐘 𝐗𝛃
2𝜎 2
𝜎2 2
𝜈𝑜
1
1 2 𝛿0
exp
𝜎2 2𝜎 2
𝐾/2
1 1
exp 𝛃 𝛃𝟎 ′ 𝚺𝟎 𝟏 𝛃 𝛃𝟎
𝜎2 2𝜎 2

23
Posterior for


0
∗ ′ ′
0 𝟎 𝟎

′ 1

1 1 𝟏
𝟎 𝟎

24
Posterior for

∗ ∗ 1
𝟎 𝟎
∗ 1 1
𝟎

• Diffuse prior: 𝟎 →




= usual OLS formulas
25
Posterior for

• Dogmatic prior: 𝟎 →



→ 𝟎

posterior = prior
26
Posterior for


• In general: is a matrix-weighted average of
and , where weights depend on
confidence in prior ( ) and strength of
evidence from data (
27
Another Way to Interpret the Prior
• Suppose I had observed an earlier sample of
observations:

which were independent of the current observed


sample:

28
Another Way to Interpret the Prior
• Then my OLS estimate based on all information
would be:

with variance (given ) of:

29
Another Way to Interpret the Prior
• Let be the OLS estimate based on the prior
sample alone:

and let denote its variance:

30
Another Way to Interpret the Prior


identical to formula for posterior mean

31
Another Way to Interpret the Prior


for the posterior variance defined earlier

32
What About the Marginal Posterior for ?
• To make inference on , we need to know the marginal
posterior:

2 2
 

0
• For this simple model under the natural conjugate prior
analytical results can be obtained:
multivariate Student t with degrees of
freedom, mean ∗ , and scale matrix ∗ ∗ ∗ as
defined before
• BUT » integration is hard
» with other prior distributions analytical derivation
of joint and marginal posterior is not possible 33
Solution: Gibbs Sampling
• Suppose the parameter vector can be partitioned
as with the property that is
of unknown form but

are of known form and we can easily sample from


these conditional distributions (same idea works for
2, 4, or n blocks)
34
Gibbs Sampling: Theory
• What does that buy us?
 Theory suggests that if we obtain many samples
from , then these will also
be samples from the joint posterior
(see Geman and Geman, 1984; Casella and George, 1992)
𝛉 ,𝛉 ,𝛉 |𝐘
 Intuition: Bayes’ rule
𝛉 ,𝛉 |𝐘

 The marginal posterior distribution can be


approximated by the empirical distribution of
for example: estimate of mean for , is the sample
mean of retained draws , 35
Gibbs Sampling: Implementation
(1) Start with arbitrary initial guesses
for .

(2) Generate: from

from

from
(3) Repeat step (2) for
(4) Throw out first draws (for large) and use
remaining draws for inference 36
Back to our Regression Model
• Idea: By sampling repeatedly from the conditional
distributions and , we can
approximate the joint and marginal distributions of our
parameters of interest
• Steps:
1. Set priors and initial guess for 𝜎
2. Sample 𝛃 conditional on

3. Sample conditional on 𝛃
4. Cycle through steps (2) and (3) a large number of times and keep
only the last 𝐷 𝐷 draws
37
Application 1
• Linear regression model with one exogenous variable:
, and

• Gibbs sampling algorithm:


(1) a. Set priors: and
Prior hyperparameters:

b. Set starting value for first iteration


,

38
Application 1
,
(2) At iteration j, conditional on draw , draw
,

where
,

(3) Conditional on draw , draw

where

39
How to Take Draws
• Normal distribution
To sample a vector from , generate
draws 𝟎 from the standard normal distribution (randn in
Matlab) and then apply the following transformation
𝟎 ⁄

 is said to be a square root of if the matrix product

 For positive-definite matrices, one way to obtain the


square root is the Choleski decomposition (chol in
Matlab):
40
How to Take Draws
• Inverse gamma distribution

To sample a scalar from an inverse gamma with degrees


of freedom and scale parameter , there are 2 options:
 generate 𝑇 numbers from 𝒔 ~ 𝑁 0, 1 and apply the following
transformation
𝛿
𝑠
𝒔 ′𝒔
 generate a draw 𝑠̅ from a gamma with degrees of freedom 𝑣 and scale
parameter (gamrnd in Matlab) and compute
1
𝑠
𝑠̅ 41
Posterior Distribution

42
Bayesian Analysis of
Structural VAR Models
The Identification Problem
• Classic questions in empirical macroeconomics:
 What is the effect of a policy intervention (interest
rate increase, fiscal stimulus) on macroeconomic
aggregates (output, inflation, employment,…)?
 Which structural shocks drive aggregate
fluctuations?

• What we are interested in is the dynamic causal


effect of structural shock on a vector of
macro time series :

,
2
Dynamic structural model:
A yt    B 1 yt−1    B m yt−m  D 1/2 v t
nnn1 n1 nn n1 nn n1 nn n1

v t  i.i.d. N0, In 

x ′t−1 ′
 1, yt−1 , yt−2 , . . . , y′t−m  ′

d 11 0 ... 0

1/2
0 d 22 . . . 0
D 
  ... 
0 0 ... d nn
3
Example: supply and demand
q t   s   s p t  b s11 p t−1  b s12 q t−1  b s21 p t−2
 b s22 q t−2    b sm1 p t−m  b sm2 q t−m  d s v st
q t   d   d p t  b d11 p t−1  b d12 q t−1  b d21 p t−2
 b d22 q t−2    b dm1 p t−m  b dm2 q t−m  d d v dt

qt 1 − s
yt  A
pt 1 − d
4
Reduced-form (can easily estimate):
y t  c   1 y t−1     m y t−m   t
 t  i.i.d. N0, 
−1
̂ T
 T  ∑ y x t−1 ′
∑ T ′
x t−1 x t−1
t1 t t1
′ ′ ′
x t−1  1, y t−1 , y t−2 , . . . , y ′t−m  ′
 c 1 2  m
̂ t  y t − ̂ T x t−1
̂
T  T ∑ tt
−1 T
̂ ̂ ′
t1

5
Reminder of Bayesian Principles
• Bayesian idea: before observing a data sample ,
the researcher has some beliefs about how likely
different parameter values are which can be
expressed in the form of a distribution
prior density
• Combine prior information with information in the data
through the likelihood function to obtain the posterior
distribution:
p | yT   pyT |  p

“The posterior is proportional to the likelihood


times the prior.” 6
Why Bayesian Estimation of VARs?
• Given the large number of parameters in VARs,
estimates of objects of interest (e.g. impulse
responses, forecasts) can become imprecise in large
models.
 need restrictions (e.g. lag length, which variables
to include)
 Bayesian philosophy: use “soft” restrictions
guide estimates toward prior restrictions,
but don’t insist
incorporating prior information generally
yields more precise estimates
• Bayesian simulation methods provide an easy way to
characterize estimation uncertainty. 7
Prior for VAR Coefficients
Stacking the T observations, the system can be
written as: Y  X   
Tn Tkkn Tn

Define   vec and assume that the prior


for the VAR coefficients  is normal:
p  N 0 , V 0 
Conditional on , the posterior for the VAR
coefficients is normal:
p | , Y  N ∗ , V ∗ 
−1
where V ∗  V −1
0   ⊗ X ′
X −1

∗ −1 −1
  V  V −1

0  0   ⊗ X ′
X ̂ OLS  8
Prior for Variance Matrix
• Univariate regression:
 2  E 2t 
Let z i  N0,  −1  for i  1, 2, . . . 
then, W  z 21  z 22  . . .  z 2 
 Γ, 

9
Prior for Variance Matrix
• Univariate regression:
 2  E 2t 
Let z i  N0,  −1  for i  1, 2, . . . 
then, W  z 21  z 22  . . .  z 2 
 Γ, 
• Multivariate regression:

 E 
Now zi  N0,  −1  where zi is 1  n
then, W  z21  z22  . . .  z2 
 W, 
10
Prior for Variance Matrix
• The conjugate prior for the VAR covariance matrix
is an inverse Wishart distribution:
p  IW 0 , S 0 
where  0 are the prior degrees of freedom
S 0 is the prior scale matrix

• Conditional on , the posterior for  is


also inverse Wishart:
p ∣ , Y  IW 0  T, S ∗ 
where S ∗  S 0  Y − X  ′ Y − X 
11
Key Prior Distributions for VARs
1. Minnesota prior (Litterman, 1980; Doan, Litterman,
and Sims, 1984)
2. Normal-inverse Wishart prior (see Uhlig, JME
2005)
3. Independent Normal-inverse Wishart prior
4. Dummy observations (see Banbura, Giannone
and Reichlin, JAE 2010)
5. Steady state priors (see Villani, JAE 2009)

12
Minnesota Prior
Structured prior beliefs:
1) The endogenous variables in the VAR follow a
random walk or an AR(1) process.
Example: bivariate VAR(2) model

𝑦𝑡 𝑐1 𝑏11 𝑏12 𝑦𝑡 1 𝑑11 𝑑12 𝑦𝑡 2 𝜀1


𝑥𝑡 𝑐2 𝑏21 𝑏22 𝑥𝑡 1 𝑑21 𝑑22 𝑥𝑡 2 𝜀2

𝑦𝑡 0 𝑏11 0 𝑦𝑡 1 0 0 𝑦𝑡 2 𝜀1
𝑥𝑡 0 0 𝑏22 𝑥𝑡 1 0 0 𝑥𝑡 2 𝜀2

0 0 ′
Prior mean: 𝟎 11 22  

0 0
Under RW assumption: 11 22  
13
Minnesota Prior
2) The variance of the prior for the VAR coefficients
is set based on the following observations:

a. Greater confidence that coefficients on higher


lags are zero.
b. Greater confidence that coefficients other than
own lags are zero.

14
To make this operational define a set of hyperparameters that
control the tightness of the prior:

 𝝀𝟏 : controls the standard deviation of the prior on own lags


(overall confidence in the prior)
as 𝜆1 → 0, greater weight is given to RW or AR(1)

 𝝀𝟐 : controls the standard deviation of the prior on lags of


variables other than the dependent variable
with 𝜆2 1, no distinction between own and other lags

 𝝀𝟑 : controls the degree to which coefficients on lags higher


than 1 are likely to be zero
for 𝜆3 0, all lags are given equal weight

 𝝀𝟒 : controls the prior variance of the constant


as 𝜆4 → 0, constant terms are shrunk to zero 15
Formulas for Minnesota Prior
𝜆1 2

𝑚𝜆3
2
𝑖 1 2
• 𝜆3
𝑗

2
• 𝑖 4   for the constant
and are the standard deviations of error
terms from AR regressions estimated via OLS
the ratio of and accounts for the possibility that
variables and may have different scales 16
What does look like?
2
𝜎1 𝜆4 0 0 0 0 0 0 0 0 0
2
⎛ 0 𝜆1 0 0 0 0 0 0 0 0 ⎞
2
⎜ 𝜎1 𝜆1 𝜆2 ⎟
⎜ 0 0
𝜎2
0 0 0 0 0 0 0 ⎟
⎜ 2 ⎟
⎜ 0 𝜆1 ⎟
0 0 0 0 0 0 0 0
⎜ 2𝜆 3 ⎟
⎜ 𝜎1 𝜆1 𝜆2 2 ⎟
⎜ 0 0 0 0 0 0 0 0 0 ⎟
⎜ 𝜎2 2𝜆 3 ⎟
𝐕0 2
⎜ 0 0 0 0 0 𝜎2 𝜆4 0 0 0 0 ⎟
⎜ 𝜎2 𝜆1 𝜆2 2 ⎟
⎜ 0 0 0 0 0 0 0 0 0 ⎟
⎜ 𝜎1 ⎟
2
⎜ 0 0 0 0 0 0 0 𝜆1 0 0 ⎟
⎜ 2 ⎟
𝜎2 𝜆1 𝜆2
⎜ 0 0 0 0 0 0 0 0 0 ⎟
⎜ 𝜎1 2𝜆 3 ⎟
2
𝜆1
⎝ 0 0 0 0 0 0 0 0 0 ⎠
2𝜆 3

Typical values for hyperparameters used in the literature


(see Doan 2013):
17
1 2 3 4
Prior Using Dummy Observations
• The computation of the moments of the conditional
posterior distribution for the VAR coefficients requires
the computation of the inverse of an nk  nk matrix:
−1
V −1
0  ⊗ X ′ X −1
• Alternative approach: use dummy observations to
represent prior densities
• Dummy observations can be obtained as follows:
1. Use actual observations from other countries or pre-sample
2. Use observations generated by simulating macro model
3. Use observations generated from “introspection”
⇒ represent by hyperparameters 18
Prior Using Dummy Observations
• Augment actual data Y and X with "artificial"
data Y D and X D : Y ∗  Y; Y D  and X ∗  X; X D 

• For example:
 Prior mean:  0  X ′D X D  −1 X ′D Y D
 Posterior mean:  ∗  X ∗′ X ∗  −1 X ∗′ Y ∗

the weight placed on the artificial data determines


how much confidence researcher has in the prior
19
System Dynamics
Reduced-form VAR(m) model:
yt  c   1 yt−1     m yt−m   t
For notational convenience, we
introduce the lag operator:
L 1 yt ≡ yt−1 , L 2 yt ≡ yt−2 , . . . , L m yt ≡ yt−m
Rewrite VAR(m) in this form:
yt  c   1 L 1 yt     m L m yt   t
 In −  1 L 1 −  −  m L m yt  c   t
 Lyt  c   t
20
Mean  of this vector process yt is:
  In −  1 −  2 −  −  m  −1 c

Rewrite VAR(m) in deviations from the mean:


yt −    1 yt−1 −    2 yt−2 − 
    m yt−m −    t

yt −  t
yt−1 −  0
t ≡ et ≡
nm  1
 nm  1 
yt−m1 −  0
21
Rewriting a VAR( ) as a VAR(1)
Companion matrix:
 1  2  3   m−1  m
In 0 0  0 0
F ≡ 0 In 0  0 0
nm  nm
     
0 0 0  In 0

 t  F t−1  e t
22
Stability Condition

 t  F t−1  e t

Iterating the system s periods forward, we get:


 ts  e ts  Fe ts−1  F 2 e ts−2    F s−1 e t1  F s  t

 if the eigenvalues of F all lie inside the


unit circle, the VAR model is stable
 implies: any shock must eventually die out

23
Vector MA( ) Representation

Mapping this back into the original system:


yts     ts   1  ts−1     s−1  t1
s s
 F 11 yt −     F 1m yt−m1 − 
where
j
j  F 11
j
with F 11 being the top left n  n block of F
raised to the jth power

24
Nonorthogonalized Impulse Responses
∂yts
 s
∂ ′t

• The moving average coefficients are called dynamic


multipliers
they allow us to see how a one-time shock
propagates through time i.e. how a shock affects
the endogenous variables in the system at a
specific horizon

25
Structural model:
Ayt  B x t−1  D 1/2 v t v t  i.i.d. N0, In 
Structural impulse responses:
∂yts ∂yts ∂ t
  sH  0  In
∂v ′t ∂ ′t ∂v ′t
∂ t
H  A −1 D 1/2
∂v ′t

Reduced form:
yt   x t−1   t  t  i.i.d. N0, 
  A −1 B
 t  A −1 D 1/2 v t  Hv t
E t  ′t     A −1 DA −1  ′ 26
The Identification Problem

  A −1 DA −1  ′
Supply and demand example:
4 structural parameters in A and D
 s ,  d , d s , d d 
BUT can only estimate 3 parameters in  by OLS
 11 ,  12 ,  22 

27
The Identification Problem

28
What is the Problem?
Structural model:
Ayt    B 1 yt−1    B m yt−m  u t u t  D 1/2 v t
u t  i.i.d. N0, D D diagonal
Intuition:
If we knew row i of A (denoted a ′i ,
then we could estimate coefficients for
i th structural equation (b ′i  by OLS
regression of a ′i yt on x t−1 :
−1 ′
̂b i  ∑ T x t−1 x ′ ∑ T
x t−1 y′t a i ̂
  Tai
t1 t−1 t1

d̂ ii  a ′i 
̂ Tai ̂ T A ′ (diagonal)
̂  A
D 29
Traditional Approach to Identification

Put enough restrictions on A and D


so that for any  there is a unique
A and D for which   A −1 DA −1  ′

Point identification

30
Popular Identification Strategies for
Exact Identification
• Recursive ordering of the variables based on timing
assumptions (Sims 1986)
• Short-run structural relationships (Bernanke 1986,
Blanchard and Watson 1986)
• Separating transitory from permanent components by
assuming long-run structural relationships (Blanchard
and Quah 1989)
• Combination of short-run and long-run restrictions
(Galí 1992)
31
Example
• Assume that short-run price elasticity of supply

1 − s
A
1 − d

means A and A −1 are lower triangular


• If D is diagonal, then H is also lower triangular:
∂yt
H  A −1 D 1/2
∂v ′t

• H  Cholesky factor of  is the unique


lower triangular matrix such that HH ′   32
Estimate from reduced form:
̂ T
 T  T ∑ ̂ t ̂ ′t
−1
t1

̂T
PP  
∂yts
s 
∂ ′t

HP
Then we infer dynamic structural responses:
∂yts ∂yts ∂ t
  sP
∂v ′t ∂ ′t ∂v ′t

33
Point Identification: Example
• Application 2: Simple bivariate supply and demand
model of the global oil market

yt  Δq t , p t  ′
 = oil production growth
 = real price of oil
 monthly VAR(24) for 1975M2 to 2007M12

Compute IRFs for oil supply and oil demand shocks


assuming that supply elasticity
44
Identification Using Inequality Constraints

• Assumption that short-run supply elasticity is


very strong
Can we make inference using weaker assumptions?

35
Identification Using Inequality Constraints
• We may have confidence in signs:

∂q t /∂v dt ∂q t /∂v st  −
H 
∂p t /∂v dt ∂p t /∂v st  

Set identification: want set of all n  n H such that


(1) HH ′  
(2) H satisfies certain sign restrictions
(3) columns of H are orthogonal to each other

36
How Do We Obtain Such an H?
• Claim: the set of all H such that HH ′  
is the set defined by H  PQ where Q is the
set of all orthogonal matrices (all n  n matrices
satisfying QQ ′  I n 
HH ′  PQPQ ′  PQQ ′ P ′  PIn P ′  
• Q is called a rotation matrix because it allows
to "rotate" the initial Cholesky factor (recursive
matrix) while maintaining the property that
shocks are uncorrelated 38
One strategy:
j
(1) Generate a million matrices Q j  1, . . . , 10 6
drawn "uniformly" from the set of all orthogonal
matrices.
j
j j ̂
(2) For each Q calculate H  PQ for
̂PP̂ ′  
̂.
(3) Keep H j if it satisfies restrictions.

38
How to Generate a Draw for Q?

Rubio-Ramírez, Waggoner, and Zha (2010)


algorithm to generate Q j :
(1) Generate an n  n matrix X j of
independent N0, 1 variables
(2) Calculate the QR decomposition
X j  Q j R j where Q j is orthogonal
and R j is upper triangular
(e.g. use qr command in Matlab)
39
Problem:
̂ , the resulting set
If we do this for fixed 
just reports uncertainty resulting from
incomplete identification.
But we also don’t know the true .

40
Solution:
j −1
(i) Generate   from Wishart with
T − p degrees of freedom and scale matrix T ̂.
(ii) Calculate P j P j′   j and H j  P j Q j .

41
Traditional Sign Restriction Algorithm
Step 1 Take a draw ,  from the posterior
of the reduced-form coefficients
Step 2 Compute the Cholesky factor P of 
Step 3 Generate an n  n matrix X  x ij  from N0, 1
Step 4 Take the QR decomposition of X  QR for
Q orthogonal and R upper triangular. Normalize
the elements in Q such that the diagonal entries
of R are positive.
Step 5 Compute IRFs using H  PQ
Step 6 Keep H if it satisfies sign restrictions; otherwise,
discard it. 43
Let’s Look at the Algorithm
• Application 2: Simple bivariate supply and demand
model of the global oil market

yt  Δq t , p t  ′
 = oil production growth
 = real price of oil
 monthly VAR(24) for 1975M2 to 2007M12

Question: What happens if we don’t impose


ANY restrictions at all?
44
Impulse responses for one-standard-deviation
supply and demand shocks

45
Histogram of impact effect of one-standard-
deviation shocks: bivariate model

46
Histogram of impact effect of one-standard-
deviation shocks: 6-variable VAR

What is going on here?


47
A Closer Look at Q
• Q is an orthogonal matrix which means
that its columns and rows are orthogonal
unit vectors (orthonormal vectors)
• First column of Q  first column of X
normalized to have unit length:

q 11 x 11 / x 211    x 2n1
  
q n1 x n1 / x 211    x 2n1
47
Bivariate Case
𝑥

𝑞 ,𝑞

𝜃
0 1
𝑥

q 11  x 11 / x 211  x 221
q 21  x 21 / x 211  x 221
 is the angle between 1, 0 and x 11 , x 21  48
Some Trigonometry

hypotenuse (c)
opposite (b)

𝜃
adjacent (a)

cosine of the angle  gives the length of the x-component:


adjacent a
cos  hypotenus
 c

sine of the angle  gives the length of the y-component:


opposite b
sin  hypotenus
 c

Pythagoras: a 2  b 2  c 2  c  a2  b2 49
Rotation Matrix
q 11  x 11 / x 211  x 221  cos
q 21  x 21 / x 211  x 221  sin

cos  − sin 
with prob 1/2
sin  cos 
Q 
cos  sin 
with prob 1/2
sin  − cos 

  U−, 
50
q i1  x i1 / x 211    x 2n1
q 2i1  x 2i1 /x 211    x 2n1 
Recall: x 11  N0, 1  x 211   2 1
 x 211    x 2n1    2 n
In general: if X   2  and Y   2  are independent,

then X
XY
 Beta 2 , 2

 q 2i1  Beta1/2, n − 1/2
Γn/2
Γ1/2Γn−1/2
1 − q 2i1  n−3/2 if q i1 ∈ −1, 1
pq i1  
0 otherwise
h 11  p 11 q 11   11 q 11
51
Impact effect of one-standard-deviation
shock on variable i: analytic distribution

52
Impact effect of one-standard-deviation
shock on variable i: evidence

̂T  2. 28 −0. 47  66  2. 94

−0. 47 32. 26
(6-variable VAR)
 11  1. 51 (bivariate)
53
Take-Away #1
• A prior that is UNINFORMATIVE about a parameter
(here: the angle of rotation ) is in general
informative about nonlinear transformations of

there is an informative prior for IRFs


implicit in the traditional sign restriction
algorithm

55
Other Objects of Interest
• Suppose we are interested in the effect of a shock that
raises the price by 1% on quantity
• In the bivariate case with , we normalize
the impact matrix H by dividing the first column by its
first element

1 ...
H h 21
h 11
...

this ratio is called a price elasticity


56
Impact effect of 1% increase in price
on quantity (without sign restrictions)

56
What’s the implicit prior distribution here?
H  PQ
h 11 h 12 p 11 0 q 11 q 12
 
h 21 h 22 p 21 p 22 q 21 q 22
If we normalize shock 1 as something
that raises variable 1 by 1 unit:
h 21 p 21 q 11 p 22 q 21 p 21 p 22 x 21
h ∗21  h 11
 p 11 q 11  p 11  p 11 x 11

x 21 /x 11  Cauchy0, 1
 h ∗ij |  Cauchyc ∗ij ,  ∗ij 
location parameter: c ∗ij   ij / jj
 ii − 2ij / jj
scale parameter:  ∗ij   jj 57
Impact effect on variable i of shock
that increases j by one unit

58
What Happens If We Impose
Sign Restrictions?
• Sign restrictions confine these distributions to particular
regions but do not change their basic features.
• Apply traditional algorithm to 8-lag VAR fit to growth rates
of U.S. real compensation per worker and of U.S.
employment for the period 1970:Q1-2014:Q2
Application 3
• Identify supply and demand shock by sign restrictions:

Δwt  

Δn t  −
59

Implied elasticity of labor demand ( )

Red = truncated Cauchy


Blue = output of traditional algorithm
60

Implied elasticity of labor supply ( )

Red = truncated Cauchy


Blue = output of traditional algorithm
61
What’s the Nature of this Truncation?
H  PQ
h 11 h 12 p 11 0 cos sin
 
h 21 h 22 p 21 p 22 sin − cos

p 11 cos p 11 sin

p 21 cos  p 22 sin p 21 sin − p 22 cos
variable 1  wage, variable 2  employment
shock 1  demand, shock 2  supply
h 11 h 12  

h 21 h 22  −
h 11 , h 12 ≥ 0   ∈ 0, /2 62
What’s the Nature of this Truncation?
h 11 h 12 p 11 cos p 11 sin

h 21 h 22 p 21 cos  p 22 sin p 21 sin − p 22 cos

63
What Does This Imply for the Elasticities?

h 11 h 12 p 11 cos p 11 sin

h 21 h 22 p 21 cos  p 22 sin p 21 sin − p 22 cos

h 21 p 21 cos p 22 sin  p 21 p 22


h ∗21  h 11
 p 11 cos 
 p 11  p 11 tan
h 22 p 21 sin −p 22 cos  p 21 p 22
h ∗22  h 12
 p 11 sin 
 p 11 − p 11 cot 


for  ∈ 0,   h ∗21 ∈  21 / 11 ,  22 / 21 
h ∗22 ∈ −, 0
64
p 21
for   0 : h ∗21  p 11 since tan0  0
h ∗22  − since cot 0  
 p 21 p 22 p 22 p 221  p 222
for    : h ∗21   p 11  p 11 p 21 p 11 p 21
   p 21
since tan   1/ cot  and cot   p 22
p 21 p 22 p 21
h ∗22  p 11− p 11 p 22 0
 p 21
since cot   p 22

′ p 11 0 p 11 p 21
  PP 
p 21 p 22 0 p 22

p 211 p 21 p 11  11  21
 
p 21 p 11 p 221  p 222  21  22 65
Take-Away #2
• The sign restrictions may end up implying no or
trivial restrictions on the feasible set
 d ∈ −, 0
 s ∈ 0. 04, 4. 06
the allowable set is uselessly large

• Without acknowledging prior information, there is


no statistical justification for reporting the median
and the upper and lower range around this
every draw in the set is consistent with data
67
Take-Away #3

• Causal interpretation of correlations requires


information about economic structure.

• Instead of having an implicit prior built into some


mechanical algorithm, make prior beliefs about the
economic structure explicit and defend them.

• Use all the information that you may have.

68
Bayesian Inference in Set-Identified
Structural VAR Models

Structural model of interest:


A y t    B 1 y t−1    B m y t−m  u t
nnn1

u t  i.i.d. N0, D
D diagonal

68
Bayesian approach:
Summarize whatever information we have
that helps identify A in the form of a density pA.
pA is highest for values of A we think are
most plausible.
pA  0 for values of A we rule out altogether.
pA can also impose sign restrictions
and zeros

69
Bayesian begins with prior beliefs
before seeing data:
pA, D, B  pApD|ApB|D, A

pA could be any density


Use natural conjugate priors for D and B :
diagonals of D −1 |A are independent gamma
rows of B|D, A are independent normal

70
Prior for pD|A

d −1
ii |A  Γ i ,  i 
where
Ed −1
ii |A   i / i
−1 2
Vard ii |A   i / i

uninformative priors:  i ,  i → 0

71
Prior for pB|D, A

B  B1 B2  Bm

b i |A, D  Nm i , d ii M i 

uninformative priors: M −1
i → 0

72
Likelihood:
pY T |A, D, B  2 −Tn/2 |detA| T |D| −T/2 
T
exp −1/2 ∑ t1 Ay t − Bx t−1  ′ D −1 Ay t − Bx t−1 

prior:
pA, D, B  pApD|ApB|A, D
posterior:
pY T |A,D,BpA,D,B
pA, D, B|Y T  
 pY T |A,D,BpA,D,BdAdDdB
 pA|Y T pD|A, Y T pB|A, D, Y T 

73
74
prior:
d −1
ii |A  Γ i ,  i 
posterior:
d −1
ii |A, Y T  Γ ∗ ∗
i , i 
p
As T → , d ii → true value

75
Posterior distribution for A

prior: pA

If M −1
i  0, and  i   i  0,
k T pA|detA̂ T A′ | T/2
posterior: pA|Y T   ̂ T A ′  T/2
det diag(A

k T  constant that makes this integrate to 1

76
Posterior distribution for A

k T pA|detA̂ T A ′ | T/2
pA|Y T   ̂ T A ′  T/2
det diag(A

If we evaluate the posterior at an A that


diagonalizes ̂ T , i.e. A T A ′  diagA T A ′ ,
then pA|Y T   k T pA

77
Posterior distribution for A
k T pA|detA̂ T A ′ | T/2
pA|Y T   ̂ T A ′  T/2
det diag(A

If we evaluate the posterior at an A


for which A T A ′ ≠ diagA T A ′ ,
from Hadamard’s inequality it follows
det diag(A̂ T A ′   detA
̂ T A′ 
and as the sample size grows
pA|Y T  → 0
78
What Does this Mean?
• If there is more than one matrix A that
̂ T , then the model is
diagonalizes 
under-identified.
• If the model is under-identified, the
influence of the prior will not vanish
even if sample size T goes to .
• Posterior is a re-scaled version of the prior:
pA|Y T   k T pA
79
Is This a Bad Thing?
• If enough restrictions are imposed so that the model is
exactly identified, then there exists only one allowable
value for A that diagonalizes Ω and the posterior
converges to that value as T → .
BUT need to be absolutely certain about identifying
assumptions
• If you are not certain about some of the identifying
assumptions, you want to be able to take that
uncertainty into account. You do that in the form of
pA.
Doubts about the identification itself will be
reflected in the posterior distribution. 80
Application 4: Labor Market Dynamics

demand:
Δn t  k d   d Δw t  b d11 Δw t−1  b d12 Δn t−1  b d21 Δw t−2
 b d22 Δn t−2    b dm1 Δw t−m  b dm2 Δn t−m  u dt
supply:
Δn t  k s   s Δw t  b s11 Δw t−1  b s12 Δn t−1  b s21 Δw t−2
 b s22 Δn t−2    b sm1 Δw t−m  b sm2 Δn t−m  u st

81
Prior for the Elasticities

− d 1
for yt  Δwt , Δn t  ′ : A 
− s 1

Any arbitrary prior density that best summarizes


our prior beliefs about labor supply and demand
elasticities: pA  p s p d 

82
What do we know about the short-run wage
elasticity of labor demand?
• Hamermesh (1996) surveys microeconometric
studies: 0.1 to 0.75
• Lichter et al. (2014) conduct meta-analysis of
942 estimates: lower end of Hamermesh range
• Theoretical macro models can imply value
above 2.5 (Akerlof and Dickens, 2007; Galí,
Smets and Wouters 2012)

83
84
Student t prior for labor demand elasticity

85
What do we know about the wage elasticity
of labor supply?
• Long run: often assumed to be zero because
income and substitution effects cancel (e.g.,
Kydland and Prescott, 1982)
• Short run: often interpreted as Frisch elasticity
• Reichling and Whalen survey of microeconometric
studies: 0.27-0.53
• Chetty et al. (2013) review 15 quasi-experimental
studies: < 0.5
• Macro models often assume value greater than 2
(Kydland and Prescott, 1982, Cho and Cooley,
1994, Smets and Wouters, 2007) 86
87
Student t prior for labor supply elasticity

88
Prior for the inverse of the structural
variances
• Recall: d −1
ii |A  Γ i ,  i 

• Considerations:
Prior should in part reflect the scale of the data
Scales of individual innovations are obtained from
residuals of univariate denoted
 is the sample variance matrix of
residuals
Prior mean is set equal to the
reciprocal of the diagonal element of

• We set which puts modest weight on our prior


beliefs (equivalent to 4 observations of data). 89
Prior for the lagged structural coefficients
• Recall: b i |A, D  Nm i , d ii M i 
• Basic idea: Minnesota prior (Doan, Litterman, and
Sims 1984, Sims and Zha 1998)
 Prior mean
 The most useful variable for predicting any variable is its
own lagged value.
 Prior belief that macro variables behave like random
walks.
 Prior variance
 Coefficients on higher lags are more likely to be zero
implies: smaller values for the diagonal elements of
for higher lags
 Prior variance is governed by a few hyperparameters λ
90
Prior mean
Reduced form might look like a random walk:

  A −1 B

E  In 0  
nn nk−n
nk

 EB|A  A


m i A  Eb i |A   a i

91
Prior variance
Variance reflects increasing confidence in prior
expectation as lag order increases:
 : confidence in higher-order lags = 0
v ′1  1/1 2 1 , 1/2 2 1 , . . . , 1/m 2 1 
1m
 are the diagonal elements of
v ′2  s −1 ,
11 22s −1
, . . . , s −1 ′
nn 
1n

 : overall confidence in prior ( for constant term)


v1 ⊗ v2
v3   20
 23 92
Prior variance

v1 ⊗ v2
v3   20
 23

M i is a diagonal matrix whose r, r element is


the r th element of v 3 :
M i,rr  v 3r

Hyperparameters for labor market example:


 0  0. 2,  1  1,  3  100
93
Posterior Distribution for
b i |A, D, Y T  Nm ∗i , d ii M ∗i 

Regressions on augmented data sets:



̃
Y i  a ′i y1 , . . . , a ′i yT , m ′i P i  where P i P ′i  M −1
i
1Tk

̃
Xi  x 0  x T−1 P i
kTk
′ −1 ′
m ∗i A  ̃
XiX ̃i ̃
XiY ̃ i A
′ −1
M ∗i  ̃
XiX ̃i
94
Posterior Distribution for

d −1
ii |A, Y T  Γ ∗ ∗
i , i 

 ∗i   i  T/2
 ∗i   i A   ∗i A/2
′ ′ ′ −1 ′
 ∗i A  ̃
YiY ̃i − ̃
YiX ̃i ̃
XiX ̃i ̃
XiY ̃i

95
Posterior Distribution for
k T pAdetA ̂ T A ′  T/2 n
pA|Y T   n   i A  i
 i1

2/T i A i∗ i1

where  ̂ T is the sample variance matrix


for the reduced-form VAR residuals:
−1
̂ T  T −1 ∑ T y y′ − ∑ T y x ′
 ∑ T
x t−1 x ′t−1
T
∑ t1 x t−1 x ′t−1
t1 t t t1 t t−1 t1

96
Baumeister-Hamilton Algorithm
• Goal:
Generate draws from the joint posterior distribution
pA, D, B|Y T 
• Procedure:
Draw A ℓ from pA|Y T 
Draw D ℓ from pD|A, Y T 
Draw B ℓ from pB|A, D, Y T 
Repeat for ℓ  1, . . . , 10 6
ℓ ℓ ℓ N
• A , D , B  ℓ1 is a draw from the joint posterior
97
How to Generate Draws for ?
• Problem:
Posterior distribution for A is of unknown form
cannot directly sample from it

• Solution:
Use random-walk Metropolis-Hastings algorithm to
approximate the posterior distribution pA|Y T 

98
Metropolis-Hastings Algorithm
• Goal:
Draw samples from a distribution with unusual
form (referred to as the target density)

where  is a K  1 vector of parameters
• How can we do that?
 Specify a candidate-generating (proposal)
density q G1 | G  or q G1  from which
we can generate draws easily.
 G1 /q G1 
 Evaluate to decide whether
 G /q G 

to keep the candidate draw or to discard it. 100


Random-Walk Metropolis-Hastings Algorithm
• As the name suggests, the proposal density is a
random walk:
 G1   G  e
where e  i. i. d. N0,  is a K  1 vector
• Given that e   G1 −  G is normally distributed,
the density p G1 −  G   p G −  G1  due to
symmetry.

• Thus, q G1 | G   q G | G1  and the acceptance


 G1 
probability simplifies to:    G 
101
RW-MH Algorithm Step by Step
Step 1 Specify a starting value for the parameters 
denoted  0 and set the variance  of the
shocks to the random walk.
Step 2 Draw a new value for the parameters  new using
 new   old  e
where for the first draw  old   0 .
Step 3 Compute the acceptance probability
 new 
  min  old 
, 1
If   u  U0, 1, retain  new ;
otherwise, retain  old .
Step 4 Repeat steps (2) and (3) M times and
use the last L draws for inference. 102
Generating draws for A
−1 ′
• Generate a candidate ̃ ℓ1
 ℓ ̂
 Q   v ℓ1
where
v ℓ1 is a 2  1 vector of independent standard Student t
variables with 2 degrees of freedom,
̂ is the Cholesky factor of 
Q ̂ (see below), and
 is a scalar tuning parameter to get a 30% acceptance rate

• Check whether the sign restrictions for the


elements in ̃ are satisfied:
 d  0 and  s  0
103
Generating draws for A
• For those ̃ that have the correct signs,
calculate the log of the target function:
qA  logpA  T/2 log det A ̂ TA′
n n
−∑ i1
 ∗i log2/T ∗i A ∑ i1
 i  i A
where
− d 1
A
− s 1

104
Generating draws for A
• To enhance the efficiency of the algorithm, find the
value that maximizes the target function numerically

 Use as a starting value for RW-MH step


 Use matrix of second derivatives (Hessian) as the
variance matrix

̂ −

∂ 2 qA 
∂ ∂ ′  ̂

105
A Bayesian Interpretation of
Traditional Approaches to
Identification
Structural model of interest:
A y t    B 1 y t1    B m y tm  u t
nnn1

u t  i.i.d. N0, D
D diagonal

2
Identification Strategy
• Traditional approach:
assume perfect knowledge of structure to achieve
identification

• Bayesian approach:
represent imperfect information about elements in A
in the form of a Bayesian prior distribution pA

generalization of traditional methods

3
How does this relate to non-Bayesian approaches?
(1) Traditional point-identified structural VARs
are a special case of a Bayesian prior that
is dogmatic.
A. Cholesky identification
B. Long-run restrictions
(2) Set-identified VARs with implicit informative
priors that cannot be chosen by the user

B. Sign and boundary restrictions


4
A. Cholesky Identification

Kilian AER (2009)


q t  world oil production
y t  real global economic activity
p t  real price of oil

⇒ Application 5

5
Structural Model of the Global Oil Market

oil supply:
q t   qy y t   qp p t  b 1 x t1  u 1t
economic activity:

y t   yq q t   yp p t  b 2 x t1  u 2t
inverse of oil demand curve:

p t   pq q t   py y t  b 3 x t1  u 3t
Note:  pq  inverse of short-run
price-elasticity of oil demand
6
Put in Canonical Form

Ayt    B 1 yt1    B m ytm  u t

1  qy  qp
A  yq 1  yp
 pq  py 1

7
Example 1: Kilian (AER 2009)
What Does Cholesky Identification Imply?
(Cholesky identification)
 qy   qp   yp  0
oil supply:
q t   qy y t   qp p t  b 1 x t1  u 1t
economic activity:

y t   yq q t   yp p t  b 2 x t1  u 2t
inverse of oil demand curve:
p t   pq q t   py y t  b 3 x t1  u 3t
8
Bayesian Translation of Cholesky
Identification
• I put absolutely zero possibility on any A
unless the (1,2), (1,3), and (2,3) elements
are all zero.
 dogmatic prior: degenerate distribution

• I have no information at all about the


(2,1), (3,1), and (3,2) elements.
⇒ totally uniformative
uninformativeprior
prior
9
Special Case of Bayesian Prior Beliefs

1 0 0
A  yq 1 0
 pq  py 1

pA  p yq p pq p py 

10
How to Represent “No Information”?

• Requirement:
Prior has to be a proper density
• Proposal:
(2,1) element: p yq   Student t with location 0,
scale 100, d.f.  3

Same for p pq  and p py 

11
Prior for Lower-Triangular Elements of A

p yq  : If oil production goes up by 1% this month,


maybe economic activity goes up by 100%,
maybe it goes down by 100%. 12
Blue: posterior median IRF as calculated using Baumeister-
Hamilton algorithm for above prior.
Red: IRF calculated using Kilian’s (AER 2009) Cholesky analysis. 13
Prior (red) and posterior (blue) distributions for
unknown elements of A

15
Prior (red) and posterior (blue) distributions for
unknown elements of A

16
Posterior density of short-run oil demand elasticity

12% posterior probability that demand elasticity > 0


94% posterior probability that abs(elasticity) > 2 16
Standard Cholesky Identification

• Demand elasticity estimate: -5.9562


⇨ A 10% increase in the price of oil with no change
in income would result in a 60% drop in
consumption within the month.
• Why is the demand elasticity so large?
⇨ 𝛼 = 0 plays a key role in this conclusion
– Suppose that 𝛼 > 0, positive correlation between

ũ dt and
and q twould bias OLS estimate upward (closer to zero)
⇒ implies bias of estimated demand elasticity toward
larger absolute value (Baumeister and Hamilton, ET)
What do we know about the price elasticity
of demand?
• Cross-country regression of log of petroleum use per
dollar of GDP on real price of gasoline for 23 OECD
countries in 2004
4 4
(gallons/year/real GDP) in 2004

(gallons/year/real GDP) in 2004


slope = -0.51 s
Log petroleum consumption

Log petroleum consumption


3.5 3.5

3 3

2.5 2.5

2 2
0 0.5 1 1.5 2 0

18
Log gasoline price ($ per gallon) in 2004
What do we know about the price elasticity
of demand?
• Cross-sectional evidence based on household
surveys
 Newey and Hausman (1995): -0.81
 Yatchew and No (2001): -0.9

• Estimates from previous literature surveys:


 Dahl and Sterner (1991): -0.86
 Graham and Glaister (2004): -0.77
 Brons et al. (2008): -0.84

short-run elasticity < long-run elasticity


19
What about the price elasticity of supply?

Saudi Arabian oil production can respond quickly to


changing economic conditions:
short-run supply elasticity is not zero 20
What else do we know about the price
elasticity of oil supply?

• Estimates from multiple historical episodes


 Caldara, Cavallo, and Iacoviello (JME 2019): 0.081

• Estimates from individual oil wells in North Dakota


 Bjørnland, Nordvik, and Rohrer (JAE 2021): 0.3-0.9

• Estimates from micro data of shale producers


 Aastveit, Bjørnland, and Gundersen (2021): 0.62

21
Take-Away #4

• Better use non-dogmatic priors to reflect doubts


that we entertain about identifying assumptions
replace zero restrictions with prior density

• Use all available information to form prior beliefs

• Bayesian posterior distribution incorporates both


uncertainty from a finite data set and
incomplete confidence in identifying assumptions

22
B. Bayesian Interpretation of
Sign-Restricted VARs
Application 6: Kilian and Murphy JEEA (2012)
• Know with certainty the signs of the impact matrix
𝑯 = 𝑨−1
• Know with certainty interval in which elasticities fall
(boundary restrictions)

23
Sign Restrictions

q t /u 1t q t /u 2t q t /u 3t


y t /u 1t y t /u 2t y t /u 3t
p t /u 1t p t /u 2t p t /u 3t

  
   
  
24

What Prior Beliefs on the 𝜶 𝒔 Produce
Those Signs?

oil supply  qy  0,  qp  0

q t   qy y t   qp p t  b 1 x t1  u 1t
economic activity  yq  0,  yp  0
y t   yq q t   yp p t  b 2 x t1  u 2t
inverse of oil demand curve  pq  0,  py  0
p t   pq q t   py y t  b 3 x t1  u 3t

25
1 0  qp
A 0 1  yp
 pq  py 1

Under the above assumptions:


(1) Model still unidentified
(2) yt /u t has desired signs for all allowable A

26
How Do We Know?

1   py  yp  qp  py  qp
H  A 1  1
1 qp  pq  py  yp
 yp  pq 1   qp  pq  yp
 pq  py 1
detA

supply curve slopes up:  qp  0


higher oil prices depress economic activity:  yp  0
demand curve slopes down:  pq  0
higher income boosts oil demand  py  0
27
Boundary Restrictions
KM argued that sign restrictions are not enough.
(1) Claimed we know with certainty that
short-run price elasticity of oil supply
 qp falls in 0, 0. 0258
(2) Claimed we know with certainty that
the response of economic activity to
an oil-specific demand shock on
impact falls in 1. 5, 0
28
(1) Prior for the Supply Elasticity

29
(2) Prior for the Impact Effect
This is a statement about the (2,3) element of the IRF

IRF0 2, 3  H 23  D0.5

implying the following constraint:


 yp
1. 5  detA
d 33  0

prior: U(-1.5, 0)

30
(2) Prior for the Impact Effect

31
Summary of Prior Beliefs
Prior for A:
 qy   yq  0
 qp  U0, 0. 0258
 yp  Student t (0,100,3)
truncated to be negative
 pq  Student t (0,100,3)
truncated to be negative
 py  Student t (0,100,3)
truncated to be positive 32
Blue: posterior median IRFs calculated using BH algorithm
Red: IRFs calculated using Kilian-Murphy (JEEA, 2012)
33
sign and boundary restrictions
Prior (red) and posterior (blue) distributions
for unknown elements of A and H

34
Posterior density of short-run
price elasticity of demand

96% posterior probability that demand elasticity < -1.


35
Additional Extensions within the
Bayesian Approach
• Treatment of older data:
how long a data set to use

• Measurement error:
is pervasive but tends to be ignored in most
structural inferences

⇒ see Baumeister and Hamilton (2019) for an


illustration in the context of the global oil market
36
Objects of Structural Interest

BH recommendation: use literature


and other data sets as source of prior
information about structural parameters A
Ay t    B 1 y t1    B m y tm  ut
ut  N0, D
Most applications of structural VARs focus
instead on impacts of structural shocks H
1 1/2
HA D
• Uhlig (Adv Econometrics 2017): H is better
focus because we care about equilibrium
effects of policy interventions.

• However, the question of identification comes


down to what we know from prior evidence or
theory.
• Rows of A correspond to behavior of individual
agents (e.g. consumers, producers, govt policy)
• Rows of H correspond to the general equilibrium
consequences of changes in those agents’
behavior
• Prior information usually comes in the form of
information about A:
– Elasticities (BH 2019; Aastveit, Bjørnland, and Cross,
2020; Brinca, Duarte, and Faria-e-Castro, 2020)
– Policy rules (BH 2018; Belongia and Ireland, JMCB
2020)
A Common Error in Estimating Elasticities
• Many studies have tried to draw an inference
about behavioral elasticities by calculating ratios of
the elements of a single column of H.

• For example: Kilian and Murphy (KM 2014)


estimate demand elasticity from ratio of change in
oil consumption to change in price in response to
an oil supply shock ⇨ 𝛽 = ℎ11 /ℎ31
– Other studies: Kilian and Murphy (2012), Güntner
(2014), Riggi and Venditti (2015), Kilian and Lütkepohl
(2017), Ludvigson et al. (2017), Antolín-Díaz and Rubio-
Ramírez (2018), Basher et al. (2018), Herrera and
Rangaraju (2020), Zhou (2020)
The problem with this approach can be seen
in a 3-variable example:

supply: q t  y t  p t  u st
y
income: y t  q t  p t  ut
demand: q t  y t  p t  d
ut
• Meaning of demand elasticity 𝛽:
If price were to increase 1% with income held
constant, by how much would quantity
demanded change?
Effects of shocks in this 3-variable system

Structural model: Ayt  u t


 impact matrix: A 1

Response of q to supply shock: |A| 1   


Response of y to supply shock: |A| 1   
Response of p to supply shock: |A| 1   1

Ratio of (1) to (3): 1

• Dividing this by the change in price |A| 1   1

produces the estimate used by KM 1
.
• This equals  in the special case when   0.
• When   0, their estimate is a combination
of sensitivity of demand to price and
sensitivity of demand to income.

⇨ Use inverse of H!
A Fully Bayesian
Approach: Estimation
and Inference
What Have We Learned So Far?
• Better to use nondogmatic priors and use all
available information
⇒ Be an economist!

• Uncertainty about identifying assumptions should


be incorporated in statements about any feature of
the model

• Bayesian posterior distribution reflects both


randomness in a finite data set and incomplete
confidence in identifying assumptions

2
Another advantage of being openly
Bayesian

• Can clearly motivate optimal inference given


uncertainty about identifying assumptions
⇒ credibility sets for impulse responses, variance
and historical decompositions are grounded in
statistical decision theory

3
Structural model of interest:
A y t    B 1 y t1    B m y tm  u t
nnn1

u t  i.i.d. N0, D
D diagonal

4
Application 7:
A Three-Variable Macro Model

Baumeister and Hamilton (2018)


y t  output gap
 t  inflation (year-over-year PCE)
r t  fed funds rate
t  1986:Q1 - 2008:Q3

5
Model Description

Aggregate Supply or Phillips Curve


s 
y t  k    t  b  x t1  u st
s s

Aggregate Demand or Euler Equation


d 
y t  k    t   r t  b  x t1  u dt
d d d

Monetary Policy or Taylor Rule


r t  k m   y y t     t  b m   x t1  u mt

6
Commonly used Taylor Rule
r t  r  1   y y t  1     t    
 r t1  r   u mt
is a special case of our equation
 m 
r t  k   y t    t  b  x t1  u mt
m y

 y  1   y
   1   

7
Prior information:
Taylor (1993) proposed values of
 y  0. 5 and    1. 5

Bayesian prior densities:


 y  Student t0. 5, 0. 4, 3
truncated to be positive
Prob y  1  0. 82, Prob y  2  0. 98
   Student t1. 5, 0. 4, 3
truncated to be positive
8
Prior Distributions for Taylor Coefficients

9
Prior information for smoothing parameter :
Lubik and Schorfheide (2004) and
Del Negro and Schorfheide (2004)

  Beta2. 6, 2. 6
mean  0. 5, std dev  0. 2

10
1  s 0
A 1  d  d
1   y 1    1

11
Commonly used dynamic IS curve
y t  b y  y t1|t  r t   t1|t   u dt
where   intertemporal elasticity of substitution

DSGE would imply


y t1|t  c y   y x t
 t1|t  c     x t

12
y t1|t  c   x t
y y

 t1|t  c     x t

• One approach: find DSGE-implied values


for  y and   in terms of deep structural
parameters, use these for prior.

• Our approach: use prior beliefs about


reduced-form directly.

13
Minnesota prior: the most useful variable for
predicting any variable is its own lagged value.
 y x t   y y t
  x t     t
Minnesota prior: everything is a random walk
y    1
For our variables (output gap, inflation) we
instead expect
 y     0. 75
14
y t  c d  y t1|t  r t   t1|t   u dt
  d   y y t  r t     t  u dt
y t   d  /1   y r t    /1   y  t  ũ dt

From DSGE literature:


  2/3
  0. 5

15
So for our AD equation
d 
y t  k    t   r t  b  x t1  u dt
d d d

we expect
 d  /1   y   0. 5/0. 5  1
 d    /1   y   0. 75
Bayesian prior
 d  Student t1, 0. 4, 3 truncated  0
 d  Student t0. 75, 0. 4, 3 no sign restriction

16
Phillips Curve
y t  k s   s  t  b s   x t1  u st

Lubik and Schorfheide (2004):


 s  Student t2, 0. 4, 3 truncated  0

17
Priors for Contemporaneous Structural
Coefficients

1  s 0
A 1  d  d
1   y 1    1

pA  p s p d p d p y p  p

18
Additional Considerations

• What do these priors imply for the impulse


responses?

• Don’t we also have information about the


effects of structural shocks on the endogenous
variables?

19
Solutions

• What do these priors imply for the impulse


responses?
⇒ check via simulation or analytically
⇒ graphically or by computing prior probabilities

• Don’t we also have information about the


effects of structural shocks on the endogenous
variables?
⇒ form priors about the impacts of shocks
20
Priors for Impacts of Shocks

H  A 1  1 
H
detA

 d   d 1     s s d
 
H  d 1   y  1 1 d
1      d  y  1      s  y   s   d

detA   s 1   d 1   y    d   d 1    

21
Priors for Impacts of Shocks

Sign restrictions on contemporaneous coefficients


 s  0,  d  0,  y  0,    0, and 1    0
imply the following signs on H :

?  
 
signH   
?  ?

22
What Information on Impact Effects?
• Sign restriction on response of output gap
to supply shock:
h 1   d   d 1     0
 expect increase of output after
favorable supply shock

• Magnitude of response of output gap


to 1% monetary contraction:
s d
h2   s  d
 0
 expect output to fall modestly 23
Extension to the BH Algorithm
• In keeping with the idea of not wanting to impose
constraints dogmatically, we introduce a new flexible
family of densities: an asymmetric t distribution
• Consider a random variable ℎ ∈ (−∞, ∞) with the
following density:
ph  k 1  v h h   h / h  h h/ h 

h

where
 v x is a standard Student t variable
x is the cumulative distribution function
for a standard N0, 1 variable 24
Features of the Asymmetric t Distribution

ph  k 1  v h h   h / h  h h/ h 



h

 h governs asymmetry (shape parameter)


  h  0 : symmetric
  h  0 : positive skew
  h  0 : negative skew
  h   : Student t  h ,  h ,  h 
truncated to be positive
  h   : Student t  h ,  h ,  h 
truncated to be negative
25
Features of the Asymmetric t Distribution

ph  k 1  v h h   h / h  h h/ h 



h

 h degrees of freedom parameter


 h   : N h ,  h 
 h  1 : Cauchy distribution
 h location parameter
 h scale parameter

26
How to Determine the Parameters of
This Distribution?
• By simulation:
1. Take draws from distributions for 𝛽𝑑 , 𝛾 𝑑 , 𝜓 𝜋 , and 𝜌
2. Compute for each draw ℎ1 = 𝛽𝑑 + 𝛾 𝑑 (1 − 𝜌)𝜓 𝜋
3. Compute mean and standard deviation from
empirical distribution

• By economic theory:
The output gap is unlikely to move one-for-one with a
change in the policy rate on impact.
27
Prior for ℎ1
• Simulation results:  h 1  0. 1 and  h 1  1
• Set:  h 1  3,  h 1  4 and  h 1  1

28
Prior for ℎ2
• Economic intuition:  h 2  0. 3 and  h 2  0. 5
• Set:  h 2  3,  h 2  2 and  h 2  1

29
Joint Prior
Add to
logpA  logp s   logp d   logp d 
 logp y   logp    logp
the following two terms
logph 1    h 1 log v h1 h 1   h 1 / h 1   log h 1 h 1 / h 1 
logph 2    h 2 log v h2 h 1   h 2 / h 2   log h 2 h 2 / h 2 
where
 h 1 and  h 2 govern the overall weight put on
prior for h 1 and h 2 (here  h 1  1 and  h 2  1
30
Joint Prior
• Resulting prior is no longer independent across the
individual elements of A, but includes some joint
information about their interaction
favors combinations of parameters that are in
line with ℎ1 and ℎ2 over those that are not

• Flexible in that it can be controlled by one parameter


setting  h  0 ignores information about ℎ
and gets us back to default

31
Prior Toolbox
• To visualize prior densities that best reflect your prior
beliefs
• To calculate and simulate moments of prior
distributions
• Toolbox contains a set of useful distributions:
 (truncated) Student t
 Gamma
 Beta
 Asymmetric Student t
• Toolbox contains function to compute impact matrix
32
Priors for Structural Variances

• d 1
ii |A  , a 
i Sa i 
2

• 𝐒 is the sample variance matrix of univariate


residuals from AR(4):
T
s ij  T 1  ê it ê jt
t1

33
Priors for Lagged Structural Coefficients
Minnesota prior:
• coeff on own lag in reduced form is 0.75
 first three elements of b i may be close
to 0.75a i for a i the i th column of A
• all other coeffs 0

Also have information that third element of b r


should be near 
b i |A, D  Nm i A, d ii M i 
34
Prior (red) and posterior (blue) distributions for
contemporaneous coefficients

35
Impulse-Response Functions

Solid blue lines: posterior median. Shaded regions: 68% posterior credibility set. 36
Dotted blue lines: 95% posterior credibility set. Dashed red lines: prior median.
Prior and posterior probabilities that effect
of shock is positive at horizon s
Supply shock Demand shock Monetary policy shock
(1) (2) (3) (4) (5) (6)
Prior Posterior Prior Posterior Prior Posterior
Variable
s=0
y 0.851 1.000 1.000 1.000 0.000 0.000
π 0.000 0.000 1.000 1.000 0.000 0.000
r 0.008 0.229 1.000 1.000 0.999 1.000

s=1
y 0.717 1.000 0.994 1.000 0.037 0.079
π 0.006 0.000 0.961 1.000 0.117 0.046
r 0.054 0.374 0.965 1.000 0.981 1.000

s=2
y 0.617 1.000 0.974 1.000 0.143 0.206
π 0.021 0.000 0.879 1.000 0.272 0.078
r 0.156 0.478 0.869 1.000 0.916 1.000
37
Historical
Historical Decompositions
decompositions:
y ts  ŷ ts|t  
s1
m0
 m  tsm
 ŷ ts|t   1
s1
m0
 m A utsm
This decomposes value of y ts into forecast
at t and the n structural shocks between t
and t  s.
⇒ Answers questions such as:
• How would a particular variable have evolved if only a
specific shock had occurred historically?
• What is the relative contribution of different types of
structural shocks to fluctuations in observed variable?38
Historical Decomposition of the Output Gap

Dashed red: actual data in deviation from mean. Solid blue: portion
attributed to indicated structural shock. Shaded regions: 68% posterior
39
credibility sets. Dotted blue: 95% posterior credibility sets.
Historical Decomposition of Inflation

Dashed red: actual data in deviation from mean. Solid blue: portion
attributed to indicated structural shock. Shaded regions: 68% posterior
40
credibility sets. Dotted blue: 95% posterior credibility sets.
Historical Decomposition of Fed Funds Rate

Dashed red: actual data in deviation from mean. Solid blue: portion
attributed to indicated structural shock. Shaded regions: 68% posterior
41
credibility sets. Dotted blue: 95% posterior credibility sets.
Variance Decompositions
What is the contribution of structural shock 𝑗 to
the s-period-ahead mean squared forecast
error of the 𝑖 𝑡ℎ element of yts ?

⇒ s-period-ahead forecast error:


s1 s1

yts  y ts|t    m  tsm    m A 1 u tsm
m0 m0
Hm
42
Variance Decompositions
Mean-squared error:
Eyts  ŷts|t yts  ŷts|t  |   j1 Q js 
 n

where
Q js   d jj  
s1
m0
h j m; h j m;  

⇒ fraction of variance accounted for by 𝑗𝑡ℎ


structural shock:

Q js / 
n
j1
Q js 
43
Variance decomposition of 4-quarter-
ahead forecast errors

Supply Demand Monetary policy

0.36 [35%] 0.62 [60%] 0.05 [5%]


Output gap (0.10, 0.84) (0.34, 1.10) (0.01, 0.19)
0.38 [69%] 0.16 [28%] 0.02 [3%]
Inflation (0.20, 0.68) (0.05, 0.36) (0.00, 0.09)
0.02 [1%] 0.94 [71%] 0.37 [28%]
Fed funds rate (0.00, 0.16) (0.37, 1.74) (0.11, 0.92)

Notes. Estimated contribution of each structural shock to the 4-quarter-ahead median


squared forecast error of each variable in bold, and expressed as a percent of
total MSE in brackets. Parentheses indicate 95% credibility intervals.
44
Because we have used multiple sources
of prior information, we can look at what
difference it makes if we drop any one.
E.g., replace  d  Student t0. 75, 0. 4, 3
with  d  Student t0. 75, 10, 3.

45
Plot of Student t density with location parameter 0.75,
3 degrees of freedom, and scale parameter of 0.4, 2,
or 10.
46
Response of output gap
to monetary shock
with an uninformative prior
for indicated parameter.

Solid blue:
Posterior median.
Dashed red lines:
Benchmark posterior.
Parentheses:
Median contribution of MP
shock to 4-quarter-ahead
squared forecast error of
output gap.

47
Priors for Larger Systems
• Add to 3-dimensional system, 3 variables:
corporate bond spread, commodity spot price
and wages
Use informative priors for original set of
parameters
Use relatively uninformative priors for parameters
on additional variables: Student t(0,1,3)
Use sign restrictions for impact effects of
monetary policy shock
Identify only subset of structural shocks
48
1 a 12 a 13 a 14 a 15 a 16
1 a 22 a 23 a 24 a 25 a 26
a 31 a 32 1 a 34 a 35 a 36
A 
a 41 a 42 a 43 1 a 45 a 46
a 51 a 52 a 53 a 54 1 a 56
a 61 a 62 a 63 a 64 a 65 1

49
Responses to a monetary policy shock in 6-variable VAR
with sign restrictions on impact matrix

50
Conclusions
(1) Structural interpretation of correlations only possible
by drawing on prior understanding of economic
structure.

(2) Identifying assumptions are not a “necessary evil” in


structural analysis.

(3) Think carefully about what prior evidence and


economic theory tell us about the structure!

(4) Be an economist and use all available information in a


sensible way ⇒ be explicit about uncertainty of
identifying assumptions. 51
Conclusions

(5) The Bayesian approach will incorporate doubts about


the quality of that prior information formally in all the
conclusions we draw.

(6) Bayesian approach offers many advantages over


existing approaches and can be easily applied in
many other contexts.

52

You might also like