Lasso-Based Inference for High-Dimensional Time
Series
Citation for published version (APA):
Adámek, R. (2022). Lasso-Based Inference for High-Dimensional Time Series. [Doctoral Thesis,
Maastricht University]. Maastricht University. https://doi.org/10.26481/dis.20221205ra
Document status and date:
Published: 01/01/2022
DOI:
10.26481/dis.20221205ra
Document Version:
Publisher's PDF, also known as Version of record
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can
be important differences between the submitted version and the official published version of record.
People interested in the research are advised to contact the author for the final version of the publication,
or visit the DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page
numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright
owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these
rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above,
please follow below link for the End User Agreement:
www.umlib.nl/taverne-license
Take down policy
If you believe that this document breaches copyright please contact us at:
repository@maastrichtuniversity.nl
providing details and we will investigate your claim.
Download date: 29 Apr. 2024
Lasso-Based Inference for High-Dimensional
Time Series
R.X. Adámek
This research was financially supported by the Netherlands Organization for Scientific
Research (NWO) under grant number 452-17-010.
© R.X. Adámek, Maastricht 2022
All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system, or transmitted in any form, or by any means, electronic, mechanical,
photocopying, recording or otherwise, without the prior permission in writing from
the author.
This book was typeset by the author using LATEX.
Published by Universitaire Pers Maastricht
ISBN: 978-94-6469-120-7
Cover: Pavel Baláš, 2022
Printed in The Netherlands by ProefschriftMaken
Lasso-Based Inference for High-Dimensional
Time Series
DISSERTATION
to obtain the degree of Doctor at
Maastricht University,
on the authority of the Rector Magnificus,
Prof. dr. Pamela Habibović,
in accordance with the decision of the Board of Deans,
to be defended in public
on Monday the 5th of December 2022, at 16:00 hours
by
Robert Xerxes Adámek
Supervisor
Dr. S.J.M. Smeekes
Co-supervisor
Dr. I. Wilms
Assessment Committee
Prof. dr. A.W. Hecq (chair)
Dr. S. Basu, Cornell University
Prof. dr. J. van den Brakel
Dr. O. Boldea, Tilburg University
To my loving family
Acknowledgements
Acknowledgements
Sometimes the path to my PhD feels like a long sequence of happy coincidences, that
I just happened to stumble into things I was happy to do. But I think this would
sell short many people without whom this thesis would never have been written.
I would not have chosen to study econometrics in Maastricht if my parents didn’t
travel with me to several open day events across the Netherlands. After finishing
the Bachelor, I may not have continued on to the Master without encouragement
from Jean-Pierre, who supervised my Bachelor thesis project and ignited my love for
time series econometrics. Without Stephan and his Big Data course, I wouldn’t have
chosen a rather ambitions Master thesis topic under his supervision, and eventually
be offered a PhD position to work on his project. I can’t name everyone who helped
put me on this path, but I will do my best!
It should go without saying that Stephan and Ines were great supervisors. Anyone
who’s worked with them could tell you they are highly knowledgeable about their
field of research, they are excellent communicators and teachers, they take their work
seriously and have an eye for detail. I want to thank you both on a more personal
level – I feel like I’ve grown quite a lot as a person over the last four years, and you’ve
played a big role in that. I always felt like I could tell you about my problems and
you had a lot of great advice for dealing with them. I appreciate all the work you put
into going over my writing and giving me useful feedback, especially relating to my
job market materials. Our regular chats were a highlight of the week, I always came
out of them motivated and with a smile on my face. You probably think these are all
things a good supervisor should do, but I certainly don’t take them for granted. It
wouldn’t have been the same without you.
It’s no secret that I’m an introvert, and didn’t put much effort into socializing
with people at the department. In retrospect, I wish I spent more time getting to
vii
Acknowledgements
know everyone, especially with covid making that very difficult in the latter half of my
PhD. That being said, Etienne took me under his wing as soon as I started, so work
at the office was never lonely. I fondly remember our discussions about cool proofs on
the whiteboard, gossiping about students in Mathematical Statistics, talking about
music, cooking and videogames instead of working... Caterina, Luca, the conference
in Rome was some of the most fun I had during my PhD. Dewi, Eric, Enrico, thank
you for taking me along for the Econometric Game. I loved our reading groups about
high-dimensional CLTs and SVARs with Lenard. Adam, Daniel, Elisa, Francesco,
Marie, it was great to talk with you at various NESG’s, seminars and workshops.
I also want to thank many people who taught me over the years: Alain, Christian,
Dries, Hanno, Rasmus, Sean, to name a few – you are part of why I want to continue
working in academia and being a teacher myself.
Finally, I want to thank everyone in my life outside of the university, for a needed
distraction from work, for listening to my rambling about work, and for telling me to
stop talking about my work. My family has always supported me during my studies
and PhD, being at my side for every big decision in my life. Getting to spend more
time with you was a huge upside of working from home – I hope I will never be too
far from you. Charlotte, Kubo, Sofie, thank you for always being there for me when
I need you. Conor, Daniel, Demane, Štěpáne, Tom, thanks for all the evenings of
chatting, gaming and laughing.
Robert Adámek
Aarhus, October 2022
viii
Acknowledgements
Contents
Acknowledgements
vii
Contents
ix
1 Introduction
3
1.1
Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
High-dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
The lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4
The desparsified lasso . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.5
Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.6
Chapter overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2 Desparsified Lasso in Time Series
11
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
The High-Dimensional Linear Model . . . . . . . . . . . . . . . . . . .
15
2.3
Error Bound and Consistency for the Lasso . . . . . . . . . . . . . . .
22
2.4
Uniformly Valid Inference via the Desparsified Lasso . . . . . . . . . .
23
2.4.1
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.4.2
Inference on low-dimensional parameters
. . . . . . . . . . . .
28
2.4.3
Inference on high-dimensional parameters . . . . . . . . . . . .
33
Analysis of Finite-Sample Performance . . . . . . . . . . . . . . . . . .
35
2.5.1
Tuning parameter selection . . . . . . . . . . . . . . . . . . . .
36
2.5.2
Autoregressive model with exogenous variables . . . . . . . . .
37
2.5.3
Factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
2.5.4
Weakly sparse VAR(1) . . . . . . . . . . . . . . . . . . . . . . .
40
2.5
2.6
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
41
ix
Acknowledgements
2.A Proofs for Section 2.3
. . . . . . . . . . . . . . . . . . . . . . . . . . .
44
2.A.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
2.A.2 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . .
44
2.A.3 Proofs of the main results . . . . . . . . . . . . . . . . . . . . .
2.B Proofs for Section 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
46
2.B.1 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . .
46
2.B.2 Proofs of main results . . . . . . . . . . . . . . . . . . . . . . .
49
2.C Supplementary Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
2.C.1 Proofs of preliminary results Section 2.3 . . . . . . . . . . . . .
59
2.C.2 Proofs of preliminary results Section 2.4 . . . . . . . . . . . . .
63
2.C.3 Illustration of conditions for Corollary 2.1
. . . . . . . . . . .
78
2.C.4 Properties of induced p-norms for 0 ≤ p < 1 . . . . . . . . . . .
80
2.C.5 Additional notes on Examples 2.5 and 2.6 . . . . . . . . . . . .
82
2.C.6 Algorithmic details for choosing the lasso tuning parameter . .
87
2.C.7 Additional simulation details . . . . . . . . . . . . . . . . . . .
88
3 Local Projection Inference in High Dimensions
95
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
3.2
High-dimensional Local Projections . . . . . . . . . . . . . . . . . . . .
98
3.2.1
Local Projection Estimation . . . . . . . . . . . . . . . . . . . . 100
3.2.2
Local Projection Inference . . . . . . . . . . . . . . . . . . . . . 102
3.3
Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.4
Structural Impulse Responses Estimated by HDLPs . . . . . . . . . . 106
3.5
3.4.1
Impulse Responses to a Shock in Monetary Policy . . . . . . . 106
3.4.2
Impulse Responses to a Shock in Government Spending . . . . 109
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.A Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.B Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.C Simulations: Extra Figures . . . . . . . . . . . . . . . . . . . . . . . . 117
3.D Data used in Section 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.E FAVAR Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4 Sparse High-Dimensional Vector Autoregressive Bootstrap
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.2
Vector Autoregressive Bootstrap . . . . . . . . . . . . . . . . . . . . . 127
4.3
x
125
4.2.1
Bootstrap for High-Dimensional VARs . . . . . . . . . . . . . . 128
4.2.2
Bootstrap Inference on (Approximate) Means . . . . . . . . . . 130
HDCLT for linear processes . . . . . . . . . . . . . . . . . . . . . . . . 131
CONTENTS
4.4
Application to VAR models . . . . . . . . . . . . . . . . . . . . . . . . 133
4.5
Bootstrap consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.6
Bootstrap Consistency for VAR Estimation by the lasso . . . . . . . . 139
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.A Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.B Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5 Conclusion
169
Bibliography
173
Impact
183
Curriculum Vitae
187
1
Chapter 1
Introduction
The common themes of my work can be succinctly summarized by the 4 components
of my thesis title: the lasso, statistical inference, high-dimensionality, and time series.
In this chapter, I will introduce each of these concepts, and motivate why combining
them together presents an important and interesting challenge – one that I hope to
help overcome. Most of my work is theoretical, which may carry the connotations
of limited practical use. I believe that applying the methods I discuss to real life
problems is the ultimate goal of anyone studying them, but we should ideally only
use them if we have compelling arguments for why they work. My work helps arm
researchers with the arguments and confidence to apply these methods, and informs
where their limits may lie. That being said, to motivate the more practically minded
reader, my work also includes some empirical applications, simulation studies, and a
software package which implements these methods in a user-friendly way.
1.1
Inference
To give a simple example of inference, consider a coin-tossing experiment: I flip a
coin 100 times and count 55 Heads. Is the coin fair or biased? To answer this
question, I adopt the frequentist (as opposed to Bayesian) philosophy of statistics.
The probability of getting Heads is a fixed number between 0 and 1, presumably
determined by the coin’s physical properties, that describes the frequency at which I
would get Heads if I kept flipping it over and over. For this coin experiment, we could
get an arbitrarily accurate estimate of this probability with enough patience, or like
Diaconis et al. (2007), with a coin-flipping machine. However, in many situations,
repeating an experiment is either expensive, or simply impossible – I may only ever
get 100 flips. Inference lets me make an informed judgement about the coin’s bias
3
1 Introduction
by quantifying the uncertainty associated with these 55 Heads. After all, even a fair
coin could appear biased due to random chance. As it turns out, a fair coin produces
outcomes at least as extreme as 55 Heads – that is, larger than 55 or smaller than 45
– around 37% of the time.
While such a calculation is easy for coin flipping, the problem of quantifying
uncertainty becomes considerably more difficult when our data comes from a more
complicated distribution, or even a distribution we do not know. Consider an example
where we wish to measure the average height of a Dutch man. Schönbeck et al.
(2013) found that in 2009, from a sample of 5,811, the average height at age 21 was
183.8cm. How representative is this of the male Dutch population as a whole? In
such cases, we can often appeal to asymptotic approximations. By the Central Limit
Theorem, we know that means over larger and larger samples tend to resemble a
Normal distribution more and more closely. If we treat this Normal approximation
as accurate, we could say that with a 95% probability, our estimate is within around
1.8mm of the true mean. However, this approximation is only exact in the limit,
letting the number of Dutch men in our sample go to infinity – one may argue such
a situation is not very realistic, considering there was only a finite number of Dutch
men in 2009. Despite this, the accuracy of asymptotic approximations can be verified
by simulations, and it often works very well in practice, especially when the data itself
is already close to Normal and the sample size is large, as is the case here.
While probabilities in the frequentist sense were easy to understand in the coin
flipping example, it may be more tricky with average heights. When I say the height
estimate was within 1.8mm of the true mean with a probability of 95%, it means that
if we could repeat this experiment again, only choosing the random sample of Dutch
men differently, the estimated mean would be within 1.8mm of the true mean 95%
of the time. This is of course not possible, since we cannot travel back in time to
redo the experiment. However, it motivates another approximation technique: the
bootstrap. Introduced by Efron (1979), the bootstrap is a method which re-samples
our original data to create new artificial samples, which we then use as samples from
an “alternate reality”. In essence, the extent to which the means of these bootstrap
samples resemble the original mean lets us infer how the original mean relates to that
of the population in 2009. Unlike asymptotic approximations, the bootstrap works
well in small samples, and when the data is far from Normal, e.g. when it is heavily
skewed or bimodal.
The methods I consider in this thesis are more complicated than the examples
above, but the idea of statistical inference remains central to my work. In addition
to the estimates themselves, I am interested in quantifying their uncertainty, allowing
practitioners to make an informed judgement about the statistical significance of their
4
1.2 High-dimensionality
results.
1.2
High-dimensionality
One of the defining features of my work is the focus on methods for high-dimensional
data, or “Big Data”. While this term can mean different things to different people, in
this thesis it refers to the setting where the number of variables is very large relative
to the number of observations we have for each variable. In a classical econometric
setting, we may encounter the following dataset. For 100 individuals (i), we have data
about their income (yi ), years of education (edi ), number of children (chi ), marital
status (mari ), and gender (gi ). If we wanted to explain an individual’s income by
their other characteristics, we could estimate the following linear model
yi = α + β1 edi + β2 chi + β3 mari + β4 gi + ϵi ,
(1.1)
where α is an intercept, and ϵi is an error term which we wish to minimize. The
parameters, which I typically denote with Greek letters, represent the effect on income
resulting from a one unit change in the explanatory variable. To make the notation
more compact, we can write eq. (1.1) in terms of vectors
yi = β ′ xi + ϵi ,
(1.2)
where β = (α, β1 , . . . , β4 )′ and xi = (1, edi , chi , mari , gi )′ . This kind of model is then
typically estimated by ordinary least squares (OLS), which chooses β in such a way
as to minimize the sum of squared residuals, or
β̂ = arg min
β
X
yi − β ′ xi
2
.
(1.3)
i
With 5 explanatory variables (including the intercept) and 100 observations, the properties of β̂ are well-known, including methods for inference. For example, we could
check if the estimated parameter β̂4 is significantly different from 0, which might
indicate the presence of a wage gap between men and women. However, we could
also think of many other factors which affect a person’s wages. Their nationality,
age, ethnicity, where they work and in which sector, which high-school or university
they attended, the wages and education levels of their parents, etc. With the rise of
social media such as Facebook, it is plausible such detailed data could be available
– though not without ethical concerns. One might be tempted to simply include all
these variables into a model as in eq. (1.1); after all, more data should improve the
5
1 Introduction
model and improve its estimates. Unfortunately, this approach would run afoul of the
curse of dimensionality.
A model with many variables (and therefore many parameters), is more flexible
than one with few; it can explain a larger proportion of the variation in yi and
achieve a better fit of the data. However, this flexibility comes at the price of high
variance in our estimates, and therefore high uncertainty which makes meaningful
inference difficult. Worse yet, when the number of variables exceeds the number of
observations, methods such as OLS fail completely; this is because the optimization
problem eq. (1.3) has no unique solution. The model becomes so flexible that it
can fit the data perfectly, and it can do so in an infinite number of equally valid
ways. One of the canonical examples of such high-dimensional problems are gene
expression models, where we want to estimate which genes are associated with the
occurrence of certain diseases. For example, Simon et al. (2013) consider a data set
with 127 patients, and data on 22,283 genes. If we hope to ever identify the correct
genes with a statistical approach, it is paramount to use methods which function
in such a high-dimensional setting, and also allow for valid inference. As another
example, the FRED-QD database (McCracken and Ng, 2020) contains 128 quarterly
US macroeconomic variables over the last 253 quarters. While this may not appear
high-dimensional at first glance, practitioners typically use at least 4 lags of variables
in their models, which makes the effective number of variables 1,265.
1.3
The lasso
The lasso (or Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani (1996), is an estimation method for sparse high-dimensional models. It addresses
the problem of too-flexible models by adding a penalty for large parameter estimates.
Compare to eq. (1.3) the lasso estimator:
β̂
(L)
= arg min
β
where ∥β∥1 =
P
X
yi − β ′ x i
2
(1.4)
+ λ ∥β∥1 ,
i
|βj |, and λ > 0 is a tuning parameter. The intuition behind this
j
method is that the additional penalty prioritizes estimates β̂ with small entries, while
maintaining a good fit for the data. The trade-off between these two features is
governed by λ, where larger values make the penalty term more important, thus
resulting in β̂
(L)
’s with smaller entries – eventually giving β̂
(L)
= 0 if λ is sufficiently
large.
The lasso has several attractive features: It has a unique solution regardless of
6
1.4 The desparsified lasso
how many variables are in the model, and the geometry of the optimization problem
gives rise to sparse estimates; that is, many entries of β̂
(L)
are exactly 0. This
means it selects those variables which are most important to give a good model
fit, and sets the parameters of all other variables to 0, making the results easily
interpretable. However, these properties come at the cost of introducing some bias
into our estimation. Compared to OLS, the lasso produces estimates which are closer
to zero and may lead to our underestimating the effects of some variables. The upshot
is that it greatly reduces the variance of the estimates, and when λ is chosen carefully,
this lower variance outweighs the additional bias, resulting in estimates which are on
average closer to their true values.
Unfortunately, it is difficult to do inference with the lasso. Its tendency to reduce
small parameter values to 0 may lead to omitted variable bias, which may invalidate inference on the remaining variables in the model. For example, if we return to
eq. (1.1), it may be the case that that the lasso gives an estimate of β̂3 = 0, implying
that marital status has no effect on income. If married individuals have particularly
high incomes and also have more children on average than unmarried individuals, this
high income will become wrongly associated with the number of children, thus overestimating the size of β2 . When multiple mutually correlated variables are included
in the model, the lasso has a tendency to eliminate all but one of them, which has a
high risk of creating this problem.
1.4
The desparsified lasso
The desparsified (also known as debiased) lasso, introduced by van de Geer et al.
(2014), addresses the lasso’s issues with inference by adding an adjustment term to
the lasso in eq. (1.4)
β̂
(DL)
= β̂
(L)
+ Θ̂
T
1X
xi ϵ̂i
T i=1
!
(1.5)
,
where ϵ̂i is the residual of the lasso model, i.e. the amount by which the dependent
variable yi differs from the model prediction β̂
(L)′
xi . To give some intuition behind
this adjustment, consider the example of omitted variable bias above, where the lasso
penalized the parameter of the marital status variable to 0. For a high-earning married
individual with many children, we would expect this issue to result in a particularly
large value for ϵ̂i . Since their value for the number of children is also large, the product
of xi and ϵ̂i will contain a large entry, giving us an indication that this omitted variable
bias may be occurring. The term in the brackets then takes an average of how big
7
1 Introduction
this issue is over all individuals, and gets scaled by the matrix Θ̂. This is the inverse
covariance (or precision) matrix, which measures how correlated each pair of variables
is, while taking into account the effects of all other variables. Therefore, if we miss
important variables which are also correlated in a relevant way to other variables, the
desparsified lasso will make a large adjustment to the lasso estimator to compensate.
The two names of this method hint at its properties: β̂
(DL)
is no longer sparse in
the sense of containing many zeroes, since the adjustment term is never 0. This means
we lose the variable selection properties of the lasso, but also avoid the associated
problems with omitted variables. It also counteracts the lasso’s inherent bias towards
0; in fact, when used in a regular low-dimensional setting, β̂
(DL)
becomes identical
to the OLS estimator of eq. (1.1). This then provides some intuition for why we also
recover some of the nice inference properties of the OLS, even in the high-dimensional
setting.
1.5
Time series
The main contribution of my thesis lies in extending the concepts described above to
a time series setting. In the examples estimating Dutch men’s heights, or modelling
a person’s income, I was implicitly assuming that the features of different individuals
were independent of each other. If the data were collected in a well-designed survey,
this is reasonable. Time series are a type of data where the observations refer to some
quantity at different points in time, usually ordered from oldest to newest, and they
typically exhibit some sort of dependence over time. While we could think of the
series of coin flips as a sort of time series – they were presumably flipped sequentially
over time – we would not be very interested in the order of heads or tails, since we
expect these flips were independent of each other. Similarly, if we changed the order of
individuals in our income data set, we would not expect to see different results. With
time series, this ordering is a crucial feature of the data, and we are often interested
in studying how a series changes over time. For example, we may want to examine
whether average temperatures have been increasing over the past decades, or how has
the gross domestic product (GDP) of a country been affected by COVID-19.
Time-dependence brings with it many complications when it comes to statistical
modelling, estimation, and inference. However, understanding these patterns in our
data allows us to make informed predictions about the future, which makes the additional effort worthwhile. Understanding the effects of greenhouse gas emissions on
temperatures can help us develop effective climate policy to address climate change.
In macroeconomics, studying the relationships between interest rates and GDP can
lead to monetary policy which brings a country out of recession. As in my previous
8
1.6 Chapter overview
examples, inference is an invaluable tool for analyzing time series. When we forecast
that GDP will grow by 1% next year, we may also want to know how confident we
are of this prediction. If we are 95% confident that GDP will grow by between 0.8%
and 1.2%, this tells a very different story than if it were between -3% and 4%.
In my work, I largely consider time series which are weakly dependent, or stationary. These series may have strong dependence between values that are close to each
other in time, but the dependence dies out as they grow further apart. As such, they
generally move in a tunnel; they may get disturbed by a short term shock, but they
tend to return back to their long term mean after a while. For example, GDP growth
dropped sharply during 2020, likely as a result of COVID-19 and various lockdown
measures, and these effects will likely have lasting effects even after the pandemic.
However, we wouldn’t expect this to have any noticeable effect in 100 years, in the
same way that the Spanish flu has little effect on GDP today.
At the time I started working on this thesis, inference in high dimensions was wellestablished for independent data, with only a few authors pioneering the use of lassobased methods in time series settings. Among them were Kock and Callot (2015),
Basu and Michailidis (2015), and Medeiros and Mendes (2016), who showed many
promising and useful results, and whose theoretical approaches greatly influenced
my work. In particular, the latter’s relaxation of the Gaussianity assumption to
allow for more general and potentially fat-tailed distributions is also a running theme
throughout this thesis. The field has grown rapidly since then, and I am pleased to
see it flourish. The main contribution of this thesis is extending the theory behind
the lasso and desparsified lasso to a high-dimensional, weakly dependent time series
setting, deriving novel theoretical results for valid inference in linear models such as
eq. (1.2).
1.6
Chapter overview
Chapter 2 is heavily theoretical, and forms the backbone of my thesis. In this chapter,
we derive asymptotic results for the desparsified lasso in eq. (1.5), under a highly
general form of weak dependence known as near-epoch dependence, which covers many
popular dependence concepts such as vector autoregressive (VAR) or mixing processes.
We derive these results under the assumption of weak sparsity of β, which makes its
use justifiable in many practical settings. Unlike exact sparsity, which requires that
many elements of β are exactly 0, weak sparsity allows for a large number of nonzero
elements, provided they are not “too large”. Inference in time series typically involves
the long-run covariance matrix of our process, and we provide a consistent estimator
for this matrix, showing it works well even in high dimensions. We also provide a
9
1 Introduction
data-dependent way of choosing the λ tuning parameter for the lasso in eq. (1.4), thus
giving a complete toolbox required to do high-dimensional inference. To facilitate the
use of this method by practitioners, we created the package desla for the opensource statistical software R, which efficiently implements the desparsified lasso and
our proposed inference method. Furthermore, we perform an extensive simulation
study demonstrating the accuracy of our inference in several relevant settings.
In Chapter 3, I focus on applying the desparsified lasso to high-dimensional local
projection. This modelling technique allows us to do structural, or causal inference;
that is, rather than only considering correlations, this approach lets practitioners
estimate the causal effects of structural shocks in the form of impulse responses. Local projections are typically used in macroeconomic settings, and the recent work of
Plagborg-Møller and Wolf (2021) showed that they are in some sense equivalent to
structural VARs – another highly popular method for estimating impulse responses.
In addition to showing how the results of Chapter 2 can be applied in these local projections, we also propose a small modification to the desparsified lasso which greatly
improves its inference performance, and demonstrate this in a simulation. Finally,
we present two empirical applications where we investigate the impulse responses to
shock in monetary policy and government spending. These applications also highlight
why these high-dimensional methods are important. For one of our analyses, we use
13 lags of 122 macroeconomic variables from the FRED-MD database (McCracken
and Ng, 2016) in a non-linear state-dependent model, resulting in 3309 explanatory
variables with only 707 time series observations. The implementation of our proposed
high-dimensional local projections is also a part of the desla package.
Finally, in Chapter 4, we return to more theoretical results, with our proposal of
the sparse high-dimensional VAR bootstrap. Unlike Chapter 2, where we develop time
series theory for a high-dimensional method, this chapter develops high-dimensional
theory for a method which is well-established in low-dimensional time series settings.
In this bootstrap, we propose to estimate high-dimensional VARs with the lasso, and
use these VARs to build our bootstrap samples. We then show that this bootstrap procedure is consistent, and provides a valid approximation for high-dimensional means.
To do so, we build on our previous results in Section 2.3, where we derive error bounds
for the lasso. To prove our main results, we also derive a high-dimensional central
limit theorem for linear processes, which may be of independent interest.
Throughout the following chapters, I generally denote scalar quantities by lower
case letters (e.g. x), vectors by bold lower case letters (e.g. v), matrices by bold
capital letters (e.g. M ), and quantities unknown by Greek letters. The notation of
each chapter is otherwise self-contained and defined separately. In cases where one
chapter refers to results of another, any relevant notation differences are made clear.
10
Chapter 2
Desparsified Lasso in Time
Series
Abstract†
In this chapter we develop valid inference for high-dimensional time series. We extend the desparsified lasso to a time series setting under Near-Epoch Dependence
(NED) assumptions allowing for non-Gaussian, serially correlated and heteroskedastic processes, where the number of regressors can possibly grow faster than the time
dimension. We first derive an error bound under weak sparsity, which, coupled with
the NED assumption, means this inequality can also be applied to the (inherently
misspecified) nodewise regressions performed in the desparsified lasso. This allows us
to establish the uniform asymptotic normality of the desparsified lasso under general
conditions, including for inference on parameters of increasing dimensions. Additionally, we show consistency of a long-run variance estimator, thus providing a complete
set of tools for performing inference in high-dimensional linear time series models.
Finally, we perform a simulation exercise to demonstrate the small sample properties
of the desparsified lasso in common time series settings.
† This chapter is based on joint work with S.J.M. Smeekes and I. Wilms. It is forthcoming in the
Journal of Econometrics.
11
2 Desparsified Lasso in Time Series
2.1
Introduction
In this chapter we propose methods for performing uniformly valid inference on highdimensional time series regression models. Specifically, we establish the uniform
asymptotic normality of the desparsified lasso method (van de Geer et al., 2014) under
very general conditions, thereby allowing for inference in high-dimensional time series
settings that encompass many econometric applications. That is, we establish validity
for potentially misspecified time series models, where the regressors and errors may
exhibit serial dependence, heteroskedasticity and fat tails. In addition, as part of our
analysis we derive new error bounds for the lasso (Tibshirani, 1996), on which the
desparsified lasso is based.
Although traditionally approaches to high-dimensionality in econometric time series have been dominated by factor models (Bai and Ng, 2008; Stock and Watson,
2011, cf.), shrinkage methods have rapidly been gaining ground. Unlike factor models
where dimensionality is reduced by assuming common structures underlying regressors, shrinkage methods assume a certain structure on the parameter vector. Typically, sparsity is assumed, where only a small, unknown subset of the variables is
thought to have “significantly non-zero” coefficients, and all the other variables have
negligible – or even exactly zero – coefficients. The most prominent among shrinkage
methods exploiting sparsity is the lasso proposed by Tibshirani (1996), which adds a
penalty on the absolute value of the parameters to the least squares objective function. This penalty ensures that many of the coefficients will be set to zero and thus
variable selection is performed, an attractive feature that helps to make the results of
a high-dimensional analysis interpretable. Due to this feature, the lasso and its many
extensions are now standard tools for high-dimensional analysis (see e.g., Hesterberg
et al., 2008; Vidaurre et al., 2013; Hastie et al., 2015, for reviews).
Much effort has been devoted to establish error bounds for lasso-based methods to
guarantee consistency for prediction (e.g., Greenshtein and Ritov, 2004; Bühlmann,
2006) and estimation of a high-dimensional parameter (e.g., Bunea et al., 2007; Zhang
and Huang, 2008; Bickel et al., 2009; Meinshausen and Yu, 2009; Huang et al., 2008).
While most of these advances have been made in frameworks with independent and
identically distributed (IID) data, early extensions of lasso-based methods to the
time series case can be found in Wang et al. (2007), Hsu et al. (2008). These authors,
however, only consider the case where the number of variables is smaller than the
sample size. Various papers (e.g., Nardi and Rinaldo, 2011; Kock and Callot, 2015
and Basu and Michailidis, 2015) let the number of variables increase with the sample
size, but often require restrictive assumptions (for instance Gaussianity) on the error
process when investigating theoretical properties of lasso-based estimators in time
12
2.1 Introduction
series models.
Exceptions are Medeiros and Mendes (2016), Wu and Wu (2016), Masini et al.
(2021), and Wong et al. (2020). Medeiros and Mendes (2016) consider the adaptive
lasso for sparse, high-dimensional time series models and show that it is model selection consistent and has the oracle property, even when the errors are non-Gaussian and
conditionally heteroskedastic. Wu and Wu (2016) consider high-dimensional linear
models with dependent non-Gaussian errors and/or regressors and provide asymptotic theory for the lasso with deterministic design. To this end, they adopt the
functional dependence framework of Wu (2005). Masini et al. (2021) focus on weakly
sparse high-dimensional vector autoregressions for a class of potentially heteroskedastic and serially dependent errors, which encompass many multivariate volatility models. The authors derive finite sample estimation error bounds for the parameter vector
and establish consistency properties of lasso estimation. Wong et al. (2020) derive
nonasymptotic inequalities for estimation error and prediction error of the lasso without assuming any specific parametric form of the DGP (data-generating process).
The authors assume the series to be either α-mixing Gaussian processes or β-mixing
processes with sub-Weibull marginal distributions thereby accommodating settings
with heavy-tailed non-Gaussian errors.
While one of the attractive feature of lasso-type methods is their ability to perform
variable selection, this also causes serious issues when performing inference on the
estimated parameters. In particular, performing inference on a (data-driven) selected
model, while ignoring the selection, causes the inference to be invalid. This has been
discussed by, among others, Leeb and Pötscher (2005) in the general context of model
selection and Leeb and Pötscher (2008) for shrinkage estimators. As a consequence,
recent statistical literature has seen a surge in the development of so-called postselection inference methods that circumvent the problem induced by model selection;
see for example the literature on selective inference (cf. Fithian et al., 2015; Lee et al.,
2016) and simultaneous inference (Berk et al., 2013; Bachoc et al., 2020).
In the context of lasso-type estimation, methods have been developed based on
the idea of orthogonalizing the estimation of the parameter of interest to the estimation (and potential incorrect selection) of the other parameters. Belloni et al.
(2014); Chernozhukov et al. (2015) propose a post-double-selection approach that
uses a Frisch-Waugh partialling out strategy to achieve this orthogonalization by selecting important covariates in initial selection steps on both the dependent variable
and the variable of interest, and show this approach yields uniformly valid and standard normal inference for independent data. In a related approach, Javanmard and
Montanari (2014); van de Geer et al. (2014) and Zhang and Zhang (2014) introduce
debiased or desparsified versions of the lasso that achieve uniform validity based on
13
2 Desparsified Lasso in Time Series
similar principles for IID Gaussian data. Extensions to the time series case include
Chernozhukov et al. (2021) who provide desparsified simultaneous inference on the
parameters in a high-dimensional regression model allowing for temporal and crosssectional dependency in covariates and error processes, Krampe et al. (2021) who
introduce bootstrap-based inference for autoregressive time series models based on
the desparsification idea, Hecq et al. (2021) who use the post-double-selection procedure of Belloni et al. (2014) for constructing uniformly valid Granger causality test
in high-dimensional VAR models, and Babii et al. (2019) who use a debiased sparse
group lasso for inference on a low dimensional group of parameters.
In this chapter, we contribute to the literature on shrinkage methods for highdimensional time series models by providing novel theoretical results for both point
estimation and inference via the desparsified lasso.
We consider a very general
time series-framework where the regressors and errors terms are allowed to be nonGaussian, serially correlated and heteroskedastic, and the number of variables can
grow faster than the time dimension. Moreover, our assumptions allow for both correctly specified and misspecified models, thus providing results relevant for structural
interpretations if the overall model is specified correctly, but not limited to this.
We derive error bounds for the lasso in high-dimensional, linear time series models
under mixingale assumptions and a weak sparsity assumption on the parameter vector.
Our setting generalizes the one from Medeiros and Mendes (2016), who require a
martingale difference sequence (m.d.s.) assumption – and hence correct specification
– on the error process. Moreover, we relax the traditional sparsity assumption to allow
for weak sparsity, thereby recognizing that the true parameters are likely not exactly
zero. The error bounds are used to establish estimation and prediction consistency
even when the number of parameters grows faster than the sample size.
We extend the error bounds to the nodewise regressions performed in the desparsified lasso, where each regressor (on which inference is performed) is regressed on
all other regressors. Note that, contrary to the setting with independence over time,
these nodewise regressions are inherently misspecified in dynamic models with temporal dependence. As such our error bounds are specifically derived under potential
misspecification. We then establish the asymptotic normality of the desparsified lasso
under general conditions. As such, we ensure uniformly valid inference over the class
of weakly sparse models. This result is accompanied by a consistent estimator for the
long run variance, thereby providing a complete set of tools for performing inference in
high-dimensional, linear time series models. As such, our theoretical results accommodate various financial and macro-economic applications encountered by applied
researchers.
The remainder of this chapter is structured as follows. Section 2.2 introduces the
14
2.2 The High-Dimensional Linear Model
time series setting and assumptions thereof. In Section 2.3, we derive an error bound
for the lasso (Corollary 2.1) that forms the basis for the nodewise regressions performed for the desparsfied lasso. In Section 2.4, we establish the theory that allows
for uniform inference with the desparsified lasso. Section 2.5 contains a simulation
study examining the small sample performance of the desparsified lasso, and Section 2.6 concludes. The main proofs and preliminary lemmas needed for Section 2.3
are contained in Appendix 2.A, while Appendix 2.B contains the results and proofs
on Section 2.4. Appendix 2.C contains supplementary material.
N
P
r
1/r
|xi |
A word on notation. For any N dimensional vector x, ∥x∥r =
i=1
P
denotes the Lr -norm, with the familiar convention that ∥x∥0 = i 1(|xi | > 0) and
∥x∥∞ = max |xi |. For a matrix A, we let ∥A∥r = max∥x∥r =1 ∥Ax∥r for any r ∈ [0, ∞]
i
p
d
and ∥A∥max = max |ai,j |. We use → and → to denote convergence in probability and
i,j
distribution respectively. Depending on the context, ∼ denotes equivalence in order
of magnitude of sequences, or equivalence in distribution. We frequently make use of
arbitrary positive finite constants C (or its sub-indexed version Ci ) whose values may
change from line to line throughout the chapter, but they are always independent
of the time and cross-sectional dimension. Similarly, generic sequences converging to
zero as T → ∞ are denoted by ηT (or its sub-indexed version ηT,i ). We say a sequence
ηT is of size −x if ηT = O (T −x−ε ) for some ε > 0.
2.2
The High-Dimensional Linear Model
Consider the linear model
yt = x′t β 0 + ut ,
(2.1)
t = 1, . . . , T,
′
where xt = (x1,t , . . . , xN,t ) is a N × 1 vector of explanatory variables, β 0 is a N × 1
parameter vector and ut is an error term. Throughout the chapter, we examine the
high-dimensional time series model where N can be larger than T .
We impose the following assumptions on the processes {xt } and {ut }.
Assumption 2.1. Let z t = (x′t , ut )′ , and let there exist some constants m̄ > m > 2,
and d ≥ max{1, (m̄/m − 1)/(m̄ − 2)} such that
(i) Let E [z t ] = 0, E [xt ut ] = 0, and
max
1≤j≤N +1, 1≤t≤T
2m̄
E |zj,t |
≤ C.
(ii) Let sT,t denote a k(T )-dimensional triangular array that is α-mixing of size
−d/(1/m − 1/m̄) with σ-field Fts := σ {sT,t , sT,t−1 , . . . } such that z t is Fts 15
2 Desparsified Lasso in Time Series
measurable. The process {zj,t } is L2m -near-epoch-dependent (NED) of size −d
on sT,t with positive bounded NED constants, uniformly over j = 1, . . . , N + 1.
Assumption 2.1(i) ensures that the error terms are contemporaneously uncorrelated with each of the regressors, and that the process has finite and constant unconditional moments. One can think of sT,t in Assumption 2.1(ii) as an underlying
shock process driving the regressors and errors in z t , where we assume z t to depend
almost entirely on the “near epoch” of sT,t .1
Near epoch dependence of z t can be interpreted as z t being “approximately” mixing, in the sense that it can be well-approximated by a mixing process. The NED
framework in Assumption 2.1 therefore allows for very general forms of dependence
that are often encountered in econometrics applications including, but not limited to,
strong mixing processes (McLeish, 1975), linear processes including ARMA models,
various types of stochastic volatility and GARCH models (Hansen, 1991a), and nonlinear processes (Davidson, 2002a). Moreover, NED holds in cases where mixing has
well-known failures for common processes, such as the AR(1) process discussed in
Andrews (1984). These properties have made NED a very popular tool for modelling
dependence in econometrics (Davidson, 2002b, Sections 14, 17).2
To our knowledge, our work in this chapter is the first to utilize the NED framework for establishing uniformly valid high-dimensional inference. Wong et al. (2020)
consider time series models with β-mixing errors, which has the advantage of allowing for general forms of dynamic misspecification resulting in serially correlated error
terms, but, as discussed above, rules out several relevant data generating processes,
and is in addition typically difficult to verify. Alternative approaches that avoid mixing assumptions are found in Babii et al. (2019), who consider τ −dependence, as well
as Wu and Wu (2016) and Chernozhukov et al. (2021), who use functional dependence for modeling the dependence allowed in regressors and innovations. Finally,
Masini et al. (2021) use an m.d.s. assumption on the innovations in combination with
sub-Weibull tails and a mixingale assumption on the conditional covariance matrix.
The m.d.s. assumption of Medeiros and Mendes (2016) and Masini et al. (2021) however does not allow for dynamic misspecification of the full model. Importantly, the
NED assumption on ut does allow for misspecified models as well, in which case we
view β 0 as the coefficients of the pseudo-true model when restricting the class of
models to those linear in xt . In particular, it allows one to view (2.1) as simply the
1 Since z grows asymptotically in dimension, it is natural to let the dimension of s
t
T,t grow with
T , though this is not theoretically required. Although, like sT,t , technically our stochastic process
z t is a triangular array due to dimension N increasing with T , in the remainder of the chapter we
suppress the dependence on T for notational convenience.
2 To make the chapter self-contained, we include formal definitions on NED and mixingales in
Appendix 2.A.1.
16
2.2 The High-Dimensional Linear Model
linear projection of yt on all the variables in xt , with β 0 in that case representing
the corresponding best linear projection coefficients. In such a case E [ut ] = 0 and
E [ut xj,t ] = 0 hold by construction, and the additional conditions of Assumption 2.1
can be shown to hold under weak further assumptions. On the other hand, ut is not
likely to be an m.d.s. in that case. As will be explained later, allowing for misspecified
dynamics is crucial for developing the theory for the nodewise regressions underlying
the desparsified lasso.
It is important to note that we do not consider β 0 as the projection coefficients
of the (lasso) selected model, but only of the full, pseudo-true, model. Our approach
simply allows for the possibility of the full model being misspecified, for instance
if the econometrician has missed relevant confounders in the initial dataset. This
does not imply a “failure” of our lasso inference method, but rather a failure of the
econometrician in setting up the initial model.3 Allowing for such misspecification
is crucial for the nodewise regressions we consider in Section 2.4 which are simply
projections of one explanatory variable on all the others, and therefore inherently
misspecified.
We further elaborate on misspecification in Example 2.3, after we present two
examples of correctly specified common econometric time series DGPs.
Remark 2.1. The NED-order m and sequence size −d play a key role in later theorems where they enter the asymptotic rates. In Assumption 2.1(i), we require z t
to have m̄ moments, with m̄ being slightly larger than m. The more moments, the
tighter the error bounds and the weaker conditions on the tuning parameter are, but
a high m̄ implies stronger restrictions on the model (see e.g., the GARCH parameters
in the to be discussed Example 2.1). Additionally, there is a tradeoff between the
thickness of the tails allowed for and the amount of dependence – measured through
the mixing rate in Assumption 2.1(ii). Under strong dependence, fewer moments
are needed; the reduction from m̄ to m then reflects the price one needs to pay for
allowing more dependence through a smaller mixing rate.
Example 2.1 (ARDL model with GARCH errors). Consider the autoregressive dis3 Of course, the misspecification may be intentional, as even in dynamically misspecified models,
the parameter of interest can still have a structural meaning. One example is the local projections
of Jordà (2005), where h-step ahead predictive regressions with generally serially correlated error
terms are performed.
17
2 Desparsified Lasso in Time Series
tributed lag (ARDL) model with GARCH errors
p
X
yt =
ρi yt−i +
i=1
q
X
θ ′i wt−i + ut = x′t β 0 + ut ,
i=0
p
ut = ht εt ,
εt ∼ IID(0, 1),
ht = π0 + π1 ht−1 + π2 u2t−1 ,
p
P
where the roots of the lag polynomial ρ(z) = 1 −
ρi z i are outside the unit circle.
i=1
Take εt , π1 and π2 such that E ln(π1 ε2t + π2 ) < 0, then ut is a strictly station-
ary geometrically β-mixing
h process
i (Francq and Zakoïan, 2010, Theorem 3.4), and
2m̄
additionally such that E |ut |
< ∞ for some m̄ ∈ N (the number of moments
depends on π1 , π2 and the moments of ϵt , cf. Francq and Zakoïan, 2010, Example
2.3). Also assume that the vector of exogenous variables wt is stationary and geometrically β-mixing as well with finite 2m̄ moments. Given the invertibility of the
Pq
lag polynomial, we may then write yt = ρ−1 (L)vt , where vt = i=0 θ ′i wt−i + ut and
the inverse lag polynomial ρ−1 (z) has geometrically decaying coefficients. Then it
follows directly that yt is NED on vt , where vt is strong mixing of size −∞ as its
components are geometrically β-mixing, and the sum inherits the mixing properties.
Furthermore, if ∥θi ∥1 ≤ C for all i = 0, . . . , q, it follows directly from Minkowski that
E |vt |
2m̄
2m̄
≤ C and consequently E |yt |
≤ C. Then yt is NED of size −∞ on (wt , ut ),
and consequently z t = (yt−1 , wt , ut ) as well.
Example 2.2 (Equation-by-equation VAR). Consider the vector autoregressive model
yt =
p
X
Φi y t−i + ut ,
i=1
where y t is a K × 1 vector of dependent variables, E |ut |
2m̄
≤ C , and the K × K
matrices Φi satisfy appropriate stationarity and 2m̄-th order summability conditions.
The equivalent equation-by-equation representation is
yk,t =
p
X
[Φk,1,i , . . . , Φk,K,i ] y t−i +uk,t = y ′t−1 , . . . , y ′t−p β k +uk,t ,
k ∈ (1, . . . , K).
i=1
Assuming a well-specified model with E ut |y t−1 , . . . , y t−p = 0, the conditions of
Assumption 2.1 are then satisfied trivially.
Examples 2.1 and 2.2 demonstrate that Assumption 2.1 is sufficiently general to
include common time series models in econometrics. While these examples are equally
well covered by other commonly used assumptions such as the martingale difference
18
2.2 The High-Dimensional Linear Model
sequence (m.d.s) framework chosen in Medeiros and Mendes (2016) or Masini et al.
(2021), we opt for the more general NED framework, as it additionally covers many
relevant cases – in particular for our nodewise regressions – where properties such as
m.d.s. fail. The following examples provide simple illustrations of these cases.
Example 2.3 (Misspecified AR model). Consider an autoregressive (AR) model of
order 2
yt = ρ1 yt−1 + ρ2 yt−2 + vt ,
vt ∼ IID(0, 1),
where E|vt |2m̄ ≤ C and the roots of 1 − ρ1 L − ρ2 L2 are outside the unit circle. De
fine the misspecified model yt = ρ̃yt−1 + ut , where ρ̃ = arg min E (yt − ρyt−1 )2 =
ρ
E[yt yt−1 ]
2
E[yt−1
]
=
ρ1
1−ρ2
and ut is autocorrelated. An m.d.s. assumption would be inappro-
priate in this case, as
E [ut |σ {yt−1 , yt−2 , . . . }] = E [yt − ρ̃yt−1 |σ {yt−1 , yt−2 , . . . }] = −
ρ1 ρ2
yt−1 +ρ2 yt−2 ̸= 0.
1 − ρ2
However, it can be shown that (yt−1 , ut )′ satisfies Assumption 2.1(ii) by considering
the moving average representation of yt and by extension, of ut = yt − ρ̃yt−1 . As the
coefficients are geometrically decaying, ut is clearly NED on vt and Assumption 2.1(ii)
is satisfied.
The key condition to apply the lasso successfully is that the parameter vector β 0
is (at least approximately) sparse. We formulate this in Assumption 2.2 below.
Assumption 2.2. For some 0 ≤ r < 1 and sparsity level sr , define the N -dimensional
sparse compact parameter space
r
B N (r, sr ) := β ∈ RN : ∥β∥r ≤ sr , ∥β∥∞ ≤ C, ∃C < ∞ ,
and assume that β 0 ∈ B N (r, sr ).
Assumption 2.2 implies that β 0 is sparse with the degree of sparsity governed by
both r and sr . Without further assumptions on r and sr , Assumption 2.2 is not
binding, but as will be seen later, the allowed rates will interact with other DGP
parameters creating binding conditions. Assumption 2.2 generalizes the common
assumption of exact sparsity taking r = 0 (see e.g., Medeiros and Mendes, 2016;
van de Geer et al., 2014; Chernozhukov et al., 2021; Babii et al., 2019), which assumes
that there are only a few (at most s0 ) non-zero components in β 0 , to weak sparsity
(see e.g., van de Geer, 2019). This allows us to have many non-zero elements in the
parameter vector, as long as they are sufficiently small. It follows directly from the
19
2 Desparsified Lasso in Time Series
formulation in Assumption 2.2 that, given the compactness of the parameter space,
exact sparsity of order s0 implies weak sparsity with r > 0 of the same order (up to a
fixed constant). In general, the smaller r is, the more restrictive the assumption. The
relaxation to weak sparsity is straightforward and follows from elementary inequalities
(see e.g., Section 2.10 of van de Geer, 2016 and the proof of Lemma 2.A.7).
Example 2.4 (Infinite order AR). Consider an infinite order autoregressive model
yt =
∞
X
ρj yt−j + εt ,
j=1
where εt is a stationary m.d.s. with sufficient moments existing, and the lag polynomial
P∞
P∞
1 − j=1 ρj Lj is invertible and satisfies the summability condition j=1 j a |ρj | < ∞
for some a ≥ 0. One might consider fitting an autoregressive approximation of order
P to yt ,
yt =
P
X
βj yt−j + ut ,
j=1
as it is well known that if P is sufficiently large, the best linear predictors βj will be
close to the true coefficients ρj (see e.g., Kreiss et al., 2011, Lemma 2.2). To relate the
summability condition above to the weak sparsity condition, note that by Hölder’s
inequality we have that
r
∥β∥r =
P
X
j=1
r
(j a |βj |) j −ar
r
1−r
P
P
X
X
ar
≤
j a |βj |
j − 1−r
≤ C max{P 1−(a+1)r , 1}.
j=1
j=1
The constant comes from bounding the first term by the convergence of βj to ρj plus
the summability of the latter, while the second term involving P follows from Lemma
5.1 of Phillips and Solo (1992).4 As such, summability conditions on lag polynomials
imply weak sparsity conditions, where the strength of the summability condition
(measured through a) and the required strictness of the sparsity (measured through
r) determine the order sr of the sparsity. Therefore, weak sparsity – unlike exact
sparsity – can accommodate sparse sieve estimation of infinite-order, appropriately
summable, processes, providing an alternative to least-squares estimation of lower
order approximations. For VAR models we can apply the same reasoning, with the
addition that appropriate row sparsity is needed for the coefficients in the row of
interest of the VAR if the number of series increases with the sample size.
4 As the same lemma shows, one should in fact treat the case r = 1/(a + 1) separately, in which
a
a bound of order (ln P ) a+1 holds.
20
2.2 The High-Dimensional Linear Model
For λ ≥ 0, define the weak sparsity index set
with cardinality |Sλ |,
Sλ := j : βj0 > λ
(2.2)
and complement set Sλc = {1, . . . , N } \ Sλ . With an appropriate choice of λ, this
set contains all ‘sufficiently large’ coefficients; for λ = 0 it contains all non-zero
parameters. We need this set in the following condition, which formulates the standard
compatibility conditions needed for lasso consistency (see e.g., Bühlmann and van
De Geer, 2011, Chapter 6).
Assumption 2.3. Let Σ :=
1
T
T
P
t=1
E [xt x′t ]. For a general index set S with cardinality
|S|, define the compatibility constant
ϕ2Σ (S)
:=
min
{z∈RN \0:∥z S c ∥1 ≤3∥z S ∥1 }
|S|z ′ Σz
∥z S ∥21
.
Assume that ϕ2Σ (Sλ ) ≥ 1/C, which implies that
∥z Sλ ∥21 ≤
|Sλ |z ′ Σz
≤ C|Sλ |z ′ Σz,
ϕ2Σ (Sλ )
for all z satisfying ∥z Sλc ∥1 ≤ 3∥z Sλ ∥1 ̸= 0.
The compatibility constant in Assumption 2.3 is an upper bound on the minimum
eigenvalue of Σ, so this condition is considerably weaker than assuming Σ to be
positive definite. We formulate the compatibility condition in Assumption 2.3 on the
population covariance matrix rather than directly on the sample covariance matrix
Σ̂ := X ′ X/T , see e.g., the restricted eigenvalue condition in Medeiros and Mendes
(2016) or Assumption (A2) in Chernozhukov et al. (2021). Verifying this assumption
on the population covariance matrix is generally more straightforward than directly
on the sample covariance matrix.5
Finally, note that the compatibility assumption for the weak sparsity index set Sλ
is weaker than (and implied by) its equivalent for S0 , see Lemma 6.19 in Bühlmann
and van De Geer (2011), and that the strictness of this assumption depends on the
choice of the tuning parameter λ.
5 Though note that Basu and Michailidis (2015) show in their Proposition 3.1 that the restricted
eigenvalue condition holds with high probability under general time series conditions when xt is
a stable process with full-rank spectral density and T is sufficiently large. Their Proposition 4.2
includes a stable VAR process as an example.
21
2 Desparsified Lasso in Time Series
2.3
Error Bound and Consistency for the Lasso
In this section, we derive a new error bound for the lasso in a high-dimensional time
series model. The lasso estimator (Tibshirani, 1996) of the parameter vector β 0 in
Model (2.1) is given by
β̂ := arg min
β∈RN
∥y − Xβ∥22
+ 2λ∥β∥1 ,
T
(2.3)
′
where y = (y1 , . . . , yT )′ is the T × 1 response vector, X = (x1 , . . . , xT ) the T × N
design matrix and λ > 0 a tuning parameter. Optimization problem (2.3) adds a
penalty term to the least squares objective to penalize parameters that are different
from zero.
When deriving this error bound, one typically requires that λ is chosen sufficiently
PT
large to exceed the empirical process max T1 t=1 xj,t ut with high probability. To
j
l
P
this end, we define the set ET (z) :=
max
ut xj,t ≤ z , and establish the
j≤N,l≤T t=1
conditions under which P (ET (T λ/4)) → 1. In addition, since we formulate the compatibility condition in Assumption 2.3 on the population covariance matrix, we need to
show that Σ and Σ̂ are sufficiently
close under theoDGP assumptions. To this end, we
n
define the set CC T (S) :=
≤ C/ |S| , and show that P (CC T (Sλ )) → 1.
Σ̂ − Σ
max
Theorem 2.1 then presents both results.
Theorem 2.1. Let Assumptions 2.1 to 2.3 hold, and assume that
d+m−1
0 < r < 1 : λ ≥ C ln(ln(T )) r(dm+m−1) sr
√
"
d+m−1
− dm+m−1
r = 0 : s0 ≤ C ln(ln(T ))
λ ≥ Cln(ln(T ))
N(
1/m N
2
2
N ( d + m−1 )
√
T
#
T
2
2
d + m−1
)
!
1
m
( d1 + m−1
)
r1
1
m
( d1 + m−1
)
,
(2.4)
1/m
√
T
When N, T are sufficiently large, P (ET (T λ/4) ∩ CC T (Sλ )) ≥ 1 − C ln(ln(T ))−1 .
Theorem 2.1 thus establishes that the sets ET (T λ/4) and CC T (Sλ ) hold with high
probability. Each set has a condition under which its probability converges to 1,
which follow from Lemmas 2.A.3 and 2.A.4 respectively. For the set ET (T λ/4), the
condition λ ≥ C ln(ln(T ))
1/m N 1/m
√
T
is required. The ln(ln(T )) appearing throughout
the theorem is chosen arbitrarily as a sequence which grows slowly as T → ∞; we
only need some sequence tending to infinity sufficiently slowly. The details can be
22
2.4 Uniformly Valid Inference via the Desparsified Lasso
found in the proof of Theorem 2.1. For the set CC T (Sλ ), we need to distinguish the
cases 0 < r < 1 and r = 0 due to the way the size of the sparsity index set in eq. (2.2)
is bounded. For 0 < r < 1, a lower bound on λ is imposed which is stricter than the
one for the empirical process, hence only that bounds appears in Theorem 2.1. For
r = 0, the conditions do not depend on λ hence both bounds appear in Theorem 2.1.
Theorem 2.1 directly yields an error bound for the lasso in high-dimensional time
series models by standard arguments in the literature, see e.g., Chapter 2 of van de
Geer (2016). The proofs of Lemmas 2.A.6 and 2.A.7 in Section 2.C.1 provide details.
Corollary 2.1. Under Assumptions 2.1 to 2.3 and the conditions of Theorem 2.1,
when N, T are sufficiently large, the following holds with probability at least 1 −
C ln ln T −1 :
(i)
(ii)
1
T
X(β̂ − β 0 )
2
2
β̂ − β 0
1
≤ Cλ2−r sr ,
≤ Cλ1−r sr .
Under the additional assumption that λ1−r sr → 0, these error bounds directly
establish prediction and estimation consistency. The bounds in Theorem 2.1 thereby
put implicit limits on the divergence rate of N , and sr relative to T . In particular,
the term offsetting the divergence in N , and sr is of polynomial order in T . The
order of the polynomial, and therefore the restriction on the growth of N and sr , is
determined by the moments m and dependence parameter d; the higher the number
of moments m and the larger the dependence parameter d, the fewer restrictions one
has on the allowed polynomial growth of N and sr . In the limit, if m and d tend
to infinity (all moments exist and the data are mixing), the order of the polynomial
restriction on N tends to infinity, thereby approaching exponential growth. A similar
trade off between the allowed growth of N and the existence of moments was found
in Medeiros and Mendes (2016). In Example 2.C.1 we study in greater detail how
the different rates interact, thereby providing an overview of the restrictions under
different scenarios.
While Corollary 2.1 is a useful result in its own right, it is vital to derive the
theoretical results for the desparsified lasso, which we turn to next.
2.4
Uniformly Valid Inference via the Desparsified
Lasso
We use the desparsified lasso to perform uniformly valid inference in general highdimensional time series settings. After briefly reviewing the desparsified lasso, we
23
2 Desparsified Lasso in Time Series
formulate the assumptions needed in Section 2.4.1. The asymptotic theory is then
derived in Section 2.4.2 for inference on low-dimensional parameters of interest, and
Section 2.4.3 for inference on a high-dimensional parameters.
The desparsified lasso (van de Geer et al., 2014) is defined as
b̂ := β̂ +
Θ̂X ′ (y − X β̂)
,
T
(2.5)
where β̂ is the lasso estimator from eq. (2.3) and Θ̂ := Υ̂
−2
Γ̂ is a reasonable approx-
imation for the inverse of Σ̂. By de-sparsifying the initial lasso, the bias in the lasso
estimator is removed and uniformly valid inference can be obtained. The matrix Γ̂
is constructed using nodewise regressions; regressing each column of X on all other
explanatory variables using the lasso. Let the lasso estimates of the j = 1, . . . , N
nodewise regressions be
(
γ̂ j := arg min
γ j ∈RN −1
∥xj − X −j γ j ∥22
+ 2λj ∥γ j ∥1
T
)
(2.6)
,
where the T × (N − 1) matrix X −j is X with its jth column removed. Their components are given by γ̂ j = {γ̂j,k : k = {1, . . . , N } \ j}. Stacking these estimated parameter vectors row-wise with ones on the diagonal gives the matrix
1
−γ̂2,1
Γ̂ :=
..
.
−γ̂N,1
We then take Υ̂
−2
−γ̂1,2
...
−γ̂1,N
1
..
.
...
..
.
−γ̂2,N
..
.
.
−γ̂N,2
...
1
2
:= diag 1/τ̂12 , . . . , 1/τ̂N
, where τ̂j2 :=
1
T
xj − X −j γ̂ j
2
+2λj
2
γ̂ j
We use the index set H ⊆ {1, . . . , N } with cardinality h = |H| to denote the
set of variables whose coefficients we wish to perform inference on. In this case
computational gains can be obtained with respect to the nodewise regressions, as we
only need to obtain the sub-vector of the desparsified lasso corresponding to b̂H :=
β̂ H + Θ̂H X(y−X β̂), with the subscript H indicating that we only take the respective
rows of β̂ and Θ̂. To compute Θ̂H , one only needs to compute h nodewise regressions
instead of N , which can be a considerable reduction for small h relative to large N .
24
1
.
2.4 Uniformly Valid Inference via the Desparsified Lasso
2.4.1
Assumptions
Consider the population nodewise regressions defined by the linear projections
( "
xj,t = x′−j,t γ 0j +vj,t ,
T
2
1X
γ 0j := arg min E
xj,t − x′−j,t γ
T
γ
t=1
#)
,
j = 1, . . . , N,
(2.7)
with τj2 :=
1
T
T
P
t=1
2
E vj,t
. Note that by construction, it holds that E [vj,t ] = 0, ∀t, j
and E [vj,t xk,t ] = 0, ∀t, k ̸= j. We first present Assumptions 2.4 and 2.5, which allow
us to extend Corollary 2.1 to the nodewise lasso regressions.
Assumption 2.4. Let
max
1≤j≤N, 1≤t≤T
2m̄
E |vj,t |
≤ C.
Assumption 2.5.
(j)
(j)
(i) For some 0 ≤ r < 1 and sparsity levels sr , let γj0 ∈ B N −1 (r, sr ), ∀j ∈ H.
(ii) Let max σj,j ≤ C and Λmin ≥ 1/C, where Λmin is the smallest eigenvalue of
1≤j≤N
Σ.
Assumption 2.4 requires the errors vj,t from the nodewise linear projections to
have bounded moments of an order greater than fourth. By the properties of NED
processes, we use Assumptions 2.1 and 2.4 to establish mixingale properties of the
products vj,t ut =: wj,t and wj,t wk,t−l in Lemma 2.B.2, which are used extensively in
the derivation of the desparsified lasso’s asymptotic distribution.
Assumption 2.5(i), similar to Assumption 2.2, requires weak sparsity of the nodewise regressions, not exact sparsity. The latter could be problematic, as it would imply
many of the regressors to be uncorrelated. In contrast, weak sparsity is a plausible
alternative, see e.g., Example 2.4. Importantly, the weak sparsity of the nodewise
regressions is fully determined by the model and hence should be verified. Below, we
provide concrete examples where the weak sparsity assumption holds.
Assumption 2.5(ii) requires the population covariance matrix to be positive definite, with its smallest eigenvalue bounded away from zero, and to have finite variances.
Assumption 2.5(ii) implies the compatibility condition and thus replaces Assumption 2.3 in Section 2.3, with Λmin fulfilling the role of ϕ2Σ . It also implies that the
explanatory variables, including the irrelevant ones, cannot be linear combinations of
each other even as we let the number of variables tends to infinity. Although this is a
considerable strengthening of Assumption 2.3, it is important to realize this assumption is still made on the population matrix instead of the sample version, and may
therefore still hold in fairly general, high-dimensional models. For example, Basu and
25
2 Desparsified Lasso in Time Series
Michailidis (2015) provide a lower bound for Λmin in VAR models on their Proposition
2.3, which can be shown to be bounded away from zero under realistic conditions, see
also Masini et al. (2021, p. 6). Similarly, this assumption can be shown to hold in
factor models under minimal assumptions on the idiosyncratic errors (see Example 2.5
below).
Example 2.5. (Sparse factor model) Consider the factor model
′
yt = β 0 xt + ut , ut ∼ IID(0, 1)
xt = Λ f t + ν t , ν t ∼ IID(0, Σν ),
N ×kk×1
f t ∼ IID(0, Σf ),
where Λ has bounded elements, Σf and Σν are positive definite with bounded eigenvalues, and ν t and f t are uncorrelated. In this DGP,
−1
−1
′ −1
−1
Λ′ Σ−1
Σ = ΛΣf Λ′ + Σν =⇒ Θ = Σ−1
ν .
ν − Σν Λ Σf + Λ Σν Λ
As shown in Appendix 2.C.5, the sparsity of the nodewise regression parameters can
be bounded as
max γ 0j
j
r
r
≤ Σ−1
ν
r
r
1 + C Σ−1
ν
r
r
r
∥Λ∥r k 2−r/2 N −ar ,
where N a is the rate at which the k-th largest eigenvalue of Σ diverges. This result
allows for weak factor models where a < 1, which have been proposed for providing
a theoretical explanation for the often observed empirical phenomenon where the
separation between the eigenvalues of the Gram matrix is not as large as the strong
factor model with a = 1 implies (cf. De Mol et al., 2008; Onatski, 2012; Uematsu and
Yamagata, 2022a,b).
The bound of the nodewise regressions further depends on the number of factors,
the sparsity of the factor loadings and the sparsity of Σ−1
ν . Sparse factor loadings
are intimately linked to weak factor models, and may provide accurate descriptions of
the data in various economic and financial applications, see Uematsu and Yamagata
(2022a,b) and Appendix 2.C.5 for details.
Sparsity in Σ−1
holds when the idiosyncratic components are not too strongly
ν
cross-sectionally dependent, which is a standard assumption in factor models. It
occurs for instance for block diagonal structures of Σν , in which case Σ−1
ν
r
r
≤ Cb
where b is the size of the largest b × b block matrix with b nonzero elements, or for
2
Toeplitz structures σν i,j = ρ|i−j| , |ρ| < 1, in which case Σ−1
ν
r
r
≤ C. Note that
to satisfy the minimum eigenvalue condition (Assumption 2.5(ii)), we only need the
minimum eigenvalue of Σν to be bounded away from 0.
26
2.4 Uniformly Valid Inference via the Desparsified Lasso
Example 2.6 (Sparse VAR(1)). Consider a stationary VAR(1) model for z t =
(yt , x′t )′
z t = Φz t−1 + ut , Eut u′t := Ω, Eut u′t−l = 0, ∀l ̸= 0,
with our regression of interest being the first line of the VAR, that is yt = ϕ1 z t−1 +u1,t ,
where ϕj is the jth row of Φ. Under this DGP, the nodewise regression parameters
γ 0j are determined entirely by Φ and Ω, and we now consider two cases for which we
derive explicit results in Section 2.C.5.
(a) Let Φ be symmetric and block diagonal with largest block of size b. Assume
that Φ has eigenvalues strictly between 0 and 1, and ∥Φ∥max ≤ C. Furthermore,
let Ω = I. Then the nonzero entries of γ 0j follow the block structure of Φ, such
that max γ 0j
j
0
≤ Cb.
(b) Let Φ = ϕI with |ϕ| < 1, and let Ω have a Toeplitz structure ωi,j = ρ|i−j| , |ρ| <
1. Then γ 0j is only weakly sparse, in the sense that it contains no zeroes, but its
entries follow a geometrically decaying pattern, meaning that max γ 0j
j
r
r
≤ C.
More generally, sparsity of γ 0j requires that the autoregressive coefficient matrix Φ and
the error covariance matrix Ω are row- and column-sparse in such a way that matrix
multiplication preserves this sparsity. For case (a), we may relax the assumption on
Ω to block-diagonality, provided the block structure is similar to that of Φ. For case
(b), the result holds even when we let Φ have a similar Toeplitz structure as Ω, as we
numerically investigate in Section 2.C.5. To verify the minimum eigenvalue condition
in Assumption 2.5(ii), we may apply the bound derived in (Masini et al., 2021, p. 6),
2
which gives Λmin ≥ Λmin (Ω) [1 + (∥Φ∥1 + ∥Φ∥∞ ) /2] , where Λmin (Ω) is the smallest
eigenvalue of Ω.
Remark 2.2. Alternative approaches exist that circumvent the need to directly impose weak sparsity assumptions on the nodewise regressions. Krampe et al. (2021)
use the desparsified lasso for inference in the context of stationary VARs with IID
errors, but do not use nodewise regressions to build an estimator of Θ as we do. Instead, they use the VAR model structure to derive an estimator based on regularized
estimates of the VAR coefficients and the error covariances. Such an approach requires knowledge of the full model underlying the covariates to provide an analytical
expression for the nodewise projections. While this is a natural approach in a VAR
model, this approach is considerably more difficult to apply in a more general setting, where the structure underlying the covariates is typically unknown. Moreover,
they still require conditions on sparsity, which are similar to those found for the VAR
27
2 Desparsified Lasso in Time Series
model of Example 2.6, i.e. row- and column-sparsity of the VAR coefficient matrices
in addition to sparsity of the inverse error covariance matrix.
Deshpande et al. (2021) use an online debiasing strategy for inference in VAR
models with IID Gaussian errors, among other settings. Rather than using a single
estimate of Θ, they use a sequence of precision matrix estimates based on an episodic
structure, which can be seen as a generalization of sample-splitting. In addition, they
use the precision matrix estimator as in Javanmard and Montanari (2014), which does
not require sparsity of Θ. It is an interesting topic for future research to investigate
whether these techniques can be leveraged in our setting allowing for misspecification
and with potentially serially correlated/heteroskedastic errors.
Assumptions 2.4 and 2.5 allow us to apply Corollary 2.1 to the nodewise regressions. Specifically, if the conditions on λ formulated in (2.4) hold for both λ := min λj
¯
(j)
and λ̄ := max λj , the error bounds – with s̄r := max sr
j∈H
j∈H
j∈H
substituted for sr – apply
to the nodewise regressions as well. As we generally need the error bounds to hold
uniformly over all relevant nodewise regressions as well as the initial regression, we
combine these bounds and state our results on the quantities
λmin = min{λ, λ},
¯
λmax = max{λ, λ̄},
sr,max = max{sr , s̄r },
(2.8)
which simplifies many of the final expressions. While some conditions could be weakened if we keep them in terms of λ̄ or s̄r explicitly, this would be at the expense of
more conditions and readability, and therefore we opt against it.
2.4.2
Inference on low-dimensional parameters
In this section we establish the uniform asymptotic normality of the desparsified
lasso focusing on low-dimensional parameters of interest. We consider testing P
joint hypotheses of the form RN β 0 = q via a Wald statistic, where RN is an
appropriate
P × N matrix
whose non-zero columns are indexed by the set H :=
n P
o
P
j : p=1 |rN,p,j | > 0 of cardinality h := |H|. As can be seen from the lemmas in
Appendix 2.B, all our results up to application of the central limit theorem allow for
h to increase in N (and therefore T ). In Theorem 2.2 we first focus on inference on
a finite set of parameters, such that we can apply a standard central limit theorem
under the assumptions listed above. An alternative, high-dimensional approach under
more stringent conditions is considered in Section 2.4.3.
28
2.4 Uniformly Valid Inference via the Desparsified Lasso
Given our time series setting, the long-run covariance matrix
"
ΩN,T
1
=E
T
T
X
t=1
!
wt
T
X
!#
w′t
,
t=1
where wt = (v1,t ut , . . . , vN,t ut )′ , enters the asymptotic distribution in Theorem 2.2.
TP
−1
ΩN,T can equivalently be written as ΩN,T = Ξ(0) +
(Ξ(l) + Ξ′ (l)), where Ξ(l) =
l=1
1
T
T
P
E
t=l+1
wt w′t−l
.
Theorem 2.2. Let Assumptions 2.1 to 2.5 hold, and assume that the smallest
eigenvalue of ΩN,T is bounded away from 0. Furthermore, assume that λ2max ≤
h√
i−1
(ln ln T )λrmin
T sr,max
, and
0 < r < 1 : λmin ≥ (ln ln T ) sr,max
√
"
r = 0 : s0,max ≤ (ln ln T )−1
2
2
N ( d + m−1 )
√
T
N(
#
T
2
2
d + m−1
!
1
m
( d1 + m−1
)
r1
1
m
)
( d1 + m−1
)
,
N 1/m
λmin ≥ (ln ln T ) √ .
T
Let RN ∈ RP ×N satisfy max ∥r N,p ∥1 ≤ C, where r N,p denotes the p-th row of
1≤p≤P
RN , and P, h ≤ C. Then we have that
√
d
T RN (b̂ − β 0 ) → N (0, Ψ) ,
uniformly in β 0 ∈ B N (r, sr ), where
Ψ :=
lim
N,T →∞
2
RN Υ−2 ΩN,T Υ−2 R′N and Υ−2 := diag(1/τ12 , . . . , 1/τN
).
Remark 2.3. Unlike van de Geer et al. (2014), we do not require the regularization
parameters λj to have a uniform growth rate. We only control the slowest and fastest
converging λj (covered by λmax and λmin respectively) through convergence rates that
also involve N, T , and the sparsity sr,max . We provide a specific example of a joint
asymptotic setup for these quantities in Corollary 2.2.
Remark 2.4. Belloni et al. (2012) and Chernozhukov et al. (2018), among others,
show that sample splitting can improve the convergence rates for the desparsified lasso
in IID settings. The idea is to estimate the initial and nodewise regressions with two
independent parts of the sample, and exploit this independence to efficiently bound
certain terms in the proofs. Efficiency loss is then avoided by so-called cross-fitting
29
2 Desparsified Lasso in Time Series
and combining two estimators in which the roles of the two sub-samples are swapped.
However, with time series data naive sample splitting will not yield (asymptotically)
independent subsamples. Instead, subsamples must carefully be chosen to leave sufficiently large ‘gaps’ in-between to ensure (at least asymptotic) independence. These
ideas are explored in Lunde (2019) and Beutner et al. (2021), though for different
purposes and dependence concepts. They could however provide a useful starting
point for future research on investigating the potential of sample-splitting in the NED
framework.
In order to estimate the asymptotic variance Ψ, we suggest to estimate ΩN,T with
the long-run variance kernel estimator
Ω̂ = Ξ̂(0) +
QX
T −1
K
l=1
l
QT
′
Ξ̂(l) + Ξ̂ (l) ,
(2.9)
T
P
ŵt ŵ′t−l with ŵj,t = v̂j,t ût , the kernel K(·) can be taken as the
Bartlett kernel K(l/QT ) = 1 − QlT (Newey and West, 1987) and the bandwidth QT
where Ξ̂(l) =
1
T −l
t=l+1
should increase with the sample size at an appropriate rate. A similar heteroskedasticity and autocorrelation consistent (HAC) estimator was considered by Babii et al.
(2019), though under a different framework of dependence. In Theorem 2.3, we show
−2
−2
)R′N is a consistent estimator of Ψ in our NED framework.
√
1
Theorem 2.3. Take Ω̂ with QT → ∞ as T → ∞, such that QT h2 ( T h2 )− 1/d+m/(m−2) →
0. Assume that
that Ψ̂ = RN (Υ̂
Ω̂Υ̂
h
i−1 h
i−1
p √
QT T sr,max
, QT h1/m T 1/m sr,max
,
−1
min
λ2−r
max ≤ (ln ln T )
h
i−1 h
i−1
2/3
Q2T h3/m T (3−m)/m sr,max
, QT h1/(3m) T (m+1)/3m sr,max
,
λ2max ≤ (ln ln T )−1 λrmin
h√
T h2/m sr,max
0<r<1:
"
r=0:
s0,max ≤ (ln ln T )
, and
2
λmin ≥ (ln ln T ) sr,max
−1
i−1
2
(hN )( d + m−1 )
√
T
!
1
m
( d1 + m−1
)
1
#
√
m
( d1 + m−1
)
T
,
2+ 2
(
)
(hN ) d m−1
r1
,
λmin ≥ (ln ln T )
(hN )1/m
√
.
T
Furthermore, let RN ∈ RP ×N satisfy max ∥r N,p ∥1 ≤ C and P ≤ Ch. Then under
1≤p≤P
Assumptions 2.1 to 2.5, uniformly in β 0 ∈ B N (r, sr ),
RN (Υ̂
30
−2
Ω̂Υ̂
−2
− Υ−2 ΩN,T Υ−2 )R′N
p
→ 0.
max
2.4 Uniformly Valid Inference via the Desparsified Lasso
Note that here we restrict RN such that the number of hypotheses P may not
grow faster than the number of parameters of interest h, but h may grow with T at a
controlled rate. Theorem 2.3 therefore allows for variance estimation of an increasing
number of estimators. We believe the restrictions on P are reasonable, as they apply
to the most commonly performed hypothesis tests in practice, such as joint significance
tests (where RN is the identity matrix), or tests for the equality of parameter pairs.
As a natural implication of Theorems 2.2 and 2.3, Corollary 2.2 gives an asymptotic distribution result for a quantity composed exclusively of estimated components.
Corollary 2.2. Let Assumptions 2.1 to 2.5 hold, and assume that the smallest eigen1
value of ΩN,T is bounded away from 0, and QT T − 2/d+2m/(m−2) → 0 for some QT → ∞.
Further, assume that λ ∼ λmax ∼ λmin , and
"
0<r<1:
(ln ln T )−1 s1/r
r,max
r=0:
2
2
N ( d + m−1 )
√
T
#
1
r
m
)
( d1 + m−1
h √
i−1/(2−r)
≤ λ ≤ ln ln T Q2T T sr,max
,
h √
i−1/2
N 1/m
≤ λ ≤ ln ln T Q2T T s0,max
.
(ln ln T )−1 √
T
d(m−1)(2−r)
d+m−1
1
These bounds are feasible when QrT sr,max N (2−r)( dm+m−1 ) T 4 (r− dm+m−1 ) → 0,
2/m
and additionally when Q2T s0,max N√T → 0 if r = 0. Under these conditions, for
RN ∈ RP ×N with max ∥r N,p ∥1 ≤ C and P, h ≤ C, we have that
1≤p≤P
0
√
r
(
b̂
−
β
)
N,p
≤ z − Φ(z) = op (1),
P T q
−2
−2
r N,p (Υ̂ Ω̂Υ̂ )r ′N,p
sup
β 0 ∈B N (r,sr )
1≤p≤P,z∈R
sup
β 0 ∈B N (r,sr )
h
P
"
#
!
i′ R Υ̂−2 Ω̂Υ̂−2 R′ −1 h
i
N
N
RN b̂ − q
RN b̂ − q ≤ z − FP (z) = op (1),
T
z∈R
(2.10)
where Φ(·) is the CDF of N (0, 1), FP (z) is the CDF of χ2P , and q ∈ RP is chosen
to test a null hypothesis of the form RN β 0 = q.
Corollary 2.2 allows one to perform a variety of hypothesis tests. For a significance
test on a single variable j, forinstance, takeRN as the jth basis vector. Then,
√
T (b̂j −βj0 )
inference on βj0 of the form P √
≤ z − Φ(z) = op (1), ∀z ∈ R, can be
4
ω̂j,j /τ̂j
obtained where Φ(·) is the standard
normal
CDF. One can
standard
q
q then4 obtain
ω̂j,j /τ̂j4
ω̂j,j /τ̂j
confidence intervals CI(α) := b̂j − zα/2
, b̂j + zα/2
, where zα/2 :=
T
T
Φ−1 (1 − α/2), with the property that sup
P βj0 ∈ CI(α) − (1 − α) = op (1).
β 0 ∈B(sr )
31
2 Desparsified Lasso in Time Series
For a joint test with P restrictions on h variables of interest of the form RN β 0 = q,
one can construct a Wald type test statistic based on eq. (2.10), and compare it to
the critical value FP−1 (1 − α). Note that these results can also be used to test for
nonlinear restrictions of parameters via the Delta method (e.g., Casella and Berger,
2002, Theorems 5.5.23,28).
As the bounds and convergence rates as displayed in full generality in Corollary 2.2
may be hard to interpret, we investigate in Example 2.7 how the conditions of Corollary 2.2 can be satisfied in a simplified asymptotic setup, thereby illustrating how the
different growth rates interact. As for Corollary 2.1, the conditions on λ effectively
require that QT , N , and sr,max grow at a polynomial rate of T , which we exploit in
Example 2.7 to simplify the conditions.
Example 2.7. The requirements of Corollary 2.2 are satisfied when N ∼ T a for
a > 0, sr,max ∼ T b for b > 0, QT ∼ T Q for an arbitrarily small Q > 0, and λ ∼ T −ℓ
for
0<r<1:
r=0:
b + 1/2
1
1
m
1
1
1
<ℓ< 1
−b
+
− 2a
+
,
m
2−r
d m−1
d m−1
r( d + m−1
) 2
b + 1/2
1
a
<ℓ< − .
2
2 m
This choice of ℓ is feasible if
1
1
4b + r
m
1
+
+ 4a
+
< 1.
2−r
d m−1
d m−1
(2.11)
There is thus a limit on how fast sr,max and N can grow relative to T , and there exists
a trade-off between both: sr,max can grow faster if we limit the growth rate of N , and
vice versa. Besides, for larger r, the conditions on the growth rate of sr,max are more
strict. The strictness of these bounds is additionally influenced by the number of
moments m and the size of the NED −d: the bounds become easier to satisfy when
m and d are large.
Depending on the growth rates of sr,max and N , inequality (2.11) may put stricter
requirements on m and d than those in Assumption 2.1. For example, if we assume
that sr,max is asymptotically bounded (b = 0), and N grows proportionally to T
(a = 1), then m and d should satisfy
1
d
+
1
m−1
<
1
4.
If, on the other hand, m and
d are allowed to be arbitrarily large, such as when the data are mixing and subexponential, then we only need b <
1−r
2 ,
and we do not have an effective upper bound
on a, implying that N can grow at any polynomial rate of T . For a more general
understanding of the restrictions imposed by eq. (2.11), Figure 2.1 shows feasible
regions for different combinations of a, b, d, and r, as well as how many moments m
32
2.4 Uniformly Valid Inference via the Desparsified Lasso
are needed in those cases.
Figure 2.1: Required moments m implied by eq. (2.11). Contours mark intervals
of 10 moments, and values above m = 100 are truncated to 100. Non-shaded areas
indicate infeasible regions.
2.4.3
Inference on high-dimensional parameters
The reason for considering h ≤ C in Theorem 2.2 lies entirely in the application
of the central limit theorem. However, while inference on a finite set of parameters
covers many cases of interest in practice, it does not allow for simultaneous inference
on all parameters. We therefore next consider inference on a growing number of
parameters (or hypotheses). We follow the approach pioneered by Chernozhukov
33
2 Desparsified Lasso in Time Series
et al. (2013) to consider tests which can be formulated as a maximum over individual
tests, and apply a high-dimensional CLT for the maximum of a random vector of
increasing length. Zhang and Wu (2017) and Zhang and Cheng (2018) provide such
a CLT for high-dimensional time series, with serial dependence characterized through
the functional dependence framework of Wu (2005), while Chernozhukov et al. (2019)
derive a similar result under general β-mixing conditions. In more recent work, Chang
et al. (2021) derive a high-dimensional CLT for α-mixing processes, that we base our
result on. Recalling that a process which is NED on an α-mixing process can be wellapproximated by a mixing process, this mixing condition remains conceptually close
to, if more stringent than, our NED framework.6 We therefore build on their results to
provide distributional results for high-dimensional inference in Corollary 2.3. While
the core of the proof directly follows by applying the CLT of Chang et al. (2021), one
still needs to integrate this with the results from Theorem 2.3 on the consistency of the
covariance matrix, as well as adapting the CLT to our estimators. We therefore believe
it is worthwhile to state this as a formal result in Corollary 2.3. Correspondingly, we
now strengthen our assumptions as follows.
Assumption 2.6.
(i) Let z t be uniformly α-mixing with mixing coefficients satisfying
αT (q) ≤ C1 exp −C2 q K for some K > 0 and all q ≥ 1.
(ii) Let there exist sequences du,T , dv,T , DT = du,T dv,T ≥ 1 such that
∥ut ∥ψ2 ≤ du,T , ∥m′ v t ∥ψ2 ≤ dv,T , ∀m ∈ RN : ∥m∥1 ≤ C, where ∥x∥ψ2 :=
h
n
h
i
o
i
2
inf c > 0 : E exp (x/c) − 1 ≤ 1 .
Assumption 2.6(i) implies Assumption 2.1(ii). Assumption 2.1(ii) states that the
NED process z t can be well-approximated by an α-mixing process; clearly this holds
when it is itself α-mixing. More specifically, the sequence is NED on itself, such
that Assumption 2.1(ii) is satisfied for any positive d. Furthermore, the exponential
decay of the α-mixing coefficients is stricter than our restrictions on sT,t . Similarly,
the sub-gaussian moments in Assumption 2.6(ii) imply that all finite moments in
Assumption 2.1(i) and Assumption 2.4 exist, so m may be arbitrarily large.
Corollary 2.3. Let Assumptions 2.1 to 2.6 hold, and let h ∼ T H for H > 0, N ∼ T a
for a > 0, sr,max ∼ T b for 0 < b <
1−r
2 ,
QT ∼ T Q for 0 < Q < 2/3 and λmin ∼
6 Ideally one would directly have a high-dimensional CLT available for NED processes, such that
it would directly fit to our assumptions. However, such a result is, to our knowledge, currently not
available in the literature. While such a result would clearly be very interesting to obtain, this is left
for future research given the intricacies needed to derive it.
34
2.5 Analysis of Finite-Sample Performance
λmax ∼ λ ∼ T −ℓ where
0<r<1:
r=0:
b + 1/2
1/2 − b
<ℓ<
,
2−r
r
b + 1/2
< ℓ < 1/2.
2
Additionally, let the smallest eigenvalue of ΩN,T be bounded away from 0, and
2/3
DT
(ln T )(1+2K)/(3K)
T 1/9
+
sup
P
DT (ln T )7/6
T 1/9
√
z∈R,β 0 ∈B N (r,sr )
max
1≤p≤P
→ 0. Then, for 1/C ≤ max ∥r N,p ∥1 ≤ C, P ≤ Ch,
1≤p≤P
T r N,p b̂ − β 0 ≤ z − P∗ max ĝp ≤ z = op (1),
1≤p≤P
where ĝ is a P -dimensional vector which is distributed as N (0, RN Υ̂
−2
Ω̂Υ̂
−2
R′N )
conditionally on the data, and P∗ is the corresponding conditional probability.
Unlike Corollary 2.2, Corollary 2.3 allows one to simultaneously test a growing
number of hypotheses, while controlling for family-wise error rate, for example by the
stepdown method described in Section 5 of Chernozhukov et al. (2013). One such
test is an overall test of significance, with the
null hypothesis β 0 = 0; in this case
P = h = N and RN = I. Note that although P max ĝp ≤ z cannot be calculated
1≤p≤P
analytically, it can easily be approximated with arbitrary accuracy by simulation.
Due to the stronger assumptions in Corollary 2.3, we can relax the conditions on
the growth rates of N and sr,max compared to Corollary 2.2 and Example 2.7. In
particular, the size of a and H are not restricted, meaning that N and h can grow
at an arbitrarily large polynomial rate of T . The conditions on sr,max can also be
√
relaxed so it can grow up to a rate of T , depending on r. This corresponds to our
analysis in Example 2.7 when we let m and d tend to infinity.
2.5
Analysis of Finite-Sample Performance
We analyze the finite sample performance of the desparsified lasso by means of simulations. We start by discussing tuning parameter selection in Section 2.5.1. We
then discuss three simulation settings: a high-dimensional autoregressive model with
exogenous variables (in Section 2.5.2), a factor model (in Section 2.5.3), and a weakly
sparse VAR model (in Section 2.5.4). In Section 2.5.2 and Section 2.5.3, we compute
coverage rates of confidence intervals for single hypothesis tests. In Section 2.5.4, we
perform a multiple hypothesis test for Granger causality.
35
2 Desparsified Lasso in Time Series
2.5.1
Tuning parameter selection
While the previous sections give some theoretical restrictions on the tuning parameter
choice, these results cannot be used in practice since its value depends on properties
of the underlying model that are unobservable. In this section, we provide a feasible
recommendation to select the tuning parameters (in both the original regression and
nodewise regressions) in a data-driven way.
In particular, we adapt the iterative plug-in procedure (PI) used in, for instance,
Belloni et al. (2012, 2014, 2017) to a time series setting. We build on the theoretical
relation between the tuning parameter and the empirical process in Theorem 2.1,
namely the restriction that
1
T
X ′u
∞
≤ Cλ needs to hold with high probability, to
guide the choice of λ. For large N and T ,
1
T
X ′u
∞
can be approximated by the
maximum over an N -dimensional multivariate Gaussian distribution with covariance
(E)
matrix ΩN,T = E T1 X ′ uu′ X .7 One may therefore approximate its quantiles by
simulating from a multivariate Gaussian with covariance matrix a consistent estimate
(E)
Ω̂(E) of ΩN,T .
Our time series setting requires the usage of a consistent long-run variance estimator, which is provided by Theorem 2.3. We therefore take Ω̂(E) as in eq. (2.9) with
T
P
(E)
Ξ̂ (l) = T 1−l
xt ût ût−l x′t−l . We set the number of lags in the long-run covarit=l+1
ance estimator as the automatic bandwidth estimator in Andrews (1991), specifically
QT = 1.1447(α̂(1)T )1/3 , with α̂(1) computed based on an AR(1) model, as detailed
in eq. (6.4) therein. As the estimates ût require a choice of λ, we iterate the algorithm
until the chosen λ converges. Full details are provided in Algorithm 2.1, Appendix
2.C.6. Throughout all simulations, the lasso estimates are obtained through the coordinate descent algorithm (Friedman et al., 2010) applied to standardized data.
Remark 2.5. We opt to only base our empirical choice for λ on its relation to
the empirical process and hence the set ET (·) in Theorem 2.1, not on its relation
to the set CC Sλ which also implies a lower hound λ. The latter bound, however,
requires one to approximate ∥Σ̂ − Σ∥max which is considerably more difficult as it
cannot be approximated by plugging in estimated quantities directly. With eigenvalue
assumptions typically stated in terms of the sample rather than the population, this
kind of additional restriction may be avoided, but such assumptions often still need to
be justified by showing that the sample covariance matrix is close to the population
matrix. As the additional bound only appears under weak sparsity (r > 0), it can also
be avoided by assuming exact sparsity. However, given that weak sparsity may often
7 Under minimal extra assumptions (sub-Gaussian moments for x , and minimum eigenvalue of
t
the long-run covariance matrix bounded away from 0), Corollary 2.3 substantiates the validity of
this approximation.
36
2.5 Analysis of Finite-Sample Performance
be the more relevant concept in practice, it may well be that the extra restriction
on λ from bounding ∥Σ̂ − Σ∥max is relevant beyond this chapter. Investigating ways
to incorporate this in the tuning parameter selection therefore seems an interesting
avenue for future research.
2.5.2
Autoregressive model with exogenous variables
Inspired by the simulation studies in Kock and Callot (2015) (Experiment B) and
Medeiros and Mendes (2016), we take the following DGP
yt = ρyt−1 + β ′ xt−1 + ut ,
xt = A1 xt−1 + A4 xt−4 + ν t ,
where xt is a (N − 1) × 1 vector of exogenous variables. In this simulation design
(and the following ones), we consider different values of the time series length T =
{100, 200, 500, 1000} and number of regressors N = {101, 201, 501, 1001}. For this
data generating process, we take ρ = 0.6, βj = √1s (−1)j for j = 1, . . . , s, and zero
otherwise. For N = 101, 201 we set s = 5 and s = 10 for N = 501, 1001. The
autoregressive parameter matrices A1 and A4 are block-diagonal with each block of
dimension 5 × 5. Within each matrix, all blocks are identical with typical elements
of 0.15 and -0.1 for A1 and A4 respectively. Due to the misspecification of nodewise
regressions, there is induced autocorrelation in the nodewise errors vj,t . However,
the block diagonal structure of A1 and A4 keeps the sparsity of nodewise regressions
constant asymptotically.
We consider different processes for the error terms ut and ν t :
(A) IID errors: (ut , ν ′t )′ ∼ IID N (0, I). Since all moments of the Normal distribution are finite, all moment conditions are satisfied.
(B) GARCH(1,1) errors: ut =
√
ht εt , ht = 5 × 10−4 + 0.9ht−1 + 0.05u2t−1 , εt ∼
IID N (0, 1), νj,t ∼ ut for j = 1, . . . , N − 1. Under this choice of GARCH
parameters, not all moments of ut are guaranteed to exist, but E u24
< ∞.
t
(C) Correlated errors: ν t ∼ IID N (0, S), where S has a Toeplitz structure Sj,k =
(−1)|j−k| ρ|j−k|+1 , with ρ = 0.4.
For all designs, we evaluate whether the 95% confidence intervals corresponding to
ρ and
at the q
correct rates.
The intervals are constructed
β1 cover
values
q their true
ω̂1,1 /τ̂14
ω̂2,2 /τ̂24
as ρ̂ ± z0.025
and β̂1 ± z0.025
. These results are obtained based
T
T
on 2,000 replications. The rates at which the intervals contain the true values are
reported in Table 2.1.
37
2 Desparsified Lasso in Time Series
Table 2.1: Autoregressive model with exogenous variables: 95% confidence interval
coverage. The mean interval widths are reported in parentheses.
ρ
Model
A
N \T
101
201
501
1001
101
B
201
501
1001
101
C
201
501
1001
100
0.958
200
0.953
500
0.951
1000
0.948
100
0.809
β1
200
500
0.731 0.751
1000
0.843
(0.366)
(0.220)
(0.113)
(0.070)
(0.383)
(0.257)
(0.152)
(0.102)
0.965
0.955
0.959
0.955
0.790
0.720
0.721
0.802
(0.387)
(0.224)
(0.116)
(0.071)
(0.388)
(0.258)
(0.154)
(0.103)
0.937
0.950
0.955
0.952
0.850
0.786
0.773
0.770
(0.418)
(0.238)
(0.129)
(0.081)
(0.399)
(0.260)
(0.165)
(0.113)
0.936
0.950
0.944
0.946
0.819
0.777
0.780
0.821
(0.429)
(0.244)
(0.130)
(0.083)
(0.388)
(0.260)
(0.164)
(0.114)
0.961
0.957
0.953
0.941
0.797
0.735
0.760
0.839
(0.374)
(0.219)
(0.115)
(0.071)
(0.390)
(0.261)
(0.153)
(0.102)
0.949
0.959
0.954
0.959
0.810
0.726
0.721
0.817
(0.387)
(0.227)
(0.117)
(0.073)
(0.398)
(0.260)
(0.156)
(0.103)
0.951
0.960
0.953
0.954
0.838
0.796
0.759
0.775
(0.425)
(0.241)
(0.130)
(0.082)
(0.400)
(0.263)
(0.165)
(0.114)
0.937
0.960
0.947
0.942
0.820
0.787
0.769
0.806
(0.434)
(0.246)
(0.131)
(0.084)
(0.394)
(0.261)
(0.165)
(0.115)
0.964
0.960
0.956
0.943
0.936
0.887
0.902
0.911
(0.410)
(0.231)
(0.121)
(0.080)
(0.628)
(0.394)
(0.232)
(0.166)
0.975
0.965
0.968
0.964
0.917
0.899
0.901
0.900
(0.421)
(0.239)
(0.123)
(0.081)
(0.646)
(0.398)
(0.233)
(0.166)
0.969
0.965
0.951
0.948
0.950
0.935
0.892
0.903
(0.457)
(0.260)
(0.129)
(0.081)
(0.665)
(0.420)
(0.243)
(0.168)
0.974
0.960
0.957
0.960
0.947
0.938
0.895
0.894
(0.475)
(0.265)
(0.132)
(0.082)
(0.669)
(0.421)
(0.244)
(0.168)
We start by discussing the results for the model with Gaussian errors (Model A).
Coverage for ρ is close to the nominal level of 95% for all combinations of N and T ,
with some combinations producing slightly conservative results. The coverage rates for
β1 are worse than for ρ. This is likely due to the fact that the exogenous variables xt
within the same block are strongly correlated to each other which negatively impacts
the performance of the lasso.
Turning to the results for the model with GARCH errors (Model B), similar finite
sample coverage rates are obtained. We do see a small increase in the mean interval
width, which is to be expected given the heteroskedastic error structure. With correlated errors (Model C), we again observe consistent coverage rates near the nominal
level for ρ. Interestingly, the coverage rates for β1 appear considerably better than
in Models A and B, though in most cases still remaining below the nominal rate at
around 90%. We also observe higher mean interval widths than Model A, which is
due to larger variance of xt induced by the cross-sectional covariance of the errors.
In Appendix 2.C.7, we provide details on an examination of various selection
methods for tuning parameters through heat maps for the coverage levels, which also
shed some further light on the relatively poor performance for β1 compared to ρ
visible for models A and B. In addition to selection by our PI method, we indicate
38
2.5 Analysis of Finite-Sample Performance
selection by the BIC, the AIC, and the EBIC as in Chen and Chen (2012), with
γ = 1.8 We summarize the main findings below. First, notice that there are regions
with coverage close to the nominal level in nearly all scenarios and combinations of
N and T , suggesting that good coverage could be achieved by selecting the tuning
parameters well. Second, across all scenarios, PI generally tends to result in coverage
rates closest to the nominal coverage of 95%. As expected, the AIC produces, overall,
the least sparse solutions, the EBIC the sparsest and BIC lies in between. PI lies
mostly between the BIC and EBIC. Third, there is a region of relatively low coverage
for large values of the tuning parameter in the initial and nodewise regressions (see
the top right corner of the heat maps). This occurs more pronouncedly for β1 than
for ρ and especially for T = 1000. Since PI tends to select near this region, it partly
explains why its coverage is worse for β1 . The relatively better coverage of β1 in Model
C is matched by this region being much less prominent. Given that the regions of
good coverage are in different places for ρ and β1 , using the BIC or EBIC for generally
smaller or larger λ would not lead to consistently better coverage across scenarios.9
2.5.3
Factor model
We take the following factor model
yt = β ′ xt + ut , ut ∼ IID N (0, 1)
xt = Λft + ν t , ν t ∼ IID N (0, I),
ft = 0.5ft−1 + εt , εt ∼ IID N (0, 1),
where xt is a N × 1 vector generated by the AR(1) factor ft . We take β as in
Section 2.5.2 with s increased by one to match the number of non-zero parameters.
The N × 1 vector of factor loadings Λ is chosen with the first s entries (corresponding
to the variables with non-zero entries in β) set to 0.5, and the remaining entries
Λi = (i − s + 1)−1 . This choice of weakly sparse factor loadings ensures that the
nodewise regressions are weakly sparse too, as shown in Example 2.5. By letting the
large loadings coincide with the non-zero entries in β, we ensure that there is a large
potential for incurring (omitted variable) bias in the estimates, and thus that this
DGP provides a serious test for the desparsified lasso.
q
ω̂1,1 /τ̂24
We investigate whether the confidence interval for β1 , β̂1 ± z0.025
, covT
ers the true value at the correct rate. Results are reported in Table 2.2. Coverage
8 For additional stability in the high-dimensional settings, we restrict the BIC, AIC, and EBIC
to only select models with at most T /2 nonzero parameters, though this restriction appears to be
binding for the AIC only.
9 To confirm this analysis, we also performed the simulations results for all three setups using
selection of λ by BIC (the best performing information criterion); in line with the heat maps, the
coverage rates for BIC are generally somewhat worse than for PI. Results are available upon request.
39
2 Desparsified Lasso in Time Series
Table 2.2: Factor model: 95% confidence interval coverage for β1 . The mean interval
widths are reported in parentheses.
N \T
101
201
501
1001
100
0.890
200
0.851
500
0.889
1000
0.907
(0.480)
(0.299)
(0.163)
(0.112)
0.873
0.849
0.879
0.897
(0.490)
(0.307)
(0.165)
(0.112)
0.956
0.940
0.890
0.910
(0.489)
(0.327)
(0.180)
(0.117)
0.951
0.943
0.881
0.896
(0.498)
(0.331)
(0.184)
(0.117)
rates improve with growing values of N and T , with empirical coverages of approximately 85% for small N and T , and increasing towards the nominal level when either
N or T increases. This result is therefore in line with our theoretical framework, and
provides a relevant practical setting in which the desparsified lasso is appropriate to
use even if exact sparsity is not present.
2.5.4
Weakly sparse VAR(1)
Inspired by Kock and Callot (2015) (Experiment D), we consider the VAR(1) model
z t = (yt , xt , wt )′ = A1 z t−1 + ut ,
ut ∼ IID N (0, 1),
with z t a (N/2)×1 vector. We focus on testing whether xt Granger causes yt by fitting
a a VAR(2) model, such that we have a total of N explanatory variables per equation.
(j,k)
The (j, k)-th element of the autoregressive matrix A1
= (−1)|j−k| ρ|j−k|+1 , with
(1,2)
A1
= 0; to measure the power of
ρ = 0.4. To measure the size of the test, we set
the test, we keep its regular value of −ρ . Weak sparsity holds10 under our choice
2
of the autoregressive parameters, but exact sparsity is violated by having half of the
parameters non-zero. Note that the desparsified lasso is convenient for estimating the
full VAR equation-by-equation, since all equations share the same regressors, and Θ̂
needs to be computed only once. For our Granger causality test, however, only a
single equation needs to be estimated.
We test whether xt Granger causes yt by regressing yt on the first and second lag of
(1,2)
(1,2)
z t . To this end, we test the null hypothesis A1
= A2
= 0 by using the Wald test
′
(1,2)
(1,2)
statistic in eq. (2.10), with b̂H = 0, Â1 , 0 . . . 0, Â2 , 0 . . . 0 , H = {2, N/2 + 1},
′
(1,2)
(1,2)
and Â1 , Â2
obtained by regressing yt on z ′t−1 , z ′t−2 . We reject the null
10 The
weak sparsity measure is
N
P
j=1
B = 0.
40
|ρj |r with asymptotic limit
ρr
1−ρr
< ∞, trivially satisfying
2.6 Conclusion
Table 2.3: Weakly sparse VAR: Joint test rejection rates for a nominal size of
α = 5%.
N \T
102
202
502
1002
100
0.050
0.062
0.051
0.059
Size
200
500
0.070
0.070
0.075
0.081
0.067
0.106
0.083
0.101
1000
0.073
0.078
0.076
0.091
100
0.415
0.411
0.401
0.407
Power
200
500
0.751 0.982
0.775 0.987
0.776 0.990
0.769 0.995
1000
1.000
1.000
1.000
1.000
hypothesis when the statistic exceeds χ22,0.05 ≈ 5.99.
We start by discussing the size of the test in Table 2.3. Overall, the empirical sizes
exceed the nominal size of 5%, with performance generally not improving for larger
sample sizes. In particular, rejection rates slightly deteriorate for larger N . However,
the observed changes in performance across N and T are rather small and may be
due to simulation randomness. The power of the test increases with both N and T ,
reaching 1 at T = 1000 regardless of the value for N .
To improve the finite-sample performance of the method, a natural extension
would be to consider the bootstrap for constructing confidence intervals as opposed
to asymptotic theory. Bootstrap-based inference for desparsified lasso methods in high
dimensions has already been explored by several authors, for example Dezeure et al.
(2017) in the IID setting, and in time series by Krampe et al. (2021), Chernozhukov
et al. (2019) and Chernozhukov et al. (2021). In particular, block or block multiplier
bootstrap methods, which would allow one to capture serial dependence nonparametrically, would fit our setup well. The block bootstrap has the additional advantage
of correcting the finite-sample performance of statistics based on long-run variance
estimators, which might be a factor for our tests as well (Gonçalves and Vogelsang,
2011). However, due to the lack of theory about such bootstrap methods, and the
associated selection of tuning parameters like the block length, for high-dimensional
NED processes, we do not consider such methods here. The development of such
theory would be a highly relevant and interesting topic for future research.
2.6
Conclusion
We provide a complete set of tools for uniformly valid inference in high-dimensional
stationary time series settings, where the number of regressors N can possibly grow at
a faster rate than the time dimension T . Our main results include (i) an error bound
for the lasso under a weak sparsity assumption on the parameter vector, thereby
establishing parameter and prediction consistency; (ii) the asymptotic normality of the
41
2 Desparsified Lasso in Time Series
desparsified lasso under a general set of conditions, leading to uniformly valid inference
for finite subsets of parameters; (iii) asymptotic normality of a maximum-type statistic
of a growing, high-dimensional, number of tests, valid under more stringent conditions,
thereby also permitting simultaneous inference over a potentially large number of
parameters, and (iv) a consistent Bartlett kernel Newey-West long-run covariance
estimator to conduct inference in practice.
These results are established under very general conditions, thereby allowing for
typical settings encountered in many econometric applications where the errors may
be non-Gaussian, autocorrelated, heteroskedastic and weakly dependent. Crucially,
this allows for certain types of misspecified time series models, such as omitted lags
in an AR model.
Through a small simulation study, we examine the finite sample performance of
the desparsified lasso in popular types of time series models. We perform both single
and joint hypothesis tests and examine the desparsified lasso’s robustness to, amongst
others, regressors and error terms exhibiting serial dependence and conditional heteroskedasticity, and a violation of the sparsity assumption in the nodewise regressions.
Overall our results show that good coverage rates are obtained even when N and T
increase jointly. The factor model design shows that the desparsified lasso remains
applicable when the exact sparsity assumption of the nodewise regressions is violated.
Finally, Granger causality tests in the VAR are slightly oversized, but empirical sizes
generally remain close to the nominal sizes, and the test’s power increases with both
N and T .
There are several extensions to our approach that are interesting to consider. The
development of a high-dimensional central limit theorem for NED processes would
allow to weaken the dependence conditions needed for establishing simultaneous,
high-dimensional inference. Similarly, using sample splitting would likely allow for
weakening sparsity assumptions. Finally, improvements in finite sample performance
may be achieved by bootstrap procedures. All of these extensions would require the
development of novel theory, and thus provide challenging but worthwhile avenues for
future research.
Acknowledgements
We thank the editor, associate editor and three referees for their thorough review
and highly appreciate their constructive comments which substantially improved the
quality of the manuscript.
The first and second author were financially supported by the Dutch Research
Council (NWO) under grant number 452-17-010. The third author was supported by
42
2.6 Conclusion
the European Union’s Horizon 2020 research and innovation programme under the
Marie Skłodowska-Curie grant agreement No 832671. Previous versions of this chapter
were presented at CFE-CM Statistics 2019, NESG 2020, Bernoulli-IMS One World
Symposium 2020, (EC)2 2020, and the 2021 Maastricht Workshop on Dimensionality
Reduction and Inference in High-Dimensional Time Series. We gratefully acknowledge
the comments by participants at these conferences. In addition, we thank Etienne
Wijler for helpful discussions. All remaining errors are our own.
43
2 Desparsified Lasso in Time Series
Appendix 2.A
Proofs for Section 2.3
This section provides the theory for the lasso consistency established in Section 2.3. We first
provide some definitions in Section 2.A.1 and preliminary lemmas in Section 2.A.2 which
are proved in the Supplementary Appendix 2.C.1. The proofs of the main results are then
provided in Section 2.A.3.
2.A.1
Definitions
Definition 2.A.1 (Near-Epoch Dependence, Davidson (2002b), ch. 17). Let there ex∞
ist non-negative NED constants {ct }∞
t=−∞ , an NED sequence {ψq }q=0 such that ψq → 0
t−l+q
as q → ∞, and a (possibly vector-valued) stochastic sequence {st }∞
t=−∞ with Ft−l−q =
t−l+q ∞
σ{st−q , . . . , st+q }, such that {Ft−l−q
}q=0 is an increasing sequence of σ-fields. For p > 0,
the random variable {Xt }∞
t=−∞ is Lp -NED on st if
p i1/p
h
t−l+q
≤ ct ψq .
E Xt − E Xt |Ft−l−q
for all t and q ≥ 0. Furthermore, we say {Xt } is Lp -NED of size −d on st if ψq = O(q −d−ε )
for some ε > 0.
Definition 2.A.2 (Mixingale, Davidson (2002b), ch. 16). Let there exist non-negative
∞
mixingale constants {ct }∞
t=−∞ and mixingale sequence {ψq }q=0 such that ψq → 0 as q → ∞.
For p ≥ 1, the random variable {Xt }∞
t=−∞ is an Lp -mixingale with respect to the σ-algebra
{Ft }∞
t=−∞ if
(E [|E (Xt |Ft−q )|p ])1/p ≤ ct ψq ,
(E [|Xt − E (Xt |Ft+q )|p ])1/p ≤ ct ψq ,
for all t and q ≥ 0. Furthermore, we say {Xt } is an Lp -mixingale of size −d with respect to
{Ft } if ψq = O(q −d−ε ) for some ε > 0. Note that the latter condition holds automatically
when Xt is Ft -measurable, as is the case in this chapter. We use the same notation for the
constants ct and sequence ψq as with near-epoch dependence, since they play the same role
in both types of dependence.
2.A.2
Preliminary results
Lemma 2.A.1. Under Assumption 2.1, for every j = 1, . . . , N , {ut xj,t } is an Lm -Mixingale
with respect to Ft = σ {z t , z t−1 , . . . }, with non-negative mixingale constants ct ≤ C and
∞
P
sequence ψq satisfying
ψq < ∞.
q=1
Lemma 2.A.2. Under Assumption 2.1, {xi,t xj,t − Exi,t xj,t } is Lm̄ -bounded and an Lm mixingale with respect to Ft = σ {z t , z t−1 , . . . }, with non-negative mixingale constants
ct ≤ C, and mixingale sequences of size −d.
44
2.A Proofs for Section 2.3
Lemma 2.A.3. Recall the set CC T (S) :=
n
Σ̂ − Σ
max
o
≤ C/ |S| and Sλ = {j : βj0 > λ}.
Under Assumptions 2.1 to 2.3, for a sequence ηT → 0 such that ηT ≤
N2
,
e
if the following is
satisfied
d+m−1
λ−r sr ≤ CηTdm+m−1
√
1
T
2
d
+
1
m
m−1
2
N ( d + m−1 )
.
then P (CC T (Sλ )) ≥ 1 − 3ηT → 1 as N, T → ∞.
l
P
ut xj,t ≤ z . Under Assumption 2.1, we have
Lemma 2.A.4. Let ET (z) :=
max
j≤N,l≤T
t=1
for z > 0 that
√ m
T
.
P (ET (z)) ≥ 1 − CN
z
2
′
Lemma 2.A.5. Take an index set S with cardinality |S|. Assuming that
n ∥β S ∥1 ≤ C|S|β Σβ
o
N
holds for β ∈ R : ∥β S c ∥1 ≤ 3∥β S ∥1 , then on the set CC T (S) = ∥Σ̂ − Σ∥max ≤ C/|S|
∥β S ∥1 ≤ C
q
|S|β ′ Σ̂β,
for β ∈ RN : ∥β S c ∥1 ≤ 3∥β S ∥1 .
Lemma 2.A.6. Let Assumption 2.3 hold for an index set S, i.e. ϕ2Σ (S) ≥ 1/C =⇒ ∥z S ∥21 ≤
C |S| z ′ Σz. On the set ET (T λ/4) ∩ CC T (S):
∥X(β̂ − β 0 )∥22
λ
8
+ ∥β̂ − β 0 ∥1 ≤Cλ2 |S| + λ∥β 0S c ∥1 .
T
4
3
Lemma 2.A.7. Under Assumptions 2.2 and 2.3, on the set CC T (Sλ ) ∩ ET (T λ/4),
∥X(β̂ − β 0 )∥22
λ
+ ∥β̂ − β 0 ∥1 ≤ Cλ2−r sr .
T
4
2.A.3
Proofs of the main results
Proof of Theorem 2.1. In this proof we combine the results of Lemmas 2.A.3 and 2.A.4.
√
By applying Lemma 2.A.4 to the set ET (T λ/4), we have that P (ET (T λ/4)) ≥ 1−CN (λ T )−m .
√ −m
Choose ηT such that N (λ T )
≤ ηT , meaning that
P (ET (T λ/4)) ≥ 1 − ηT
−1/m N
λ ≥ CηT
when
For Lemma 2.A.3, we need that ηT ≤
N2
,
e
1/m
√
T
.
which is true for sufficiently large N, T , since N
diverges, and ηT converges with T → ∞. Then
d+m−1
P (CC T (Sλ )) ≥ 1 − ηT
when
λ−r sr ≤ CηTdm+m−1
√
T
2
1
d
2
N ( d + m−1 )
+
1
m
m−1
.
45
2 Desparsified Lasso in Time Series
When 0 < r < 1 , the required bound for the set ET (T λ/4) is dominated by the bound for
CC T (Sλ ) when sr does not converge to 0, i.e. sr ≥ 1/C (when sr → 0 these results are
2 2 1 1m
( d + m−1 ) ( d + m−1 )
1/m
trivial). To show this, note that for m > 2, d ≥ 1, N √T
≥ N√T ,
d+m−1
− r(dm+m−1)
ηT
P (CC T (Sλ )
T
−1/−m
≥ ηT
, and 1/r > 1.
The result then follows by the union bound,
ET (T λ/4)) ≥ 1 − (1 − P(CC T (Sλ ))) − (1 − P(ET (T λ/4))) ≥ 1 − CηT → 1
as N, T → ∞. The result of the theorem follows from choosing ηT = C(ln ln T )−1 .
■
Proof of Corollary 2.1. By Theorem 2.1, the set CC T (Sλ )∩ET (T λ/4) holds with probability at least 1 − CηT , and so the error bound of Lemma 2.A.7 holds with the same probability.
With the error bound, items (i) and (ii) follow straightforwardly.
Appendix 2.B
■
Proofs for Section 2.4
This section provides the theory for the desparsified lasso established in Section 2.4. We first
provide some preliminary lemmas in Section 2.B.1 which are proved in the Supplementary
Appendix 2.C.2. The proofs of the main results are then provided in Section 2.B.2.
2.B.1
Preliminary results
Lemma 2.B.1. Under Assumptions 2.1 and 2.4, the following holds:
(i) E [vj,t ] = 0, ∀j, E [vj,t xk,t ] = 0, ∀k ̸= j, t.
(ii)
max
1≤j≤N, 1≤t≤T
E [|vj,t xj,t |m ] ≤ C.
(j)
(iii) {vj,t xk,t } is an Lm -Mixingale with respect to Ft
= σ {vj,t , x−j,t , vj,t−1 , x−j,t−1 , . . . },
∀k ̸= j, with non-negative mixingale constants ct ≤ C and sequences ψq satisfying
∞
P
ψq ≤ C.
q=1
Lemma 2.B.2. Let wt = (w1,t , . . . , wN,t )′ with wj,t = vj,t ut . Under Assumptions 2.1
and 2.4 the following holds:
(i) {wj,t } is Lm̄ -bounded and an Lm -Mixingale of size −d uniformly over j ∈ {1, . . . , N }
with respect to Ft = σ {ut , v t , ut−1 , v t−1 , . . . }, with non-negative mixingale constants
C1 ≤ ct ≤ C2 .
(ii)
max
q≤j,k≤N, 1≤t≤T
|E [wj,t wk,t−l ]| ≤ Cϕl , where ϕl is a sequence of size −d, and the co-
variances are therefore absolutely summable.
(iii) For all l, {wj,t wk,t−l − E [wj,t wk,t−l ]} is Lm/2 -bounded and an L1 -Mixingale of size
−d uniformly over j, k ∈ {1, . . . , N } with respect to Ft , with non-negative mixingale
constants ct ≤ C.
46
2.B Proofs for Section 2.4
Lemma 2.B.3. Recall the sets CC T (S) :=
n
Σ̂ − Σ
max
o
≤ C/ |S| , Sλ = {j : βj0 > λ},
0
and Sλ,j := {k : γj,k
> λj }. Under Assumptions 2.1 to 2.3, for a sequence ηT → 0 such
that ηT ≤
N2
,
e
if the following is satisfied
d+m−1
dm+m−1
λ−r
min sr,max ≤ CηT
√
1
T
d
+
1
m
m−1
,
2
2
N ( d + m−1 )
!
T
P CC T (Sλ )
CC T (Sλ,j )
≥ 1 − 3(1 + h)ηT .
j∈H
Lemma 2.B.4. Under Assumptions 2.1 and 2.4, for xj > 0 the following holds
!
\
P
(j)
ET (xj )
≥1−C
j∈H
hN T m/2
.
min xm
j
j∈H
Lemma 2.B.5. Define the set LT :=
max
j∈H
1
T
T
P
2
vj,t
− τj2 ≤
t=1
h
δT
, and let Assumption 2.4
hold. When
√
1
δT ≤ CηT ( T h) 1/d+m/(m−1) ,
dm+m−1
P (LT ) ≥ 1 − 3ηTd+m−1 → 1 as N, T → ∞.
Lemma 2.B.6. Under Assumption 2.5(ii)
1
≤ τj2 ≤ C, uniformly over j = 1, . . . , N.
C
Furthermore, define the set PT,nw :=
(j)
T
ET (T
j∈H
λj
4
(2.B.1)
)
T
CC T (Sλ,j ) and let Assumption 2.5(i)
j∈H
hold. On the set PT,nw ∩ LT , we have
max τ̂j2 − τj2 ≤
j∈H
q
h
+ C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r ,
¯
δT
and
1
1
max 2 − 2 ≤
j∈H τ̂j
τj
q
+ C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r
¯
.
q
C3 − C4 δhT + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r
h
δT
¯
Lemma 2.B.7. Under Assumption 2.5(i)-(ii), it holds for a sufficiently large T that on the
T (j) λj
set
ET (T 4 ) ∩ LT ,
j∈H
n
o
max ∥e′j − Θ̂j Σ̂∥∞ ≤
j∈H
C1 −
h
δT
λ̄
,
− C2 λ̄2−r s̄r
where Θ̂j is the jth row of Θ̂.
47
2 Desparsified Lasso in Time Series
Lemma 2.B.8. Define ∆ :=
√
T Θ̂Σ̂ − I β̂ − β 0 , and PT,las := ET (T λ4 ) ∩ CC T (Sλ )
Under Assumptions 2.1, 2.2 and 2.5(i)-(ii), on the set PT,las ∩ PT,nw ∩ LT we have that
max |∆j | ≤
√
j∈H
T λ1−r sr
C1 −
λ̄
.
− C2 λ̄2−r s̄r
h
δT
Lemma 2.B.9. Under Assumption 2.5(i)-(ii), on the set ET (T λ) ∩ PT,nw ,
√
1
max √ v̂ ′j u − v ′j u ≤ C T λ2−r
max s̄r .
j∈H
T
(j)
Lemma 2.B.10. Define the set ET,uv (x) :=
T
and 2.4, for x > 0 it follows that P
(j)
s
P
max
vj,t ut ≤ x . Under Assumptions 2.1
s≤T t=1
!
ET,uv (x)
j∈H
≥1−
ChT m/2
.
xm
Lemma 2.B.11. Under Assumptions 2.1 and 2.3 to 2.5(i)-(ii), on the set
√
T (j)
ET,uv (h1/m T 1/2 ηT−1 ) with ηT−1 ≤ C T , we have
ET (T λ) ∩ PT,nw ∩ LT
j∈H
q
√
1/m −1
h1/m ηT−1 δhT + C1 h1/m ηT−1 T λ2−r
ηT
λ̄2 λ−r s̄r
max s̄r + C2 h
1 v̂ ′j u
1 v ′j u
¯
√
.
max √
−
≤
q
j∈H
T τ̂j2
T τj2
−r
h
C3 − C4 δT + C1 λ̄2−r s̄r + C2 λ̄2 λ s̄r
¯
Lemma 2.B.12. For any process {dt }Tt=1 and constant x > 0, define the set ET,d (x) :=
∥d∥∞ ≤ x . Let maxt E |dt |p ≤ C < ∞. Then for x > 0, P ({ET,d (x)}c ) ≤ Cx−p T .
Lemma 2.B.13. Under Assumptions 2.1, 2.2, 2.4 and 2.5(i)-(ii), on the set
PT,uv := PT,las ∩ PT,nw ∩ ET,uvw ,
max
(j,k)∈H 2
T
h
i2
1 X
(ŵj,t ŵk,t−l − wj,t wk,t−l ) ≤ C1 T 1/2 λ2−r
max sr,max
T
t=l+1
q
h 1 m+1
i3
1
1
3−m
3
2
2−r
+ C2 h m T m λ2−r
s
+
C
h m T m λ2−r
.
3
max sr,max + C4 h 3m T 3m λmax sr,max
max r,max
Lemma 2.B.14. Define
(
ET,ww (x) :=
max
(j,k)∈H 2
)
T
1 X
(wj,t wk,t−l − Ewj,t wk,t−l ) ≤ x .
T
t=l+1
Under Assumptions 2.1 and 2.4, it holds that
1
√
−
dm+m−2
1/d+m/(m−2)
P ET,ww ηT−1 h2
T h2
≥ 1 − 3ηT2d+m−2 .
48
2.B Proofs for Section 2.4
h
i−1
√
2/m
Lemma 2.B.15. Assume that λ2max λ−r
T sr,max
,
min ≤ ηT h
λ−r
min sr,max
d+m−1
dm+m−1
"
≤ CηT
1/m
and if r = 0, λmin ≥ ηT−1 (hN√)T
m+1
2
+
h dm√ m−1
T
→ 0,
# 1 1m
√
+
d
m−1
T
,
2+ 2
(hN )( d m−1 )
. Furthermore, assume that RN satisfies max ∥r N,p ∥1 ≤
1≤p≤P
C, and P ≤ Ch. Then, as N, T → ∞,
max r N,p
1≤p≤P
Θ̂X ′ u
Υ−2 V ′ u
√
√
+∆−
T
T
!
p
→ 0.
Lemma 2.B.16. Let Assumptions 2.1 to 2.6 hold, and let h ∼ T H for H > 0, N ∼ T a for
a > 0, sr,max ∼ T b for 0 < b <
1−r
,
2
λmin ∼ λmax ∼ λ ∼ T −ℓ and
1/2 + b
1/2 − b
<ℓ<
,
2−r
r
1/2 + b
r=0:
< ℓ < 1/2,
2−r
0<r<1:
and QT ∼ T Q for 0 < Q < 2/3. Under these conditions,
−2
−2
Ω
R′N
RN,T
:= RN Υ−2 ΩN,T Υ−2 − Υ̂ Ω̂N,T Υ̂
1
= Op T 2 (b−ℓ(2−r)) ,
max
(2.B.2)
β
RN,T
:= max r N,p
1≤p≤P
Θ̂X ′ u
Υ−2 V ′ u
√
√
+∆−
T
T
!
= Op T ϵ−1/2 + T 1/2+b−ℓ(2−r) ,
(2.B.3)
for an arbitrarily small ϵ > 0, with
2.B.2
1
(b
2
− ℓ(2 − r)) < −1/4, and 1/2 + b − ℓ(2 − r) < 0.
Proofs of main results
Proof of Theorem 2.2. Using eq. (2.5), we can write
√
√
T RN b̂ − β 0 = T RN
Θ̂X ′ (y − X β̂)
β̂ − β +
T
0
!
= RN
!
Θ̂X ′ u
√
+∆ ,
T
and by Lemma 2.B.15,
max r N,p
1≤p≤P
Θ̂X ′ u
Υ−2 V ′ u
√
√
+∆−
T
T
!
p
→ 0.
Note that under the assumption that h ≤ C, the requirements for Lemma 2.B.15 reduce
to the requirements for Theorem 2.2 (note that one of the bounds becomes redundant for
0 < r < 1, see the proof of Theorem 2.1 for details). The proof will therefore continue by
49
2 Desparsified Lasso in Time Series
deriving the asymptotic distribution of
RN
T
X
Υ−2 V ′ u
1
√
= √ RN Υ−2
wt ,
T
T
t=1
and applying Slutsky’s theorem. Regarding RN , under the assumption that h < ∞, we
may without loss of generality consider the case with P = 1. In the multivariate setting, let
R∗N be a P × N matrix with 1 < P < ∞, and non-zero columns indexed by the set H of
√
d
cardinality h = |H| < ∞. By the Cramér-Wold theorem, T R∗N (b̂ − β 0 ) → N (0, Ψ∗ ) if and
√
d
only if T α′ R∗N (b̂ − β 0 ) → N (0, α′ Ψ∗ α) for all α ̸= 0. We show this directly by letting the
1 × N vector RN = α′ R∗N and the scalar ψ =
lim
N,T →∞
α′ R∗N (Υ−2 ΩN,T Υ−2 )R∗′
N α. The final
part of the proof is then devoted to establishing the central limit theorem. This result can
be shown by applying Theorem 24.6 and Corollary 24.7 of Davidson (2002b). Following the
R Υ−2 ΩN,T Υ−2 R′N
notation therein, let XT,t = √P 1 ψT RN Υ−2 wt , where PN,T = N
; note
ψ
N,T
t
= σ {sT,t , sT,t−1 , . . . },
that by definition of ψ, PN,T → 1 as N, T → ∞. Further, let FT,−∞
1
the positive constant array {cT,t } = √
, and r = m̄. We show that the requirements
PN,T ψT
of this Theorem are satisfied.
t
-measurability of XT,t , follows from the measurability of z t in AsPart (a), FT,−∞
sumption 2.1(ii), E [XT,t ] = √P 1 ψT RN Υ−2 E [wt ] = 0 follows from the rewriting wj,t =
N,T
xj,t − x′−j,t γ 0j ut and noting that E [xj,t ut ] = 0, ∀j by Assumption 2.1(i), and
E
T
X
!2
XT,t =
"
1
−2
PN,T ψ
t=1
1
=
RN Υ
−2
PN,T ψ
RN Υ
T
X
1
E
T
!
T
X
wt
t=1
−2
ΩN,T Υ
R′N
!#
Υ−2 R′N
w′t
t=1
= 1.
For part (b) we get that
sup
n
−2
E|RN Υ
wt |
m̄ 1/m̄
T,t
X |rN,j |
sup
τj2 T,t
(1)
j∈H
≤
n
E|wj,t |m̄
X rN,j
= sup
E
wj,t
τj2
T,t
j∈H
o
1/m̄
o
m̄ !1/m̄
≤ C,
(2)
where (1) is due to Minkowski’s inequality, and (2) follows from h < 0, τj2 ≤ C by eq. (2.B.1),
and wj,t is Lm̄ -bounded by Lemma 2.B.2(i).
For part (c’), by the arguments in the proof of Lemma 2.B.2, wj,t is Lm -NED of size
d
−d, and therefore also size −1 on sT,t , which is α-mixing of size − 1/m−1/
< −m̄/(m̄ − 2)
m̄
under Assumption 2.1.
For (d’), we let MT = max {cT,t } = √P
t
C, where the inequality follows from
1
τj2
1
N,T ψT
≥
1
C
, such that sup T MT2 = sup
T
T
−2
by eq. (2.B.1), and RN Υ
1
RN Υ−2 ΩN,T Υ−2 R′N
ΩN,T Υ
−2
R′N
is
bounded from below by the minimum eigenvalue of ΩN,T (assumed to be bounded away
from 0), via the Min-max theorem.
50
≤
2.B Proofs for Section 2.4
Finally, Theorem 2.2 states that this convergence is uniform in β 0 ∈ B(sr ). This follows
by noting that eq. (2.C.3) holds uniformly in β 0 ∈ B(sr ).
■
Proof of Theorem 2.3. The following derivations collectively require that the set
PT,las ∩ PT,nw ∩ LT ∩ ET,uvw ∩ ET,ww
ηT−1 h2
1
√
−
1/d+m/(m−2)
T h2
holds with probability converging to 1. For PT,las ∩ PT,nw ∩ LT , this can be shown by
the arguments in the proof of Lemma 2.B.15 when the following convergence rates hold:
h
i−1 m+1 + 2
√
dm
m−1
2/m
λ2max λ−r
T sr,max
, h √T
→ 0,
min ≤ ηT h
λ−r
min sr,max
d+m−1
dm+m−1
"
≤ CηT
# 1 1m
√
+
d
m−1
T
,
2+ 2
(
)
(hN ) d m−1
1/m
and if r = 0, λmin ≥ ηT−1 (hN√)T . ET,uvw follows from Lemma 2.B.13, and
1
√
−
1/d+m/(m−2)
ET,ww ηT−1 h2
T h2
holds with probability converging to 1 by Lemma 2.B.14.
We can write
i
h −2
h −2
i
−2
−2
RN Υ̂ Ω̂Υ̂ − Υ−2 ΩN,T Υ−2 R′N ≤ RN Υ̂ Ω̂Υ̂ − Υ−2 Ω̂Υ−2 R′N
i
h
+ RN Υ−2 Ω̂Υ−2 − Υ−2 ΩN,T Υ−2 R′N =: R(a) + R(b) .
For R(a) we get that
h −2
i
h −2
i h −2
i
R(a) ≤ RN Υ̂ − Υ−2 Ω̂ Υ̂ − Υ−2 R′N + 2 RN Υ̂ − Υ−2 Ω̂Υ−2 R′N
h −2
ih
i h −2
i
≤ RN Υ̂ − Υ−2 Ω̂ − ΩN,QT Υ̂ − Υ−2 R′N
h −2
i
h −2
i
+ RN Υ̂ − Υ−2 ΩN,QT Υ̂ − Υ−2 R′N
h −2
i
h −2
ih
i
+ 2 RN Υ̂ − Υ−2 Ω̂ − ΩN,QT Υ−2 R′N + 2 RN Υ̂ − Υ−2 ΩN,QT Υ−2 R′N ,
where
"
ΩN,QT
1
:= E
QT
QT
X
t=1
!
wt
QT
X
QT −1
!#
w′t
= Ξ(0) +
t=1
where the (j, k)th element of Ξ(l) is ξj,k =
X
Ξ(l) + Ξ′ (l),
l=1
1
T
T
P
Ewj,t wk,t−l .
t=l+1
51
2 Desparsified Lasso in Time Series
Starting with the third term of R(a) , applying the triangle inequality
h −2
ih
i
RN Υ̂ − Υ−2 Ω̂ − ΩN,QT Υ−2 R′N
max
)
(
1
XX
1
1
N,QT
− 2
ω̂j,k − ωj,k
rN,q,k
≤ max
rN,p,j
1≤p,q≤P
τ̂j2
τj
τk2
j∈H
k∈H
1
1
1
≤ max 2 − 2 max 2
j∈H τ̂j
τj j∈H τj
max
N,QT
ω̂j,k − ωj,k
1
2
j∈H τj
≤ C by eq. (2.B.1), and
(j,k)∈H 2
max ∥r N,p ∥1 ≤ C by assumption, max
1≤p≤P
1
1
max 2 − 2 ≤
j∈H τ̂j
τj
max
1≤p,q≤P
∥r N,p ∥1 ∥r N,q ∥1 ,
q
+ C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r
¯
→ 0,
q
h
C3 − C4 δT + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r
h
δT
¯
on the set PT,nw ∩ LT by Lemma 2.B.6. Finally, we show that
max
(j,k)∈H 2
N,QT
ω̂j,k − ωj,k
→ 0.
QT −1
max
(j,k)∈H 2
X
N,QT
ω̂j,k − ωj,k
≤2
QT −1
=2
X
l=0
(1 − l/QT )
max
(j,k)∈H 2
(1 − l/QT )ξˆj,k (l) − ξj,k (l)
max
(j,k)∈H 2
l=0
T
T
X
1 X
1
ŵj,t ŵk,t−l −
Ewj,t wk,t−l .
T −l
T
t=l+1
t=l+1
Using a telescopic sum argument,
QT −1
max
(j,k)∈H 2
N,QT
ω̂j,k − ωj,k
≤2
X
l=0
+
l
QT
T
1 X
(ŵj,t ŵk,t−l − Ewj,t wk,t−l )
T
max
(j,k)∈H 2
max
(j,k)∈H 2
1
T
t=l+1
T
X
Ewj,t wk,t−l .
t=l+1
For the second term, it follows by Lemma 2.B.2(ii) that
QT −1
2
X
l=0
QT −1
QT −1 1−d−ϵ
X l
C X 1−d−ϵ
l
max |Ewj,t wk,t−l | ≤
l
≤ CQ−d−ϵ
≤ CQ1−d−ϵ
,
T
T
1−d−ϵ
2
QT j,k∈H
QT
Q
T
l=1
l=1
since l/QT < 1, and Q1−d−ϵ
→ 0 for d ≥ 1, and
T
PQT −1
l=1
l−1−δ ≤ C by properties of p-series.
It follows from Lemmas 2.B.13 and 2.B.14 that
max
(j,k)∈H 2
max
(j,k)∈H 2
52
T
h
i2
1
1
1 X
(ŵj,t ŵk,t−l − wj,t wk,t−l ) ≤ C1 T 1/2 λ2−r
+ C2 h m T m λ2−r
max sr,max
max sr,max
T
t=l+1
q
h 1 m+1
i3
3−m
3
2
2−r
+ C3 h m T m λ2−r
,
max sr,max + C4 h 3m T 3m λmax sr,max
T
1
√
−
1 X
1/d+m/(m−2)
(wj,t wk,t−l − Ewj,t wk,t−l ) ≤ C5 ηT−1 h2
T h2
.
T
t=l+1
2.B Proofs for Section 2.4
on the set PT,uv ∩ ET,ww
1
√
−
1/d+m/(m−2)
ηT−1 h2
T h2
. Plugging the upper bounds in, we
find that
h
i2
1
1
N,QT
2−r
sr,max + C2 h m T m λ2−r
ω̂j,k − ωj,k
≤ 2QT C1 T 1/2 λmax
max sr,max
max
(j,k)∈H 2
h 1 m+1
i3
3−m
3
2
2−r
h m T m λ2−r
max sr,max + C4 h 3m T 3m λmax sr,max
1
√
−
1/d+m/(m−2)
+C5 ηT−1 h2
+ C6 Q1−d−ϵ
.
T h2
T
q
+ C3
Hence,
max
p
(j,k)∈H 2
N,QT
ω̂j,k − ωj,k
−
→ 0 if we take
λ2−r
max ≤ ηT min
h
p
h
and QT ηT−1 h2
√
QT
i−1 h
i−1
√
T sr,max
, QT h1/m T 1/m sr,max
,
Q2T h3/m T (3−m)/m sr,max
−
i−1 h
i−1
2/3
, QT h1/(3m) T (m+1)/3m sr,max
,
1
1/d+m/(m−2)
→ 0. For the latter term, since we can choose ηT−1 to
1
√
−
1/d+m/(m−2)
grow arbitrarily slowly, it is sufficient to assume QT h2
T h2
→ 0. FurtherT h2
m+1
more, this convergence rate is stricter than the previous rate
2
+
h dm√ m−1
T
→ 0, and therefore
makes it redundant.
For the fourth term of R(a) , we may bound as follows
h −2
i
RN Υ̂ − Υ−2 ΩN,QT Υ−2 R′N
max
(
)
XX
1
1
N,QT 1
≤ max
rN,p,j
− 2 ωj,k
rN,q,k
1≤p,q≤P
τ̂j2
τj
τk2
j∈H
k∈H
1
1
1
≤ max 2 − 2 max 2
j∈H τ̂j
τj j∈H τj
The only new term here is
max
max
(j,k)∈H 2
(j,k)∈H 2
N,QT
ωj,k
max
1≤p,q≤P
∥r N,p ∥1 ∥r N,q ∥1 ,
N,QT
, which can by bounded by
ωj,k
QT −1
max
(j,k)∈H 2
N,QT
≤ ∥ΩN,QT ∥max ≤ 2
ωj,k
X
∥Ξ(l)∥max ≤ C,
l=0
where the last inequality follows from Lemma 2.B.2(ii).
Note that when the third and fourth terms of R(a) converge to 0, this holds for the first
1
2
j∈H τj
and second terms as well; one may simply replace max
by a second max
j∈H
1
τ̂j2
−
1
τj2
→ 0 in
the upper bound.
This concludes
the part of R(a) . With the results above, it remains to be shown for R(b)
that RN Υ−2 Ω̂ − ΩN,T Υ−2 R′N
→ 0. Using similar arguments as for the terms of
max
53
2 Desparsified Lasso in Time Series
R(a) , it suffices to show that
QT −1
X
N,QT
ωj,k
− ωj,k ≤
N,QT
ωj,k
− ωj,k → 0. Note that by Lemma 2.B.2(ii)
max
(j,k)∈H 2
ξj,k −
l=1
T
−1
X
ξj,k ≤
l=1
T
X
|ξj,k (l)| ≤
l=QT
T
X
Cϕl ≤ C
l=QT
T
X
l−d−ϵ ,
l=QT
PT
P
−1−δ
,
which converges to 0 by letting δ = ϵ/2, and writing Tl=QT l−d−ϵ ≤ Q1−d−δ
T
l=QT l
PT
1−d−δ
−1−ϵ
→ 0 by properties of p-series and QT → ∞.
where QT
→ 0 for d ≥ 1, and l=QT l
This shows that R(b)
p
max
−
→ 0.
Summarizing the above, we argue that for some δ > 0,
−2
−2
RN Υ−2 ΩN,T Υ−2 − Υ̂ Ω̂N,T Υ̂
R′N
where ∆τ := max
j∈H
1
τ̂j2
−
1
τj2
and ∆ω :=
max
max
(j,k)∈H 2
0
≤ C1 ∆τ [1 + ∆τ + ∆τ ∆ω] + C2 Q1−d−δ
T
N,QT
ω̂j,k − ωj,k
.
Finally, this result holding uniformly in β ∈ B(sr ) follows the same logic as the proof
of Theorem 2.2, namely that eq. (2.C.3) holds uniformly in β 0 ∈ B(sr ).
■
Proof of Corollary 2.2. The result follows by applying Theorems 2.2 and 2.3, so the assumed conditions from both must be satisfied. Since we assume that h ≤ C and λ ∼ λmax ∼
λmin , the conditions will simplify considerably. To summarize, we require the following six
conditions: For Theorem 2.2 we require that
(1)
λ2max λ−r
min ≤ ηT
(2)
λ−r
min sr,max
(3∗ )
(4)
h√
i−1
T sr,max
,
√
≤ ηT
2
1
T
m
( d1 + m−1
)
2
N ( d + m−1 )
,
N 1/m
when r = 0,
λmin ≥ ηT−1 √
T
h
i−1 h
i−1
p √
λ2−r
QT T sr,max
, QT h1/m T 1/m sr,max
,
max ≤ ηT min
h
Q2T h3/m T (3−m)/m sr,max
h√
i−1 h
i−1
2/3
, QT h1/(3m) T (m+1)/3m sr,max
,
i−1
(5)
λ2max λ−r
min ≤ ηT
(6)
λ−r
min sr,max
(7∗ )
λmin ≥ ηT−1
(8)
√
1
−
QT h2 ( T h2 ) 1/d+m/(m−2) → 0,
"
≤ ηT
T h2/m sr,max
1
#
√
m
( d1 + m−1
)
T
,
2+ 2
(hN )( d m−1 )
(hN )1/m
√
when r = 0,
T
where (1)-(3∗ ) follow from Theorem 2.2 and (4)-(8) from Theorem 2.3. Note that (1), (2), and
(3∗ ) are same as the terms (4), (5), and (6∗ ) and without the h terms. For (4), this can be sim-
54
2.B Proofs for Section 2.4
h √
i −1
2−r
plified into a single (slightly more strict) upper bound λmax ≤ CηT Q2T T h3/m sr,max
.
h √
i−1
2
We may then combine this with (5), and both are satisfied when λ2max λ−r
T h3/m sr,max
.
min ≤ ηT QT
Using h ≤ C and λ ∼ λmax ∼ λmin , these simplify to
(1)
−r
λ
√
sr,max ≤ ηT
1
T
2
2
N ( d + m−1 )
(3)
N 1/m
λ ≥ ηT−1 √
when r = 0,
T
i−1
h √
,
λ2−r ≤ ηT Q2T T sr,max
(4)
QT T
(2∗ )
m
)
( d1 + m−1
1
− 2/d+2m/(m−2)
,
→ 0.
When 0 < r < 1, from (1) and (3) we get
"
ηT−1 s1/r
r,max
2
2
N ( d + m−1 )
√
T
#
1
r
m
)
( d1 + m−1
h √
i−1/(2−r)
≤ λ ≤ ηT Q2T T sr,max
,
and by combining the upper and lower bounds, we obtain the condition
d+m−1
1
QrT sr,max N (2−r)( dm+m−1 ) T 4
d(m−1)(2−r)
r− dm+m−1
→ 0.
When r = 0, the bounds on λ come from (2∗ ) and (3)
h √
i−1/2
N 1/m
ηT−1 √
≤ λ ≤ ηT Q2T T s0,max
.
T
Combining the upper and lower bounds, we obtain the condition
N 2/m
→ 0.
Q2T s0,max √
T
From (1), we then obtain the condition
d+m−1
s0,max N 2( dm+m−1 ) T
−1
2
d(m−1)
dm+m−1
→ 0,
which is the same condition which came from (1) and (3) in the 0 < r < 1 case. Collectively,
55
2 Desparsified Lasso in Time Series
we then need to satisfy the following
2 2 1 1m
i−1/(2−r)
h √
( d + m−1 ) r( d + m−1 )
−1 1/r
when 0 < r < 1,
≤ λ ≤ ηT Q2T T sr,max
ηT sr,max N √T
h √
i−1/2
1/m
ηT−1 N√
≤ λ ≤ ηT Q2T T s0,max
when r = 0,
T
1 r− d(m−1)(2−r)
d+m−1
dm+m−1
→ 0,
QrT sr,max N (2−r)( dm+m−1 ) T 4
2/m
2
N
QT s0,max √T → 0 when r = 0,
1
−
QT T 2/d+2m/(m−2) → 0.
By implication of Theorem 2.2
√
d
T r N,p (b̂ − β 0 ) → N (0, ψ),
uniformly in β 0 ∈ B(sr ). Then, by Theorem 2.3
−2
r N,p (Υ̂
Ω̂Υ̂
−2
p
)r ′N,p → ψ,
also uniformly in β 0 ∈ B(sr ). By Slutsky’s Theorem, it is then the case that
√
d
T r N,p (b̂ − β 0 ) → N (0, ψ),
uniformly in β 0 ∈ B(sr ), for every 1 ≤ p ≤ P . As P < ∞ by assumption, it follows that
√
P T q
sup
β 0 ∈B(sr )
1≤p≤P,z∈R
r N,p (b̂ − β 0 )
−2
r N,p (Υ̂
Ω̂Υ̂
−2
)r ′N,p
≤ z − Φ(z) = op (1).
Note that uniform convergence over z ∈ R follows automatically by Lemma 2.11 in Van der
Vaart (1998), since the distribution is continuous. The second result then follows from the
fact that a sum of P squared standard Normal variables have a χ2P distribution.
■
Proof of Corollary 2.3. Define g ∼ N (0, RN Υ−2 ΩN,T Υ−2 R′N ) as the ‘population counterpart’ of ĝ and define the following distribution functions:
F1,T (z) := P
GT (z) := P
T
X
1
max √ r N,p Υ−2
wt ≤ z
1≤p≤P
T
t=1
G∗T (z) := P∗ max ĝp ≤ z .
√
max T r N,p b̂ − β 0 ≤ z , F2,T (z) := P
1≤p≤P
max gp ≤ z ,
1≤p≤P
1≤p≤P
Now note that
|F1,T (z) − G∗T (z)| ≤ |F1,T (z) − GT (z)| + |GT (z) − G∗T (z)| .
{z
} |
{z
}
|
F G (z)
RT
For RTF G (z), write x̂T =
56
GG (z)
RT
√
T r N,p b̂ − β 0 and xT =
√1
T
r N,p Υ−2
PT
t=1
wt , such that
!
,
2.B Proofs for Section 2.4
F1,T (z) = P(maxp x̂T,p ≤ z) and F2,T (z) = P(maxp xT,p ≤ z), and let rT := max1≤p≤P x̂T,p −
max1≤p≤P xT,p . Then
|rT | =
β
,
max x̂T,p − max xT,p ≤ max |x̂T,p − xT,p | = RN,T
1≤p≤P
1≤p≤P
1≤p≤P
β
where RN,T
is defined in (2.B.3). Given our assumptions, we therefore know that there exist
sequences ηT,1 and ηT,2 such that P (|rT | > ηT,1 ) ≤ ηT,2 , such that
|F1,T (z) − GT (z)| ≤ P max xT,p + rT ≤ z |rT | ≤ ηT,1 P (|rT | ≤ ηT,1 ) − P(max gp ≤ z)
p
p
+ P max x̂T,p ≤ z |rT | > ηT,1 P (|rT | > ηT,1 )
p
≤ P max xT,p ≤ z + ηT,1 − P(max gp ≤ z) + 2ηT,2
p
p
≤ P max xT,p ≤ z + ηT,1 − P(max gp ≤ z + ηT,1 )
p
p
|
{z
}
F G (z+η
RT
T ,1 )
,1
+ P max gp ≤ z + ηT,1
p
|
{z
F G (z)
RT
,2
− P(max gp ≤ z) +2ηT,2 .
p
}
FG
(z + ηT,1 ) we apply the high-dimensional CLT in Theorem 1 of Chang
For the term RT,1
et al. (2021), noting that our assumptions imply the conditions required for this theorem. In
particular, for the sub-exponential moment assumption, we need that r N,p Υ−2 wt
ψγ1
≤
DT for all t and p, for some γ1 ≥ 1. We choose γ1 = 1, and use Lemma 2.7.7 of Vershynin
(2019) to bound r N,p Υ−2 wt
ψ1
≤ r N,p Υ−2 v t
ψ2
∥ut ∥ψ2 ≤ dv,T du,T = DT . We assume
that L1 -bounded linear combinations of v t are sub-Gaussian, which covers this case, since
the ∥rN,p ∥1 ≤ C by assumption, and Υ−2
max
≤ C by eq. (2.B.1). The non-degeneracy
condition then follows from choosing 1/C ≤ ∥RN ∥1 , and assuming the minimum eigenvalue
(and therefore the smallest diagonal element) of ΩN,T is bounded away from 0. Defining
ω T := min1≤p≤P Egi2 , this implies that ω T ≥ C > 0. Applying the CLT, we bound as follows
FG
RT,1
(z
+ ηT,1 ) ≤
FG
sup RT,1
(z)
z∈R
≤ sup P
z∈RP
T
RN Υ−2 X
√
wt ≤ z
T
t=1
!
− P (g ≤ z)
2/3
BT (ln P )(1+2K)/(3K)
BT (ln P )7/6
+ C2
→ 0.
1/9
T
T 1/9
The final result holds as ln P ≤ ln Ch = O ln T H = O(ln T ), since H is a constant.
≤ C1
FG
For the term RT,2
(z), apply the anti-concentration bound in Lemma 2.1 of Chernozhukov
57
2 Desparsified Lasso in Time Series
et al. (2013) to show that
FG
RT,2
(z) ≤ sup P(z ≤ max gp ≤ z + ηT,1 ) ≤ sup P max gp − z ≤ ηT,1
p
p
z∈R
z∈R
q
√
√
≤ CηT,1
2 ln P + 1 ∨ ln(ω T /ηT,1 ) ≤ C1 ηT,1 2 ln P .
β
By Lemma 2.B.16 we find that RN,T
= Op
h
T ϵ−1/2 + T 1/2+b−ℓ(2−r)
i√
ln T
= Op (T −δ )
for some δ > 0, since ϵ > 0 can be chosen arbitrarily small, and 1/2 + b − ℓ(2 − r) < 0. We
p
may therefore take ηT,1 at a polynomial rate as well, such that ηT,1 2 ln(P ) → 0.
For RTGG (z), it follows by Theorem 2 in Chernozhukov et al. (2015) that
2/3
Ω
Ω
sup RTGG (z) ≤ C(RN,T
)1/3 max{1, ln(P/RN,T
)}
,
z∈R
Ω
Ω
with RN,T
as defined in eq. (2.B.2). By Lemma 2.B.16 we have RN,T
= Op (T −1/4 ), such
that
2/3
Ω
Ω
(RN,T
)1/3 max{1, ln(P/RN,T
)}
= Op T −1/12 (max {1, (H + 1/4) ln T })2/3 = op (1).■
58
2.C Supplementary Results
Appendix 2.C
Supplementary Results
Section 2.C.1 and 2.C.2 present the proofs of the preliminary results from Section 2.3 and
Section 2.4, respectively. Section 2.C.5 provides the details on Examples 2.5 and 2.6. Section
2.C.6 contains the algorithm for choosing the tuning parameter.
2.C.1
Proofs of preliminary results Section 2.3
Proof of Lemma 2.A.1. Lm̄ -boundedness of {xj,t ut } follows directly from the L2m̄ -boundedness
of {z t } and the Cauchy-Schwarz inequality. By Theorem 17.9 in Davidson (2002b) it follows
that {xj,t ut } is Lm -NED on {sT,t } of size −1. We then apply Theorem 17.5 in Davidson
d
(1/m −
(1/m−1/m̄)
s
Ft -measurability of z t implies
(2002b) to conclude that {xj,t ut } is an Lm -mixingale of size − min{1,
1/m̄)} = −1, with respect to
Fts
= σ{sT,t , sT,t−1 , . . . }; the
σ{z t , z t−1 , . . . } ⊂ Fts , which in turn implies that {xj,t ut } it is also an Lm -mixingale with
∞
P
respect to Ft = σ{z t , z t−1 , . . . }. The summability condition
ψq < ∞ is satisfied by the
q=1
convergence property of p-series:
∞
P
q
−p
< ∞ for any p > 1.
■
q=1
Proof of Lemma 2.A.2. Lm̄ -boundedness of {xi,t xj,t −Exi,t xj,t } follows directly from the
L2m̄ -boundedness of {z t } and the Cauchy-Schwarz inequality. By Theorem 17.9 of Davidson
(2002b) the product of two NED processes is also NED, with the order halved. It follows that
{xi,t xj,t } is Lm -NED on {sT,t } of size −d. Therefore, Exi,t xj,t is trivially NED. Theorem
17.8 in Davidson (2002b) implies that also {xi,t xj,t − Exi,t xj,t } is Lm -NED. We then apply
Theorem 17.5 in Davidson (2002b) to conclude that {xi,t xj,t − Exi,t xj,t } is an Lm -mixingale
d
(1/m−1/m̄)} = −d, with respect to Fts =
(1/m−1/m̄)
s
Ft -measurability of z t implies σ{z t , z t−1 , . . . } ⊂ Fts , which in turn
of size − min{d,
σ{sT,t , sT,t−1 , . . . }; the
implies that {xi,t xj,t −
Exi,t xj,t } is also an Lm -mixingale with respect to Ft = σ{z t , z t−1 , . . . }. The boundedness of
mixingale constants comes from Theorem 17.5, noting that the NED constants of {z j,t } are
bounded by Assumption 2.1(ii), and {xi,t xj,t − Exi,t xj,t } is appropriately Lm̄ -bounded.
■
Proof of Lemma 2.A.3. By the union bound
P
Σ̂ − Σ
max
N X
N
X
> C/ |S| ≤
P
T
X
!
(xi,t xj,t − E [xi,t xj,t ]) > CT / |S| .
t=1
i=1 j=1
Now apply the Triplex inequality (Jiang, 2009)
P
T
X
!
(xi,t xj,t − E [xi,t xj,t ]) > CT / |S|
t=1
+
≤ 2q exp
−T
C2
288 |S|2 q 2 κ2T
!
T
T
i
6 |S| X
15 |S| X h
E [|E (xi,t xj,t |Ft−q ) − E (xi,t xj,t )|] +
E |xi,t xj,t | 1{|xi,t xj,t |>κT }
C T t=1
C T t=1
:= R(i) + R(ii) + R(iii) .
59
2 Desparsified Lasso in Time Series
For the first term, we have
N X
N
X
C2
−T
288 |S|2 q 2 κ2T
2
R(i) = 2N q exp
i=1 j=1
so we need N 2 q exp
−T
|S|2 q 2 κ2
T
!
→ 0. By Lemma 2.A.2 and Jensen’s inequality, we have that
E [|E [xi,t xj,t |Ft−q ] − E [xi,t xj,t ]|] ≤ ct ψq , and thus for the second term that
R(ii) ≤
T
6 |S| X
ct ψq ≤ C |S| ψq ,
C T t=1
N X
N
X
R(ii) ≤ CN 2 |S| q −d ,
i=1 j=1
so we need N 2 |S| q −d → 0. For the third term, we have by Hölder’s and Markov’s inequalities
1−1/m
h
i
E |xi,t xj,t |m
E [|xi,t xj,t |m ] ,
E |xi,t xj,t | 1{|xi,t xj,t |>κT } ≤ (E |xi,t xj,t |m )1/m
≤ κ1−m
T
κm
T
N
N X
X
R(iii) ≤ CN 2 |S| κ1−m
T
i=1 j=1
so we need N 2 |S| κ1−m
→ 0. We then jointly bound all three terms
T
−T
|S|2 q 2 κ2T
(1)
CN 2 q exp
(2)
CN 2 |S| q −d ≤ ηT ,
!
≤ ηT ,
(3)
≤ ηT .
CN 2 |S| κ1−m
T
by a sequence ηT → 0. Note that in the Triplex inequality, q is a positive integer, κT > 0,
and λ−r sr > 0 is also satisfied. We further assume that
ηT
N2
≤
1
e
=⇒
ηT
qN 2
≤
1
.
e
First,
isolate κT in (1),
CN 2 q exp
−T
|S|2 q 2 κ2T
√
!
≤ ηT
⇐⇒
κT ≤ C
T
1
p
.
|S| q ln (qN 2 /ηT )
Similarly, isolating κT from (3), gives
CN 2 |S| κ1−m
≤ ηT
T
⇐⇒
κT ≥ C N 2 |S|
1
m−1
−1
ηTm−1 .
Since we have a lower and upper bound on κT , we need to make sure both bounds are
satisfied,
C1
−1
1
N |S| m−1 ηTm−1 ≤ C2
2
⇐⇒
60
√
T
1
p
|S| q ln (qN 2 /ηT )
1
p
√
−m
−2
q ln (qN 2 /ηT ) ≤ C T |S| m−1 N m−1 ηTm−1 .
2.C Supplementary Results
Isolating q from (2),
CN 2 |S| q −d ≤ ηT
ηT
qN 2
Assuming that
CN
2
d
1
d
−1
d
|S| ηT
2
⇐⇒
−1
1
q ≥ CN d |S| d ηTd .
≤ 1e , we have that q ≤ q
p
1
√
−m
−2
≤ C T |S| m−1 N m−1 ηTm−1
ln (qN 2 /ηT ) and therefore we need to ensure
d+m−1
dm+m−1
⇐⇒
√
|S| ≤ CηT
2
1
T
d
+
1
m
m−1
2
N ( d + m−1 )
For the set Sλ , we have the bound
|Sλ | ≤
N
X
1{|β 0 |>λ}
j
j=1
!r
βj0
λ
≤ λ−r
N
X
1{|β 0 |>0} βj0
j
j=1
r
= λ−r sr ,
and it is sufficient to assume that
−r
λ
d+m−1
dm+m−1
√
sr ≤ CηT
1
T
2
d
+
1
m
m−1
.
2
N ( d + m−1 )
When this bound is satisfied,
N P
N
P
(R(i) + R(ii) + R(iii) ) ≤ 3ηT , and P (CC T (Sλ )) ≥ 1 −
i=1 j=1
3ηT .
■
Proof of Lemma 2.A.4. By the union bound, Markov’s inequality and the mixingale concentration inequality of (Hansen, 1991b, Lemma 2), it follows that
"
P
max
j≤N,l≤T
≤z
−m
N
X
l
X
#
ut xj,t
!
>z
t=1
"
P max
j=1
"
E max
j=1
≤
N
X
l≤T
l
X
l≤T
m#
ut xj,t
t=1
≤z
−m
N
X
j=1
l
X
#
ut xj,t
!
>z
t=1
C1m
T
X
!m/2
c2t
≤ CN T m/2 z −m ,
t=1
as {xj,t ut } is a mixingale of appropriate size by Lemma 2.A.1.
■
Proof of Lemma 2.A.5. This result follows directly by Corollary 6.8 in Bühlmann and
■
van De Geer (2011).
Proof of Lemma 2.A.6. The proof largely follows Theorem 2.2 of van de Geer (2016)
applied to β = β 0 with some modifications. For the sake of clarity and readability, we
include the full proof here. Consider two cases. First, consider the case where
∥X(β̂−β 0 )∥2
2
T
<
− λ4 ∥β̂ − β 0 ∥1 + 2λ∥β 0S c ∥1 . Then
∥X(β̂ − β 0 )∥22
λ
8
+ ∥β̂ − β 0 ∥1 < 2λ∥β 0S c ∥1 < λ∥β 0S c ∥1 + Cλ2 |S|,
T
4
3
which satisfies Lemma 2.A.6.
61
.
2 Desparsified Lasso in Time Series
Next, consider the case where
∥X(β̂−β 0 )∥2
2
T
≥ − λ4 ∥β̂ − β 0 ∥1 + 2λ∥β 0S c ∥1 . From the Lasso
X ′ (y−X β̂)
T
optimization problem in eq. (2.3), we have the Karush-Kuhn-Tucker conditions
=
λκ̂, where κ̂ is the subdifferential of ∥β̂∥1 . Premultiplying by (β 0 − β̂)′ , we get
(β 0 − β̂)′ X ′ (y − X β̂)
′
=λ(β 0 − β̂)′ κ̂ = λβ 0 κ̂ − λ∥β̂∥1 ≤ λ∥β 0 ∥1 − λ∥β̂∥1 .
T
By plugging in y = Xβ 0 +u, the left-hand-side can be re-written as
∥X(β̂−β 0 )∥2
2
T
′
+u
X(β 0 −β̂)
,
T
and therefore
∥X(β̂ − β 0 )∥22
u′ X(β̂ − β 0 )
≤
+ λ∥β 0 ∥1 − λ∥β̂∥1
T
T
1
≤
u′ X ∞ ∥β̂ − β 0 ∥1 + λ∥β 0 ∥1 − λ∥β̂∥1
(1) T
≤
(2)
≤
(4)
λ
5λ
3λ
5λ 0
∥β̂ − β 0 ∥1 + λ∥β 0 ∥1 − λ∥β̂∥1 ≤
∥β̂ S − β 0S ∥1 −
∥β̂ S c ∥1 +
∥β S c ∥1
4
4
4
(3) 4
5λ
3λ
∥β̂ S − β 0S ∥1 −
∥β̂ S c − β 0S c ∥1 + 2λ∥β 0S c ∥1 ,
4
4
where (1) follows from the dual norm inequality, (2) from the bound on the empirical process
given by ET (T λ4 ), (3) from the property ∥β∥1 = ∥β S ∥1 + ∥β S c ∥1 with βj,S = βj 1{j∈S} , as
well as several
applications of the itriangle inequality, and (4) follows from the fact that
h
∥X(β̂−β 0 )∥2
2
∥β̂ S c ∥1 ≤ ∥β̂ S c − β 0S c ∥1 − ∥β 0S c ∥1 . Note that it follows from the condition
≥
T
− λ4 ∥β̂ − β 0 ∥1 + 2λ∥β 0S c ∥1 combined with the previous inequality that ∥β̂ S c − β 0S c ∥1 ≤
3∥β̂ S − β 0S ∥1 such that Lemma 2.A.5 can be applied. Adding
3λ
∥β̂ S
4
− β 0S ∥1 to both sides
and re-arranging, we get by applying Lemma 2.A.5
4 ∥X(β̂ − β 0 )∥22
8
λ
8
+ ∥β̂ − β 0 ∥1 ≤ λ∥β̂ S − β 0S ∥1 + λ∥β 0S c ∥1
3
T
4
3
3
q
8
8
0 ′
≤ λC |S|(β̂ − β ) Σ̂(β̂ − β 0 ) + λ∥β 0S c ∥1 .
3
3
q
p
1
Using that 2uv ≤ u2 + v 2 with u =
(β̂ − β 0 )′ Σ̂(β̂ − β 0 ), v = √43 Cλ |S|, we further
3
bound the right-hand-side to arrive at
4 ∥X(β̂ − β 0 )∥22
λ
1 ∥X(β̂ − β 0 )∥22
8
+ ∥β̂ − β 0 ∥1 ≤
+ Cλ2 |S| + λ∥β 0S c ∥1 ,
3
T
4
3
T
3
from which the result follows.
■
Proof of Lemma 2.A.7. By Assumption 2.3 and Lemma 2.A.6, we have on the set ET (T λ4 )∩
CC T (Sλ )
∥X(β̂ − β 0 )∥22
λ
8
+ ∥β̂ − β 0 ∥1 ≤Cλ2 |Sλ | + λ∥β 0S c ∥1 .
λ
T
4
3
62
2.C Supplementary Results
It follows directly from Assumption 2.2 that
β 0S c
λ
=
1
N
X
j=1
1{0<|β 0 |≤λ} βj0
j
≤
N
X
j=1
1{|β 0 |>0}
j
λ
βj0
!1−r
βj0 = λ1−r
N
X
1{|β 0 |>0} βj0
j=1
r
j
≤ λ1−r sr .
and by arguments in the proof of Lemma 2.A.3, |Sλ | ≤ λ−r sr Plugging these in, we obtain
∥X(β̂ − β 0 )∥22
λ
8
+ ∥β̂ − β 0 ∥1 ≤ Cλ2 λ−r sr + λλ1−r sr = Cλ2−r sr .
T
4
3
2.C.2
■
Proofs of preliminary results Section 2.4
Proof of Lemma 2.B.1. As vj,t are the projection errors from projecting xj,t on all other
xk,t , it follows directly that E [vj,t ] = 0 and E [vj,t xk,t ] = 0. Lm̄ -boundedness of {vj,t xk,t }, ∀j, k
follows from Assumption 2.1(i), Assumption 2.4, and the Cauchy-Schwarz inequality. By
Theorem 17.8 in Davidson (2002b), {vj,t } is L2m -NED on {sT,t } of size −d. The remainder
■
of the proof follows as in the proof of Lemma 2.A.1.
Proof of Lemma 2.B.2. It follows by the Cauchy-Schwarz inequality that {wj,t } is Lm̄ bounded for all j = 1, . . . , p, and from the properties of {vj,t } by Theorem 17.9 in Davidson
(2002b) that {wj,t } is Lm -NED of size −d. Part (i) then follows by Theorem 17.5 in Davidson
(2002b). For part (ii), we adapt the proof of Theorem 17.7 in Davidson (2002b). Letting
Yt = wj,t and Xt = wk,t , Ewj,t wk,t−l = EYt Xt−l . By the triangle inequality, choosing
t−l+q
q = [l/2], and using Ft−l−q
as in Definition 2.A.1,
i
oi
h
h
n
t−l+q
t−l+q
.
+ E Yt E Xt−l |Ft−l−q
|EYt Xt−l | ≤ E Yt Xt−l − E Xt−l |Ft−l−q
By Hölder’s inequality, we can bound the first term
n
oi
h
h
n
o
i m−1 h
m
m
t−l+q
t−l+q
≤ E |Yt+q | m−1
E Yt Xt−l − E Xt−l |Ft−l−q
E Xt−l − E Xt−l |Ft−l−q
1
m i m
i m−1
h
m
m
m
Since m−1
≤ C, and since Xt−l is NED of size −d,
< m < m̄, E |Yt+q | m−1
h
n
o m i 1
m
t−l+q
E Xt−l − E Xt−l |Ft−l−q
≤ Cψq , where ψq = O(q −d−ϵ ) for some ϵ > 0. For the
second term, we use the tower property and Hölder’s inequality again
h
i
h
i
t−l+q
t−l+q
t−l+q
E Yt E Xt−l |Ft−l−q
= E E Yt |Ft−l−q
E Xt−l |Ft−l−q
h
t−l+q
≤ E E Yt |Ft−l−q
1
m i m
t−l+q
E E Xt−l |Ft−l−q
m
m−1
m−1
m
.
63
.
2 Desparsified Lasso in Time Series
Since conditioning is a contractionary projection in Lp spaces,
h
t−l+q
E E Yt |Ft−l−q
1
m i m
h
t−l+q
≤ E E Yt |F−∞
m−1
m
m−1
t−l+q
E E Xt−l |Ft−l−q
m
1
m i m
h
i m−1
m
m
≤ E |Xt−l | m−1
≤ C.
Since Yt is a Mixingale of size −d, the first term can be bounded by Cψq−l , where ψq−l =
O((q − l)−d−ϵ ). The sequence ϕl is then obtained by recalling that we chose q = [l/2],
ϕl = O((l/2)−d−ϵ ) = O(l−d−ϵ ). Absolute summability follows by properties of p-series,
since d ≥ 1. Note this results also holds for
max
q≤j,k≤N, 1≤t≤T
|E [wj,t wk,t−l ]| since C and ϕl
are independent of j, k, and t. (iii) follows by repeated application of Corollary 17.11 and
Theorem 17.5 in Davidson (2002b), noting that E(wj,t wk,t−l ) is a non-random and bounded,
■
so trivially NED.
Proof of Lemma 2.B.3. By Lemma 2.A.3, P (CC T (Sλ )) ≥ 1 − 3ηT when
d+m−1
λ−r sr ≤ CηTdm+m−1
√
1
T
2
d
+
1
m
m−1
2
N ( d + m−1 )
for a sequence ηT → 0 such that ηT ≤
N2
.
e
,
We can similarly apply this lemma to the sets
CC T (Sλ,j ); when
d+m−1
dm+m−1
λ−r
j sr,j ≤ CηT
√
T
2
1
d
2
N ( d + m−1 )
+
1
m
m−1
,
!
T
CC T (Sλ,j ) ≥ 1−[1 − P (CC T (Sλ ))]−
P (CC T (Sλ,j )) ≥ 1−3ηT . By the union bound, P CC T (Sλ )
j∈H
P
[1 − P (CC T (Sλ,j ))] ≥ 1−3(1+h)ηT , when the conditions above hold for all j ∈ H. These
j∈H
conditions are then jointly satisfied by the conditions this lemma, which are expressed in
terms of sr,max and λmin .
■
√
(j)
Proof of Lemma 2.B.4. By Lemmas 2.A.4 and 2.B.1, we have P ET (xj ) ≤ CN ( T /xj )m .
Then
!
P
\
j∈H
(j)
ET (xj )
≥1−
X
j∈H
P
n
oc
hN T m/2
(j)
E T xj
≥1−C
.
min xm
j
Proof of Lemma 2.B.5. Note that
(
)!
T
\
h
1 X 2
2
P(LT ) = P
vj,t − τj ≤
=1−P
T t=1
δT
j∈H
!
T
X
1 X 2
h
2
≥1−
P
vj,t − τj >
.
T t=1
δT
j∈H
64
■
j∈H
(
[
j∈H
T
1 X 2
h
vj,t − τj2 >
T t=1
δT
)!
2.C Supplementary Results
Recalling that τj2 =
T
P
1
T
2
E vj,t
, write P
t=1
1
T
T
P
2
vj,t
− τj2 >
t=1
h
δT
=P
T
P
2
2
(vj,t
− Evj,t
) >
t=1
As in the proof of Lemma 2.A.3, we use the Triplex inequality to bound this probability.
T
X
h
2
2
(vj,t
− Evj,t
) >T
δ
T
t=1
P
+6
!
≤ 2q exp −
T h2
288q 2 κ2T δT2
T
T
i
δT X
δT X h 2
2
2
|Ft−q − Evj,t
E E vj,t
+ 15
E vj,t 1{|v2 |>κT }
j,t
T h t=1
T h t=1
:= R(i) + R(ii) + R(iii) .
For the second term, note by the proof of Lemma 2.B.1 that {vj,t } is L2m -NED on {sT,t }
2
is Lm̄ -bounded, and by Theorem 17.9 of Davidson
of size −d. By Assumption 2.4, vj,t
2
2
(2002b), it is Lm -NED on {sT,t } of size −d. By Theorem 17.5 vj,t
− Evj,t
is then an
2
2
Lm -mixingale of size −d. It then follows that E E vj,t |Ft−q − Evj,t ≤ ct ψq ≤ Cq −d ,
and
X
R(ii) ≤
j∈H
X
6
j∈H
T
δT
δT X −d
Cq = C d .
T h t=1
q
For the third term, we have by Hölder’s and Markov’s inequalities
h
i
2
1{|v2 |>κT } ≤ Cκ1−m
.
E vj,t
T
j,t
and therefore
X
R(iii) ≤
j∈H
X
15
j∈H
T
δT X
δT
Cκ1−m
= C m−1 .
T
T h t=1
κT
We jointly bound all three terms by a sequence ηT → 0.
T h2
Cqh exp − 2 2 2 ≤ ηT ,
q κT δ T
(1)
C
(2)
For the steps below, we assume that
ηT
h
1
e
≤
δT
≤ ηT ,
qd
=⇒
(3)
C
δT
m−1
κT
≤ ηT .
p
− ln(ηT /(hq)) ≥ 1. Isolate κT in
(1) and (2),
Cqh exp
C
δT
κm−1
T
−T h2
q 2 κ2T δT2
√
≤ ηT ⇐⇒ κT ≤ C
≤ ηT ⇐⇒ κT ≥ C
δT
ηT
Th
,
qδT
1/(m−1)
.
Combining both bounds on κT ,
C1
δT
ηT
1/(m−1)
≤ C2
√
Th
qδT
⇐⇒
√
1/(m−1) −m/(m−1)
q ≤ C T hηT
δT
.
65
Th
δT
.
2 Desparsified Lasso in Time Series
Isolating q from (2), gives
CδT q −d ≤ ηT
⇐⇒
−1/d 1/d
δT .
q ≥ CηT
Combining both bounds on q,
√
1/(m−1) −m/(m−1)
1/d −1/d
C1 T hηT
δT
≥ C2 δT η T
When δT satisfies this upper bound,
P
d+m−1
√
1
δT ≤ CηTdm+m−1 ( T h) 1/d+m/(m−1) .
⇐⇒
(R(i) + R(ii) + R(iii) ) ≤ 3ηT , and P (LT ) ≥ 1 − 3ηT ,
j∈H
which completes the proof.
■
Proof of Lemma 2.B.6. Note that τ̂j2 can be rewritten as follows
τ̂j2
=
xj − X −j γ 0j
2
2
+
X −j γ̂ j − γ 0j
2
2
T
T
′
2 xj − X −j γ 0j X −j γ̂ j − γ 0j
−
+ λj ∥γ̂ j ∥1
T
′
2
T
X −j γ̂ j − γ 0j 2
2 xj − X −j γ 0j X −j γ̂ j − γ 0j
1 X 2
=
vj,t +
−
+ λj ∥γ̂ j ∥1 .
T t=1
T
T
(2.C.1)
Then
|τ̂j2
−
τj2 |
2
T
X −j γ̂ j − γ 0j 2
1 X 2
2
≤
vj,t − τj +
T t=1
T
′
2 xj − X −j γ 0j X −j γ̂ j − γ 0j
+ λj ∥γ̂ j ∥1
+
T
=: R(i) + R(ii) + R(iii) + R(iv) .
By the set LT , we have R(i) ≤ max
j∈H
1
T
T
P
2
vj,t
− τj2 ≤
t=1
nodewise regression, it holds that R(ii) ≤
(j)
C1 λ2−r
sr
j
h
δT
. By Corollary 2.1 applied to the
T
λ
(j)
≤ C1 λ̄2−r s¯r . By the set
{ET (T 4j )}
j∈H
and the same error bound, we have
R(iii)
2 v ′j X −j γ̂ j − γ 0j
=
T
≤ C2 λj γ̂ j − γ 0j
1
≤ C2 λ̄2−r s̄r .
By the triangle inequality R(iv) ≤ λj ∥γ 0j ∥1 + λj ∥γ̂ j − γ 0j ∥1 . Using the weak sparsity index
for the nodewise regressions Sλ,j = {k ̸= j : |γj,k | > λj }, write ∥γ 0j ∥1 =
(γ 0j )Sλ,j
. These terms can then be bounded as follows
1
c
(γ 0j )Sλ,j
66
=
1
X
1{|γ 0
j,k
k̸=j
0
|≤λj } |γj,k |
≤ λ1−r
sr(j) ≤ λ̄1−r s̄r .
j
c
(γ 0j )Sλ,j
+
1
2.C Supplementary Results
Bounding the L1 norm by the L2 norm, we get
2
(γ 0j )Sλ,j
1
≤|Sλ,j |∥γ 0j ∥22 ≤ λ−r s̄r ∥γ 0j ∥22 ,
¯
To further bound ∥γ 0j ∥22 , consider the matrix Θ = Σ−1 =
1
T
PT
t=1
−1
E [xt x′t ]
and the
partitioning
"
Σ=
1
T
1
T
PT
t=1
PT
t=1
E x2j,t
PT
1
t=1 E
T
P
T
1
t=1 E
T
E (x−j,t xj,t )
xj,t x′−j,t
#
x−j,t x′−j,t
.
By blockwise matrix inversion, we can write the jth row of Θ as
"
#−1
T
T
X
1 X
1
1
1
′
′
= 1 1, (γ 0j )′ . (2.C.2)
Θj = 2 , − 2
E xj,t x−j,t
E x−j,t x−j,t
τj
τj T t=1
T t=1
τj2
It then follows that
∥γ 0j ∥22 =
X 0 2
X 0 2
τj4
(γj,k ) ≤ 1 +
(γj,k ) = τj4 Θj Θ′j ≤ 2 ,
Λmin
k̸=j
as
1
Λmin
k̸=j
is the largest eigenvalue of Θ. For a bound on τj2 , by the definition of γ 0j from
eq. (2.7) and Assumption 2.5(ii), it follows that
( "
τj2
#)
T
2
1 X
′
= min E
xj,t − x−j,t γ j
γj
T t=1
"
#
T
T
2
1 X
1 X 2
′
≤E
xj,t − x−j,t 0
=
E xj,t = Σj,j ≤ C.
T t=1
T t=1
Similar arguments can be used to bound τj2 from below. By the proof of Lemma 5.3 in
van de Geer et al. (2014), τj2 =
1
Θj,j
, and therefore τj2 ≥ Λmin . It then follows from Assump-
tion 2.5(ii) that
1
≤ τj2 ≤ C, uniformly over j ∈ 1, . . . , N.
C
We therefore have ∥γ 0j ∥2 ≤
τj2
Λmin
≤ C 2 , such that we can bound the fourth term as
c
R(iv) ≤ λj ∥γ 0j ∥1 + λj ∥γ̂ j − γ 0j ∥1 = λj (γ 0j )Sλ,j
q
≤ λ̄2−r s̄r + λ̄ λ−r s̄r C12 + C2 λ̄2−r s̄r
1
+ λj (γ 0j )Sλ,j
1
+ λj ∥γ̂ j − γ 0j ∥1
¯
Combining all bounds, we have
q
h
+ C1 λ̄2−r s¯r + C2 λ̄2−r s¯r + λ̄2−r s̄r + λ̄2 λ−r s̄r C32 + C4 λ̄2−r s̄r
¯
δT
q
h
=
+ C5 λ̄2−r s̄r + C6 λ̄2 λ−r s̄r .
¯
δT
|τ̂j2 − τj2 | ≤
67
2 Desparsified Lasso in Time Series
For the second statement in Lemma 2.B.6, we have by the triangle inequality and eq. (2.B.1)
that
|τ̂j2 − τj2 |
1
1
≤
−
≤
τ̂j2
τj2
τj4 − τj2 |τ̂j2 − τj2 |
|τ̂j2 − τj2 |
1
− C|τ̂j2 − τj2 |
C2
q
h
+ C5 λ̄2−r s̄r + C6 λ̄2 λ−r s̄r
δT
¯
.
≤
q
−r
h
2−r
2
C7 − C8 δT + C5 λ̄
s̄r + C6 λ̄ λ s̄r
■
¯
Proof of Lemma 2.B.7. First, note that since Σ̂ is a symmetric matrix
o
n
o
n
′
max ∥e′j − Θ̂j Σ̂∥∞ = max ∥Σ̂Θ̂j − ej ∥∞ .
j∈H
j∈H
By the extended KKT conditions
(see Section 2.1.1 of van de Geer et al., 2014), we have
n
o
′
λ
that max ∥Σ̂Θ̂j − ej ∥∞ ≤ max τ̂ 2j ≤ min λ̄ τ̂ 2 . For a lower bound on min τ̂j2 , note
}
{
j∈H
j∈H
j∈H
j
j
j∈H
that by eq. (2.C.1), τ̂j2 can be rewritten as
τ̂j2
′
∥xj − X −j γ 0j ∥22
∥X −j γ̂ j − γ 0j ∥22
2 xj − X −j γ 0j X −j γ̂ j − γ 0j
=
+
−
+ λj ∥γ̂ j ∥1 .
T
T
T
With
2
∥X −j (γ̂ j −γ 0
j )∥2
T
≥ 0 and λj ∥γ̂ j ∥1 ≥ 0 by definition for all j, we have
′
∥xj − X −j γ 0j ∥22
2 xj − X −j γ 0j X −j γ̂ j − γ 0j
τ̂j2 ≥
−
=
T
T
T
P
2
vj,t
t=1
T
−
2v ′j X −j γ̂ j − γ 0j
.
T
The dual norm inequality in combination with the triangle inequality then gives
T
1 X 2
2
vj,t − τj2 − max |v ′j xk | ∥γ̂ j − γ 0j ∥1 ,
T t=1
T k̸=j
)
(
T
2
1
1 X 2
≥
− max
vj,t − τj2 − max |v ′j xk | ∥γ̂ j − γ 0j ∥1 ,
j
C
T t=1
T k̸=j
τ̂j2 ≥ τj2 −
(j)
where the second line follows from eq. (2.B.1). Then, on the sets LT and ET (T
τ̂j2 ≥ C1 −
λj
4
)
λj
h
h
h
−
∥γ̂ j − γ 0j ∥1 ≥ C1 −
− C2 λ2−r
− C2 λ̄2−r s̄r ,
sr(j) ≥ C1 −
j
δT
2
δT
δT
where Corollary 2.1 yields the second inequality. As λ̄2−r s̄r → 0, for a large enough T we
have that
min
j
1
≤
τ̂j2 C1 −
h
δT
1
− C2 λ̄2−r s̄r
from which the result follows.
■
Proof of Lemma 2.B.8. Note that the jth row of the matrix I − Θ̂Σ̂ is e′j − Θ̂j Σ̂, where
68
2.C Supplementary Results
Θ̂j is the jth row of Θ̂. Plugging in the definition of ∆, we have
n
o
√
√
max |∆j | = T max e′j − Θ̂j Σ̂ β̂ − β 0 ≤ T max ∥e′j − Θ̂j Σ̂∥∞ ∥β̂ − β 0 ∥1 .
j∈H
j∈H
j∈H
By Lemma 2.A.7, under Assumptions 2.2 and 2.5(ii), on the sets ET (T λ4 ) ∩ CC T (Sλ ), we have
∥X(β̂ − β 0 )∥22
+ λ∥β̂ − β 0 ∥1 ≤ Cλ2−r sr ,
T
(2.C.3)
from which it follows that ∥β̂ − β 0 ∥1 ≤ Cλ1−r sr . Combining this bound with Lemma 2.B.7
gives
√
max |∆j | ≤ T λ1−r sr
j∈H
C1 −
h
δT
λ̄
.
− C2 λ̄2−r s̄r
■
Proof of Lemma 2.B.9. Starting from the nodewise regression model, write
1
1
1
√ v̂ ′j u − v ′j u = √ u′ X −j γ 0j − γ̂ j ≤ √
T
T
T
u′ X
∞
γ̂ j − γ 0j
1
.
By the set ET (T λ) and Corollary 2.1,
′
{|u Xj |}
√ max
j
T
γ̂ j − γ 0j
T
1
√
≤ T λ γ̂ j − γ 0j
1
√
√
≤ C T λλ1−r
sr(j) ≤ C T λ2−r
max s̄r ,
j
where the upper bound is uniform over j ∈ H.
■
Proof of Lemma 2.B.10. By the union bound
(
P
\
j∈H
max
s≤T
s
X
)!
vj,t ut ≤ x
≥1−
t=1
X
P max
j∈H
s≤T
!
s
X
vj,t ut > x .
t=1
By the Markov inequality, Lemma 2.B.2 and the mixingale concentration inequality of
(Hansen, 1991b, Lemma 2),
P max
s≤T
s
X
s
P
vj,t ut
E max
!
vj,t ut > x
≤
t=1
s≤T
t=1
xm
m
C1m
≤
T
P
(j)
ct
t=1
xm
2 m/2
=
CT m/2
,
xm
■
from which the result follows.
Proof of Lemma 2.B.11. Start by writing
1 v̂ ′j u
1 v ′j u
1
√
−√
≤ √
2
2
τ̂
τ
T j
T j
T
v̂ ′j u − v ′j u
1
1
+ 2 − 2
τ̂j2
τ̂j
τj
v ′j u
√
=: R(i) + R(ii) .
T
For the first term, we can bound from above using Lemmas 2.B.6 and 2.B.9 and eq. (2.B.1),
69
2 Desparsified Lasso in Time Series
all providing bounds uniform over j ∈ H. We then get
R(i) ≤
|v̂ ′j u − v ′j u|
1
√
≤
|τj2 | − |τ̂j2 − τj2 |
T
√
C5 T λ2−r
max s̄r
1/C6 −
h
δT
+ C1 λ̄2−r s̄r + C2
q
λ̄2 λ−r s̄r
.
¯
For the second term, we can bound from above using Lemma 2.B.6 and the set
T (j)
ET,uv (h1/m T 1/2 ηT−1 ) to get the uniform bound
j∈H
R(ii)
q
h1/m ηT−1 δhT + C7 λ̄2−r s̄r h1/m ηT−1 + C8 λ̄2 λ−r s̄r h1/m ηT−1
¯
.
≤
q
C9 − C10 δhT + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r
¯
Combining both bounds gives
R(i) + R(ii)
q
√
1/m −1
h1/m ηT−1 δhT + C1 h1/m ηT−1 T λ2−r
ηT
λ̄2 λ−r s̄r
max s̄r + C2 h
¯
≤
q
−r
C3 − C4 δhT + C1 λ̄2−r s̄r + C2 λ̄2 λ s̄r
¯
■
from which the result follows.
Proof of Lemma 2.B.12. The result follows directly from the Markov inequality
h
i
P ∥d∥∞ > x ≤ x−p E max |dt |p ≤ x−p T max E |dt |p ≤ Cx−p T.
t
t
■
Proof of Lemma 2.B.13. We can write
T
T
1 X
1 X
(ŵj,t ŵk,t−l − wj,t wk,t−l ) ≤
(ŵj,t − wj,t ) (ŵk,t−l − wk,t−l )
T
T
t=l+1
+
1
T
t=l+1
T
X
(ŵj,t − wj,t ) wk,t−l +
t=l+1
T
1 X
wj,t (ŵk,t−l − wk,t−l )
T
t=l+1
1
=:
R(i) + R(ii) + R(iii) .
T
Take R(i) first. Using that ŵj,t−q = ût−q v̂j,t−q , straightforward but tedious calculations
70
2.C Supplementary Results
show that
T
X
R(i) ≤
(ût − ut ) (ût−l − ut−l ) (v̂j,t − vj,t ) (v̂k,t−l − vk,t−l )
t=l+1
+
T
X
(ût − ut ) (ût−l − ut−l ) (v̂j,t − vj,t ) vk,t−l +
t=l+1
+
T
X
T
X
(ût − ut ) (ût−l − ut−l ) vj,t (v̂k,t−l − vk,t−l ) +
t=l+1
+
T
X
T
X
T
X
(ût − ut ) (ût−l − ut−l ) vj,t vk,t−l
t=l+1
(ût − ut ) ut−l vj,t (v̂k,t−l − vk,t−l ) +
t=l+1
+
(ût − ut ) ut−l (v̂j,t − vj,t ) (v̂k,t−l − vk,t−l )
t=l+1
T
X
ut (ût−l − ut−l ) (v̂j,t − vj,t ) (v̂k,t−l − vk,t−l )
t=l+1
ut (ût−l − ut−l ) (v̂j,t − vj,t ) vk,t−l +
T
X
ut ut−l (v̂j,t − vj,t ) (v̂k,t−l − vk,t−l ) =:
R(i),i .
i=1
t=l+1
t=l+1
9
X
p
X −j γ̂ 0 − γ 0j 2 ≤ C T λ̄2−r s̄r on the set PT,nw by Corol
√
≤ C T λ2−r sr on the set PT,las by Corollary 2.1,
lary 2.1, and ∥û − u∥2 = X β̂ − β 0
Using that ∥v̂ j − v j ∥2 =
2
we can use the Cauchy-Schwarz inequality to conclude that
2
R(i),1 ≤ ∥û − u∥22 ∥v̂ j − v j ∥2 ∥v̂ k − v k ∥2 ≤ CT 2 λ2−r sr λ̄2−r s̄r ≤ CT 2 λ2−r
.
max sr,max
On the set ET,u (T 1/2m )
T
j∈H
ET,vj (T 1/2m ), we have that ∥u∥∞ ≤ CT 1/2m , and
∥v j ∥∞ ≤ C(hT )1/2m , uniformly over j ∈ H. Then we can use this, plus the previous results
to find that
R(i),2 ≤ ∥v k ∥∞
T
X
|ût − ut | |ût−l − ut−l | |v̂j,t − vj,t |
t=l+1
3/2
1
≤ ∥v k ∥∞ ∥û − u∥22 ∥v̂ j − v j ∥2 ≤ C(hT ) 2m T 3/2 λ2−r
.
max sr,max
We then find in the same way that
3/2
1
,
R(i),3 ≤ ∥u∥∞ ∥û − u∥2 ∥v̂ j − v j ∥2 ∥v̂ k − v k ∥2 ≤ CT 2m T 3/2 λ2−r
max sr,max
2−r
3/2
1
2
3/2
R(i),4 ≤ ∥û − u∥2 ∥v j ∥∞ ∥v̂ k − v k ∥2 ≤ C(hT ) 2m T
λmax sr,max
,
1
R(i),5 ≤ ∥û − u∥22 ∥v j ∥∞ ∥v k ∥∞ ≤ C(hT ) m T λ2−r
max sr,max .
Defining w̃j,l = (u1 vk,l+1 , . . . , uT vj,T )′ , w̃k,−l = (ul+1 vk,1 , . . . , uT vk,T )′ and ũl = (u1 ul+1 , . . . , uT uT )′ ,
all with m̄ bounded moments, we find on the set
ET,u (T 1/2m ) ∩ ET,ũl (T 1/m )
\
j∈H
ET,w̃j,l (T 1/m )
\
ET,w̃k,−l (T 1/m )
k∈H
71
2 Desparsified Lasso in Time Series
that
1
R(i),6 ≤ ∥w̃j,l ∥∞ ∥û − u∥2 ∥v̂ k − v k ∥2 ≤ C(hT ) m T λ2−r
max sr,max ,
3/2
1
,
R(i),7 ≤ ∥u∥∞ ∥û − u∥2 ∥v̂ j − v j ∥2 ∥v̂ k − v k ∥2 ≤ CT 2m T λ2−r
max sr,max
1
R(i),8 ≤ ∥w̃k,−l ∥∞ ∥û − u∥2 ∥v̂ j − v j ∥2 ≤ C(hT ) m T λ2−r
max sr,max ,
1
R(i),9 ≤ ∥ũl ∥2∞ ∥v̂ j − v j ∥2 ∥v̂ k − v k ∥2 ≤ CT m T λ2−r
max sr,max .
It then follows that
2
3/2
1
R(i) ≤ C1 T λ2−r
+ C2 h1/2m T (m+1)/2m λ2−r
max sr,max
max sr,max
T
+ C3 h1/m T 1/m λ2−r
max sr,max .
For R(ii) we get analogously on the set ET,u (T 1/2m )
ET,vj ((hT )1/2m )
T
j∈H
R(ii) ≤
T
ET,wj ((hT )1/m )
j∈H
T
1 X
(ût − ut ) (v̂j,t − vj,t ) wk,t−l
T
t=l+1
+
T
T
1 X
1 X
(ût − ut ) vj,t wk,t−l +
ut (v̂j,t − vj,t ) wk,t−l
T
T
t=l+1
t=l+1
≤ ∥û − u∥2 ∥v̂ j − v j ∥2 ∥wk ∥∞ + ∥û − u∥2 ∥v j ∥∞ ∥wk ∥∞ + ∥u∥∞ ∥v̂ j − v j ∥2 ∥wk ∥∞ ,
q
q
1
1
3
3
1/2
1/2
2m T
≤ C1 (hT ) m T λ2−r
λ2−r
λ2−r
max sr,max + C3 h m T 2m T
max sr,max .
max sr,max + C2 (hT )
It then follows that
1
T
3/2m (3−m)/2m
R(ii) ≤ C1 h1/m T 1/m λ2−r
T
max sr,max + C2 h
q
λ2−r
max sr,max .
Finally, R(iii) follows identically to R(ii) .
Collect all sets in the set
(j,k)
ET,uvw := ET,u (T 1/2m )
\
ET,vj ((hT )1/2m )
j∈H
∩ ET,ũ (T
1/m
)
\
j∈H
ET,w̃j,l ((hT )1/m )
\
ET,w̃k,−l ((hT )1/m ).
k∈H
Now note that by application of Lemma 2.B.12, we can show that all sets, and by extension
their intersection, have a probability of at least 1 − CT −c for some c > 0. Take for instance
the sets with x = T 1/m . In that
can apply Lemma 2.B.12 with p = m̄ moments to
case
−we
m̄
1/m
obtain a probability of 1 − C T
T = 1 − CT 1−m̄/m , so c = m̄/m − 1 > 0. The sets
for p = 2m̄ moments can be treated similarly. For the sets involving intersections over j ∈!H,
T
Lemma 2.B.12 can be used with an additional union bound argument: P
ET,d (x) ≥
j∈H
1 − Cx−p hT . These sets therefore hold with probability at least 1 − C(hT )−c . Since h is
non-decreasing, this probability converges no slower than 1 − CT −c .
72
■
2.C Supplementary Results
Proof of Lemma 2.B.14. Consider the set
(
)
T
P
max T1
(wj,t wk,t−l − Ewj,t wk,t−l ) ≤ h2 χT . As in Lemma 2.A.3, we use the
(j,k)∈H 2
t=l+1
Triplex inequality (Jiang, 2009) to show under which conditions this set holds with probability converging to 1. By the union bound,
P
T
1 X
(wj,t wk,t−l − Ewj,t wk,t−l ) ≤ h2 χT
T
max
(j,k)∈H 2
t=l+1
X
≥1−
!
P
(j,k)∈H 2
T
1 X
(wj,t wk,t−l − Ewj,t wk,t−l ) > h2 χT
T
!
.
t=l+1
Let zt = wj,t wk,t−l :
T
X
P
!
[zt − Ezt ] > h2 χT (T )
≤ 2q exp
t=l+1
+
−T h4 χ2T
288q 2 κ2T
T
T
X
6
15 X
E
|E
(z
|F
)
−
E(z
)|
+
E |zt | 1{|zt |>κT }
t
t−q
t
h2 T χT t=1
h2 T χT t=1
=: R(i) + R(ii) + R(iii) .
We treat the first term last, as we first need to establish the restrictions put on χT , q
and κT from R(ii) and R(iii) . For the second term, by Lemma 2.B.2(iii)
E |E (zt |Ft−q ) − E(zt )| ≤ ct ψq ≤ Cψq ≤ C1 q −d ,
−d
such that R(ii) ≤ Ch−2 χ−1
.
T q
P
R(ii) → 0.
−1
Hence we need χ−1
→ 0 as T → ∞, such that
T q
(j,k)∈H 2
For the third term, we have by Hölder’s and Markov’s inequalities
1−m/2
E |zt | 1{|zt |>κT } ≤ κT
E |zt |m/2
1−m/2
so R(iii) ≤ Ch−2 χ−1
T κT
1−m/2
χ−1
T κT
. Hence we know that we need to take κT and χT such that
P
R(iii) → 0.
→ 0 as T → ∞, giving
(j,k)∈H 2
Our goal is to minimize χT while ensuring all conditions are satisfied. We jointly bound
all three terms by a sequence ηT → 0:
(1)
X
R(i) ≤ Cqh2 exp
(j,k)∈H 2
−T h4 χ2T
q 2 κ2T
For the steps below, we assume that
ηT
h2
−d
≤ ηT , (2) Cχ−1
≤ ηT , (3) Cχ−1
T q
T κT
≤
1−m/2
1
e
=⇒
≤ ηT .
p
− ln(ηT /(qh2 )) ≥ 1. First, isolate κT
in (1) and (2),
Cqh2 exp
−T h4 χ2T
q 2 κ2T
√
≤ ηT
⇐⇒
κT ≤ C
T h2 χT
.
q
73
2 Desparsified Lasso in Time Series
Cχ−1
T κT
1−m/2
≤ ηT
⇐⇒
κT ≥ C
1
χT ηT
2/(m−2)
.
Combining both bounds,
C1
1
χT η T
2/(m−2)
√ 2
T h χT
≤ C2
q
√
m/(m−2) 2/(m−2)
ηT
,
q ≤ C T h2 χT
⇐⇒
Isolating q from (2),
−d
Cχ−1
T q
≤ ηT
⇐⇒
q≥C
1
ηT χT
1/d
.
Satisfying both bounds on q,
√
m/(m−2) 2/(m−2)
C1 T h2 χT
ηT
≥ C2
1
ηT χT
1/d
P
When χT satisfies this lower bound,
2d+m−2
− dm+m−2
⇐⇒ χT ≥ CηT
√
1
−
( T h2 ) 1/d+m/(m−2) .
(R(i) + R(ii) + R(iii) ) ≤ 3ηT , and
(j,k)∈H 2
P
max
(j,k)∈H 2
T
1 X
(wj,t wk,t−l − Ewj,t wk,t−l ) ≤ h2 χT
T
!
≥ 1 − 3ηT ,
t=l+1
■
Which completes the proof.
−2
′
Proof of Lemma 2.B.15. By the definition of Θ̂, it follows directly that Θ̂X ′ = Υ̂ V̂ ,
√
√
−2
′
where V̂ = (v̂ 1 , . . . , v̂ N ), such that Θ̂X ′ u/ T = Υ̂ V̂ u/ T .
√ p
The proof will now proceed by showing that max r N,p Θ̂X ′ u − Υ−2 V ′ u / T −
→0
1≤p≤P
p
and max |r N,p ∆| −
→ 0. By Lemma 2.B.8, it holds that
1≤p≤P
max |∆j | ≤
j∈H
√
T λ1−r sr
λ̄
=: U∆,T ,
C1 − ηT − C2 λ̄2−r s̄r
on the set PT,las ∩ PT,nw ∩ LT . First note that U∆,T → 0 as the assumption λ2max λ−r
min ≤
h
i−1
√
√
2/m
1−r
2−r
ηT h
T sr,max
sr → 0 and λ̄
s̄r → 0. Regarding PT,las ∩
implies that T λ̄λ
N
, and from
PT,nw ∩ LT , it follows from Lemma 2.A.4 that P (ET (T λ/4)) ≥ 1 − C T m/2
λm
!
n
o
T
λ
(j)
hN
Lemma 2.B.4 that P
ET (T 4j )
≥ 1 − C T m/2
; both of these probabilities conλm
j∈H
¯
!
1/m
T
CC T (Sλ,j ) ≥
verge to 1 when λmin ≥ ηT−1 (hN√)T . By Lemma 2.B.3, P CC T (Sλ )
j∈H
1 − 3(1 + h)ηT′ → 1 when hηT′ → 0 and
d+m−1
dm+m−1
λ−r
min sr,max ≤ CηT
√
2
1
T
d
2
N ( d + m−1 )
+
1
m
m−1
.
For the former condition, we may let hηT′ ≤ ηT =⇒ ηT′ ≤ ηT h−1 and ηT′−1 ≥ ηT−1 h, and
74
2.C Supplementary Results
combining this with the latter condition we require that
λ−r
min sr,max
d+m−1
dm+m−1
"
≤ CηT
# 1 1m
√
+
d
m−1
T
,
2+ 2
(
)
d
m−1
(hN )
which we assume in this lemma. Note that this bound makes redundant the previous bound
1/m
λmin ≥ ηT−1 (hN√)T
when 0 < r < 1, by arguments similar to those in the proof of Theo√
1
rem 2.1. The probability of LT converges to 1 by Lemma 2.B.5 when δT ≤ CηT,1 ( T h) 1/d+m/(m−1) .
√
1
We may therefore let δT = CηT,1 ( T h) 1/d+m/(m−1) , where ηT,1 will be addressed later in
the proof. We assume that max ∥r N,p ∥1 < C, from which it follows that max |r N,p ∆| ≤
1≤p≤P
1≤p≤P
∥r N,p ∥1 max |∆j | → 0. Similarly
j∈H
√
v ′j u
1 v̂ ′j u
max r N,p Θ̂X ′ u − Υ−2 V ′ u / T ≤ max ∥r N,p ∥1 max √
−
.
j∈H
1≤p≤P
1≤p≤P
τj2
T τ̂j2
By Lemma 2.B.11, on the set
EV,T := ET (T λ/4) ∩ PT,nw ∩ LT
\
ET,uv (h1/m T 1/2 ηT−1 )
(j)
j∈H
it holds that
q
√ 2−r
−1 h
−1
−1
h1/m ηT,2
λ̄2 λ−r s̄r
+ C1 h1/m ηT,2
T λmax s̄r + C2 h1/m ηT,2
δT
v ′j u
1 v̂ ′j u
¯
max √
−
≤
=: UV,T .
q
j∈H
τj2
T τ̂j2
−r
h
C3 − C4 δT + C1 λ̄2−r s̄r + C2 λ̄2 λ s̄r
¯
Plugging in our choice of δT into the first term in the numerator,
−1
h1/m ηT,2
√
1
h
−
= C(ηT,1 ηT,2 )−1 h1+1/m ( T h) 1/d+m/(m−1) = C(ηT,1 ηT,2 )−1
δT
h
m+1
2
+ m−1
dm
1
! 1/d+m/(m−1)
√
T
.
We may choose ηT,1 and ηT,2 such that (ηT,1 ηT,2 )−1 grows arbitrarily slowly. Therefore,
m+1
this term converges to 0 when
then converge to 0 when
2
+
h dm√ m−1
T
λ2max λ−r
min
→ 0. The two other terms in the numerator
h
i−1
√
2/m
≤ ηT h
T sr,max
. Under these rates the denom-
inator then converges to C3 , which gives UV,T → 0. The only new set appearing in EV,T
T (j)
ET,uv (h1/m T 1/2 ηT−1 ), whose probability converges to 1 by Lemma 2.B.10. It follows
is
j∈H
directly that
√ p
RN Θ̂X ′ u − Υ−2 V ′ u / T −
→ 0.
■
β
Ω
Proof of Lemma 2.B.16. The following bounds on RN,T
and RN,T
hold on the set
PT,las ∩ PT,nw ∩ LT ∩ ET,uvw ∩ ET,ww
1
√
−
1/d+m/(m−2)
ηT−1 h2
T h2
,
75
2 Desparsified Lasso in Time Series
h
i−1
√
2/m
which holds with probability converging to 1 when λ2max λ−r
T sr,max
,
min ≤ ηT h
1 1m
m+1 + 2
d+m−1
√
+
dm+m−1
d
m−1
T
h dm√ m−1
→ 0, λ−r
, and, if r = 0, λmin ≥
min sr,max ≤ CηT
T
(2+ 2 )
(hN ) d m−1
1/m
ηT−1 (hN√)T
, see the proof of Theorem 2.3 for details. Under Assumption 2.6, m and d
may be arbitrarily large, and assuming polynomial growth rates allows us to simplify these
conditions to the following:
1/2 + b
1/2 − b
<ℓ<
,
2−r
r
1/2 + b
r=0:
< ℓ < 1/2.
2−r
0<r<1:
These bounds are feasible when b <
1−r
.
2
By eq. (2.B.2)
Ω
,
RN,T
≤ C1 ∆τ [1 + ∆τ + ∆τ ∆ω] + C2 Q1−d−δ
T
where δ > 0,
1
1
∆τ = max 2 − 2 ≤
j∈H τ̂j
τj
q
+ C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r
¯
,
q
h
2−r
C3 − C4 δT + C1 λ̄
s̄r + C2 λ̄2 λ−r s̄r
h
δT
¯
√
1
with δT = CηT,1 ( T h) 1/d+m/(m−1) , and
∆ω =
max
(j,k)∈H 2
h
i2
1
1
N,QT
ω̂j,k − ωj,k
≤ (2QT + 1) C1 T 1/2 λ2−r
+ C2 h m T m λ2−r
max sr,max
max sr,max
q
h 1 m+1
i3
3−m
3
2
2−r
h m T m λ2−r
max sr,max + C4 h 3m T 3m λmax sr,max
1
√
−
1/d+m/(m−2)
2
−1 2
Th
.
+C5 ηT h
+ C3
Q1−d−δ
is dominated by the term C1 ∆τ [1 + ∆τ + ∆τ ∆ω], since d may be arbitrarily
T
large, and we can limit the analysis to ∆τ and ∆ω.
For ∆τ , we first consider the numerator of the upper bound
q
1
1
h
H−(H+1/2) 1/d+m/(m−1)
+ C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r =O T
+ T b−ℓ(2−r) + T 2 (b−ℓ(2−r))
¯
δT
1
=O T ϵ−1/2 + T b−ℓ(2−r) + T 2 (b−ℓ(2−r)) ,
for some arbitrarily small ϵ > 0. From the earlier conditions,
1/2+b
2−r
< ℓ =⇒ b − ℓ(2 − r) <
−1/2, which implies
that the numerator converges to 0, and that it converges at the rate
1 (b−ℓ(2−r))
2
of O T
, since the two other terms have a smaller exponent of T . The same
expression from the numerator also
1appears inthe denominator, so the latter converges to a
non-zero constant, and ∆τ = O T 2 (b−ℓ(2−r)) .
76
2.C Supplementary Results
For ∆ω, we may simplify the upper bound as follows
h
i2
1
1
2−r
sr,max
(2QT + 1) C1 T 1/2 λ2−r
+ C2 h m T m λmax
max sr,max
q
1
√
h 1 m+1
−
i3
3−m
3
2
1/d+m/(m−2)
2−r
+ C3 h m T m λ2−r
+C5 ηT−1 h2
T h2
max sr,max + C4 h 3m T 3m λmax sr,max
i
h
3
1
= O T Q T 2(1/2+b−ℓ(2−r)) + T ϵ+b−ℓ(2−r) + T ϵ+ 2 (−1+b−ℓ(2−r)) + T ϵ+ 2 (1/3+b−ℓ(2−r)) + T ϵ−1/2
= O T Q+2(1/2+b−ℓ(2−r)) + T Q+ϵ−1/2 .
Since ∆τ → 0,
∆τ [1 + ∆τ + ∆τ ∆ω] =O ∆τ + [∆τ ]2 ∆ω
1
=O T 2 (b−ℓ(2−r)) + T Q+1+3(b−ℓ(2−r)) + T Q−1/2+(b−ℓ(2−r)) .
1
1
When Q < min −1 − 65 (b − ℓ(2 − r)),
2 − 2 (b − ℓ(2 − r)) , the first term dominates the
1 (b−ℓ(2−r))
Ω
. Note that since b − ℓ(2 − r) < −1/2, this bound on
others, and RN,T = O T 2
Q is satisfied when Q < 2/3. Following the proof of Lemma 2.B.15,
β
RN,T
:= max r N,p
1≤p≤P
Υ−2 V ′ u
Θ̂X ′ u
√
√
+∆−
T
T
!
≤ U∆,T + UV,T ,
(2.C.4)
where
√
U∆,T =
T λ1−r sr
λ̄
,
C1 − ηT − C2 λ̄2−r s̄r
and
UV,T
q
√
1/m −1
h1/m ηT−1 δhT + C1 h1/m ηT−1 T λ2−r
ηT
λ̄2 λ−r s̄r
max s̄r + C2 h
¯
=
,
q
C3 − C4 δhT + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r
¯
√
1
with δT = CηT,1 ( T h) 1/d+m/(m−1) . For U∆,T , the numerator is of order O T 1/2+b−ℓ(2−r) ,
and the denominator of order O 1 + T b−ℓ(2−r) = O(1), so U∆,T = O T 1/2+b−ℓ(2−r) . For
UV,T , note that each term in the numerator is multiplied by h1/m ηT−1 , which we can take to
be O(T ϵ ) for an arbitrarily small ϵ > 0. The remainder of the numerator is then
q
√
1
1
h
H−(H+1/2) 1/d+m/(m−1)
+ C1 T λ2−r
+ T 1/2+b−ℓ(2−r) + T 2 (b−ℓ(2−r))
λ̄2 λ−r s̄r =O T
max s¯r + C2
¯
δT
1
=O T ϵ−1/2 + T 1/2+b−ℓ(2−r) + T 2 (b−ℓ(2−r)) ,
=O T ϵ−1/2 + T 1/2+b−ℓ(2−r) .
Since the denominator contains the same expression as ∆τ , it converges to a non-zero con-
77
2 Desparsified Lasso in Time Series
i
h
1
stant, and UV,T = O T ϵ T −1/2 + T 2 (b−ℓ(2−r)) . Combining these terms,
h
i
1
β
RN,T
= O T 1/2+b−ℓ(2−r) + T ϵ T −1/2 + T 2 (b−ℓ(2−r)) = O T ϵ−1/2 + T 1/2+b−ℓ(2−r) .
Finally, as mentioned at the start of the proof, these results hold on a set whose probability
converges to 1. We therefore replace O(·) with Op (·) and the proof is complete.
2.C.3
■
Illustration of conditions for Corollary 2.1
Example 2.C.1. The requirements of Corollary 2.1 are satisfied when N ∼ T a for
a > 0, sr ∼ T b for b > 0, and λ ∼ T −ℓ for
0<r<1:
r=0:
b
1
1
1
m
1
1
<ℓ< 1
−
b
+
−
2a
+
,
m
1−r
d m−1
d m−1
r( d + m−1
) 2
b
1
a
<ℓ< − .
1−r
2 m
This choice of ℓ is feasible when
1
m
1
1
2b
+
+ 4a
+
< 1.
1−r
d m−1
d m−1
(2.C.5)
Figure 2.2 demonstrates which values of a, b, m, d, and r are feasible, as well as
how many moments m are required for different combinations of the other parameters.
78
2.C Supplementary Results
Figure 2.2: Required moments m implied by eq. (2.C.5). Contours mark intervals
of 10 moments, and values above m = 100 are truncated to 100. Non-shaded areas
indicate infeasible regions.
79
2 Desparsified Lasso in Time Series
2.C.4
Properties of induced p-norms for 0 ≤ p < 1
Lemma 2.C.1 (0 < p < 1). For matrices A, B ∈ Rn×m with column vectors aj and bj
∥Ax∥r
∥x∥p
and 0 < p < 1, define the induced pseudo-norm ∥A∥p = max
x̸=0
P
1/p
p
for a vector x the pseudo-norm ∥x∥p =
.
j |xj |
= max ∥Ax∥p , where
∥x∥p =1
(1) ∥c × A∥p = |c| ∥A∥p
(2) ∥A∥p = maxj ∥aj ∥p
(3) ∥AB∥p ≤ ∥A∥p ∥B∥p
(4) ∥A + B∥pp ≤ ∥A∥pp + ∥B∥pp
(5) m1/2−1/r ∥A∥2 ≤ ∥A∥r ≤ n1/r−1/2 ∥A∥2
Proof. We can show the p-norm satisfied absolute homogeneity, i.e. for a scalar c,
!1/p
!1/p
X
∥xc∥p =
|xj × a|
p
|c|
=
p
X
|xj |
p
!1/p
X
= |c|
|xj |
= |c| ∥x∥p .
j
j
j
p
Property (1) then follows:
∥Axc∥p
∥c × A∥p = max
∥x∥p
x̸=0
= max
|c| ∥Ax∥p
= |c| ∥A∥p
∥x∥p
x̸=0
By absolute homogeneity, the alternative definition of ∥·∥p follows from
max
∥Ax∥p
x̸=0
∥x∥p
= max
x̸=0
Ax
∥x∥p
= max A
x̸=0
p
x
∥x∥p
= max ∥Ay∥p .
p
∥y∥p =1
Property (2) follows from the following arguments:
!p
∥A∥pp =
max ∥Ax∥p
∥x∥p =1
p
= max
∥x∥p =1
∥Ax∥pp
X
= max
∥x∥p =1
!
≤ max
X
∥x∥p =1
∥aj xj ∥pp
X
= max
∥x∥p =1
j
|xj |p ∥aj ∥pp
max
∥x∥p =1
j
|xj |
∥aj ∥pp
.
j
P
|xj |p = 1. We can therefore rewrite as
j
!
p
p
!
Note that the condition ∥x∥p = 1 ⇐⇒ ∥x∥pp =
X
j
aj xj
!
=
X
max
P
yj ≥0,
yj =1
yj ∥aj ∥pp
.
j
This maximum is then straightforward to evaluate: check which ∥aj ∥pp is the largest, and
set its corresponding yj to 1. This gives us an upper bound on the induced norm:
∥A∥pp ≤ max ∥aj ∥pp ⇐⇒ ∥A∥p ≤ max ∥aj ∥p .
j
80
j
2.C Supplementary Results
The inequality can also be shown in the other direction: For any j, we may write ∥aj ∥p =
∥Aej ∥p , where ej is the jth basis vector. ej is a vector which satisfies ∥ej ∥p = 1, so we
may upper bound ∥aj ∥p = ∥Aej ∥p ≤ max ∥Ax∥p = ∥A∥p . Note that the inequality
∥x∥p =1
∥aj ∥ ≤ ∥A∥p holds for all j, including the j which maximizes ∥aj ∥p . Therefore, we have
maxj ∥aj ∥p ≤ ∥A∥p , and property (2) holds by the sandwich theorem.
Property (3) follows from
∥A∥p = max
∥Ax∥p
∥x∥p
x̸=0
≥
∥Ax∥p
=⇒ ∥Ax∥p ≤ ∥A∥p ∥x∥p , ∀x ̸= 0,
∥x∥p
∥AB∥p = max ∥ABx∥p ≤ max ∥A∥p ∥Bx∥p ≤ max ∥A∥p ∥B∥p ∥x∥p = ∥A∥p ∥B∥p .
∥x∥p =1
∥x∥p =1
∥x∥p =1
Property (4) follows from the following arguments. ∥·∥pp satisfies the triangle inequality:
∥x + y∥pp =
X
|xj + yj |p ≤
X
j
∥A + B∥pp =
|xj |p +
X
j
max
≤ max
!p
= max
∥x∥p
∥Ax∥pp
x̸=0
j
∥(A + B)x∥p
x̸=0
∥x∥pp
|yj |p = ∥x∥pp + ∥y∥pp .
∥(A + B)x∥pp
x̸=0
+ max
x̸=0
∥Bx∥pp
∥x∥pp
∥x∥pp
≤ max
∥Ax∥pp + ∥Bx∥pp
x̸=0
∥x∥pp
= ∥A∥pp + ∥B∥pp .
For property (5), by the Cr -inequality we have for x ∈ Rn
n
X
(x2i )p/2
∥x∥p =
#1/2
!2/p 1/2 "
n
X
2
2/p−1
xi
= n1/p−1/2 ∥x∥2 ,
≤ n
i=1
i=1
and also ∥x∥2 ≤ ∥x∥p . Consequently
∥A∥p = max
∥Ax∥p
x̸=0
∥x∥p
≤ max
x̸=0
n1/p−1/2 ∥Ax∥2
∥Ax∥2
≤ n1/p−1/2 max
= n1/p−1/2 ∥A∥2 .
x̸=0 ∥x∥
∥x∥p
2
Similarly,
∥A∥2 = max
x̸=0
∥Ax∥p
∥Ax∥p
∥Ax∥2
≤ max
≤ max 1/2−1/p
= m1/p−1/2 ∥A∥p .
x̸
=
0
x̸
=
0
∥x∥2
∥x∥2
m
∥x∥p
■
Lemma 2.C.2 (p = 0). For matrices A and B with column vectors aj and bj define the
induced pseudo-norm ∥A∥0 = max
x̸=0
P
j 1(|xj | > 0).
∥Ax∥0
,
∥x∥0
where for a vector x the pseudo-norm ∥x∥0 =
(1) ∥c × A∥0 = ∥A∥0 , for c ̸= 0
(2) ∥A∥0 = maxj ∥aj ∥0
(3) ∥AB∥0 ≤ ∥A∥0 ∥B∥0
(4) ∥A + B∥0 ≤ ∥A∥0 + ∥B∥0
81
2 Desparsified Lasso in Time Series
Proof. For property (1), note that ∥x∥0 = ∥xc∥0 for any scalar c ̸= 0:
∥c × A∥0 = max
x̸=0
∥Axc∥0
∥Ax∥0
= max
= ∥A∥0 .
x̸
=
0
∥x∥0
∥x∥0
For property (2), let S(x) be the index set {j : |xj | > 0} with cardinality |S(x)|; note
that ∥x∥0 = |S(x)|. Furthermore, note that the 0-norm satisfies the triangle inequality:
∥x + y∥0 =
X
1(|xj + yj | > 0) ≤
j
P
j∈S(x)
|S(x)|
x̸=0
1(|xj | > 0) + 1(|yj | > 0) = ∥x∥0 + ∥y∥0 .
j
∥Ax∥0
= max
∥A∥0 = max
x̸=0 ∥x∥
x̸=0
P0
∥aj ∥0
= max
X
aj xj
j
P
0
|S(x)|
≤ max
x̸=0
j
P
∥aj xj ∥0
|S(x)|
= max
∥aj xj ∥0
j∈S(x)
x̸=0
|S(x)|
|S(x)| max ∥aj ∥0
≤ max
x̸=0
j
|S(x)|
= max ∥aj ∥0 .
j
This inequality can also be shown in the other direction: For any j, we may write ∥aj ∥0 =
∥Aej ∥
∥Aej ∥0 = e 0 , where ej is the jth basis vector, noting that ∥ej ∥0 = 1. ej is a vector
∥ j ∥0
∥Aej ∥
∥Ax∥
which satisfies ej ̸= 0, so we may upper bound ∥aj ∥0 = e 0 ≤ max ∥x∥ 0 = ∥A∥0 . Note
0
∥ j ∥0
x̸=0
that the inequality ∥aj ∥0 ≤ ∥A∥0 holds for all j, including the j which maximizes ∥aj ∥0 .
Therefore, we have maxj ∥aj ∥0 ≤ ∥A∥0 , and property (2) holds by the sandwich theorem.
Property (3) follows from
∥A∥0 = max
x̸=0
∥Ax∥0
∥Ax∥0
≥
=⇒ ∥Ax∥0 ≤ ∥A∥0 ∥x∥0 , ∀x ̸= 0,
∥x∥0
∥x∥0
∥AB∥0 = max
x̸=0
∥ABx∥0
∥A∥0 ∥Bx∥0
∥A∥0 ∥B∥0 ∥x∥0
≤ max
≤ max
= ∥A∥0 ∥B∥0 .
x̸=0
x̸=0
∥x∥0
∥x∥0
∥x∥0
Property (4) follows from the triangle inequality of the 0-norm:
∥A + B∥0 = max
x̸=0
≤ max
x̸=0
∥(A + B)x∥0
∥Ax∥0 + ∥Bx∥0
≤ max
x̸=0
∥x∥0
∥x∥0
∥Ax∥0
∥Bx∥0
+ max
= ∥A∥0 + ∥B∥0 .
x̸=0 ∥x∥
∥x∥0
0
■
2.C.5
Additional notes on Examples 2.5 and 2.6
Using the properties of p-norms for 0 ≤ p < 1 described in section 2.C.4, we provide further
details on Examples 2.5 and 2.6.
82
2.C Supplementary Results
2.C.5.1
Example 2.5: Sparse factor model
Recall the factor model
′
yt = β 0 xt + ut , ut ∼ IID(0, 1)
xt = Λ f t + ν t , ν t ∼ IID(0, Σν ),
N ×kk×1
f t ∼ IID(0, Σf ),
where Λ has bounded elements, Σf and Σν are positive definite with bounded eigenvalues,
and ν t and f t uncorrelated. We make the following assumptions on the factor loadings:
C1 N a ≤ λmin (Λ′ Λ) ≤ λmax (Λ′ Λ) ≤ C2 N b ,
0 < a ≤ b ≤ 1.
(2.C.6)
These assumptions imply that the k largest eigenvalues of Σ = ΛΣf Λ′ + Σν diverge at rates
between N a and N b , while the remaining N − k + 1 eigenvalues do not diverge. This holds
as we can bound the largest eigenvalue λmax (Σ) from above by
λmax (Σ) ≤ λmax (ΛΣf Λ′ ) + λmax (Σν ) ≤ λmax (Σf )λmax (Λ′ Λ) + λmax (Σν ) ≤ C1 N b + C2 .
Similarly, we can bound bound the k-th largest eigenvalue λk (Σ) using Weyl’s inequality
and the min-max theorem from below by
x′ ΛΣf Λ′ x
λk (Σ) ≥ λk (ΛΣf Λ′ ) + λmin (Σν ) = max min
dim(U)
=
N
−
k
+
1
+ λmin (Σν )
U
x∈U \0
x′ x
x′ ΛΛ′ x
≥ λmin (Σf ) max min
dim(U) = N − k + 1 + λmin (Σν )
U
x∈U \0
x′ x
= λmin (Σf )λk (ΛΛ′ ) + λmin (Σν ) = λmin (Σf )λmin (ΛΛ′ ) + λmin (Σν ) ≥ C1 N a + C2 ,
where we used that λk (ΛΛ′ ) = λk (Λ′ Λ) = λmin (Λ′ Λ).
Therefore, this assumption generates a weak factor model if b < 1, while if b = 1 but
a < 1 some factors, but not all, are weak; see e.g. Uematsu and Yamagata (2022a,b) and
the references therein.11 If a = b = 1 we have the standard strong factor model with dense
loadings.
Sparse factor loadings satisfy these assumptions. In particular, from Lemma 2.C.1(5)
we find that λmax (Λ′ Λ) = ∥Λ∥22 ≤ k2/r−1 ∥Λ∥2r ; thus, with a fixed number k of factors,
the sparsity of Λ provides an upper bound for the strength of divergence of the largest
eigenvalues.12 Sparse factor models may provide accurate descriptions of various economic
and financial datasets. For example, Uematsu and Yamagata (2022b) find strong evidence
of sparse factor loadings in the FRED-MD macroeconomic dataset (McCracken and Ng,
2016), as well as of firm-level excess returns of the S&P500 beyond the market return factor.
Freyaldenhoven (2021) uses sparsity in the loadings to identify the factors, motivating the
sparsity empirically through the presence of “local” factors in economic and financial data.
11 Our setup corresponds to the framework with factors of varying strength as proposed by Uematsu
and Yamagata (2022a,b) by setting λj (Λ′ Λ) ∼ N aj where b = a1 ≥ . . . ≥ ak = a.
12 This bound only holds for r > 0. Uematsu and Yamagata (2022a) consider the case r = 0.
83
2 Desparsified Lasso in Time Series
Further empirical evidence for sparse factor models is reviewed in Uematsu and Yamagata
(2022a).
We now derive the sparsity bound of Example 2.5. We bound γ 0j
that Θ = Υ
−2
−2
Γ, where Υ
1
−γ2,1
Γ := .
..
−γN,1
=
2
diag(1/τ12 , . . . , 1/τN
),
r
r
based on the fact
and
−γ1,2
...
−γ1,N
1
..
.
...
..
.
−γN,2
...
−γ2,N
.
..
.
1
This result follows from the definition of γ 0j as linear projection coefficients, and the block
matrix inverse identity for Θ. Then
max γ 0j
j
r
r
≤1 + max γ 0j
r
= (Υ−2 )−1 Θ
r
j
r
r
′
= max (1, γ 0′
j )
j
≤ (Υ−2 )−1
r
r
r
r
′
= max (1, −γ 0′
j )
j
r
r
= ∥Γ∥rr
∥Θ∥rr ≤ max τj2r ∥Θ∥rr ≤ C ∥Θ∥rr ,
j
where maxj τj2r ≤ C follows from eq. (2.B.1). Note that when r = 0, these steps follow
similarly, noting that (Υ−2 )−1
0
= 1, and therefore C = 1.
By the Woodbury matrix identity
a
′ −1
a
−1
a
Θ = Σ−1
Σ−1
ν − Σν Λ/N
f /N + Λ Σν Λ/N
−1
Λ′ Σ−1
ν .
Then
∥Θ∥rr ≤ Σ−1
ν
r
+ Σ−1
ν
r
r
r
a
′ −1
a
Σ−1
f /N + Λ Σν Λ/N
∥Λ/N a ∥rr
−1
r
r
Λ′
r
r
Σ−1
ν
r
r
.
As for positive semidefinite symmetric matrices A and B we have that
(A + B)−1
2
≤
1
1
1
≤
≤
,
λmin (A + B)
λmin (A) + λmin (B)
λmin (B)
it follows that
a
′ −1
a
Σ−1
f /N + Λ Σν Λ/N
−1
≤
2
1
1
≤
.
′
a
a
λmin (Σ−1
λmin Λ′ Σ−1
ν )λmin (Λ Λ/N )
ν Λ/N
′
a
As λmin (Σ−1
ν ) = 1/λmax (Σν ) ≥ 1/C, it follows from our assumptions that λmin (Λ Λ/N ) ≥
−1
′ −1
C and therefore Σ−1
≤ C. It then also follows from Lemma 2.C.1(5)
f /N + Λ Σν Λ/N
2
−1 r
a
′ −1
a
that
Σ−1
≤ Ck1−r/2 and
f /N + Λ Σν Λ/N
r
∥Θ∥rr ≤ Σ−1
ν
84
r
r
+ Ck1−r/2 Σ−1
ν
r
r
∥Λ/N a ∥rr Λ′
r
r
Σ−1
ν
r
r
.
(2.C.7)
2.C Supplementary Results
r
With ∥Λ′ ∥r ≤ Ck, we then find the bound
∥Θ∥rr ≤ Σ−1
ν
r
r
+ Ck2−r/2 N −ra Σ−1
ν
2r
r
∥Λ∥rr .
We provide two examples of Σν such that Σ−1
ν is sparse. For block diagonal structures,
this follows trivially, since the inverse maintains the same block diagonal structure. For a
Toeplitz structure Σν,i,j = ρ|i−j| , by Section 8.8.4 of Gentle (2007),
Σ−1
ν
−ρ
1 0
=
2
1−ρ .
.
.
−ρ
0
...
0
1 + ρ2
−ρ
...
−ρ
..
.
1 + ρ2
..
.
...
..
.
0
0
,
..
.
0
0
...
1
1
0
and we can bound
r
Σ−1
ν
r
= max Σ−1
ν,·,j
j
or simply max Σ−1
ν,·,j
j
0
r
r
= Σ−1
ν,·,⌈N/2⌉
r
1 + ρ2 + 2 |ρ|r
≤ C,
|1 − ρ2 |r
r
=
r
= 3 for r = 0.
Note that a (potentially weak) factor model without sparse loadings does not yield a
sufficiently sparse matrix Θ for all values of r. In eq. (2.C.7) we may try to bound ∥Λ∥rr
r/2
,
directly using Lemma 2.C.1(5)
to bound ∥Λ/N a ∥rr ≤
N 1+(b−2a−1)r/2 λmax (Λ′ Λ/N b )
r
2−r/2 1+(b−2a−1)r/2
.
This
is
not
a
tight
enough
bound
1
+
Ck
N
such that ∥Θ∥rr ≤ Σ−1
ν
r
to guarantee sparsity of Θ. To illustrate, for the standard dense factor model with a = b = 1
and k fixed, we get ∥Θ∥rr ≤ CN 1−r . Weaker divergence of the eigenvalues even increases
the power of N .
2.C.5.2
Example 2.6: Sparse VAR(1)
Recall the sparse VAR(1) model
z t = Φz t−1 + ut , Eut u′t := Ω, Eut u′t−l = 0, ∀l ̸= 0,
with our regression of interest being yt = ϕ1 z t−1 +u1,t . For Example 2.6(a) with a symmetric
block-diagonal coefficient matrix Φ and the error covariance matrix Ω being the identity,
∞
∞
−1
P
P
Φq ΩΦ′q =
Φ2q = I − Φ2
, where Φ0 = Φ′0 = I, and
we can simplify Σ =
q=0
q=0
Θ = Σ−1 = I − Φ2 . Note that I − A is invertible iff 1 is not an eigenvalue of A. Since the
eigenvalues of Φ2 are between (and not including) 0 and 1, Σ exists. I − Φ2 inherits the
block diagonal structure of Φ, so we may bound max γ 0j
j
r
r
≤ C ∥Θ∥rr ≤ Cb.
This result can be extended to the case where Ω has the same block diagonal structure
as the VAR coefficient matrix Φ. While the simplified expression for Σ provided above no
longer holds, both Σ and Σ−1 remain block diagonal when Ω and Φ share the same block
85
2 Desparsified Lasso in Time Series
structure. As a result, the nonzero structure of γ 0j remains unaltered.
Figure 2.3: Example 6(b): We display ln max γ 0j
r
r
j
for N between 10 and 1000,
and r between 0.1 and 0.9.
Log sparsity in L r −norm
30
r
0.75
20
0.50
0.25
10
0
0
250
500
750
1000
N
For Example 2.6(b) with a diagonal Φ and Toeplitz Ω, we can simplify Σ =
∞
P
Φq ΩΦ′q =
q=0
∞
P
q=0
ϕ2q Ω =
1
Ω
1−ϕ2
and by similar arguments to section 2.C.5.1,
1
−ρ
1−ϕ 0
Θ=
2
1−ρ .
.
.
2
0
−ρ
0
...
0
1 + ρ2
−ρ
...
−ρ
..
.
1 + ρ2
..
.
...
..
.
0
0
.
..
.
0
0
...
1
The precision matrix is clearly sparse in this case, and max γ 0j
j
r
r
≤ C ∥Θ∥rr ≤ C.
Finally, we numerically investigated the extension where the VAR coefficient matrix
also has a Toeplitz structure, namely Φi,j = 0.41+|i−j| . We vary the sample size between
86
2.C Supplementary Results
N = 10 and N = 1000 and display the boundedness in r-norm of the parameter vector in
the nodewise regressions in Figure 2.3 for different values of r. We use a log-scale since this
sparsity grows by orders of magnitude for decreasing r.
2.C.6
Algorithmic details for choosing the lasso tuning parameter
Algorithm 2.1: Plug-in choice of λ
1
At k = 0, initialize λ(0) ← X ′ y
2
while 1 ≤ k ≤ K do
∞
/T and û(0) ← y −
1
T
(k)
3
Obtain the estimated long-run covariance matrix Ω̂
T
P
(k−1) (k−1) ′
Ξ̂(l) = T 1−l
xt ût
ût−l xt−l ;
PT
t=1
yt ;
as in eq. (2.9), with
t=l+1
4
while 1 ≤ b ≤ B do
5
(k)
Draw ĝ (b) from N 0, Ω̂
;
6
mb ← ĝ (b)
∞
;
7
λ(k) ← c √1T q(1−α) , where q(1−α) is the (1 − α)-quantile of m1 , . . . , mB ;
8
if λ(k) − λ(k−1) /λ(k−1) < ϵ then
9
10
λ ← λ(k) ;
break;
(k)
11
Estimate β̂
12
û(k) ← y − X β̂
13
with the lasso using λ(k) as the tuning parameter;
(k)
;
λ ← λ(K) ;
We set K = 15, ϵ = 0.01, B = 1000, α = 0.05, and c = 0.8 throughout the
simulation study.
87
2 Desparsified Lasso in Time Series
2.C.7
Additional simulation details
Figure 2.4: Model A, ρ heat map coverage: Contours mark the coverage thresholds
at 5% intervals, from 75% to the nominal 95%, from dark green to white respectively.
Units on the axes are not proportional to the λ-value but rather its position in the
grid. The value of λ is (10T )−1 at 0, and increases exponentially to a value that
sets all parameters to zero at 50. Plots are based on 100 replications, with colored
dots representing combinations of λ’s selected by PI (purple), AIC (red), BIC (blue),
EBIC (yellow).
N=101, T=200
N=101, T=1000
40
40
40
40
30
20
30
20
30
20
10
10
10
0
0
0
10
20
30
40
50
0
Nodewise lambda
10
20
30
40
Initial lambda
50
Initial lambda
50
0
50
30
20
10
0
0
Nodewise lambda
N=201, T=100
10
20
30
40
50
0
Nodewise lambda
N=201, T=200
N=201, T=500
40
40
40
40
20
20
30
20
10
10
10
0
0
0
0
10
20
30
40
50
0
Nodewise lambda
10
20
30
40
Initial lambda
50
Initial lambda
50
30
50
40
50
30
20
Coverage
0
0
10
20
30
40
50
1.00
0
Nodewise lambda
N=501, T=200
30
10
Nodewise lambda
N=501, T=100
20
N=201, T=1000
50
30
10
Nodewise lambda
50
Initial lambda
Initial lambda
N=101, T=500
50
Initial lambda
Initial lambda
N=101, T=100
50
10
20
30
40
50
0.75
Nodewise lambda
N=501, T=500
0.50
N=501, T=1000
50
50
50
50
40
40
40
40
0.25
20
10
30
20
10
0
10
20
30
40
50
20
10
20
30
40
50
20
0
0
Nodewise lambda
N=1001, T=100
30
10
0
0
Nodewise lambda
10
20
30
40
50
0
Nodewise lambda
N=1001, T=200
N=1001, T=500
40
40
40
40
20
10
10
0
10
20
30
40
Nodewise lambda
50
30
20
10
0
0
88
20
Initial lambda
50
Initial lambda
50
30
10
20
30
40
Nodewise lambda
50
30
40
50
30
20
10
0
0
20
N=1001, T=1000
50
30
10
Nodewise lambda
50
Initial lambda
Initial lambda
30
10
0
0
Initial lambda
30
Initial lambda
Initial lambda
Initial lambda
0.00
0
0
10
20
30
40
Nodewise lambda
50
0
10
20
30
40
Nodewise lambda
50
2.C Supplementary Results
Figure 2.5: Model A, β1 heat map coverage: Contours mark the coverage thresholds
at 5% intervals, from 75% to the nominal 95%, from dark green to white respectively.
Units on the axes are not proportional to the λ-value but rather its position in the
grid. The value of λ is (10T )−1 at 0, and increases exponentially to a value that
sets all parameters to zero at 50. Plots are based on 100 replications, with colored
dots representing combinations of λ’s selected by PI (purple), AIC (red), BIC (blue),
EBIC (yellow).
N=101, T=200
N=101, T=1000
40
40
40
40
30
20
30
20
30
20
10
10
10
0
0
0
10
20
30
40
50
0
Nodewise lambda
10
20
30
40
Initial lambda
50
Initial lambda
50
0
50
30
20
10
0
0
Nodewise lambda
N=201, T=100
10
20
30
40
50
0
Nodewise lambda
N=201, T=200
N=201, T=500
40
40
40
40
20
20
30
20
10
10
10
0
0
0
0
10
20
30
40
50
0
Nodewise lambda
10
20
30
40
Initial lambda
50
Initial lambda
50
30
50
40
50
30
20
Coverage
0
0
10
20
30
40
50
1.00
0
Nodewise lambda
N=501, T=200
30
10
Nodewise lambda
N=501, T=100
20
N=201, T=1000
50
30
10
Nodewise lambda
50
Initial lambda
Initial lambda
N=101, T=500
50
Initial lambda
Initial lambda
N=101, T=100
50
10
20
30
40
50
0.75
Nodewise lambda
N=501, T=500
0.50
N=501, T=1000
50
50
50
50
40
40
40
40
0.25
20
30
20
20
10
10
0
0
0
10
20
30
40
50
0
Nodewise lambda
10
20
30
40
50
30
20
10
0
0
Nodewise lambda
N=1001, T=100
10
20
30
40
50
0
Nodewise lambda
N=1001, T=200
N=1001, T=500
40
40
40
40
20
20
30
20
10
10
10
0
0
0
0
10
20
30
40
Nodewise lambda
50
0
10
20
30
40
Nodewise lambda
50
Initial lambda
50
Initial lambda
50
30
20
30
40
50
N=1001, T=1000
50
30
10
Nodewise lambda
50
Initial lambda
Initial lambda
30
10
0
Initial lambda
30
Initial lambda
Initial lambda
Initial lambda
0.00
30
20
10
0
0
10
20
30
40
Nodewise lambda
50
0
10
20
30
40
50
Nodewise lambda
89
2 Desparsified Lasso in Time Series
Figure 2.6: Model B, ρ heat map coverage: Contours mark the coverage thresholds
at 5% intervals, from 75% to the nominal 95%, from dark green to white respectively.
Units on the axes are not proportional to the λ-value but rather its position in the
grid. The value of λ is (10T )−1 at 0, and increases exponentially to a value that
sets all parameters to zero at 50. Plots are based on 100 replications, with colored
dots representing combinations of λ’s selected by PI (purple), AIC (red), BIC (blue),
EBIC (yellow).
N=101, T=200
N=101, T=1000
40
40
40
40
30
20
30
20
30
20
10
10
10
0
0
0
10
20
30
40
50
0
Nodewise lambda
10
20
30
40
Initial lambda
50
Initial lambda
50
0
50
30
20
10
0
0
Nodewise lambda
N=201, T=100
10
20
30
40
50
0
Nodewise lambda
N=201, T=200
N=201, T=500
40
40
40
40
20
20
30
20
10
10
10
0
0
0
0
10
20
30
40
50
0
Nodewise lambda
10
20
30
40
Initial lambda
50
Initial lambda
50
30
50
40
50
30
20
Coverage
0
0
10
20
30
40
50
1.00
0
Nodewise lambda
N=501, T=200
30
10
Nodewise lambda
N=501, T=100
20
N=201, T=1000
50
30
10
Nodewise lambda
50
Initial lambda
Initial lambda
N=101, T=500
50
Initial lambda
Initial lambda
N=101, T=100
50
10
20
30
40
50
0.75
Nodewise lambda
N=501