Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Lasso-Based Inference for High-Dimensional Time Series Citation for published version (APA): Adámek, R. (2022). Lasso-Based Inference for High-Dimensional Time Series. [Doctoral Thesis, Maastricht University]. Maastricht University. https://doi.org/10.26481/dis.20221205ra Document status and date: Published: 01/01/2022 DOI: 10.26481/dis.20221205ra Document Version: Publisher's PDF, also known as Version of record Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement: www.umlib.nl/taverne-license Take down policy If you believe that this document breaches copyright please contact us at: repository@maastrichtuniversity.nl providing details and we will investigate your claim. Download date: 29 Apr. 2024 Lasso-Based Inference for High-Dimensional Time Series R.X. Adámek This research was financially supported by the Netherlands Organization for Scientific Research (NWO) under grant number 452-17-010. © R.X. Adámek, Maastricht 2022 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form, or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission in writing from the author. This book was typeset by the author using LATEX. Published by Universitaire Pers Maastricht ISBN: 978-94-6469-120-7 Cover: Pavel Baláš, 2022 Printed in The Netherlands by ProefschriftMaken Lasso-Based Inference for High-Dimensional Time Series DISSERTATION to obtain the degree of Doctor at Maastricht University, on the authority of the Rector Magnificus, Prof. dr. Pamela Habibović, in accordance with the decision of the Board of Deans, to be defended in public on Monday the 5th of December 2022, at 16:00 hours by Robert Xerxes Adámek Supervisor Dr. S.J.M. Smeekes Co-supervisor Dr. I. Wilms Assessment Committee Prof. dr. A.W. Hecq (chair) Dr. S. Basu, Cornell University Prof. dr. J. van den Brakel Dr. O. Boldea, Tilburg University To my loving family Acknowledgements Acknowledgements Sometimes the path to my PhD feels like a long sequence of happy coincidences, that I just happened to stumble into things I was happy to do. But I think this would sell short many people without whom this thesis would never have been written. I would not have chosen to study econometrics in Maastricht if my parents didn’t travel with me to several open day events across the Netherlands. After finishing the Bachelor, I may not have continued on to the Master without encouragement from Jean-Pierre, who supervised my Bachelor thesis project and ignited my love for time series econometrics. Without Stephan and his Big Data course, I wouldn’t have chosen a rather ambitions Master thesis topic under his supervision, and eventually be offered a PhD position to work on his project. I can’t name everyone who helped put me on this path, but I will do my best! It should go without saying that Stephan and Ines were great supervisors. Anyone who’s worked with them could tell you they are highly knowledgeable about their field of research, they are excellent communicators and teachers, they take their work seriously and have an eye for detail. I want to thank you both on a more personal level – I feel like I’ve grown quite a lot as a person over the last four years, and you’ve played a big role in that. I always felt like I could tell you about my problems and you had a lot of great advice for dealing with them. I appreciate all the work you put into going over my writing and giving me useful feedback, especially relating to my job market materials. Our regular chats were a highlight of the week, I always came out of them motivated and with a smile on my face. You probably think these are all things a good supervisor should do, but I certainly don’t take them for granted. It wouldn’t have been the same without you. It’s no secret that I’m an introvert, and didn’t put much effort into socializing with people at the department. In retrospect, I wish I spent more time getting to vii Acknowledgements know everyone, especially with covid making that very difficult in the latter half of my PhD. That being said, Etienne took me under his wing as soon as I started, so work at the office was never lonely. I fondly remember our discussions about cool proofs on the whiteboard, gossiping about students in Mathematical Statistics, talking about music, cooking and videogames instead of working... Caterina, Luca, the conference in Rome was some of the most fun I had during my PhD. Dewi, Eric, Enrico, thank you for taking me along for the Econometric Game. I loved our reading groups about high-dimensional CLTs and SVARs with Lenard. Adam, Daniel, Elisa, Francesco, Marie, it was great to talk with you at various NESG’s, seminars and workshops. I also want to thank many people who taught me over the years: Alain, Christian, Dries, Hanno, Rasmus, Sean, to name a few – you are part of why I want to continue working in academia and being a teacher myself. Finally, I want to thank everyone in my life outside of the university, for a needed distraction from work, for listening to my rambling about work, and for telling me to stop talking about my work. My family has always supported me during my studies and PhD, being at my side for every big decision in my life. Getting to spend more time with you was a huge upside of working from home – I hope I will never be too far from you. Charlotte, Kubo, Sofie, thank you for always being there for me when I need you. Conor, Daniel, Demane, Štěpáne, Tom, thanks for all the evenings of chatting, gaming and laughing. Robert Adámek Aarhus, October 2022 viii Acknowledgements Contents Acknowledgements vii Contents ix 1 Introduction 3 1.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 High-dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 The lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 The desparsified lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Chapter overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Desparsified Lasso in Time Series 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The High-Dimensional Linear Model . . . . . . . . . . . . . . . . . . . 15 2.3 Error Bound and Consistency for the Lasso . . . . . . . . . . . . . . . 22 2.4 Uniformly Valid Inference via the Desparsified Lasso . . . . . . . . . . 23 2.4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.2 Inference on low-dimensional parameters . . . . . . . . . . . . 28 2.4.3 Inference on high-dimensional parameters . . . . . . . . . . . . 33 Analysis of Finite-Sample Performance . . . . . . . . . . . . . . . . . . 35 2.5.1 Tuning parameter selection . . . . . . . . . . . . . . . . . . . . 36 2.5.2 Autoregressive model with exogenous variables . . . . . . . . . 37 2.5.3 Factor model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.4 Weakly sparse VAR(1) . . . . . . . . . . . . . . . . . . . . . . . 40 2.5 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 41 ix Acknowledgements 2.A Proofs for Section 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.A.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.A.2 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.A.3 Proofs of the main results . . . . . . . . . . . . . . . . . . . . . 2.B Proofs for Section 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 46 2.B.1 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.B.2 Proofs of main results . . . . . . . . . . . . . . . . . . . . . . . 49 2.C Supplementary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.C.1 Proofs of preliminary results Section 2.3 . . . . . . . . . . . . . 59 2.C.2 Proofs of preliminary results Section 2.4 . . . . . . . . . . . . . 63 2.C.3 Illustration of conditions for Corollary 2.1 . . . . . . . . . . . 78 2.C.4 Properties of induced p-norms for 0 ≤ p < 1 . . . . . . . . . . . 80 2.C.5 Additional notes on Examples 2.5 and 2.6 . . . . . . . . . . . . 82 2.C.6 Algorithmic details for choosing the lasso tuning parameter . . 87 2.C.7 Additional simulation details . . . . . . . . . . . . . . . . . . . 88 3 Local Projection Inference in High Dimensions 95 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.2 High-dimensional Local Projections . . . . . . . . . . . . . . . . . . . . 98 3.2.1 Local Projection Estimation . . . . . . . . . . . . . . . . . . . . 100 3.2.2 Local Projection Inference . . . . . . . . . . . . . . . . . . . . . 102 3.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.4 Structural Impulse Responses Estimated by HDLPs . . . . . . . . . . 106 3.5 3.4.1 Impulse Responses to a Shock in Monetary Policy . . . . . . . 106 3.4.2 Impulse Responses to a Shock in Government Spending . . . . 109 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.A Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.B Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.C Simulations: Extra Figures . . . . . . . . . . . . . . . . . . . . . . . . 117 3.D Data used in Section 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.E FAVAR Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4 Sparse High-Dimensional Vector Autoregressive Bootstrap 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.2 Vector Autoregressive Bootstrap . . . . . . . . . . . . . . . . . . . . . 127 4.3 x 125 4.2.1 Bootstrap for High-Dimensional VARs . . . . . . . . . . . . . . 128 4.2.2 Bootstrap Inference on (Approximate) Means . . . . . . . . . . 130 HDCLT for linear processes . . . . . . . . . . . . . . . . . . . . . . . . 131 CONTENTS 4.4 Application to VAR models . . . . . . . . . . . . . . . . . . . . . . . . 133 4.5 Bootstrap consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.6 Bootstrap Consistency for VAR Estimation by the lasso . . . . . . . . 139 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.A Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.B Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5 Conclusion 169 Bibliography 173 Impact 183 Curriculum Vitae 187 1 Chapter 1 Introduction The common themes of my work can be succinctly summarized by the 4 components of my thesis title: the lasso, statistical inference, high-dimensionality, and time series. In this chapter, I will introduce each of these concepts, and motivate why combining them together presents an important and interesting challenge – one that I hope to help overcome. Most of my work is theoretical, which may carry the connotations of limited practical use. I believe that applying the methods I discuss to real life problems is the ultimate goal of anyone studying them, but we should ideally only use them if we have compelling arguments for why they work. My work helps arm researchers with the arguments and confidence to apply these methods, and informs where their limits may lie. That being said, to motivate the more practically minded reader, my work also includes some empirical applications, simulation studies, and a software package which implements these methods in a user-friendly way. 1.1 Inference To give a simple example of inference, consider a coin-tossing experiment: I flip a coin 100 times and count 55 Heads. Is the coin fair or biased? To answer this question, I adopt the frequentist (as opposed to Bayesian) philosophy of statistics. The probability of getting Heads is a fixed number between 0 and 1, presumably determined by the coin’s physical properties, that describes the frequency at which I would get Heads if I kept flipping it over and over. For this coin experiment, we could get an arbitrarily accurate estimate of this probability with enough patience, or like Diaconis et al. (2007), with a coin-flipping machine. However, in many situations, repeating an experiment is either expensive, or simply impossible – I may only ever get 100 flips. Inference lets me make an informed judgement about the coin’s bias 3 1 Introduction by quantifying the uncertainty associated with these 55 Heads. After all, even a fair coin could appear biased due to random chance. As it turns out, a fair coin produces outcomes at least as extreme as 55 Heads – that is, larger than 55 or smaller than 45 – around 37% of the time. While such a calculation is easy for coin flipping, the problem of quantifying uncertainty becomes considerably more difficult when our data comes from a more complicated distribution, or even a distribution we do not know. Consider an example where we wish to measure the average height of a Dutch man. Schönbeck et al. (2013) found that in 2009, from a sample of 5,811, the average height at age 21 was 183.8cm. How representative is this of the male Dutch population as a whole? In such cases, we can often appeal to asymptotic approximations. By the Central Limit Theorem, we know that means over larger and larger samples tend to resemble a Normal distribution more and more closely. If we treat this Normal approximation as accurate, we could say that with a 95% probability, our estimate is within around 1.8mm of the true mean. However, this approximation is only exact in the limit, letting the number of Dutch men in our sample go to infinity – one may argue such a situation is not very realistic, considering there was only a finite number of Dutch men in 2009. Despite this, the accuracy of asymptotic approximations can be verified by simulations, and it often works very well in practice, especially when the data itself is already close to Normal and the sample size is large, as is the case here. While probabilities in the frequentist sense were easy to understand in the coin flipping example, it may be more tricky with average heights. When I say the height estimate was within 1.8mm of the true mean with a probability of 95%, it means that if we could repeat this experiment again, only choosing the random sample of Dutch men differently, the estimated mean would be within 1.8mm of the true mean 95% of the time. This is of course not possible, since we cannot travel back in time to redo the experiment. However, it motivates another approximation technique: the bootstrap. Introduced by Efron (1979), the bootstrap is a method which re-samples our original data to create new artificial samples, which we then use as samples from an “alternate reality”. In essence, the extent to which the means of these bootstrap samples resemble the original mean lets us infer how the original mean relates to that of the population in 2009. Unlike asymptotic approximations, the bootstrap works well in small samples, and when the data is far from Normal, e.g. when it is heavily skewed or bimodal. The methods I consider in this thesis are more complicated than the examples above, but the idea of statistical inference remains central to my work. In addition to the estimates themselves, I am interested in quantifying their uncertainty, allowing practitioners to make an informed judgement about the statistical significance of their 4 1.2 High-dimensionality results. 1.2 High-dimensionality One of the defining features of my work is the focus on methods for high-dimensional data, or “Big Data”. While this term can mean different things to different people, in this thesis it refers to the setting where the number of variables is very large relative to the number of observations we have for each variable. In a classical econometric setting, we may encounter the following dataset. For 100 individuals (i), we have data about their income (yi ), years of education (edi ), number of children (chi ), marital status (mari ), and gender (gi ). If we wanted to explain an individual’s income by their other characteristics, we could estimate the following linear model yi = α + β1 edi + β2 chi + β3 mari + β4 gi + ϵi , (1.1) where α is an intercept, and ϵi is an error term which we wish to minimize. The parameters, which I typically denote with Greek letters, represent the effect on income resulting from a one unit change in the explanatory variable. To make the notation more compact, we can write eq. (1.1) in terms of vectors yi = β ′ xi + ϵi , (1.2) where β = (α, β1 , . . . , β4 )′ and xi = (1, edi , chi , mari , gi )′ . This kind of model is then typically estimated by ordinary least squares (OLS), which chooses β in such a way as to minimize the sum of squared residuals, or β̂ = arg min β X yi − β ′ xi 2 . (1.3) i With 5 explanatory variables (including the intercept) and 100 observations, the properties of β̂ are well-known, including methods for inference. For example, we could check if the estimated parameter β̂4 is significantly different from 0, which might indicate the presence of a wage gap between men and women. However, we could also think of many other factors which affect a person’s wages. Their nationality, age, ethnicity, where they work and in which sector, which high-school or university they attended, the wages and education levels of their parents, etc. With the rise of social media such as Facebook, it is plausible such detailed data could be available – though not without ethical concerns. One might be tempted to simply include all these variables into a model as in eq. (1.1); after all, more data should improve the 5 1 Introduction model and improve its estimates. Unfortunately, this approach would run afoul of the curse of dimensionality. A model with many variables (and therefore many parameters), is more flexible than one with few; it can explain a larger proportion of the variation in yi and achieve a better fit of the data. However, this flexibility comes at the price of high variance in our estimates, and therefore high uncertainty which makes meaningful inference difficult. Worse yet, when the number of variables exceeds the number of observations, methods such as OLS fail completely; this is because the optimization problem eq. (1.3) has no unique solution. The model becomes so flexible that it can fit the data perfectly, and it can do so in an infinite number of equally valid ways. One of the canonical examples of such high-dimensional problems are gene expression models, where we want to estimate which genes are associated with the occurrence of certain diseases. For example, Simon et al. (2013) consider a data set with 127 patients, and data on 22,283 genes. If we hope to ever identify the correct genes with a statistical approach, it is paramount to use methods which function in such a high-dimensional setting, and also allow for valid inference. As another example, the FRED-QD database (McCracken and Ng, 2020) contains 128 quarterly US macroeconomic variables over the last 253 quarters. While this may not appear high-dimensional at first glance, practitioners typically use at least 4 lags of variables in their models, which makes the effective number of variables 1,265. 1.3 The lasso The lasso (or Least Absolute Shrinkage and Selection Operator), introduced by Tibshirani (1996), is an estimation method for sparse high-dimensional models. It addresses the problem of too-flexible models by adding a penalty for large parameter estimates. Compare to eq. (1.3) the lasso estimator: β̂ (L) = arg min β where ∥β∥1 = P X yi − β ′ x i 2 (1.4) + λ ∥β∥1 , i |βj |, and λ > 0 is a tuning parameter. The intuition behind this j method is that the additional penalty prioritizes estimates β̂ with small entries, while maintaining a good fit for the data. The trade-off between these two features is governed by λ, where larger values make the penalty term more important, thus resulting in β̂ (L) ’s with smaller entries – eventually giving β̂ (L) = 0 if λ is sufficiently large. The lasso has several attractive features: It has a unique solution regardless of 6 1.4 The desparsified lasso how many variables are in the model, and the geometry of the optimization problem gives rise to sparse estimates; that is, many entries of β̂ (L) are exactly 0. This means it selects those variables which are most important to give a good model fit, and sets the parameters of all other variables to 0, making the results easily interpretable. However, these properties come at the cost of introducing some bias into our estimation. Compared to OLS, the lasso produces estimates which are closer to zero and may lead to our underestimating the effects of some variables. The upshot is that it greatly reduces the variance of the estimates, and when λ is chosen carefully, this lower variance outweighs the additional bias, resulting in estimates which are on average closer to their true values. Unfortunately, it is difficult to do inference with the lasso. Its tendency to reduce small parameter values to 0 may lead to omitted variable bias, which may invalidate inference on the remaining variables in the model. For example, if we return to eq. (1.1), it may be the case that that the lasso gives an estimate of β̂3 = 0, implying that marital status has no effect on income. If married individuals have particularly high incomes and also have more children on average than unmarried individuals, this high income will become wrongly associated with the number of children, thus overestimating the size of β2 . When multiple mutually correlated variables are included in the model, the lasso has a tendency to eliminate all but one of them, which has a high risk of creating this problem. 1.4 The desparsified lasso The desparsified (also known as debiased) lasso, introduced by van de Geer et al. (2014), addresses the lasso’s issues with inference by adding an adjustment term to the lasso in eq. (1.4) β̂ (DL) = β̂ (L) + Θ̂ T 1X xi ϵ̂i T i=1 ! (1.5) , where ϵ̂i is the residual of the lasso model, i.e. the amount by which the dependent variable yi differs from the model prediction β̂ (L)′ xi . To give some intuition behind this adjustment, consider the example of omitted variable bias above, where the lasso penalized the parameter of the marital status variable to 0. For a high-earning married individual with many children, we would expect this issue to result in a particularly large value for ϵ̂i . Since their value for the number of children is also large, the product of xi and ϵ̂i will contain a large entry, giving us an indication that this omitted variable bias may be occurring. The term in the brackets then takes an average of how big 7 1 Introduction this issue is over all individuals, and gets scaled by the matrix Θ̂. This is the inverse covariance (or precision) matrix, which measures how correlated each pair of variables is, while taking into account the effects of all other variables. Therefore, if we miss important variables which are also correlated in a relevant way to other variables, the desparsified lasso will make a large adjustment to the lasso estimator to compensate. The two names of this method hint at its properties: β̂ (DL) is no longer sparse in the sense of containing many zeroes, since the adjustment term is never 0. This means we lose the variable selection properties of the lasso, but also avoid the associated problems with omitted variables. It also counteracts the lasso’s inherent bias towards 0; in fact, when used in a regular low-dimensional setting, β̂ (DL) becomes identical to the OLS estimator of eq. (1.1). This then provides some intuition for why we also recover some of the nice inference properties of the OLS, even in the high-dimensional setting. 1.5 Time series The main contribution of my thesis lies in extending the concepts described above to a time series setting. In the examples estimating Dutch men’s heights, or modelling a person’s income, I was implicitly assuming that the features of different individuals were independent of each other. If the data were collected in a well-designed survey, this is reasonable. Time series are a type of data where the observations refer to some quantity at different points in time, usually ordered from oldest to newest, and they typically exhibit some sort of dependence over time. While we could think of the series of coin flips as a sort of time series – they were presumably flipped sequentially over time – we would not be very interested in the order of heads or tails, since we expect these flips were independent of each other. Similarly, if we changed the order of individuals in our income data set, we would not expect to see different results. With time series, this ordering is a crucial feature of the data, and we are often interested in studying how a series changes over time. For example, we may want to examine whether average temperatures have been increasing over the past decades, or how has the gross domestic product (GDP) of a country been affected by COVID-19. Time-dependence brings with it many complications when it comes to statistical modelling, estimation, and inference. However, understanding these patterns in our data allows us to make informed predictions about the future, which makes the additional effort worthwhile. Understanding the effects of greenhouse gas emissions on temperatures can help us develop effective climate policy to address climate change. In macroeconomics, studying the relationships between interest rates and GDP can lead to monetary policy which brings a country out of recession. As in my previous 8 1.6 Chapter overview examples, inference is an invaluable tool for analyzing time series. When we forecast that GDP will grow by 1% next year, we may also want to know how confident we are of this prediction. If we are 95% confident that GDP will grow by between 0.8% and 1.2%, this tells a very different story than if it were between -3% and 4%. In my work, I largely consider time series which are weakly dependent, or stationary. These series may have strong dependence between values that are close to each other in time, but the dependence dies out as they grow further apart. As such, they generally move in a tunnel; they may get disturbed by a short term shock, but they tend to return back to their long term mean after a while. For example, GDP growth dropped sharply during 2020, likely as a result of COVID-19 and various lockdown measures, and these effects will likely have lasting effects even after the pandemic. However, we wouldn’t expect this to have any noticeable effect in 100 years, in the same way that the Spanish flu has little effect on GDP today. At the time I started working on this thesis, inference in high dimensions was wellestablished for independent data, with only a few authors pioneering the use of lassobased methods in time series settings. Among them were Kock and Callot (2015), Basu and Michailidis (2015), and Medeiros and Mendes (2016), who showed many promising and useful results, and whose theoretical approaches greatly influenced my work. In particular, the latter’s relaxation of the Gaussianity assumption to allow for more general and potentially fat-tailed distributions is also a running theme throughout this thesis. The field has grown rapidly since then, and I am pleased to see it flourish. The main contribution of this thesis is extending the theory behind the lasso and desparsified lasso to a high-dimensional, weakly dependent time series setting, deriving novel theoretical results for valid inference in linear models such as eq. (1.2). 1.6 Chapter overview Chapter 2 is heavily theoretical, and forms the backbone of my thesis. In this chapter, we derive asymptotic results for the desparsified lasso in eq. (1.5), under a highly general form of weak dependence known as near-epoch dependence, which covers many popular dependence concepts such as vector autoregressive (VAR) or mixing processes. We derive these results under the assumption of weak sparsity of β, which makes its use justifiable in many practical settings. Unlike exact sparsity, which requires that many elements of β are exactly 0, weak sparsity allows for a large number of nonzero elements, provided they are not “too large”. Inference in time series typically involves the long-run covariance matrix of our process, and we provide a consistent estimator for this matrix, showing it works well even in high dimensions. We also provide a 9 1 Introduction data-dependent way of choosing the λ tuning parameter for the lasso in eq. (1.4), thus giving a complete toolbox required to do high-dimensional inference. To facilitate the use of this method by practitioners, we created the package desla for the opensource statistical software R, which efficiently implements the desparsified lasso and our proposed inference method. Furthermore, we perform an extensive simulation study demonstrating the accuracy of our inference in several relevant settings. In Chapter 3, I focus on applying the desparsified lasso to high-dimensional local projection. This modelling technique allows us to do structural, or causal inference; that is, rather than only considering correlations, this approach lets practitioners estimate the causal effects of structural shocks in the form of impulse responses. Local projections are typically used in macroeconomic settings, and the recent work of Plagborg-Møller and Wolf (2021) showed that they are in some sense equivalent to structural VARs – another highly popular method for estimating impulse responses. In addition to showing how the results of Chapter 2 can be applied in these local projections, we also propose a small modification to the desparsified lasso which greatly improves its inference performance, and demonstrate this in a simulation. Finally, we present two empirical applications where we investigate the impulse responses to shock in monetary policy and government spending. These applications also highlight why these high-dimensional methods are important. For one of our analyses, we use 13 lags of 122 macroeconomic variables from the FRED-MD database (McCracken and Ng, 2016) in a non-linear state-dependent model, resulting in 3309 explanatory variables with only 707 time series observations. The implementation of our proposed high-dimensional local projections is also a part of the desla package. Finally, in Chapter 4, we return to more theoretical results, with our proposal of the sparse high-dimensional VAR bootstrap. Unlike Chapter 2, where we develop time series theory for a high-dimensional method, this chapter develops high-dimensional theory for a method which is well-established in low-dimensional time series settings. In this bootstrap, we propose to estimate high-dimensional VARs with the lasso, and use these VARs to build our bootstrap samples. We then show that this bootstrap procedure is consistent, and provides a valid approximation for high-dimensional means. To do so, we build on our previous results in Section 2.3, where we derive error bounds for the lasso. To prove our main results, we also derive a high-dimensional central limit theorem for linear processes, which may be of independent interest. Throughout the following chapters, I generally denote scalar quantities by lower case letters (e.g. x), vectors by bold lower case letters (e.g. v), matrices by bold capital letters (e.g. M ), and quantities unknown by Greek letters. The notation of each chapter is otherwise self-contained and defined separately. In cases where one chapter refers to results of another, any relevant notation differences are made clear. 10 Chapter 2 Desparsified Lasso in Time Series Abstract† In this chapter we develop valid inference for high-dimensional time series. We extend the desparsified lasso to a time series setting under Near-Epoch Dependence (NED) assumptions allowing for non-Gaussian, serially correlated and heteroskedastic processes, where the number of regressors can possibly grow faster than the time dimension. We first derive an error bound under weak sparsity, which, coupled with the NED assumption, means this inequality can also be applied to the (inherently misspecified) nodewise regressions performed in the desparsified lasso. This allows us to establish the uniform asymptotic normality of the desparsified lasso under general conditions, including for inference on parameters of increasing dimensions. Additionally, we show consistency of a long-run variance estimator, thus providing a complete set of tools for performing inference in high-dimensional linear time series models. Finally, we perform a simulation exercise to demonstrate the small sample properties of the desparsified lasso in common time series settings. † This chapter is based on joint work with S.J.M. Smeekes and I. Wilms. It is forthcoming in the Journal of Econometrics. 11 2 Desparsified Lasso in Time Series 2.1 Introduction In this chapter we propose methods for performing uniformly valid inference on highdimensional time series regression models. Specifically, we establish the uniform asymptotic normality of the desparsified lasso method (van de Geer et al., 2014) under very general conditions, thereby allowing for inference in high-dimensional time series settings that encompass many econometric applications. That is, we establish validity for potentially misspecified time series models, where the regressors and errors may exhibit serial dependence, heteroskedasticity and fat tails. In addition, as part of our analysis we derive new error bounds for the lasso (Tibshirani, 1996), on which the desparsified lasso is based. Although traditionally approaches to high-dimensionality in econometric time series have been dominated by factor models (Bai and Ng, 2008; Stock and Watson, 2011, cf.), shrinkage methods have rapidly been gaining ground. Unlike factor models where dimensionality is reduced by assuming common structures underlying regressors, shrinkage methods assume a certain structure on the parameter vector. Typically, sparsity is assumed, where only a small, unknown subset of the variables is thought to have “significantly non-zero” coefficients, and all the other variables have negligible – or even exactly zero – coefficients. The most prominent among shrinkage methods exploiting sparsity is the lasso proposed by Tibshirani (1996), which adds a penalty on the absolute value of the parameters to the least squares objective function. This penalty ensures that many of the coefficients will be set to zero and thus variable selection is performed, an attractive feature that helps to make the results of a high-dimensional analysis interpretable. Due to this feature, the lasso and its many extensions are now standard tools for high-dimensional analysis (see e.g., Hesterberg et al., 2008; Vidaurre et al., 2013; Hastie et al., 2015, for reviews). Much effort has been devoted to establish error bounds for lasso-based methods to guarantee consistency for prediction (e.g., Greenshtein and Ritov, 2004; Bühlmann, 2006) and estimation of a high-dimensional parameter (e.g., Bunea et al., 2007; Zhang and Huang, 2008; Bickel et al., 2009; Meinshausen and Yu, 2009; Huang et al., 2008). While most of these advances have been made in frameworks with independent and identically distributed (IID) data, early extensions of lasso-based methods to the time series case can be found in Wang et al. (2007), Hsu et al. (2008). These authors, however, only consider the case where the number of variables is smaller than the sample size. Various papers (e.g., Nardi and Rinaldo, 2011; Kock and Callot, 2015 and Basu and Michailidis, 2015) let the number of variables increase with the sample size, but often require restrictive assumptions (for instance Gaussianity) on the error process when investigating theoretical properties of lasso-based estimators in time 12 2.1 Introduction series models. Exceptions are Medeiros and Mendes (2016), Wu and Wu (2016), Masini et al. (2021), and Wong et al. (2020). Medeiros and Mendes (2016) consider the adaptive lasso for sparse, high-dimensional time series models and show that it is model selection consistent and has the oracle property, even when the errors are non-Gaussian and conditionally heteroskedastic. Wu and Wu (2016) consider high-dimensional linear models with dependent non-Gaussian errors and/or regressors and provide asymptotic theory for the lasso with deterministic design. To this end, they adopt the functional dependence framework of Wu (2005). Masini et al. (2021) focus on weakly sparse high-dimensional vector autoregressions for a class of potentially heteroskedastic and serially dependent errors, which encompass many multivariate volatility models. The authors derive finite sample estimation error bounds for the parameter vector and establish consistency properties of lasso estimation. Wong et al. (2020) derive nonasymptotic inequalities for estimation error and prediction error of the lasso without assuming any specific parametric form of the DGP (data-generating process). The authors assume the series to be either α-mixing Gaussian processes or β-mixing processes with sub-Weibull marginal distributions thereby accommodating settings with heavy-tailed non-Gaussian errors. While one of the attractive feature of lasso-type methods is their ability to perform variable selection, this also causes serious issues when performing inference on the estimated parameters. In particular, performing inference on a (data-driven) selected model, while ignoring the selection, causes the inference to be invalid. This has been discussed by, among others, Leeb and Pötscher (2005) in the general context of model selection and Leeb and Pötscher (2008) for shrinkage estimators. As a consequence, recent statistical literature has seen a surge in the development of so-called postselection inference methods that circumvent the problem induced by model selection; see for example the literature on selective inference (cf. Fithian et al., 2015; Lee et al., 2016) and simultaneous inference (Berk et al., 2013; Bachoc et al., 2020). In the context of lasso-type estimation, methods have been developed based on the idea of orthogonalizing the estimation of the parameter of interest to the estimation (and potential incorrect selection) of the other parameters. Belloni et al. (2014); Chernozhukov et al. (2015) propose a post-double-selection approach that uses a Frisch-Waugh partialling out strategy to achieve this orthogonalization by selecting important covariates in initial selection steps on both the dependent variable and the variable of interest, and show this approach yields uniformly valid and standard normal inference for independent data. In a related approach, Javanmard and Montanari (2014); van de Geer et al. (2014) and Zhang and Zhang (2014) introduce debiased or desparsified versions of the lasso that achieve uniform validity based on 13 2 Desparsified Lasso in Time Series similar principles for IID Gaussian data. Extensions to the time series case include Chernozhukov et al. (2021) who provide desparsified simultaneous inference on the parameters in a high-dimensional regression model allowing for temporal and crosssectional dependency in covariates and error processes, Krampe et al. (2021) who introduce bootstrap-based inference for autoregressive time series models based on the desparsification idea, Hecq et al. (2021) who use the post-double-selection procedure of Belloni et al. (2014) for constructing uniformly valid Granger causality test in high-dimensional VAR models, and Babii et al. (2019) who use a debiased sparse group lasso for inference on a low dimensional group of parameters. In this chapter, we contribute to the literature on shrinkage methods for highdimensional time series models by providing novel theoretical results for both point estimation and inference via the desparsified lasso. We consider a very general time series-framework where the regressors and errors terms are allowed to be nonGaussian, serially correlated and heteroskedastic, and the number of variables can grow faster than the time dimension. Moreover, our assumptions allow for both correctly specified and misspecified models, thus providing results relevant for structural interpretations if the overall model is specified correctly, but not limited to this. We derive error bounds for the lasso in high-dimensional, linear time series models under mixingale assumptions and a weak sparsity assumption on the parameter vector. Our setting generalizes the one from Medeiros and Mendes (2016), who require a martingale difference sequence (m.d.s.) assumption – and hence correct specification – on the error process. Moreover, we relax the traditional sparsity assumption to allow for weak sparsity, thereby recognizing that the true parameters are likely not exactly zero. The error bounds are used to establish estimation and prediction consistency even when the number of parameters grows faster than the sample size. We extend the error bounds to the nodewise regressions performed in the desparsified lasso, where each regressor (on which inference is performed) is regressed on all other regressors. Note that, contrary to the setting with independence over time, these nodewise regressions are inherently misspecified in dynamic models with temporal dependence. As such our error bounds are specifically derived under potential misspecification. We then establish the asymptotic normality of the desparsified lasso under general conditions. As such, we ensure uniformly valid inference over the class of weakly sparse models. This result is accompanied by a consistent estimator for the long run variance, thereby providing a complete set of tools for performing inference in high-dimensional, linear time series models. As such, our theoretical results accommodate various financial and macro-economic applications encountered by applied researchers. The remainder of this chapter is structured as follows. Section 2.2 introduces the 14 2.2 The High-Dimensional Linear Model time series setting and assumptions thereof. In Section 2.3, we derive an error bound for the lasso (Corollary 2.1) that forms the basis for the nodewise regressions performed for the desparsfied lasso. In Section 2.4, we establish the theory that allows for uniform inference with the desparsified lasso. Section 2.5 contains a simulation study examining the small sample performance of the desparsified lasso, and Section 2.6 concludes. The main proofs and preliminary lemmas needed for Section 2.3 are contained in Appendix 2.A, while Appendix 2.B contains the results and proofs on Section 2.4. Appendix 2.C contains supplementary material. N P r 1/r |xi | A word on notation. For any N dimensional vector x, ∥x∥r = i=1 P denotes the Lr -norm, with the familiar convention that ∥x∥0 = i 1(|xi | > 0) and ∥x∥∞ = max |xi |. For a matrix A, we let ∥A∥r = max∥x∥r =1 ∥Ax∥r for any r ∈ [0, ∞] i p d and ∥A∥max = max |ai,j |. We use → and → to denote convergence in probability and i,j distribution respectively. Depending on the context, ∼ denotes equivalence in order of magnitude of sequences, or equivalence in distribution. We frequently make use of arbitrary positive finite constants C (or its sub-indexed version Ci ) whose values may change from line to line throughout the chapter, but they are always independent of the time and cross-sectional dimension. Similarly, generic sequences converging to zero as T → ∞ are denoted by ηT (or its sub-indexed version ηT,i ). We say a sequence ηT is of size −x if ηT = O (T −x−ε ) for some ε > 0. 2.2 The High-Dimensional Linear Model Consider the linear model yt = x′t β 0 + ut , (2.1) t = 1, . . . , T, ′ where xt = (x1,t , . . . , xN,t ) is a N × 1 vector of explanatory variables, β 0 is a N × 1 parameter vector and ut is an error term. Throughout the chapter, we examine the high-dimensional time series model where N can be larger than T . We impose the following assumptions on the processes {xt } and {ut }. Assumption 2.1. Let z t = (x′t , ut )′ , and let there exist some constants m̄ > m > 2, and d ≥ max{1, (m̄/m − 1)/(m̄ − 2)} such that (i) Let E [z t ] = 0, E [xt ut ] = 0, and max 1≤j≤N +1, 1≤t≤T 2m̄ E |zj,t | ≤ C. (ii) Let sT,t denote a k(T )-dimensional triangular array that is α-mixing of size −d/(1/m − 1/m̄) with σ-field Fts := σ {sT,t , sT,t−1 , . . . } such that z t is Fts 15 2 Desparsified Lasso in Time Series measurable. The process {zj,t } is L2m -near-epoch-dependent (NED) of size −d on sT,t with positive bounded NED constants, uniformly over j = 1, . . . , N + 1. Assumption 2.1(i) ensures that the error terms are contemporaneously uncorrelated with each of the regressors, and that the process has finite and constant unconditional moments. One can think of sT,t in Assumption 2.1(ii) as an underlying shock process driving the regressors and errors in z t , where we assume z t to depend almost entirely on the “near epoch” of sT,t .1 Near epoch dependence of z t can be interpreted as z t being “approximately” mixing, in the sense that it can be well-approximated by a mixing process. The NED framework in Assumption 2.1 therefore allows for very general forms of dependence that are often encountered in econometrics applications including, but not limited to, strong mixing processes (McLeish, 1975), linear processes including ARMA models, various types of stochastic volatility and GARCH models (Hansen, 1991a), and nonlinear processes (Davidson, 2002a). Moreover, NED holds in cases where mixing has well-known failures for common processes, such as the AR(1) process discussed in Andrews (1984). These properties have made NED a very popular tool for modelling dependence in econometrics (Davidson, 2002b, Sections 14, 17).2 To our knowledge, our work in this chapter is the first to utilize the NED framework for establishing uniformly valid high-dimensional inference. Wong et al. (2020) consider time series models with β-mixing errors, which has the advantage of allowing for general forms of dynamic misspecification resulting in serially correlated error terms, but, as discussed above, rules out several relevant data generating processes, and is in addition typically difficult to verify. Alternative approaches that avoid mixing assumptions are found in Babii et al. (2019), who consider τ −dependence, as well as Wu and Wu (2016) and Chernozhukov et al. (2021), who use functional dependence for modeling the dependence allowed in regressors and innovations. Finally, Masini et al. (2021) use an m.d.s. assumption on the innovations in combination with sub-Weibull tails and a mixingale assumption on the conditional covariance matrix. The m.d.s. assumption of Medeiros and Mendes (2016) and Masini et al. (2021) however does not allow for dynamic misspecification of the full model. Importantly, the NED assumption on ut does allow for misspecified models as well, in which case we view β 0 as the coefficients of the pseudo-true model when restricting the class of models to those linear in xt . In particular, it allows one to view (2.1) as simply the 1 Since z grows asymptotically in dimension, it is natural to let the dimension of s t T,t grow with T , though this is not theoretically required. Although, like sT,t , technically our stochastic process z t is a triangular array due to dimension N increasing with T , in the remainder of the chapter we suppress the dependence on T for notational convenience. 2 To make the chapter self-contained, we include formal definitions on NED and mixingales in Appendix 2.A.1. 16 2.2 The High-Dimensional Linear Model linear projection of yt on all the variables in xt , with β 0 in that case representing the corresponding best linear projection coefficients. In such a case E [ut ] = 0 and E [ut xj,t ] = 0 hold by construction, and the additional conditions of Assumption 2.1 can be shown to hold under weak further assumptions. On the other hand, ut is not likely to be an m.d.s. in that case. As will be explained later, allowing for misspecified dynamics is crucial for developing the theory for the nodewise regressions underlying the desparsified lasso. It is important to note that we do not consider β 0 as the projection coefficients of the (lasso) selected model, but only of the full, pseudo-true, model. Our approach simply allows for the possibility of the full model being misspecified, for instance if the econometrician has missed relevant confounders in the initial dataset. This does not imply a “failure” of our lasso inference method, but rather a failure of the econometrician in setting up the initial model.3 Allowing for such misspecification is crucial for the nodewise regressions we consider in Section 2.4 which are simply projections of one explanatory variable on all the others, and therefore inherently misspecified. We further elaborate on misspecification in Example 2.3, after we present two examples of correctly specified common econometric time series DGPs. Remark 2.1. The NED-order m and sequence size −d play a key role in later theorems where they enter the asymptotic rates. In Assumption 2.1(i), we require z t to have m̄ moments, with m̄ being slightly larger than m. The more moments, the tighter the error bounds and the weaker conditions on the tuning parameter are, but a high m̄ implies stronger restrictions on the model (see e.g., the GARCH parameters in the to be discussed Example 2.1). Additionally, there is a tradeoff between the thickness of the tails allowed for and the amount of dependence – measured through the mixing rate in Assumption 2.1(ii). Under strong dependence, fewer moments are needed; the reduction from m̄ to m then reflects the price one needs to pay for allowing more dependence through a smaller mixing rate. Example 2.1 (ARDL model with GARCH errors). Consider the autoregressive dis3 Of course, the misspecification may be intentional, as even in dynamically misspecified models, the parameter of interest can still have a structural meaning. One example is the local projections of Jordà (2005), where h-step ahead predictive regressions with generally serially correlated error terms are performed. 17 2 Desparsified Lasso in Time Series tributed lag (ARDL) model with GARCH errors p X yt = ρi yt−i + i=1 q X θ ′i wt−i + ut = x′t β 0 + ut , i=0 p ut = ht εt , εt ∼ IID(0, 1), ht = π0 + π1 ht−1 + π2 u2t−1 , p P where the roots of the lag polynomial ρ(z) = 1 − ρi z i are outside the unit circle. i=1   Take εt , π1 and π2 such that E ln(π1 ε2t + π2 ) < 0, then ut is a strictly station- ary geometrically β-mixing h process i (Francq and Zakoïan, 2010, Theorem 3.4), and 2m̄ additionally such that E |ut | < ∞ for some m̄ ∈ N (the number of moments depends on π1 , π2 and the moments of ϵt , cf. Francq and Zakoïan, 2010, Example 2.3). Also assume that the vector of exogenous variables wt is stationary and geometrically β-mixing as well with finite 2m̄ moments. Given the invertibility of the Pq lag polynomial, we may then write yt = ρ−1 (L)vt , where vt = i=0 θ ′i wt−i + ut and the inverse lag polynomial ρ−1 (z) has geometrically decaying coefficients. Then it follows directly that yt is NED on vt , where vt is strong mixing of size −∞ as its components are geometrically β-mixing, and the sum inherits the mixing properties. Furthermore, if ∥θi ∥1 ≤ C for all i = 0, . . . , q, it follows directly from Minkowski that E |vt | 2m̄ 2m̄ ≤ C and consequently E |yt | ≤ C. Then yt is NED of size −∞ on (wt , ut ), and consequently z t = (yt−1 , wt , ut ) as well. Example 2.2 (Equation-by-equation VAR). Consider the vector autoregressive model yt = p X Φi y t−i + ut , i=1 where y t is a K × 1 vector of dependent variables, E |ut | 2m̄ ≤ C , and the K × K matrices Φi satisfy appropriate stationarity and 2m̄-th order summability conditions. The equivalent equation-by-equation representation is yk,t = p X   [Φk,1,i , . . . , Φk,K,i ] y t−i +uk,t = y ′t−1 , . . . , y ′t−p β k +uk,t , k ∈ (1, . . . , K). i=1   Assuming a well-specified model with E ut |y t−1 , . . . , y t−p = 0, the conditions of Assumption 2.1 are then satisfied trivially. Examples 2.1 and 2.2 demonstrate that Assumption 2.1 is sufficiently general to include common time series models in econometrics. While these examples are equally well covered by other commonly used assumptions such as the martingale difference 18 2.2 The High-Dimensional Linear Model sequence (m.d.s) framework chosen in Medeiros and Mendes (2016) or Masini et al. (2021), we opt for the more general NED framework, as it additionally covers many relevant cases – in particular for our nodewise regressions – where properties such as m.d.s. fail. The following examples provide simple illustrations of these cases. Example 2.3 (Misspecified AR model). Consider an autoregressive (AR) model of order 2 yt = ρ1 yt−1 + ρ2 yt−2 + vt , vt ∼ IID(0, 1), where E|vt |2m̄ ≤ C and the roots of 1 − ρ1 L − ρ2 L2 are outside the unit circle. De  fine the misspecified model yt = ρ̃yt−1 + ut , where ρ̃ = arg min E (yt − ρyt−1 )2 = ρ E[yt yt−1 ] 2 E[yt−1 ] = ρ1 1−ρ2 and ut is autocorrelated. An m.d.s. assumption would be inappro- priate in this case, as E [ut |σ {yt−1 , yt−2 , . . . }] = E [yt − ρ̃yt−1 |σ {yt−1 , yt−2 , . . . }] = − ρ1 ρ2 yt−1 +ρ2 yt−2 ̸= 0. 1 − ρ2 However, it can be shown that (yt−1 , ut )′ satisfies Assumption 2.1(ii) by considering the moving average representation of yt and by extension, of ut = yt − ρ̃yt−1 . As the coefficients are geometrically decaying, ut is clearly NED on vt and Assumption 2.1(ii) is satisfied. The key condition to apply the lasso successfully is that the parameter vector β 0 is (at least approximately) sparse. We formulate this in Assumption 2.2 below. Assumption 2.2. For some 0 ≤ r < 1 and sparsity level sr , define the N -dimensional sparse compact parameter space  r B N (r, sr ) := β ∈ RN : ∥β∥r ≤ sr , ∥β∥∞ ≤ C, ∃C < ∞ , and assume that β 0 ∈ B N (r, sr ). Assumption 2.2 implies that β 0 is sparse with the degree of sparsity governed by both r and sr . Without further assumptions on r and sr , Assumption 2.2 is not binding, but as will be seen later, the allowed rates will interact with other DGP parameters creating binding conditions. Assumption 2.2 generalizes the common assumption of exact sparsity taking r = 0 (see e.g., Medeiros and Mendes, 2016; van de Geer et al., 2014; Chernozhukov et al., 2021; Babii et al., 2019), which assumes that there are only a few (at most s0 ) non-zero components in β 0 , to weak sparsity (see e.g., van de Geer, 2019). This allows us to have many non-zero elements in the parameter vector, as long as they are sufficiently small. It follows directly from the 19 2 Desparsified Lasso in Time Series formulation in Assumption 2.2 that, given the compactness of the parameter space, exact sparsity of order s0 implies weak sparsity with r > 0 of the same order (up to a fixed constant). In general, the smaller r is, the more restrictive the assumption. The relaxation to weak sparsity is straightforward and follows from elementary inequalities (see e.g., Section 2.10 of van de Geer, 2016 and the proof of Lemma 2.A.7). Example 2.4 (Infinite order AR). Consider an infinite order autoregressive model yt = ∞ X ρj yt−j + εt , j=1 where εt is a stationary m.d.s. with sufficient moments existing, and the lag polynomial P∞ P∞ 1 − j=1 ρj Lj is invertible and satisfies the summability condition j=1 j a |ρj | < ∞ for some a ≥ 0. One might consider fitting an autoregressive approximation of order P to yt , yt = P X βj yt−j + ut , j=1 as it is well known that if P is sufficiently large, the best linear predictors βj will be close to the true coefficients ρj (see e.g., Kreiss et al., 2011, Lemma 2.2). To relate the summability condition above to the weak sparsity condition, note that by Hölder’s inequality we have that r ∥β∥r = P X j=1 r (j a |βj |) j −ar  r  1−r P P X X ar ≤ j a |βj |  j − 1−r  ≤ C max{P 1−(a+1)r , 1}. j=1 j=1 The constant comes from bounding the first term by the convergence of βj to ρj plus the summability of the latter, while the second term involving P follows from Lemma 5.1 of Phillips and Solo (1992).4 As such, summability conditions on lag polynomials imply weak sparsity conditions, where the strength of the summability condition (measured through a) and the required strictness of the sparsity (measured through r) determine the order sr of the sparsity. Therefore, weak sparsity – unlike exact sparsity – can accommodate sparse sieve estimation of infinite-order, appropriately summable, processes, providing an alternative to least-squares estimation of lower order approximations. For VAR models we can apply the same reasoning, with the addition that appropriate row sparsity is needed for the coefficients in the row of interest of the VAR if the number of series increases with the sample size. 4 As the same lemma shows, one should in fact treat the case r = 1/(a + 1) separately, in which a a bound of order (ln P ) a+1 holds. 20 2.2 The High-Dimensional Linear Model For λ ≥ 0, define the weak sparsity index set with cardinality |Sλ |,  Sλ := j : βj0 > λ (2.2) and complement set Sλc = {1, . . . , N } \ Sλ . With an appropriate choice of λ, this set contains all ‘sufficiently large’ coefficients; for λ = 0 it contains all non-zero parameters. We need this set in the following condition, which formulates the standard compatibility conditions needed for lasso consistency (see e.g., Bühlmann and van De Geer, 2011, Chapter 6). Assumption 2.3. Let Σ := 1 T T P t=1 E [xt x′t ]. For a general index set S with cardinality |S|, define the compatibility constant ϕ2Σ (S)  := min {z∈RN \0:∥z S c ∥1 ≤3∥z S ∥1 } |S|z ′ Σz ∥z S ∥21  . Assume that ϕ2Σ (Sλ ) ≥ 1/C, which implies that ∥z Sλ ∥21 ≤ |Sλ |z ′ Σz ≤ C|Sλ |z ′ Σz, ϕ2Σ (Sλ ) for all z satisfying ∥z Sλc ∥1 ≤ 3∥z Sλ ∥1 ̸= 0. The compatibility constant in Assumption 2.3 is an upper bound on the minimum eigenvalue of Σ, so this condition is considerably weaker than assuming Σ to be positive definite. We formulate the compatibility condition in Assumption 2.3 on the population covariance matrix rather than directly on the sample covariance matrix Σ̂ := X ′ X/T , see e.g., the restricted eigenvalue condition in Medeiros and Mendes (2016) or Assumption (A2) in Chernozhukov et al. (2021). Verifying this assumption on the population covariance matrix is generally more straightforward than directly on the sample covariance matrix.5 Finally, note that the compatibility assumption for the weak sparsity index set Sλ is weaker than (and implied by) its equivalent for S0 , see Lemma 6.19 in Bühlmann and van De Geer (2011), and that the strictness of this assumption depends on the choice of the tuning parameter λ. 5 Though note that Basu and Michailidis (2015) show in their Proposition 3.1 that the restricted eigenvalue condition holds with high probability under general time series conditions when xt is a stable process with full-rank spectral density and T is sufficiently large. Their Proposition 4.2 includes a stable VAR process as an example. 21 2 Desparsified Lasso in Time Series 2.3 Error Bound and Consistency for the Lasso In this section, we derive a new error bound for the lasso in a high-dimensional time series model. The lasso estimator (Tibshirani, 1996) of the parameter vector β 0 in Model (2.1) is given by  β̂ := arg min β∈RN  ∥y − Xβ∥22 + 2λ∥β∥1 , T (2.3) ′ where y = (y1 , . . . , yT )′ is the T × 1 response vector, X = (x1 , . . . , xT ) the T × N design matrix and λ > 0 a tuning parameter. Optimization problem (2.3) adds a penalty term to the least squares objective to penalize parameters that are different from zero. When deriving this error bound, one typically requires that λ is chosen sufficiently PT large to exceed the empirical process max T1 t=1 xj,t ut with high probability. To j   l P this end, we define the set ET (z) := max ut xj,t ≤ z , and establish the j≤N,l≤T t=1 conditions under which P (ET (T λ/4)) → 1. In addition, since we formulate the compatibility condition in Assumption 2.3 on the population covariance matrix, we need to show that Σ and Σ̂ are sufficiently close under theoDGP assumptions. To this end, we n define the set CC T (S) := ≤ C/ |S| , and show that P (CC T (Sλ )) → 1. Σ̂ − Σ max Theorem 2.1 then presents both results. Theorem 2.1. Let Assumptions 2.1 to 2.3 hold, and assume that  d+m−1 0 < r < 1 : λ ≥ C ln(ln(T )) r(dm+m−1) sr √ " d+m−1 − dm+m−1 r = 0 : s0 ≤ C ln(ln(T )) λ ≥ Cln(ln(T )) N( 1/m N 2 2 N ( d + m−1 ) √ T # T 2 2 d + m−1 ) ! 1 m ( d1 + m−1 )  r1  1 m ( d1 + m−1 ) , (2.4) 1/m √ T When N, T are sufficiently large, P (ET (T λ/4) ∩ CC T (Sλ )) ≥ 1 − C ln(ln(T ))−1 . Theorem 2.1 thus establishes that the sets ET (T λ/4) and CC T (Sλ ) hold with high probability. Each set has a condition under which its probability converges to 1, which follow from Lemmas 2.A.3 and 2.A.4 respectively. For the set ET (T λ/4), the condition λ ≥ C ln(ln(T )) 1/m N 1/m √ T is required. The ln(ln(T )) appearing throughout the theorem is chosen arbitrarily as a sequence which grows slowly as T → ∞; we only need some sequence tending to infinity sufficiently slowly. The details can be 22 2.4 Uniformly Valid Inference via the Desparsified Lasso found in the proof of Theorem 2.1. For the set CC T (Sλ ), we need to distinguish the cases 0 < r < 1 and r = 0 due to the way the size of the sparsity index set in eq. (2.2) is bounded. For 0 < r < 1, a lower bound on λ is imposed which is stricter than the one for the empirical process, hence only that bounds appears in Theorem 2.1. For r = 0, the conditions do not depend on λ hence both bounds appear in Theorem 2.1. Theorem 2.1 directly yields an error bound for the lasso in high-dimensional time series models by standard arguments in the literature, see e.g., Chapter 2 of van de Geer (2016). The proofs of Lemmas 2.A.6 and 2.A.7 in Section 2.C.1 provide details. Corollary 2.1. Under Assumptions 2.1 to 2.3 and the conditions of Theorem 2.1, when N, T are sufficiently large, the following holds with probability at least 1 − C ln ln T −1 : (i) (ii) 1 T X(β̂ − β 0 ) 2 2 β̂ − β 0 1 ≤ Cλ2−r sr , ≤ Cλ1−r sr . Under the additional assumption that λ1−r sr → 0, these error bounds directly establish prediction and estimation consistency. The bounds in Theorem 2.1 thereby put implicit limits on the divergence rate of N , and sr relative to T . In particular, the term offsetting the divergence in N , and sr is of polynomial order in T . The order of the polynomial, and therefore the restriction on the growth of N and sr , is determined by the moments m and dependence parameter d; the higher the number of moments m and the larger the dependence parameter d, the fewer restrictions one has on the allowed polynomial growth of N and sr . In the limit, if m and d tend to infinity (all moments exist and the data are mixing), the order of the polynomial restriction on N tends to infinity, thereby approaching exponential growth. A similar trade off between the allowed growth of N and the existence of moments was found in Medeiros and Mendes (2016). In Example 2.C.1 we study in greater detail how the different rates interact, thereby providing an overview of the restrictions under different scenarios. While Corollary 2.1 is a useful result in its own right, it is vital to derive the theoretical results for the desparsified lasso, which we turn to next. 2.4 Uniformly Valid Inference via the Desparsified Lasso We use the desparsified lasso to perform uniformly valid inference in general highdimensional time series settings. After briefly reviewing the desparsified lasso, we 23 2 Desparsified Lasso in Time Series formulate the assumptions needed in Section 2.4.1. The asymptotic theory is then derived in Section 2.4.2 for inference on low-dimensional parameters of interest, and Section 2.4.3 for inference on a high-dimensional parameters. The desparsified lasso (van de Geer et al., 2014) is defined as b̂ := β̂ + Θ̂X ′ (y − X β̂) , T (2.5) where β̂ is the lasso estimator from eq. (2.3) and Θ̂ := Υ̂ −2 Γ̂ is a reasonable approx- imation for the inverse of Σ̂. By de-sparsifying the initial lasso, the bias in the lasso estimator is removed and uniformly valid inference can be obtained. The matrix Γ̂ is constructed using nodewise regressions; regressing each column of X on all other explanatory variables using the lasso. Let the lasso estimates of the j = 1, . . . , N nodewise regressions be ( γ̂ j := arg min γ j ∈RN −1 ∥xj − X −j γ j ∥22 + 2λj ∥γ j ∥1 T ) (2.6) , where the T × (N − 1) matrix X −j is X with its jth column removed. Their components are given by γ̂ j = {γ̂j,k : k = {1, . . . , N } \ j}. Stacking these estimated parameter vectors row-wise with ones on the diagonal gives the matrix  1   −γ̂2,1 Γ̂ :=   ..  . −γ̂N,1 We then take Υ̂ −2  −γ̂1,2 ... −γ̂1,N 1 .. . ... .. .  −γ̂2,N  ..  . .  −γ̂N,2 ... 1  2 := diag 1/τ̂12 , . . . , 1/τ̂N , where τ̂j2 := 1 T xj − X −j γ̂ j 2 +2λj 2 γ̂ j We use the index set H ⊆ {1, . . . , N } with cardinality h = |H| to denote the set of variables whose coefficients we wish to perform inference on. In this case computational gains can be obtained with respect to the nodewise regressions, as we only need to obtain the sub-vector of the desparsified lasso corresponding to b̂H := β̂ H + Θ̂H X(y−X β̂), with the subscript H indicating that we only take the respective rows of β̂ and Θ̂. To compute Θ̂H , one only needs to compute h nodewise regressions instead of N , which can be a considerable reduction for small h relative to large N . 24 1 . 2.4 Uniformly Valid Inference via the Desparsified Lasso 2.4.1 Assumptions Consider the population nodewise regressions defined by the linear projections ( " xj,t = x′−j,t γ 0j +vj,t , T 2 1X γ 0j := arg min E xj,t − x′−j,t γ T γ t=1 #) , j = 1, . . . , N, (2.7) with τj2 := 1 T T P t=1  2  E vj,t . Note that by construction, it holds that E [vj,t ] = 0, ∀t, j and E [vj,t xk,t ] = 0, ∀t, k ̸= j. We first present Assumptions 2.4 and 2.5, which allow us to extend Corollary 2.1 to the nodewise lasso regressions. Assumption 2.4. Let max 1≤j≤N, 1≤t≤T 2m̄ E |vj,t | ≤ C. Assumption 2.5. (j) (j) (i) For some 0 ≤ r < 1 and sparsity levels sr , let γj0 ∈ B N −1 (r, sr ), ∀j ∈ H. (ii) Let max σj,j ≤ C and Λmin ≥ 1/C, where Λmin is the smallest eigenvalue of 1≤j≤N Σ. Assumption 2.4 requires the errors vj,t from the nodewise linear projections to have bounded moments of an order greater than fourth. By the properties of NED processes, we use Assumptions 2.1 and 2.4 to establish mixingale properties of the products vj,t ut =: wj,t and wj,t wk,t−l in Lemma 2.B.2, which are used extensively in the derivation of the desparsified lasso’s asymptotic distribution. Assumption 2.5(i), similar to Assumption 2.2, requires weak sparsity of the nodewise regressions, not exact sparsity. The latter could be problematic, as it would imply many of the regressors to be uncorrelated. In contrast, weak sparsity is a plausible alternative, see e.g., Example 2.4. Importantly, the weak sparsity of the nodewise regressions is fully determined by the model and hence should be verified. Below, we provide concrete examples where the weak sparsity assumption holds. Assumption 2.5(ii) requires the population covariance matrix to be positive definite, with its smallest eigenvalue bounded away from zero, and to have finite variances. Assumption 2.5(ii) implies the compatibility condition and thus replaces Assumption 2.3 in Section 2.3, with Λmin fulfilling the role of ϕ2Σ . It also implies that the explanatory variables, including the irrelevant ones, cannot be linear combinations of each other even as we let the number of variables tends to infinity. Although this is a considerable strengthening of Assumption 2.3, it is important to realize this assumption is still made on the population matrix instead of the sample version, and may therefore still hold in fairly general, high-dimensional models. For example, Basu and 25 2 Desparsified Lasso in Time Series Michailidis (2015) provide a lower bound for Λmin in VAR models on their Proposition 2.3, which can be shown to be bounded away from zero under realistic conditions, see also Masini et al. (2021, p. 6). Similarly, this assumption can be shown to hold in factor models under minimal assumptions on the idiosyncratic errors (see Example 2.5 below). Example 2.5. (Sparse factor model) Consider the factor model ′ yt = β 0 xt + ut , ut ∼ IID(0, 1) xt = Λ f t + ν t , ν t ∼ IID(0, Σν ), N ×kk×1 f t ∼ IID(0, Σf ), where Λ has bounded elements, Σf and Σν are positive definite with bounded eigenvalues, and ν t and f t are uncorrelated. In this DGP, −1  −1 ′ −1 −1 Λ′ Σ−1 Σ = ΛΣf Λ′ + Σν =⇒ Θ = Σ−1 ν . ν − Σν Λ Σf + Λ Σν Λ As shown in Appendix 2.C.5, the sparsity of the nodewise regression parameters can be bounded as max γ 0j j r r ≤ Σ−1 ν r r  1 + C Σ−1 ν r r  r ∥Λ∥r k 2−r/2 N −ar , where N a is the rate at which the k-th largest eigenvalue of Σ diverges. This result allows for weak factor models where a < 1, which have been proposed for providing a theoretical explanation for the often observed empirical phenomenon where the separation between the eigenvalues of the Gram matrix is not as large as the strong factor model with a = 1 implies (cf. De Mol et al., 2008; Onatski, 2012; Uematsu and Yamagata, 2022a,b). The bound of the nodewise regressions further depends on the number of factors, the sparsity of the factor loadings and the sparsity of Σ−1 ν . Sparse factor loadings are intimately linked to weak factor models, and may provide accurate descriptions of the data in various economic and financial applications, see Uematsu and Yamagata (2022a,b) and Appendix 2.C.5 for details. Sparsity in Σ−1 holds when the idiosyncratic components are not too strongly ν cross-sectionally dependent, which is a standard assumption in factor models. It occurs for instance for block diagonal structures of Σν , in which case Σ−1 ν r r ≤ Cb where b is the size of the largest b × b block matrix with b nonzero elements, or for 2 Toeplitz structures σν i,j = ρ|i−j| , |ρ| < 1, in which case Σ−1 ν r r ≤ C. Note that to satisfy the minimum eigenvalue condition (Assumption 2.5(ii)), we only need the minimum eigenvalue of Σν to be bounded away from 0. 26 2.4 Uniformly Valid Inference via the Desparsified Lasso Example 2.6 (Sparse VAR(1)). Consider a stationary VAR(1) model for z t = (yt , x′t )′ z t = Φz t−1 + ut , Eut u′t := Ω, Eut u′t−l = 0, ∀l ̸= 0, with our regression of interest being the first line of the VAR, that is yt = ϕ1 z t−1 +u1,t , where ϕj is the jth row of Φ. Under this DGP, the nodewise regression parameters γ 0j are determined entirely by Φ and Ω, and we now consider two cases for which we derive explicit results in Section 2.C.5. (a) Let Φ be symmetric and block diagonal with largest block of size b. Assume that Φ has eigenvalues strictly between 0 and 1, and ∥Φ∥max ≤ C. Furthermore, let Ω = I. Then the nonzero entries of γ 0j follow the block structure of Φ, such that max γ 0j j 0 ≤ Cb. (b) Let Φ = ϕI with |ϕ| < 1, and let Ω have a Toeplitz structure ωi,j = ρ|i−j| , |ρ| < 1. Then γ 0j is only weakly sparse, in the sense that it contains no zeroes, but its entries follow a geometrically decaying pattern, meaning that max γ 0j j r r ≤ C. More generally, sparsity of γ 0j requires that the autoregressive coefficient matrix Φ and the error covariance matrix Ω are row- and column-sparse in such a way that matrix multiplication preserves this sparsity. For case (a), we may relax the assumption on Ω to block-diagonality, provided the block structure is similar to that of Φ. For case (b), the result holds even when we let Φ have a similar Toeplitz structure as Ω, as we numerically investigate in Section 2.C.5. To verify the minimum eigenvalue condition in Assumption 2.5(ii), we may apply the bound derived in (Masini et al., 2021, p. 6), 2 which gives Λmin ≥ Λmin (Ω) [1 + (∥Φ∥1 + ∥Φ∥∞ ) /2] , where Λmin (Ω) is the smallest eigenvalue of Ω. Remark 2.2. Alternative approaches exist that circumvent the need to directly impose weak sparsity assumptions on the nodewise regressions. Krampe et al. (2021) use the desparsified lasso for inference in the context of stationary VARs with IID errors, but do not use nodewise regressions to build an estimator of Θ as we do. Instead, they use the VAR model structure to derive an estimator based on regularized estimates of the VAR coefficients and the error covariances. Such an approach requires knowledge of the full model underlying the covariates to provide an analytical expression for the nodewise projections. While this is a natural approach in a VAR model, this approach is considerably more difficult to apply in a more general setting, where the structure underlying the covariates is typically unknown. Moreover, they still require conditions on sparsity, which are similar to those found for the VAR 27 2 Desparsified Lasso in Time Series model of Example 2.6, i.e. row- and column-sparsity of the VAR coefficient matrices in addition to sparsity of the inverse error covariance matrix. Deshpande et al. (2021) use an online debiasing strategy for inference in VAR models with IID Gaussian errors, among other settings. Rather than using a single estimate of Θ, they use a sequence of precision matrix estimates based on an episodic structure, which can be seen as a generalization of sample-splitting. In addition, they use the precision matrix estimator as in Javanmard and Montanari (2014), which does not require sparsity of Θ. It is an interesting topic for future research to investigate whether these techniques can be leveraged in our setting allowing for misspecification and with potentially serially correlated/heteroskedastic errors. Assumptions 2.4 and 2.5 allow us to apply Corollary 2.1 to the nodewise regressions. Specifically, if the conditions on λ formulated in (2.4) hold for both λ := min λj ¯ (j) and λ̄ := max λj , the error bounds – with s̄r := max sr j∈H j∈H j∈H substituted for sr – apply to the nodewise regressions as well. As we generally need the error bounds to hold uniformly over all relevant nodewise regressions as well as the initial regression, we combine these bounds and state our results on the quantities λmin = min{λ, λ}, ¯ λmax = max{λ, λ̄}, sr,max = max{sr , s̄r }, (2.8) which simplifies many of the final expressions. While some conditions could be weakened if we keep them in terms of λ̄ or s̄r explicitly, this would be at the expense of more conditions and readability, and therefore we opt against it. 2.4.2 Inference on low-dimensional parameters In this section we establish the uniform asymptotic normality of the desparsified lasso focusing on low-dimensional parameters of interest. We consider testing P joint hypotheses of the form RN β 0 = q via a Wald statistic, where RN is an appropriate P × N matrix whose non-zero columns are indexed by the set H := n P o P j : p=1 |rN,p,j | > 0 of cardinality h := |H|. As can be seen from the lemmas in Appendix 2.B, all our results up to application of the central limit theorem allow for h to increase in N (and therefore T ). In Theorem 2.2 we first focus on inference on a finite set of parameters, such that we can apply a standard central limit theorem under the assumptions listed above. An alternative, high-dimensional approach under more stringent conditions is considered in Section 2.4.3. 28 2.4 Uniformly Valid Inference via the Desparsified Lasso Given our time series setting, the long-run covariance matrix " ΩN,T 1 =E T T X t=1 ! wt T X !# w′t , t=1 where wt = (v1,t ut , . . . , vN,t ut )′ , enters the asymptotic distribution in Theorem 2.2. TP −1 ΩN,T can equivalently be written as ΩN,T = Ξ(0) + (Ξ(l) + Ξ′ (l)), where Ξ(l) = l=1 1 T T P E t=l+1  wt w′t−l  . Theorem 2.2. Let Assumptions 2.1 to 2.5 hold, and assume that the smallest eigenvalue of ΩN,T is bounded away from 0. Furthermore, assume that λ2max ≤ h√ i−1 (ln ln T )λrmin T sr,max , and  0 < r < 1 : λmin ≥ (ln ln T ) sr,max √ " r = 0 : s0,max ≤ (ln ln T )−1 2 2 N ( d + m−1 ) √ T N( # T 2 2 d + m−1 ! 1 m ( d1 + m−1 )  r1  1 m ) ( d1 + m−1 ) , N 1/m λmin ≥ (ln ln T ) √ . T Let RN ∈ RP ×N satisfy max ∥r N,p ∥1 ≤ C, where r N,p denotes the p-th row of 1≤p≤P RN , and P, h ≤ C. Then we have that √ d T RN (b̂ − β 0 ) → N (0, Ψ) , uniformly in β 0 ∈ B N (r, sr ), where Ψ := lim N,T →∞ 2 RN Υ−2 ΩN,T Υ−2 R′N and Υ−2 := diag(1/τ12 , . . . , 1/τN ). Remark 2.3. Unlike van de Geer et al. (2014), we do not require the regularization parameters λj to have a uniform growth rate. We only control the slowest and fastest converging λj (covered by λmax and λmin respectively) through convergence rates that also involve N, T , and the sparsity sr,max . We provide a specific example of a joint asymptotic setup for these quantities in Corollary 2.2. Remark 2.4. Belloni et al. (2012) and Chernozhukov et al. (2018), among others, show that sample splitting can improve the convergence rates for the desparsified lasso in IID settings. The idea is to estimate the initial and nodewise regressions with two independent parts of the sample, and exploit this independence to efficiently bound certain terms in the proofs. Efficiency loss is then avoided by so-called cross-fitting 29 2 Desparsified Lasso in Time Series and combining two estimators in which the roles of the two sub-samples are swapped. However, with time series data naive sample splitting will not yield (asymptotically) independent subsamples. Instead, subsamples must carefully be chosen to leave sufficiently large ‘gaps’ in-between to ensure (at least asymptotic) independence. These ideas are explored in Lunde (2019) and Beutner et al. (2021), though for different purposes and dependence concepts. They could however provide a useful starting point for future research on investigating the potential of sample-splitting in the NED framework. In order to estimate the asymptotic variance Ψ, we suggest to estimate ΩN,T with the long-run variance kernel estimator Ω̂ = Ξ̂(0) + QX T −1  K l=1 l QT   ′ Ξ̂(l) + Ξ̂ (l) , (2.9) T P ŵt ŵ′t−l with ŵj,t = v̂j,t ût , the kernel K(·) can be taken as the   Bartlett kernel K(l/QT ) = 1 − QlT (Newey and West, 1987) and the bandwidth QT where Ξ̂(l) = 1 T −l t=l+1 should increase with the sample size at an appropriate rate. A similar heteroskedasticity and autocorrelation consistent (HAC) estimator was considered by Babii et al. (2019), though under a different framework of dependence. In Theorem 2.3, we show −2 −2 )R′N is a consistent estimator of Ψ in our NED framework. √ 1 Theorem 2.3. Take Ω̂ with QT → ∞ as T → ∞, such that QT h2 ( T h2 )− 1/d+m/(m−2) → 0. Assume that that Ψ̂ = RN (Υ̂ Ω̂Υ̂ h i−1 h i−1 p √ QT T sr,max , QT h1/m T 1/m sr,max , −1 min λ2−r max ≤ (ln ln T ) h i−1 h i−1  2/3 Q2T h3/m T (3−m)/m sr,max , QT h1/(3m) T (m+1)/3m sr,max , λ2max ≤ (ln ln T )−1 λrmin h√ T h2/m sr,max  0<r<1: " r=0: s0,max ≤ (ln ln T ) , and 2 λmin ≥ (ln ln T ) sr,max −1 i−1 2 (hN )( d + m−1 ) √ T ! 1 m ( d1 + m−1 ) 1 # √ m ( d1 + m−1 ) T , 2+ 2 ( ) (hN ) d m−1  r1  , λmin ≥ (ln ln T ) (hN )1/m √ . T Furthermore, let RN ∈ RP ×N satisfy max ∥r N,p ∥1 ≤ C and P ≤ Ch. Then under 1≤p≤P Assumptions 2.1 to 2.5, uniformly in β 0 ∈ B N (r, sr ), RN (Υ̂ 30 −2 Ω̂Υ̂ −2 − Υ−2 ΩN,T Υ−2 )R′N p → 0. max 2.4 Uniformly Valid Inference via the Desparsified Lasso Note that here we restrict RN such that the number of hypotheses P may not grow faster than the number of parameters of interest h, but h may grow with T at a controlled rate. Theorem 2.3 therefore allows for variance estimation of an increasing number of estimators. We believe the restrictions on P are reasonable, as they apply to the most commonly performed hypothesis tests in practice, such as joint significance tests (where RN is the identity matrix), or tests for the equality of parameter pairs. As a natural implication of Theorems 2.2 and 2.3, Corollary 2.2 gives an asymptotic distribution result for a quantity composed exclusively of estimated components. Corollary 2.2. Let Assumptions 2.1 to 2.5 hold, and assume that the smallest eigen1 value of ΩN,T is bounded away from 0, and QT T − 2/d+2m/(m−2) → 0 for some QT → ∞. Further, assume that λ ∼ λmax ∼ λmin , and " 0<r<1: (ln ln T )−1 s1/r r,max r=0: 2 2 N ( d + m−1 ) √ T # 1 r m ) ( d1 + m−1 h √ i−1/(2−r) ≤ λ ≤ ln ln T Q2T T sr,max , h √ i−1/2 N 1/m ≤ λ ≤ ln ln T Q2T T s0,max . (ln ln T )−1 √ T d(m−1)(2−r) d+m−1 1 These bounds are feasible when QrT sr,max N (2−r)( dm+m−1 ) T 4 (r− dm+m−1 ) → 0, 2/m and additionally when Q2T s0,max N√T → 0 if r = 0. Under these conditions, for RN ∈ RP ×N with max ∥r N,p ∥1 ≤ C and P, h ≤ C, we have that 1≤p≤P  0 √ r ( b̂ − β ) N,p ≤ z  − Φ(z) = op (1), P T q −2 −2 r N,p (Υ̂ Ω̂Υ̂ )r ′N,p  sup β 0 ∈B N (r,sr ) 1≤p≤P,z∈R sup β 0 ∈B N (r,sr ) h P " # ! i′ R Υ̂−2 Ω̂Υ̂−2 R′ −1 h i N N RN b̂ − q RN b̂ − q ≤ z − FP (z) = op (1), T z∈R (2.10) where Φ(·) is the CDF of N (0, 1), FP (z) is the CDF of χ2P , and q ∈ RP is chosen to test a null hypothesis of the form RN β 0 = q. Corollary 2.2 allows one to perform a variety of hypothesis tests. For a significance test on a single variable j, forinstance, takeRN as the jth basis vector. Then, √ T (b̂j −βj0 ) inference on βj0 of the form P √ ≤ z − Φ(z) = op (1), ∀z ∈ R, can be 4 ω̂j,j /τ̂j obtained where Φ(·) is the standard normal CDF. One can standard   q q then4 obtain ω̂j,j /τ̂j4 ω̂j,j /τ̂j confidence intervals CI(α) := b̂j − zα/2 , b̂j + zα/2 , where zα/2 := T T  Φ−1 (1 − α/2), with the property that sup P βj0 ∈ CI(α) − (1 − α) = op (1). β 0 ∈B(sr ) 31 2 Desparsified Lasso in Time Series For a joint test with P restrictions on h variables of interest of the form RN β 0 = q, one can construct a Wald type test statistic based on eq. (2.10), and compare it to the critical value FP−1 (1 − α). Note that these results can also be used to test for nonlinear restrictions of parameters via the Delta method (e.g., Casella and Berger, 2002, Theorems 5.5.23,28). As the bounds and convergence rates as displayed in full generality in Corollary 2.2 may be hard to interpret, we investigate in Example 2.7 how the conditions of Corollary 2.2 can be satisfied in a simplified asymptotic setup, thereby illustrating how the different growth rates interact. As for Corollary 2.1, the conditions on λ effectively require that QT , N , and sr,max grow at a polynomial rate of T , which we exploit in Example 2.7 to simplify the conditions. Example 2.7. The requirements of Corollary 2.2 are satisfied when N ∼ T a for a > 0, sr,max ∼ T b for b > 0, QT ∼ T Q for an arbitrarily small Q > 0, and λ ∼ T −ℓ for 0<r<1: r=0:      b + 1/2 1 1 m 1 1 1 <ℓ< 1 −b + − 2a + , m 2−r d m−1 d m−1 r( d + m−1 ) 2 b + 1/2 1 a <ℓ< − . 2 2 m This choice of ℓ is feasible if      1 1 4b + r m 1 + + 4a + < 1. 2−r d m−1 d m−1 (2.11) There is thus a limit on how fast sr,max and N can grow relative to T , and there exists a trade-off between both: sr,max can grow faster if we limit the growth rate of N , and vice versa. Besides, for larger r, the conditions on the growth rate of sr,max are more strict. The strictness of these bounds is additionally influenced by the number of moments m and the size of the NED −d: the bounds become easier to satisfy when m and d are large. Depending on the growth rates of sr,max and N , inequality (2.11) may put stricter requirements on m and d than those in Assumption 2.1. For example, if we assume that sr,max is asymptotically bounded (b = 0), and N grows proportionally to T (a = 1), then m and d should satisfy 1 d + 1 m−1 < 1 4. If, on the other hand, m and d are allowed to be arbitrarily large, such as when the data are mixing and subexponential, then we only need b < 1−r 2 , and we do not have an effective upper bound on a, implying that N can grow at any polynomial rate of T . For a more general understanding of the restrictions imposed by eq. (2.11), Figure 2.1 shows feasible regions for different combinations of a, b, d, and r, as well as how many moments m 32 2.4 Uniformly Valid Inference via the Desparsified Lasso are needed in those cases. Figure 2.1: Required moments m implied by eq. (2.11). Contours mark intervals of 10 moments, and values above m = 100 are truncated to 100. Non-shaded areas indicate infeasible regions. 2.4.3 Inference on high-dimensional parameters The reason for considering h ≤ C in Theorem 2.2 lies entirely in the application of the central limit theorem. However, while inference on a finite set of parameters covers many cases of interest in practice, it does not allow for simultaneous inference on all parameters. We therefore next consider inference on a growing number of parameters (or hypotheses). We follow the approach pioneered by Chernozhukov 33 2 Desparsified Lasso in Time Series et al. (2013) to consider tests which can be formulated as a maximum over individual tests, and apply a high-dimensional CLT for the maximum of a random vector of increasing length. Zhang and Wu (2017) and Zhang and Cheng (2018) provide such a CLT for high-dimensional time series, with serial dependence characterized through the functional dependence framework of Wu (2005), while Chernozhukov et al. (2019) derive a similar result under general β-mixing conditions. In more recent work, Chang et al. (2021) derive a high-dimensional CLT for α-mixing processes, that we base our result on. Recalling that a process which is NED on an α-mixing process can be wellapproximated by a mixing process, this mixing condition remains conceptually close to, if more stringent than, our NED framework.6 We therefore build on their results to provide distributional results for high-dimensional inference in Corollary 2.3. While the core of the proof directly follows by applying the CLT of Chang et al. (2021), one still needs to integrate this with the results from Theorem 2.3 on the consistency of the covariance matrix, as well as adapting the CLT to our estimators. We therefore believe it is worthwhile to state this as a formal result in Corollary 2.3. Correspondingly, we now strengthen our assumptions as follows. Assumption 2.6. (i) Let z t be uniformly α-mixing with mixing coefficients satisfying  αT (q) ≤ C1 exp −C2 q K for some K > 0 and all q ≥ 1. (ii) Let there exist sequences du,T , dv,T , DT = du,T dv,T ≥ 1 such that ∥ut ∥ψ2 ≤ du,T , ∥m′ v t ∥ψ2 ≤ dv,T , ∀m ∈ RN : ∥m∥1 ≤ C, where ∥x∥ψ2 := h n h i o i 2 inf c > 0 : E exp (x/c) − 1 ≤ 1 . Assumption 2.6(i) implies Assumption 2.1(ii). Assumption 2.1(ii) states that the NED process z t can be well-approximated by an α-mixing process; clearly this holds when it is itself α-mixing. More specifically, the sequence is NED on itself, such that Assumption 2.1(ii) is satisfied for any positive d. Furthermore, the exponential decay of the α-mixing coefficients is stricter than our restrictions on sT,t . Similarly, the sub-gaussian moments in Assumption 2.6(ii) imply that all finite moments in Assumption 2.1(i) and Assumption 2.4 exist, so m may be arbitrarily large. Corollary 2.3. Let Assumptions 2.1 to 2.6 hold, and let h ∼ T H for H > 0, N ∼ T a for a > 0, sr,max ∼ T b for 0 < b < 1−r 2 , QT ∼ T Q for 0 < Q < 2/3 and λmin ∼ 6 Ideally one would directly have a high-dimensional CLT available for NED processes, such that it would directly fit to our assumptions. However, such a result is, to our knowledge, currently not available in the literature. While such a result would clearly be very interesting to obtain, this is left for future research given the intricacies needed to derive it. 34 2.5 Analysis of Finite-Sample Performance λmax ∼ λ ∼ T −ℓ where 0<r<1: r=0: b + 1/2 1/2 − b <ℓ< , 2−r r b + 1/2 < ℓ < 1/2. 2 Additionally, let the smallest eigenvalue of ΩN,T be bounded away from 0, and 2/3 DT (ln T )(1+2K)/(3K) T 1/9 + sup P DT (ln T )7/6 T 1/9 √  z∈R,β 0 ∈B N (r,sr ) max 1≤p≤P → 0. Then, for 1/C ≤ max ∥r N,p ∥1 ≤ C, P ≤ Ch, 1≤p≤P      T r N,p b̂ − β 0 ≤ z − P∗ max ĝp ≤ z = op (1), 1≤p≤P where ĝ is a P -dimensional vector which is distributed as N (0, RN Υ̂ −2 Ω̂Υ̂ −2 R′N ) conditionally on the data, and P∗ is the corresponding conditional probability. Unlike Corollary 2.2, Corollary 2.3 allows one to simultaneously test a growing number of hypotheses, while controlling for family-wise error rate, for example by the stepdown method described in Section 5 of Chernozhukov et al. (2013). One such test is an overall test of significance, with the  null hypothesis β 0 = 0; in this case P = h = N and RN = I. Note that although P max ĝp ≤ z cannot be calculated 1≤p≤P analytically, it can easily be approximated with arbitrary accuracy by simulation. Due to the stronger assumptions in Corollary 2.3, we can relax the conditions on the growth rates of N and sr,max compared to Corollary 2.2 and Example 2.7. In particular, the size of a and H are not restricted, meaning that N and h can grow at an arbitrarily large polynomial rate of T . The conditions on sr,max can also be √ relaxed so it can grow up to a rate of T , depending on r. This corresponds to our analysis in Example 2.7 when we let m and d tend to infinity. 2.5 Analysis of Finite-Sample Performance We analyze the finite sample performance of the desparsified lasso by means of simulations. We start by discussing tuning parameter selection in Section 2.5.1. We then discuss three simulation settings: a high-dimensional autoregressive model with exogenous variables (in Section 2.5.2), a factor model (in Section 2.5.3), and a weakly sparse VAR model (in Section 2.5.4). In Section 2.5.2 and Section 2.5.3, we compute coverage rates of confidence intervals for single hypothesis tests. In Section 2.5.4, we perform a multiple hypothesis test for Granger causality. 35 2 Desparsified Lasso in Time Series 2.5.1 Tuning parameter selection While the previous sections give some theoretical restrictions on the tuning parameter choice, these results cannot be used in practice since its value depends on properties of the underlying model that are unobservable. In this section, we provide a feasible recommendation to select the tuning parameters (in both the original regression and nodewise regressions) in a data-driven way. In particular, we adapt the iterative plug-in procedure (PI) used in, for instance, Belloni et al. (2012, 2014, 2017) to a time series setting. We build on the theoretical relation between the tuning parameter and the empirical process in Theorem 2.1, namely the restriction that 1 T X ′u ∞ ≤ Cλ needs to hold with high probability, to guide the choice of λ. For large N and T , 1 T X ′u ∞ can be approximated by the maximum over an N -dimensional multivariate Gaussian distribution with covariance   (E) matrix ΩN,T = E T1 X ′ uu′ X .7 One may therefore approximate its quantiles by simulating from a multivariate Gaussian with covariance matrix a consistent estimate (E) Ω̂(E) of ΩN,T . Our time series setting requires the usage of a consistent long-run variance estimator, which is provided by Theorem 2.3. We therefore take Ω̂(E) as in eq. (2.9) with T P (E) Ξ̂ (l) = T 1−l xt ût ût−l x′t−l . We set the number of lags in the long-run covarit=l+1 ance estimator as the automatic bandwidth estimator in Andrews (1991), specifically   QT = 1.1447(α̂(1)T )1/3 , with α̂(1) computed based on an AR(1) model, as detailed in eq. (6.4) therein. As the estimates ût require a choice of λ, we iterate the algorithm until the chosen λ converges. Full details are provided in Algorithm 2.1, Appendix 2.C.6. Throughout all simulations, the lasso estimates are obtained through the coordinate descent algorithm (Friedman et al., 2010) applied to standardized data. Remark 2.5. We opt to only base our empirical choice for λ on its relation to the empirical process and hence the set ET (·) in Theorem 2.1, not on its relation to the set CC Sλ which also implies a lower hound λ. The latter bound, however, requires one to approximate ∥Σ̂ − Σ∥max which is considerably more difficult as it cannot be approximated by plugging in estimated quantities directly. With eigenvalue assumptions typically stated in terms of the sample rather than the population, this kind of additional restriction may be avoided, but such assumptions often still need to be justified by showing that the sample covariance matrix is close to the population matrix. As the additional bound only appears under weak sparsity (r > 0), it can also be avoided by assuming exact sparsity. However, given that weak sparsity may often 7 Under minimal extra assumptions (sub-Gaussian moments for x , and minimum eigenvalue of t the long-run covariance matrix bounded away from 0), Corollary 2.3 substantiates the validity of this approximation. 36 2.5 Analysis of Finite-Sample Performance be the more relevant concept in practice, it may well be that the extra restriction on λ from bounding ∥Σ̂ − Σ∥max is relevant beyond this chapter. Investigating ways to incorporate this in the tuning parameter selection therefore seems an interesting avenue for future research. 2.5.2 Autoregressive model with exogenous variables Inspired by the simulation studies in Kock and Callot (2015) (Experiment B) and Medeiros and Mendes (2016), we take the following DGP yt = ρyt−1 + β ′ xt−1 + ut , xt = A1 xt−1 + A4 xt−4 + ν t , where xt is a (N − 1) × 1 vector of exogenous variables. In this simulation design (and the following ones), we consider different values of the time series length T = {100, 200, 500, 1000} and number of regressors N = {101, 201, 501, 1001}. For this data generating process, we take ρ = 0.6, βj = √1s (−1)j for j = 1, . . . , s, and zero otherwise. For N = 101, 201 we set s = 5 and s = 10 for N = 501, 1001. The autoregressive parameter matrices A1 and A4 are block-diagonal with each block of dimension 5 × 5. Within each matrix, all blocks are identical with typical elements of 0.15 and -0.1 for A1 and A4 respectively. Due to the misspecification of nodewise regressions, there is induced autocorrelation in the nodewise errors vj,t . However, the block diagonal structure of A1 and A4 keeps the sparsity of nodewise regressions constant asymptotically. We consider different processes for the error terms ut and ν t : (A) IID errors: (ut , ν ′t )′ ∼ IID N (0, I). Since all moments of the Normal distribution are finite, all moment conditions are satisfied. (B) GARCH(1,1) errors: ut = √ ht εt , ht = 5 × 10−4 + 0.9ht−1 + 0.05u2t−1 , εt ∼ IID N (0, 1), νj,t ∼ ut for j = 1, . . . , N − 1. Under this choice of GARCH   parameters, not all moments of ut are guaranteed to exist, but E u24 < ∞. t (C) Correlated errors: ν t ∼ IID N (0, S), where S has a Toeplitz structure Sj,k = (−1)|j−k| ρ|j−k|+1 , with ρ = 0.4. For all designs, we evaluate whether the 95% confidence intervals corresponding to ρ and at the q correct rates. The intervals are constructed  β1 cover  values   q their true ω̂1,1 /τ̂14 ω̂2,2 /τ̂24 as ρ̂ ± z0.025 and β̂1 ± z0.025 . These results are obtained based T T on 2,000 replications. The rates at which the intervals contain the true values are reported in Table 2.1. 37 2 Desparsified Lasso in Time Series Table 2.1: Autoregressive model with exogenous variables: 95% confidence interval coverage. The mean interval widths are reported in parentheses. ρ Model A N \T 101 201 501 1001 101 B 201 501 1001 101 C 201 501 1001 100 0.958 200 0.953 500 0.951 1000 0.948 100 0.809 β1 200 500 0.731 0.751 1000 0.843 (0.366) (0.220) (0.113) (0.070) (0.383) (0.257) (0.152) (0.102) 0.965 0.955 0.959 0.955 0.790 0.720 0.721 0.802 (0.387) (0.224) (0.116) (0.071) (0.388) (0.258) (0.154) (0.103) 0.937 0.950 0.955 0.952 0.850 0.786 0.773 0.770 (0.418) (0.238) (0.129) (0.081) (0.399) (0.260) (0.165) (0.113) 0.936 0.950 0.944 0.946 0.819 0.777 0.780 0.821 (0.429) (0.244) (0.130) (0.083) (0.388) (0.260) (0.164) (0.114) 0.961 0.957 0.953 0.941 0.797 0.735 0.760 0.839 (0.374) (0.219) (0.115) (0.071) (0.390) (0.261) (0.153) (0.102) 0.949 0.959 0.954 0.959 0.810 0.726 0.721 0.817 (0.387) (0.227) (0.117) (0.073) (0.398) (0.260) (0.156) (0.103) 0.951 0.960 0.953 0.954 0.838 0.796 0.759 0.775 (0.425) (0.241) (0.130) (0.082) (0.400) (0.263) (0.165) (0.114) 0.937 0.960 0.947 0.942 0.820 0.787 0.769 0.806 (0.434) (0.246) (0.131) (0.084) (0.394) (0.261) (0.165) (0.115) 0.964 0.960 0.956 0.943 0.936 0.887 0.902 0.911 (0.410) (0.231) (0.121) (0.080) (0.628) (0.394) (0.232) (0.166) 0.975 0.965 0.968 0.964 0.917 0.899 0.901 0.900 (0.421) (0.239) (0.123) (0.081) (0.646) (0.398) (0.233) (0.166) 0.969 0.965 0.951 0.948 0.950 0.935 0.892 0.903 (0.457) (0.260) (0.129) (0.081) (0.665) (0.420) (0.243) (0.168) 0.974 0.960 0.957 0.960 0.947 0.938 0.895 0.894 (0.475) (0.265) (0.132) (0.082) (0.669) (0.421) (0.244) (0.168) We start by discussing the results for the model with Gaussian errors (Model A). Coverage for ρ is close to the nominal level of 95% for all combinations of N and T , with some combinations producing slightly conservative results. The coverage rates for β1 are worse than for ρ. This is likely due to the fact that the exogenous variables xt within the same block are strongly correlated to each other which negatively impacts the performance of the lasso. Turning to the results for the model with GARCH errors (Model B), similar finite sample coverage rates are obtained. We do see a small increase in the mean interval width, which is to be expected given the heteroskedastic error structure. With correlated errors (Model C), we again observe consistent coverage rates near the nominal level for ρ. Interestingly, the coverage rates for β1 appear considerably better than in Models A and B, though in most cases still remaining below the nominal rate at around 90%. We also observe higher mean interval widths than Model A, which is due to larger variance of xt induced by the cross-sectional covariance of the errors. In Appendix 2.C.7, we provide details on an examination of various selection methods for tuning parameters through heat maps for the coverage levels, which also shed some further light on the relatively poor performance for β1 compared to ρ visible for models A and B. In addition to selection by our PI method, we indicate 38 2.5 Analysis of Finite-Sample Performance selection by the BIC, the AIC, and the EBIC as in Chen and Chen (2012), with γ = 1.8 We summarize the main findings below. First, notice that there are regions with coverage close to the nominal level in nearly all scenarios and combinations of N and T , suggesting that good coverage could be achieved by selecting the tuning parameters well. Second, across all scenarios, PI generally tends to result in coverage rates closest to the nominal coverage of 95%. As expected, the AIC produces, overall, the least sparse solutions, the EBIC the sparsest and BIC lies in between. PI lies mostly between the BIC and EBIC. Third, there is a region of relatively low coverage for large values of the tuning parameter in the initial and nodewise regressions (see the top right corner of the heat maps). This occurs more pronouncedly for β1 than for ρ and especially for T = 1000. Since PI tends to select near this region, it partly explains why its coverage is worse for β1 . The relatively better coverage of β1 in Model C is matched by this region being much less prominent. Given that the regions of good coverage are in different places for ρ and β1 , using the BIC or EBIC for generally smaller or larger λ would not lead to consistently better coverage across scenarios.9 2.5.3 Factor model We take the following factor model yt = β ′ xt + ut , ut ∼ IID N (0, 1) xt = Λft + ν t , ν t ∼ IID N (0, I), ft = 0.5ft−1 + εt , εt ∼ IID N (0, 1), where xt is a N × 1 vector generated by the AR(1) factor ft . We take β as in Section 2.5.2 with s increased by one to match the number of non-zero parameters. The N × 1 vector of factor loadings Λ is chosen with the first s entries (corresponding to the variables with non-zero entries in β) set to 0.5, and the remaining entries Λi = (i − s + 1)−1 . This choice of weakly sparse factor loadings ensures that the nodewise regressions are weakly sparse too, as shown in Example 2.5. By letting the large loadings coincide with the non-zero entries in β, we ensure that there is a large potential for incurring (omitted variable) bias in the estimates, and thus that this DGP provides a serious test for the desparsified lasso.   q ω̂1,1 /τ̂24 We investigate whether the confidence interval for β1 , β̂1 ± z0.025 , covT ers the true value at the correct rate. Results are reported in Table 2.2. Coverage 8 For additional stability in the high-dimensional settings, we restrict the BIC, AIC, and EBIC to only select models with at most T /2 nonzero parameters, though this restriction appears to be binding for the AIC only. 9 To confirm this analysis, we also performed the simulations results for all three setups using selection of λ by BIC (the best performing information criterion); in line with the heat maps, the coverage rates for BIC are generally somewhat worse than for PI. Results are available upon request. 39 2 Desparsified Lasso in Time Series Table 2.2: Factor model: 95% confidence interval coverage for β1 . The mean interval widths are reported in parentheses. N \T 101 201 501 1001 100 0.890 200 0.851 500 0.889 1000 0.907 (0.480) (0.299) (0.163) (0.112) 0.873 0.849 0.879 0.897 (0.490) (0.307) (0.165) (0.112) 0.956 0.940 0.890 0.910 (0.489) (0.327) (0.180) (0.117) 0.951 0.943 0.881 0.896 (0.498) (0.331) (0.184) (0.117) rates improve with growing values of N and T , with empirical coverages of approximately 85% for small N and T , and increasing towards the nominal level when either N or T increases. This result is therefore in line with our theoretical framework, and provides a relevant practical setting in which the desparsified lasso is appropriate to use even if exact sparsity is not present. 2.5.4 Weakly sparse VAR(1) Inspired by Kock and Callot (2015) (Experiment D), we consider the VAR(1) model z t = (yt , xt , wt )′ = A1 z t−1 + ut , ut ∼ IID N (0, 1), with z t a (N/2)×1 vector. We focus on testing whether xt Granger causes yt by fitting a a VAR(2) model, such that we have a total of N explanatory variables per equation. (j,k) The (j, k)-th element of the autoregressive matrix A1 = (−1)|j−k| ρ|j−k|+1 , with (1,2) A1 = 0; to measure the power of ρ = 0.4. To measure the size of the test, we set the test, we keep its regular value of −ρ . Weak sparsity holds10 under our choice 2 of the autoregressive parameters, but exact sparsity is violated by having half of the parameters non-zero. Note that the desparsified lasso is convenient for estimating the full VAR equation-by-equation, since all equations share the same regressors, and Θ̂ needs to be computed only once. For our Granger causality test, however, only a single equation needs to be estimated. We test whether xt Granger causes yt by regressing yt on the first and second lag of (1,2) (1,2) z t . To this end, we test the null hypothesis A1 = A2 = 0 by using the Wald test  ′ (1,2) (1,2) statistic in eq. (2.10), with b̂H = 0, Â1 , 0 . . . 0, Â2 , 0 . . . 0 , H = {2, N/2 + 1}, ′ (1,2) (1,2) and Â1 , Â2 obtained by regressing yt on z ′t−1 , z ′t−2 . We reject the null 10 The weak sparsity measure is N P j=1 B = 0. 40 |ρj |r with asymptotic limit ρr 1−ρr < ∞, trivially satisfying 2.6 Conclusion Table 2.3: Weakly sparse VAR: Joint test rejection rates for a nominal size of α = 5%. N \T 102 202 502 1002 100 0.050 0.062 0.051 0.059 Size 200 500 0.070 0.070 0.075 0.081 0.067 0.106 0.083 0.101 1000 0.073 0.078 0.076 0.091 100 0.415 0.411 0.401 0.407 Power 200 500 0.751 0.982 0.775 0.987 0.776 0.990 0.769 0.995 1000 1.000 1.000 1.000 1.000 hypothesis when the statistic exceeds χ22,0.05 ≈ 5.99. We start by discussing the size of the test in Table 2.3. Overall, the empirical sizes exceed the nominal size of 5%, with performance generally not improving for larger sample sizes. In particular, rejection rates slightly deteriorate for larger N . However, the observed changes in performance across N and T are rather small and may be due to simulation randomness. The power of the test increases with both N and T , reaching 1 at T = 1000 regardless of the value for N . To improve the finite-sample performance of the method, a natural extension would be to consider the bootstrap for constructing confidence intervals as opposed to asymptotic theory. Bootstrap-based inference for desparsified lasso methods in high dimensions has already been explored by several authors, for example Dezeure et al. (2017) in the IID setting, and in time series by Krampe et al. (2021), Chernozhukov et al. (2019) and Chernozhukov et al. (2021). In particular, block or block multiplier bootstrap methods, which would allow one to capture serial dependence nonparametrically, would fit our setup well. The block bootstrap has the additional advantage of correcting the finite-sample performance of statistics based on long-run variance estimators, which might be a factor for our tests as well (Gonçalves and Vogelsang, 2011). However, due to the lack of theory about such bootstrap methods, and the associated selection of tuning parameters like the block length, for high-dimensional NED processes, we do not consider such methods here. The development of such theory would be a highly relevant and interesting topic for future research. 2.6 Conclusion We provide a complete set of tools for uniformly valid inference in high-dimensional stationary time series settings, where the number of regressors N can possibly grow at a faster rate than the time dimension T . Our main results include (i) an error bound for the lasso under a weak sparsity assumption on the parameter vector, thereby establishing parameter and prediction consistency; (ii) the asymptotic normality of the 41 2 Desparsified Lasso in Time Series desparsified lasso under a general set of conditions, leading to uniformly valid inference for finite subsets of parameters; (iii) asymptotic normality of a maximum-type statistic of a growing, high-dimensional, number of tests, valid under more stringent conditions, thereby also permitting simultaneous inference over a potentially large number of parameters, and (iv) a consistent Bartlett kernel Newey-West long-run covariance estimator to conduct inference in practice. These results are established under very general conditions, thereby allowing for typical settings encountered in many econometric applications where the errors may be non-Gaussian, autocorrelated, heteroskedastic and weakly dependent. Crucially, this allows for certain types of misspecified time series models, such as omitted lags in an AR model. Through a small simulation study, we examine the finite sample performance of the desparsified lasso in popular types of time series models. We perform both single and joint hypothesis tests and examine the desparsified lasso’s robustness to, amongst others, regressors and error terms exhibiting serial dependence and conditional heteroskedasticity, and a violation of the sparsity assumption in the nodewise regressions. Overall our results show that good coverage rates are obtained even when N and T increase jointly. The factor model design shows that the desparsified lasso remains applicable when the exact sparsity assumption of the nodewise regressions is violated. Finally, Granger causality tests in the VAR are slightly oversized, but empirical sizes generally remain close to the nominal sizes, and the test’s power increases with both N and T . There are several extensions to our approach that are interesting to consider. The development of a high-dimensional central limit theorem for NED processes would allow to weaken the dependence conditions needed for establishing simultaneous, high-dimensional inference. Similarly, using sample splitting would likely allow for weakening sparsity assumptions. Finally, improvements in finite sample performance may be achieved by bootstrap procedures. All of these extensions would require the development of novel theory, and thus provide challenging but worthwhile avenues for future research. Acknowledgements We thank the editor, associate editor and three referees for their thorough review and highly appreciate their constructive comments which substantially improved the quality of the manuscript. The first and second author were financially supported by the Dutch Research Council (NWO) under grant number 452-17-010. The third author was supported by 42 2.6 Conclusion the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 832671. Previous versions of this chapter were presented at CFE-CM Statistics 2019, NESG 2020, Bernoulli-IMS One World Symposium 2020, (EC)2 2020, and the 2021 Maastricht Workshop on Dimensionality Reduction and Inference in High-Dimensional Time Series. We gratefully acknowledge the comments by participants at these conferences. In addition, we thank Etienne Wijler for helpful discussions. All remaining errors are our own. 43 2 Desparsified Lasso in Time Series Appendix 2.A Proofs for Section 2.3 This section provides the theory for the lasso consistency established in Section 2.3. We first provide some definitions in Section 2.A.1 and preliminary lemmas in Section 2.A.2 which are proved in the Supplementary Appendix 2.C.1. The proofs of the main results are then provided in Section 2.A.3. 2.A.1 Definitions Definition 2.A.1 (Near-Epoch Dependence, Davidson (2002b), ch. 17). Let there ex∞ ist non-negative NED constants {ct }∞ t=−∞ , an NED sequence {ψq }q=0 such that ψq → 0 t−l+q as q → ∞, and a (possibly vector-valued) stochastic sequence {st }∞ t=−∞ with Ft−l−q = t−l+q ∞ σ{st−q , . . . , st+q }, such that {Ft−l−q }q=0 is an increasing sequence of σ-fields. For p > 0, the random variable {Xt }∞ t=−∞ is Lp -NED on st if  p i1/p  h  t−l+q ≤ ct ψq . E Xt − E Xt |Ft−l−q for all t and q ≥ 0. Furthermore, we say {Xt } is Lp -NED of size −d on st if ψq = O(q −d−ε ) for some ε > 0. Definition 2.A.2 (Mixingale, Davidson (2002b), ch. 16). Let there exist non-negative ∞ mixingale constants {ct }∞ t=−∞ and mixingale sequence {ψq }q=0 such that ψq → 0 as q → ∞. For p ≥ 1, the random variable {Xt }∞ t=−∞ is an Lp -mixingale with respect to the σ-algebra {Ft }∞ t=−∞ if (E [|E (Xt |Ft−q )|p ])1/p ≤ ct ψq , (E [|Xt − E (Xt |Ft+q )|p ])1/p ≤ ct ψq , for all t and q ≥ 0. Furthermore, we say {Xt } is an Lp -mixingale of size −d with respect to {Ft } if ψq = O(q −d−ε ) for some ε > 0. Note that the latter condition holds automatically when Xt is Ft -measurable, as is the case in this chapter. We use the same notation for the constants ct and sequence ψq as with near-epoch dependence, since they play the same role in both types of dependence. 2.A.2 Preliminary results Lemma 2.A.1. Under Assumption 2.1, for every j = 1, . . . , N , {ut xj,t } is an Lm -Mixingale with respect to Ft = σ {z t , z t−1 , . . . }, with non-negative mixingale constants ct ≤ C and ∞ P sequence ψq satisfying ψq < ∞. q=1 Lemma 2.A.2. Under Assumption 2.1, {xi,t xj,t − Exi,t xj,t } is Lm̄ -bounded and an Lm mixingale with respect to Ft = σ {z t , z t−1 , . . . }, with non-negative mixingale constants ct ≤ C, and mixingale sequences of size −d. 44 2.A Proofs for Section 2.3 Lemma 2.A.3. Recall the set CC T (S) := n Σ̂ − Σ max o ≤ C/ |S| and Sλ = {j : βj0 > λ}. Under Assumptions 2.1 to 2.3, for a sequence ηT → 0 such that ηT ≤ N2 , e if the following is satisfied d+m−1 λ−r sr ≤ CηTdm+m−1 √  1 T 2 d + 1 m m−1 2 N ( d + m−1 ) . then P (CC T (Sλ )) ≥ 1 − 3ηT → 1 as N, T → ∞.   l   P ut xj,t ≤ z . Under Assumption 2.1, we have Lemma 2.A.4. Let ET (z) := max j≤N,l≤T t=1 for z > 0 that  √ m T . P (ET (z)) ≥ 1 − CN z 2 ′ Lemma 2.A.5. Take an index set S with cardinality |S|. Assuming that n ∥β S ∥1 ≤ C|S|β Σβ o  N holds for β ∈ R : ∥β S c ∥1 ≤ 3∥β S ∥1 , then on the set CC T (S) = ∥Σ̂ − Σ∥max ≤ C/|S| ∥β S ∥1 ≤ C q |S|β ′ Σ̂β,  for β ∈ RN : ∥β S c ∥1 ≤ 3∥β S ∥1 . Lemma 2.A.6. Let Assumption 2.3 hold for an index set S, i.e. ϕ2Σ (S) ≥ 1/C =⇒ ∥z S ∥21 ≤ C |S| z ′ Σz. On the set ET (T λ/4) ∩ CC T (S): ∥X(β̂ − β 0 )∥22 λ 8 + ∥β̂ − β 0 ∥1 ≤Cλ2 |S| + λ∥β 0S c ∥1 . T 4 3 Lemma 2.A.7. Under Assumptions 2.2 and 2.3, on the set CC T (Sλ ) ∩ ET (T λ/4), ∥X(β̂ − β 0 )∥22 λ + ∥β̂ − β 0 ∥1 ≤ Cλ2−r sr . T 4 2.A.3 Proofs of the main results Proof of Theorem 2.1. In this proof we combine the results of Lemmas 2.A.3 and 2.A.4. √ By applying Lemma 2.A.4 to the set ET (T λ/4), we have that P (ET (T λ/4)) ≥ 1−CN (λ T )−m . √ −m Choose ηT such that N (λ T ) ≤ ηT , meaning that P (ET (T λ/4)) ≥ 1 − ηT −1/m N λ ≥ CηT when For Lemma 2.A.3, we need that ηT ≤ N2 , e 1/m √ T . which is true for sufficiently large N, T , since N diverges, and ηT converges with T → ∞. Then d+m−1 P (CC T (Sλ )) ≥ 1 − ηT when λ−r sr ≤ CηTdm+m−1  √ T 2 1 d 2 N ( d + m−1 ) + 1 m m−1 . 45 2 Desparsified Lasso in Time Series When 0 < r < 1 , the required bound for the set ET (T λ/4) is dominated by the bound for CC T (Sλ ) when sr does not converge to 0, i.e. sr ≥ 1/C (when sr → 0 these results are  2 2  1 1m ( d + m−1 ) ( d + m−1 ) 1/m trivial). To show this, note that for m > 2, d ≥ 1, N √T ≥ N√T , d+m−1 − r(dm+m−1) ηT P (CC T (Sλ ) T −1/−m ≥ ηT , and 1/r > 1. The result then follows by the union bound, ET (T λ/4)) ≥ 1 − (1 − P(CC T (Sλ ))) − (1 − P(ET (T λ/4))) ≥ 1 − CηT → 1 as N, T → ∞. The result of the theorem follows from choosing ηT = C(ln ln T )−1 . ■ Proof of Corollary 2.1. By Theorem 2.1, the set CC T (Sλ )∩ET (T λ/4) holds with probability at least 1 − CηT , and so the error bound of Lemma 2.A.7 holds with the same probability. With the error bound, items (i) and (ii) follow straightforwardly. Appendix 2.B ■ Proofs for Section 2.4 This section provides the theory for the desparsified lasso established in Section 2.4. We first provide some preliminary lemmas in Section 2.B.1 which are proved in the Supplementary Appendix 2.C.2. The proofs of the main results are then provided in Section 2.B.2. 2.B.1 Preliminary results Lemma 2.B.1. Under Assumptions 2.1 and 2.4, the following holds: (i) E [vj,t ] = 0, ∀j, E [vj,t xk,t ] = 0, ∀k ̸= j, t. (ii) max 1≤j≤N, 1≤t≤T E [|vj,t xj,t |m ] ≤ C. (j) (iii) {vj,t xk,t } is an Lm -Mixingale with respect to Ft = σ {vj,t , x−j,t , vj,t−1 , x−j,t−1 , . . . }, ∀k ̸= j, with non-negative mixingale constants ct ≤ C and sequences ψq satisfying ∞ P ψq ≤ C. q=1 Lemma 2.B.2. Let wt = (w1,t , . . . , wN,t )′ with wj,t = vj,t ut . Under Assumptions 2.1 and 2.4 the following holds: (i) {wj,t } is Lm̄ -bounded and an Lm -Mixingale of size −d uniformly over j ∈ {1, . . . , N } with respect to Ft = σ {ut , v t , ut−1 , v t−1 , . . . }, with non-negative mixingale constants C1 ≤ ct ≤ C2 . (ii) max q≤j,k≤N, 1≤t≤T |E [wj,t wk,t−l ]| ≤ Cϕl , where ϕl is a sequence of size −d, and the co- variances are therefore absolutely summable. (iii) For all l, {wj,t wk,t−l − E [wj,t wk,t−l ]} is Lm/2 -bounded and an L1 -Mixingale of size −d uniformly over j, k ∈ {1, . . . , N } with respect to Ft , with non-negative mixingale constants ct ≤ C. 46 2.B Proofs for Section 2.4 Lemma 2.B.3. Recall the sets CC T (S) := n Σ̂ − Σ max o ≤ C/ |S| , Sλ = {j : βj0 > λ}, 0 and Sλ,j := {k : γj,k > λj }. Under Assumptions 2.1 to 2.3, for a sequence ηT → 0 such that ηT ≤ N2 , e if the following is satisfied d+m−1 dm+m−1 λ−r min sr,max ≤ CηT √  1 T d + 1 m m−1 , 2 2 N ( d + m−1 ) ! T P CC T (Sλ ) CC T (Sλ,j ) ≥ 1 − 3(1 + h)ηT . j∈H Lemma 2.B.4. Under Assumptions 2.1 and 2.4, for xj > 0 the following holds ! \ P (j) ET (xj ) ≥1−C j∈H hN T m/2 . min xm j j∈H  Lemma 2.B.5. Define the set LT := max j∈H 1 T T P 2 vj,t − τj2 ≤ t=1 h δT  , and let Assumption 2.4 hold. When √ 1 δT ≤ CηT ( T h) 1/d+m/(m−1) , dm+m−1 P (LT ) ≥ 1 − 3ηTd+m−1 → 1 as N, T → ∞. Lemma 2.B.6. Under Assumption 2.5(ii) 1 ≤ τj2 ≤ C, uniformly over j = 1, . . . , N. C Furthermore, define the set PT,nw := (j) T ET (T j∈H λj 4 (2.B.1) ) T CC T (Sλ,j ) and let Assumption 2.5(i) j∈H hold. On the set PT,nw ∩ LT , we have max τ̂j2 − τj2 ≤ j∈H q h + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r , ¯ δT and 1 1 max 2 − 2 ≤ j∈H τ̂j τj q + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r ¯  . q C3 − C4 δhT + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r h δT ¯ Lemma 2.B.7. Under Assumption 2.5(i)-(ii), it holds for a sufficiently large T that on the T (j) λj set ET (T 4 ) ∩ LT , j∈H n o max ∥e′j − Θ̂j Σ̂∥∞ ≤ j∈H C1 − h δT λ̄ , − C2 λ̄2−r s̄r where Θ̂j is the jth row of Θ̂. 47 2 Desparsified Lasso in Time Series Lemma 2.B.8. Define ∆ :=   √  T Θ̂Σ̂ − I β̂ − β 0 , and PT,las := ET (T λ4 ) ∩ CC T (Sλ ) Under Assumptions 2.1, 2.2 and 2.5(i)-(ii), on the set PT,las ∩ PT,nw ∩ LT we have that max |∆j | ≤ √ j∈H T λ1−r sr C1 − λ̄ . − C2 λ̄2−r s̄r h δT Lemma 2.B.9. Under Assumption 2.5(i)-(ii), on the set ET (T λ) ∩ PT,nw , √ 1 max √ v̂ ′j u − v ′j u ≤ C T λ2−r max s̄r . j∈H T  (j) Lemma 2.B.10. Define the set ET,uv (x) := T and 2.4, for x > 0 it follows that P (j)  s P max vj,t ut ≤ x . Under Assumptions 2.1 s≤T t=1 ! ET,uv (x) j∈H ≥1− ChT m/2 . xm Lemma 2.B.11. Under Assumptions 2.1 and 2.3 to 2.5(i)-(ii), on the set √ T (j) ET,uv (h1/m T 1/2 ηT−1 ) with ηT−1 ≤ C T , we have ET (T λ) ∩ PT,nw ∩ LT j∈H q √ 1/m −1 h1/m ηT−1 δhT + C1 h1/m ηT−1 T λ2−r ηT λ̄2 λ−r s̄r max s̄r + C2 h 1 v̂ ′j u 1 v ′j u ¯   √ . max √ − ≤ q j∈H T τ̂j2 T τj2 −r h C3 − C4 δT + C1 λ̄2−r s̄r + C2 λ̄2 λ s̄r ¯ Lemma 2.B.12. For any process {dt }Tt=1 and constant x > 0, define the set ET,d (x) :=  ∥d∥∞ ≤ x . Let maxt E |dt |p ≤ C < ∞. Then for x > 0, P ({ET,d (x)}c ) ≤ Cx−p T . Lemma 2.B.13. Under Assumptions 2.1, 2.2, 2.4 and 2.5(i)-(ii), on the set PT,uv := PT,las ∩ PT,nw ∩ ET,uvw , max (j,k)∈H 2 T h i2 1 X (ŵj,t ŵk,t−l − wj,t wk,t−l ) ≤ C1 T 1/2 λ2−r max sr,max T t=l+1 q h 1 m+1 i3 1 1 3−m 3 2 2−r + C2 h m T m λ2−r s + C h m T m λ2−r . 3 max sr,max + C4 h 3m T 3m λmax sr,max max r,max Lemma 2.B.14. Define ( ET,ww (x) := max (j,k)∈H 2 ) T 1 X (wj,t wk,t−l − Ewj,t wk,t−l ) ≤ x . T t=l+1 Under Assumptions 2.1 and 2.4, it holds that    1 √ − dm+m−2 1/d+m/(m−2) P ET,ww ηT−1 h2 T h2 ≥ 1 − 3ηT2d+m−2 . 48 2.B Proofs for Section 2.4 h i−1 √ 2/m Lemma 2.B.15. Assume that λ2max λ−r T sr,max , min ≤ ηT h λ−r min sr,max d+m−1 dm+m−1 " ≤ CηT 1/m and if r = 0, λmin ≥ ηT−1 (hN√)T m+1 2 + h dm√ m−1 T → 0, # 1 1m √ + d m−1 T , 2+ 2 (hN )( d m−1 ) . Furthermore, assume that RN satisfies max ∥r N,p ∥1 ≤ 1≤p≤P C, and P ≤ Ch. Then, as N, T → ∞, max r N,p 1≤p≤P Θ̂X ′ u Υ−2 V ′ u √ √ +∆− T T ! p → 0. Lemma 2.B.16. Let Assumptions 2.1 to 2.6 hold, and let h ∼ T H for H > 0, N ∼ T a for a > 0, sr,max ∼ T b for 0 < b < 1−r , 2 λmin ∼ λmax ∼ λ ∼ T −ℓ and 1/2 + b 1/2 − b <ℓ< , 2−r r 1/2 + b r=0: < ℓ < 1/2, 2−r 0<r<1: and QT ∼ T Q for 0 < Q < 2/3. Under these conditions,   −2 −2 Ω R′N RN,T := RN Υ−2 ΩN,T Υ−2 − Υ̂ Ω̂N,T Υ̂  1  = Op T 2 (b−ℓ(2−r)) , max (2.B.2) β RN,T := max r N,p 1≤p≤P Θ̂X ′ u Υ−2 V ′ u √ √ +∆− T T !   = Op T ϵ−1/2 + T 1/2+b−ℓ(2−r) , (2.B.3) for an arbitrarily small ϵ > 0, with 2.B.2 1 (b 2 − ℓ(2 − r)) < −1/4, and 1/2 + b − ℓ(2 − r) < 0. Proofs of main results Proof of Theorem 2.2. Using eq. (2.5), we can write   √ √ T RN b̂ − β 0 = T RN Θ̂X ′ (y − X β̂) β̂ − β + T 0 ! = RN ! Θ̂X ′ u √ +∆ , T and by Lemma 2.B.15, max r N,p 1≤p≤P Θ̂X ′ u Υ−2 V ′ u √ √ +∆− T T ! p → 0. Note that under the assumption that h ≤ C, the requirements for Lemma 2.B.15 reduce to the requirements for Theorem 2.2 (note that one of the bounds becomes redundant for 0 < r < 1, see the proof of Theorem 2.1 for details). The proof will therefore continue by 49 2 Desparsified Lasso in Time Series deriving the asymptotic distribution of RN T X Υ−2 V ′ u 1 √ = √ RN Υ−2 wt , T T t=1 and applying Slutsky’s theorem. Regarding RN , under the assumption that h < ∞, we may without loss of generality consider the case with P = 1. In the multivariate setting, let R∗N be a P × N matrix with 1 < P < ∞, and non-zero columns indexed by the set H of √ d cardinality h = |H| < ∞. By the Cramér-Wold theorem, T R∗N (b̂ − β 0 ) → N (0, Ψ∗ ) if and √ d only if T α′ R∗N (b̂ − β 0 ) → N (0, α′ Ψ∗ α) for all α ̸= 0. We show this directly by letting the 1 × N vector RN = α′ R∗N and the scalar ψ = lim N,T →∞ α′ R∗N (Υ−2 ΩN,T Υ−2 )R∗′ N α. The final part of the proof is then devoted to establishing the central limit theorem. This result can be shown by applying Theorem 24.6 and Corollary 24.7 of Davidson (2002b). Following the R Υ−2 ΩN,T Υ−2 R′N notation therein, let XT,t = √P 1 ψT RN Υ−2 wt , where PN,T = N ; note ψ N,T t = σ {sT,t , sT,t−1 , . . . }, that by definition of ψ, PN,T → 1 as N, T → ∞. Further, let FT,−∞ 1 the positive constant array {cT,t } = √ , and r = m̄. We show that the requirements PN,T ψT of this Theorem are satisfied. t -measurability of XT,t , follows from the measurability of z t in AsPart (a), FT,−∞ sumption 2.1(ii), E [XT,t ] = √P 1 ψT RN Υ−2 E [wt ] = 0 follows from the rewriting wj,t = N,T  xj,t − x′−j,t γ 0j ut and noting that E [xj,t ut ] = 0, ∀j by Assumption 2.1(i), and  E T X !2  XT,t  = " 1 −2 PN,T ψ t=1 1 = RN Υ −2 PN,T ψ RN Υ T X 1 E T ! T X wt t=1 −2 ΩN,T Υ R′N !# Υ−2 R′N w′t t=1 = 1. For part (b) we get that sup n −2 E|RN Υ wt | m̄ 1/m̄   T,t X |rN,j | sup τj2 T,t (1) j∈H ≤ n E|wj,t |m̄  X rN,j = sup E wj,t τj2 T,t  j∈H o 1/m̄ o m̄ !1/m̄   ≤ C, (2) where (1) is due to Minkowski’s inequality, and (2) follows from h < 0, τj2 ≤ C by eq. (2.B.1), and wj,t is Lm̄ -bounded by Lemma 2.B.2(i). For part (c’), by the arguments in the proof of Lemma 2.B.2, wj,t is Lm -NED of size d −d, and therefore also size −1 on sT,t , which is α-mixing of size − 1/m−1/ < −m̄/(m̄ − 2) m̄ under Assumption 2.1. For (d’), we let MT = max {cT,t } = √P t C, where the inequality follows from 1 τj2 1 N,T ψT ≥ 1 C , such that sup T MT2 = sup T T −2 by eq. (2.B.1), and RN Υ 1 RN Υ−2 ΩN,T Υ−2 R′N ΩN,T Υ −2 R′N is bounded from below by the minimum eigenvalue of ΩN,T (assumed to be bounded away from 0), via the Min-max theorem. 50 ≤ 2.B Proofs for Section 2.4 Finally, Theorem 2.2 states that this convergence is uniform in β 0 ∈ B(sr ). This follows by noting that eq. (2.C.3) holds uniformly in β 0 ∈ B(sr ). ■ Proof of Theorem 2.3. The following derivations collectively require that the set  PT,las ∩ PT,nw ∩ LT ∩ ET,uvw ∩ ET,ww ηT−1 h2  1 √ − 1/d+m/(m−2) T h2 holds with probability converging to 1. For PT,las ∩ PT,nw ∩ LT , this can be shown by the arguments in the proof of Lemma 2.B.15 when the following convergence rates hold: h i−1 m+1 + 2 √ dm m−1 2/m λ2max λ−r T sr,max , h √T → 0, min ≤ ηT h λ−r min sr,max d+m−1 dm+m−1 " ≤ CηT # 1 1m √ + d m−1 T , 2+ 2 ( ) (hN ) d m−1 1/m and if r = 0, λmin ≥ ηT−1 (hN√)T . ET,uvw follows from Lemma 2.B.13, and   1 √ − 1/d+m/(m−2) ET,ww ηT−1 h2 T h2 holds with probability converging to 1 by Lemma 2.B.14. We can write i h −2 h −2 i −2 −2 RN Υ̂ Ω̂Υ̂ − Υ−2 ΩN,T Υ−2 R′N ≤ RN Υ̂ Ω̂Υ̂ − Υ−2 Ω̂Υ−2 R′N i h + RN Υ−2 Ω̂Υ−2 − Υ−2 ΩN,T Υ−2 R′N =: R(a) + R(b) . For R(a) we get that h −2 i h −2 i h −2 i R(a) ≤ RN Υ̂ − Υ−2 Ω̂ Υ̂ − Υ−2 R′N + 2 RN Υ̂ − Υ−2 Ω̂Υ−2 R′N h −2 ih i h −2 i ≤ RN Υ̂ − Υ−2 Ω̂ − ΩN,QT Υ̂ − Υ−2 R′N h −2 i h −2 i + RN Υ̂ − Υ−2 ΩN,QT Υ̂ − Υ−2 R′N h −2 i h −2 ih i + 2 RN Υ̂ − Υ−2 Ω̂ − ΩN,QT Υ−2 R′N + 2 RN Υ̂ − Υ−2 ΩN,QT Υ−2 R′N , where " ΩN,QT 1 := E QT QT X t=1 ! wt QT X QT −1 !# w′t = Ξ(0) + t=1 where the (j, k)th element of Ξ(l) is ξj,k = X Ξ(l) + Ξ′ (l), l=1 1 T T P Ewj,t wk,t−l . t=l+1 51 2 Desparsified Lasso in Time Series Starting with the third term of R(a) , applying the triangle inequality h −2 ih i RN Υ̂ − Υ−2 Ω̂ − ΩN,QT Υ−2 R′N max ) (    1 XX 1  1 N,QT − 2 ω̂j,k − ωj,k rN,q,k ≤ max rN,p,j 1≤p,q≤P τ̂j2 τj τk2 j∈H k∈H 1 1 1 ≤ max 2 − 2 max 2 j∈H τ̂j τj j∈H τj max N,QT ω̂j,k − ωj,k 1 2 j∈H τj ≤ C by eq. (2.B.1), and (j,k)∈H 2 max ∥r N,p ∥1 ≤ C by assumption, max 1≤p≤P 1 1 max 2 − 2 ≤ j∈H τ̂j τj max  1≤p,q≤P ∥r N,p ∥1 ∥r N,q ∥1 , q + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r ¯   → 0, q h C3 − C4 δT + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r h δT ¯ on the set PT,nw ∩ LT by Lemma 2.B.6. Finally, we show that max (j,k)∈H 2 N,QT ω̂j,k − ωj,k → 0. QT −1 max (j,k)∈H 2 X N,QT ω̂j,k − ωj,k ≤2 QT −1 =2 X l=0 (1 − l/QT ) max (j,k)∈H 2 (1 − l/QT )ξˆj,k (l) − ξj,k (l) max (j,k)∈H 2 l=0 T T X 1 X 1 ŵj,t ŵk,t−l − Ewj,t wk,t−l . T −l T t=l+1 t=l+1 Using a telescopic sum argument, QT −1 max (j,k)∈H 2 N,QT ω̂j,k − ωj,k ≤2 X l=0 + l QT T 1 X (ŵj,t ŵk,t−l − Ewj,t wk,t−l ) T max (j,k)∈H 2 max (j,k)∈H 2 1 T t=l+1 T X Ewj,t wk,t−l . t=l+1 For the second term, it follows by Lemma 2.B.2(ii) that QT −1 2 X l=0 QT −1 QT −1 1−d−ϵ X l C X 1−d−ϵ l max |Ewj,t wk,t−l | ≤ l ≤ CQ−d−ϵ ≤ CQ1−d−ϵ , T T 1−d−ϵ 2 QT j,k∈H QT Q T l=1 l=1 since l/QT < 1, and Q1−d−ϵ → 0 for d ≥ 1, and T PQT −1 l=1 l−1−δ ≤ C by properties of p-series. It follows from Lemmas 2.B.13 and 2.B.14 that max (j,k)∈H 2 max (j,k)∈H 2 52 T h i2 1 1 1 X (ŵj,t ŵk,t−l − wj,t wk,t−l ) ≤ C1 T 1/2 λ2−r + C2 h m T m λ2−r max sr,max max sr,max T t=l+1 q h 1 m+1 i3 3−m 3 2 2−r + C3 h m T m λ2−r , max sr,max + C4 h 3m T 3m λmax sr,max T 1 √ − 1 X 1/d+m/(m−2) (wj,t wk,t−l − Ewj,t wk,t−l ) ≤ C5 ηT−1 h2 T h2 . T t=l+1 2.B Proofs for Section 2.4 on the set PT,uv ∩ ET,ww   1 √ − 1/d+m/(m−2) ηT−1 h2 T h2 . Plugging the upper bounds in, we find that  h i2 1 1 N,QT 2−r sr,max + C2 h m T m λ2−r ω̂j,k − ωj,k ≤ 2QT C1 T 1/2 λmax max sr,max max (j,k)∈H 2 h 1 m+1 i3 3−m 3 2 2−r h m T m λ2−r max sr,max + C4 h 3m T 3m λmax sr,max  1 √ − 1/d+m/(m−2) +C5 ηT−1 h2 + C6 Q1−d−ϵ . T h2 T q + C3 Hence, max p (j,k)∈H 2 N,QT ω̂j,k − ωj,k − → 0 if we take λ2−r max ≤ ηT min h p h and QT ηT−1 h2 √ QT i−1 h i−1 √ T sr,max , QT h1/m T 1/m sr,max , Q2T h3/m T (3−m)/m sr,max − i−1 h i−1  2/3 , QT h1/(3m) T (m+1)/3m sr,max , 1 1/d+m/(m−2) → 0. For the latter term, since we can choose ηT−1 to 1 √ − 1/d+m/(m−2) grow arbitrarily slowly, it is sufficient to assume QT h2 T h2 → 0. FurtherT h2 m+1 more, this convergence rate is stricter than the previous rate 2 + h dm√ m−1 T → 0, and therefore makes it redundant. For the fourth term of R(a) , we may bound as follows h −2 i RN Υ̂ − Υ−2 ΩN,QT Υ−2 R′N max ( )   XX 1 1 N,QT 1 ≤ max rN,p,j − 2 ωj,k rN,q,k 1≤p,q≤P τ̂j2 τj τk2 j∈H k∈H 1 1 1 ≤ max 2 − 2 max 2 j∈H τ̂j τj j∈H τj The only new term here is max max (j,k)∈H 2 (j,k)∈H 2 N,QT ωj,k max  1≤p,q≤P ∥r N,p ∥1 ∥r N,q ∥1 , N,QT , which can by bounded by ωj,k QT −1 max (j,k)∈H 2 N,QT ≤ ∥ΩN,QT ∥max ≤ 2 ωj,k X ∥Ξ(l)∥max ≤ C, l=0 where the last inequality follows from Lemma 2.B.2(ii). Note that when the third and fourth terms of R(a) converge to 0, this holds for the first 1 2 j∈H τj and second terms as well; one may simply replace max by a second max j∈H 1 τ̂j2 − 1 τj2 → 0 in the upper bound. This concludes  the part of R(a) . With the results above, it remains to be shown for R(b) that RN Υ−2 Ω̂ − ΩN,T Υ−2 R′N → 0. Using similar arguments as for the terms of max 53 2 Desparsified Lasso in Time Series R(a) , it suffices to show that QT −1 X N,QT ωj,k − ωj,k ≤ N,QT ωj,k − ωj,k → 0. Note that by Lemma 2.B.2(ii) max (j,k)∈H 2 ξj,k − l=1 T −1 X ξj,k ≤ l=1 T X |ξj,k (l)| ≤ l=QT T X Cϕl ≤ C l=QT T X l−d−ϵ , l=QT PT P −1−δ , which converges to 0 by letting δ = ϵ/2, and writing Tl=QT l−d−ϵ ≤ Q1−d−δ T l=QT l PT 1−d−δ −1−ϵ → 0 by properties of p-series and QT → ∞. where QT → 0 for d ≥ 1, and l=QT l This shows that R(b) p max − → 0. Summarizing the above, we argue that for some δ > 0,   −2 −2 RN Υ−2 ΩN,T Υ−2 − Υ̂ Ω̂N,T Υ̂ R′N where ∆τ := max j∈H 1 τ̂j2 − 1 τj2 and ∆ω := max max (j,k)∈H 2 0 ≤ C1 ∆τ [1 + ∆τ + ∆τ ∆ω] + C2 Q1−d−δ T N,QT ω̂j,k − ωj,k . Finally, this result holding uniformly in β ∈ B(sr ) follows the same logic as the proof of Theorem 2.2, namely that eq. (2.C.3) holds uniformly in β 0 ∈ B(sr ). ■ Proof of Corollary 2.2. The result follows by applying Theorems 2.2 and 2.3, so the assumed conditions from both must be satisfied. Since we assume that h ≤ C and λ ∼ λmax ∼ λmin , the conditions will simplify considerably. To summarize, we require the following six conditions: For Theorem 2.2 we require that (1) λ2max λ−r min ≤ ηT (2) λ−r min sr,max (3∗ ) (4) h√ i−1 T sr,max , √  ≤ ηT 2 1  T m ( d1 + m−1 ) 2 N ( d + m−1 ) , N 1/m when r = 0, λmin ≥ ηT−1 √ T h i−1 h i−1 p √ λ2−r QT T sr,max , QT h1/m T 1/m sr,max , max ≤ ηT min h Q2T h3/m T (3−m)/m sr,max h√ i−1 h i−1  2/3 , QT h1/(3m) T (m+1)/3m sr,max , i−1 (5) λ2max λ−r min ≤ ηT (6) λ−r min sr,max (7∗ ) λmin ≥ ηT−1 (8) √ 1 − QT h2 ( T h2 ) 1/d+m/(m−2) → 0, " ≤ ηT T h2/m sr,max 1 # √ m ( d1 + m−1 ) T , 2+ 2 (hN )( d m−1 ) (hN )1/m √ when r = 0, T where (1)-(3∗ ) follow from Theorem 2.2 and (4)-(8) from Theorem 2.3. Note that (1), (2), and (3∗ ) are same as the terms (4), (5), and (6∗ ) and without the h terms. For (4), this can be sim- 54 2.B Proofs for Section 2.4 h √ i −1 2−r plified into a single (slightly more strict) upper bound λmax ≤ CηT Q2T T h3/m sr,max . h √ i−1 2 We may then combine this with (5), and both are satisfied when λ2max λ−r T h3/m sr,max . min ≤ ηT QT Using h ≤ C and λ ∼ λmax ∼ λmin , these simplify to (1) −r λ √  sr,max ≤ ηT 1  T 2 2 N ( d + m−1 ) (3) N 1/m λ ≥ ηT−1 √ when r = 0, T i−1 h √ , λ2−r ≤ ηT Q2T T sr,max (4) QT T (2∗ ) m ) ( d1 + m−1 1 − 2/d+2m/(m−2) , → 0. When 0 < r < 1, from (1) and (3) we get " ηT−1 s1/r r,max 2 2 N ( d + m−1 ) √ T # 1 r m ) ( d1 + m−1 h √ i−1/(2−r) ≤ λ ≤ ηT Q2T T sr,max , and by combining the upper and lower bounds, we obtain the condition d+m−1 1 QrT sr,max N (2−r)( dm+m−1 ) T 4   d(m−1)(2−r) r− dm+m−1 → 0. When r = 0, the bounds on λ come from (2∗ ) and (3) h √ i−1/2 N 1/m ηT−1 √ ≤ λ ≤ ηT Q2T T s0,max . T Combining the upper and lower bounds, we obtain the condition N 2/m → 0. Q2T s0,max √ T From (1), we then obtain the condition d+m−1 s0,max N 2( dm+m−1 ) T −1 2  d(m−1) dm+m−1  → 0, which is the same condition which came from (1) and (3) in the 0 < r < 1 case. Collectively, 55 2 Desparsified Lasso in Time Series we then need to satisfy the following   2 2  1 1m i−1/(2−r) h √  ( d + m−1 ) r( d + m−1 )  −1 1/r  when 0 < r < 1, ≤ λ ≤ ηT Q2T T sr,max ηT sr,max N √T     h √ i−1/2   1/m   ηT−1 N√ ≤ λ ≤ ηT Q2T T s0,max when r = 0, T   1 r− d(m−1)(2−r) d+m−1  dm+m−1  → 0, QrT sr,max N (2−r)( dm+m−1 ) T 4   2/m  2 N   QT s0,max √T → 0 when r = 0,    1  − QT T 2/d+2m/(m−2) → 0. By implication of Theorem 2.2 √ d T r N,p (b̂ − β 0 ) → N (0, ψ), uniformly in β 0 ∈ B(sr ). Then, by Theorem 2.3 −2 r N,p (Υ̂ Ω̂Υ̂ −2 p )r ′N,p → ψ, also uniformly in β 0 ∈ B(sr ). By Slutsky’s Theorem, it is then the case that √ d T r N,p (b̂ − β 0 ) → N (0, ψ), uniformly in β 0 ∈ B(sr ), for every 1 ≤ p ≤ P . As P < ∞ by assumption, it follows that  √ P T q sup β 0 ∈B(sr ) 1≤p≤P,z∈R  r N,p (b̂ − β 0 ) −2 r N,p (Υ̂ Ω̂Υ̂ −2 )r ′N,p ≤ z  − Φ(z) = op (1). Note that uniform convergence over z ∈ R follows automatically by Lemma 2.11 in Van der Vaart (1998), since the distribution is continuous. The second result then follows from the fact that a sum of P squared standard Normal variables have a χ2P distribution. ■ Proof of Corollary 2.3. Define g ∼ N (0, RN Υ−2 ΩN,T Υ−2 R′N ) as the ‘population counterpart’ of ĝ and define the following distribution functions:  F1,T (z) := P  GT (z) := P T X 1 max √ r N,p Υ−2 wt ≤ z 1≤p≤P T t=1   G∗T (z) := P∗ max ĝp ≤ z .    √ max T r N,p b̂ − β 0 ≤ z , F2,T (z) := P 1≤p≤P  max gp ≤ z , 1≤p≤P 1≤p≤P Now note that |F1,T (z) − G∗T (z)| ≤ |F1,T (z) − GT (z)| + |GT (z) − G∗T (z)| . {z } | {z } | F G (z) RT For RTF G (z), write x̂T = 56 GG (z) RT   √ T r N,p b̂ − β 0 and xT = √1 T r N,p Υ−2 PT t=1 wt , such that ! , 2.B Proofs for Section 2.4 F1,T (z) = P(maxp x̂T,p ≤ z) and F2,T (z) = P(maxp xT,p ≤ z), and let rT := max1≤p≤P x̂T,p − max1≤p≤P xT,p . Then |rT | = β , max x̂T,p − max xT,p ≤ max |x̂T,p − xT,p | = RN,T 1≤p≤P 1≤p≤P 1≤p≤P β where RN,T is defined in (2.B.3). Given our assumptions, we therefore know that there exist sequences ηT,1 and ηT,2 such that P (|rT | > ηT,1 ) ≤ ηT,2 , such that   |F1,T (z) − GT (z)| ≤ P max xT,p + rT ≤ z |rT | ≤ ηT,1 P (|rT | ≤ ηT,1 ) − P(max gp ≤ z) p p   + P max x̂T,p ≤ z |rT | > ηT,1 P (|rT | > ηT,1 ) p   ≤ P max xT,p ≤ z + ηT,1 − P(max gp ≤ z) + 2ηT,2 p p   ≤ P max xT,p ≤ z + ηT,1 − P(max gp ≤ z + ηT,1 ) p p | {z } F G (z+η RT T ,1 ) ,1   + P max gp ≤ z + ηT,1 p | {z F G (z) RT ,2 − P(max gp ≤ z) +2ηT,2 . p } FG (z + ηT,1 ) we apply the high-dimensional CLT in Theorem 1 of Chang For the term RT,1 et al. (2021), noting that our assumptions imply the conditions required for this theorem. In particular, for the sub-exponential moment assumption, we need that r N,p Υ−2 wt ψγ1 ≤ DT for all t and p, for some γ1 ≥ 1. We choose γ1 = 1, and use Lemma 2.7.7 of Vershynin (2019) to bound r N,p Υ−2 wt ψ1 ≤ r N,p Υ−2 v t ψ2 ∥ut ∥ψ2 ≤ dv,T du,T = DT . We assume that L1 -bounded linear combinations of v t are sub-Gaussian, which covers this case, since the ∥rN,p ∥1 ≤ C by assumption, and Υ−2 max ≤ C by eq. (2.B.1). The non-degeneracy condition then follows from choosing 1/C ≤ ∥RN ∥1 , and assuming the minimum eigenvalue (and therefore the smallest diagonal element) of ΩN,T is bounded away from 0. Defining ω T := min1≤p≤P Egi2 , this implies that ω T ≥ C > 0. Applying the CLT, we bound as follows FG RT,1 (z + ηT,1 ) ≤ FG sup RT,1 (z) z∈R ≤ sup P z∈RP T RN Υ−2 X √ wt ≤ z T t=1 ! − P (g ≤ z) 2/3 BT (ln P )(1+2K)/(3K) BT (ln P )7/6 + C2 → 0. 1/9 T T 1/9  The final result holds as ln P ≤ ln Ch = O ln T H = O(ln T ), since H is a constant. ≤ C1 FG For the term RT,2 (z), apply the anti-concentration bound in Lemma 2.1 of Chernozhukov 57 2 Desparsified Lasso in Time Series et al. (2013) to show that   FG RT,2 (z) ≤ sup P(z ≤ max gp ≤ z + ηT,1 ) ≤ sup P max gp − z ≤ ηT,1 p p z∈R z∈R   q √ √ ≤ CηT,1 2 ln P + 1 ∨ ln(ω T /ηT,1 ) ≤ C1 ηT,1 2 ln P . β By Lemma 2.B.16 we find that RN,T = Op h T ϵ−1/2 + T 1/2+b−ℓ(2−r) i√ ln T  = Op (T −δ ) for some δ > 0, since ϵ > 0 can be chosen arbitrarily small, and 1/2 + b − ℓ(2 − r) < 0. We p may therefore take ηT,1 at a polynomial rate as well, such that ηT,1 2 ln(P ) → 0. For RTGG (z), it follows by Theorem 2 in Chernozhukov et al. (2015) that  2/3 Ω Ω sup RTGG (z) ≤ C(RN,T )1/3 max{1, ln(P/RN,T )} , z∈R Ω Ω with RN,T as defined in eq. (2.B.2). By Lemma 2.B.16 we have RN,T = Op (T −1/4 ), such that  2/3   Ω Ω (RN,T )1/3 max{1, ln(P/RN,T )} = Op T −1/12 (max {1, (H + 1/4) ln T })2/3 = op (1).■ 58 2.C Supplementary Results Appendix 2.C Supplementary Results Section 2.C.1 and 2.C.2 present the proofs of the preliminary results from Section 2.3 and Section 2.4, respectively. Section 2.C.5 provides the details on Examples 2.5 and 2.6. Section 2.C.6 contains the algorithm for choosing the tuning parameter. 2.C.1 Proofs of preliminary results Section 2.3 Proof of Lemma 2.A.1. Lm̄ -boundedness of {xj,t ut } follows directly from the L2m̄ -boundedness of {z t } and the Cauchy-Schwarz inequality. By Theorem 17.9 in Davidson (2002b) it follows that {xj,t ut } is Lm -NED on {sT,t } of size −1. We then apply Theorem 17.5 in Davidson d (1/m − (1/m−1/m̄) s Ft -measurability of z t implies (2002b) to conclude that {xj,t ut } is an Lm -mixingale of size − min{1, 1/m̄)} = −1, with respect to Fts = σ{sT,t , sT,t−1 , . . . }; the σ{z t , z t−1 , . . . } ⊂ Fts , which in turn implies that {xj,t ut } it is also an Lm -mixingale with ∞ P respect to Ft = σ{z t , z t−1 , . . . }. The summability condition ψq < ∞ is satisfied by the q=1 convergence property of p-series: ∞ P q −p < ∞ for any p > 1. ■ q=1 Proof of Lemma 2.A.2. Lm̄ -boundedness of {xi,t xj,t −Exi,t xj,t } follows directly from the L2m̄ -boundedness of {z t } and the Cauchy-Schwarz inequality. By Theorem 17.9 of Davidson (2002b) the product of two NED processes is also NED, with the order halved. It follows that {xi,t xj,t } is Lm -NED on {sT,t } of size −d. Therefore, Exi,t xj,t is trivially NED. Theorem 17.8 in Davidson (2002b) implies that also {xi,t xj,t − Exi,t xj,t } is Lm -NED. We then apply Theorem 17.5 in Davidson (2002b) to conclude that {xi,t xj,t − Exi,t xj,t } is an Lm -mixingale d (1/m−1/m̄)} = −d, with respect to Fts = (1/m−1/m̄) s Ft -measurability of z t implies σ{z t , z t−1 , . . . } ⊂ Fts , which in turn of size − min{d, σ{sT,t , sT,t−1 , . . . }; the implies that {xi,t xj,t − Exi,t xj,t } is also an Lm -mixingale with respect to Ft = σ{z t , z t−1 , . . . }. The boundedness of mixingale constants comes from Theorem 17.5, noting that the NED constants of {z j,t } are bounded by Assumption 2.1(ii), and {xi,t xj,t − Exi,t xj,t } is appropriately Lm̄ -bounded. ■ Proof of Lemma 2.A.3. By the union bound  P Σ̂ − Σ max N X N  X > C/ |S| ≤ P T X ! (xi,t xj,t − E [xi,t xj,t ]) > CT / |S| . t=1 i=1 j=1 Now apply the Triplex inequality (Jiang, 2009) P T X ! (xi,t xj,t − E [xi,t xj,t ]) > CT / |S| t=1 + ≤ 2q exp −T C2 288 |S|2 q 2 κ2T ! T T i 6 |S| X 15 |S| X h E [|E (xi,t xj,t |Ft−q ) − E (xi,t xj,t )|] + E |xi,t xj,t | 1{|xi,t xj,t |>κT } C T t=1 C T t=1 := R(i) + R(ii) + R(iii) . 59 2 Desparsified Lasso in Time Series For the first term, we have N X N X C2 −T 288 |S|2 q 2 κ2T 2 R(i) = 2N q exp i=1 j=1 so we need N 2 q exp  −T |S|2 q 2 κ2 T  ! → 0. By Lemma 2.A.2 and Jensen’s inequality, we have that E [|E [xi,t xj,t |Ft−q ] − E [xi,t xj,t ]|] ≤ ct ψq , and thus for the second term that R(ii) ≤ T 6 |S| X ct ψq ≤ C |S| ψq , C T t=1 N X N X R(ii) ≤ CN 2 |S| q −d , i=1 j=1 so we need N 2 |S| q −d → 0. For the third term, we have by Hölder’s and Markov’s inequalities  1−1/m h i E |xi,t xj,t |m E [|xi,t xj,t |m ] , E |xi,t xj,t | 1{|xi,t xj,t |>κT } ≤ (E |xi,t xj,t |m )1/m ≤ κ1−m T κm T N N X X R(iii) ≤ CN 2 |S| κ1−m T i=1 j=1 so we need N 2 |S| κ1−m → 0. We then jointly bound all three terms T −T |S|2 q 2 κ2T (1) CN 2 q exp (2) CN 2 |S| q −d ≤ ηT , ! ≤ ηT , (3) ≤ ηT . CN 2 |S| κ1−m T by a sequence ηT → 0. Note that in the Triplex inequality, q is a positive integer, κT > 0, and λ−r sr > 0 is also satisfied. We further assume that ηT N2 ≤ 1 e =⇒ ηT qN 2 ≤ 1 . e First, isolate κT in (1), CN 2 q exp −T |S|2 q 2 κ2T √ ! ≤ ηT ⇐⇒ κT ≤ C T 1 p . |S| q ln (qN 2 /ηT ) Similarly, isolating κT from (3), gives CN 2 |S| κ1−m ≤ ηT T ⇐⇒ κT ≥ C N 2 |S| 1  m−1 −1 ηTm−1 . Since we have a lower and upper bound on κT , we need to make sure both bounds are satisfied, C1 −1  1 N |S| m−1 ηTm−1 ≤ C2 2 ⇐⇒ 60 √ T 1 p |S| q ln (qN 2 /ηT ) 1 p √ −m −2 q ln (qN 2 /ηT ) ≤ C T |S| m−1 N m−1 ηTm−1 . 2.C Supplementary Results Isolating q from (2), CN 2 |S| q −d ≤ ηT ηT qN 2 Assuming that CN 2 d 1 d −1 d |S| ηT 2 ⇐⇒ −1 1 q ≥ CN d |S| d ηTd . ≤ 1e , we have that q ≤ q p 1 √ −m −2 ≤ C T |S| m−1 N m−1 ηTm−1 ln (qN 2 /ηT ) and therefore we need to ensure d+m−1 dm+m−1 ⇐⇒ √  |S| ≤ CηT 2 1 T d + 1 m m−1 2 N ( d + m−1 ) For the set Sλ , we have the bound |Sλ | ≤ N X 1{|β 0 |>λ} j j=1 !r βj0 λ ≤ λ−r N X 1{|β 0 |>0} βj0 j j=1 r = λ−r sr , and it is sufficient to assume that −r λ d+m−1 dm+m−1 √  sr ≤ CηT 1 T 2 d + 1 m m−1 . 2 N ( d + m−1 ) When this bound is satisfied, N P N P (R(i) + R(ii) + R(iii) ) ≤ 3ηT , and P (CC T (Sλ )) ≥ 1 − i=1 j=1 3ηT . ■ Proof of Lemma 2.A.4. By the union bound, Markov’s inequality and the mixingale concentration inequality of (Hansen, 1991b, Lemma 2), it follows that " P max j≤N,l≤T ≤z −m N X l X # ut xj,t ! >z t=1 " P max j=1 " E max j=1 ≤ N X l≤T l X l≤T m# ut xj,t t=1 ≤z −m N X j=1 l X # ut xj,t ! >z t=1 C1m T X !m/2 c2t ≤ CN T m/2 z −m , t=1 as {xj,t ut } is a mixingale of appropriate size by Lemma 2.A.1. ■ Proof of Lemma 2.A.5. This result follows directly by Corollary 6.8 in Bühlmann and ■ van De Geer (2011). Proof of Lemma 2.A.6. The proof largely follows Theorem 2.2 of van de Geer (2016) applied to β = β 0 with some modifications. For the sake of clarity and readability, we include the full proof here. Consider two cases. First, consider the case where ∥X(β̂−β 0 )∥2 2 T < − λ4 ∥β̂ − β 0 ∥1 + 2λ∥β 0S c ∥1 . Then ∥X(β̂ − β 0 )∥22 λ 8 + ∥β̂ − β 0 ∥1 < 2λ∥β 0S c ∥1 < λ∥β 0S c ∥1 + Cλ2 |S|, T 4 3 which satisfies Lemma 2.A.6. 61 . 2 Desparsified Lasso in Time Series Next, consider the case where ∥X(β̂−β 0 )∥2 2 T ≥ − λ4 ∥β̂ − β 0 ∥1 + 2λ∥β 0S c ∥1 . From the Lasso X ′ (y−X β̂) T optimization problem in eq. (2.3), we have the Karush-Kuhn-Tucker conditions = λκ̂, where κ̂ is the subdifferential of ∥β̂∥1 . Premultiplying by (β 0 − β̂)′ , we get (β 0 − β̂)′ X ′ (y − X β̂) ′ =λ(β 0 − β̂)′ κ̂ = λβ 0 κ̂ − λ∥β̂∥1 ≤ λ∥β 0 ∥1 − λ∥β̂∥1 . T By plugging in y = Xβ 0 +u, the left-hand-side can be re-written as ∥X(β̂−β 0 )∥2 2 T ′ +u X(β 0 −β̂) , T and therefore ∥X(β̂ − β 0 )∥22 u′ X(β̂ − β 0 ) ≤ + λ∥β 0 ∥1 − λ∥β̂∥1 T T 1 ≤ u′ X ∞ ∥β̂ − β 0 ∥1 + λ∥β 0 ∥1 − λ∥β̂∥1 (1) T ≤ (2) ≤ (4) λ 5λ 3λ 5λ 0 ∥β̂ − β 0 ∥1 + λ∥β 0 ∥1 − λ∥β̂∥1 ≤ ∥β̂ S − β 0S ∥1 − ∥β̂ S c ∥1 + ∥β S c ∥1 4 4 4 (3) 4 5λ 3λ ∥β̂ S − β 0S ∥1 − ∥β̂ S c − β 0S c ∥1 + 2λ∥β 0S c ∥1 , 4 4 where (1) follows from the dual norm inequality, (2) from the bound on the empirical process given by ET (T λ4 ), (3) from the property ∥β∥1 = ∥β S ∥1 + ∥β S c ∥1 with βj,S = βj 1{j∈S} , as well as several applications of the itriangle inequality, and (4) follows from the fact that h ∥X(β̂−β 0 )∥2 2 ∥β̂ S c ∥1 ≤ ∥β̂ S c − β 0S c ∥1 − ∥β 0S c ∥1 . Note that it follows from the condition ≥ T − λ4 ∥β̂ − β 0 ∥1 + 2λ∥β 0S c ∥1 combined with the previous inequality that ∥β̂ S c − β 0S c ∥1 ≤ 3∥β̂ S − β 0S ∥1 such that Lemma 2.A.5 can be applied. Adding 3λ ∥β̂ S 4 − β 0S ∥1 to both sides and re-arranging, we get by applying Lemma 2.A.5 4 ∥X(β̂ − β 0 )∥22 8 λ 8 + ∥β̂ − β 0 ∥1 ≤ λ∥β̂ S − β 0S ∥1 + λ∥β 0S c ∥1 3 T 4 3 3 q 8 8 0 ′ ≤ λC |S|(β̂ − β ) Σ̂(β̂ − β 0 ) + λ∥β 0S c ∥1 . 3 3 q p 1 Using that 2uv ≤ u2 + v 2 with u = (β̂ − β 0 )′ Σ̂(β̂ − β 0 ), v = √43 Cλ |S|, we further 3 bound the right-hand-side to arrive at 4 ∥X(β̂ − β 0 )∥22 λ 1 ∥X(β̂ − β 0 )∥22 8 + ∥β̂ − β 0 ∥1 ≤ + Cλ2 |S| + λ∥β 0S c ∥1 , 3 T 4 3 T 3 from which the result follows. ■ Proof of Lemma 2.A.7. By Assumption 2.3 and Lemma 2.A.6, we have on the set ET (T λ4 )∩ CC T (Sλ ) ∥X(β̂ − β 0 )∥22 λ 8 + ∥β̂ − β 0 ∥1 ≤Cλ2 |Sλ | + λ∥β 0S c ∥1 . λ T 4 3 62 2.C Supplementary Results It follows directly from Assumption 2.2 that β 0S c λ = 1 N X j=1 1{0<|β 0 |≤λ} βj0 j ≤ N X j=1 1{|β 0 |>0} j λ βj0 !1−r βj0 = λ1−r N X 1{|β 0 |>0} βj0 j=1 r j ≤ λ1−r sr . and by arguments in the proof of Lemma 2.A.3, |Sλ | ≤ λ−r sr Plugging these in, we obtain ∥X(β̂ − β 0 )∥22 λ 8 + ∥β̂ − β 0 ∥1 ≤ Cλ2 λ−r sr + λλ1−r sr = Cλ2−r sr . T 4 3 2.C.2 ■ Proofs of preliminary results Section 2.4 Proof of Lemma 2.B.1. As vj,t are the projection errors from projecting xj,t on all other xk,t , it follows directly that E [vj,t ] = 0 and E [vj,t xk,t ] = 0. Lm̄ -boundedness of {vj,t xk,t }, ∀j, k follows from Assumption 2.1(i), Assumption 2.4, and the Cauchy-Schwarz inequality. By Theorem 17.8 in Davidson (2002b), {vj,t } is L2m -NED on {sT,t } of size −d. The remainder ■ of the proof follows as in the proof of Lemma 2.A.1. Proof of Lemma 2.B.2. It follows by the Cauchy-Schwarz inequality that {wj,t } is Lm̄ bounded for all j = 1, . . . , p, and from the properties of {vj,t } by Theorem 17.9 in Davidson (2002b) that {wj,t } is Lm -NED of size −d. Part (i) then follows by Theorem 17.5 in Davidson (2002b). For part (ii), we adapt the proof of Theorem 17.7 in Davidson (2002b). Letting Yt = wj,t and Xt = wk,t , Ewj,t wk,t−l = EYt Xt−l . By the triangle inequality, choosing t−l+q q = [l/2], and using Ft−l−q as in Definition 2.A.1, i oi h  h  n t−l+q t−l+q . + E Yt E Xt−l |Ft−l−q |EYt Xt−l | ≤ E Yt Xt−l − E Xt−l |Ft−l−q By Hölder’s inequality, we can bound the first term n oi  h h  n o i m−1  h m m t−l+q t−l+q ≤ E |Yt+q | m−1 E Yt Xt−l − E Xt−l |Ft−l−q E Xt−l − E Xt−l |Ft−l−q 1 m i m i m−1  h m m m Since m−1 ≤ C, and since Xt−l is NED of size −d, < m < m̄, E |Yt+q | m−1  h n o m i 1 m t−l+q E Xt−l − E Xt−l |Ft−l−q ≤ Cψq , where ψq = O(q −d−ϵ ) for some ϵ > 0. For the second term, we use the tower property and Hölder’s inequality again h  i h    i t−l+q t−l+q t−l+q E Yt E Xt−l |Ft−l−q = E E Yt |Ft−l−q E Xt−l |Ft−l−q  h   t−l+q ≤ E E Yt |Ft−l−q 1 m i m     t−l+q E E Xt−l |Ft−l−q m m−1  m−1 m . 63 . 2 Desparsified Lasso in Time Series Since conditioning is a contractionary projection in Lp spaces,  h   t−l+q E E Yt |Ft−l−q 1 m i m  h   t−l+q ≤ E E Yt |F−∞  m−1 m m−1     t−l+q E E Xt−l |Ft−l−q m 1 m i m  h i m−1 m m ≤ E |Xt−l | m−1 ≤ C. Since Yt is a Mixingale of size −d, the first term can be bounded by Cψq−l , where ψq−l = O((q − l)−d−ϵ ). The sequence ϕl is then obtained by recalling that we chose q = [l/2], ϕl = O((l/2)−d−ϵ ) = O(l−d−ϵ ). Absolute summability follows by properties of p-series, since d ≥ 1. Note this results also holds for max q≤j,k≤N, 1≤t≤T |E [wj,t wk,t−l ]| since C and ϕl are independent of j, k, and t. (iii) follows by repeated application of Corollary 17.11 and Theorem 17.5 in Davidson (2002b), noting that E(wj,t wk,t−l ) is a non-random and bounded, ■ so trivially NED. Proof of Lemma 2.B.3. By Lemma 2.A.3, P (CC T (Sλ )) ≥ 1 − 3ηT when d+m−1 λ−r sr ≤ CηTdm+m−1 √  1 T 2 d + 1 m m−1 2 N ( d + m−1 ) for a sequence ηT → 0 such that ηT ≤ N2 . e , We can similarly apply this lemma to the sets CC T (Sλ,j ); when d+m−1 dm+m−1 λ−r j sr,j ≤ CηT √ T  2 1 d 2 N ( d + m−1 ) + 1 m m−1 , ! T CC T (Sλ,j ) ≥ 1−[1 − P (CC T (Sλ ))]− P (CC T (Sλ,j )) ≥ 1−3ηT . By the union bound, P CC T (Sλ ) j∈H P [1 − P (CC T (Sλ,j ))] ≥ 1−3(1+h)ηT , when the conditions above hold for all j ∈ H. These j∈H conditions are then jointly satisfied by the conditions this lemma, which are expressed in terms of sr,max and λmin . ■   √ (j) Proof of Lemma 2.B.4. By Lemmas 2.A.4 and 2.B.1, we have P ET (xj ) ≤ CN ( T /xj )m . Then ! P \ j∈H (j) ET (xj ) ≥1− X j∈H P n oc  hN T m/2 (j) E T xj ≥1−C . min xm j Proof of Lemma 2.B.5. Note that ( )! T \ h 1 X 2 2 P(LT ) = P vj,t − τj ≤ =1−P T t=1 δT j∈H ! T X 1 X 2 h 2 ≥1− P vj,t − τj > . T t=1 δT j∈H 64 ■ j∈H ( [ j∈H T 1 X 2 h vj,t − τj2 > T t=1 δT )! 2.C Supplementary Results Recalling that τj2 = T P 1 T  2  E vj,t , write P t=1  1 T T P 2 vj,t − τj2 > t=1 h δT   =P T P 2 2 (vj,t − Evj,t ) > t=1 As in the proof of Lemma 2.A.3, we use the Triplex inequality to bound this probability. T X h 2 2 (vj,t − Evj,t ) >T δ T t=1 P +6 !  ≤ 2q exp − T h2 288q 2 κ2T δT2  T T i  δT X  δT X h 2 2 2  |Ft−q − Evj,t E E vj,t + 15 E vj,t 1{|v2 |>κT } j,t T h t=1 T h t=1 := R(i) + R(ii) + R(iii) . For the second term, note by the proof of Lemma 2.B.1 that {vj,t } is L2m -NED on {sT,t }  2 is Lm̄ -bounded, and by Theorem 17.9 of Davidson of size −d. By Assumption 2.4, vj,t  2 2 (2002b), it is Lm -NED on {sT,t } of size −d. By Theorem 17.5 vj,t − Evj,t is then an    2 2 Lm -mixingale of size −d. It then follows that E E vj,t |Ft−q − Evj,t ≤ ct ψq ≤ Cq −d , and X R(ii) ≤ j∈H X 6 j∈H T δT δT X −d Cq = C d . T h t=1 q For the third term, we have by Hölder’s and Markov’s inequalities h i 2 1{|v2 |>κT } ≤ Cκ1−m . E vj,t T j,t and therefore X R(iii) ≤ j∈H X 15 j∈H T δT X δT Cκ1−m = C m−1 . T T h t=1 κT We jointly bound all three terms by a sequence ηT → 0.   T h2 Cqh exp − 2 2 2 ≤ ηT , q κT δ T (1) C (2) For the steps below, we assume that ηT h 1 e ≤ δT ≤ ηT , qd =⇒ (3) C δT m−1 κT ≤ ηT . p − ln(ηT /(hq)) ≥ 1. Isolate κT in (1) and (2),  Cqh exp C δT κm−1 T −T h2 q 2 κ2T δT2 √  ≤ ηT ⇐⇒ κT ≤ C  ≤ ηT ⇐⇒ κT ≥ C δT ηT Th , qδT 1/(m−1) . Combining both bounds on κT ,  C1 δT ηT 1/(m−1) ≤ C2 √ Th qδT ⇐⇒ √ 1/(m−1) −m/(m−1) q ≤ C T hηT δT . 65 Th δT  . 2 Desparsified Lasso in Time Series Isolating q from (2), gives CδT q −d ≤ ηT ⇐⇒ −1/d 1/d δT . q ≥ CηT Combining both bounds on q, √ 1/(m−1) −m/(m−1) 1/d −1/d C1 T hηT δT ≥ C2 δT η T When δT satisfies this upper bound, P d+m−1 √ 1 δT ≤ CηTdm+m−1 ( T h) 1/d+m/(m−1) . ⇐⇒ (R(i) + R(ii) + R(iii) ) ≤ 3ηT , and P (LT ) ≥ 1 − 3ηT , j∈H which completes the proof. ■ Proof of Lemma 2.B.6. Note that τ̂j2 can be rewritten as follows τ̂j2 = xj − X −j γ 0j 2 2 + X −j γ̂ j − γ 0j  2 2 T T ′  2 xj − X −j γ 0j X −j γ̂ j − γ 0j − + λj ∥γ̂ j ∥1 T  ′  2 T X −j γ̂ j − γ 0j 2 2 xj − X −j γ 0j X −j γ̂ j − γ 0j 1 X 2 = vj,t + − + λj ∥γ̂ j ∥1 . T t=1 T T (2.C.1) Then |τ̂j2 − τj2 |  2 T X −j γ̂ j − γ 0j 2 1 X 2 2 ≤ vj,t − τj + T t=1 T   ′ 2 xj − X −j γ 0j X −j γ̂ j − γ 0j + λj ∥γ̂ j ∥1 + T =: R(i) + R(ii) + R(iii) + R(iv) . By the set LT , we have R(i) ≤ max j∈H 1 T T P 2 vj,t − τj2 ≤ t=1 nodewise regression, it holds that R(ii) ≤ (j) C1 λ2−r sr j h δT . By Corollary 2.1 applied to the T λ (j) ≤ C1 λ̄2−r s¯r . By the set {ET (T 4j )} j∈H and the same error bound, we have R(iii) 2 v ′j X −j γ̂ j − γ 0j = T  ≤ C2 λj γ̂ j − γ 0j 1 ≤ C2 λ̄2−r s̄r . By the triangle inequality R(iv) ≤ λj ∥γ 0j ∥1 + λj ∥γ̂ j − γ 0j ∥1 . Using the weak sparsity index for the nodewise regressions Sλ,j = {k ̸= j : |γj,k | > λj }, write ∥γ 0j ∥1 = (γ 0j )Sλ,j . These terms can then be bounded as follows 1 c (γ 0j )Sλ,j 66 = 1 X 1{|γ 0 j,k k̸=j 0 |≤λj } |γj,k | ≤ λ1−r sr(j) ≤ λ̄1−r s̄r . j c (γ 0j )Sλ,j + 1 2.C Supplementary Results Bounding the L1 norm by the L2 norm, we get 2 (γ 0j )Sλ,j 1 ≤|Sλ,j |∥γ 0j ∥22 ≤ λ−r s̄r ∥γ 0j ∥22 , ¯ To further bound ∥γ 0j ∥22 , consider the matrix Θ = Σ−1 =  1 T PT t=1 −1 E [xt x′t ] and the partitioning " Σ= 1 T 1 T PT t=1 PT t=1 E x2j,t PT 1 t=1 E T P T 1 t=1 E T  E (x−j,t xj,t ) xj,t x′−j,t # x−j,t x′−j,t  . By blockwise matrix inversion, we can write the jth row of Θ as  " #−1  T T X  1 X    1 1 1 ′ ′  = 1 1, (γ 0j )′ . (2.C.2) Θj =  2 , − 2 E xj,t x−j,t E x−j,t x−j,t τj τj T t=1 T t=1 τj2 It then follows that ∥γ 0j ∥22 = X 0 2 X 0 2 τj4 (γj,k ) ≤ 1 + (γj,k ) = τj4 Θj Θ′j ≤ 2 , Λmin k̸=j as 1 Λmin k̸=j is the largest eigenvalue of Θ. For a bound on τj2 , by the definition of γ 0j from eq. (2.7) and Assumption 2.5(ii), it follows that ( " τj2 #) T 2 1 X ′ = min E xj,t − x−j,t γ j γj T t=1 " # T T 2 1 X 1 X  2  ′ ≤E xj,t − x−j,t 0 = E xj,t = Σj,j ≤ C. T t=1 T t=1 Similar arguments can be used to bound τj2 from below. By the proof of Lemma 5.3 in van de Geer et al. (2014), τj2 = 1 Θj,j , and therefore τj2 ≥ Λmin . It then follows from Assump- tion 2.5(ii) that 1 ≤ τj2 ≤ C, uniformly over j ∈ 1, . . . , N. C We therefore have ∥γ 0j ∥2 ≤ τj2 Λmin ≤ C 2 , such that we can bound the fourth term as c R(iv) ≤ λj ∥γ 0j ∥1 + λj ∥γ̂ j − γ 0j ∥1 = λj (γ 0j )Sλ,j q ≤ λ̄2−r s̄r + λ̄ λ−r s̄r C12 + C2 λ̄2−r s̄r 1 + λj (γ 0j )Sλ,j 1 + λj ∥γ̂ j − γ 0j ∥1 ¯ Combining all bounds, we have q h + C1 λ̄2−r s¯r + C2 λ̄2−r s¯r + λ̄2−r s̄r + λ̄2 λ−r s̄r C32 + C4 λ̄2−r s̄r ¯ δT q h = + C5 λ̄2−r s̄r + C6 λ̄2 λ−r s̄r . ¯ δT |τ̂j2 − τj2 | ≤ 67 2 Desparsified Lasso in Time Series For the second statement in Lemma 2.B.6, we have by the triangle inequality and eq. (2.B.1) that |τ̂j2 − τj2 | 1 1 ≤ − ≤ τ̂j2 τj2 τj4 − τj2 |τ̂j2 − τj2 | |τ̂j2 − τj2 | 1 − C|τ̂j2 − τj2 | C2 q h + C5 λ̄2−r s̄r + C6 λ̄2 λ−r s̄r δT ¯  . ≤ q −r h 2−r 2 C7 − C8 δT + C5 λ̄ s̄r + C6 λ̄ λ s̄r ■ ¯ Proof of Lemma 2.B.7. First, note that since Σ̂ is a symmetric matrix o n o n ′ max ∥e′j − Θ̂j Σ̂∥∞ = max ∥Σ̂Θ̂j − ej ∥∞ . j∈H j∈H By the extended KKT conditions  (see Section 2.1.1 of van de Geer et al., 2014), we have n o  ′ λ that max ∥Σ̂Θ̂j − ej ∥∞ ≤ max τ̂ 2j ≤ min λ̄ τ̂ 2 . For a lower bound on min τ̂j2 , note } { j∈H j∈H j∈H j j j∈H that by eq. (2.C.1), τ̂j2 can be rewritten as τ̂j2  ′  ∥xj − X −j γ 0j ∥22 ∥X −j γ̂ j − γ 0j ∥22 2 xj − X −j γ 0j X −j γ̂ j − γ 0j = + − + λj ∥γ̂ j ∥1 . T T T With 2 ∥X −j (γ̂ j −γ 0 j )∥2 T ≥ 0 and λj ∥γ̂ j ∥1 ≥ 0 by definition for all j, we have  ′ ∥xj − X −j γ 0j ∥22 2 xj − X −j γ 0j X −j γ̂ j − γ 0j τ̂j2 ≥ − = T T T P 2 vj,t t=1 T −  2v ′j X −j γ̂ j − γ 0j . T The dual norm inequality in combination with the triangle inequality then gives T  1 X 2 2 vj,t − τj2 − max |v ′j xk | ∥γ̂ j − γ 0j ∥1 , T t=1 T k̸=j ) ( T  2 1 1 X 2 ≥ − max vj,t − τj2 − max |v ′j xk | ∥γ̂ j − γ 0j ∥1 , j C T t=1 T k̸=j τ̂j2 ≥ τj2 − (j) where the second line follows from eq. (2.B.1). Then, on the sets LT and ET (T τ̂j2 ≥ C1 − λj 4 ) λj h h h − ∥γ̂ j − γ 0j ∥1 ≥ C1 − − C2 λ2−r − C2 λ̄2−r s̄r , sr(j) ≥ C1 − j δT 2 δT δT where Corollary 2.1 yields the second inequality. As λ̄2−r s̄r → 0, for a large enough T we have that min j 1 ≤ τ̂j2 C1 − h δT 1 − C2 λ̄2−r s̄r from which the result follows. ■ Proof of Lemma 2.B.8. Note that the jth row of the matrix I − Θ̂Σ̂ is e′j − Θ̂j Σ̂, where 68 2.C Supplementary Results Θ̂j is the jth row of Θ̂. Plugging in the definition of ∆, we have    n o √ √ max |∆j | = T max e′j − Θ̂j Σ̂ β̂ − β 0 ≤ T max ∥e′j − Θ̂j Σ̂∥∞ ∥β̂ − β 0 ∥1 . j∈H j∈H j∈H By Lemma 2.A.7, under Assumptions 2.2 and 2.5(ii), on the sets ET (T λ4 ) ∩ CC T (Sλ ), we have ∥X(β̂ − β 0 )∥22 + λ∥β̂ − β 0 ∥1 ≤ Cλ2−r sr , T (2.C.3) from which it follows that ∥β̂ − β 0 ∥1 ≤ Cλ1−r sr . Combining this bound with Lemma 2.B.7 gives √ max |∆j | ≤ T λ1−r sr j∈H C1 − h δT λ̄ . − C2 λ̄2−r s̄r ■ Proof of Lemma 2.B.9. Starting from the nodewise regression model, write  1 1 1 √ v̂ ′j u − v ′j u = √ u′ X −j γ 0j − γ̂ j ≤ √ T T T u′ X ∞ γ̂ j − γ 0j 1 . By the set ET (T λ) and Corollary 2.1, ′ {|u Xj |} √ max j T γ̂ j − γ 0j T 1 √ ≤ T λ γ̂ j − γ 0j 1 √ √ ≤ C T λλ1−r sr(j) ≤ C T λ2−r max s̄r , j where the upper bound is uniform over j ∈ H. ■ Proof of Lemma 2.B.10. By the union bound ( P \ j∈H max s≤T s X )! vj,t ut ≤ x ≥1− t=1 X P max j∈H s≤T ! s X vj,t ut > x . t=1 By the Markov inequality, Lemma 2.B.2 and the mixingale concentration inequality of (Hansen, 1991b, Lemma 2), P max s≤T s X  s P vj,t ut E max ! vj,t ut > x ≤ t=1 s≤T t=1 xm m C1m  ≤ T  P (j) ct t=1 xm 2 m/2 = CT m/2 , xm ■ from which the result follows. Proof of Lemma 2.B.11. Start by writing 1 v̂ ′j u 1 v ′j u 1 √ −√ ≤ √ 2 2 τ̂ τ T j T j T  v̂ ′j u − v ′j u 1 1 + 2 − 2 τ̂j2 τ̂j τj v ′j u √ =: R(i) + R(ii) . T For the first term, we can bound from above using Lemmas 2.B.6 and 2.B.9 and eq. (2.B.1), 69 2 Desparsified Lasso in Time Series all providing bounds uniform over j ∈ H. We then get R(i) ≤ |v̂ ′j u − v ′j u| 1 √ ≤ |τj2 | − |τ̂j2 − τj2 | T √ C5 T λ2−r max s̄r  1/C6 − h δT + C1 λ̄2−r s̄r + C2 q λ̄2 λ−r s̄r . ¯ For the second term, we can bound from above using Lemma 2.B.6 and the set T (j) ET,uv (h1/m T 1/2 ηT−1 ) to get the uniform bound j∈H R(ii) q h1/m ηT−1 δhT + C7 λ̄2−r s̄r h1/m ηT−1 + C8 λ̄2 λ−r s̄r h1/m ηT−1 ¯   . ≤ q C9 − C10 δhT + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r ¯ Combining both bounds gives R(i) + R(ii) q √ 1/m −1 h1/m ηT−1 δhT + C1 h1/m ηT−1 T λ2−r ηT λ̄2 λ−r s̄r max s̄r + C2 h ¯   ≤ q −r C3 − C4 δhT + C1 λ̄2−r s̄r + C2 λ̄2 λ s̄r ¯ ■ from which the result follows. Proof of Lemma 2.B.12. The result follows directly from the Markov inequality h i  P ∥d∥∞ > x ≤ x−p E max |dt |p ≤ x−p T max E |dt |p ≤ Cx−p T. t t ■ Proof of Lemma 2.B.13. We can write T T 1 X 1 X (ŵj,t ŵk,t−l − wj,t wk,t−l ) ≤ (ŵj,t − wj,t ) (ŵk,t−l − wk,t−l ) T T t=l+1 + 1 T t=l+1 T X (ŵj,t − wj,t ) wk,t−l + t=l+1 T 1 X wj,t (ŵk,t−l − wk,t−l ) T t=l+1  1  =: R(i) + R(ii) + R(iii) . T Take R(i) first. Using that ŵj,t−q = ût−q v̂j,t−q , straightforward but tedious calculations 70 2.C Supplementary Results show that T X R(i) ≤ (ût − ut ) (ût−l − ut−l ) (v̂j,t − vj,t ) (v̂k,t−l − vk,t−l ) t=l+1 + T X (ût − ut ) (ût−l − ut−l ) (v̂j,t − vj,t ) vk,t−l + t=l+1 + T X T X (ût − ut ) (ût−l − ut−l ) vj,t (v̂k,t−l − vk,t−l ) + t=l+1 + T X T X T X (ût − ut ) (ût−l − ut−l ) vj,t vk,t−l t=l+1 (ût − ut ) ut−l vj,t (v̂k,t−l − vk,t−l ) + t=l+1 + (ût − ut ) ut−l (v̂j,t − vj,t ) (v̂k,t−l − vk,t−l ) t=l+1 T X ut (ût−l − ut−l ) (v̂j,t − vj,t ) (v̂k,t−l − vk,t−l ) t=l+1 ut (ût−l − ut−l ) (v̂j,t − vj,t ) vk,t−l + T X ut ut−l (v̂j,t − vj,t ) (v̂k,t−l − vk,t−l ) =: R(i),i . i=1 t=l+1 t=l+1 9 X p  X −j γ̂ 0 − γ 0j 2 ≤ C T λ̄2−r s̄r on the set PT,nw by Corol  √ ≤ C T λ2−r sr on the set PT,las by Corollary 2.1, lary 2.1, and ∥û − u∥2 = X β̂ − β 0 Using that ∥v̂ j − v j ∥2 = 2 we can use the Cauchy-Schwarz inequality to conclude that  2 R(i),1 ≤ ∥û − u∥22 ∥v̂ j − v j ∥2 ∥v̂ k − v k ∥2 ≤ CT 2 λ2−r sr λ̄2−r s̄r ≤ CT 2 λ2−r . max sr,max On the set ET,u (T 1/2m ) T j∈H ET,vj (T 1/2m ), we have that ∥u∥∞ ≤ CT 1/2m , and ∥v j ∥∞ ≤ C(hT )1/2m , uniformly over j ∈ H. Then we can use this, plus the previous results to find that R(i),2 ≤ ∥v k ∥∞ T X |ût − ut | |ût−l − ut−l | |v̂j,t − vj,t | t=l+1  3/2 1 ≤ ∥v k ∥∞ ∥û − u∥22 ∥v̂ j − v j ∥2 ≤ C(hT ) 2m T 3/2 λ2−r . max sr,max We then find in the same way that  3/2 1 , R(i),3 ≤ ∥u∥∞ ∥û − u∥2 ∥v̂ j − v j ∥2 ∥v̂ k − v k ∥2 ≤ CT 2m T 3/2 λ2−r max sr,max  2−r 3/2 1 2 3/2 R(i),4 ≤ ∥û − u∥2 ∥v j ∥∞ ∥v̂ k − v k ∥2 ≤ C(hT ) 2m T λmax sr,max , 1 R(i),5 ≤ ∥û − u∥22 ∥v j ∥∞ ∥v k ∥∞ ≤ C(hT ) m T λ2−r max sr,max . Defining w̃j,l = (u1 vk,l+1 , . . . , uT vj,T )′ , w̃k,−l = (ul+1 vk,1 , . . . , uT vk,T )′ and ũl = (u1 ul+1 , . . . , uT uT )′ , all with m̄ bounded moments, we find on the set ET,u (T 1/2m ) ∩ ET,ũl (T 1/m ) \ j∈H ET,w̃j,l (T 1/m ) \ ET,w̃k,−l (T 1/m ) k∈H 71 2 Desparsified Lasso in Time Series that 1 R(i),6 ≤ ∥w̃j,l ∥∞ ∥û − u∥2 ∥v̂ k − v k ∥2 ≤ C(hT ) m T λ2−r max sr,max ,  3/2 1 , R(i),7 ≤ ∥u∥∞ ∥û − u∥2 ∥v̂ j − v j ∥2 ∥v̂ k − v k ∥2 ≤ CT 2m T λ2−r max sr,max 1 R(i),8 ≤ ∥w̃k,−l ∥∞ ∥û − u∥2 ∥v̂ j − v j ∥2 ≤ C(hT ) m T λ2−r max sr,max , 1 R(i),9 ≤ ∥ũl ∥2∞ ∥v̂ j − v j ∥2 ∥v̂ k − v k ∥2 ≤ CT m T λ2−r max sr,max . It then follows that  2  3/2 1 R(i) ≤ C1 T λ2−r + C2 h1/2m T (m+1)/2m λ2−r max sr,max max sr,max T + C3 h1/m T 1/m λ2−r max sr,max . For R(ii) we get analogously on the set ET,u (T 1/2m ) ET,vj ((hT )1/2m ) T j∈H R(ii) ≤ T ET,wj ((hT )1/m ) j∈H T 1 X (ût − ut ) (v̂j,t − vj,t ) wk,t−l T t=l+1 + T T 1 X 1 X (ût − ut ) vj,t wk,t−l + ut (v̂j,t − vj,t ) wk,t−l T T t=l+1 t=l+1 ≤ ∥û − u∥2 ∥v̂ j − v j ∥2 ∥wk ∥∞ + ∥û − u∥2 ∥v j ∥∞ ∥wk ∥∞ + ∥u∥∞ ∥v̂ j − v j ∥2 ∥wk ∥∞ , q q 1 1 3 3 1/2 1/2 2m T ≤ C1 (hT ) m T λ2−r λ2−r λ2−r max sr,max + C3 h m T 2m T max sr,max . max sr,max + C2 (hT ) It then follows that 1 T 3/2m (3−m)/2m R(ii) ≤ C1 h1/m T 1/m λ2−r T max sr,max + C2 h q λ2−r max sr,max . Finally, R(iii) follows identically to R(ii) . Collect all sets in the set (j,k) ET,uvw := ET,u (T 1/2m ) \ ET,vj ((hT )1/2m ) j∈H ∩ ET,ũ (T 1/m ) \ j∈H ET,w̃j,l ((hT )1/m ) \ ET,w̃k,−l ((hT )1/m ). k∈H Now note that by application of Lemma 2.B.12, we can show that all sets, and by extension their intersection, have a probability of at least 1 − CT −c for some c > 0. Take for instance the sets with x = T 1/m . In that can apply Lemma 2.B.12 with p = m̄ moments to  case −we m̄ 1/m obtain a probability of 1 − C T T = 1 − CT 1−m̄/m , so c = m̄/m − 1 > 0. The sets for p = 2m̄ moments can be treated similarly. For the sets involving intersections over j ∈!H, T Lemma 2.B.12 can be used with an additional union bound argument: P ET,d (x) ≥ j∈H 1 − Cx−p hT . These sets therefore hold with probability at least 1 − C(hT )−c . Since h is non-decreasing, this probability converges no slower than 1 − CT −c . 72 ■ 2.C Supplementary Results Proof of Lemma 2.B.14. Consider the set ( ) T P max T1 (wj,t wk,t−l − Ewj,t wk,t−l ) ≤ h2 χT . As in Lemma 2.A.3, we use the (j,k)∈H 2 t=l+1 Triplex inequality (Jiang, 2009) to show under which conditions this set holds with probability converging to 1. By the union bound, P T 1 X (wj,t wk,t−l − Ewj,t wk,t−l ) ≤ h2 χT T max (j,k)∈H 2 t=l+1 X ≥1− ! P (j,k)∈H 2 T 1 X (wj,t wk,t−l − Ewj,t wk,t−l ) > h2 χT T ! . t=l+1 Let zt = wj,t wk,t−l : T X P ! [zt − Ezt ] > h2 χT (T )  ≤ 2q exp t=l+1 + −T h4 χ2T 288q 2 κ2T  T T X  6 15 X  E |E (z |F ) − E(z )| + E |zt | 1{|zt |>κT } t t−q t h2 T χT t=1 h2 T χT t=1 =: R(i) + R(ii) + R(iii) . We treat the first term last, as we first need to establish the restrictions put on χT , q and κT from R(ii) and R(iii) . For the second term, by Lemma 2.B.2(iii) E |E (zt |Ft−q ) − E(zt )| ≤ ct ψq ≤ Cψq ≤ C1 q −d , −d such that R(ii) ≤ Ch−2 χ−1 . T q P R(ii) → 0. −1 Hence we need χ−1 → 0 as T → ∞, such that T q (j,k)∈H 2 For the third term, we have by Hölder’s and Markov’s inequalities   1−m/2 E |zt | 1{|zt |>κT } ≤ κT E |zt |m/2 1−m/2 so R(iii) ≤ Ch−2 χ−1 T κT 1−m/2 χ−1 T κT . Hence we know that we need to take κT and χT such that P R(iii) → 0. → 0 as T → ∞, giving (j,k)∈H 2 Our goal is to minimize χT while ensuring all conditions are satisfied. We jointly bound all three terms by a sequence ηT → 0: (1) X R(i) ≤ Cqh2 exp (j,k)∈H 2  −T h4 χ2T q 2 κ2T For the steps below, we assume that  ηT h2 −d ≤ ηT , (2) Cχ−1 ≤ ηT , (3) Cχ−1 T q T κT ≤ 1−m/2 1 e =⇒ ≤ ηT . p − ln(ηT /(qh2 )) ≥ 1. First, isolate κT in (1) and (2), Cqh2 exp  −T h4 χ2T q 2 κ2T √  ≤ ηT ⇐⇒ κT ≤ C T h2 χT . q 73 2 Desparsified Lasso in Time Series Cχ−1 T κT 1−m/2  ≤ ηT ⇐⇒ κT ≥ C 1 χT ηT 2/(m−2) . Combining both bounds,  C1 1 χT η T 2/(m−2) √ 2 T h χT ≤ C2 q √ m/(m−2) 2/(m−2) ηT , q ≤ C T h2 χT ⇐⇒ Isolating q from (2), −d Cχ−1 T q  ≤ ηT ⇐⇒ q≥C 1 ηT χT 1/d . Satisfying both bounds on q, √ m/(m−2) 2/(m−2) C1 T h2 χT ηT ≥ C2  1 ηT χT 1/d P When χT satisfies this lower bound, 2d+m−2 − dm+m−2 ⇐⇒ χT ≥ CηT √ 1 − ( T h2 ) 1/d+m/(m−2) . (R(i) + R(ii) + R(iii) ) ≤ 3ηT , and (j,k)∈H 2 P max (j,k)∈H 2 T 1 X (wj,t wk,t−l − Ewj,t wk,t−l ) ≤ h2 χT T ! ≥ 1 − 3ηT , t=l+1 ■ Which completes the proof. −2 ′ Proof of Lemma 2.B.15. By the definition of Θ̂, it follows directly that Θ̂X ′ = Υ̂ V̂ , √ √ −2 ′ where V̂ = (v̂ 1 , . . . , v̂ N ), such that Θ̂X ′ u/ T = Υ̂ V̂ u/ T .  √ p The proof will now proceed by showing that max r N,p Θ̂X ′ u − Υ−2 V ′ u / T − →0 1≤p≤P p and max |r N,p ∆| − → 0. By Lemma 2.B.8, it holds that 1≤p≤P max |∆j | ≤ j∈H √ T λ1−r sr λ̄ =: U∆,T , C1 − ηT − C2 λ̄2−r s̄r on the set PT,las ∩ PT,nw ∩ LT . First note that U∆,T → 0 as the assumption λ2max λ−r min ≤ h i−1 √ √ 2/m 1−r 2−r ηT h T sr,max sr → 0 and λ̄ s̄r → 0. Regarding PT,las ∩ implies that T λ̄λ N , and from PT,nw ∩ LT , it follows from Lemma 2.A.4 that P (ET (T λ/4)) ≥ 1 − C T m/2 λm ! n o T λ (j) hN Lemma 2.B.4 that P ET (T 4j ) ≥ 1 − C T m/2 ; both of these probabilities conλm j∈H ¯ ! 1/m T CC T (Sλ,j ) ≥ verge to 1 when λmin ≥ ηT−1 (hN√)T . By Lemma 2.B.3, P CC T (Sλ ) j∈H 1 − 3(1 + h)ηT′ → 1 when hηT′ → 0 and d+m−1 dm+m−1 λ−r min sr,max ≤ CηT  √ 2 1 T d 2 N ( d + m−1 ) + 1 m m−1 . For the former condition, we may let hηT′ ≤ ηT =⇒ ηT′ ≤ ηT h−1 and ηT′−1 ≥ ηT−1 h, and 74 2.C Supplementary Results combining this with the latter condition we require that λ−r min sr,max d+m−1 dm+m−1 " ≤ CηT # 1 1m √ + d m−1 T , 2+ 2 ( ) d m−1 (hN ) which we assume in this lemma. Note that this bound makes redundant the previous bound 1/m λmin ≥ ηT−1 (hN√)T when 0 < r < 1, by arguments similar to those in the proof of Theo√ 1 rem 2.1. The probability of LT converges to 1 by Lemma 2.B.5 when δT ≤ CηT,1 ( T h) 1/d+m/(m−1) . √ 1 We may therefore let δT = CηT,1 ( T h) 1/d+m/(m−1) , where ηT,1 will be addressed later in the proof. We assume that max ∥r N,p ∥1 < C, from which it follows that max |r N,p ∆| ≤ 1≤p≤P 1≤p≤P ∥r N,p ∥1 max |∆j | → 0. Similarly j∈H   √ v ′j u 1 v̂ ′j u max r N,p Θ̂X ′ u − Υ−2 V ′ u / T ≤ max ∥r N,p ∥1 max √ − . j∈H 1≤p≤P 1≤p≤P τj2 T τ̂j2 By Lemma 2.B.11, on the set EV,T := ET (T λ/4) ∩ PT,nw ∩ LT \ ET,uv (h1/m T 1/2 ηT−1 ) (j) j∈H it holds that q √ 2−r −1 h −1 −1 h1/m ηT,2 λ̄2 λ−r s̄r + C1 h1/m ηT,2 T λmax s̄r + C2 h1/m ηT,2 δT v ′j u 1 v̂ ′j u ¯   max √ − ≤ =: UV,T . q j∈H τj2 T τ̂j2 −r h C3 − C4 δT + C1 λ̄2−r s̄r + C2 λ̄2 λ s̄r ¯ Plugging in our choice of δT into the first term in the numerator, −1 h1/m ηT,2 √ 1 h − = C(ηT,1 ηT,2 )−1 h1+1/m ( T h) 1/d+m/(m−1) = C(ηT,1 ηT,2 )−1 δT h m+1 2 + m−1 dm 1 ! 1/d+m/(m−1) √ T . We may choose ηT,1 and ηT,2 such that (ηT,1 ηT,2 )−1 grows arbitrarily slowly. Therefore, m+1 this term converges to 0 when then converge to 0 when 2 + h dm√ m−1 T λ2max λ−r min → 0. The two other terms in the numerator h i−1 √ 2/m ≤ ηT h T sr,max . Under these rates the denom- inator then converges to C3 , which gives UV,T → 0. The only new set appearing in EV,T T (j) ET,uv (h1/m T 1/2 ηT−1 ), whose probability converges to 1 by Lemma 2.B.10. It follows is j∈H directly that   √ p RN Θ̂X ′ u − Υ−2 V ′ u / T − → 0. ■ β Ω Proof of Lemma 2.B.16. The following bounds on RN,T and RN,T hold on the set PT,las ∩ PT,nw ∩ LT ∩ ET,uvw ∩ ET,ww   1 √ − 1/d+m/(m−2) ηT−1 h2 T h2 , 75 2 Desparsified Lasso in Time Series h i−1 √ 2/m which holds with probability converging to 1 when λ2max λ−r T sr,max , min ≤ ηT h  1 1m  m+1 + 2 d+m−1 √ + dm+m−1 d m−1 T h dm√ m−1 → 0, λ−r , and, if r = 0, λmin ≥ min sr,max ≤ CηT T (2+ 2 ) (hN ) d m−1 1/m ηT−1 (hN√)T , see the proof of Theorem 2.3 for details. Under Assumption 2.6, m and d may be arbitrarily large, and assuming polynomial growth rates allows us to simplify these conditions to the following: 1/2 + b 1/2 − b <ℓ< , 2−r r 1/2 + b r=0: < ℓ < 1/2. 2−r 0<r<1: These bounds are feasible when b < 1−r . 2 By eq. (2.B.2) Ω , RN,T ≤ C1 ∆τ [1 + ∆τ + ∆τ ∆ω] + C2 Q1−d−δ T where δ > 0, 1 1 ∆τ = max 2 − 2 ≤ j∈H τ̂j τj q + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r ¯  , q h 2−r C3 − C4 δT + C1 λ̄ s̄r + C2 λ̄2 λ−r s̄r h δT ¯ √ 1 with δT = CηT,1 ( T h) 1/d+m/(m−1) , and ∆ω = max (j,k)∈H 2  h i2 1 1 N,QT ω̂j,k − ωj,k ≤ (2QT + 1) C1 T 1/2 λ2−r + C2 h m T m λ2−r max sr,max max sr,max q h 1 m+1 i3 3−m 3 2 2−r h m T m λ2−r max sr,max + C4 h 3m T 3m λmax sr,max  1 √ − 1/d+m/(m−2) 2 −1 2 Th . +C5 ηT h + C3 Q1−d−δ is dominated by the term C1 ∆τ [1 + ∆τ + ∆τ ∆ω], since d may be arbitrarily T large, and we can limit the analysis to ∆τ and ∆ω. For ∆τ , we first consider the numerator of the upper bound   q 1 1 h H−(H+1/2) 1/d+m/(m−1) + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r =O T + T b−ℓ(2−r) + T 2 (b−ℓ(2−r)) ¯ δT   1 =O T ϵ−1/2 + T b−ℓ(2−r) + T 2 (b−ℓ(2−r)) , for some arbitrarily small ϵ > 0. From the earlier conditions, 1/2+b 2−r < ℓ =⇒ b − ℓ(2 − r) < −1/2, which implies  that the numerator converges to 0, and that it converges at the rate 1 (b−ℓ(2−r)) 2 of O T , since the two other terms have a smaller exponent of T . The same expression from the numerator also  1appears inthe denominator, so the latter converges to a non-zero constant, and ∆τ = O T 2 (b−ℓ(2−r)) . 76 2.C Supplementary Results For ∆ω, we may simplify the upper bound as follows  h i2 1 1 2−r sr,max (2QT + 1) C1 T 1/2 λ2−r + C2 h m T m λmax max sr,max  q 1 √ h 1 m+1 − i3 3−m 3 2 1/d+m/(m−2) 2−r + C3 h m T m λ2−r +C5 ηT−1 h2 T h2 max sr,max + C4 h 3m T 3m λmax sr,max i  h 3 1 = O T Q T 2(1/2+b−ℓ(2−r)) + T ϵ+b−ℓ(2−r) + T ϵ+ 2 (−1+b−ℓ(2−r)) + T ϵ+ 2 (1/3+b−ℓ(2−r)) + T ϵ−1/2   = O T Q+2(1/2+b−ℓ(2−r)) + T Q+ϵ−1/2 . Since ∆τ → 0,  ∆τ [1 + ∆τ + ∆τ ∆ω] =O ∆τ + [∆τ ]2 ∆ω  1  =O T 2 (b−ℓ(2−r)) + T Q+1+3(b−ℓ(2−r)) + T Q−1/2+(b−ℓ(2−r)) .  1 1 When Q < min −1 − 65 (b − ℓ(2 − r)),  2 − 2 (b − ℓ(2 − r)) , the first term dominates the 1 (b−ℓ(2−r)) Ω . Note that since b − ℓ(2 − r) < −1/2, this bound on others, and RN,T = O T 2 Q is satisfied when Q < 2/3. Following the proof of Lemma 2.B.15, β RN,T := max r N,p 1≤p≤P Υ−2 V ′ u Θ̂X ′ u √ √ +∆− T T ! ≤ U∆,T + UV,T , (2.C.4) where √ U∆,T = T λ1−r sr λ̄ , C1 − ηT − C2 λ̄2−r s̄r and UV,T q √ 1/m −1 h1/m ηT−1 δhT + C1 h1/m ηT−1 T λ2−r ηT λ̄2 λ−r s̄r max s̄r + C2 h ¯   = , q C3 − C4 δhT + C1 λ̄2−r s̄r + C2 λ̄2 λ−r s̄r ¯   √ 1 with δT = CηT,1 ( T h) 1/d+m/(m−1) . For U∆,T , the numerator is of order O T 1/2+b−ℓ(2−r) ,     and the denominator of order O 1 + T b−ℓ(2−r) = O(1), so U∆,T = O T 1/2+b−ℓ(2−r) . For UV,T , note that each term in the numerator is multiplied by h1/m ηT−1 , which we can take to be O(T ϵ ) for an arbitrarily small ϵ > 0. The remainder of the numerator is then   q √ 1 1 h H−(H+1/2) 1/d+m/(m−1) + C1 T λ2−r + T 1/2+b−ℓ(2−r) + T 2 (b−ℓ(2−r)) λ̄2 λ−r s̄r =O T max s¯r + C2 ¯ δT   1 =O T ϵ−1/2 + T 1/2+b−ℓ(2−r) + T 2 (b−ℓ(2−r)) ,   =O T ϵ−1/2 + T 1/2+b−ℓ(2−r) . Since the denominator contains the same expression as ∆τ , it converges to a non-zero con- 77 2 Desparsified Lasso in Time Series i  h 1 stant, and UV,T = O T ϵ T −1/2 + T 2 (b−ℓ(2−r)) . Combining these terms,  h i   1 β RN,T = O T 1/2+b−ℓ(2−r) + T ϵ T −1/2 + T 2 (b−ℓ(2−r)) = O T ϵ−1/2 + T 1/2+b−ℓ(2−r) . Finally, as mentioned at the start of the proof, these results hold on a set whose probability converges to 1. We therefore replace O(·) with Op (·) and the proof is complete. 2.C.3 ■ Illustration of conditions for Corollary 2.1 Example 2.C.1. The requirements of Corollary 2.1 are satisfied when N ∼ T a for a > 0, sr ∼ T b for b > 0, and λ ∼ T −ℓ for 0<r<1: r=0:      b 1 1 1 m 1 1 <ℓ< 1 − b + − 2a + , m 1−r d m−1 d m−1 r( d + m−1 ) 2 b 1 a <ℓ< − . 1−r 2 m This choice of ℓ is feasible when      1 m 1 1 2b + + 4a + < 1. 1−r d m−1 d m−1 (2.C.5) Figure 2.2 demonstrates which values of a, b, m, d, and r are feasible, as well as how many moments m are required for different combinations of the other parameters. 78 2.C Supplementary Results Figure 2.2: Required moments m implied by eq. (2.C.5). Contours mark intervals of 10 moments, and values above m = 100 are truncated to 100. Non-shaded areas indicate infeasible regions. 79 2 Desparsified Lasso in Time Series 2.C.4 Properties of induced p-norms for 0 ≤ p < 1 Lemma 2.C.1 (0 < p < 1). For matrices A, B ∈ Rn×m with column vectors aj and bj ∥Ax∥r ∥x∥p and 0 < p < 1, define the induced pseudo-norm ∥A∥p = max x̸=0 P 1/p p for a vector x the pseudo-norm ∥x∥p = . j |xj | = max ∥Ax∥p , where ∥x∥p =1 (1) ∥c × A∥p = |c| ∥A∥p (2) ∥A∥p = maxj ∥aj ∥p (3) ∥AB∥p ≤ ∥A∥p ∥B∥p (4) ∥A + B∥pp ≤ ∥A∥pp + ∥B∥pp (5) m1/2−1/r ∥A∥2 ≤ ∥A∥r ≤ n1/r−1/2 ∥A∥2 Proof. We can show the p-norm satisfied absolute homogeneity, i.e. for a scalar c, !1/p !1/p X ∥xc∥p = |xj × a| p |c| = p X |xj | p !1/p X = |c| |xj | = |c| ∥x∥p . j j j p Property (1) then follows: ∥Axc∥p ∥c × A∥p = max ∥x∥p x̸=0 = max |c| ∥Ax∥p = |c| ∥A∥p ∥x∥p x̸=0 By absolute homogeneity, the alternative definition of ∥·∥p follows from max ∥Ax∥p x̸=0 ∥x∥p = max x̸=0 Ax ∥x∥p = max A x̸=0 p x ∥x∥p = max ∥Ay∥p . p ∥y∥p =1 Property (2) follows from the following arguments: !p ∥A∥pp = max ∥Ax∥p ∥x∥p =1 p = max  ∥x∥p =1 ∥Ax∥pp  X = max ∥x∥p =1 ! ≤ max X ∥x∥p =1 ∥aj xj ∥pp X = max ∥x∥p =1 j |xj |p ∥aj ∥pp max ∥x∥p =1 j |xj | ∥aj ∥pp . j P |xj |p = 1. We can therefore rewrite as j ! p p ! Note that the condition ∥x∥p = 1 ⇐⇒ ∥x∥pp = X j aj xj ! = X max P yj ≥0, yj =1 yj ∥aj ∥pp . j This maximum is then straightforward to evaluate: check which ∥aj ∥pp is the largest, and set its corresponding yj to 1. This gives us an upper bound on the induced norm: ∥A∥pp ≤ max ∥aj ∥pp ⇐⇒ ∥A∥p ≤ max ∥aj ∥p . j 80 j 2.C Supplementary Results The inequality can also be shown in the other direction: For any j, we may write ∥aj ∥p = ∥Aej ∥p , where ej is the jth basis vector. ej is a vector which satisfies ∥ej ∥p = 1, so we may upper bound ∥aj ∥p = ∥Aej ∥p ≤ max ∥Ax∥p = ∥A∥p . Note that the inequality ∥x∥p =1 ∥aj ∥ ≤ ∥A∥p holds for all j, including the j which maximizes ∥aj ∥p . Therefore, we have maxj ∥aj ∥p ≤ ∥A∥p , and property (2) holds by the sandwich theorem. Property (3) follows from ∥A∥p = max ∥Ax∥p ∥x∥p x̸=0 ≥ ∥Ax∥p =⇒ ∥Ax∥p ≤ ∥A∥p ∥x∥p , ∀x ̸= 0, ∥x∥p ∥AB∥p = max ∥ABx∥p ≤ max ∥A∥p ∥Bx∥p ≤ max ∥A∥p ∥B∥p ∥x∥p = ∥A∥p ∥B∥p . ∥x∥p =1 ∥x∥p =1 ∥x∥p =1 Property (4) follows from the following arguments. ∥·∥pp satisfies the triangle inequality: ∥x + y∥pp = X |xj + yj |p ≤ X j ∥A + B∥pp = |xj |p + X j max ≤ max !p = max ∥x∥p ∥Ax∥pp x̸=0 j ∥(A + B)x∥p x̸=0 ∥x∥pp |yj |p = ∥x∥pp + ∥y∥pp . ∥(A + B)x∥pp x̸=0 + max x̸=0 ∥Bx∥pp ∥x∥pp ∥x∥pp ≤ max ∥Ax∥pp + ∥Bx∥pp x̸=0 ∥x∥pp = ∥A∥pp + ∥B∥pp . For property (5), by the Cr -inequality we have for x ∈ Rn  n X (x2i )p/2 ∥x∥p =  #1/2 !2/p 1/2 " n X 2 2/p−1  xi = n1/p−1/2 ∥x∥2 , ≤ n i=1 i=1 and also ∥x∥2 ≤ ∥x∥p . Consequently ∥A∥p = max ∥Ax∥p x̸=0 ∥x∥p ≤ max x̸=0 n1/p−1/2 ∥Ax∥2 ∥Ax∥2 ≤ n1/p−1/2 max = n1/p−1/2 ∥A∥2 . x̸=0 ∥x∥ ∥x∥p 2 Similarly, ∥A∥2 = max x̸=0 ∥Ax∥p ∥Ax∥p ∥Ax∥2 ≤ max ≤ max 1/2−1/p = m1/p−1/2 ∥A∥p . x̸ = 0 x̸ = 0 ∥x∥2 ∥x∥2 m ∥x∥p ■ Lemma 2.C.2 (p = 0). For matrices A and B with column vectors aj and bj define the induced pseudo-norm ∥A∥0 = max x̸=0 P j 1(|xj | > 0). ∥Ax∥0 , ∥x∥0 where for a vector x the pseudo-norm ∥x∥0 = (1) ∥c × A∥0 = ∥A∥0 , for c ̸= 0 (2) ∥A∥0 = maxj ∥aj ∥0 (3) ∥AB∥0 ≤ ∥A∥0 ∥B∥0 (4) ∥A + B∥0 ≤ ∥A∥0 + ∥B∥0 81 2 Desparsified Lasso in Time Series Proof. For property (1), note that ∥x∥0 = ∥xc∥0 for any scalar c ̸= 0: ∥c × A∥0 = max x̸=0 ∥Axc∥0 ∥Ax∥0 = max = ∥A∥0 . x̸ = 0 ∥x∥0 ∥x∥0 For property (2), let S(x) be the index set {j : |xj | > 0} with cardinality |S(x)|; note that ∥x∥0 = |S(x)|. Furthermore, note that the 0-norm satisfies the triangle inequality: ∥x + y∥0 = X 1(|xj + yj | > 0) ≤ j P j∈S(x) |S(x)| x̸=0 1(|xj | > 0) + 1(|yj | > 0) = ∥x∥0 + ∥y∥0 . j ∥Ax∥0 = max ∥A∥0 = max x̸=0 ∥x∥ x̸=0 P0 ∥aj ∥0 = max X aj xj j P 0 |S(x)| ≤ max x̸=0 j P ∥aj xj ∥0 |S(x)| = max ∥aj xj ∥0 j∈S(x) x̸=0 |S(x)| |S(x)| max ∥aj ∥0 ≤ max x̸=0 j |S(x)| = max ∥aj ∥0 . j This inequality can also be shown in the other direction: For any j, we may write ∥aj ∥0 = ∥Aej ∥ ∥Aej ∥0 = e 0 , where ej is the jth basis vector, noting that ∥ej ∥0 = 1. ej is a vector ∥ j ∥0 ∥Aej ∥ ∥Ax∥ which satisfies ej ̸= 0, so we may upper bound ∥aj ∥0 = e 0 ≤ max ∥x∥ 0 = ∥A∥0 . Note 0 ∥ j ∥0 x̸=0 that the inequality ∥aj ∥0 ≤ ∥A∥0 holds for all j, including the j which maximizes ∥aj ∥0 . Therefore, we have maxj ∥aj ∥0 ≤ ∥A∥0 , and property (2) holds by the sandwich theorem. Property (3) follows from ∥A∥0 = max x̸=0 ∥Ax∥0 ∥Ax∥0 ≥ =⇒ ∥Ax∥0 ≤ ∥A∥0 ∥x∥0 , ∀x ̸= 0, ∥x∥0 ∥x∥0 ∥AB∥0 = max x̸=0 ∥ABx∥0 ∥A∥0 ∥Bx∥0 ∥A∥0 ∥B∥0 ∥x∥0 ≤ max ≤ max = ∥A∥0 ∥B∥0 . x̸=0 x̸=0 ∥x∥0 ∥x∥0 ∥x∥0 Property (4) follows from the triangle inequality of the 0-norm: ∥A + B∥0 = max x̸=0 ≤ max x̸=0 ∥(A + B)x∥0 ∥Ax∥0 + ∥Bx∥0 ≤ max x̸=0 ∥x∥0 ∥x∥0 ∥Ax∥0 ∥Bx∥0 + max = ∥A∥0 + ∥B∥0 . x̸=0 ∥x∥ ∥x∥0 0 ■ 2.C.5 Additional notes on Examples 2.5 and 2.6 Using the properties of p-norms for 0 ≤ p < 1 described in section 2.C.4, we provide further details on Examples 2.5 and 2.6. 82 2.C Supplementary Results 2.C.5.1 Example 2.5: Sparse factor model Recall the factor model ′ yt = β 0 xt + ut , ut ∼ IID(0, 1) xt = Λ f t + ν t , ν t ∼ IID(0, Σν ), N ×kk×1 f t ∼ IID(0, Σf ), where Λ has bounded elements, Σf and Σν are positive definite with bounded eigenvalues, and ν t and f t uncorrelated. We make the following assumptions on the factor loadings: C1 N a ≤ λmin (Λ′ Λ) ≤ λmax (Λ′ Λ) ≤ C2 N b , 0 < a ≤ b ≤ 1. (2.C.6) These assumptions imply that the k largest eigenvalues of Σ = ΛΣf Λ′ + Σν diverge at rates between N a and N b , while the remaining N − k + 1 eigenvalues do not diverge. This holds as we can bound the largest eigenvalue λmax (Σ) from above by λmax (Σ) ≤ λmax (ΛΣf Λ′ ) + λmax (Σν ) ≤ λmax (Σf )λmax (Λ′ Λ) + λmax (Σν ) ≤ C1 N b + C2 . Similarly, we can bound bound the k-th largest eigenvalue λk (Σ) using Weyl’s inequality and the min-max theorem from below by   x′ ΛΣf Λ′ x λk (Σ) ≥ λk (ΛΣf Λ′ ) + λmin (Σν ) = max min dim(U) = N − k + 1 + λmin (Σν ) U x∈U \0 x′ x   x′ ΛΛ′ x ≥ λmin (Σf ) max min dim(U) = N − k + 1 + λmin (Σν ) U x∈U \0 x′ x = λmin (Σf )λk (ΛΛ′ ) + λmin (Σν ) = λmin (Σf )λmin (ΛΛ′ ) + λmin (Σν ) ≥ C1 N a + C2 , where we used that λk (ΛΛ′ ) = λk (Λ′ Λ) = λmin (Λ′ Λ). Therefore, this assumption generates a weak factor model if b < 1, while if b = 1 but a < 1 some factors, but not all, are weak; see e.g. Uematsu and Yamagata (2022a,b) and the references therein.11 If a = b = 1 we have the standard strong factor model with dense loadings. Sparse factor loadings satisfy these assumptions. In particular, from Lemma 2.C.1(5) we find that λmax (Λ′ Λ) = ∥Λ∥22 ≤ k2/r−1 ∥Λ∥2r ; thus, with a fixed number k of factors, the sparsity of Λ provides an upper bound for the strength of divergence of the largest eigenvalues.12 Sparse factor models may provide accurate descriptions of various economic and financial datasets. For example, Uematsu and Yamagata (2022b) find strong evidence of sparse factor loadings in the FRED-MD macroeconomic dataset (McCracken and Ng, 2016), as well as of firm-level excess returns of the S&P500 beyond the market return factor. Freyaldenhoven (2021) uses sparsity in the loadings to identify the factors, motivating the sparsity empirically through the presence of “local” factors in economic and financial data. 11 Our setup corresponds to the framework with factors of varying strength as proposed by Uematsu and Yamagata (2022a,b) by setting λj (Λ′ Λ) ∼ N aj where b = a1 ≥ . . . ≥ ak = a. 12 This bound only holds for r > 0. Uematsu and Yamagata (2022a) consider the case r = 0. 83 2 Desparsified Lasso in Time Series Further empirical evidence for sparse factor models is reviewed in Uematsu and Yamagata (2022a). We now derive the sparsity bound of Example 2.5. We bound γ 0j that Θ = Υ −2  −2 Γ, where Υ 1   −γ2,1  Γ :=  .  ..  −γN,1 = 2 diag(1/τ12 , . . . , 1/τN ), r r based on the fact and  −γ1,2 ... −γ1,N 1 .. . ... .. . −γN,2 ...  −γ2,N   . ..  .   1 This result follows from the definition of γ 0j as linear projection coefficients, and the block matrix inverse identity for Θ. Then max γ 0j j r r ≤1 + max γ 0j r = (Υ−2 )−1 Θ r j r r ′ = max (1, γ 0′ j ) j ≤ (Υ−2 )−1 r r r r ′ = max (1, −γ 0′ j ) j r r = ∥Γ∥rr ∥Θ∥rr ≤ max τj2r ∥Θ∥rr ≤ C ∥Θ∥rr , j where maxj τj2r ≤ C follows from eq. (2.B.1). Note that when r = 0, these steps follow similarly, noting that (Υ−2 )−1 0 = 1, and therefore C = 1. By the Woodbury matrix identity a ′ −1 a −1 a Θ = Σ−1 Σ−1 ν − Σν Λ/N f /N + Λ Σν Λ/N −1 Λ′ Σ−1 ν . Then ∥Θ∥rr ≤ Σ−1 ν r + Σ−1 ν r r r a ′ −1 a Σ−1 f /N + Λ Σν Λ/N ∥Λ/N a ∥rr −1 r r Λ′ r r Σ−1 ν r r . As for positive semidefinite symmetric matrices A and B we have that (A + B)−1 2 ≤ 1 1 1 ≤ ≤ , λmin (A + B) λmin (A) + λmin (B) λmin (B) it follows that a ′ −1 a Σ−1 f /N + Λ Σν Λ/N −1 ≤ 2 1 1  ≤ . ′ a a λmin (Σ−1 λmin Λ′ Σ−1 ν )λmin (Λ Λ/N ) ν Λ/N ′ a As λmin (Σ−1 ν ) = 1/λmax (Σν ) ≥ 1/C, it follows from our assumptions that λmin (Λ Λ/N ) ≥  −1 ′ −1 C and therefore Σ−1 ≤ C. It then also follows from Lemma 2.C.1(5) f /N + Λ Σν Λ/N 2  −1 r a ′ −1 a that Σ−1 ≤ Ck1−r/2 and f /N + Λ Σν Λ/N r ∥Θ∥rr ≤ Σ−1 ν 84 r r + Ck1−r/2 Σ−1 ν r r ∥Λ/N a ∥rr Λ′ r r Σ−1 ν r r . (2.C.7) 2.C Supplementary Results r With ∥Λ′ ∥r ≤ Ck, we then find the bound ∥Θ∥rr ≤ Σ−1 ν r r + Ck2−r/2 N −ra Σ−1 ν 2r r ∥Λ∥rr . We provide two examples of Σν such that Σ−1 ν is sparse. For block diagonal structures, this follows trivially, since the inverse maintains the same block diagonal structure. For a Toeplitz structure Σν,i,j = ρ|i−j| , by Section 8.8.4 of Gentle (2007),  Σ−1 ν  −ρ  1  0 =  2 1−ρ  .  .  .  −ρ 0 ... 0 1 + ρ2 −ρ ... −ρ .. . 1 + ρ2 .. . ... .. .  0  0 , ..   . 0 0 ... 1 1 0 and we can bound r Σ−1 ν r = max Σ−1 ν,·,j j or simply max Σ−1 ν,·,j j 0 r r = Σ−1 ν,·,⌈N/2⌉ r 1 + ρ2 + 2 |ρ|r ≤ C, |1 − ρ2 |r r = r = 3 for r = 0. Note that a (potentially weak) factor model without sparse loadings does not yield a sufficiently sparse matrix Θ for all values of r. In eq. (2.C.7) we may try to bound ∥Λ∥rr  r/2 , directly using Lemma 2.C.1(5) to bound ∥Λ/N a ∥rr ≤  N 1+(b−2a−1)r/2 λmax (Λ′ Λ/N b )  r 2−r/2 1+(b−2a−1)r/2 . This is not a tight enough bound 1 + Ck N such that ∥Θ∥rr ≤ Σ−1 ν r to guarantee sparsity of Θ. To illustrate, for the standard dense factor model with a = b = 1 and k fixed, we get ∥Θ∥rr ≤ CN 1−r . Weaker divergence of the eigenvalues even increases the power of N . 2.C.5.2 Example 2.6: Sparse VAR(1) Recall the sparse VAR(1) model z t = Φz t−1 + ut , Eut u′t := Ω, Eut u′t−l = 0, ∀l ̸= 0, with our regression of interest being yt = ϕ1 z t−1 +u1,t . For Example 2.6(a) with a symmetric block-diagonal coefficient matrix Φ and the error covariance matrix Ω being the identity, ∞ ∞ −1 P P Φq ΩΦ′q = Φ2q = I − Φ2 , where Φ0 = Φ′0 = I, and we can simplify Σ = q=0 q=0 Θ = Σ−1 = I − Φ2 . Note that I − A is invertible iff 1 is not an eigenvalue of A. Since the eigenvalues of Φ2 are between (and not including) 0 and 1, Σ exists. I − Φ2 inherits the block diagonal structure of Φ, so we may bound max γ 0j j r r ≤ C ∥Θ∥rr ≤ Cb. This result can be extended to the case where Ω has the same block diagonal structure as the VAR coefficient matrix Φ. While the simplified expression for Σ provided above no longer holds, both Σ and Σ−1 remain block diagonal when Ω and Φ share the same block 85 2 Desparsified Lasso in Time Series structure. As a result, the nonzero structure of γ 0j remains unaltered.  Figure 2.3: Example 6(b): We display ln max γ 0j r r j  for N between 10 and 1000, and r between 0.1 and 0.9. Log sparsity in L r −norm 30 r 0.75 20 0.50 0.25 10 0 0 250 500 750 1000 N For Example 2.6(b) with a diagonal Φ and Toeplitz Ω, we can simplify Σ = ∞ P Φq ΩΦ′q = q=0 ∞ P q=0 ϕ2q Ω = 1 Ω 1−ϕ2 and by similar arguments to section 2.C.5.1,  1  −ρ  1−ϕ  0 Θ=  2 1−ρ  .  .  . 2 0  −ρ 0 ... 0 1 + ρ2 −ρ ... −ρ .. . 1 + ρ2 .. . ... .. .  0  0 . ..   . 0 0 ... 1 The precision matrix is clearly sparse in this case, and max γ 0j j r r ≤ C ∥Θ∥rr ≤ C. Finally, we numerically investigated the extension where the VAR coefficient matrix also has a Toeplitz structure, namely Φi,j = 0.41+|i−j| . We vary the sample size between 86 2.C Supplementary Results N = 10 and N = 1000 and display the boundedness in r-norm of the parameter vector in the nodewise regressions in Figure 2.3 for different values of r. We use a log-scale since this sparsity grows by orders of magnitude for decreasing r. 2.C.6 Algorithmic details for choosing the lasso tuning parameter Algorithm 2.1: Plug-in choice of λ 1 At k = 0, initialize λ(0) ← X ′ y 2 while 1 ≤ k ≤ K do ∞ /T and û(0) ← y − 1 T (k) 3 Obtain the estimated long-run covariance matrix Ω̂ T P (k−1) (k−1) ′ Ξ̂(l) = T 1−l xt ût ût−l xt−l ; PT t=1 yt ; as in eq. (2.9), with t=l+1 4 while 1 ≤ b ≤ B do 5   (k) Draw ĝ (b) from N 0, Ω̂ ; 6 mb ← ĝ (b) ∞ ; 7 λ(k) ← c √1T q(1−α) , where q(1−α) is the (1 − α)-quantile of m1 , . . . , mB ; 8 if λ(k) − λ(k−1) /λ(k−1) < ϵ then 9 10 λ ← λ(k) ; break; (k) 11 Estimate β̂ 12 û(k) ← y − X β̂ 13 with the lasso using λ(k) as the tuning parameter; (k) ; λ ← λ(K) ; We set K = 15, ϵ = 0.01, B = 1000, α = 0.05, and c = 0.8 throughout the simulation study. 87 2 Desparsified Lasso in Time Series 2.C.7 Additional simulation details Figure 2.4: Model A, ρ heat map coverage: Contours mark the coverage thresholds at 5% intervals, from 75% to the nominal 95%, from dark green to white respectively. Units on the axes are not proportional to the λ-value but rather its position in the grid. The value of λ is (10T )−1 at 0, and increases exponentially to a value that sets all parameters to zero at 50. Plots are based on 100 replications, with colored dots representing combinations of λ’s selected by PI (purple), AIC (red), BIC (blue), EBIC (yellow). N=101, T=200 N=101, T=1000 40 40 40 40 30 20 30 20 30 20 10 10 10 0 0 0 10 20 30 40 50 0 Nodewise lambda 10 20 30 40 Initial lambda 50 Initial lambda 50 0 50 30 20 10 0 0 Nodewise lambda N=201, T=100 10 20 30 40 50 0 Nodewise lambda N=201, T=200 N=201, T=500 40 40 40 40 20 20 30 20 10 10 10 0 0 0 0 10 20 30 40 50 0 Nodewise lambda 10 20 30 40 Initial lambda 50 Initial lambda 50 30 50 40 50 30 20 Coverage 0 0 10 20 30 40 50 1.00 0 Nodewise lambda N=501, T=200 30 10 Nodewise lambda N=501, T=100 20 N=201, T=1000 50 30 10 Nodewise lambda 50 Initial lambda Initial lambda N=101, T=500 50 Initial lambda Initial lambda N=101, T=100 50 10 20 30 40 50 0.75 Nodewise lambda N=501, T=500 0.50 N=501, T=1000 50 50 50 50 40 40 40 40 0.25 20 10 30 20 10 0 10 20 30 40 50 20 10 20 30 40 50 20 0 0 Nodewise lambda N=1001, T=100 30 10 0 0 Nodewise lambda 10 20 30 40 50 0 Nodewise lambda N=1001, T=200 N=1001, T=500 40 40 40 40 20 10 10 0 10 20 30 40 Nodewise lambda 50 30 20 10 0 0 88 20 Initial lambda 50 Initial lambda 50 30 10 20 30 40 Nodewise lambda 50 30 40 50 30 20 10 0 0 20 N=1001, T=1000 50 30 10 Nodewise lambda 50 Initial lambda Initial lambda 30 10 0 0 Initial lambda 30 Initial lambda Initial lambda Initial lambda 0.00 0 0 10 20 30 40 Nodewise lambda 50 0 10 20 30 40 Nodewise lambda 50 2.C Supplementary Results Figure 2.5: Model A, β1 heat map coverage: Contours mark the coverage thresholds at 5% intervals, from 75% to the nominal 95%, from dark green to white respectively. Units on the axes are not proportional to the λ-value but rather its position in the grid. The value of λ is (10T )−1 at 0, and increases exponentially to a value that sets all parameters to zero at 50. Plots are based on 100 replications, with colored dots representing combinations of λ’s selected by PI (purple), AIC (red), BIC (blue), EBIC (yellow). N=101, T=200 N=101, T=1000 40 40 40 40 30 20 30 20 30 20 10 10 10 0 0 0 10 20 30 40 50 0 Nodewise lambda 10 20 30 40 Initial lambda 50 Initial lambda 50 0 50 30 20 10 0 0 Nodewise lambda N=201, T=100 10 20 30 40 50 0 Nodewise lambda N=201, T=200 N=201, T=500 40 40 40 40 20 20 30 20 10 10 10 0 0 0 0 10 20 30 40 50 0 Nodewise lambda 10 20 30 40 Initial lambda 50 Initial lambda 50 30 50 40 50 30 20 Coverage 0 0 10 20 30 40 50 1.00 0 Nodewise lambda N=501, T=200 30 10 Nodewise lambda N=501, T=100 20 N=201, T=1000 50 30 10 Nodewise lambda 50 Initial lambda Initial lambda N=101, T=500 50 Initial lambda Initial lambda N=101, T=100 50 10 20 30 40 50 0.75 Nodewise lambda N=501, T=500 0.50 N=501, T=1000 50 50 50 50 40 40 40 40 0.25 20 30 20 20 10 10 0 0 0 10 20 30 40 50 0 Nodewise lambda 10 20 30 40 50 30 20 10 0 0 Nodewise lambda N=1001, T=100 10 20 30 40 50 0 Nodewise lambda N=1001, T=200 N=1001, T=500 40 40 40 40 20 20 30 20 10 10 10 0 0 0 0 10 20 30 40 Nodewise lambda 50 0 10 20 30 40 Nodewise lambda 50 Initial lambda 50 Initial lambda 50 30 20 30 40 50 N=1001, T=1000 50 30 10 Nodewise lambda 50 Initial lambda Initial lambda 30 10 0 Initial lambda 30 Initial lambda Initial lambda Initial lambda 0.00 30 20 10 0 0 10 20 30 40 Nodewise lambda 50 0 10 20 30 40 50 Nodewise lambda 89 2 Desparsified Lasso in Time Series Figure 2.6: Model B, ρ heat map coverage: Contours mark the coverage thresholds at 5% intervals, from 75% to the nominal 95%, from dark green to white respectively. Units on the axes are not proportional to the λ-value but rather its position in the grid. The value of λ is (10T )−1 at 0, and increases exponentially to a value that sets all parameters to zero at 50. Plots are based on 100 replications, with colored dots representing combinations of λ’s selected by PI (purple), AIC (red), BIC (blue), EBIC (yellow). N=101, T=200 N=101, T=1000 40 40 40 40 30 20 30 20 30 20 10 10 10 0 0 0 10 20 30 40 50 0 Nodewise lambda 10 20 30 40 Initial lambda 50 Initial lambda 50 0 50 30 20 10 0 0 Nodewise lambda N=201, T=100 10 20 30 40 50 0 Nodewise lambda N=201, T=200 N=201, T=500 40 40 40 40 20 20 30 20 10 10 10 0 0 0 0 10 20 30 40 50 0 Nodewise lambda 10 20 30 40 Initial lambda 50 Initial lambda 50 30 50 40 50 30 20 Coverage 0 0 10 20 30 40 50 1.00 0 Nodewise lambda N=501, T=200 30 10 Nodewise lambda N=501, T=100 20 N=201, T=1000 50 30 10 Nodewise lambda 50 Initial lambda Initial lambda N=101, T=500 50 Initial lambda Initial lambda N=101, T=100 50 10 20 30 40 50 0.75 Nodewise lambda N=501