Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: yhmath

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2402.12724v1 [stat.ME] 20 Feb 2024

Controlled Variable Selection from Summary Statistics Only?
A Solution via GhostKnockoffs and Penalized Regression

Zhaomeng Chen Zihuai He Department of Neurology and Neurological Sciences, Stanford University Benjamin B. Chu Department of Biomedical Data Science, Stanford University Jiaqi Gu Department of Neurology and Neurological Sciences, Stanford University Tim Morrison Department of Statistics, Stanford University         Chiara Sabatti Department of Statistics, Stanford University Department of Biomedical Data Science, Stanford University Emmanuel Candès Department of Statistics, Stanford University Department of Mathematics, Stanford University
Abstract

Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. (2022) and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer’s disease, and evidence a significant improvement in power.

**footnotetext: Equal contribution.

Keywords— Variable selection, replicability, summary statistics, false discovery rate (FDR), knockoffs, genome-wide association study (GWAS), pseudo-lasso

1 Introduction

1.1 Background and contributions

Modern large-scale studies frequently involve a multitude of explanatory variables potentially associated with an outcome we would like to better understand. Oftentimes, the goal is to select those explanatory variables that are meaningfully associated with the response variable. For instance, with recent advances in genome sequencing technologies and genotype imputation techniques, one can now gather tens of millions of variants from hundreds of thousands of samples in large-scale genetic studies, with the aim of pinpointing which genetic variants are biologically associated with specific diseases. This information could provide mechanistic insights and potentially aid the development of targeted drugs. In statistics, this challenge is typically framed as a multiple testing problem. Further, due to the sheer number of hypotheses considered and the cost of following false leads, it is generally required to control some form of error rate on the false positives.

In this paper, we focus on controlling the false discovery rate (FDR), which is the expected proportion of false selections among all selected variables. Compared to the more stringent familywise error rate (FWER) control, keeping the FDR under a nominal level allows for more discoveries while maintaining a reasonable statistical guarantee on the rate of false positives. Several methods for FDR control have been proposed in the literature, with the Benjamini-Hochberg procedure being particularly popular (Benjamini and Hochberg, 1995). However, these approaches often assume a parametric model or the existence of valid p𝑝pitalic_p-values, which remains difficult, and even problematic, in high-dimensional settings.

Candès et al. (2018) proposed the model-X knockoffs, a broad and flexible framework which allows the statistician to select variables that retain dependence with the response conditional on all other covariates while maintaining FDR control. Model-X knockoffs differs from previous approaches in that (1) it makes no modeling assumptions on the distribution of the response Y𝑌Yitalic_Y we wish to study conditional on the family of covariates X𝑋Xitalic_X, and (2) it does not require the construction of valid p𝑝pitalic_p-values. Instead, the crucial assumption is that the distribution of X𝑋Xitalic_X is known. The main idea in Candès et al. (2018) is to generate fake variables X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG, knockoffs, which we can view as negative controls and can be used to tease apart variables that do influence the response from those who do not. Model-X knockoffs has proved effective in a number of real-world applications, particularly in GWAS; see Bates et al. (2020), Sesia et al. (2021) and He et al. (2022) for examples.

To deploy model-X knockoffs, researchers must have in hand the covariates and responses from all samples. However, in certain situations, individual-level data that may reveal sensitive personal information is not readily accessible. For example, due to privacy concerns, many GWAS studies only publish summary statistics of the original data (Pasaniuc and Price, 2017). Yet in such cases, we would still like to develop controlled variable selection methods that rely solely on summary statistics. In genetic studies, this would enable us to utilize available summary data from different data centers to conduct meta-analysis, enhancing the effective sample size and improving variable selection power. On this front, He et al. (2022) proposed the framework of GhostKnockoffs, which implements the knockoffs procedure with the marginal correlation difference feature importance statistic directly from summary statistics. As we shall review next, the main idea is to generate knockoff Zlimit-from𝑍Z-italic_Z -scores directly without creating knockoff variables; all that is needed are marginal correlations between the response and the features under study. In details, with n𝑛nitalic_n being the sample size and p𝑝pitalic_p the number of variables being assayed, the method operates with only 𝐗𝐘superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and 𝐘22superscriptsubscriptdelimited-∥∥𝐘22\lVert\mathbf{Y}\rVert_{2}^{2}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where 𝐗𝐗\mathbf{X}bold_X is the n×p𝑛𝑝n\times pitalic_n × italic_p matrix of covariates, and 𝐘𝐘\mathbf{Y}bold_Y is the n×1𝑛1n\times 1italic_n × 1 response vector.

In this paper, we extend the family of GhostKnockoffs methods to incorporate feature importance statistics obtained from penalized regression. We first consider in Section 3 the situation in which the empirical covariance of the covariate-response pair (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) is available; with the above notation, this means that the summary statistics 𝐗𝐗superscript𝐗top𝐗\mathbf{X}^{\top}\mathbf{X}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X, 𝐗𝐘superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, 𝐘22superscriptsubscriptdelimited-∥∥𝐘22\lVert\mathbf{Y}\rVert_{2}^{2}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are available along with the sample size n𝑛nitalic_n. Unsurprisingly, we observe substantial power improvement over the method of He et al. (2022) because we can now employ far more effective test statistics. Next, in Section 4, we consider the case where the empirical covariance 𝐗𝐗superscript𝐗top𝐗\mathbf{X}^{\top}\mathbf{X}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X of the features is not available. There, we propose new imputation methods that consistently outperform He et al. (2022) in comprehensive synthetic and semi-synthetic simulations and rigorously control the FDR under suitable conditions. Finally, in Section 5 we apply our methods to a meta-analysis of nine large-scale array-based genome-wide association and whole-exome/-genome sequencing studies of Alzheimer’s disease, in which our methods yield more discoveries than He et al. (2022). We note that existing work in the genetics literature has implemented variable selection methods based on penalized regression with summary statistics, e.g., Mak et al. (2017) and Zou et al. (2022). However, none of these provide any guarantee of FDR control. In fact, as we note in the main text, these methods can be leveraged in our approach to create knockoffs versions that do control the FDR.

1.2 Code availability and reproducibility

The software and example code that reproduce the results presented in this paper can be found at https://github.com/biona001/ghostknockoff-gwas-reproducibility/tree/main/chen_et_al. Simulation results in Section 3.5, Section 4.4.2 and Section 4.4.3 can be exactly reproduced. Due to data accessibility issue, we only provide code without real data for Section 4.4.1 and Section 5.

2 Model-X Knockoffs and GhostKnockoffs

To begin with, we define the controlled variable selection problem and give a brief review of model-X knockoffs and GhostKnockoffs. For a more detailed exposition, we refer readers to Candès et al. (2018), Barber and Candès (2015), and He et al. (2022). In the following, we use boldface letters for vectors and matrices.***As an exception, we use X𝑋Xitalic_X, X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG, and Y𝑌Yitalic_Y to represent generic covariates, their knockoffs, and the response. We use 𝐗jnsubscript𝐗𝑗superscript𝑛\mathbf{X}_{j}\in\mathbb{R}^{n}bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝐱ipsubscript𝐱𝑖superscript𝑝\mathbf{x}_{i}\in\mathbb{R}^{p}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT to respectively represent the j𝑗jitalic_jth column and i𝑖iitalic_ith row of the covariate matrix 𝐗𝐗\mathbf{X}bold_X.

2.1 Problem statement

Given covariates Xp𝑋superscript𝑝X\in\mathbb{R}^{p}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and a response Y𝑌Y\in\mathbb{R}italic_Y ∈ blackboard_R, we are interested in understanding which variables influence Y𝑌Yitalic_Y. We formulate this selection problem as testing the conditional independence hypotheses 0j:XjYXj\mathcal{H}_{0}^{j}:X_{j}\perp\!\!\!\perp Y\mid X_{-j}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT : italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟂ ⟂ italic_Y ∣ italic_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT for 1jp1𝑗𝑝1\leq j\leq p1 ≤ italic_j ≤ italic_p, where Xjsubscript𝑋𝑗X_{-j}italic_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT is a shorthand for all the variables except the j𝑗jitalic_jth; that is Xj={X1,,Xj1,Xj+1,,Xn}subscript𝑋𝑗subscript𝑋1subscript𝑋𝑗1subscript𝑋𝑗1subscript𝑋𝑛X_{-j}=\{X_{1},...,X_{j-1},X_{j+1},...,X_{n}\}italic_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. In words, we should reject 0jsuperscriptsubscript0𝑗\mathcal{H}_{0}^{j}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT if we believe that Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can help better predict the outcome than if we only had available the values of all the other variables. Put differently, Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT has information about Y𝑌Yitalic_Y which cannot be subsumed by the information contained in all the other variables. By conditioning on Xjsubscript𝑋𝑗X_{-j}italic_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT, these hypothesis tests aim to weed out variables whose relationship to Y𝑌Yitalic_Y is driven by residual correlations with other covariates.

Let 0[p]subscript0delimited-[]𝑝\mathcal{H}_{0}\subset[p]caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊂ [ italic_p ] be the set of indices for which the null conditional independence hypothesis 0jsuperscriptsubscript0𝑗\mathcal{H}_{0}^{j}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is true, and let 𝒮[p]𝒮delimited-[]𝑝{\mathcal{S}}\subset[p]caligraphic_S ⊂ [ italic_p ] be the set of indices of the hypotheses rejected by a selection procedure. The false discovery rate (FDR) is the expected fraction of false positives among the selected, defined as

FDR:=𝔼[|𝒮0||𝒮^|]assignFDR𝔼delimited-[]𝒮subscript0^𝒮\text{FDR}:=\mathbb{E}\left[\frac{|{\mathcal{S}}\cap\mathcal{H}_{0}|}{|\hat{% \mathcal{S}}|}\right]FDR := blackboard_E [ divide start_ARG | caligraphic_S ∩ caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | over^ start_ARG caligraphic_S end_ARG | end_ARG ]

with the convention that 0/0=00000/0=00 / 0 = 0. Our goal is to make as many rejections as possible while controlling the FDR below a user-specified level q𝑞qitalic_q.

In this paper, we consider the setting in which, instead of observing i.i.d. samples from the distribution of (X,Y𝑋𝑌X,Yitalic_X , italic_Y), we only have some summary statistics of the i.i.d. samples. In particular, we will show how one can, quite remarkably, perform tests of conditional independence when we do not directly observe the i.i.d. samples. Throughout this paper, we assume that X𝒩(𝟎,𝚺)similar-to𝑋𝒩0𝚺X\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma})italic_X ∼ caligraphic_N ( bold_0 , bold_Σ ) where 𝚺𝚺\mathbf{\Sigma}bold_Σ is known (or, in practice, can be estimated).

2.2 Model-X knockoffs

2.2.1 The procedure

Suppose we observe n𝑛nitalic_n i.i.d. samples (Xi,Yi)subscript𝑋𝑖subscript𝑌𝑖(X_{i},Y_{i})( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, arranged in a data matrix 𝐗n×p𝐗superscript𝑛𝑝\mathbf{X}\in\mathbb{R}^{n\times p}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT and response vector 𝐘n𝐘superscript𝑛\mathbf{Y}\in\mathbb{R}^{n}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. In the model-X knockoffs framework Candès et al. (2018), we assume we know the distribution PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT of the covariates X𝑋Xitalic_X while having no knowledge of the conditional distribution YXconditional𝑌𝑋Y\mid Xitalic_Y ∣ italic_X. The model-X approach is well-suited to genetic applications where reference panels may be available to estimate PXsubscript𝑃𝑋P_{X}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT or where we have good models of linkage disequilibrium.

To implement model-X knockoffs, we first generate a matrix 𝐗~n×p~𝐗superscript𝑛𝑝\widetilde{\mathbf{X}}\in\mathbb{R}^{n\times p}over~ start_ARG bold_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT of knockoffs such that the following two conditions hold:

(Exchangeability): (𝐗j,𝐗~j,𝐗j,𝐗~j)=d(𝐗~j,𝐗j,𝐗j,𝐗~j), 1jpformulae-sequencesuperscript𝑑subscript𝐗𝑗subscript~𝐗𝑗subscript𝐗𝑗subscript~𝐗𝑗subscript~𝐗𝑗subscript𝐗𝑗subscript𝐗𝑗subscript~𝐗𝑗for-all1𝑗𝑝\displaystyle\;(\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j},\mathbf{X}_{-j},% \widetilde{\mathbf{X}}_{-j})\stackrel{{\scriptstyle d}}{{=}}(\widetilde{% \mathbf{X}}_{j},\mathbf{X}_{j},\mathbf{X}_{-j},\widetilde{\mathbf{X}}_{-j}),\;% \forall\;1\leq j\leq p( bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ( over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT ) , ∀ 1 ≤ italic_j ≤ italic_p (1)
(Conditional independence): 𝐗~𝐘𝐗.\displaystyle\;\widetilde{\mathbf{X}}\perp\!\!\!\perp\mathbf{Y}\mid\mathbf{X}.over~ start_ARG bold_X end_ARG ⟂ ⟂ bold_Y ∣ bold_X . (2)

Roughly, the first says that we cannot distinguish between [𝐗𝐗~]delimited-[]𝐗~𝐗[\mathbf{X}\;\widetilde{\mathbf{X}}][ bold_X over~ start_ARG bold_X end_ARG ] and [𝐗𝐗~]swap(j)subscriptdelimited-[]𝐗~𝐗swap𝑗[\mathbf{X}\;\widetilde{\mathbf{X}}]_{\text{swap}(j)}[ bold_X over~ start_ARG bold_X end_ARG ] start_POSTSUBSCRIPT swap ( italic_j ) end_POSTSUBSCRIPT, where [𝐗𝐗~]swap(j)subscriptdelimited-[]𝐗~𝐗swap𝑗[\mathbf{X}\;\widetilde{\mathbf{X}}]_{\text{swap}(j)}[ bold_X over~ start_ARG bold_X end_ARG ] start_POSTSUBSCRIPT swap ( italic_j ) end_POSTSUBSCRIPT is obtained from [𝐗𝐗~]delimited-[]𝐗~𝐗[\mathbf{X}\;\widetilde{\mathbf{X}}][ bold_X over~ start_ARG bold_X end_ARG ] by swapping the j𝑗jitalic_jth and (j+p)𝑗𝑝(j+p)( italic_j + italic_p )th columns. The second condition implies that 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG does not provide any new information about Y𝑌Yitalic_Y conditional on X𝑋Xitalic_X and is guaranteed if 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG is constructed without looking at 𝐘𝐘\mathbf{Y}bold_Y. If these properties hold, it can be shown that 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝐗~jsubscript~𝐗𝑗\widetilde{\mathbf{X}}_{j}over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are indistinguishable conditional on 𝐘𝐘\mathbf{Y}bold_Y for each j0𝑗subscript0j\in\mathcal{H}_{0}italic_j ∈ caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Next, we define feature importance statistics 𝐖=w([𝐗,𝐗~],𝐘)p𝐖𝑤𝐗~𝐗𝐘superscript𝑝\mathbf{W}=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\in\mathbb{R}^{p}bold_W = italic_w ( [ bold_X , over~ start_ARG bold_X end_ARG ] , bold_Y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT to be any function of 𝐗𝐗\mathbf{X}bold_X, 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG and 𝐘𝐘\mathbf{Y}bold_Y such that a flip-sign property holds; namely, switching a column 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with its knockoff 𝐗~jsubscript~𝐗𝑗\widetilde{\mathbf{X}}_{j}over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT flips the sign of the j𝑗jitalic_jth component of the output; formally, wj([𝐗,𝐗~]swap(j),𝐘)=wj([𝐗,𝐗~],𝐘)subscript𝑤𝑗subscript𝐗~𝐗swap𝑗𝐘subscript𝑤𝑗𝐗~𝐗𝐘w_{j}([\mathbf{X},\widetilde{\mathbf{X}}]_{\text{swap}(j)},\mathbf{Y})=-w_{j}(% [\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( [ bold_X , over~ start_ARG bold_X end_ARG ] start_POSTSUBSCRIPT swap ( italic_j ) end_POSTSUBSCRIPT , bold_Y ) = - italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( [ bold_X , over~ start_ARG bold_X end_ARG ] , bold_Y ). Common choices include Wj=|𝐗j𝐘||𝐗~j𝐘|subscript𝑊𝑗superscriptsubscript𝐗𝑗top𝐘superscriptsubscript~𝐗𝑗top𝐘W_{j}=|\mathbf{X}_{j}^{\top}\mathbf{Y}|-|\widetilde{\mathbf{X}}_{j}^{\top}% \mathbf{Y}|italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = | bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y | - | over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y | (marginal correlation difference statistic) and Wj=|β^j(λCV)||β^j+p(λCV)|subscript𝑊𝑗subscript^𝛽𝑗subscript𝜆CVsubscript^𝛽𝑗𝑝subscript𝜆CVW_{j}=|\hat{\beta}_{j}(\lambda_{\text{CV}})|-|\hat{\beta}_{j+p}(\lambda_{\text% {CV}})|italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT CV end_POSTSUBSCRIPT ) | - | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j + italic_p end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT CV end_POSTSUBSCRIPT ) | (Lasso coefficient difference statistic), where 𝜷^(λCV)^𝜷subscript𝜆CV\hat{\bm{\beta}}(\lambda_{\text{CV}})over^ start_ARG bold_italic_β end_ARG ( italic_λ start_POSTSUBSCRIPT CV end_POSTSUBSCRIPT ) is the solution to the Lasso problem

argmin𝜷2p12𝐘[𝐗𝐗~]𝜷22+λCV𝜷1,subscriptargmin𝜷superscript2𝑝12superscriptsubscriptnorm𝐘delimited-[]𝐗~𝐗𝜷22subscript𝜆CVsubscriptnorm𝜷1\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2p}}\frac{1}{2}||\mathbf{Y}% -[\mathbf{X}\;\widetilde{\mathbf{X}}]\bm{\beta}||_{2}^{2}+\lambda_{\text{CV}}|% |\mathbf{\bm{\beta}}||_{1},start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG | | bold_Y - [ bold_X over~ start_ARG bold_X end_ARG ] bold_italic_β | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT CV end_POSTSUBSCRIPT | | bold_italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

and λCVsubscript𝜆CV\lambda_{\text{CV}}italic_λ start_POSTSUBSCRIPT CV end_POSTSUBSCRIPT is usually chosen by cross-validation.

Finally, the knockoff filter selects the variables 𝒮={j:WjT}𝒮conditional-set𝑗subscript𝑊𝑗𝑇\mathcal{S}=\{j:W_{j}\geq T\}caligraphic_S = { italic_j : italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ italic_T }, where

T=min{t𝒲:1+#{j:Wjt}#{j:Wjt}1q}.𝑇minconditional-set𝑡𝒲1#conditional-set𝑗subscript𝑊𝑗𝑡#conditional-set𝑗subscript𝑊𝑗𝑡1𝑞T=\text{min}\left\{t\in\mathcal{W}:\frac{1+\#\{j:\;W_{j}\leq-t\}}{\#\{j:\;W_{j% }\geq t\}\vee 1}\leq q\right\}.italic_T = min { italic_t ∈ caligraphic_W : divide start_ARG 1 + # { italic_j : italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ - italic_t } end_ARG start_ARG # { italic_j : italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ italic_t } ∨ 1 end_ARG ≤ italic_q } . (3)

Here, 𝒲={|Wj|:j=1,,p}\{0}\mathcal{W}=\{|W_{j}|:j=1,...,p\}\backslash\{0\}caligraphic_W = { | italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | : italic_j = 1 , … , italic_p } \ { 0 }, and T=+𝑇T=+\inftyitalic_T = + ∞ if 𝒲𝒲\mathcal{W}caligraphic_W is empty. Intuitively, the threshold T𝑇Titalic_T is chosen to be the most liberal one such that an estimate of FDP is bounded by q𝑞qitalic_q. Candès et al. (2018) showed that this procedure controls the FDR of the conditional testing problem at level q𝑞qitalic_q.

2.2.2 Gaussian knockoff sampler

Under the assumption that the rows of the data matrix 𝐗𝐗\mathbf{X}bold_X are i.i.d. from the Gaussian distribution 𝒩(𝟎,𝚺)𝒩0𝚺\mathcal{N}(\mathbf{0},\mathbf{\Sigma})caligraphic_N ( bold_0 , bold_Σ ), we can generate a knockoff vector 𝐱~isubscript~𝐱𝑖\widetilde{\mathbf{x}}_{i}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each row 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the data matrix 𝐗𝐗\mathbf{X}bold_X by sampling 𝐱~i𝒩(𝐏𝐱i,𝐕)similar-tosubscript~𝐱𝑖𝒩superscript𝐏topsubscript𝐱𝑖𝐕\widetilde{\mathbf{x}}_{i}\sim{\mathcal{N}(\mathbf{P}^{\top}\mathbf{x}_{i},% \mathbf{V})}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_V ) independently across rows, where 𝐏=𝐈𝚺1𝐃𝐏𝐈superscript𝚺1𝐃\mathbf{P}=\mathbf{I}-\mathbf{\Sigma}^{-1}\mathbf{D}bold_P = bold_I - bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D, 𝐕=2𝐃𝐃𝚺1𝐃𝐕2𝐃𝐃superscript𝚺1𝐃\mathbf{V}=2\,\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}bold_V = 2 bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D, 𝐃=diag{𝐬}𝐃diag𝐬\mathbf{D}=\text{diag}\{\mathbf{s}\}bold_D = diag { bold_s }, and 𝐬p𝐬superscript𝑝\mathbf{s}\in\mathbb{R}^{p}bold_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is a vector of free parameters usually obtained by solving a convex optimization problem that depends on 𝚺𝚺\mathbf{\Sigma}bold_Σ (Candès et al., 2018). See Appendix A for details of computing 𝐬𝐬\mathbf{s}bold_s. Concatenating all the knockoff vectors then gives a valid matrix 𝐗~n×p~𝐗superscript𝑛𝑝\widetilde{\mathbf{X}}\in\mathbb{R}^{n\times p}over~ start_ARG bold_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT of knockoffs. In matrix form, the construction above is

𝐗~=𝐗𝐏+𝐄𝐕1/2,~𝐗𝐗𝐏superscript𝐄𝐕12\widetilde{\mathbf{X}}=\mathbf{X}\mathbf{P}+\mathbf{E}\mathbf{V}^{1/2},over~ start_ARG bold_X end_ARG = bold_XP + bold_EV start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , (4)

where 𝐄𝐄\mathbf{E}bold_E is an n𝑛nitalic_n by p𝑝pitalic_p matrix with i.i.d. standard Gaussian entries, independent of 𝐗𝐗\mathbf{X}bold_X and 𝐘𝐘\mathbf{Y}bold_Y. For later reference, we summarize the Gaussian knockoff sampler in Algorithm 1 and denote it as 𝒢𝒢\mathcal{G}caligraphic_G.

Algorithm 1 Gaussian Knockoff Sampler 𝒢𝒢\mathcal{G}caligraphic_G
1:  Input: 𝐗𝐗\mathbf{X}bold_X and 𝚺𝚺\mathbf{\Sigma}bold_Σ.
2:  Compute 𝐬𝐬\mathbf{s}bold_s by solving a convex optimization problem as defined in (15).
3:  Compute 𝐃=diag{𝐬},𝐏=𝐈𝚺1𝐃formulae-sequence𝐃diag𝐬𝐏𝐈superscript𝚺1𝐃\mathbf{D}=\text{diag}\{\mathbf{s}\},\mathbf{P}=\mathbf{I}-\mathbf{\Sigma}^{-1% }\mathbf{D}bold_D = diag { bold_s } , bold_P = bold_I - bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D, and 𝐕=2𝐃𝐃𝚺1𝐃𝐕2𝐃𝐃superscript𝚺1𝐃\mathbf{V}=2\,\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}bold_V = 2 bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D.
4:  Simulate 𝐄n×p𝐄superscript𝑛𝑝\mathbf{E}\in\mathbb{R}^{n\times p}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT whose entries are i.i.d. standard Gaussian variables.
5:  Output: 𝐗~=𝐗𝐏+𝐄𝐕1/2.~𝐗𝐗𝐏superscript𝐄𝐕12\widetilde{\mathbf{X}}=\mathbf{X}\mathbf{P}+\mathbf{E}\mathbf{V}^{1/2}.over~ start_ARG bold_X end_ARG = bold_XP + bold_EV start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT .

2.3 GhostKnockoffs with marginal correlation difference statistic

The original model-X knockoffs procedure relies on having access to the covariates and responses from all data points, i.e., the matrix of covariates 𝐗𝐗\mathbf{X}bold_X and the response vector 𝐘𝐘\mathbf{Y}bold_Y. Henceforth, we call these individual-level data. In many application scenarios, however, individual-level data are not available due to privacy concerns. Instead, we only have access to some summary statistics of 𝐗𝐗\mathbf{X}bold_X and 𝐘𝐘\mathbf{Y}bold_Y, e.g., the empirical covariance matrix of the covariaties and the empirical covariance between each covariate and the response.

He et al. (2022) proposed GhostKnockoffs, which implements the knockoffs procedure with marginal correlation difference statistic when only 𝐗𝐘superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and 𝐘22superscriptsubscriptnorm𝐘22||\mathbf{Y}||_{2}^{2}| | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are available. The key idea of He et al. (2022) is to sample the knockoff Z𝑍Zitalic_Z-score 𝐙~ssubscript~𝐙𝑠\widetilde{\mathbf{Z}}_{s}over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT from 𝐗𝐘superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and 𝐘22superscriptsubscriptnorm𝐘22||\mathbf{Y}||_{2}^{2}| | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT directly, in a way such that

𝐙~s𝐗,𝐘=d𝐗~𝐘𝐗,𝐘,superscript𝑑conditionalsubscript~𝐙𝑠𝐗𝐘conditionalsuperscript~𝐗top𝐘𝐗𝐘\widetilde{\mathbf{Z}}_{s}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}% {{=}}\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y},over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ∣ bold_X , bold_Y , (5)

where 𝐗~=𝒢(𝐗,𝚺)~𝐗𝒢𝐗𝚺\widetilde{\mathbf{X}}=\mathcal{G}(\mathbf{X},\mathbf{\Sigma})over~ start_ARG bold_X end_ARG = caligraphic_G ( bold_X , bold_Σ ) is the knockoff matrix generated by the Gaussian knockoff sampler (Algorithm 1). If we use 𝐖=𝐙s𝐙~s𝐖subscript𝐙𝑠subscript~𝐙𝑠\mathbf{W}=\mathbf{Z}_{s}-\widetilde{\mathbf{Z}}_{s}bold_W = bold_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (where 𝐙s=𝐗𝐘subscript𝐙𝑠superscript𝐗top𝐘\mathbf{Z}_{s}=\mathbf{X}^{\top}\mathbf{Y}bold_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y) as the feature importance statistic and run the knockoff filter, the resulting rejection set will have the same distribution as that of the knockoffs procedure with marginal correlation difference statistic. Therefore, the two procedures are statistically identical. In particular, they both control the FDR.

Specifically, He et al. (2022) showed that for 𝐏𝐏\mathbf{P}bold_P and 𝐕𝐕\mathbf{V}bold_V computed in step 3 of Algorithm 1,

𝐙~s=𝐏𝐗𝐘+𝐘2𝐙where𝐙𝒩(𝟎,𝐕)is independent of𝐗and𝐘subscript~𝐙𝑠superscript𝐏topsuperscript𝐗top𝐘subscriptnorm𝐘2𝐙where𝐙similar-to𝒩0𝐕is independent of𝐗and𝐘\widetilde{\mathbf{Z}}_{s}=\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+||% \mathbf{Y}||_{2}\mathbf{Z}\ \text{where}\ \mathbf{Z}\sim\mathcal{N}(\mathbf{0}% ,\mathbf{V})\;\text{is independent of}\;\mathbf{X}\;\text{and}\;\mathbf{Y}over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z where bold_Z ∼ caligraphic_N ( bold_0 , bold_V ) is independent of bold_X and bold_Y (6)

satisfies (5) as detailed in Appendix B. All this is summarized in Algorithm 2. In the following sections, we refer to Algorithm 2 as GhostKnockoffs with marginal correlation difference statistic (GK-marginal).

Algorithm 2 GhostKnockoffs with Marginal Correlation Difference Statistic (GK-marginal)
1:  Input: 𝐗𝐘superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, 𝐘22superscriptsubscriptnorm𝐘22||\mathbf{Y}||_{2}^{2}| | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and 𝚺𝚺\mathbf{\Sigma}bold_Σ.
2:  Compute 𝐬𝐬\mathbf{s}bold_s, 𝐏𝐏\mathbf{P}bold_P, and 𝐕𝐕\mathbf{V}bold_V as in Algorithm 1.
3:  Compute the feature importance statistics 𝐖=|𝐙s||𝐙~s|𝐖subscript𝐙𝑠subscript~𝐙𝑠\mathbf{W}=\lvert\mathbf{Z}_{s}\rvert-\lvert\widetilde{\mathbf{Z}}_{s}\rvertbold_W = | bold_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | - | over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT |, where 𝐙~ssubscript~𝐙𝑠\widetilde{\mathbf{Z}}_{s}over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is generated according to (6).
4:  Input 𝐖𝐖\mathbf{W}bold_W into the knockoffs selection procedure.
5:  Output: Knockoffs selection set.

3 GhostKnockoffs with Penalized Regression: Known Empirical Covariance

3.1 Setting

As we have just seen, GhostKnockoffs-marginal gives a way to test conditional hypotheses while maintaining FDR control when only the summary statistics 𝐗𝐘superscript𝐗top𝐘\mathbf{\mathbf{X}}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and 𝐘22superscriptsubscriptdelimited-∥∥𝐘22\lVert\mathbf{Y}\rVert_{2}^{2}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are available to the analyst. Now, we consider the setting in which we have knowledge of the empirical covariance matrix 𝐗𝐗superscript𝐗top𝐗\mathbf{X}^{\top}\mathbf{X}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X and the sample size n𝑛nitalic_n, in addition to 𝐗𝐘superscript𝐗top𝐘\mathbf{\mathbf{X}}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and 𝐘22superscriptsubscriptdelimited-∥∥𝐘22\lVert\mathbf{Y}\rVert_{2}^{2}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. These quantities only reveal sample averages of relevant quantities, as opposed to all the individual-level information.

In this section, we propose a variable selection method that utilizes only 𝐗𝐗superscript𝐗top𝐗\mathbf{X}^{\top}\mathbf{X}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X, 𝐗𝐘superscript𝐗top𝐘\mathbf{\mathbf{X}}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, 𝐘22superscriptsubscriptdelimited-∥∥𝐘22\lVert\mathbf{Y}\rVert_{2}^{2}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and n𝑛nitalic_n. Our method achieves FDR control and power comparable to the knockoffs procedure with the cross-validated Lasso coefficient difference statistic defined in Section 2. This is interesting because the latter usually outperforms GhostKnockoffs with the marginal correlation difference statistic by a significant margin. Notably, for a fixed tuning parameter λ𝜆\lambdaitalic_λ, we show that our procedure is equivalent to a corresponding knockoffs method using the Lasso coefficient difference statistic with the same penalty level λ𝜆\lambdaitalic_λ.

3.2 GhostKnockoffs with the Lasso

Recall that in the knockoffs procedure with the Lasso coefficient difference statistic, we solve the optimization problem

𝜷^(λ)argmin𝜷2p12𝐘[𝐗𝐗~]𝜷22+λ𝜷1,^𝜷𝜆subscriptargmin𝜷superscript2𝑝12superscriptsubscriptnorm𝐘delimited-[]𝐗~𝐗𝜷22𝜆subscriptnorm𝜷1\hat{\bm{\beta}}(\lambda)\in\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^% {2p}}\frac{1}{2}||\mathbf{Y}-[\mathbf{X}\;\widetilde{\mathbf{X}}]\bm{\beta}||_% {2}^{2}+\lambda||\bm{\beta}||_{1},over^ start_ARG bold_italic_β end_ARG ( italic_λ ) ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG | | bold_Y - [ bold_X over~ start_ARG bold_X end_ARG ] bold_italic_β | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ | | bold_italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (7)

where 𝐗~=𝒢(𝐗,𝚺).~𝐗𝒢𝐗𝚺\widetilde{\mathbf{X}}=\mathcal{G}(\mathbf{X},\mathbf{\Sigma}).over~ start_ARG bold_X end_ARG = caligraphic_G ( bold_X , bold_Σ ) . We then define the Lasso coefficient difference feature importance statistics by Wj=|β^j(λ)||β^j+p(λ)|subscript𝑊𝑗subscript^𝛽𝑗𝜆subscript^𝛽𝑗𝑝𝜆W_{j}=\lvert\hat{\beta}_{j}(\lambda)\rvert-\lvert\hat{\beta}_{j+p}(\lambda)\rvertitalic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_λ ) | - | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j + italic_p end_POSTSUBSCRIPT ( italic_λ ) | for 1jp1𝑗𝑝1\leq j\leq p1 ≤ italic_j ≤ italic_p. If we have access to individual-level data, λ𝜆\lambdaitalic_λ is usually chosen by cross-validation (Candès et al. (2018) and Weinstein et al. (2020)).***In the case that Y𝑌Yitalic_Y is binary, one may think that utilizing (penalized) logistic regression would give much better power than Lasso. In Appendix P, we show that this intuition may not be correct through simulations, even when Y𝑌Yitalic_Y is generated according to a logistic regression model.

As a first step, we would like to run a statistically equivalent procedure using 𝐗𝐗,𝐗𝐘superscript𝐗top𝐗superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{X},\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, 𝐘22superscriptsubscriptdelimited-∥∥𝐘22\lVert\mathbf{Y}\rVert_{2}^{2}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and n𝑛nitalic_n for a fixed λ𝜆\lambdaitalic_λ. Note that, with λ𝜆\lambdaitalic_λ fixed, (7) depends on the data only through

[𝐗𝐗𝐗𝐗~𝐗~𝐗𝐗~𝐗~]matrixsuperscript𝐗top𝐗superscript𝐗top~𝐗superscript~𝐗top𝐗superscript~𝐗top~𝐗\begin{bmatrix}\mathbf{X}^{\top}\mathbf{X}&\mathbf{X}^{\top}\widetilde{\mathbf% {X}}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{X}&\widetilde{\mathbf{X}}^{\top}% \widetilde{\mathbf{X}}\end{bmatrix}[ start_ARG start_ROW start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X end_CELL start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_X end_ARG end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ]

and

[𝐗𝐘𝐗~𝐘].matrixsuperscript𝐗top𝐘superscript~𝐗top𝐘\begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{Y}\end{bmatrix}.[ start_ARG start_ROW start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW end_ARG ] .

Define the Gram matrix of [𝐗,𝐗~,𝐘]𝐗~𝐗𝐘[\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y}][ bold_X , over~ start_ARG bold_X end_ARG , bold_Y ]

𝒯(𝐗,𝐗~,𝐘)=[𝐗,𝐗~,𝐘][𝐗,𝐗~,𝐘].𝒯𝐗~𝐗𝐘superscript𝐗~𝐗𝐘top𝐗~𝐗𝐘\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})=[\mathbf{X},% \widetilde{\mathbf{X}},\mathbf{Y}]^{\top}\,[\mathbf{X},\widetilde{\mathbf{X}},% \mathbf{Y}].caligraphic_T ( bold_X , over~ start_ARG bold_X end_ARG , bold_Y ) = [ bold_X , over~ start_ARG bold_X end_ARG , bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X , over~ start_ARG bold_X end_ARG , bold_Y ] .

The Gram matrix can of course be equivalently reconstructed from (𝐘22,𝐗𝐘,𝐗~𝐘,𝐗𝐗,𝐗~𝐗,𝐗~𝐗~)superscriptsubscriptdelimited-∥∥𝐘22superscript𝐗top𝐘superscript~𝐗top𝐘superscript𝐗top𝐗superscript~𝐗top𝐗superscript~𝐗top~𝐗(\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y},\widetilde{\mathbf% {X}}^{\top}\mathbf{Y},\mathbf{X}^{\top}\mathbf{X},\widetilde{\mathbf{X}}^{\top% }\mathbf{X},\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}})( ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X , over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X , over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_X end_ARG ). The main idea is to sample from the joint distribution of 𝒯(𝐗,𝐗~,𝐘)𝒯𝐗~𝐗𝐘\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})caligraphic_T ( bold_X , over~ start_ARG bold_X end_ARG , bold_Y ) using the Gram matrix of [𝐗,𝐘]𝐗𝐘[\mathbf{X},\mathbf{Y}][ bold_X , bold_Y ] only. Based on this, we can then generate the solution to the Lasso problem (7) (in distribution) for a fixed λ𝜆\lambdaitalic_λ.***Careful readers may realize that the solution of the Lasso problem does not depend on 𝐘22superscriptsubscriptdelimited-∥∥𝐘22\lVert\mathbf{Y}\rVert_{2}^{2}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Here we include 𝐘22superscriptsubscriptdelimited-∥∥𝐘22\lVert\mathbf{Y}\rVert_{2}^{2}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as an input of to be able to make a more general statement later that goes beyond the Lasso. This is achieved via the following Proposition 1, which says in words that if we generate ‘fake’ data matrices 𝐗widecheckwidecheck𝐗\widecheck{\mathbf{X}}overwidecheck start_ARG bold_X end_ARG and 𝐘widecheckwidecheck𝐘\widecheck{\mathbf{Y}}overwidecheck start_ARG bold_Y end_ARG that lead to the same Gram matrix as that of 𝐗𝐗\mathbf{X}bold_X and 𝐘𝐘\mathbf{Y}bold_Y, then the distribution of 𝒯𝒯\mathcal{T}caligraphic_T remains unchanged if we replace the original data matrices by the fake data matrices.

Proposition 1.

Suppose 𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘n×p𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐗superscript𝑛𝑝\widecheck{\mathbf{X}}\in\mathbb{R}^{n\times p}overwidecheck start_ARG bold_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT and 𝐘𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘n𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐘superscript𝑛\widecheck{\mathbf{Y}}\in\mathbb{R}^{n}overwidecheck start_ARG bold_Y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are constructed such that [𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐘𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘][𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐘𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘]=[𝐗𝐘][𝐗𝐘]superscriptdelimited-[]𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐘topdelimited-[]𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐘superscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = [ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ]. Setting 𝐗~=𝒢(𝐗,𝚺)normal-~𝐗𝒢𝐗𝚺\widetilde{\mathbf{X}}=\mathcal{G}({\mathbf{X}},\mathbf{\Sigma})over~ start_ARG bold_X end_ARG = caligraphic_G ( bold_X , bold_Σ ) and 𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘~=𝒢(𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘,𝚺)normal-~𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐗𝒢𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐗𝚺\widetilde{\widecheck{\mathbf{X}}}=\mathcal{G}(\widecheck{\mathbf{X}},\mathbf{% \Sigma})over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG = caligraphic_G ( overwidecheck start_ARG bold_X end_ARG , bold_Σ ) as the outputs of Algorithm 1,***Note that 𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐗\widecheck{\mathbf{X}}overwidecheck start_ARG bold_X end_ARG may not be a data matrix with i.i.d. rows and covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ and we should call 𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘~normal-~𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐗\widetilde{\widecheck{\mathbf{X}}}over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG the pseudo-Gaussian knockoff data matrix. we have

𝒯(𝐗,𝐗~,𝐘)𝐗,𝐘=d𝒯(𝐗widecheck,𝐗widecheck~,𝐘widecheck)𝐗,𝐘.superscript𝑑conditional𝒯𝐗~𝐗𝐘𝐗𝐘conditional𝒯widecheck𝐗~widecheck𝐗widecheck𝐘𝐗𝐘\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})\mid\mathbf{X},% \mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}\mathcal{T}(\widecheck{\mathbf{X}},% \widetilde{\widecheck{\mathbf{X}}},\widecheck{\mathbf{Y}})\mid\mathbf{X},% \mathbf{Y}.caligraphic_T ( bold_X , over~ start_ARG bold_X end_ARG , bold_Y ) ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP caligraphic_T ( overwidecheck start_ARG bold_X end_ARG , over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG , overwidecheck start_ARG bold_Y end_ARG ) ∣ bold_X , bold_Y .

Proof of Proposition 1 is provided in Appendix C. Specifically, Proposition 1 suggests that summary statistics (𝐗𝐗,𝐗𝐘,𝐘22superscript𝐗top𝐗superscript𝐗top𝐘superscriptsubscriptnorm𝐘22\mathbf{X}^{\top}\mathbf{X},\mathbf{X}^{\top}\mathbf{Y},||\mathbf{Y}||_{2}^{2}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝚺𝚺\mathbf{\Sigma}bold_Σ) are sufficient for sampling the Gram matrix 𝒯(𝐗,𝐗~,𝐘)𝒯𝐗~𝐗𝐘\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})caligraphic_T ( bold_X , over~ start_ARG bold_X end_ARG , bold_Y ).

Algorithm 3 GhostKnockoffs with Penalized Regression: Known Empirical Covariance
1:  Input: 𝐗𝐗,𝐗𝐘,𝐘22superscript𝐗top𝐗superscript𝐗top𝐘superscriptsubscriptnorm𝐘22\mathbf{X}^{\top}\mathbf{X},\mathbf{X}^{\top}\mathbf{Y},||\mathbf{Y}||_{2}^{2}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝚺𝚺\mathbf{\Sigma}bold_Σ, and n𝑛nitalic_n.
2:  Find 𝐗widecheckwidecheck𝐗\widecheck{\mathbf{X}}overwidecheck start_ARG bold_X end_ARG and 𝐘widecheckwidecheck𝐘\widecheck{\mathbf{Y}}overwidecheck start_ARG bold_Y end_ARG such that [𝐗widecheck𝐘widecheck][𝐗widecheck𝐘widecheck]=[𝐗𝐘][𝐗𝐘]superscriptdelimited-[]widecheck𝐗widecheck𝐘topdelimited-[]widecheck𝐗widecheck𝐘superscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = [ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] by eigen-decomposition or Cholesky decomposition.
3:  Generate 𝐗widecheck~=𝒢(𝐗widecheck,𝚺)~widecheck𝐗𝒢widecheck𝐗𝚺\widetilde{\widecheck{\mathbf{X}}}=\mathcal{G}(\widecheck{\mathbf{X}},\mathbf{% \Sigma})over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG = caligraphic_G ( overwidecheck start_ARG bold_X end_ARG , bold_Σ ) via Algorithm 1.
4:  Run the standard knockoffs procedure (at level q𝑞qitalic_q) with the Lasso coefficient difference statistic on 𝐗widecheckwidecheck𝐗\widecheck{\mathbf{X}}overwidecheck start_ARG bold_X end_ARG and 𝐗widecheck~~widecheck𝐗\widetilde{\widecheck{\mathbf{X}}}over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG for a fixed penalty level λ𝜆\lambdaitalic_λ or use the methods from Sections 3.3 and 3.4.
5:  Output: Knockoffs selection set.

We are now able to write down a procedure, namely, Algorithm 3, which is statistically equivalent to the corresponding individual-level knockoffs procedure using the Lasso coefficient difference statistic (or any statistic defined in Sections 3.3 and 3.4). In step 2, 𝐗widecheckwidecheck𝐗\widecheck{\mathbf{X}}overwidecheck start_ARG bold_X end_ARG and 𝐘widecheckwidecheck𝐘\widecheck{\mathbf{Y}}overwidecheck start_ARG bold_Y end_ARG can be obtained by performing the eigen-decomposition or Cholesky decomposition of [𝐗𝐘][𝐗𝐘]superscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}][ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ]. Brief procedures to construct 𝐗widecheckwidecheck𝐗\widecheck{\mathbf{X}}overwidecheck start_ARG bold_X end_ARG and 𝐘widecheckwidecheck𝐘\widecheck{\mathbf{Y}}overwidecheck start_ARG bold_Y end_ARG via eigen-decomposition are provided in Appendix D. All we need to do is to run the knockoffs procedure with 𝐗widecheckwidecheck𝐗\widecheck{\mathbf{X}}overwidecheck start_ARG bold_X end_ARG and 𝐗widecheck~~widecheck𝐗\widetilde{\widecheck{\mathbf{X}}}over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG in lieu of 𝐗𝐗{\mathbf{X}}bold_X and 𝐗~~𝐗\widetilde{{\mathbf{X}}}over~ start_ARG bold_X end_ARG. We say that the procedure is equivalent since the rejection sets have the same distribution. In particular, this proves that Algorithm 3 controls the FDR.

Corollary 1.

Consider a knockoffs feature importance statistic 𝐖=𝐟(𝒯(𝐗,𝐗~,𝐘),𝐔)p𝐖𝐟𝒯𝐗normal-~𝐗𝐘𝐔superscript𝑝\mathbf{W}=\mathbf{f}(\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y}% ),\mathbf{U})\in\mathbb{R}^{p}bold_W = bold_f ( caligraphic_T ( bold_X , over~ start_ARG bold_X end_ARG , bold_Y ) , bold_U ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, which is a deterministic function of 𝒯(𝐗,𝐗~,𝐘)𝒯𝐗normal-~𝐗𝐘\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})caligraphic_T ( bold_X , over~ start_ARG bold_X end_ARG , bold_Y ) and an independent random variable 𝐔𝐔\mathbf{U}bold_U. Define 𝐖^=𝐟(𝒯(𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘,𝐗𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘~,𝐘),𝐔)normal-^𝐖𝐟𝒯𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐗normal-~𝑤𝑖𝑑𝑒𝑐ℎ𝑒𝑐𝑘𝐗𝐘𝐔\widehat{\mathbf{W}}=\mathbf{f}(\mathcal{T}(\widecheck{\mathbf{X}},\widetilde{% \widecheck{\mathbf{X}}},\mathbf{Y}),\mathbf{U})over^ start_ARG bold_W end_ARG = bold_f ( caligraphic_T ( overwidecheck start_ARG bold_X end_ARG , over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG , bold_Y ) , bold_U ). Let 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (resp. 𝒮2subscript𝒮2\mathcal{S}_{2}caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) be the rejection set obtained from applying the knockoffs filter on 𝐖𝐖\mathbf{W}bold_W (resp. 𝐖^normal-^𝐖\widehat{\mathbf{W}}over^ start_ARG bold_W end_ARG). Then 𝒮1𝐗,𝐘=d𝒮2𝐗,𝐘superscript𝑑conditionalsubscript𝒮1𝐗𝐘conditionalsubscript𝒮2𝐗𝐘\mathcal{S}_{1}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}% \mathcal{S}_{2}\mid\mathbf{X},\mathbf{Y}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ bold_X , bold_Y. Thus, if 𝐖𝐖\mathbf{W}bold_W obeys the flip-sign property, both procedures have equal FDR at most equal to q𝑞qitalic_q.

Proof.

Proposition 1 gives 𝐖𝐗,𝐘=d𝐖^𝐗,𝐘superscript𝑑conditional𝐖𝐗𝐘conditional^𝐖𝐗𝐘\mathbf{W}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}\widehat{% \mathbf{W}}\mid\mathbf{X},\mathbf{Y}bold_W ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP over^ start_ARG bold_W end_ARG ∣ bold_X , bold_Y. Since the selection set is uniquely determined by the values of 𝐖𝐖\mathbf{W}bold_W (or 𝐖^^𝐖\widehat{\mathbf{W}}over^ start_ARG bold_W end_ARG), it follows that 𝒮1𝐗,𝐘=d𝒮2𝐗,𝐘superscript𝑑conditionalsubscript𝒮1𝐗𝐘conditionalsubscript𝒮2𝐗𝐘\mathcal{S}_{1}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}% \mathcal{S}_{2}\mid\mathbf{X},\mathbf{Y}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ bold_X , bold_Y. Therefore, the procedures have the same FDR. ∎

We can easily adapt the method above to accommodate other types of regularization, such as Ridge regression and Elastic Net.

3.3 GhostKnockoffs with the square-root Lasso

In Section 3.2, we assumed that the tuning parameter λ𝜆\lambdaitalic_λ in (7) is fixed. In practice, one may choose the penalty level using information from the Gram matrix of [𝐗,𝐘]𝐗𝐘[\mathbf{X},\mathbf{Y}][ bold_X , bold_Y ], and the sample size n𝑛nitalic_n. Since individual-level data is not available, we are unable to use data-splitting approaches such as cross-validation.

An alternative way to define feature importance is to use the square-root Lasso (Belloni et al., 2011), for which the choice of a reasonable tuning parameter is convenient. The square-root Lasso applied to the knockoffs setting solves

𝜷^(λ)argmin𝜷2p𝐘[𝐗𝐗~]𝜷2+λ𝜷1,^𝜷𝜆subscriptargmin𝜷superscript2𝑝subscriptnorm𝐘delimited-[]𝐗~𝐗𝜷2𝜆subscriptnorm𝜷1\hat{\bm{\beta}}(\lambda)\in\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^% {2p}}||\mathbf{Y}-[\mathbf{X}\;\widetilde{\mathbf{X}}]\bm{\beta}||_{2}+\lambda% ||\bm{\beta}||_{1},over^ start_ARG bold_italic_β end_ARG ( italic_λ ) ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | | bold_Y - [ bold_X over~ start_ARG bold_X end_ARG ] bold_italic_β | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ | | bold_italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (8)

and a good choice of λ𝜆\lambdaitalic_λ is given by

λ=κ𝔼[[𝐗𝐗~]ϵϵ2|𝐗,𝐗~],𝜆𝜅𝔼delimited-[]conditionalsubscriptdelimited-∥∥superscriptdelimited-[]𝐗~𝐗topbold-italic-ϵsubscriptdelimited-∥∥bold-italic-ϵ2𝐗~𝐗\lambda=\kappa\cdot\mathbb{E}\left[\frac{\lVert[\mathbf{X}\;\widetilde{\mathbf% {X}}]^{\top}\bm{\epsilon}\rVert_{\infty}}{\lVert\bm{\epsilon}\rVert_{2}}|% \mathbf{X},\widetilde{\mathbf{X}}\right],italic_λ = italic_κ ⋅ blackboard_E [ divide start_ARG ∥ [ bold_X over~ start_ARG bold_X end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | bold_X , over~ start_ARG bold_X end_ARG ] , (9)

where ϵ𝒩(𝟎,𝐈n)similar-tobold-italic-ϵ𝒩0subscript𝐈𝑛\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n})bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and κ𝜅\kappaitalic_κ is a unitless hyperparameter (Tian et al., 2018). This value is a scalar multiple of the expected value of the minimal penalty level required such that all the coefficients are shrunk to zero under the global null model. The square-root Lasso has the benefit that the value of the hyperparameter does not depend on the details of the distribution of Y𝑌Yitalic_Y conditional on X𝑋Xitalic_X. We also found that the performance of our procedure does not depend very sensitively on the choice of κ𝜅\kappaitalic_κ. In our data examples, we take κ=0.3𝜅0.3\kappa=0.3italic_κ = 0.3.

In the setting where we only know about values of the summary statistics, we simply replace (𝐗,𝐗~,𝐘𝐗~𝐗𝐘\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y}bold_X , over~ start_ARG bold_X end_ARG , bold_Y) by (𝐗widecheck,𝐗widecheck~,𝐘widecheck)\widecheck{\mathbf{X}},\widetilde{\widecheck{\mathbf{X}}},\widecheck{\mathbf{Y% }})overwidecheck start_ARG bold_X end_ARG , over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG , overwidecheck start_ARG bold_Y end_ARG ) in (8). Further, we note that for any orthogonal matrix 𝐐𝐐\mathbf{Q}bold_Q,

([𝐗𝐗~]𝐐ϵ,ϵϵ)𝐗,𝐗~conditionalsuperscriptdelimited-[]𝐗~𝐗topsuperscript𝐐topbold-italic-ϵsuperscriptbold-italic-ϵtopbold-italic-ϵ𝐗~𝐗\displaystyle([\mathbf{X}\;\widetilde{\mathbf{X}}]^{\top}\mathbf{Q}^{\top}\bm{% \epsilon},\bm{\epsilon}^{\top}\bm{\epsilon})\mid\mathbf{X},\widetilde{\mathbf{% X}}( [ bold_X over~ start_ARG bold_X end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ , bold_italic_ϵ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ) ∣ bold_X , over~ start_ARG bold_X end_ARG =d([𝐗𝐗~]𝐐ϵ,ϵ𝐐𝐐ϵ)𝐗,𝐗~superscript𝑑absentconditionalsuperscriptdelimited-[]𝐗~𝐗topsuperscript𝐐topbold-italic-ϵsuperscriptbold-italic-ϵtopsuperscript𝐐𝐐topbold-italic-ϵ𝐗~𝐗\displaystyle\stackrel{{\scriptstyle d}}{{=}}([\mathbf{X}\;\widetilde{\mathbf{% X}}]^{\top}\mathbf{Q}^{\top}\bm{\epsilon},\bm{\epsilon}^{\top}\mathbf{Q}% \mathbf{Q}^{\top}\bm{\epsilon})\mid\mathbf{X},\widetilde{\mathbf{X}}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ( [ bold_X over~ start_ARG bold_X end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ , bold_italic_ϵ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_QQ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ) ∣ bold_X , over~ start_ARG bold_X end_ARG
=d([𝐗𝐗~]ϵ,ϵϵ)𝐗,𝐗~,superscript𝑑absentconditionalsuperscriptdelimited-[]𝐗~𝐗topbold-italic-ϵsuperscriptbold-italic-ϵtopbold-italic-ϵ𝐗~𝐗\displaystyle\stackrel{{\scriptstyle d}}{{=}}([\mathbf{X}\;\widetilde{\mathbf{% X}}]^{\top}\bm{\epsilon},\bm{\epsilon}^{\top}\bm{\epsilon})\mid\mathbf{X},% \widetilde{\mathbf{X}},start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ( [ bold_X over~ start_ARG bold_X end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ , bold_italic_ϵ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ) ∣ bold_X , over~ start_ARG bold_X end_ARG ,

where the second equality follows from 𝐐ϵ=dϵsuperscript𝑑superscript𝐐topbold-italic-ϵbold-italic-ϵ\mathbf{Q}^{\top}\bm{\epsilon}\stackrel{{\scriptstyle d}}{{=}}\bm{\epsilon}bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP bold_italic_ϵ. Therefore, the value of the hyperparameter in (9) remains unchanged if we multiply [𝐗𝐗~]delimited-[]𝐗~𝐗[\mathbf{X}\;\widetilde{\mathbf{X}}][ bold_X over~ start_ARG bold_X end_ARG ] by 𝐐𝐐\mathbf{Q}bold_Q on the left. This implies that (9) is a deterministic function of [𝐗𝐗~][𝐗𝐗~]superscriptdelimited-[]𝐗~𝐗topdelimited-[]𝐗~𝐗[\mathbf{X}\;\widetilde{\mathbf{X}}]^{\top}[\mathbf{X}\;\widetilde{\mathbf{X}}][ bold_X over~ start_ARG bold_X end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X over~ start_ARG bold_X end_ARG ]. Hence, the feature importance statistic is a function of 𝒯(𝐗,𝐗~,𝐘)𝒯𝐗~𝐗𝐘\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})caligraphic_T ( bold_X , over~ start_ARG bold_X end_ARG , bold_Y ). Following Corollary 1, we can apply the knockoffs procedure with the square-root Lasso and matrices (𝐗widecheck,𝐗widecheck~)widecheck𝐗~widecheck𝐗(\widecheck{\mathbf{X}},\widetilde{\widecheck{\mathbf{X}}})( overwidecheck start_ARG bold_X end_ARG , over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG ) in lieu of (𝐗,𝐗~)𝐗~𝐗({\mathbf{X}},\widetilde{{\mathbf{X}}})( bold_X , over~ start_ARG bold_X end_ARG ). Upon choosing

λ=κ𝔼[[𝐗widecheck𝐗widecheck~]ϵϵ2𝐗widecheck,𝐗widecheck~],𝜆𝜅𝔼delimited-[]conditionalsubscriptdelimited-∥∥superscriptdelimited-[]widecheck𝐗~widecheck𝐗topbold-italic-ϵsubscriptdelimited-∥∥bold-italic-ϵ2widecheck𝐗~widecheck𝐗\lambda=\kappa\;\mathbb{E}\left[\frac{\lVert[\widecheck{\mathbf{X}}\;% \widetilde{\widecheck{\mathbf{X}}}]^{\top}\bm{\epsilon}\rVert_{\infty}}{\lVert% \bm{\epsilon}\rVert_{2}}\mid\widecheck{\mathbf{X}},\widetilde{\widecheck{% \mathbf{X}}}\right],italic_λ = italic_κ blackboard_E [ divide start_ARG ∥ [ overwidecheck start_ARG bold_X end_ARG over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∣ overwidecheck start_ARG bold_X end_ARG , over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG ] , (10)

we get a procedure, which is statistically indistinguishable from that we would get if we were performing all the same steps with 𝐗𝐗\mathbf{X}bold_X and 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG. (In practice, we compute the value in (10) via Monte Carlo simulation.) In the sequel, we call the resulting procedure summary statistics GhostKnockoffs with square-root Lasso importance statistic (GK-sqrtlasso). Note that GK-sqrtlasso controls the FDR as the flip-sign property of the feature importance statistic holds. This is because swapping a variable with its knockoff does not change the value of the hyperparameter. Therefore, by Corollary 1, applying the knockoff filter to the square-root Lasso feature importance statistics yields FDR control.

3.4 GhostKnockoffs with the Lasso-max

In the standard fixed-X knockoffs setting, cross-validation is also not feasible, since doing so would violate the sufficiency condition required for the feature importance statistics. As one possible alternative, Barber and Candès (2015) considered using as the feature importance statistic the value of λ𝜆\lambdaitalic_λ on the Lasso path at which feature Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT first enters the model. Formally, they define the feature importance statistic

Wj=sup{λ:β^j(λ)0}sup{λ:β^j+p(λ)0},subscript𝑊𝑗supconditional-set𝜆subscript^𝛽𝑗𝜆0supconditional-set𝜆subscript^𝛽𝑗𝑝𝜆0W_{j}=\text{sup}\{\lambda:\hat{\beta}_{j}(\lambda)\neq 0\}-\text{sup}\{\lambda% :\hat{\beta}_{j+p}(\lambda)\neq 0\},italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = sup { italic_λ : over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_λ ) ≠ 0 } - sup { italic_λ : over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j + italic_p end_POSTSUBSCRIPT ( italic_λ ) ≠ 0 } ,

where 𝜷^(λ)^𝜷𝜆\hat{\bm{\beta}}(\lambda)over^ start_ARG bold_italic_β end_ARG ( italic_λ ) is as in (7). We call this statistic the Lasso-max statistic. Intuitively, a larger penalty level is required to shrink an important feature to zero, so we should expect Wjsubscript𝑊𝑗W_{j}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to be large and positive for non-nulls.

By Corollary 1, with the Lasso-max statistic Algorithm 3 produces a rejection set that has the same distribution as the rejection set obtained from the corresponding individual-data-based knockoffs procedure. We call this summary-statistic-based procedure GhostKnockoffs with Lasso-max statistic (GK-lassomax).

We remark that choices of other tuning parameters and feature importance statistics are also possible. For instance, we may choose λ𝜆\lambdaitalic_λ to minimize the Stein’s unbiased risk estimate (SURE) associated with (7). We shall however focus on the two approaches we have described.

3.5 Numerical simulations

We consider a variety of simulation settings in which we compare the performance of the proposed GhostKnockoffs with square-root Lasso and Lasso-max statistics (GK-sqrtlasso and GK-lassomax, defined in Sections 3.3 and 3.4), GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2), and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic with individual-level data (KF-lassocv). Note that the first three are statistically equivalent to the corresponding knockoffs procedures with individual-level data.

3.5.1 Independent features

In the first set of simulations (Figure 1), we generate random samples 𝐱iiid𝒩(𝟎,𝐈p)superscriptsimilar-to𝑖𝑖𝑑subscript𝐱𝑖𝒩0subscript𝐈𝑝\mathbf{x}_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(\mathbf{0},% \mathbf{I}_{p})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_i italic_i italic_d end_ARG end_RELOP caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) and Yi=𝜷𝐱i+nϵisubscript𝑌𝑖superscript𝜷topsubscript𝐱𝑖𝑛subscriptitalic-ϵ𝑖Y_{i}=\bm{\beta}^{\top}\mathbf{x}_{i}+\sqrt{n}\epsilon_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG italic_n end_ARG italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where ϵiiid𝒩(0,1)superscriptsimilar-to𝑖𝑖𝑑subscriptitalic-ϵ𝑖𝒩01\epsilon_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(0,1)italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_i italic_i italic_d end_ARG end_RELOP caligraphic_N ( 0 , 1 ) for i{1,2,,n}𝑖12𝑛i\in\{1,2,...,n\}italic_i ∈ { 1 , 2 , … , italic_n }.***The simulation setting is designed in a way that the signal-to-noise ratio has the same scale as n𝑛nitalic_n varies. We consider three settings of varying dimensionality measured by the ratio p/n𝑝𝑛p/nitalic_p / italic_n: (n,p){(600,200),(400,400),(200,600)}𝑛𝑝600200400400200600(n,p)\in\{(600,200),(400,400),(200,600)\}( italic_n , italic_p ) ∈ { ( 600 , 200 ) , ( 400 , 400 ) , ( 200 , 600 ) }. In each of the three settings, we create a sparse vector 𝜷𝜷\bm{\beta}bold_italic_β by selecting 30 coordinates to be non-zero uniformly at random. The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability. We vary the signal amplitudes such that we explore a wide power range below. For the square-root Lasso, we average over 200 Monte Carlo samples to calculate

λ=κ𝔼[[𝐗𝐗~]ϵϵ2𝐗,𝐗~].𝜆𝜅𝔼delimited-[]conditionalsubscriptdelimited-∥∥superscriptdelimited-[]𝐗~𝐗topbold-italic-ϵsubscriptdelimited-∥∥bold-italic-ϵ2𝐗~𝐗\lambda=\kappa\cdot\mathbb{E}\big{[}\frac{\lVert[\mathbf{X}\;\widetilde{% \mathbf{X}}]^{\top}\bm{\epsilon}\rVert_{\infty}}{\lVert\bm{\epsilon}\rVert_{2}% }\mid\mathbf{X},\widetilde{\mathbf{X}}\big{]}.italic_λ = italic_κ ⋅ blackboard_E [ divide start_ARG ∥ [ bold_X over~ start_ARG bold_X end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∣ bold_X , over~ start_ARG bold_X end_ARG ] .

The target FDR is 20%. Each point on the curves represents the average of the results from 200 replications.

Refer to caption
Figure 1: Power and FDR plots for independent features and a Gaussian linear model with varying dimensions. Each point is an average over 200 replications.

We observe that GK-sqrtlasso and GK-lassomax generally demonstrate greater power than GK-marginal. This enhanced performance is not surprising, as GK-sqrtlasso and GK-lassomax (1) have access to additional information via 𝐗𝐗superscript𝐗top𝐗\mathbf{X}^{\top}\mathbf{X}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X, and (2) employing a joint modeling algorithm such as Lasso generally provides a better assessment of variable importance for understanding conditional (in)dependence since such a model explicitly adjusts for the effects from all the other variables. We also note the presence of power gaps between GK-lassocv and GK-sqrtlasso/GK-lassomax, likely due to the fact that we are unable to perform cross-validation without individual-level data. All methods control the FDR at the desired level.

3.5.2 AR(1) features

In the second set of simulations (Figures 2), we generate 𝐱iiid𝒩(𝟎,𝚺ρ)superscriptsimilar-to𝑖𝑖𝑑subscript𝐱𝑖𝒩0subscript𝚺𝜌\mathbf{x}_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(\mathbf{0},% \mathbf{\Sigma}_{\rho})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_i italic_i italic_d end_ARG end_RELOP caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ) for i{1,2,,n}𝑖12𝑛i\in\{1,2,...,n\}italic_i ∈ { 1 , 2 , … , italic_n }, where [𝚺ρ]s,t=ρ|st|subscriptdelimited-[]subscript𝚺𝜌𝑠𝑡superscript𝜌𝑠𝑡\left[\mathbf{\Sigma}_{\rho}\right]_{s,t}=\rho^{|s-t|}[ bold_Σ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT = italic_ρ start_POSTSUPERSCRIPT | italic_s - italic_t | end_POSTSUPERSCRIPT for 1s,tpformulae-sequence1𝑠𝑡𝑝1\leq s,t\leq p1 ≤ italic_s , italic_t ≤ italic_p. As before, we generate Yi=𝜷𝐱i+nϵisubscript𝑌𝑖superscript𝜷topsubscript𝐱𝑖𝑛subscriptitalic-ϵ𝑖Y_{i}=\bm{\beta}^{\top}\mathbf{x}_{i}+\sqrt{n}\epsilon_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG italic_n end_ARG italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where ϵiiidN(0,1)superscriptsimilar-to𝑖𝑖𝑑subscriptitalic-ϵ𝑖𝑁01\epsilon_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1)italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_i italic_i italic_d end_ARG end_RELOP italic_N ( 0 , 1 ) for i{1,2,,n}𝑖12𝑛i\in\{1,2,...,n\}italic_i ∈ { 1 , 2 , … , italic_n }. We consider the same three (n,p)𝑛𝑝(n,p)( italic_n , italic_p ) combinations. In each of the three cases, we create a sparse vector 𝜷𝜷\bm{\beta}bold_italic_β exactly as before, except that we fix the signal amplitudes to 4, 4, and 7 respectively to explore a wide power range. We vary ρ𝜌\rhoitalic_ρ in {0,0.1,0.2,,0.8}00.10.20.8\{0,0.1,0.2,...,0.8\}{ 0 , 0.1 , 0.2 , … , 0.8 } The target FDR is set to be 20%. Each point represents the average of the results from 200 replications.

Refer to caption
Figure 2: Power and FDR plots for AR(1) features and a Gaussian linear model with varying dimensions. Each point is an average over 200 replications.

Again, we observe that GK-sqrtlasso and GK-lassomax generally have greater power than GK-marginal. All methods have (almost) decreasing power as the autocorrelation coefficient increases, since it becomes harder to separate true signals from null variables that are correlated with them. All methods control the FDR at the desired level.

4 GhostKnockoffs with Penalized Regression: Missing Empirical Covariance

4.1 Setting

Thus far, we have discussed how incorporating the additional information from 𝐗𝐗superscript𝐗top𝐗\mathbf{X}^{\top}\mathbf{X}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X and n𝑛nitalic_n could enhance our ability to detect significant features. However, in applications such as genetics, 𝐗𝐗superscript𝐗top𝐗\mathbf{X}^{\top}\mathbf{X}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X may not be available. In this section, we propose alternative procedures when the scientist only knows about 𝐗𝐘superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, 𝐘2superscriptdelimited-∥∥𝐘2\lVert\mathbf{Y}\rVert^{2}∥ bold_Y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the sample size n𝑛nitalic_n. As before, we assume that X𝒩(𝟎,𝚺)similar-to𝑋𝒩0𝚺X\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma})italic_X ∼ caligraphic_N ( bold_0 , bold_Σ ), where the covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ is known (or can be estimated from other data sources).

4.2 GhostKnockoffs with pseudo-lasso

The idea of our method is to modify the Lasso objective function so that it can be constructed from the available summary statistics. It turns out that the solution of our modified objective function is proportional to that of the scout procedure (with known precision matrix) proposed by Witten and Tibshirani (2009). We will see through simulation studies that our procedure improves the power of the original GhostKnockoffs method of (He et al., 2022) while maintaining FDR control.

4.2.1 The procedure

Recall that in the knockoffs procedure with the Lasso statistic, we solve the following optimization problem:

𝜷^(λ)=argmin𝜷2p12n𝜷[𝐗𝐗𝐗𝐗~𝐗~𝐗𝐗~𝐗~]𝜷1n𝜷[𝐗𝐘𝐗~𝐘]+λ𝜷1.^𝜷𝜆subscriptargmin𝜷superscript2𝑝12𝑛superscript𝜷topmatrixsuperscript𝐗top𝐗superscript𝐗top~𝐗superscript~𝐗top𝐗superscript~𝐗top~𝐗𝜷1𝑛superscript𝜷topmatrixsuperscript𝐗top𝐘superscript~𝐗top𝐘𝜆subscriptnorm𝜷1\hat{\bm{\beta}}(\lambda)=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2% p}}\frac{1}{2n}\bm{\beta}^{\top}\begin{bmatrix}\mathbf{X}^{\top}\mathbf{X}&% \mathbf{X}^{\top}\widetilde{\mathbf{X}}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{X}&\widetilde{\mathbf{X}}^{\top}% \widetilde{\mathbf{X}}\end{bmatrix}\bm{\beta}-\frac{1}{n}\bm{\beta}^{\top}% \begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{Y}\end{bmatrix}+\lambda||\bm{\beta}||_{1}.over^ start_ARG bold_italic_β end_ARG ( italic_λ ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_n end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X end_CELL start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_X end_ARG end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] bold_italic_β - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW end_ARG ] + italic_λ | | bold_italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

To mimic the form of the loss function when we do not observe the empirical covariance of the features, we may want to substitute them with their population version: i.e. we swap 𝐗𝐗/nsuperscript𝐗top𝐗𝑛\mathbf{X}^{\top}\mathbf{X}/nbold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X / italic_n and 𝐗~𝐗~/nsuperscript~𝐗top~𝐗𝑛\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}/nover~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_X end_ARG / italic_n with 𝚺𝚺\mathbf{\Sigma}bold_Σ and 𝐗𝐗~/nsuperscript𝐗top~𝐗𝑛\mathbf{X}^{\top}\widetilde{\mathbf{X}}/nbold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_X end_ARG / italic_n with 𝚺𝐃𝚺𝐃\mathbf{\Sigma}-\mathbf{D}bold_Σ - bold_D. As usual, 𝐃=diag{𝐬}𝐃diag𝐬\mathbf{D}=\text{diag}\{\mathbf{s}\}bold_D = diag { bold_s } is obtained by solving the convex optimization problem (15). In the language of fixed-X knockoffs (Barber and Candès, 2015), this is equivalent to regarding 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG as a fixed-X knockoff of 𝐗𝐗\mathbf{X}bold_X and replacing 𝐗𝐗/nsuperscript𝐗top𝐗𝑛\mathbf{X}^{\top}\mathbf{X}/nbold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X / italic_n by 𝚺𝚺\mathbf{\Sigma}bold_Σ.***We remark that similar objective functions have been used in, for example, Mak et al. (2017) and Zou et al. (2022). This yields Algorithm 4.

Algorithm 4 GhostKnockoffs with Penalized Regression: Missing Empirical Covariance
1:  Input: 𝐗𝐘,𝐘22,𝚺superscript𝐗top𝐘superscriptsubscriptnorm𝐘22𝚺\mathbf{X}^{\top}\mathbf{Y},||\mathbf{Y}||_{2}^{2},\mathbf{\Sigma}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_Σ and n𝑛nitalic_n.
2:  Simulate 𝐙𝒩(𝟎,𝐕)similar-to𝐙𝒩0𝐕\mathbf{Z}\sim\mathcal{N}(\mathbf{0},\mathbf{V})bold_Z ∼ caligraphic_N ( bold_0 , bold_V ), where 𝐕𝐕\mathbf{V}bold_V is defined as in Algorithm 2.
3:  Solve 𝜷^(λ)=argmin𝜷2p12𝜷[𝚺𝚺𝐃𝚺𝐃𝚺]𝜷1n𝜷[𝐗𝐘𝐏𝐗𝐘+𝐘2𝐙]+λ𝜷1,^𝜷𝜆subscriptargmin𝜷superscript2𝑝12superscript𝜷topmatrix𝚺𝚺𝐃𝚺𝐃𝚺𝜷1𝑛superscript𝜷topmatrixsuperscript𝐗top𝐘superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙𝜆subscriptnorm𝜷1\hat{\bm{\beta}}(\lambda)=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2% p}}\frac{1}{2}\bm{\beta}^{\top}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-% \mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\bm{\beta}-\frac{1}{n}% \bm{\beta}^{\top}\begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\vspace{1mm}\\ \mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}\end{bmatrix}+\lambda||\bm{\beta}||_{1},over^ start_ARG bold_italic_β end_ARG ( italic_λ ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] bold_italic_β - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW start_ROW start_CELL bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z end_CELL end_ROW end_ARG ] + italic_λ | | bold_italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , where 𝐃𝐃\mathbf{D}bold_D and 𝐏𝐏\mathbf{P}bold_P are defined as in Section 2.2.2 and λ𝜆\lambdaitalic_λ is fixed or as chosen in Section 4.2.2
4:  Run the standard knockoffs procedure (at level q𝑞qitalic_q) with importance statistic Wj=|β^j(λ)||β^j+p(λ)|.subscript𝑊𝑗subscript^𝛽𝑗𝜆subscript^𝛽𝑗𝑝𝜆W_{j}=\lvert\hat{\beta}_{j}(\lambda)\rvert-\lvert\hat{\beta}_{j+p}(\lambda)\rvert.italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_λ ) | - | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j + italic_p end_POSTSUBSCRIPT ( italic_λ ) | .
5:  Output: Knockoffs selection set.

We call this procedure GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso). We show below that Algorithm 4 controls the FDR of selections at level q𝑞qitalic_q. Before doing so, we first state a general proposition that includes GK-marginal as a special case.

Proposition 2.

Suppose 𝐕𝐕\mathbf{V}bold_V and 𝐏𝐏\mathbf{P}bold_P are defined as in Algorithm 2, 𝐙𝒩(𝟎,𝐕)similar-to𝐙𝒩0𝐕\mathbf{Z}\sim\mathcal{N}(\mathbf{0},\mathbf{V})bold_Z ∼ caligraphic_N ( bold_0 , bold_V ) is independent of 𝐗𝐗\mathbf{X}bold_X and 𝐘𝐘\mathbf{Y}bold_Y, and 𝐗~=𝒢(𝐗,𝚺)normal-~𝐗𝒢𝐗𝚺\widetilde{\mathbf{X}}=\mathcal{G}(\mathbf{X},\mathbf{\Sigma})over~ start_ARG bold_X end_ARG = caligraphic_G ( bold_X , bold_Σ ). Consider a knockoffs feature importance statistic 𝐖=𝐠(𝐘22,𝐗𝐘,𝐗~𝐘,𝐔)p𝐖𝐠superscriptsubscriptdelimited-∥∥𝐘22superscript𝐗top𝐘superscriptnormal-~𝐗top𝐘𝐔superscript𝑝\mathbf{W}=\mathbf{g}(\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{% Y},\widetilde{\mathbf{X}}^{\top}\mathbf{Y},\mathbf{U})\in\mathbb{R}^{p}bold_W = bold_g ( ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , bold_U ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, which is a deterministic function of 𝐘22,𝐗𝐘,𝐗~𝐘superscriptsubscriptdelimited-∥∥𝐘22superscript𝐗top𝐘superscriptnormal-~𝐗top𝐘\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y},\widetilde{\mathbf{% X}}^{\top}\mathbf{Y}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and an independent random variable 𝐔𝐔\mathbf{U}bold_U. Define 𝐖^=𝐠(𝐘22,𝐗𝐘,𝐏𝐗𝐘+𝐘2𝐙,𝐔)normal-^𝐖𝐠superscriptsubscriptdelimited-∥∥𝐘22superscript𝐗top𝐘superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙𝐔\widehat{\mathbf{W}}=\mathbf{g}(\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{% \top}\mathbf{Y},\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}% \rVert_{2}\mathbf{Z},\mathbf{U})over^ start_ARG bold_W end_ARG = bold_g ( ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z , bold_U ). Let 𝒮1subscript𝒮1\mathcal{S}_{1}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (resp. 𝒮2subscript𝒮2\mathcal{S}_{2}caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) be the rejection set obtained from applying the knockoffs filter on 𝐖𝐖\mathbf{W}bold_W (resp. 𝐖^normal-^𝐖\widehat{\mathbf{W}}over^ start_ARG bold_W end_ARG). Then 𝒮1𝐗,𝐘=d𝒮2𝐗,𝐘superscript𝑑conditionalsubscript𝒮1𝐗𝐘conditionalsubscript𝒮2𝐗𝐘\mathcal{S}_{1}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}% \mathcal{S}_{2}\mid\mathbf{X},\mathbf{Y}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ bold_X , bold_Y. Thus, if 𝐖𝐖\mathbf{W}bold_W obeys the flip-sign property, both procedures have equal FDR at most equal to q𝑞qitalic_q.

Proof.

In Appendix B, we prove that

𝐗~𝐘𝐗,𝐘=d𝐏𝐗𝐘+𝐘2𝐙𝐗,𝐘.superscript𝑑conditionalsuperscript~𝐗top𝐘𝐗𝐘superscript𝐏topsuperscript𝐗top𝐘conditionalsubscriptnorm𝐘2𝐙𝐗𝐘\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}\stackrel{{% \scriptstyle d}}{{=}}\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+||\mathbf{Y}% ||_{2}\mathbf{Z}\mid\mathbf{X},\mathbf{Y}.over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z ∣ bold_X , bold_Y .

As a result, 𝐖𝐗,𝐘=d𝐖^𝐗,𝐘superscript𝑑conditional𝐖𝐗𝐘conditional^𝐖𝐗𝐘\mathbf{W}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}\widehat{% \mathbf{W}}\mid\mathbf{X},\mathbf{Y}bold_W ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP over^ start_ARG bold_W end_ARG ∣ bold_X , bold_Y. Since the selection set is uniquely determined by the values of 𝐖𝐖\mathbf{W}bold_W (or 𝐖^^𝐖\widehat{\mathbf{W}}over^ start_ARG bold_W end_ARG), it follows that 𝒮1𝐗,𝐘=d𝒮2𝐗,𝐘superscript𝑑conditionalsubscript𝒮1𝐗𝐘conditionalsubscript𝒮2𝐗𝐘\mathcal{S}_{1}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}% \mathcal{S}_{2}\mid\mathbf{X},\mathbf{Y}caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP caligraphic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∣ bold_X , bold_Y. Therefore, the procedures have the same FDR. ∎

Set λ𝜆\lambdaitalic_λ to be a fixed numerical constant. Consider the feature importance statistics 𝐖𝐖\mathbf{W}bold_W defined by Wj=|β^j(λ)||β^j+p(λ)|,subscript𝑊𝑗subscript^𝛽𝑗𝜆subscript^𝛽𝑗𝑝𝜆W_{j}=\lvert\hat{\beta}_{j}(\lambda)\rvert-\lvert\hat{\beta}_{j+p}(\lambda)\rvert,italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_λ ) | - | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j + italic_p end_POSTSUBSCRIPT ( italic_λ ) | , where 𝜷^(λ)^𝜷𝜆\hat{\bm{\beta}}(\lambda)over^ start_ARG bold_italic_β end_ARG ( italic_λ ) is the solution to

argmin𝜷2p12𝜷[𝚺𝚺𝐃𝚺𝐃𝚺]𝜷1n𝜷[𝐗𝐘𝐗~𝐘]+λ𝜷1,subscriptargmin𝜷superscript2𝑝12superscript𝜷topmatrix𝚺𝚺𝐃𝚺𝐃𝚺𝜷1𝑛superscript𝜷topmatrixsuperscript𝐗top𝐘superscript~𝐗top𝐘𝜆subscriptnorm𝜷1\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2p}}\frac{1}{2}\bm{\beta}^{% \top}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\bm{\beta}-\frac{1}{n}% \bm{\beta}^{\top}\begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\vspace{1mm}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{Y}\end{bmatrix}+\lambda||\bm{\beta}||_{1},start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] bold_italic_β - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW end_ARG ] + italic_λ | | bold_italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (11)

and 𝐗~=𝒢(𝐗,𝚺)~𝐗𝒢𝐗𝚺\widetilde{\mathbf{X}}=\mathcal{G}(\mathbf{X},\mathbf{\Sigma})over~ start_ARG bold_X end_ARG = caligraphic_G ( bold_X , bold_Σ ) is the Gaussian knockoff data matrix. The feature importance statistic in Algorithm 4 is thus obtained by replacing 𝐗~𝐘superscript~𝐗top𝐘\widetilde{\mathbf{X}}^{\top}\mathbf{Y}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y by 𝐏𝐗𝐘+𝐘2𝐙superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z in (11). Since 𝐖𝐖\mathbf{W}bold_W is determined by 𝐘22,𝐗𝐘superscriptsubscriptdelimited-∥∥𝐘22superscript𝐗top𝐘\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and 𝐗~𝐘superscript~𝐗top𝐘\widetilde{\mathbf{X}}^{\top}\mathbf{Y}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, it follows from Proposition 2 that the rejection set of Algorithm 4 has the same distribution as that obtained from running the knockoff filter on 𝐖𝐖\mathbf{W}bold_W.

Thus to prove that Algorithm 4 controls the FDR of rejections at level q𝑞qitalic_q, it suffices to verify the flip-sign property of the feature importance statistic for 𝐖𝐖\mathbf{W}bold_W (see Section 2). This is a consequence of the following lemma:

Lemma 1.

Consider the problem

argmin𝜷2p12𝜷𝐂𝜷𝐝𝜷+λ𝜷1+γ𝜷22.subscriptargmin𝜷superscript2𝑝12superscript𝜷top𝐂𝜷superscript𝐝top𝜷𝜆subscriptnorm𝜷1𝛾subscriptsuperscriptdelimited-∥∥𝜷22\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2p}}\frac{1}{2}\bm{\beta}^{% \top}\mathbf{C}\bm{\beta}-\mathbf{d}^{\top}\bm{\beta}+\lambda||\bm{\beta}||_{1% }+\gamma\lVert\bm{\beta}\rVert^{2}_{2}.start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_C bold_italic_β - bold_d start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β + italic_λ | | bold_italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ ∥ bold_italic_β ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (12)

Let 𝚷Ssubscript𝚷𝑆\bm{\Pi}_{S}bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT be any permutation matrix which swaps the jth and (j+p)th entries of a 2p-dimensional vector for each jS{1,,p}𝑗𝑆1normal-…𝑝j\in S\subset\{1,...,p\}italic_j ∈ italic_S ⊂ { 1 , … , italic_p }. Assume that 𝐂𝐂\mathbf{C}bold_C is S𝑆Sitalic_S-swap invariant in the sense that 𝚷S𝐂𝚷S=𝐂superscriptsubscript𝚷𝑆top𝐂subscript𝚷𝑆𝐂\bm{\Pi}_{S}^{\top}\mathbf{C}\bm{\Pi}_{S}=\mathbf{C}bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_C bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = bold_C. Then 𝛃^normal-^𝛃\hat{\bm{\beta}}over^ start_ARG bold_italic_β end_ARG is a solution to (12) if and only if ΠS𝛃^subscriptnormal-Π𝑆normal-^𝛃\Pi_{S}\hat{\bm{\beta}}roman_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT over^ start_ARG bold_italic_β end_ARG is a solution to the same problem with 𝐝𝐝\mathbf{d}bold_d and 𝚷S𝐝subscript𝚷𝑆𝐝\bm{\Pi}_{S}\mathbf{d}bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT bold_d swapped. In other words, swapping the entries of 𝐝𝐝\mathbf{d}bold_d has the effect of swapping the corresponding entries of the solution.

Proof.

Consider the objective with problem data ΠS𝐝subscriptΠ𝑆𝐝\Pi_{S}\mathbf{d}roman_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT bold_d:

12𝜷𝐂𝜷(𝚷S𝐝)𝜷+λ𝜷1+γ𝜷22=12𝜷𝐂𝜷𝐝𝚷S𝜷+λ𝜷1+γ𝜷22.12superscript𝜷top𝐂𝜷superscriptsubscript𝚷𝑆𝐝top𝜷𝜆subscriptdelimited-∥∥𝜷1𝛾superscriptsubscriptdelimited-∥∥𝜷2212superscript𝜷top𝐂𝜷superscript𝐝topsuperscriptsubscript𝚷𝑆top𝜷𝜆subscriptdelimited-∥∥𝜷1𝛾superscriptsubscriptdelimited-∥∥𝜷22\frac{1}{2}\bm{\beta}^{\top}\mathbf{C}\bm{\beta}-(\bm{\Pi}_{S}\mathbf{d})^{% \top}\bm{\beta}+\lambda\lVert\bm{\beta}\rVert_{1}+\gamma\lVert\bm{\beta}\rVert% _{2}^{2}=\frac{1}{2}\bm{\beta}^{\top}\mathbf{C}\bm{\beta}-\mathbf{d}^{\top}\bm% {\Pi}_{S}^{\top}\bm{\beta}+\lambda\lVert\bm{\beta}\rVert_{1}+\gamma\lVert\bm{% \beta}\rVert_{2}^{2}.divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_C bold_italic_β - ( bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT bold_d ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β + italic_λ ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_C bold_italic_β - bold_d start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β + italic_λ ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Set 𝜷=𝚷S𝜷superscript𝜷superscriptsubscript𝚷𝑆top𝜷\bm{\beta}^{\prime}=\bm{\Pi}_{S}^{\top}\bm{\beta}bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β so that 𝜷=𝚷S𝜷𝜷subscript𝚷𝑆superscript𝜷\bm{\beta}=\bm{\Pi}_{S}\bm{\beta}^{\prime}bold_italic_β = bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Upon changing variables, the objective takes the form

12(𝜷)𝚷S𝐂ΠS𝜷𝐝𝜷+λ𝚷S𝜷1+γ𝚷S𝜷22=12(𝜷)𝐂𝜷𝐝𝜷+λ𝜷1+γ𝜷22,12superscriptsuperscript𝜷topsuperscriptsubscript𝚷𝑆top𝐂subscriptΠ𝑆superscript𝜷superscript𝐝topsuperscript𝜷𝜆subscriptdelimited-∥∥subscript𝚷𝑆superscript𝜷1𝛾superscriptsubscriptdelimited-∥∥subscript𝚷𝑆superscript𝜷2212superscriptsuperscript𝜷top𝐂superscript𝜷superscript𝐝topsuperscript𝜷𝜆subscriptdelimited-∥∥superscript𝜷1𝛾superscriptsubscriptdelimited-∥∥superscript𝜷22\frac{1}{2}(\bm{\beta}^{\prime})^{\top}\bm{\Pi}_{S}^{\top}\mathbf{C}\Pi_{S}\bm% {\beta}^{\prime}-\mathbf{d}^{\top}\bm{\beta}^{\prime}+\lambda\lVert\bm{\Pi}_{S% }\bm{\beta}^{\prime}\rVert_{1}+\gamma\lVert\bm{\Pi}_{S}\bm{\beta}^{\prime}% \rVert_{2}^{2}=\frac{1}{2}(\bm{\beta}^{\prime})^{\top}\mathbf{C}\bm{\beta}^{% \prime}-\mathbf{d}^{\top}\bm{\beta}^{\prime}+\lambda\lVert\bm{\beta}^{\prime}% \rVert_{1}+\gamma\lVert\bm{\beta}^{\prime}\rVert_{2}^{2},divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_C roman_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_d start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_λ ∥ bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ ∥ bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_C bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_d start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_λ ∥ bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ ∥ bold_italic_β start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the equality follows because 𝚷S𝐂𝚷S=𝐂superscriptsubscript𝚷𝑆top𝐂subscript𝚷𝑆𝐂\bm{\Pi}_{S}^{\top}\mathbf{C}\bm{\Pi}_{S}=\mathbf{C}bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_C bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = bold_C and because the 1-norm and 2-norm are invariant under permutation. Now, the objective on the right-hand side is the objective with data 𝐝𝐝\mathbf{d}bold_d. If 𝜷^^𝜷\hat{\bm{\beta}}over^ start_ARG bold_italic_β end_ARG is the solution with data 𝐝𝐝\mathbf{d}bold_d, it follows that 𝚷S𝜷^subscript𝚷𝑆^𝜷\bm{\Pi}_{S}\hat{\bm{\beta}}bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT over^ start_ARG bold_italic_β end_ARG is the solution with data 𝚷S𝐝subscript𝚷𝑆𝐝\bm{\Pi}_{S}\mathbf{d}bold_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT bold_d, and vice versa. This proves the lemma. ∎

Corollary 2.

Algorithm 4 with a fixed λ𝜆\lambdaitalic_λ controls the FDR of rejections at level q𝑞qitalic_q.

Proof.

It is easy to show that [𝚺𝚺𝐃𝚺𝐃𝚺]matrix𝚺𝚺𝐃𝚺𝐃𝚺\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}[ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] is S𝑆Sitalic_S-swap invariant for any S{1,,p}𝑆1𝑝S\subset\{1,...,p\}italic_S ⊂ { 1 , … , italic_p }. Taking

𝐂=[𝚺𝚺𝐃𝚺𝐃𝚺]𝐂matrix𝚺𝚺𝐃𝚺𝐃𝚺\mathbf{C}=\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}bold_C = [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ]

and

𝐝=1n[𝐗𝐘𝐗~𝐘]𝐝1𝑛matrixsuperscript𝐗top𝐘superscript~𝐗top𝐘\mathbf{d}=\frac{1}{n}\begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\vspace{1mm}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{Y}\end{bmatrix}bold_d = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG [ start_ARG start_ROW start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW end_ARG ]

in Lemma 1 establishes the flip-sign property of 𝐖𝐖\mathbf{W}bold_W and, therefore, the FDR control of Algorithm 4 for a fixed λ𝜆\lambdaitalic_λ. ∎

In practice, to ensure numerical stability, we add a small positive constant multiple of the identity matrix to

[𝚺𝚺𝐃𝚺𝐃𝚺]matrix𝚺𝚺𝐃𝚺𝐃𝚺\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}[ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ]

when solving for 𝜷^^𝜷\hat{\bm{\beta}}over^ start_ARG bold_italic_β end_ARG. This is equivalent to incorporating a small Ridge penalty into the objective function. It is easy to see that the lemma proved above guarantees that this modification does not compromise the FDR control as

[𝚺+c𝐈𝚺𝐃𝚺𝐃𝚺+c𝐈]matrix𝚺𝑐𝐈𝚺𝐃𝚺𝐃𝚺𝑐𝐈\begin{bmatrix}\mathbf{\Sigma}+c\mathbf{I}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}+c\mathbf{I}\end{bmatrix}[ start_ARG start_ROW start_CELL bold_Σ + italic_c bold_I end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ + italic_c bold_I end_CELL end_ROW end_ARG ]

is also S𝑆Sitalic_S-swap invariant for any c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R and any S{1,,p}.𝑆1𝑝S\subset\{1,...,p\}.italic_S ⊂ { 1 , … , italic_p } .

4.2.2 Choice of tuning parameter

Several methods can be used to tune the value of the hyperparameter λ𝜆\lambdaitalic_λ. We here consider two approaches.

Method 1 (lasso-min)

Pretend a homogeneous Gaussian linear model holds, i.e. 𝐘=𝐗𝜷*+σϵ𝐘𝐗superscript𝜷𝜎bold-italic-ϵ\mathbf{Y}=\mathbf{X}\bm{\beta}^{*}+\sigma\bm{\epsilon}bold_Y = bold_X bold_italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + italic_σ bold_italic_ϵ for some 𝜷*psuperscript𝜷superscript𝑝\bm{\beta}^{*}\in\mathbb{R}^{p}bold_italic_β start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, σ>0𝜎0\sigma>0italic_σ > 0 and ϵN(𝟎,𝐈n)similar-tobold-italic-ϵ𝑁0subscript𝐈𝑛\bm{\epsilon}\sim N(\mathbf{0},\mathbf{I}_{n})bold_italic_ϵ ∼ italic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

Focus on (11) first and imagine that we have a method for computing λ𝜆\lambdaitalic_λ that depends on data only through 𝐘22,𝐗𝐘superscriptsubscriptdelimited-∥∥𝐘22superscript𝐗top𝐘\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, and 𝐗~𝐘superscript~𝐗top𝐘\widetilde{\mathbf{X}}^{\top}\mathbf{Y}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y. Note that the objective in Algorithm 4 only substitutes 𝐗~𝐘superscript~𝐗top𝐘\widetilde{\mathbf{X}}^{\top}\mathbf{Y}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y in (11) with 𝐏𝐗𝐘+𝐘2𝐙superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z. Therefore, by Proposition 2 if we set λ𝜆\lambdaitalic_λ via the same functional and work with 𝐏𝐗𝐘+𝐘2𝐙superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z in lieu of 𝐗~𝐘superscript~𝐗top𝐘\widetilde{\mathbf{X}}^{\top}\mathbf{Y}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, we shall achieve FDR control with this data-driven value of the hyperparameter λ𝜆\lambdaitalic_λ. This holds of course with the proviso that our selection of hyperparameter is symmetric in the sense that it produces feature importance statistic obeying the flip-sign property.

To set the tuning parameter λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in (11), we use the common choice of taking a constant multiple of the expected value of the minimum λ𝜆\lambdaitalic_λ value such that 𝜷^(λ)=𝟎2p^𝜷𝜆subscript02𝑝\hat{\bm{\beta}}(\lambda)=\mathbf{0}_{2p}over^ start_ARG bold_italic_β end_ARG ( italic_λ ) = bold_0 start_POSTSUBSCRIPT 2 italic_p end_POSTSUBSCRIPT under the null model 𝐘=σϵ𝐘𝜎bold-italic-ϵ\mathbf{Y}=\sigma\bm{\epsilon}bold_Y = italic_σ bold_italic_ϵ. By the Karush–Kuhn–Tucker (KKT) conditions (Boyd and Vandenberghe, 2004), this results in a tuning parameter of the form

λ0=κσn𝔼[[𝐗𝐗~]ϵ],subscript𝜆0𝜅𝜎𝑛𝔼delimited-[]subscriptdelimited-∥∥superscriptmatrix𝐗~𝐗topbold-italic-ϵ\lambda_{0}=\kappa\cdot\frac{\sigma}{n}\cdot\mathbb{E}[\lVert\begin{bmatrix}% \mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}^{\top}\bm{\epsilon}\rVert_{% \infty}],italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_κ ⋅ divide start_ARG italic_σ end_ARG start_ARG italic_n end_ARG ⋅ blackboard_E [ ∥ [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ,

where κ𝜅\kappaitalic_κ is a hyperparameter between 0 and 1. Since [𝐗𝐗~]matrix𝐗~𝐗\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}[ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] is a data matrix whose rows are iid samples from

𝒩(𝟎,[𝚺𝚺𝐃𝚺𝐃𝚺]),𝒩0matrix𝚺𝚺𝐃𝚺𝐃𝚺\mathcal{N}\left(\mathbf{0},\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-% \mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\right),caligraphic_N ( bold_0 , [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] ) ,

𝔼[[𝐗𝐗~]ϵ]𝔼delimited-[]subscriptdelimited-∥∥superscriptmatrix𝐗~𝐗topbold-italic-ϵ\mathbb{E}[\lVert\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}% ^{\top}\bm{\epsilon}\rVert_{\infty}]blackboard_E [ ∥ [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] is a numerical constant, which can be estimated arbitrarily well via Monte Carlo simulations. We use the approach from Dicker (2014) to give an estimate of σ𝜎\sigmaitalic_σ, which crucially requires knowing only 𝐘22,𝐗𝐘superscriptsubscriptdelimited-∥∥𝐘22superscript𝐗top𝐘\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y}∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, and 𝐗~𝐘superscript~𝐗top𝐘\widetilde{\mathbf{X}}^{\top}\mathbf{Y}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y. Dicker (2014) showed that the estimator is consistent and asymptotic normal in the high-dimensional regime. Specifically, in our setting, we estimate σ𝜎\sigmaitalic_σ by

σ^0=max(2p+n+1n(n+1)𝐘221n(n+1)𝐘[𝐗𝐗~][𝚺𝚺𝐃𝚺𝐃𝚺]1[𝐗𝐗~]𝐘,0).subscript^𝜎0max2𝑝𝑛1𝑛𝑛1superscriptsubscriptdelimited-∥∥𝐘221𝑛𝑛1superscript𝐘topmatrix𝐗~𝐗superscriptmatrix𝚺𝚺𝐃𝚺𝐃𝚺1superscriptmatrix𝐗~𝐗top𝐘0\widehat{\sigma}_{0}=\sqrt{\text{max}\left(\frac{2p+n+1}{n(n+1)}\lVert\mathbf{% Y}\rVert_{2}^{2}-\frac{1}{n(n+1)}\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&% \widetilde{\mathbf{X}}\end{bmatrix}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{% \Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}^{-1}\begin{bmatrix}% \mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y},0\right)}.over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = square-root start_ARG max ( divide start_ARG 2 italic_p + italic_n + 1 end_ARG start_ARG italic_n ( italic_n + 1 ) end_ARG ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_n ( italic_n + 1 ) end_ARG bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , 0 ) end_ARG .

In sum, a choice for λ𝜆\lambdaitalic_λ in Algorithm 4 is this:

  1. 1.

    Approximate 𝔼[𝐑ϵ]𝔼delimited-[]subscriptdelimited-∥∥superscript𝐑topbold-italic-ϵ\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]blackboard_E [ ∥ bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] via Monte Carlo simulations, where 𝐑n×2p𝐑superscript𝑛2𝑝\mathbf{R}\in\mathbb{R}^{n\times 2p}bold_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × 2 italic_p end_POSTSUPERSCRIPT has iid 𝒩(𝟎,[𝚺𝚺𝐃𝚺𝐃𝚺])𝒩0matrix𝚺𝚺𝐃𝚺𝐃𝚺\mathcal{N}\left(\mathbf{0},\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-% \mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\right)caligraphic_N ( bold_0 , [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] ) rows, ϵN(𝟎,𝐈n)similar-tobold-italic-ϵ𝑁0subscript𝐈𝑛\bm{\epsilon}\sim N(\mathbf{0},\mathbf{I}_{n})bold_italic_ϵ ∼ italic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is independent.

  2. 2.

    Compute

    σ^0=max(2p+n+1n(n+1)𝐘221n(n+1)[𝐘𝐗𝐘𝐗𝐏+𝐘2𝐙][𝚺𝚺𝐃𝚺𝐃𝚺]1[𝐗𝐘𝐏𝐗𝐘+𝐘2𝐙],0),subscript^𝜎0max2𝑝𝑛1𝑛𝑛1superscriptsubscriptdelimited-∥∥𝐘221𝑛𝑛1matrixsuperscript𝐘top𝐗superscript𝐘top𝐗𝐏subscriptdelimited-∥∥𝐘2superscript𝐙topsuperscriptmatrix𝚺𝚺𝐃𝚺𝐃𝚺1matrixsuperscript𝐗top𝐘superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙0\widehat{\sigma}_{0}=\sqrt{\text{max}\left(\frac{2p+n+1}{n(n+1)}\lVert\mathbf{% Y}\rVert_{2}^{2}-\frac{1}{n(n+1)}\begin{bmatrix}\mathbf{Y}^{\top}\mathbf{X}&% \mathbf{Y}^{\top}\mathbf{X}\mathbf{P}+\lVert\mathbf{Y}\rVert_{2}\mathbf{Z}^{% \top}\end{bmatrix}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}^{-1}\begin{bmatrix}% \mathbf{X}^{\top}\mathbf{Y}\\ \mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}\end{bmatrix},0\right)},over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = square-root start_ARG max ( divide start_ARG 2 italic_p + italic_n + 1 end_ARG start_ARG italic_n ( italic_n + 1 ) end_ARG ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_n ( italic_n + 1 ) end_ARG [ start_ARG start_ROW start_CELL bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X end_CELL start_CELL bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_XP + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW start_ROW start_CELL bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z end_CELL end_ROW end_ARG ] , 0 ) end_ARG ,

    where 𝐙𝐙\mathbf{Z}bold_Z is independent of everything else.

  3. 3.

    Output λκσ^0n𝔼[𝐑ϵ]𝜆𝜅subscript^𝜎0𝑛𝔼delimited-[]subscriptdelimited-∥∥superscript𝐑topbold-italic-ϵ\lambda\approx\kappa\cdot\frac{\widehat{\sigma}_{0}}{n}\cdot\mathbb{E}[\lVert% \mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]italic_λ ≈ italic_κ ⋅ divide start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG ⋅ blackboard_E [ ∥ bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] where the approximation sign \approx reminds us that the expectation is only approximate.

As in the square-root Lasso case, we observe that the power of our method is not very sensitive to the choice of κ𝜅\kappaitalic_κ. We use κ=0.6𝜅0.6\kappa=0.6italic_κ = 0.6 in our simulations below. In Appendix E, we provide details of computation of λ𝜆\lambdaitalic_λ and prove that Algorithm 4 maintains FDR control with the computed λ𝜆\lambdaitalic_λ.

Method 2 (pseudo-sum)

An alternative way of choosing λ𝜆\lambdaitalic_λ is to adapt the pseudo-summary statistics approach proposed by Zhang et al. (2021). Set 𝐫=𝐗𝐘/n𝐫superscript𝐗top𝐘𝑛\mathbf{r}=\mathbf{X}^{\top}\mathbf{Y}/nbold_r = bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y / italic_n and 𝐫~=𝐏𝐫+𝐘2𝐙/n~𝐫superscript𝐏top𝐫subscriptdelimited-∥∥𝐘2𝐙𝑛\widetilde{\mathbf{r}}=\mathbf{P}^{\top}\mathbf{r}+\lVert\mathbf{Y}\rVert_{2}% \mathbf{Z}/nover~ start_ARG bold_r end_ARG = bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_r + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z / italic_n. The main idea of Zhang et al. (2021) is to generate training summary statistics 𝐫tsubscript𝐫𝑡\mathbf{r}_{t}bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and validation summary statistics 𝐫vsubscript𝐫𝑣\mathbf{r}_{v}bold_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT from 𝐫𝐫\mathbf{r}bold_r and 𝐫~~𝐫\widetilde{\mathbf{r}}over~ start_ARG bold_r end_ARG based on the training and validation sample sizes ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and nvsubscript𝑛𝑣n_{v}italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT respectively (in this paper we take nt=0.8nsubscript𝑛𝑡0.8𝑛n_{t}=0.8nitalic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.8 italic_n and nv=0.2nsubscript𝑛𝑣0.2𝑛n_{v}=0.2nitalic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 0.2 italic_n). Following Zhang et al. (2021), we generate the training summary statistics

[𝐫𝐫~]t=[𝐫𝐫~]+nvn×nt𝐑,subscriptmatrix𝐫~𝐫𝑡matrix𝐫~𝐫subscript𝑛𝑣𝑛subscript𝑛𝑡𝐑\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}_{t}=\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}+\sqrt{\frac{n_{v}}{n\times n_{t}}}\mathbf{% R},[ start_ARG start_ROW start_CELL bold_r end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_r end_ARG end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_r end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_r end_ARG end_CELL end_ROW end_ARG ] + square-root start_ARG divide start_ARG italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG italic_n × italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_R ,

where

𝐑𝒩(𝟎,[𝚺𝚺𝐃𝚺𝐃𝚺]),similar-to𝐑𝒩0matrix𝚺𝚺𝐃𝚺𝐃𝚺\mathbf{R}\sim\mathcal{N}\left(\mathbf{0},\begin{bmatrix}\mathbf{\Sigma}&% \mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\right),bold_R ∼ caligraphic_N ( bold_0 , [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] ) ,

and the validation summary statistics

[𝐫𝐫~]v=1nv[n[𝐫𝐫~]nt[𝐫𝐫~]t].subscriptmatrix𝐫~𝐫𝑣1subscript𝑛𝑣delimited-[]𝑛matrix𝐫~𝐫subscript𝑛𝑡subscriptmatrix𝐫~𝐫𝑡\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}_{v}=\frac{1}{n_{v}}\left[n\begin{bmatrix}% \mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}-n_{t}\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}_{t}\right].[ start_ARG start_ROW start_CELL bold_r end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_r end_ARG end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG [ italic_n [ start_ARG start_ROW start_CELL bold_r end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_r end_ARG end_CELL end_ROW end_ARG ] - italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL bold_r end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_r end_ARG end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] .

Given a sequence of candidate λ𝜆\lambdaitalic_λ values, we choose that which maximizes an approximation f(λ)𝑓𝜆f(\lambda)italic_f ( italic_λ ) of the correlation between the predicted values and the true values on the pseudo-validation set.***Unlike the previous approach, this tuning parameter choice will not induce the exact flip-sign property. However, we observe empirically that our method is robust to this issue, and no FDR inflation occurred. In theory, one could randomly swap all the variables with their corresponding knockoffs and compute the average of all the λ𝜆\lambdaitalic_λ values obtained. In the limit, the average will give a data-driven value of λ𝜆\lambdaitalic_λ that is invariant to swapping variables with their knockoffs due to symmetry. Specifically, Zhang et al. (2021) considered the approximation

f(λ)=𝜷^t,λ[𝐫𝐫~]v𝜷^t,λ[𝚺𝚺𝐃𝚺𝐃𝚺]𝜷^t,λ,𝑓𝜆subscriptsuperscript^𝜷top𝑡𝜆subscriptmatrix𝐫~𝐫𝑣subscriptsuperscript^𝜷top𝑡𝜆matrix𝚺𝚺𝐃𝚺𝐃𝚺subscript^𝜷𝑡𝜆f(\lambda)=\frac{\hat{\bm{\beta}}^{\top}_{t,\lambda}\begin{bmatrix}\mathbf{r}% \\ \widetilde{\mathbf{r}}\end{bmatrix}_{v}}{\sqrt{\hat{\bm{\beta}}^{\top}_{t,% \lambda}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\hat{\bm{\beta}}_{t,% \lambda}}},italic_f ( italic_λ ) = divide start_ARG over^ start_ARG bold_italic_β end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_λ end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL bold_r end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_r end_ARG end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over^ start_ARG bold_italic_β end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_λ end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_t , italic_λ end_POSTSUBSCRIPT end_ARG end_ARG , (13)

where

𝜷^t,λ=argmin𝜷2p12𝜷[𝚺𝚺𝐃𝚺𝐃𝚺]𝜷𝜷[𝐫𝐫~]t+λ𝜷1.subscript^𝜷𝑡𝜆subscriptargmin𝜷superscript2𝑝12superscript𝜷topmatrix𝚺𝚺𝐃𝚺𝐃𝚺𝜷superscript𝜷topsubscriptmatrix𝐫~𝐫𝑡𝜆subscriptnorm𝜷1\hat{\bm{\beta}}_{t,\lambda}=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}% ^{2p}}\frac{1}{2}\bm{\beta}^{\top}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{% \Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\bm{\beta}-\bm{\beta}^{% \top}\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}_{t}+\lambda||\bm{\beta}||_{1}.over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_t , italic_λ end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] bold_italic_β - bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_r end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_r end_ARG end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ | | bold_italic_β | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (14)

Therefore, we choose the λ𝜆\lambdaitalic_λ value that maximizes (13) among a set of candidate values. Since the objective function (11) is convex in 𝜷𝜷\bm{\beta}bold_italic_β, we may employ the BASIL framework proposed by Qian et al. (2020), which implements a batch version of the strong rules introduced in Tibshirani et al. (2012). BASIL can be directly applied to compute the solution path of (14) efficiently.

Note that there exist other ways to choose the penalty level λ𝜆\lambdaitalic_λ using 𝐗𝐘,𝐘2superscript𝐗top𝐘subscriptdelimited-∥∥𝐘2\mathbf{X}^{\top}\mathbf{Y},\lVert\mathbf{Y}\rVert_{2}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and n𝑛nitalic_n (for example, the Lassosum by Mak et al. (2017)). We do not attempt to claim an optimal strategy.

Connection with the scout procedure

It turns out that step 3 of Algorithm 4 is closely related to the scout procedure (Witten and Tibshirani, 2009). The scout procedure defines a family of covariance-regularized regression methods that achieve superior prediction via shrinking the inverse covariance matrix. It includes the Lasso, Ridge and Elastic Net as special cases. In Appendix F, we show that the solution of objective function (11) is proportional to that of the scout procedure (with known precision matrix 𝚺1superscript𝚺1\mathbf{\Sigma}^{-1}bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT). This connection provides a justification on why the objective function (11) is effective.

4.2.3 GhostKnockoffs with other feature importance statistics

In the previous sections, we presented a feature importance statistic based on summary statistics that leads to better power than the marginal correlation difference statistic. By Proposition 2, GhostKnockoffs techniques can be combined with any other feature importance statistics that i) are based on the summary statistics 𝐗𝐘superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, 𝐘2superscriptdelimited-∥∥𝐘2\lVert\mathbf{Y}\rVert^{2}∥ bold_Y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the sample size n𝑛nitalic_n and ii) satisfy the flip-sign property. The procedures generated will still guarantee FDR control. In our simulation studies, we found that using the posterior inclusion probability (PIP) produced by the SuSiE-RSS model (Zou et al., 2022) as the feature importance statistic also results in consistent power improvement over GK-marginal. SuSiE-RSS is based on the Sum of Single Effects (SuSiE) model proposed by Wang et al. (2020), which assumes a Bayesian linear model with true coefficients 𝜷𝜷\bm{\beta}bold_italic_β represented as the sum of multiple one-hot (random) individual effect vectors. Zou et al. (2022) combines SuSiE with a modified likelihood function to accommodate applications in which only summary statistics are available (see Zou et al. (2022) for details).***We used the susie_rss function inside the R package susieR in our simulations. We call the resulting procedure GhostKnockoffs with SuSiE-RSS statistic and denote it by GK-susie-rss. We include this method in the simulation section below.

4.3 Variants of GhostKnockoffs

The methods we presented so far can be adapted to work with various related procedures. We give three examples below for illustration.

4.3.1 Multi-knockoffs

The knockoffs procedure is a randomized procedure which could produce very different selection sets on different runs. This is especially true when the knockoffs rejection set is small. In fact, the offset on the numerator in (3) implies that knockoffs either rejects more than 1q1𝑞\lceil\frac{1}{q}\rceil⌈ divide start_ARG 1 end_ARG start_ARG italic_q end_ARG ⌉ hypotheses, where q𝑞qitalic_q is the target FDR level, or rejects nothing. To improve the stability of the knockoffs procedure, Gimenez and Zou (2019) proposed simultaneous multi-knockoffs, which is substantially more stable and powerful than knockoffs when the rejection set is small and maintains FDR control in general.

The idea of Gimenez and Zou (2019) is to create M𝑀Mitalic_M (instead of one) knockoff copies for every feature so that they jointly satisfy an extended exchangeability condition.***Specifically, the extended exchangeability condition says that if we permute variables with their corresponding (multiple) knockoffs arbitrarily, the joint distribution remains unchanged. If X𝒩(𝟎,𝚺)similar-to𝑋𝒩0𝚺X\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma})italic_X ∼ caligraphic_N ( bold_0 , bold_Σ ), Gimenez and Zou (2019) showed that X~pM~𝑋superscript𝑝𝑀\widetilde{X}\in\mathbb{R}^{pM}over~ start_ARG italic_X end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_p italic_M end_POSTSUPERSCRIPT is a valid M𝑀Mitalic_M multi-knockoff for Xp𝑋superscript𝑝X\in\mathbb{R}^{p}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT if [XX~]𝒩(𝟎,𝐆)similar-tomatrix𝑋~𝑋𝒩0𝐆\begin{bmatrix}X&\widetilde{X}\end{bmatrix}\sim\mathcal{N}(\mathbf{0},\mathbf{% G})[ start_ARG start_ROW start_CELL italic_X end_CELL start_CELL over~ start_ARG italic_X end_ARG end_CELL end_ROW end_ARG ] ∼ caligraphic_N ( bold_0 , bold_G ), where

𝐆=[𝚺𝚺𝐃𝚺𝐃𝚺𝐃𝚺𝚺𝐃𝚺𝐃𝚺](M+1)p×(M+1)p,𝐆matrix𝚺𝚺𝐃𝚺𝐃𝚺𝐃𝚺𝚺𝐃𝚺𝐃𝚺superscript𝑀1𝑝𝑀1𝑝\mathbf{G}=\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}&\cdots&% \mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}&\cdots&\mathbf{\Sigma}-\mathbf{D}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{\Sigma}-\mathbf{D}&\cdots&\cdots&\mathbf{\Sigma}\end{bmatrix}\in% \mathbb{R}^{(M+1)p\times(M+1)p},bold_G = [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL start_CELL ⋯ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL start_CELL ⋯ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_M + 1 ) italic_p × ( italic_M + 1 ) italic_p end_POSTSUPERSCRIPT ,

Here, 𝐃=diag{𝐬}𝐃diag𝐬\mathbf{D}=\text{diag}\{\mathbf{s}\}bold_D = diag { bold_s }, and 𝐬𝐬\mathbf{s}bold_s is obtained by solving a more restrictive convex optimization problem than in (15) which guarantees that 𝐆𝐆\mathbf{G}bold_G is positive semi-definite (see Gimenez and Zou (2019) for details). In data matrix form, we generate valid M𝑀Mitalic_M multi-knockoffs by

𝐗~=𝐗𝐏+𝐄𝐕1/2,~𝐗𝐗𝐏superscript𝐄𝐕12\widetilde{\mathbf{X}}=\mathbf{X}\mathbf{P}+\mathbf{E}\mathbf{V}^{1/2},over~ start_ARG bold_X end_ARG = bold_XP + bold_EV start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ,

where 𝐏=[𝐈𝚺1𝐃𝐈𝚺1𝐃]p×Mp,𝐏matrix𝐈superscript𝚺1𝐃𝐈superscript𝚺1𝐃superscript𝑝𝑀𝑝\mathbf{P}=\begin{bmatrix}\mathbf{I}-\mathbf{\Sigma}^{-1}\mathbf{D}&\cdots&% \mathbf{I}-\mathbf{\Sigma}^{-1}\mathbf{D}\end{bmatrix}\in\mathbb{R}^{p\times Mp},bold_P = [ start_ARG start_ROW start_CELL bold_I - bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL start_CELL ⋯ end_CELL start_CELL bold_I - bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_M italic_p end_POSTSUPERSCRIPT , 𝐄n×Mp𝐄superscript𝑛𝑀𝑝\mathbf{E}\in\mathbb{R}^{n\times Mp}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_M italic_p end_POSTSUPERSCRIPT has i.i.d. standard normal entries, and

𝐕=[2𝐃𝐃𝚺1𝐃𝐃𝐃𝚺1𝐃𝐃𝐃𝚺1𝐃𝐃𝐃𝚺1𝐃2𝐃𝐃𝚺1𝐃𝐃𝐃𝚺1𝐃𝐃𝐃𝚺1𝐃𝐃𝐃𝚺1𝐃2𝐃𝐃𝚺1𝐃].𝐕matrix2𝐃𝐃superscript𝚺1𝐃𝐃𝐃superscript𝚺1𝐃𝐃𝐃superscript𝚺1𝐃𝐃𝐃superscript𝚺1𝐃2𝐃𝐃superscript𝚺1𝐃𝐃𝐃superscript𝚺1𝐃𝐃𝐃superscript𝚺1𝐃𝐃𝐃superscript𝚺1𝐃2𝐃𝐃superscript𝚺1𝐃\mathbf{V}=\begin{bmatrix}2\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}% &\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}&\cdots&\mathbf{D}-\mathbf% {D}\mathbf{\Sigma}^{-1}\mathbf{D}\\ \mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}&2\mathbf{D}-\mathbf{D}% \mathbf{\Sigma}^{-1}\mathbf{D}&\cdots&\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1% }\mathbf{D}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}&\mathbf{D}-\mathbf{D}% \mathbf{\Sigma}^{-1}\mathbf{D}&\cdots&2\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-% 1}\mathbf{D}\end{bmatrix}.bold_V = [ start_ARG start_ROW start_CELL 2 bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL start_CELL bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL start_CELL ⋯ end_CELL start_CELL bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL end_ROW start_ROW start_CELL bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL start_CELL 2 bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL start_CELL ⋯ end_CELL start_CELL bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL start_CELL bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL start_CELL ⋯ end_CELL start_CELL 2 bold_D - bold_D bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_D end_CELL end_ROW end_ARG ] .

Gimenez and Zou (2019) generalized the knockoffs threshold (3) and the flip-sign property to produce FDR-controlling rejection sets after generating multiple knockoffs via this procedure.

In the summary statistics settings, upon redefining 𝐏𝐏\mathbf{P}bold_P, 𝐕𝐕\mathbf{V}bold_V and 𝐬𝐬\mathbf{s}bold_s as above and replacing the standard knockoffs filter by the multi-knockoffs filter, Algorithms 2 and 3 produce rejection sets that have the same distribution as those produced by their corresponding versions with individual-level data. For Algorithm 4, we simply need to further replace

[𝚺𝚺𝐃𝚺𝐃𝚺]matrix𝚺𝚺𝐃𝚺𝐃𝚺\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}[ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ]

by 𝐆𝐆\mathbf{G}bold_G.

4.3.2 Group knockoffs

When variables are highly correlated, selection procedures become conservative. For example, if a non-null variable Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is highly correlated with a null variable Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, it becomes difficult to reject XjY|XjX_{j}\perp\!\!\!\perp Y|X_{-j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟂ ⟂ italic_Y | italic_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT. This is an important practical concern because highly correlated features are ubiquitous in many settings, particularly GWAS datasets. To overcome this challenge, group knockoffs (Dai and Barber, 2016) can be useful; please see Chu et al. (2023), whose algorithms we employ in the data analyses of Section 5. In group knockoffs, the object of inference is shifted from single variables to groups of highly correlated variables. Specifically, suppose we partition p𝑝pitalic_p features into g𝑔gitalic_g groups and reorder all features such that features of the same group are in adjacent columns of 𝐗𝐗\mathbf{X}bold_X. The objective is to test group conditional independence hypothesis:

Hγ0:XγYXγ\displaystyle H_{\gamma}^{0}:X_{\gamma}\perp\!\!\!\perp Y\mid X_{-\gamma}italic_H start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT : italic_X start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ⟂ ⟂ italic_Y ∣ italic_X start_POSTSUBSCRIPT - italic_γ end_POSTSUBSCRIPT

where γ{1,,g}𝛾1𝑔\gamma\in\{1,...,g\}italic_γ ∈ { 1 , … , italic_g } denotes a group and Xγsubscript𝑋𝛾X_{\gamma}italic_X start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT is the vector of features in group γ𝛾\gammaitalic_γ. When these groups have strong correlation, single-variable knockoffs may struggle to identify signals, but group knockoffs retain power to identify significant groups. As in Section 4.3.1, all methods described in this paper apply to group knockoffs after redefining 𝐃𝐃\mathbf{D}bold_D to the equivalent version in group knockoffs. In Appendix G, we detail the construction of group knockoffs and examples of importance scores at the group level for inference.

4.3.3 Conditional randomization test

The conditional randomization test (CRT) (Candès et al., 2018) is an alternative method to test the conditional independence hypotheses Hj:XjYXjH_{j}:X_{j}\perp\!\!\!\perp Y\mid X_{-j}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟂ ⟂ italic_Y ∣ italic_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT for 1jp1𝑗𝑝1\leq j\leq p1 ≤ italic_j ≤ italic_p. By generating a valid ‘CRT p𝑝pitalic_p-value’ pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for each hypothesis Hjsubscript𝐻𝑗{H}_{j}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, existing multiple testing procedures, including the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) and the selective SeqStep+ filter (Li and Candès, 2021), can be used to simultaneously test H1,,Hpsubscript𝐻1subscript𝐻𝑝H_{1},\ldots,H_{p}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT with FDR control.***In general, CRT plimit-from𝑝p-italic_p -values may not be independent of each other or satisfy the PRDS property (Benjamini and Yekutieli, 2001). Therefore, applying the Benjamini-Hochberg procedure on CRT plimit-from𝑝p-italic_p -values does not guarantee FDR control theoretically. However, as noted in Candès et al. (2018), the FDR is usually under control empirically. As shown in Candès et al. (2018) and Wang and Janson (2021), doing so can improve the power of multiple testing with greater computational complexity.

In Appendix H, we introduce Ghostknockoffs for CRT (GhostCRT), which adopts techniques introduced in this paper to the framework of CRT.

4.4 Numerical simulations

We conduct simulations on synthetic data as well as semi-synthetic data generated from a real-world genetic dataset. Specifically, we apply GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso, defined in Algorithm 4 with tuning parameter λ𝜆\lambdaitalic_λ chosen by either lasso-min or pseudo-sum from Section 4.2.2) and GhostKnockoffs with SuSiE-RSS statistic (GK-susie-rss, defined in Section 4.2.3). We compare their performance with GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2) and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic based on individual-level data (KF-lassocv). We also demonstrate empirically the robustness of our procedures by showing the FDR control when only an estimate of the true covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ is available and when the features are discrete.

4.4.1 Simulations based on real-world genetic data

To mimic the dependency structure among features in real-world applications, we generate synthetic data based on the whole genome sequencing (WGS) data from the Alzheimer’s Disease Sequencing Project (ADSP). The data are obtained from the ADSP consortium following the SNP/Indel Variant Calling Pipeline and data management tool (VCPA) (Leung et al., 2019). The ADSP WGS data records counts of minor alleles of genetic variants over 16,906 individuals. Using reference populations from the 1000 Genomes Consortium (The 1000 Genomes Project Consortium, 2015), we estimate ancestry rates of each individual by SNPWeights v2.1 (Chen et al., 2013) and extract 6,952 individuals with estimated European ancestry rate greater than 80%. We further restrict our simulations to 2,000 randomly selected genetic variants within 0.5Mb distance to the APOE gene (chr19:44909011-45912650; hg38), whose ε𝜀\varepsilonitalic_ε2 allele and ε𝜀\varepsilonitalic_ε4 allele are known to be respectively the strongest genetic protective factor and the strongest genetic risk factor for Alzheimer’s disease (Serrano-Pozo et al., 2021; Belloy et al., 2023), and with minor allele frequency (MAF) larger than 0.010.010.010.01. Since our simulations focus on performance at identifying relevant clusters of tightly linked variants, we simplify the simulation design by pruning variants to eliminate pairs with absolute correlation greater than 0.750.750.750.75. To do so, we first compute the correlation matrix [cor(Xj,Xk)]2000×2000subscriptdelimited-[]corsubscript𝑋𝑗subscript𝑋𝑘20002000[\text{cor}(X_{j},X_{k})]_{2000\times 2000}[ cor ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT 2000 × 2000 end_POSTSUBSCRIPT of the 2,000 selected variants over the 6,952 extracted individuals using the shrinkage estimate in the R package corpcor (Schäfer and Strimmer, 2005) and apply hierarchical clustering (single-linkage with cutoff value 0.250.250.250.25) on the distance matrix [1|cor(Xj,Xk)|]2000×2000subscriptdelimited-[]1corsubscript𝑋𝑗subscript𝑋𝑘20002000[1-|\text{cor}(X_{j},X_{k})|]_{2000\times 2000}[ 1 - | cor ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | ] start_POSTSUBSCRIPT 2000 × 2000 end_POSTSUBSCRIPT. As a result, we obtain 512 variant clusters such that pairwise correlation between any pair of variants from different clusters is in [0.75,0.75]0.750.75[-0.75,0.75][ - 0.75 , 0.75 ]. By randomly choosing one representative variant from each cluster, we include p=512𝑝512p=512italic_p = 512 tested genetic variants in the simulation study.

For each replicate, we obtain synthetic data by randomly sampling n=3,000𝑛3000n=3,000italic_n = 3 , 000 individuals without replacement and collecting the sampled individuals’ records on the p=512𝑝512p=512italic_p = 512 tested genetic variants as the n×p𝑛𝑝n\times pitalic_n × italic_p covariate matrix 𝐗𝐗\mathbf{X}bold_X. We further sample another n=3,000𝑛3000n=3,000italic_n = 3 , 000 individuals without replacement as the reference panel on which we compute the correlation matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ using the shrinkage estimate in the R package corpcor (Schäfer and Strimmer, 2005). Based on the covariate matrix 𝐗𝐗\mathbf{X}bold_X, we generate the response vector 𝐘=(Y1,,Yn)𝐘superscriptsubscript𝑌1subscript𝑌𝑛top\mathbf{Y}=(Y_{1},\ldots,Y_{n})^{\top}bold_Y = ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT from either the linear model (continuous response),

Yi=β1Xi1++βpXip+ϵiC,where ϵiCN(0,32),formulae-sequencesubscript𝑌𝑖subscript𝛽1subscript𝑋𝑖1subscript𝛽𝑝subscript𝑋𝑖𝑝subscriptsuperscriptitalic-ϵ𝐶𝑖similar-towhere subscriptsuperscriptitalic-ϵ𝐶𝑖𝑁0superscript32Y_{i}=\beta_{1}X_{i1}+...+\beta_{p}X_{ip}+\epsilon^{C}_{i},\quad\text{where }% \epsilon^{C}_{i}\sim N(0,3^{2}),italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT + … + italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , where italic_ϵ start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_N ( 0 , 3 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

or the mixed-effect logit model (binary response),

YiBernounli(μi),where g(μi)=β0+β1Xi1++βpXip+ϵiB, ϵiBN(0,12) and g(x)=log(x1x).formulae-sequencesimilar-tosubscript𝑌𝑖Bernounlisubscript𝜇𝑖formulae-sequencewhere 𝑔subscript𝜇𝑖subscript𝛽0subscript𝛽1subscript𝑋𝑖1subscript𝛽𝑝subscript𝑋𝑖𝑝subscriptsuperscriptitalic-ϵ𝐵𝑖similar-to subscriptsuperscriptitalic-ϵ𝐵𝑖𝑁0superscript12 and 𝑔𝑥𝑥1𝑥Y_{i}\sim\text{Bernounli}(\mu_{i}),\quad\text{where }g(\mu_{i})=\beta_{0}+% \beta_{1}X_{i1}+...+\beta_{p}X_{ip}+\epsilon^{B}_{i},\text{ }\epsilon^{B}_{i}% \sim N(0,1^{2})\text{ and }g(x)=\log\Big{(}\frac{x}{1-x}\Big{)}.italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ Bernounli ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where italic_g ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT + … + italic_β start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT + italic_ϵ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and italic_g ( italic_x ) = roman_log ( divide start_ARG italic_x end_ARG start_ARG 1 - italic_x end_ARG ) .

Specifically, β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT under the mixed-effect logit model is log(9)9-\log(9)- roman_log ( 9 ) so that the prevalence (or the expected proportion of Yi=1subscript𝑌𝑖1Y_{i}=1italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1) is 10%percent1010\%10 %. ϵiCsubscriptsuperscriptitalic-ϵ𝐶𝑖\epsilon^{C}_{i}italic_ϵ start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s and ϵiBsubscriptsuperscriptitalic-ϵ𝐵𝑖\epsilon^{B}_{i}italic_ϵ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s reflect variation due to unobserved covariates. Only 10101010 randomly selected coefficients βjsubscript𝛽𝑗\beta_{j}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are nonzero, with value βj=120mj(1mj)subscript𝛽𝑗120subscript𝑚𝑗1subscript𝑚𝑗\beta_{j}=\frac{1}{\sqrt{20\cdot m_{j}(1-m_{j})}}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 20 ⋅ italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 - italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG end_ARG, where mjsubscript𝑚𝑗m_{j}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the MAF of the j𝑗jitalic_j-th variant.

With the relevant summary statistics computed, we apply GK-pseudolasso and GK-susie-rss and compare their performances with GK-marginal and KF-lassocv.

Over 1000 replicates under both the linear model and the mixed-effect logit model, average power and FDR of different methods with respect to different target FDR levels are visualized in Figure 3. Under both models, we observe that GK-pseudolasso with both ways of selecting the tuning parameter and GK-susie-rss are uniformly more powerful than GK-marginal. The performance of the proposed methods is very close to that of KF-lassocv. Despite the covariance matrix being estimated using an independent sample and the entries of X𝑋Xitalic_X being discrete, the FDRs of our proposed methods are controlled in both settings, suggesting the robustness of our methods.

GhostKnockoffs with discrete features

We note that discrete covariates do not follow a Gaussian distribution. However, the knockoffs procedure ensures FDR control whenever the feature importance statistics Wj=w(Tj,Tp+j)subscript𝑊𝑗𝑤subscript𝑇𝑗subscript𝑇𝑝𝑗W_{j}=w(T_{j},T_{p+j})italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_w ( italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_p + italic_j end_POSTSUBSCRIPT ), where w𝑤witalic_w is an anti-symmetric function, and 𝐓2p𝐓superscript2𝑝\mathbf{T}\in\mathbb{R}^{2p}bold_T ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT is distributionally invariant upon swapping Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with Tj+psubscript𝑇𝑗𝑝T_{j+p}italic_T start_POSTSUBSCRIPT italic_j + italic_p end_POSTSUBSCRIPT for each null j𝑗jitalic_j. Using Lemma 1, we know that Algorithm 4 controls the FDR if swapping the jlimit-from𝑗j-italic_j -th entry of 𝐙=𝐗𝐘𝐙superscript𝐗top𝐘\mathbf{Z}=\mathbf{X}^{\top}\mathbf{Y}bold_Z = bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and the jlimit-from𝑗j-italic_j -th entry of 𝐙~=𝐏𝐗𝐘+𝐘2𝐙~𝐙superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙\tilde{\mathbf{Z}}=\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{% Y}\rVert_{2}\mathbf{Z}over~ start_ARG bold_Z end_ARG = bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z does not change their joint distribution for each null j𝑗jitalic_j. In Appendix J, we visually demonstrate the approximate preservation of this distributional invariance. This, along with the robustness of knockoffs (Candès et al., 2018; Barber et al., 2020), helps in explaining why we have not observed FDR inflation with discrete covariates.

Refer to caption
Figure 3: Average power and FDR over 1000 replications with respect to different target FDR levels in simulations based on genetic data, where features are genotypes of existing patients, and the response is simulated from a linear model (continuous response) or a mixed-effect logit model (binary response).

4.4.2 Independent features

We revisit the setting from Section 3.5.1 in which Σ=IpΣsubscript𝐼𝑝\Sigma=I_{p}roman_Σ = italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. For the pseudo-sum method for GK-pseudolasso, we optimize over λ𝜆\lambdaitalic_λ using a grid of 100 candidate values interpolating between λmaxsubscript𝜆max\lambda_{\text{max}}italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT and λmax/1000subscript𝜆max1000\lambda_{\text{max}}/1000italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT / 1000 linearly in log scale, and

λmax=1n𝔼[[𝐗𝐘𝐏𝐗𝐘+𝐘2𝐙]]subscript𝜆max1𝑛𝔼delimited-[]subscriptnormmatrixsuperscript𝐗top𝐘superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙\lambda_{\text{max}}=\frac{1}{n}\mathbb{E}\left[\left\|\begin{bmatrix}\mathbf{% X}^{\top}\mathbf{Y}\\ \mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}\end{bmatrix}\right\|_{\infty}\right]italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG blackboard_E [ ∥ [ start_ARG start_ROW start_CELL bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_CELL end_ROW start_ROW start_CELL bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z end_CELL end_ROW end_ARG ] ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ]

is the minimal λ𝜆\lambdaitalic_λ value that shrinks all the coefficients to zero. To calculate 𝔼[𝐑ϵ]𝔼delimited-[]subscriptdelimited-∥∥superscript𝐑topbold-italic-ϵ\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]blackboard_E [ ∥ bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] for the lasso-min parameter method, we use a Monte Carlo estimate averaged over 200 samples. The target FDR is 20%. Each point represents an average over 200 replications.

Note that when 𝚺=𝐈p𝚺subscript𝐈𝑝\mathbf{\Sigma}=\mathbf{I}_{p}bold_Σ = bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the solution to (15) is 𝐃=𝐈p𝐃subscript𝐈𝑝\mathbf{D}=\mathbf{I}_{p}bold_D = bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. It is easy to see that (11) gives

𝜷^=1nSλ([𝐗𝐗~]𝐘),^𝜷1𝑛subscript𝑆𝜆superscriptmatrix𝐗~𝐗top𝐘\hat{\bm{\beta}}=\frac{1}{n}S_{\lambda}\left(\begin{bmatrix}\mathbf{X}&% \widetilde{\mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y}\right),over^ start_ARG bold_italic_β end_ARG = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_S start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ) ,

where the soft-threshold operator Sλ(x)=sign(x)(|x|λ)+subscript𝑆𝜆𝑥𝑠𝑖𝑔𝑛𝑥subscript𝑥𝜆S_{\lambda}(x)=sign(x)(\lvert x\rvert-\lambda)_{+}italic_S start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_x ) = italic_s italic_i italic_g italic_n ( italic_x ) ( | italic_x | - italic_λ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is applied coordinate-wise. Therefore, the method in Section 4.2 soft-thresholds the marginal correlation of 𝐗𝐗\mathbf{X}bold_X and 𝐘𝐘\mathbf{Y}bold_Y.

Refer to caption
Figure 4: Power and FDR plots for independent features and a Gaussian linear model with varying dimensions. Each point is an average over 200 replications.

As shown in Figure 4, all three new methods (GK-pseudolasso with lasso-min/pseudo-sum and GK-susie-rss) consistently outperform GK-marginal, and the FDR is always controlled at the expected level, as theoretically guaranteed. As n/p𝑛𝑝n/pitalic_n / italic_p grows, we see that the three new methods have power closer to KF-lassocv. This is further demonstrated in additional simulations in Appendix I.

4.4.3 AR(1) features

Figure 5 shows the corresponding plots when the covariate matrix is generated from an AR(1) distribution. We found similar patterns to those with independent features. The power of all methods drops when the autocorrelation coefficient increases, as it is then harder to separate true signals from other variables.

Refer to caption
Figure 5: Power and FDR plots for AR(1) features and a Gaussian linear model with varying dimensions. Each point is an average over 200 replications.

5 Application to meta-analysis for Alzheimer’s disease

To illustrate the empirical performance of the methods in detecting genetic variants associated with Alzheimer’s disease (AD), we apply them to a meta-analysis of nine large-scale array-based genome-wide association and whole-exome/-genome sequencing studies for AD. We include the details of the nine studies in Appendix K.

As all studies share the same focus on individuals with European ancestry, we perform a meta-analysis by aggregating their Z𝑍Zitalic_Z-scores and obtain the meta-analysis Z𝑍Zitalic_Z-score 𝐙metasubscript𝐙meta\mathbf{Z}_{\text{meta}}bold_Z start_POSTSUBSCRIPT meta end_POSTSUBSCRIPT (see Appendix L for details). In addition, we obtain the block-diagonal covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ with respect to approximately independent linkage disequilibrium blocks provided by Berisa and Pickrell (2016). Within each block, we use the UK Biobank directly genotyped data as the reference panel and compute the covariance matrix via the Pan-UKB consortium (https://pan.ukbb.broadinstitute.org) with details in Appendix M. To improve the power in the presence of tightly linked variants, we apply the group knockoffs construction on top of the GhostKnockoff algorithm, as detailed in Section 4.3.2. Finally, we implement GK-pseudolasso with tuning parameter chosen by the lasso-min method on the meta-analysis Z𝑍Zitalic_Z-score 𝐙metasubscript𝐙meta\mathbf{Z}_{\text{meta}}bold_Z start_POSTSUBSCRIPT meta end_POSTSUBSCRIPT and the covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ. To stabilize the GhostKnockoffs procedures, we use M=5𝑀5M=5italic_M = 5 multi-knockoffs as defined in Section 4.3.1.

Refer to caption
Figure 6: Graphical representation of the feature importance statistics after applying the GK-pseudolasso on a meta-analysis of AD. Each point represents a group of genetic variants. With an target FDR level of 0.1, identified groups are highlighted in blue or purple. For each locus with at least one identified group, the name of the locus is presented at the variant group with the largest importance statistic (highlighted in purple). Variant density is shown at the bottom of plot (number of variants per 1Mb).

Figure 6 presents the result of the meta-analysis of the nine studies via our proposed method with target FDR level 0.1. Here, we specify loci based on variant groups and annotate two loci as different loci if they are 1 Mb away from each other. We adopt the most proximal gene’s name as the locus name.***Specifically, we consider the variant group with the largest group knockoff feature importance statistic within a locus, and then map the locus to the most proximal gene of the variant within the group that has the highest knockoff importance score. As shown by Table 1 in Appendix N, GK-pseudolasso identifies variant groups in 42 and 63 loci when the target FDR level is 0.1 and 0.2 respectively, substantially more than GK-marginal (10 and 17 when the target FDR level is 0.1 and 0.2, respectively). This is consistent with our simulation results in Section 4.4. In addition, we observe from Table 1 that GK-susie-rss identifies fewer loci (35 and 47 when the target FDR level is 0.1 and 0.2, respectively), although it exhibits similar power in simulation studies. In Appendix O, we analogously visualize results of the meta-analysis via conventional marginal association test (with p𝑝pitalic_p-value cutoff 5×1085superscript1085\times 10^{-8}5 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).

Table 2 in Appendix N shows the top variant with the largest feature importance statistic in each identified group. Most discoveries exhibit relatively strong marginal associations (marginal p𝑝pitalic_p-value 0.05absent0.05\leq 0.05≤ 0.05) in individual studies and the same direction of effects across all studies. Although some loci have an opposite direction of effect in one individual study, such effects are not significant. The consistency across individual studies supports the validity of the proposed method in discovering putative causal variants. In addition, we observe that all top variants of identified groups have small meta-analysis p𝑝pitalic_p-values (less than 0.05), though some are not smaller than the stringent genome-wide threshold (5×1085superscript1085\times 10^{-8}5 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT) in marginal association tests with FWER control.

To further investigate whether the identified groups are functionally enriched, we apply a SNP-to-gene linking strategy proposed by (Gazal et al., 2022) to link the top variants of identified groups to the genes that they potentially regulate. Out of 63 top variants, we find that 34 (54.0%) can be mapped with functional evidence (e.g., being an expression quantitative trait locus, in a Hi-C linked enhancer region, near the exon of a gene, etc.), where the proportion is significantly higher than the average percentage of the background genome (28.6%). In summary, the proposed method can identify functional genetic variants with weaker statistical effects missed by conventional association tests.

6 Discussion

This paper introduced novel approaches for performing variable selection with FDR control on the basis of summary statistics. We proposed methods for testing conditional independence hypotheses from summary statistics alone. For the methods from Section 4, all we need are essentially the marginal correlations between X𝑋Xitalic_X and Y𝑌Yitalic_Y,***Along with 𝐘2superscriptdelimited-∥∥𝐘2\lVert\mathbf{Y}\rVert^{2}∥ bold_Y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and n𝑛nitalic_n. which, at first sight, may appear surprising. Our arguments rely on the assumption that the covariates follow a Gaussian distribution, as well as on the linearity and rotational invariance of Gaussian distributions. Since our methods are based on the knockoffs procedure, they do not require any knowledge about the model of Y𝑌Yitalic_Y given X𝑋Xitalic_X. Our methods extend, and generally give better power than, the work by He et al. (2022) by employing penalized regression to produce the measure of feature importance. The techniques employed in this paper provide a wrapper that can be combined with a variety of feature selection methods, yielding knockoffs versions that guarantee FDR control.

We applied our methods to genetic studies, in which summary statistics are typically available. Due to linkage disequilibrium, the application of our methods to individual genetic variants may yield conservative results. In a parallel work Chu et al. (2023), we have developed tools for constructing group knockoffs efficiently and effectively. When combined, our methods offer a powerful new approach to controlled variable selection in GWAS. This is further supported in our companion work He et al. (2023), where we see the methods in this paper led to significant scientific discoveries.

7 Acknowledgement

Z.C. would like to thank Kevin Guo and Amber Hu for helpful discussions. Z.C. was supported by the Simons Foundation under award 814641. Z.H. was supported by NIH/NIA award AG066206 and AG066515. T.M. was supported by a B.C. and E.J. Eaves Stanford Graduate Fellowship. C.S. was supported by the grants NIH R56HG010812 and NSF DMS2210392. E.J.C. was supported by the Office of Naval Research grant N00014-20-1-2157.

References

  • Barber and Candès (2015) R. F. Barber and E. J. Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055 – 2085, 2015. URL https://doi.org/10.1214/15-AOS1337.
  • Barber et al. (2020) R. F. Barber, E. J. Candès, and R. J. Samworth. Robust inference with knockoffs. 2020.
  • Bates et al. (2020) S. Bates, M. Sesia, C. Sabatti, and E. Candès. Causal inference in genetic trio studies. Proceedings of the National Academy of Sciences, 117(39):24117–24126, 2020.
  • Belloni et al. (2011) A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
  • Belloy et al. (2022a) M. E. Belloy, S. J. Eger, Y. Le Guen, V. Damotte, S. Ahmad, M. A. Ikram, A. Ramirez, A. C. Tsolaki, G. Rossi, I. E. Jansen, et al. Challenges at the APOE locus: a robust quality control approach for accurate APOE genotyping. Alzheimer’s Research & Therapy, 14:22, 2022a.
  • Belloy et al. (2022b) M. E. Belloy, Y. Le Guen, S. J. Eger, V. Napolioni, M. D. Greicius, and Z. He. A Fast and Robust Strategy to Remove Variant-Level Artifacts in Alzheimer Disease Sequencing Project Data. Neurology Genetics, 8(5):e200012, 2022b.
  • Belloy et al. (2023) M. E. Belloy, S. J. Andrews, Y. Le Guen, M. Cuccaro, L. A. Farrer, V. Napolioni, and M. D. Greicius. APOE Genotype and Alzheimer Disease Risk Across Age, Sex, and Population Ancestry. JAMA Neurology, 80(12):1284–1294, 2023.
  • Benjamini and Hochberg (1995) Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995.
  • Benjamini and Yekutieli (2001) Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188, 2001.
  • Berisa and Pickrell (2016) T. Berisa and J. K. Pickrell. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2):283–285, 2016.
  • Bis et al. (2020) J. C. Bis, X. Jian, B. W. Kunkle, Y. Chen, K. L. Hamilton-Nelson, W. S. Bush, W. J. Salerno, D. Lancour, Y. Ma, A. E. Renton, et al. Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation. Molecular psychiatry, 25:1859–1875, 2020.
  • Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
  • Candès et al. (2018) E. Candès, Y. Fan, L. Janson, and J. Lv. Panning for Gold: ‘Model-X’ Knockoffs for High Dimensional Controlled Variable Selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3):551–577, 2018.
  • Chen et al. (2013) C.-Y. Chen, S. Pollack, D. J. Hunter, J. N. Hirschhorn, P. Kraft, and A. L. Price. Improved ancestry inference using weights from external reference panels. Bioinformatics, 29(11):1399–1406, 2013.
  • Chu et al. (2023) B. B. Chu, J. Gu, Z. Chen, T. Morrison, E. Candès, Z. He, and C. Sabatti. Second-order group knockoffs with applications to GWAS. arXiv preprint arXiv:2310.15069, 2023.
  • Dai and Barber (2016) R. Dai and R. Barber. The knockoff filter for FDR control in group-sparse and multitask regression. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1851–1859. PMLR, 2016.
  • Dicker (2014) L. H. Dicker. Variance estimation in high-dimensional linear models. Biometrika, 101(2):269–284, 2014.
  • Gazal et al. (2022) S. Gazal, O. Weissbrod, F. Hormozdiari, K. K. Dey, J. Nasser, K. A. Jagadeesh, D. J. Weiner, H. Shi, C. P. Fulco, L. J. O’Connor, et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nature Genetics, 54:827–836, 2022.
  • Gimenez and Zou (2019) J. R. Gimenez and J. Zou. Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89, pages 2184–2192. PMLR, 2019.
  • He et al. (2021) Z. He, L. Liu, C. Wang, Y. Le Guen, J. Lee, S. Gogarten, F. Lu, S. Montgomery, H. Tang, E. K. Silverman, et al. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nature Communications, 12:3152, 2021.
  • He et al. (2022) Z. He, L. Liu, M. E. Belloy, Y. Le Guen, A. Sossin, X. Liu, X. Qi, S. Ma, P. K. Gyawali, T. Wyss-Coray, et al. Ghostknockoff inference empowers identification of putative causal variants in genome-wide association studies. Nature Communications, 13:7209, 2022.
  • He et al. (2023) Z. He et al. In silico identification of putative causal genetic variants. 2023.
  • Huang et al. (2017) K.-l. Huang, E. Marcora, A. A. Pimenova, A. F. Di Narzo, M. Kapoor, S. C. Jin, O. Harari, S. Bertelsen, B. P. Fairfax, J. Czajkowski, et al. A common haplotype lowers PU.1 expression in myeloid cells and delays onset of Alzheimer’s disease. Nature Neuroscience, 20:1052–1061, 2017.
  • Jansen et al. (2019) I. E. Jansen, J. E. Savage, K. Watanabe, J. Bryois, D. M. Williams, S. Steinberg, J. Sealock, I. K. Karlsson, S. Hägg, L. Athanasiu, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nature Genetics, 51:404–413, 2019.
  • Kunkle et al. (2019) B. W. Kunkle, B. Grenier-Boley, R. Sims, J. C. Bis, V. Damotte, A. C. Naj, A. Boland, M. Vronskaya, S. J. Van Der Lee, A. Amlie-Wolf, et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ𝛽\betaitalic_β, tau, immunity and lipid processing. Nature Genetics, 51:414–430, 2019.
  • Le Guen et al. (2021) Y. Le Guen, M. E. Belloy, V. Napolioni, S. J. Eger, G. Kennedy, R. Tao, Z. He, and M. D. Greicius. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimer’s Research & Therapy, 13:72, 2021.
  • Leung et al. (2019) Y. Y. Leung, O. Valladares, Y.-F. Chou, H.-J. Lin, A. B. Kuzma, L. Cantwell, L. Qu, P. Gangadharan, W. J. Salerno, G. D. Schellenberg, et al. VCPA: genomic variant calling pipeline and data management tool for Alzheimer’s Disease Sequencing Project. Bioinformatics, 35(10):1768–1770, 2019.
  • Li and Candès (2021) S. Li and E. J. Candès. Deploying the Conditional Randomization Test in High Multiplicity Problems. arXiv preprint arXiv:2110.02422, 2021.
  • Mak et al. (2017) T. S. H. Mak, R. M. Porsch, S. W. Choi, X. Zhou, and P. C. Sham. Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41:469–480, 2017.
  • Pasaniuc and Price (2017) B. Pasaniuc and A. L. Price. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics, 18:117–127, 2017.
  • Qian et al. (2020) J. Qian, Y. Tanigawa, W. Du, M. Aguirre, C. Chang, R. Tibshirani, M. A. Rivas, and T. Hastie. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genetics, 16(10):e1009141, 2020.
  • Schäfer and Strimmer (2005) J. Schäfer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:32, 2005.
  • Schwartzentruber et al. (2021) J. Schwartzentruber, S. Cooper, J. Z. Liu, I. Barrio-Hernandez, E. Bello, N. Kumasaka, A. M. Young, R. J. Franklin, T. Johnson, K. Estrada, et al. Genome-wide meta-analysis, fine-mapping and integrative prioritization implicate new Alzheimer’s disease risk genes. Nature Genetics, 53:392–402, 2021.
  • Serrano-Pozo et al. (2021) A. Serrano-Pozo, S. Das, and B. T. Hyman. APOE and Alzheimer’s disease: advances in genetics, pathophysiology, and therapeutic approaches. The Lancet Neurology, 20(1):68–80, 2021.
  • Sesia et al. (2021) M. Sesia, S. Bates, E. Candès, J. Marchini, and C. Sabatti. False discovery rate control in genome-wide association studies with population structure. Proceedings of the National Academy of Sciences, 118(40):e2105841118, 2021.
  • Spector and Janson (2022) A. Spector and L. Janson. Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1):252–276, 2022.
  • The 1000 Genomes Project Consortium (2015) The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68–74, 2015.
  • Tian et al. (2018) X. Tian, J. R. Loftus, and J. E. Taylor. Selective inference with unknown variance via the square-root lasso. Biometrika, 105(4):755–768, 2018.
  • Tibshirani et al. (2012) R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong Rules for Discarding Predictors in Lasso-Type Problems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74(2):245–266, 2012.
  • Wang et al. (2020) G. Wang, A. Sarkar, P. Carbonetto, and M. Stephens. A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(5):1273–1300, 2020.
  • Wang and Janson (2021) W. Wang and L. Janson. A high-dimensional power analysis of the conditional randomization test and knockoffs. Biometrika, 109(3):631–645, 2021.
  • Weinstein et al. (2020) A. Weinstein, W. J. Su, M. Bogdan, R. F. Barber, and E. J. Candès. A Power Analysis for Model-X Knockoffs with psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-Regularized Statistics. arXiv preprint arXiv:2007.15346, 2020.
  • Willer et al. (2010) C. J. Willer, Y. Li, and G. R. Abecasis. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics, 26(17):2190–2191, 2010.
  • Witten and Tibshirani (2009) D. M. Witten and R. Tibshirani. Covariance-regularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(3):615–636, 2009.
  • Zhang et al. (2021) Q. Zhang, F. Privé, B. Vilhjálmsson, and D. Speed. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nature Communications, 12:4192, 2021.
  • Zou et al. (2022) Y. Zou, P. Carbonetto, G. Wang, and M. Stephens. Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genetics, 18(7):e1010299, 2022.

Appendix A Computation of free parameters 𝐬𝐬\mathbf{s}bold_s

In this paper, we use the semidefinite program (SDP) construction of second-order knockoffs Candès et al. [2018]. Without loss of generality, we assume that columns of the data matrix 𝐗𝐗\mathbf{X}bold_X have been standardized with mean 0 and variance 1 such that diagonal entries 𝚺𝚺\mathbf{\Sigma}bold_Σ are 1. As a result, 𝐬𝐬\mathbf{s}bold_s is the solution of the convex optimization problem.

minimize j=1p|1sj|superscriptsubscript𝑗1𝑝1subscript𝑠𝑗\displaystyle\sum_{j=1}^{p}|1-s_{j}|∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT | 1 - italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | (15)
subject to sj0, 1jp,formulae-sequencesubscript𝑠𝑗01𝑗𝑝\displaystyle s_{j}\geq 0,\quad\ 1\leq j\leq p,italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ 0 , 1 ≤ italic_j ≤ italic_p ,
diag{𝐬}2𝚺.precedes-or-equalsdiag𝐬2𝚺\displaystyle\text{diag}\{\mathbf{s}\}\preceq 2\mathbf{\Sigma}.diag { bold_s } ⪯ 2 bold_Σ .

Other methods to compute 𝐬𝐬\mathbf{s}bold_s include the minimum variance-based reconstructability (MVR) construction [Spector and Janson, 2022] and maximum entropy (ME) construction [Gimenez and Zou, 2019, Spector and Janson, 2022], which are all compatible with our methods in this paper.

Appendix B Equivalence of GhostKnockoffs and the Gaussian knockoff sampler in sampling the knockoff Z𝑍Zitalic_Z-score 𝐙~ssubscript~𝐙𝑠\widetilde{\mathbf{Z}}_{s}over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

In this section, we summarize the proof of He et al. [2022] that 𝐙~ssubscript~𝐙𝑠\widetilde{\mathbf{Z}}_{s}over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT computed by (6) satisfies (5) as follows.

Lemma 2.

[He et al., 2022] For any 𝐏𝐏\mathbf{P}bold_P and 𝐕𝐕\mathbf{V}bold_V computed in step 3 of Algorithm 1, we have

𝐙~s𝐗,𝐘=d𝐗~𝐘𝐗,𝐘,superscript𝑑conditionalsubscript~𝐙𝑠𝐗𝐘conditionalsuperscript~𝐗top𝐘𝐗𝐘\widetilde{\mathbf{Z}}_{s}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}% {{=}}\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y},over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ∣ bold_X , bold_Y ,

where 𝐙~ssubscriptnormal-~𝐙𝑠\widetilde{\mathbf{Z}}_{s}over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is computed by (6) and 𝐗~normal-~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG is the output of Algorithm 1.

Proof.

By step 5 of Algorithm 1, we have 𝐗~=𝐗𝐏+𝐄𝐕1/2~𝐗𝐗𝐏superscript𝐄𝐕12\widetilde{\mathbf{X}}=\mathbf{X}\mathbf{P}+\mathbf{E}\mathbf{V}^{1/2}over~ start_ARG bold_X end_ARG = bold_XP + bold_EV start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT, where 𝐄𝐄\mathbf{E}bold_E is an n𝑛nitalic_n by p𝑝pitalic_p matrix with i.i.d. standard Gaussian entries, independent of 𝐗𝐗\mathbf{X}bold_X. Therefore,

𝐗~𝐘𝐗,𝐘conditionalsuperscript~𝐗top𝐘𝐗𝐘\displaystyle\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ∣ bold_X , bold_Y =𝐏𝐗𝐘+𝐕1/2𝐄𝐘𝐗,𝐘.superscriptabsentabsentsuperscript𝐏topsuperscript𝐗top𝐘conditionalsuperscript𝐕12superscript𝐄top𝐘𝐗𝐘\displaystyle\stackrel{{\scriptstyle}}{{=}}\mathbf{P}^{\top}\mathbf{X}^{\top}% \mathbf{Y}+\mathbf{V}^{1/2}\mathbf{E}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}.start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG end_ARG end_RELOP bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ∣ bold_X , bold_Y .

Because 𝐄𝐘𝐗,𝐘𝒩(𝟎,𝐘22𝐈p)similar-toconditionalsuperscript𝐄top𝐘𝐗𝐘𝒩0superscriptsubscriptnorm𝐘22subscript𝐈𝑝\mathbf{E}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}\sim\mathcal{N}(\mathbf{0}% ,||\mathbf{Y}||_{2}^{2}\mathbf{I}_{p})bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ∣ bold_X , bold_Y ∼ caligraphic_N ( bold_0 , | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ), we have

𝐄𝐘𝐗,𝐘=d𝐘2𝐒𝐗,𝐘,where𝐒𝒩(𝟎,𝐈p)is independent of𝐗and𝐘superscript𝑑conditionalsuperscript𝐄top𝐘𝐗𝐘conditionalsubscriptnorm𝐘2𝐒𝐗𝐘where𝐒similar-to𝒩0subscript𝐈𝑝is independent of𝐗and𝐘\mathbf{E}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}% }{{=}}||\mathbf{Y}||_{2}\mathbf{S}\mid\mathbf{X},\mathbf{Y},\quad\text{where}% \;\mathbf{S}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{p})\;\text{is independent % of}\;\mathbf{X}\;\text{and}\;\mathbf{Y}bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_S ∣ bold_X , bold_Y , where bold_S ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) is independent of bold_X and bold_Y

Thus, we have

𝐗~𝐘𝐗,𝐘conditionalsuperscript~𝐗top𝐘𝐗𝐘\displaystyle\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ∣ bold_X , bold_Y =d𝐏𝐗𝐘+𝐘2𝐕1/2𝐒𝐗,𝐘superscript𝑑absentsuperscript𝐏topsuperscript𝐗top𝐘conditionalsubscriptnorm𝐘2superscript𝐕12𝐒𝐗𝐘\displaystyle\stackrel{{\scriptstyle d}}{{=}}\mathbf{P}^{\top}\mathbf{X}^{\top% }\mathbf{Y}+||\mathbf{Y}||_{2}\mathbf{V}^{1/2}\mathbf{S}\mid\mathbf{X},\mathbf% {Y}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_S ∣ bold_X , bold_Y
=d𝐏𝐗𝐘+𝐘2𝐙𝐗,𝐘where𝐙𝒩(𝟎,𝐕)is independent of𝐗and𝐘.superscript𝑑absentsuperscript𝐏topsuperscript𝐗top𝐘conditionalsubscriptnorm𝐘2𝐙𝐗𝐘where𝐙similar-to𝒩0𝐕is independent of𝐗and𝐘\displaystyle\stackrel{{\scriptstyle d}}{{=}}\mathbf{P}^{\top}\mathbf{X}^{\top% }\mathbf{Y}+||\mathbf{Y}||_{2}\mathbf{Z}\mid\mathbf{X},\mathbf{Y}\quad\text{% where}\;\mathbf{Z}\sim\mathcal{N}(\mathbf{0},\mathbf{V})\;\text{is independent% of}\;\mathbf{X}\;\text{and}\;\mathbf{Y}.start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z ∣ bold_X , bold_Y where bold_Z ∼ caligraphic_N ( bold_0 , bold_V ) is independent of bold_X and bold_Y .
=𝐙~s𝐗,𝐘.superscriptabsentabsentconditionalsubscript~𝐙𝑠𝐗𝐘\displaystyle\stackrel{{\scriptstyle}}{{=}}\widetilde{\mathbf{Z}}_{s}\mid% \mathbf{X},\mathbf{Y}.start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG end_ARG end_RELOP over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ bold_X , bold_Y .

Appendix C Proof of Proposition 1

To prove Proposition 1, we need to first prove Lemma 3.

Lemma 3.

Let 𝐙1subscript𝐙1\mathbf{Z}_{1}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐙2subscript𝐙2\mathbf{Z}_{2}bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be two real n𝑛nitalic_n by p𝑝pitalic_p matrices. For any n𝑛nitalic_n and p𝑝pitalic_p, if 𝐙1𝐙1=𝐙2𝐙2superscriptsubscript𝐙1topsubscript𝐙1superscriptsubscript𝐙2topsubscript𝐙2\mathbf{Z}_{1}^{\top}\mathbf{Z}_{1}=\mathbf{Z}_{2}^{\top}\mathbf{Z}_{2}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, there must exists an orthogonal matrix 𝐐p×p𝐐superscript𝑝𝑝\mathbf{Q}\in\mathbb{R}^{p\times p}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT such that 𝐙1=𝐐𝐙2subscript𝐙1subscript𝐐𝐙2\mathbf{Z}_{1}=\mathbf{Q}\mathbf{Z}_{2}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_QZ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Proof.

Suppose 𝐙1𝐙1=𝐙2𝐙2=𝐔𝚲𝐔superscriptsubscript𝐙1topsubscript𝐙1superscriptsubscript𝐙2topsubscript𝐙2𝐔𝚲superscript𝐔top\mathbf{Z}_{1}^{\top}\mathbf{Z}_{1}=\mathbf{Z}_{2}^{\top}\mathbf{Z}_{2}=% \mathbf{U}\bm{\Lambda}\mathbf{U}^{\top}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_U bold_Λ bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where 𝐔Rp×r𝐔superscript𝑅𝑝𝑟\mathbf{U}\in R^{p\times r}bold_U ∈ italic_R start_POSTSUPERSCRIPT italic_p × italic_r end_POSTSUPERSCRIPT is an orthogonal matrix such that 𝐔𝐔=𝐈rsuperscript𝐔top𝐔subscript𝐈𝑟\mathbf{U}^{\top}\mathbf{U}=\mathbf{I}_{r}bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U = bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, 𝚲Rr×r𝚲superscript𝑅𝑟𝑟\bm{\Lambda}\in R^{r\times r}bold_Λ ∈ italic_R start_POSTSUPERSCRIPT italic_r × italic_r end_POSTSUPERSCRIPT is diagonal with positive entries and r𝑟ritalic_r is the rank of 𝐙1𝐙1superscriptsubscript𝐙1topsubscript𝐙1\mathbf{Z}_{1}^{\top}\mathbf{Z}_{1}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. In other words, we perform eigen-decomposition of 𝐙1𝐙1=𝐙2𝐙2superscriptsubscript𝐙1topsubscript𝐙1superscriptsubscript𝐙2topsubscript𝐙2\mathbf{Z}_{1}^{\top}\mathbf{Z}_{1}=\mathbf{Z}_{2}^{\top}\mathbf{Z}_{2}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and remove all zero eigenvalues and their corresponding eigenvectors. Note that 𝐔𝐔superscript𝐔𝐔top\mathbf{U}\mathbf{U}^{\top}bold_UU start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a projection matrix that projects any vector onto 𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒(𝐔)𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒𝐔\textit{colspace}(\mathbf{U})colspace ( bold_U ), the column space of 𝐔𝐔\mathbf{U}bold_U.

It is clear that

𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒(𝐔Λ𝐔)𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒(𝐔).𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒𝐔Λsuperscript𝐔top𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒𝐔\textit{colspace}(\mathbf{U}\Lambda\mathbf{U}^{\top})\subseteq\textit{colspace% }(\mathbf{U}).colspace ( bold_U roman_Λ bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ⊆ colspace ( bold_U ) .

Because 𝐔=(𝐔𝚲𝐔)𝐔𝚲1𝐔𝐔𝚲superscript𝐔top𝐔superscript𝚲1\mathbf{U}=(\mathbf{U}\bm{\Lambda}\mathbf{U}^{\top})\mathbf{U}\bm{\Lambda}^{-1}bold_U = ( bold_U bold_Λ bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_U bold_Λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, we also have

𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒(𝐔)𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒(𝐔Λ𝐔).𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒𝐔𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒𝐔Λsuperscript𝐔top\textit{colspace}(\mathbf{U})\subseteq\textit{colspace}(\mathbf{U}\Lambda% \mathbf{U}^{\top}).colspace ( bold_U ) ⊆ colspace ( bold_U roman_Λ bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) .

As a result, we have 𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒(𝐔Λ𝐔)=𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒(𝐔)𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒𝐔Λsuperscript𝐔top𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒𝐔\textit{colspace}(\mathbf{U}\Lambda\mathbf{U}^{\top})=\textit{colspace}(% \mathbf{U})colspace ( bold_U roman_Λ bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = colspace ( bold_U ).

Thus, for k=1,2𝑘12k=1,2italic_k = 1 , 2, 𝐔𝐔superscript𝐔𝐔top\mathbf{U}\mathbf{U}^{\top}bold_UU start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a projection matrix that projects any vector onto the column space of 𝐔𝚲𝐔=𝐙k𝐙k𝐔𝚲superscript𝐔topsuperscriptsubscript𝐙𝑘topsubscript𝐙𝑘\mathbf{U}\bm{\Lambda}\mathbf{U}^{\top}=\mathbf{Z}_{k}^{\top}\mathbf{Z}_{k}bold_U bold_Λ bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Because 𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒(𝐙k𝐙k)=𝑟𝑜𝑤𝑠𝑝𝑎𝑐𝑒(𝐙k)𝑐𝑜𝑙𝑠𝑝𝑎𝑐𝑒superscriptsubscript𝐙𝑘topsubscript𝐙𝑘𝑟𝑜𝑤𝑠𝑝𝑎𝑐𝑒subscript𝐙𝑘\textit{colspace}(\mathbf{Z}_{k}^{\top}\mathbf{Z}_{k})=\textit{rowspace}(% \mathbf{Z}_{k})colspace ( bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = rowspace ( bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), we have

𝐙k=𝐙k𝐔𝐔=𝐙k𝐔𝚲1/2𝚲1/2𝐔.subscript𝐙𝑘subscript𝐙𝑘superscript𝐔𝐔topsubscript𝐙𝑘𝐔superscript𝚲12superscript𝚲12superscript𝐔top\mathbf{Z}_{k}=\mathbf{Z}_{k}\mathbf{U}\mathbf{U}^{\top}=\mathbf{Z}_{k}\mathbf% {U}\bm{\Lambda}^{-1/2}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}.bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_UU start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_U bold_Λ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_Λ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Let 𝐐k=𝐙k𝐔𝚲1/2subscript𝐐𝑘subscript𝐙𝑘𝐔superscript𝚲12\mathbf{Q}_{k}=\mathbf{Z}_{k}\mathbf{U}\bm{\Lambda}^{-1/2}bold_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_U bold_Λ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT, we have 𝐙k=𝐐k𝚲1/2𝐔subscript𝐙𝑘subscript𝐐𝑘superscript𝚲12superscript𝐔top\mathbf{Z}_{k}=\mathbf{Q}_{k}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_Λ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and

𝐐k𝐐k=𝚲1/2𝐔𝐙k𝐙k𝐔𝚲1/2=𝚲1/2𝐔𝐔𝚲𝐔𝐔𝚲1/2=𝐈r,(k=1,2).formulae-sequencesuperscriptsubscript𝐐𝑘topsubscript𝐐𝑘superscript𝚲12superscript𝐔topsuperscriptsubscript𝐙𝑘topsubscript𝐙𝑘𝐔superscript𝚲12superscript𝚲12superscript𝐔top𝐔𝚲superscript𝐔top𝐔superscript𝚲12subscript𝐈𝑟𝑘12\mathbf{Q}_{k}^{\top}\mathbf{Q}_{k}=\bm{\Lambda}^{-1/2}\mathbf{U}^{\top}% \mathbf{Z}_{k}^{\top}\mathbf{Z}_{k}\mathbf{U}\bm{\Lambda}^{-1/2}=\bm{\Lambda}^% {-1/2}\mathbf{U}^{\top}\mathbf{U}\bm{\Lambda}\mathbf{U}^{\top}\mathbf{U}\bm{% \Lambda}^{-1/2}=\mathbf{I}_{r},\quad(k=1,2).bold_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_Λ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_U bold_Λ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT = bold_Λ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U bold_Λ bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_U bold_Λ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT = bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , ( italic_k = 1 , 2 ) .

Thus, we have

𝐙1subscript𝐙1\displaystyle\mathbf{Z}_{1}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =𝐐1𝚲1/2𝐔=𝐐1𝐐2𝐐2𝚲1/2𝐔=𝐐1𝐐2𝐙2,absentsubscript𝐐1superscript𝚲12superscript𝐔topsubscript𝐐1superscriptsubscript𝐐2topsubscript𝐐2superscript𝚲12superscript𝐔topsubscript𝐐1superscriptsubscript𝐐2topsubscript𝐙2\displaystyle=\mathbf{Q}_{1}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}=\mathbf{Q}_{1}% \mathbf{Q}_{2}^{\top}\mathbf{Q}_{2}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}=\mathbf% {Q}_{1}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2},= bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Λ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Λ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
𝐙2subscript𝐙2\displaystyle\mathbf{Z}_{2}bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =𝐐2𝚲1/2𝐔=𝐐2𝐐1𝐐1𝚲1/2𝐔=𝐐2𝐐1𝐙1.absentsubscript𝐐2superscript𝚲12superscript𝐔topsubscript𝐐2superscriptsubscript𝐐1topsubscript𝐐1superscript𝚲12superscript𝐔topsubscript𝐐2superscriptsubscript𝐐1topsubscript𝐙1\displaystyle=\mathbf{Q}_{2}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}=\mathbf{Q}_{2}% \mathbf{Q}_{1}^{\top}\mathbf{Q}_{1}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}=\mathbf% {Q}_{2}\mathbf{Q}_{1}^{\top}\mathbf{Z}_{1}.= bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Λ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Λ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Because 𝐐1𝐐1=𝐐2𝐐2=𝐈rsuperscriptsubscript𝐐1topsubscript𝐐1superscriptsubscript𝐐2topsubscript𝐐2subscript𝐈𝑟\mathbf{Q}_{1}^{\top}\mathbf{Q}_{1}=\mathbf{Q}_{2}^{\top}\mathbf{Q}_{2}=% \mathbf{I}_{r}bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_I start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, there exist 𝐐1,𝐐2Rp×(pr)superscriptsubscript𝐐1perpendicular-tosuperscriptsubscript𝐐2perpendicular-tosuperscript𝑅𝑝𝑝𝑟\mathbf{Q}_{1}^{\perp},\mathbf{Q}_{2}^{\perp}\in R^{p\times(p-r)}bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_p × ( italic_p - italic_r ) end_POSTSUPERSCRIPT such that 𝐕1=[𝐐1𝐐1]subscript𝐕1matrixsubscript𝐐1superscriptsubscript𝐐1perpendicular-to\mathbf{V}_{1}=\begin{bmatrix}\mathbf{Q}_{1}&\mathbf{Q}_{1}^{\perp}\end{bmatrix}bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] and 𝐕2=[𝐐2𝐐2]subscript𝐕2matrixsubscript𝐐2superscriptsubscript𝐐2perpendicular-to\mathbf{V}_{2}=\begin{bmatrix}\mathbf{Q}_{2}&\mathbf{Q}_{2}^{\perp}\end{bmatrix}bold_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] are both orthogonal matrices. Thus, we have

𝐙1=𝐐1𝐐2𝐙2=(𝐕1𝐕2𝐐1(𝐐2))𝐙2subscript𝐙1subscript𝐐1superscriptsubscript𝐐2topsubscript𝐙2subscript𝐕1superscriptsubscript𝐕2topsuperscriptsubscript𝐐1perpendicular-tosuperscriptsuperscriptsubscript𝐐2perpendicular-totopsubscript𝐙2\displaystyle\mathbf{Z}_{1}=\mathbf{Q}_{1}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2}=% (\mathbf{V}_{1}\mathbf{V}_{2}^{\top}-\mathbf{Q}_{1}^{\perp}(\mathbf{Q}_{2}^{% \perp})^{\top})\mathbf{Z}_{2}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
𝐙2=𝐐2𝐐1𝐙1=(𝐕2𝐕1𝐐2(𝐐1))𝐙1subscript𝐙2subscript𝐐2superscriptsubscript𝐐1topsubscript𝐙1subscript𝐕2superscriptsubscript𝐕1topsuperscriptsubscript𝐐2perpendicular-tosuperscriptsuperscriptsubscript𝐐1perpendicular-totopsubscript𝐙1\displaystyle\mathbf{Z}_{2}=\mathbf{Q}_{2}\mathbf{Q}_{1}^{\top}\mathbf{Z}_{1}=% (\mathbf{V}_{2}\mathbf{V}_{1}^{\top}-\mathbf{Q}_{2}^{\perp}(\mathbf{Q}_{1}^{% \perp})^{\top})\mathbf{Z}_{1}bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( bold_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Substituting 𝐙1=𝐐1𝐐2𝐙2subscript𝐙1subscript𝐐1superscriptsubscript𝐐2topsubscript𝐙2\mathbf{Z}_{1}=\mathbf{Q}_{1}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in 𝐙2=𝐐2𝐐1𝐙1subscript𝐙2subscript𝐐2superscriptsubscript𝐐1topsubscript𝐙1\mathbf{Z}_{2}=\mathbf{Q}_{2}\mathbf{Q}_{1}^{\top}\mathbf{Z}_{1}bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we have

𝐙2=𝐐2𝐐1𝐐1𝐐2𝐙2=𝐐2𝐐2𝐙2subscript𝐙2subscript𝐐2superscriptsubscript𝐐1topsubscript𝐐1superscriptsubscript𝐐2topsubscript𝐙2subscript𝐐2superscriptsubscript𝐐2topsubscript𝐙2\mathbf{Z}_{2}=\mathbf{Q}_{2}\mathbf{Q}_{1}^{\top}\mathbf{Q}_{1}\mathbf{Q}_{2}% ^{\top}\mathbf{Z}_{2}=\mathbf{Q}_{2}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2}bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

and thus

𝐐1(𝐐2))𝐙2=𝐐2𝐐2𝐙2=𝟎.\mathbf{Q}_{1}^{\perp}(\mathbf{Q}_{2}^{\perp})^{\top})\mathbf{Z}_{2}=\mathbf{Q% }_{2}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2}=\mathbf{0}.bold_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_0 .

Thus, these exists an orthogonal matrix 𝐐=𝐕1𝐕2𝐐subscript𝐕1superscriptsubscript𝐕2top\mathbf{Q}=\mathbf{V}_{1}\mathbf{V}_{2}^{\top}bold_Q = bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT such that 𝐙1=𝐐𝐙2subscript𝐙1subscript𝐐𝐙2\mathbf{Z}_{1}=\mathbf{Q}\mathbf{Z}_{2}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_QZ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We can then prove Proposition 1 as follows. By Lemma 3, since [𝐗widecheck𝐘widecheck][𝐗widecheck𝐘widecheck]=[𝐗𝐘][𝐗𝐘]superscriptdelimited-[]widecheck𝐗widecheck𝐘topdelimited-[]widecheck𝐗widecheck𝐘superscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = [ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ], we know that [𝐗widecheck𝐘widecheck]=𝐐[𝐗𝐘]delimited-[]widecheck𝐗widecheck𝐘superscript𝐐topdelimited-[]𝐗𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]=\mathbf{Q}^{\top}[\mathbf{X}% \ \mathbf{Y}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] for some orthogonal matrix 𝐐𝐐\mathbf{Q}bold_Q.

Let 𝐄n×p𝐄superscript𝑛𝑝\mathbf{E}\in\mathbb{R}^{n\times p}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT be a matrix with i.i.d. standard Gaussian entries, we have 𝐐𝐄𝐐𝐄\mathbf{Q}\mathbf{E}bold_QE is also a matrix with i.i.d. standard Gaussian entries (i.e. 𝐄=d𝐐𝐄superscript𝑑𝐄𝐐𝐄\mathbf{E}\stackrel{{\scriptstyle d}}{{=}}\mathbf{Q}\mathbf{E}bold_E start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP bold_QE) and

(𝐄[𝐗widecheck𝐘widecheck],𝐄𝐄)𝐗,𝐘conditionalsuperscript𝐄topdelimited-[]widecheck𝐗widecheck𝐘superscript𝐄top𝐄𝐗𝐘\displaystyle(\mathbf{E}^{\top}[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}% ],\mathbf{E}^{\top}\mathbf{E})\mid\mathbf{X},\mathbf{Y}( bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] , bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_E ) ∣ bold_X , bold_Y =(𝐄𝐐[𝐗𝐘],𝐄𝐐𝐐𝐄)𝐗,𝐘superscriptabsentabsentconditionalsuperscript𝐄topsuperscript𝐐topdelimited-[]𝐗𝐘superscript𝐄topsuperscript𝐐top𝐐𝐄𝐗𝐘\displaystyle\stackrel{{\scriptstyle}}{{=}}(\mathbf{E}^{\top}\mathbf{Q}^{\top}% [\mathbf{X}\ \mathbf{Y}],\mathbf{E}^{\top}\mathbf{Q}^{\top}\mathbf{Q}\mathbf{E% })\mid\mathbf{X},\mathbf{Y}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG end_ARG end_RELOP ( bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] , bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_QE ) ∣ bold_X , bold_Y
=d(𝐄[𝐗𝐘],𝐄𝐄)𝐗,𝐘.superscript𝑑absentconditionalsuperscript𝐄topdelimited-[]𝐗𝐘superscript𝐄top𝐄𝐗𝐘\displaystyle\stackrel{{\scriptstyle d}}{{=}}(\mathbf{E}^{\top}[\mathbf{X}\ % \mathbf{Y}],\mathbf{E}^{\top}\mathbf{E})\mid\mathbf{X},\mathbf{Y}.start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ( bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] , bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_E ) ∣ bold_X , bold_Y .

By the construction of [𝐗widecheck𝐘widecheck][𝐗widecheck𝐘widecheck]superscriptdelimited-[]widecheck𝐗widecheck𝐘topdelimited-[]widecheck𝐗widecheck𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ], we have that 𝐗widecheck𝐗widecheck=𝐗𝐗superscriptwidecheck𝐗topwidecheck𝐗superscript𝐗top𝐗\widecheck{\mathbf{X}}^{\top}\widecheck{\mathbf{X}}=\mathbf{X}^{\top}\mathbf{X}overwidecheck start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG = bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X, 𝐗widecheck𝐘widecheck=𝐗𝐘superscriptwidecheck𝐗topwidecheck𝐘superscript𝐗top𝐘\widecheck{\mathbf{X}}^{\top}\widecheck{\mathbf{Y}}=\mathbf{X}^{\top}\mathbf{Y}overwidecheck start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_Y end_ARG = bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and 𝐘widecheck2=𝐘2subscriptdelimited-∥∥widecheck𝐘2subscriptdelimited-∥∥𝐘2\lVert\widecheck{\mathbf{Y}}\rVert_{2}=\lVert\mathbf{Y}\rVert_{2}∥ overwidecheck start_ARG bold_Y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Therefore, we focus on the third, fifth and sixth arguments of 𝒯𝒯\mathcal{T}caligraphic_T where

(𝐗~𝐘,𝐗𝐗,𝐗~𝐗~)𝐗,𝐘conditionalsuperscript~𝐗top𝐘superscript𝐗top𝐗superscript~𝐗top~𝐗𝐗𝐘\displaystyle(\widetilde{\mathbf{X}}^{\top}\mathbf{Y},\mathbf{X}^{\top}\mathbf% {X},\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}})\mid\mathbf{X},\mathbf% {Y}( over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X , over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_X end_ARG ) ∣ bold_X , bold_Y
=superscriptabsent\displaystyle\stackrel{{\scriptstyle}}{{=}}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG end_ARG end_RELOP (𝐏𝐗𝐘+𝐕1/2𝐄𝐘,𝐏𝐗𝐗+𝐕1/2𝐄𝐗,𝐏𝐗𝐗𝐏+𝐕1/2𝐄𝐄𝐕1/2+\displaystyle\;(\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\mathbf{V}^{1/2}% \mathbf{E}^{\top}\mathbf{Y},\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{X}+% \mathbf{V}^{1/2}\mathbf{E}^{\top}\mathbf{X},\mathbf{P}^{\top}\mathbf{X}^{\top}% \mathbf{X}\mathbf{P}+\mathbf{V}^{1/2}\mathbf{E}^{\top}\mathbf{E}\mathbf{V}^{1/% 2}+( bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X + bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X , bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_XP + bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_EV start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT +
𝐏𝐗𝐄𝐕1/2+𝐕1/2𝐄𝐗𝐏)𝐗,𝐘\displaystyle\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{E}\mathbf{V}^{1/2}+% \mathbf{V}^{1/2}\mathbf{E}^{\top}\mathbf{X}\mathbf{P})\mid\mathbf{X},\mathbf{Y}bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_EV start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_XP ) ∣ bold_X , bold_Y
=dsuperscript𝑑\displaystyle\stackrel{{\scriptstyle d}}{{=}}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP (𝐏𝐗widecheck𝐘widecheck+𝐕1/2𝐄𝐘widecheck,𝐏𝐗widecheck𝐗widecheck+𝐕1/2𝐄𝐗widecheck,𝐏𝐗widecheck𝐗widecheck𝐏+𝐕1/2𝐄𝐄𝐕1/2+\displaystyle\;(\mathbf{P}^{\top}\widecheck{\mathbf{X}}^{\top}\widecheck{% \mathbf{Y}}+\mathbf{V}^{1/2}\mathbf{E}^{\top}\widecheck{\mathbf{Y}},\mathbf{P}% ^{\top}\widecheck{\mathbf{X}}^{\top}\widecheck{\mathbf{X}}+\mathbf{V}^{1/2}% \mathbf{E}^{\top}\widecheck{\mathbf{X}},\mathbf{P}^{\top}\widecheck{\mathbf{X}% }^{\top}\widecheck{\mathbf{X}}\mathbf{P}+\mathbf{V}^{1/2}\mathbf{E}^{\top}% \mathbf{E}\mathbf{V}^{1/2}+( bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_Y end_ARG + bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_Y end_ARG , bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG + bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG , bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG bold_P + bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_EV start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT +
𝐏𝐗widecheck𝐄𝐕1/2+𝐕1/2𝐄𝐗widecheck𝐏)𝐗,𝐘\displaystyle\mathbf{P}^{\top}\widecheck{\mathbf{X}}^{\top}\mathbf{E}\mathbf{V% }^{1/2}+\mathbf{V}^{1/2}\mathbf{E}^{\top}\widecheck{\mathbf{X}}\mathbf{P})\mid% \mathbf{X},\mathbf{Y}bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_EV start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + bold_V start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG bold_P ) ∣ bold_X , bold_Y
=superscriptabsent\displaystyle\stackrel{{\scriptstyle}}{{=}}start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG end_ARG end_RELOP (𝐗widecheck~𝐘widecheck,𝐗widecheck~𝐗widecheck,𝐗widecheck~𝐗widecheck~)𝐗,𝐘.conditionalsuperscript~widecheck𝐗topwidecheck𝐘superscript~widecheck𝐗topwidecheck𝐗superscript~widecheck𝐗top~widecheck𝐗𝐗𝐘\displaystyle\;(\widetilde{\widecheck{\mathbf{X}}}^{\top}\widecheck{\mathbf{Y}% },\widetilde{\widecheck{\mathbf{X}}}^{\top}\widecheck{\mathbf{X}},\widetilde{% \widecheck{\mathbf{X}}}^{\top}\widetilde{\widecheck{\mathbf{X}}})\mid\mathbf{X% },\mathbf{Y}.( over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_Y end_ARG , over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overwidecheck start_ARG bold_X end_ARG , over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG ) ∣ bold_X , bold_Y .

Hence,

𝒯(𝐗,𝐗~,𝐘)𝐗,𝐘=d𝒯(𝐗widecheck,𝐗widecheck~,𝐘widecheck)𝐗,𝐘.superscript𝑑conditional𝒯𝐗~𝐗𝐘𝐗𝐘conditional𝒯widecheck𝐗~widecheck𝐗widecheck𝐘𝐗𝐘\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})\mid\mathbf{X},% \mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}\mathcal{T}(\widecheck{\mathbf{X}},% \widetilde{\widecheck{\mathbf{X}}},\widecheck{\mathbf{Y}})\mid\mathbf{X},% \mathbf{Y}.caligraphic_T ( bold_X , over~ start_ARG bold_X end_ARG , bold_Y ) ∣ bold_X , bold_Y start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP caligraphic_T ( overwidecheck start_ARG bold_X end_ARG , over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG , overwidecheck start_ARG bold_Y end_ARG ) ∣ bold_X , bold_Y .

Appendix D Construction of [𝐗widecheck𝐘widecheck]delimited-[]widecheck𝐗widecheck𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] via eigen-decomposition

In this section, we give details on how to construct [𝐗widecheck𝐘widecheck]delimited-[]widecheck𝐗widecheck𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] such that [𝐗widecheck𝐘widecheck][𝐗widecheck𝐘widecheck]=[𝐗𝐘][𝐗𝐘]superscriptdelimited-[]widecheck𝐗widecheck𝐘topdelimited-[]widecheck𝐗widecheck𝐘superscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = [ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] using eigen-decomposition,

[𝐗𝐘][𝐗𝐘]=𝐔𝐃𝐔,where 𝐔=[𝐮1𝐮p+1] is an orthogonal matrix, 𝐃=diag(d1,,dp+1),formulae-sequencesuperscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘superscript𝐔𝐃𝐔topwhere 𝐔delimited-[]subscript𝐮1subscript𝐮𝑝1 is an orthogonal matrix, 𝐃diagsubscript𝑑1subscript𝑑𝑝1[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}]=\mathbf{U}\mathbf{D}% \mathbf{U}^{\top},\quad\text{where }\mathbf{U}=[\textbf{u}_{1}\ \ldots\ % \textbf{u}_{p+1}]\text{ is an orthogonal matrix, }\mathbf{D}=\text{diag}(d_{1}% ,\ldots,d_{p+1}),[ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] = bold_UDU start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , where bold_U = [ u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … u start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT ] is an orthogonal matrix, bold_D = diag ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT ) ,

with d1dp+1subscript𝑑1subscript𝑑𝑝1d_{1}\geq\cdots\geq d_{p+1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ ⋯ ≥ italic_d start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT. We consider two cases as follows.

Case 1 (n<p+1𝑛𝑝1n<p+1italic_n < italic_p + 1): Since rank([𝐗𝐘][𝐗𝐘])nranksuperscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘𝑛\text{rank}([\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}])\leq nrank ( [ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] ) ≤ italic_n, we have dn+1==dp+1=0subscript𝑑𝑛1subscript𝑑𝑝10d_{n+1}=\cdots=d_{p+1}=0italic_d start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ⋯ = italic_d start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT = 0 and

[𝐗𝐘][𝐗𝐘]=𝐔1𝐃n𝐔1,superscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘subscript𝐔1subscript𝐃𝑛superscriptsubscript𝐔1top[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}]=\mathbf{U}_{1}\mathbf{% D}_{n}\mathbf{U}_{1}^{\top},[ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] = bold_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,

where 𝐔1=[𝐮1𝐮n]subscript𝐔1delimited-[]subscript𝐮1subscript𝐮𝑛\mathbf{U}_{1}=[\textbf{u}_{1}\ \ldots\ \textbf{u}_{n}]bold_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], and 𝐃n=diag(d1,,dn)subscript𝐃𝑛diagsubscript𝑑1subscript𝑑𝑛\mathbf{D}_{n}=\text{diag}(d_{1},\ldots,d_{n})bold_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = diag ( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Under this case, we let [𝐗widecheck𝐘widecheck]=𝐃n1/2𝐔1delimited-[]widecheck𝐗widecheck𝐘superscriptsubscript𝐃𝑛12superscriptsubscript𝐔1top[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]=\mathbf{D}_{n}^{1/2}\mathbf{U% }_{1}^{\top}[ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = bold_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT such that [𝐗widecheck𝐘widecheck][𝐗widecheck𝐘widecheck]=[𝐗𝐘][𝐗𝐘]superscriptdelimited-[]widecheck𝐗widecheck𝐘topdelimited-[]widecheck𝐗widecheck𝐘superscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = [ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] is satisfied.

Case 2 (np+1𝑛𝑝1n\geq p+1italic_n ≥ italic_p + 1): Under this case, we let

[𝐗widecheck𝐘widecheck]=[𝐃1/2𝐔𝟎(np1)×(p+1)]delimited-[]widecheck𝐗widecheck𝐘matrixsuperscript𝐃12superscript𝐔topsubscript0𝑛𝑝1𝑝1[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]=\begin{bmatrix}\mathbf{D}^{1/% 2}\mathbf{U}^{\top}\\ \mathbf{0}_{(n-p-1)\times(p+1)}\end{bmatrix}[ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = [ start_ARG start_ROW start_CELL bold_D start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT ( italic_n - italic_p - 1 ) × ( italic_p + 1 ) end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

such that

[𝐗widecheck𝐘widecheck][𝐗widecheck𝐘widecheck]=𝐔𝐃𝐔=[𝐗𝐘][𝐗𝐘]superscriptdelimited-[]widecheck𝐗widecheck𝐘topdelimited-[]widecheck𝐗widecheck𝐘superscript𝐔𝐃𝐔topsuperscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=\mathbf{U}\mathbf{D}\mathbf{U}^{\top}=[\mathbf{X}\ % \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = bold_UDU start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = [ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ]

is satisfied.

Appendix E Computation of the tuning parameter λ𝜆\lambdaitalic_λ for the lasso-min method

Suppose we had access to individual level data such that Gaussian knockoffs 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG can be constructed, we can follow the method of Dicker [2014] to estimate the noise level σ𝜎\sigmaitalic_σ by

σ^0=max(2p+n+1n(n+1)𝐘22𝐘[𝐗𝐗~]𝐆1[𝐗𝐗~]𝐘n(n+1),0), where 𝐆=[𝚺𝚺𝐃𝚺𝐃𝚺].formulae-sequencesubscript^𝜎0max2𝑝𝑛1𝑛𝑛1superscriptsubscriptdelimited-∥∥𝐘22superscript𝐘topmatrix𝐗~𝐗superscript𝐆1superscriptmatrix𝐗~𝐗top𝐘𝑛𝑛10 where 𝐆matrix𝚺𝚺𝐃𝚺𝐃𝚺\widehat{\sigma}_{0}=\sqrt{\text{max}\left(\frac{2p+n+1}{n(n+1)}\lVert\mathbf{% Y}\rVert_{2}^{2}-\frac{\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{% \mathbf{X}}\end{bmatrix}\mathbf{G}^{-1}\begin{bmatrix}\mathbf{X}&\widetilde{% \mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y}}{n(n+1)},0\right)},\quad\text{ where% }\mathbf{G}=\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}.over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = square-root start_ARG max ( divide start_ARG 2 italic_p + italic_n + 1 end_ARG start_ARG italic_n ( italic_n + 1 ) end_ARG ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] bold_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y end_ARG start_ARG italic_n ( italic_n + 1 ) end_ARG , 0 ) end_ARG , where bold_G = [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ] . (16)

We could then compute λ=κσ^0n𝔼[𝐑ϵ]𝜆𝜅subscript^𝜎0𝑛𝔼delimited-[]subscriptdelimited-∥∥superscript𝐑topbold-italic-ϵ\lambda=\kappa\cdot\frac{\widehat{\sigma}_{0}}{n}\cdot\mathbb{E}[\lVert\mathbf% {R}^{\top}\bm{\epsilon}\rVert_{\infty}]italic_λ = italic_κ ⋅ divide start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG ⋅ blackboard_E [ ∥ bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ], where 𝐑n×2p𝐑superscript𝑛2𝑝\mathbf{R}\in\mathbb{R}^{n\times 2p}bold_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × 2 italic_p end_POSTSUPERSCRIPT is a data matrix whose rows are i.i.d. samples from 𝒩(𝟎,𝐆)𝒩0𝐆\mathcal{N}(\mathbf{0},\mathbf{G})caligraphic_N ( bold_0 , bold_G ), and ϵ𝒩(𝟎,𝐈n)similar-tobold-italic-ϵ𝒩0subscript𝐈𝑛\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n})bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is independent of 𝐑𝐑\mathbf{R}bold_R. In the summary statistics setting, we replace 𝐗~𝐘superscript~𝐗top𝐘\widetilde{\mathbf{X}}^{\top}\mathbf{Y}over~ start_ARG bold_X end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y in (16) by 𝐏𝐗𝐘+𝐘2𝐙superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z, where 𝐏𝐏\mathbf{P}bold_P and 𝐙𝐙\mathbf{Z}bold_Z are obtained in Algorithm 4.

The expectation 𝔼[𝐑ϵ]𝔼delimited-[]subscriptdelimited-∥∥superscript𝐑topbold-italic-ϵ\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]blackboard_E [ ∥ bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] can be computed using Monte Carlo integration. However, when both n𝑛nitalic_n and p𝑝pitalic_p are very large, sampling 𝐑𝐑\mathbf{R}bold_R and ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ becomes too time-consuming. Observing that

𝐑ϵ=𝐑ϵϵ2ϵ2superscript𝐑topbold-italic-ϵsuperscript𝐑topbold-italic-ϵsubscriptdelimited-∥∥bold-italic-ϵ2subscriptdelimited-∥∥bold-italic-ϵ2\mathbf{R}^{\top}\bm{\epsilon}=\mathbf{R}^{\top}\frac{\bm{\epsilon}}{\lVert\bm% {\epsilon}\rVert_{2}}\lVert\bm{\epsilon}\rVert_{2}bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ = bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG bold_italic_ϵ end_ARG start_ARG ∥ bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

where 𝐑ϵϵ2𝒩(𝟎,𝐆)similar-tosuperscript𝐑topbold-italic-ϵsubscriptdelimited-∥∥bold-italic-ϵ2𝒩0𝐆\mathbf{R}^{\top}\frac{\bm{\epsilon}}{\lVert\bm{\epsilon}\rVert_{2}}\sim% \mathcal{N}(\mathbf{0},\mathbf{G})bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT divide start_ARG bold_italic_ϵ end_ARG start_ARG ∥ bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∼ caligraphic_N ( bold_0 , bold_G ) and ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ are independent, we have

𝔼[𝐑ϵ]=𝔼[N(𝟎,𝐆)]𝔼[ϵ2]=𝔼[N(𝟎,𝐆)]2Γ{(n+1)/2}Γ(n/2).𝔼delimited-[]subscriptdelimited-∥∥superscript𝐑topbold-italic-ϵ𝔼delimited-[]subscriptdelimited-∥∥𝑁0𝐆𝔼delimited-[]subscriptdelimited-∥∥bold-italic-ϵ2𝔼delimited-[]subscriptdelimited-∥∥𝑁0𝐆2Γ𝑛12Γ𝑛2\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]=\mathbb{E}[% \lVert N(\mathbf{0},\mathbf{G})\rVert_{\infty}]\mathbb{E}[\lVert\bm{\epsilon}% \rVert_{2}]=\mathbb{E}[\lVert N(\mathbf{0},\mathbf{G})\rVert_{\infty}]\cdot% \sqrt{2}\frac{\Gamma\{(n+1)/2\}}{\Gamma(n/2)}.blackboard_E [ ∥ bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] = blackboard_E [ ∥ italic_N ( bold_0 , bold_G ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] blackboard_E [ ∥ bold_italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = blackboard_E [ ∥ italic_N ( bold_0 , bold_G ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ⋅ square-root start_ARG 2 end_ARG divide start_ARG roman_Γ { ( italic_n + 1 ) / 2 } end_ARG start_ARG roman_Γ ( italic_n / 2 ) end_ARG .

By Stirling’s formula that

Γ(z)=2πz(ze)z(1+O(1z)),Γ𝑧2𝜋𝑧superscript𝑧𝑒𝑧1𝑂1𝑧{\displaystyle\Gamma(z)={\sqrt{\frac{2\pi}{z}}}\,{\left({\frac{z}{e}}\right)}^% {z}\left(1+O\left({\frac{1}{z}}\right)\right),}roman_Γ ( italic_z ) = square-root start_ARG divide start_ARG 2 italic_π end_ARG start_ARG italic_z end_ARG end_ARG ( divide start_ARG italic_z end_ARG start_ARG italic_e end_ARG ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ( 1 + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_z end_ARG ) ) ,

we have

2Γ{(n+1)/2}Γ(n/2)n.similar-to2Γ𝑛12Γ𝑛2𝑛\sqrt{2}\frac{\Gamma\{(n+1)/2\}}{\Gamma(n/2)}\sim\sqrt{n}.square-root start_ARG 2 end_ARG divide start_ARG roman_Γ { ( italic_n + 1 ) / 2 } end_ARG start_ARG roman_Γ ( italic_n / 2 ) end_ARG ∼ square-root start_ARG italic_n end_ARG .

Therefore, we may approximate

𝔼[𝐑ϵ]n𝔼[𝐋𝐙],𝔼delimited-[]subscriptdelimited-∥∥superscript𝐑topbold-italic-ϵ𝑛𝔼delimited-[]subscriptdelimited-∥∥𝐋𝐙\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]\approx\sqrt{n}% \;\mathbb{E}[\lVert\mathbf{L}\mathbf{Z}\rVert_{\infty}],blackboard_E [ ∥ bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ≈ square-root start_ARG italic_n end_ARG blackboard_E [ ∥ bold_LZ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ,

where 𝐋𝐋\mathbf{L}bold_L is the Cholesky decomposition of 𝐆𝐆\mathbf{G}bold_G and 𝐙𝒩(𝟎,𝐈2p)similar-to𝐙𝒩0subscript𝐈2𝑝\mathbf{Z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{2p})bold_Z ∼ caligraphic_N ( bold_0 , bold_I start_POSTSUBSCRIPT 2 italic_p end_POSTSUBSCRIPT ).

In practice, the simulated 𝐋𝐙subscriptdelimited-∥∥𝐋𝐙\lVert\mathbf{L}\mathbf{Z}\rVert_{\infty}∥ bold_LZ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT usually concentrates around its mean as shown in Figure 7. Thus, only several Monte Carlo samples are needed to accurately estimate 𝔼[𝐋𝐙]𝔼delimited-[]subscriptdelimited-∥∥𝐋𝐙\mathbb{E}[\lVert\mathbf{L}\mathbf{Z}\rVert_{\infty}]blackboard_E [ ∥ bold_LZ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ], and we draw 10101010 Monte Carlo samples throughout numerical experiments of this paper.

Refer to caption
Figure 7: Boxplots of 100 simulated samples 𝐋𝐙subscriptdelimited-∥∥𝐋𝐙\lVert\mathbf{L}\mathbf{Z}\rVert_{\infty}∥ bold_LZ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT with p=200𝑝200p=200italic_p = 200, 𝚺i,j=ρ|ij|subscript𝚺𝑖𝑗superscript𝜌𝑖𝑗\mathbf{\Sigma}_{i,j}=\rho^{|i-j|}bold_Σ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_ρ start_POSTSUPERSCRIPT | italic_i - italic_j | end_POSTSUPERSCRIPT and 𝐃𝐃\mathbf{D}bold_D obtained via (15) for different ρ𝜌\rhoitalic_ρ values.

Next, we prove that Algorithm 4 maintains FDR control when λ𝜆\lambdaitalic_λ is computed as described in Section 4.2.2. By Proposition 2, it suffices to show that (11) with the computed λ𝜆\lambdaitalic_λ produces feature importance statistics that satisfy the flip sign property. By λκσ^0n𝔼[𝐑ϵ]𝜆𝜅subscript^𝜎0𝑛𝔼delimited-[]subscriptdelimited-∥∥superscript𝐑topbold-italic-ϵ\lambda\approx\kappa\cdot\frac{\widehat{\sigma}_{0}}{n}\cdot\mathbb{E}[\lVert% \mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]italic_λ ≈ italic_κ ⋅ divide start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG ⋅ blackboard_E [ ∥ bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ], it suffices to show that σ^0subscript^𝜎0\hat{\sigma}_{0}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is invariant to swapping variables with their knockoffs [Candès et al., 2018].

Let 𝚷j2p×2psubscript𝚷𝑗superscript2𝑝2𝑝\bm{\Pi}_{j}\in\mathbb{R}^{2p\times 2p}bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_p × 2 italic_p end_POSTSUPERSCRIPT be the permutation matrix that swaps the j𝑗jitalic_j-th column with the (j+p)𝑗𝑝(j+p)( italic_j + italic_p )-th column of a matrix. Thus, we have 𝚷j1=𝚷j=𝚷jsuperscriptsubscript𝚷𝑗1superscriptsubscript𝚷𝑗topsubscript𝚷𝑗\bm{\Pi}_{j}^{-1}=\bm{\Pi}_{j}^{\top}=\bm{\Pi}_{j}bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. This leads to

𝐘[𝐗𝐗~]𝚷j(𝐆)1([𝐗𝐗~]𝚷j)𝐘=superscript𝐘topmatrix𝐗~𝐗subscript𝚷𝑗superscript𝐆1superscriptmatrix𝐗~𝐗subscript𝚷𝑗top𝐘absent\displaystyle\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}\bm{\Pi}_{j}(\mathbf{G})^{-1}(\begin{bmatrix}\mathbf{X}&% \widetilde{\mathbf{X}}\end{bmatrix}\bm{\Pi}_{j})^{\top}\mathbf{Y}=bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_G ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y = 𝐘[𝐗𝐗~]𝚷j(𝐆)1𝚷j[𝐗𝐗~]𝐘superscript𝐘topmatrix𝐗~𝐗subscript𝚷𝑗superscript𝐆1superscriptsubscript𝚷𝑗topsuperscriptmatrix𝐗~𝐗top𝐘\displaystyle\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}\bm{\Pi}_{j}(\mathbf{G})^{-1}\bm{\Pi}_{j}^{\top}\begin{bmatrix}% \mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y}bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_G ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y
=\displaystyle== 𝐘[𝐗𝐗~](𝚷j𝐆𝚷j)1[𝐗𝐗~]𝐘superscript𝐘topmatrix𝐗~𝐗superscriptsubscript𝚷𝑗𝐆subscript𝚷𝑗1superscriptmatrix𝐗~𝐗top𝐘\displaystyle\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}(\bm{\Pi}_{j}\mathbf{G}\bm{\Pi}_{j})^{-1}\begin{bmatrix}\mathbf{X% }&\widetilde{\mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y}bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] ( bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_G bold_Π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y
=\displaystyle== 𝐘[𝐗𝐗~](𝐆)1[𝐗𝐗~]𝐘,superscript𝐘topmatrix𝐗~𝐗superscript𝐆1superscriptmatrix𝐗~𝐗top𝐘\displaystyle\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}(\mathbf{G})^{-1}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}^{\top}\mathbf{Y},bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] ( bold_G ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y ,

suggesting σ^0subscript^𝜎0\hat{\sigma}_{0}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is invariant to swapping variables with their knockoffs [Candès et al., 2018]. Therefore, the FDR of Algorithm 4 is controlled when λ𝜆\lambdaitalic_λ is computed as described in Section 4.2.2.

Since all variables in 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG are null, in practice we may replace [𝐗𝐗~]matrix𝐗~𝐗\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}[ start_ARG start_ROW start_CELL bold_X end_CELL start_CELL over~ start_ARG bold_X end_ARG end_CELL end_ROW end_ARG ] by 𝐗𝐗\mathbf{X}bold_X, 𝐆𝐆\mathbf{G}bold_G by 𝚺𝚺\mathbf{\Sigma}bold_Σ and 2p+n+12𝑝𝑛12p+n+12 italic_p + italic_n + 1 by p+n+1𝑝𝑛1p+n+1italic_p + italic_n + 1 in (16) to reduce the dimension when estimating σ𝜎\sigmaitalic_σ. Although this would, in theory, break the flip-sign property required for FDR control, no FDR inflation is observed in our simulations.

Appendix F Connection with the scout procedure

In this section, we explain the connection of the feature importance statsitic defined in Algorithm 4 and the scout procedure [Witten and Tibshirani, 2009].

For covariates Xp𝑋superscript𝑝X\in\mathbb{R}^{p}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and response Y𝑌Y\in\mathbb{R}italic_Y ∈ blackboard_R, Witten and Tibshirani [2009] assume that [XY]𝒩(𝟎,𝚺X,Y)similar-tomatrix𝑋𝑌𝒩0subscript𝚺𝑋𝑌\begin{bmatrix}X\\ Y\end{bmatrix}\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma}_{X,Y})[ start_ARG start_ROW start_CELL italic_X end_CELL end_ROW start_ROW start_CELL italic_Y end_CELL end_ROW end_ARG ] ∼ caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT ). The population linear regression coefficient of Y𝑌Yitalic_Y on X𝑋Xitalic_X, which induces a linear predictor that achieves the minimal mean squared prediction error, is given by 𝜷=𝚯XY/𝚯YY𝜷subscript𝚯𝑋𝑌subscript𝚯𝑌𝑌\bm{\beta}=-\bm{\Theta}_{XY}/\bm{\Theta}_{YY}bold_italic_β = - bold_Θ start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT / bold_Θ start_POSTSUBSCRIPT italic_Y italic_Y end_POSTSUBSCRIPT, where 𝚯=[𝚯XX𝚯XY𝚯YX𝚯YY]=𝚺X,Y1𝚯matrixsubscript𝚯𝑋𝑋subscript𝚯𝑋𝑌subscript𝚯𝑌𝑋subscript𝚯𝑌𝑌superscriptsubscript𝚺𝑋𝑌1\bm{\Theta}=\begin{bmatrix}\bm{\Theta}_{XX}&\bm{\Theta}_{XY}\\ \bm{\Theta}_{YX}&\bm{\Theta}_{YY}\end{bmatrix}=\mathbf{\Sigma}_{X,Y}^{-1}bold_Θ = [ start_ARG start_ROW start_CELL bold_Θ start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT end_CELL start_CELL bold_Θ start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_Θ start_POSTSUBSCRIPT italic_Y italic_X end_POSTSUBSCRIPT end_CELL start_CELL bold_Θ start_POSTSUBSCRIPT italic_Y italic_Y end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = bold_Σ start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is the precision matrix. Let 𝐒𝐒\mathbf{S}bold_S be the empirical covariance matrix of X𝑋Xitalic_X and Y𝑌Yitalic_Y, they consider the following covariance-regularized regression approach to estimate 𝜷𝜷\bm{\beta}bold_italic_β,

  1. 1.

    Compute 𝚯^XXsubscript^𝚯𝑋𝑋\hat{\bm{\Theta}}_{XX}over^ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT to maximize log{det(𝚯XX)}tr(𝐒XX𝚯XX)J1(𝚯XX)subscript𝚯𝑋𝑋trsubscript𝐒𝑋𝑋subscript𝚯𝑋𝑋subscript𝐽1subscript𝚯𝑋𝑋\log\{\det(\bm{\Theta}_{XX})\}-\text{tr}(\mathbf{S}_{XX}\bm{\Theta}_{XX})-J_{1% }(\bm{\Theta}_{XX})roman_log { roman_det ( bold_Θ start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT ) } - tr ( bold_S start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT bold_Θ start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT ) - italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_Θ start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT )

  2. 2.

    Compute 𝚯^^𝚯\hat{\bm{\Theta}}over^ start_ARG bold_Θ end_ARG to maximize log{det(𝚯)}tr(𝐒𝚯)J2(𝚯)𝚯tr𝐒𝚯subscript𝐽2𝚯\log\{\det(\bm{\Theta})\}-\text{tr}(\mathbf{S}\bm{\Theta})-J_{2}(\bm{\Theta})roman_log { roman_det ( bold_Θ ) } - tr ( bold_S bold_Θ ) - italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Θ ) subject to 𝚯XX=𝚯^XXsubscript𝚯𝑋𝑋subscript^𝚯𝑋𝑋\bm{\Theta}_{XX}=\hat{\bm{\Theta}}_{XX}bold_Θ start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT = over^ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT obtained from Step 1.

  3. 3.

    Compute 𝜷^=𝚯^XY/Θ^YY^𝜷subscript^𝚯𝑋𝑌subscript^Θ𝑌𝑌\hat{\bm{\beta}}=-\hat{\bm{\Theta}}_{XY}/\hat{\Theta}_{YY}over^ start_ARG bold_italic_β end_ARG = - over^ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT / over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_Y italic_Y end_POSTSUBSCRIPT.

  4. 4.

    Compute 𝜷^*=c𝜷^superscript^𝜷𝑐^𝜷\hat{\bm{\beta}}^{*}=c\hat{\bm{\beta}}over^ start_ARG bold_italic_β end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_c over^ start_ARG bold_italic_β end_ARG where c𝑐citalic_c is the regression coefficient of 𝐘𝐘\mathbf{Y}bold_Y onto 𝐗𝜷^𝐗^𝜷\mathbf{X}\hat{\bm{\beta}}bold_X over^ start_ARG bold_italic_β end_ARG.

Here, J1subscript𝐽1J_{1}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and J2subscript𝐽2J_{2}italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are two penalty functions. The first two steps are to appropriately separate true conditional correlations from those purely due to noise. As shown in Witten and Tibshirani [2009], when J2(Θ)=λ2𝚯1subscript𝐽2Θsubscript𝜆2subscriptdelimited-∥∥𝚯1J_{2}(\Theta)=\lambda_{2}{\lVert\bm{\Theta}\rVert_{1}}italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Θ ) = italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_Θ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (resp. λ2𝚯22subscript𝜆2superscriptsubscriptdelimited-∥∥𝚯22\lambda_{2}{\lVert\bm{\Theta}\rVert_{2}^{2}}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_Θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), the solution to step 3 is proportional to the solution of

𝜷^=argmin𝜷𝜷𝐆XX𝜷2𝐒XY𝜷+λ2𝜷1(resp.λ2𝜷22),^𝜷subscriptargmin𝜷superscript𝜷topsubscript𝐆𝑋𝑋𝜷2superscriptsubscript𝐒𝑋𝑌top𝜷subscript𝜆2subscriptdelimited-∥∥𝜷1resp.subscript𝜆2superscriptsubscriptdelimited-∥∥𝜷22\hat{\bm{\beta}}=\operatorname*{arg\,min}_{\bm{\beta}}\bm{\beta}^{\top}\mathbf% {G}_{XX}\bm{\beta}-2\mathbf{S}_{XY}^{\top}\bm{\beta}+\lambda_{2}\lVert\bm{% \beta}\rVert_{1}\;(\text{resp.}\;\lambda_{2}\lVert\bm{\beta}\rVert_{2}^{2}),over^ start_ARG bold_italic_β end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_G start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT bold_italic_β - 2 bold_S start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( resp. italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where 𝐆XXsubscript𝐆𝑋𝑋\mathbf{G}_{XX}bold_G start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT is the inverse of the solution 𝚯^XXsubscript^𝚯𝑋𝑋\hat{\bm{\Theta}}_{XX}over^ start_ARG bold_Θ end_ARG start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT from step 1. In other words, the Lasso corresponds to the setting that J1=0subscript𝐽10J_{1}=0italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 and 𝐆XX=𝐒XXsubscript𝐆𝑋𝑋subscript𝐒𝑋𝑋\mathbf{G}_{XX}=\mathbf{S}_{XX}bold_G start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT = bold_S start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT. Witten and Tibshirani [2009] consider various settings in which they demonstrate the superiority of the scout procedure over the Lasso, Ridge and Elastic Net. In the setting of Section 4, we have cov(X,X~)=[𝚺𝚺𝐃𝚺𝐃𝚺]cov𝑋~𝑋matrix𝚺𝚺𝐃𝚺𝐃𝚺\text{cov}(X,\widetilde{X})=\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-% \mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}cov ( italic_X , over~ start_ARG italic_X end_ARG ) = [ start_ARG start_ROW start_CELL bold_Σ end_CELL start_CELL bold_Σ - bold_D end_CELL end_ROW start_ROW start_CELL bold_Σ - bold_D end_CELL start_CELL bold_Σ end_CELL end_ROW end_ARG ]. Therefore, the objective function (11) corresponds to the case that the true 𝚯XXsubscript𝚯𝑋𝑋\bm{\Theta}_{XX}bold_Θ start_POSTSUBSCRIPT italic_X italic_X end_POSTSUBSCRIPT is used in step 1 (here we include both X𝑋Xitalic_X and X~~𝑋\widetilde{X}over~ start_ARG italic_X end_ARG as explanatory variables).

Appendix G Construction of group knockoffs and examples of importance scores at the group level

For group knockoffs, we test the group conditional independence hypothesis:

Hγ0:XγYXγ\displaystyle H_{\gamma}^{0}:X_{\gamma}\perp\!\!\!\perp Y\mid X_{-\gamma}italic_H start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT : italic_X start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ⟂ ⟂ italic_Y ∣ italic_X start_POSTSUBSCRIPT - italic_γ end_POSTSUBSCRIPT

where γ{1,,g}𝛾1𝑔\gamma\in\{1,...,g\}italic_γ ∈ { 1 , … , italic_g } denotes a group and Xγsubscript𝑋𝛾X_{\gamma}italic_X start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT is the vector of features in group γ𝛾\gammaitalic_γ.

In addition to the conditional independence (2), group knockoffs 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG must satisfy the group exchangeability condition that

(Group exchangeability):(𝐗γ,𝐗~γ,𝐗γ,𝐗~γ)=d(𝐗~γ,𝐗γ,𝐗γ,𝐗~γ),γ{1,,g}.formulae-sequencesuperscript𝑑(Group exchangeability):subscript𝐗𝛾subscript~𝐗𝛾subscript𝐗𝛾subscript~𝐗𝛾subscript~𝐗𝛾subscript𝐗𝛾subscript𝐗𝛾subscript~𝐗𝛾for-all𝛾1𝑔\textbf{(Group exchangeability):}\;(\mathbf{X}_{\gamma},\widetilde{\mathbf{X}}% _{\gamma},\mathbf{X}_{-\gamma},\widetilde{\mathbf{X}}_{-\gamma})\stackrel{{% \scriptstyle d}}{{=}}(\widetilde{\mathbf{X}}_{\gamma},\mathbf{X}_{\gamma},% \mathbf{X}_{-\gamma},\widetilde{\mathbf{X}}_{-\gamma}),\;\forall\;\gamma\in\{1% ,...,g\}.(Group exchangeability): ( bold_X start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT - italic_γ end_POSTSUBSCRIPT , over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT - italic_γ end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ( over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT - italic_γ end_POSTSUBSCRIPT , over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT - italic_γ end_POSTSUBSCRIPT ) , ∀ italic_γ ∈ { 1 , … , italic_g } .

No exchangeability property is required for features within the same group, which allows greater flexibility in the construction knockoffs.

When X𝒩(𝟎,𝚺)similar-to𝑋𝒩0𝚺X\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma})italic_X ∼ caligraphic_N ( bold_0 , bold_Σ ), the group exchangeability condition allows the diagonal matrix 𝐃=diag(𝐬)𝐃diag𝐬\mathbf{D}=\text{diag}(\mathbf{s})bold_D = diag ( bold_s ) described in Sections 2 and 4 becomes a block-diagonal matrix 𝐃=diag(𝐒1,,𝐒g)𝐃diagsubscript𝐒1subscript𝐒𝑔\mathbf{D}=\text{diag}(\mathbf{S}_{1},...,\mathbf{S}_{g})bold_D = diag ( bold_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ), where 𝐒γsubscript𝐒𝛾\mathbf{S}_{\gamma}bold_S start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT is a symmetric matrix whose dimension equals the number of variables in group γ𝛾\gammaitalic_γ (γ{1,,g}𝛾1𝑔\gamma\in\{1,...,g\}italic_γ ∈ { 1 , … , italic_g }). With the block-diagonal matrix 𝐃𝐃\mathbf{D}bold_D obtained following the SDP construction of Chu et al. [2023] in step 3, Algorithm 1 can construct valid group knockoffs 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG of 𝐗𝐗\mathbf{X}bold_X with respect to g𝑔gitalic_g feature groups. Analogously, Algorithms 2-4 can also be modified correspondingly to perform inference of Hγ0superscriptsubscript𝐻𝛾0H_{\gamma}^{0}italic_H start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT’s.

Although it is conceptually straightforward to modify 𝐃𝐃\mathbf{D}bold_D from a diagonal to a block-diagonal matrix, note that doing so introduces significantly more optimization variables. To reduce computational burden in practice, we exploit a form of conditional independence across groups, described in section 4 of [Chu et al., 2023]. The main idea is to select a few key variables in each group that capture most within-group variations, and perform a reduced optimization problem only on the key variables. In the real data analysis result, we defined groups via average-linkage hierarchical clustering with correlation cutoff 0.50.50.50.5, selected representatives within groups via Algorithm A1 of Chu et al. [2023] with c=0.5𝑐0.5c=0.5italic_c = 0.5, and replaced objective (15) by the maximum entropy (ME) objective, which has improved power over SDP constructions in simulations.

In this paper, we use M𝑀Mitalic_M multi-knockoffs. To define the importance score for group γ𝛾\gammaitalic_γ and its knockoffs, we sum the effect for variants in each group. With M𝑀Mitalic_M knockoff copies, we explicitly compute Zγ=i𝒜γ|βi|subscript𝑍𝛾subscript𝑖subscript𝒜𝛾subscript𝛽𝑖Z_{\gamma}=\sum_{i\in\mathcal{A}_{\gamma}}|\beta_{i}|italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_A start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | and Z~γ()=i𝒜γ|β~i()|superscriptsubscript~𝑍𝛾subscript𝑖subscript𝒜𝛾subscriptsuperscript~𝛽𝑖\widetilde{Z}_{\gamma}^{(\ell)}=\sum_{i\in\mathcal{A}_{\gamma}}|\widetilde{% \beta}^{(\ell)}_{i}|over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_A start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_POSTSUBSCRIPT | over~ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | for =1,,M1𝑀\ell=1,...,Mroman_ℓ = 1 , … , italic_M, where 𝜷=(𝜷,𝜷~1,,𝜷~M)𝜷𝜷superscript~𝜷1superscript~𝜷𝑀\bm{\beta}=(\bm{\beta},\widetilde{\bm{\beta}}^{1},...,\widetilde{\bm{\beta}}^{% M})bold_italic_β = ( bold_italic_β , over~ start_ARG bold_italic_β end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , over~ start_ARG bold_italic_β end_ARG start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) is the estimated effect sizes from step 4 of Algorithm 4. One may use other choices of feature importance such as the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm. The group-wise Lasso coefficient difference is then defined as

Wγ=(Zγmedian(Z~γ(1),,Z~γ(M)))IZγmax(Z~γ(1),,Z~γ(M))subscript𝑊𝛾subscript𝑍𝛾mediansuperscriptsubscript~𝑍𝛾1superscriptsubscript~𝑍𝛾𝑀subscript𝐼subscript𝑍𝛾maxsubscriptsuperscript~𝑍1𝛾subscriptsuperscript~𝑍𝑀𝛾\displaystyle W_{\gamma}=(Z_{\gamma}-\operatorname{median}(\widetilde{Z}_{% \gamma}^{(1)},...,\widetilde{Z}_{\gamma}^{(M)}))I_{Z_{\gamma}\geq\operatorname% {max}(\widetilde{Z}^{(1)}_{\gamma},...,\widetilde{Z}^{(M)}_{\gamma})}italic_W start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = ( italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT - roman_median ( over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ) ) italic_I start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ≥ roman_max ( over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , … , over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT

and groups with Wγ>τsubscript𝑊𝛾𝜏W_{\gamma}>\tauitalic_W start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT > italic_τ are selected, where τ𝜏\tauitalic_τ is calculated from the multiple knockoff filter [Gimenez and Zou, 2019]. Note that Wγsubscript𝑊𝛾W_{\gamma}italic_W start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT is the feature importance statistic first introduced in He et al. [2021].

Appendix H Ghostknockoffs for CRT (GhostCRT)

Let 𝐗n×p𝐗superscript𝑛𝑝\mathbf{X}\in\mathbb{R}^{n\times p}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT and 𝐘n𝐘superscript𝑛\mathbf{Y}\in\mathbb{R}^{n}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the covariate matrix and the response vector respectively. Recall that in the conditional randomization test, to test Hj:XjYXjH_{j}:X_{j}\perp\!\!\!\perp Y\mid X_{-j}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟂ ⟂ italic_Y ∣ italic_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT, Candès et al. [2018] draw i.i.d. samples 𝐗~j1,,𝐗~jB(𝐗j|𝐗j)similar-tosuperscriptsubscript~𝐗𝑗1superscriptsubscript~𝐗𝑗𝐵conditionalsubscript𝐗𝑗subscript𝐗𝑗\widetilde{\mathbf{X}}_{j}^{1},\ldots,\widetilde{\mathbf{X}}_{j}^{B}\sim% \mathcal{L}(\mathbf{X}_{j}|\mathbf{X}_{-j})over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∼ caligraphic_L ( bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT ) (𝐗jsubscript𝐗𝑗\mathbf{X}_{j}bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the j𝑗jitalic_j-th column of the covariate matrix 𝐗𝐗\mathbf{X}bold_X) and compute the CRT p𝑝pitalic_p-value as

pj=1B+1[1+b=1B𝟙T(𝐗~jb,𝐗j,𝐘)T(𝐗j,𝐗j,𝐘)],subscript𝑝𝑗1𝐵1delimited-[]1superscriptsubscript𝑏1𝐵subscript1𝑇superscriptsubscript~𝐗𝑗𝑏subscript𝐗𝑗𝐘𝑇subscript𝐗𝑗subscript𝐗𝑗𝐘p_{j}=\frac{1}{B+1}\left[1+\sum_{b=1}^{B}\mathbbm{1}_{T(\widetilde{\mathbf{X}}% _{j}^{b},\mathbf{X}_{-j},\mathbf{Y})\geq T(\mathbf{X}_{j},\mathbf{X}_{-j},% \mathbf{Y})}\right],italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_B + 1 end_ARG [ 1 + ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT italic_T ( over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT , bold_Y ) ≥ italic_T ( bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT , bold_Y ) end_POSTSUBSCRIPT ] , (17)

for some feature importance function T𝑇Titalic_T.

Under the assumption that rows of 𝐗𝐗\mathbf{X}bold_X are i.i.d. samples from 𝒩(𝟎,𝚺)𝒩0𝚺\mathcal{N}(\mathbf{0},\mathbf{\Sigma})caligraphic_N ( bold_0 , bold_Σ ), we can generate 𝐗~j1,,𝐗~jBsuperscriptsubscript~𝐗𝑗1superscriptsubscript~𝐗𝑗𝐵\widetilde{\mathbf{X}}_{j}^{1},\ldots,\widetilde{\mathbf{X}}_{j}^{B}over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT by

𝐗~jb=𝐗j𝜸j+vj1/2𝐄jb,superscriptsubscript~𝐗𝑗𝑏subscript𝐗𝑗subscript𝜸𝑗superscriptsubscript𝑣𝑗12superscriptsubscript𝐄𝑗𝑏\widetilde{\mathbf{X}}_{j}^{b}=\mathbf{X}_{-j}\bm{\gamma}_{j}+v_{j}^{1/2}% \mathbf{E}_{j}^{b},over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT bold_italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , (18)

where 𝜸j=𝚺j,j1𝚺j,jp1subscript𝜸𝑗superscriptsubscript𝚺𝑗𝑗1subscript𝚺𝑗𝑗superscript𝑝1\bm{\gamma}_{j}=\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{\Sigma}_{-j,j}\in\mathbb{R% }^{p-1}bold_italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_Σ start_POSTSUBSCRIPT - italic_j , - italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT - italic_j , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p - 1 end_POSTSUPERSCRIPT, vj=Σj,j𝚺j,j𝚺j,j1𝚺j,jsubscript𝑣𝑗subscriptΣ𝑗𝑗subscript𝚺𝑗𝑗superscriptsubscript𝚺𝑗𝑗1subscript𝚺𝑗𝑗v_{j}=\Sigma_{j,j}-\mathbf{\Sigma}_{j,-j}\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{% \Sigma}_{-j,j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_j , italic_j end_POSTSUBSCRIPT - bold_Σ start_POSTSUBSCRIPT italic_j , - italic_j end_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT - italic_j , - italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT - italic_j , italic_j end_POSTSUBSCRIPT, and 𝐄j1,,𝐄jBiid𝒩(0,𝐈n)superscriptsimilar-to𝑖𝑖𝑑superscriptsubscript𝐄𝑗1superscriptsubscript𝐄𝑗𝐵𝒩0subscript𝐈𝑛\mathbf{E}_{j}^{1},\ldots,\mathbf{E}_{j}^{B}\stackrel{{\scriptstyle iid}}{{% \sim}}\mathcal{N}(0,\mathbf{I}_{n})bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_i italic_i italic_d end_ARG end_RELOP caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) are independent of everything else. Utilizing the analogy between (4) and (18), we develop the GhostCRT with counterparts of Algorithms 2-3 as follows, while the counterpart Algorithm 4 is derived in the similar way.

Algorithm 5 GhostKnockoffs with Marginal Correlation Difference Statistic for CRT
1:  Input: 𝐗𝐘superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, 𝐘22superscriptsubscriptnorm𝐘22||\mathbf{Y}||_{2}^{2}| | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and 𝚺𝚺\mathbf{\Sigma}bold_Σ.
2:  for j=1,,p𝑗1𝑝j=1,\ldots,pitalic_j = 1 , … , italic_p do
3:     Compute 𝜸j=𝚺j,j1𝚺j,jp1subscript𝜸𝑗superscriptsubscript𝚺𝑗𝑗1subscript𝚺𝑗𝑗superscript𝑝1\bm{\gamma}_{j}=\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{\Sigma}_{-j,j}\in\mathbb{R% }^{p-1}bold_italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_Σ start_POSTSUBSCRIPT - italic_j , - italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT - italic_j , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p - 1 end_POSTSUPERSCRIPT and vj=Σj,j𝚺j,j𝚺j,j1𝚺j,jsubscript𝑣𝑗subscriptΣ𝑗𝑗subscript𝚺𝑗𝑗superscriptsubscript𝚺𝑗𝑗1subscript𝚺𝑗𝑗v_{j}=\Sigma_{j,j}-\mathbf{\Sigma}_{j,-j}\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{% \Sigma}_{-j,j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_j , italic_j end_POSTSUBSCRIPT - bold_Σ start_POSTSUBSCRIPT italic_j , - italic_j end_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT - italic_j , - italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT - italic_j , italic_j end_POSTSUBSCRIPT.
4:     for b=1,,B𝑏1𝐵b=1,\ldots,Bitalic_b = 1 , … , italic_B do
5:        Generate Z~jb=𝜸j𝐗j𝐘+𝐘2Zjbsubscriptsuperscript~𝑍𝑏𝑗superscriptsubscript𝜸𝑗topsuperscriptsubscript𝐗𝑗top𝐘subscriptnorm𝐘2superscriptsubscript𝑍𝑗𝑏\widetilde{Z}^{b}_{j}=\bm{\gamma}_{j}^{\top}\mathbf{X}_{-j}^{\top}\mathbf{Y}+|% |\mathbf{Y}||_{2}Z_{j}^{b}over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT where Zjb𝒩(0,vj)similar-tosuperscriptsubscript𝑍𝑗𝑏𝒩0subscript𝑣𝑗Z_{j}^{b}\sim\mathcal{N}(0,v_{j})italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and is independent of everything else.
6:     end for
7:     Compute the CRT p𝑝pitalic_p-value pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT via (17) with T(𝐗j,𝐗j,𝐘)=|𝐗j𝐘|𝑇subscript𝐗𝑗subscript𝐗𝑗𝐘superscriptsubscript𝐗𝑗top𝐘T(\mathbf{X}_{j},\mathbf{X}_{-j},\mathbf{Y})=|\mathbf{X}_{j}^{\top}\mathbf{Y}|italic_T ( bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT , bold_Y ) = | bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y | and T(𝐗~jb,𝐗j,𝐘)=Z~jb𝑇superscriptsubscript~𝐗𝑗𝑏subscript𝐗𝑗𝐘subscriptsuperscript~𝑍𝑏𝑗T(\widetilde{\mathbf{X}}_{j}^{b},\mathbf{X}_{-j},\mathbf{Y})=\widetilde{Z}^{b}% _{j}italic_T ( over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT , bold_Y ) = over~ start_ARG italic_Z end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
8:  end for
9:  Output: Selection set by conducting existing multiple testing procedures on CRT p𝑝pitalic_p-values p1,,ppsubscript𝑝1subscript𝑝𝑝p_{1},\ldots,p_{p}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.
Algorithm 6 GhostKnockoffs with Penalized Regression for CRT: Known Empirical Covariance
1:  Input: 𝐗𝐗,𝐗𝐘,𝐘22superscript𝐗top𝐗superscript𝐗top𝐘superscriptsubscriptnorm𝐘22\mathbf{X}^{\top}\mathbf{X},\mathbf{X}^{\top}\mathbf{Y},||\mathbf{Y}||_{2}^{2}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X , bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y , | | bold_Y | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝚺𝚺\mathbf{\Sigma}bold_Σ, and n𝑛nitalic_n.
2:  Find 𝐗widecheckwidecheck𝐗\widecheck{\mathbf{X}}overwidecheck start_ARG bold_X end_ARG and 𝐘widecheckwidecheck𝐘\widecheck{\mathbf{Y}}overwidecheck start_ARG bold_Y end_ARG such that [𝐗widecheck𝐘widecheck][𝐗widecheck𝐘widecheck]=[𝐗𝐘][𝐗𝐘]superscriptdelimited-[]widecheck𝐗widecheck𝐘topdelimited-[]widecheck𝐗widecheck𝐘superscriptdelimited-[]𝐗𝐘topdelimited-[]𝐗𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] = [ bold_X bold_Y ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_X bold_Y ] by eigen-decomposition or Cholesky decomposition.
3:  for j=1,,p𝑗1𝑝j=1,\ldots,pitalic_j = 1 , … , italic_p do
4:     Compute 𝜸j=𝚺j,j1𝚺j,jp1subscript𝜸𝑗superscriptsubscript𝚺𝑗𝑗1subscript𝚺𝑗𝑗superscript𝑝1\bm{\gamma}_{j}=\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{\Sigma}_{-j,j}\in\mathbb{R% }^{p-1}bold_italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_Σ start_POSTSUBSCRIPT - italic_j , - italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT - italic_j , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p - 1 end_POSTSUPERSCRIPT and vj=Σj,j𝚺j,j𝚺j,j1𝚺j,jsubscript𝑣𝑗subscriptΣ𝑗𝑗subscript𝚺𝑗𝑗superscriptsubscript𝚺𝑗𝑗1subscript𝚺𝑗𝑗v_{j}=\Sigma_{j,j}-\mathbf{\Sigma}_{j,-j}\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{% \Sigma}_{-j,j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_j , italic_j end_POSTSUBSCRIPT - bold_Σ start_POSTSUBSCRIPT italic_j , - italic_j end_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT - italic_j , - italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT - italic_j , italic_j end_POSTSUBSCRIPT.
5:     for b=1,,B𝑏1𝐵b=1,\ldots,Bitalic_b = 1 , … , italic_B do
6:        Generate 𝐗widecheck~jbsuperscriptsubscript~widecheck𝐗𝑗𝑏\widetilde{\widecheck{\mathbf{X}}}_{j}^{b}over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT via (18) using 𝐗widecheckjsubscriptwidecheck𝐗𝑗\widecheck{\mathbf{X}}_{-j}overwidecheck start_ARG bold_X end_ARG start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT as input.
7:     end for
8:     Compute the CRT p𝑝pitalic_p-value pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT via (17) and replacing 𝐗~jbsuperscriptsubscript~𝐗𝑗𝑏\widetilde{\mathbf{X}}_{j}^{b}over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT by 𝐗widecheck~jbsuperscriptsubscript~widecheck𝐗𝑗𝑏\widetilde{\widecheck{\mathbf{X}}}_{j}^{b}over~ start_ARG overwidecheck start_ARG bold_X end_ARG end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT with feature importance statistic defined by T(𝐗j,𝐗j,𝐘)=|β^j|,𝑇subscript𝐗𝑗subscript𝐗𝑗𝐘subscript^𝛽𝑗T(\mathbf{X}_{j},\mathbf{X}_{-j},\mathbf{Y})=|\hat{\beta}_{j}|,italic_T ( bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT , bold_Y ) = | over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | , where
(β^j,𝜷^j)=argmin(βj,𝜷j)p12𝐘𝐗jβj𝐗j𝜷j22+λ(βj,𝜷j)1.subscript^𝛽𝑗subscript^𝜷𝑗subscriptargminsubscript𝛽𝑗subscript𝜷𝑗superscript𝑝12superscriptsubscriptnorm𝐘subscript𝐗𝑗subscript𝛽𝑗subscript𝐗𝑗subscript𝜷𝑗22𝜆subscriptnormsubscript𝛽𝑗subscript𝜷𝑗1(\hat{\beta}_{j},\hat{\bm{\beta}}_{-j})=\operatorname*{arg\,min}_{(\beta_{j},% \bm{\beta}_{-j})\in\mathbb{R}^{p}}\frac{1}{2}||\mathbf{Y}-\mathbf{X}_{j}{\beta% }_{j}-\mathbf{X}_{-j}{\bm{\beta}}_{-j}||_{2}^{2}+\lambda||(\beta_{j},\bm{\beta% }_{-j})||_{1}.( over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG | | bold_Y - bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_X start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT bold_italic_β start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ | | ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_β start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .
9:  end for
10:  Output: Selection set by conducting existing multiple testing procedures on CRT p𝑝pitalic_p-values p1,,ppsubscript𝑝1subscript𝑝𝑝p_{1},\ldots,p_{p}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

As (18) is a special case of (4) where

  • P is obtained by substituting the (j,j)𝑗𝑗(j,j)( italic_j , italic_j )-entry and other entries in the j𝑗jitalic_j-th column of 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT by 00 and 𝜸jsubscript𝜸𝑗\bm{\gamma}_{j}bold_italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT respectively;

  • V is a matrix of zeros expect the (j,j)𝑗𝑗(j,j)( italic_j , italic_j )-entry equals vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT,

all theoretical results in Sections 2-4 remain true for the GhostCRT.

Remark 1.

In Algorithm 6, the tuning parameter λ𝜆\lambdaitalic_λ is allowed to depend on 𝐗𝐗superscript𝐗top𝐗\mathbf{X}^{\top}\mathbf{X}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X, 𝐗𝐘superscript𝐗top𝐘\mathbf{X}^{\top}\mathbf{Y}bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y, 𝐘𝐘superscript𝐘top𝐘\mathbf{Y}^{\top}\mathbf{Y}bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and n𝑛nitalic_n. We may also use the square-root Lasso or the Lasso-max importance statistic as outlined in Sections 3.3 and 3.4.

Appendix I Additional results for Section 4.4.2

To further demonstrate the effect of sample size on the new GhostKnockoffs methods in comparison to the individual level knockoffs with (cross-validated) Lasso coefficient difference, we consider additional experiments with p=600𝑝600p=600italic_p = 600 and n=600/1800/3000𝑛60018003000n=600/1800/3000italic_n = 600 / 1800 / 3000 under the same setting of Section 4.4.2. Note that the noise level scales in the order of n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG such that the signal to noise ratio does not change dramatically. From Figure 8, we observe that as n𝑛nitalic_n increases, all three new methods proposed in this paper have comparable power with KF-lassocv and outperform GK-marginal [He et al., 2022] consistently, with FDR controlled in all cases.

Refer to caption
Figure 8: Power and FDR plots for independent features and a Gaussian linear model with varying sample sizes and a fixed feature dimension. Each point shown is an average over 200 replications.

Appendix J Supplementary plots for Section 4.4.1

Let 𝐙=𝐗𝐘𝐙superscript𝐗top𝐘\mathbf{Z}=\mathbf{X}^{\top}\mathbf{Y}bold_Z = bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y and 𝐙~=𝐏𝐗𝐘+𝐘2𝐙~𝐙superscript𝐏topsuperscript𝐗top𝐘subscriptdelimited-∥∥𝐘2𝐙\tilde{\mathbf{Z}}=\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{% Y}\rVert_{2}\mathbf{Z}over~ start_ARG bold_Z end_ARG = bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Y + ∥ bold_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Z be defined as in Algorithm 4. Note that for Algorithm 4 to control the FDR, it suffices to require that swapping the jlimit-from𝑗j-italic_j -th entry of 𝐙𝐙\mathbf{Z}bold_Z with the jlimit-from𝑗j-italic_j -th entry of 𝐙~~𝐙\tilde{\mathbf{Z}}over~ start_ARG bold_Z end_ARG does not change the joint distribution of (𝐙,𝐙~)𝐙~𝐙(\mathbf{Z},\tilde{\mathbf{Z}})( bold_Z , over~ start_ARG bold_Z end_ARG ) for each null j𝑗jitalic_j.

By the Central Limit Theorem, short sets of entries (e.g. single entries, pairs, and triples etc.) of (𝐙,𝐙~)𝐙~𝐙(\mathbf{Z},\tilde{\mathbf{Z}})( bold_Z , over~ start_ARG bold_Z end_ARG ) are approximately Gaussian. Additionally, in Figure 11, we show empirically that the covariance of (𝐙,𝐙~)𝐙~𝐙(\mathbf{Z},\tilde{\mathbf{Z}})( bold_Z , over~ start_ARG bold_Z end_ARG ) (approximately) satisfies the required swap-invariance for null positions. These approximations, coupled with the robustness of the knockoff framework, empirically yield the FDR control. This is similar to the robustness of second-order knockoffs observed empirically in Candès et al. [2018].

In the setting from Section 4.4.1, Figure 9 depicts the ordered empirical values of Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (respectively Z~jsubscript~𝑍𝑗\tilde{Z}_{j}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) plotted against an equal-size ordered random sample from a Gaussian distribution with matching mean and variance as the empirical mean and variance of Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (respectively Z~jsubscript~𝑍𝑗\tilde{Z}_{j}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT), for three randomly selected indices. This comparison is based on the 1000 sub-sampled data replications from Section 4.4.1. In Figure 10, we overlay the plots for all indices. We observe that Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and Z~jsubscript~𝑍𝑗\tilde{Z}_{j}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT approximately follow Gaussian distributions. In Figure 11, we present the scatter plots of relevant empirical covariances. We observe that all the points roughly concentrate around the y=x𝑦𝑥y=xitalic_y = italic_x line. This shows the approximate swap-invariance of 𝐙𝐙\mathbf{Z}bold_Z and 𝐙~~𝐙\tilde{\mathbf{Z}}over~ start_ARG bold_Z end_ARG (for null indices).

Refer to caption
Figure 9: QQ plots of Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s (left) and Z~jsubscript~𝑍𝑗\tilde{Z}_{j}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s (right) against Gaussian samples with matching mean and variance for three randomly sampled indices.
Refer to caption
Figure 10: QQ plots of Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s (left) and Z~jsubscript~𝑍𝑗\tilde{Z}_{j}over~ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s (right) against Gaussian samples with matching mean and variance for all indices overlaid.
Refer to caption
Figure 11: Scatter plots for relevant empirical covariances. Each correlation is estimated from 1000 samples drawn as described in Section 4.4.1.

Appendix K Details of the nine studies in Section 5

Section 5 considers the following nine studies for Alzheimer’s disease:

  1. 1.

    The genome-wide association study performed by Huang et al. [2017].

  2. 2.

    The genome-wide meta-analysis of clinically diagnosed AD and AD-by-proxy by Jansen et al. [2019].

  3. 3.

    The genome-wide meta-analysis of clinically diagnosed AD by Kunkle et al. [2019].

  4. 4.

    The genome-wide meta-analysis by Schwartzentruber et al. [2021].

  5. 5.

    In-house genome-wide associations study of 15,209 cases and 14,452 controls aggregating 27 cohorts across 39 SNP array data sets, imputed using the TOPMed reference panels [Belloy et al., 2022a].

  6. 6.

    A whole-exome sequencing analyse of data from ADSP by Bis et al. [2020].

  7. 7.

    A whole-exome sequencing analyse of data from ADSP by Le Guen et al. [2021].

  8. 8.

    In-house whole-exome sequencing analysis of ADSP (6155 cases, 5418 controls).

  9. 9.

    In-house whole-genome sequencing analysis of the 2021 ADSP release (3584 cases, 2949 controls) [Belloy et al., 2022b].

Appendix L Calculation of meta-analysis Z-score

Based on Z𝑍Zitalic_Z-scores 𝐙1,𝐙2,,𝐙Ksubscript𝐙1subscript𝐙2subscript𝐙𝐾\mathbf{Z}_{1},\mathbf{Z}_{2},...,\mathbf{Z}_{K}bold_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_Z start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT from K𝐾Kitalic_K studies, we adopt the definition in He et al. [2022] that meta-analysis Z𝑍Zitalic_Z-score with overlapping samples is

𝐙meta=𝐇k=1Kwk𝐂k𝐙k.subscript𝐙meta𝐇superscriptsubscript𝑘1𝐾subscript𝑤𝑘subscript𝐂𝑘subscript𝐙𝑘\mathbf{Z}_{\text{meta}}=\mathbf{H}\sum_{k=1}^{K}w_{k}\mathbf{C}_{k}\mathbf{Z}% _{k}.bold_Z start_POSTSUBSCRIPT meta end_POSTSUBSCRIPT = bold_H ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

Specifically,

  • optimal weights w1,,wKsubscript𝑤1subscript𝑤𝐾w_{1},\ldots,w_{K}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT are obtained by solving the optimization problem

    minimize k=1Kl=1Kwkwlcor.Sklsubject to k=1Kwknk=1, w1,,wK0;formulae-sequencesuperscriptsubscript𝑘1𝐾superscriptsubscript𝑙1𝐾subscript𝑤𝑘subscript𝑤𝑙𝑐𝑜𝑟formulae-sequencesubscript𝑆𝑘𝑙subject to superscriptsubscript𝑘1𝐾subscript𝑤𝑘subscript𝑛𝑘1 subscript𝑤1subscript𝑤𝐾0\displaystyle\sum_{k=1}^{K}\sum_{l=1}^{K}w_{k}w_{l}cor.S_{kl}\quad\textrm{% subject to }\sum_{k=1}^{K}w_{k}\sqrt{n_{k}}=1,\text{ }w_{1},\ldots,w_{K}\geq 0;∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_c italic_o italic_r . italic_S start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT subject to ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT square-root start_ARG italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = 1 , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ≥ 0 ;
  • for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K, 𝐂k=diag{ck1,,ckp}subscript𝐂𝑘diagsubscript𝑐𝑘1subscript𝑐𝑘𝑝\mathbf{C}_{k}=\text{diag}\{c_{k1},...,c_{kp}\}bold_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = diag { italic_c start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_k italic_p end_POSTSUBSCRIPT } is a diagonal matrix where ckj=1subscript𝑐𝑘𝑗1c_{kj}=1italic_c start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT = 1 if Z𝑍Zitalic_Z-score of the j𝑗jitalic_j-th variant is observed in the k𝑘kitalic_k-th study and ckj=0subscript𝑐𝑘𝑗0c_{kj}=0italic_c start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT = 0 otherwise (j=1,,p𝑗1𝑝j=1,\ldots,pitalic_j = 1 , … , italic_p);

  • 𝐇=diag{h1,,hp}𝐇diagsubscript1subscript𝑝\mathbf{H}=\text{diag}\{h_{1},...,h_{p}\}bold_H = diag { italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } is a diagonal matrix where hj=(klwkwlckjcljcor.Skl)1/2h_{j}=(\sum_{k}\sum_{l}w_{k}w_{l}c_{kj}c_{lj}cor.S_{kl})^{-1/2}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_l italic_j end_POSTSUBSCRIPT italic_c italic_o italic_r . italic_S start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT (j=1,,p𝑗1𝑝j=1,\ldots,pitalic_j = 1 , … , italic_p);

  • cor.Sklformulae-sequence𝑐𝑜𝑟subscript𝑆𝑘𝑙cor.S_{kl}italic_c italic_o italic_r . italic_S start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT is the study correlation between the k𝑘kitalic_k-th study and the l𝑙litalic_l-th study.

In practice, when calculating cor.Sklformulae-sequence𝑐𝑜𝑟subscript𝑆𝑘𝑙cor.S_{kl}italic_c italic_o italic_r . italic_S start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT, we only use variants whose Z𝑍Zitalic_Z-scores are bounded in [1.96,1.96]1.961.96[-1.96,1.96][ - 1.96 , 1.96 ] in both the k𝑘kitalic_k-th study and the l𝑙litalic_l-th study to eliminate the impact of polygenic effects. This meta-analysis approach is a generalization of the METAL method proposed by Willer et al. [2010].

Appendix M Obtaining the covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ in meta-analysis for AD

To perform meta-analysis for AD, we need a suitable estimate of the covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ. In this paper, we adopt strategies in He et al. [2023] and Chu et al. [2023] as follows.

We first download the covariance matrix from the Pan-UKB consortium (https://pan.ukbb.broadinstitute.org), which contains about 24242424 million variants across the human genome derived from about 500,000500000500,000500 , 000 British samples. We then extract p=650,576𝑝650576p=650,576italic_p = 650 , 576 variants which satisfy the following three conditions: (a) the variant is recorded in the UK Biobank genotype array, (b) its MAF exceeds 0.01, (c) its reference/alternate allele pair matches with the ones listed in all the nine studies in meta-analysis. Based on the covariance matrix of extracted variants, we further partition extracted variants into 1703 quasi-independent blocks using the partition given by Berisa and Pickrell [2016]. Finally, we compute the block-diagonal covariance matrix

𝚺=[𝚺1𝚺1703],𝚺matrixsubscript𝚺1missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript𝚺1703\mathbf{\Sigma}=\begin{bmatrix}\mathbf{\Sigma}_{1}&&\\ &\ddots&\\ &&\mathbf{\Sigma}_{1703}\end{bmatrix},bold_Σ = [ start_ARG start_ROW start_CELL bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL bold_Σ start_POSTSUBSCRIPT 1703 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,

where 𝚺lsubscript𝚺𝑙\mathbf{\Sigma}_{l}bold_Σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the shrinkage estimator of the covariance matrix of variants in the l𝑙litalic_l-th block using the R package corpcor [Schäfer and Strimmer, 2005]. To ensure that all blocks 𝚺1,,𝚺1703subscript𝚺1subscript𝚺1703\mathbf{\Sigma}_{1},...,\mathbf{\Sigma}_{1703}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_Σ start_POSTSUBSCRIPT 1703 end_POSTSUBSCRIPT are positive definite, we perform eigen-decomposition and increase all their eigenvalues not larger than 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT.

Appendix N Supplementary tables of meta-analysis for AD

Tables 1 and 2 provide more details of the meta-analysis for AD in Section 5. Specifically, Table 1 presents the number of loci, average signals per locus, standard deviation of the number of signals per locus, average groups per locus, and standard deviation of the number of groups per locus identified by conventional marginal association test, GK-marginal, GK-pseudolasso, and GK-susie-rss. Here, the p𝑝pitalic_p-value threshold of the conventional marginal association test is 5×1085superscript1085\times 10^{-8}5 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT, and GK-pseudolasso uses the tuning parameter chosen by the lasso-min method in Section 4.2.2. For GK-marginal, GK-pseudolasso, and GK-susie-rss, we display results with respect to target FDR levels 0.05, 0.1, and 0.2. Table 2 provides details of top variants of identified loci given by GK-pseudolasso (target FDR level: 0.20), including their positions (columns “Chr.” and “SNP”), their reference alleles (column “Ref.”) and alternative alleles (column “Alt.”), genes that they potentially regulate (column “TopS2GGene”), their closest genes, their Z𝑍Zitalic_Z-scores from different individual studies, their meta-analysis Z𝑍Zitalic_Z-scores, their feature importance scores (W𝑊Witalic_W), and the marginal p𝑝pitalic_p-values obtained from meta-analysis Z𝑍Zitalic_Z-scores.

Table 1: Summary of results by applying different methods on meta-analysis for AD
Method Target Number of Average signals SD of signals Average groups SD of groups
FDR level identified loci per locus per locus per locus per locus
Marginal association test 29 15.517 32.231 1.000 0.000
GK-marginal 0.05 3 107.667 97.027 3.667 3.055
0.10 10 95.600 214.568 2.700 3.622
0.20 17 76.176 174.714 2.412 3.318
GK-pseudolasso 0.05 30 21.500 44.323 2.333 4.722
0.10 42 17.214 38.019 2.024 4.015
0.20 63 15.889 47.074 1.794 3.561
GK-susie-rss 0.05 21 14.619 27.902 1.286 0.644
0.10 35 12.000 23.506 1.257 0.657
0.20 47 10.191 20.591 1.319 0.695
Table 2: Details of top variants of identified loci given by GK-pseudolasso (target FDR level: 0.20).
Chr. SNP Ref. Alt. TopS2GGene Closest gene Z𝑍Zitalic_Z-scores from different individual studies Meta-analysis W𝑊Witalic_W Marginal
Study 1 Study 2 Study 3 Study 4 Study 5 Study 6 Study 7 Study 8 Study 9 Z𝑍Zitalic_Z-scores p𝑝pitalic_p-values
1 20853688 C T EIF4G3 EIF4G3 2.72 3.93 3.29 4.29 2.89 0.68 1.65 4.95 2.516×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 3.697×107absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
1 200984367 A G KIF21B KIF21B -2.21 -3.73 -4.33 -3.60 -2.92 -0.67 -4.36 1.837×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.490×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
1 207611623 A G CR1 CR1 -4.84 -8.81 -7.97 -9.65 -6.37 -3.96 -10.97 1.228×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 2.802×1028absentsuperscript1028\times 10^{-28}× 10 start_POSTSUPERSCRIPT - 28 end_POSTSUPERSCRIPT
2 37270395 G A NDUFAF7 1.50 4.02 3.73 4.03 2.05 4.62 1.998×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.953×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
2 44026309 T C LRPPRC -1.23 -3.80 -1.88 -3.55 -2.26 -3.64 -4.37 2.089×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.208×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
2 65409567 G A SPRED2 -3.93 -2.41 -4.05 -0.23 0.15 -4.44 1.993×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.538×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
2 105805908 T C NCK2 0.10 -3.94 -2.80 -4.67 -2.08 -4.72 2.490×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.185×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
2 127136908 A T BIN1 BIN1 3.77 10.94 8.68 11.95 8.74 4.90 13.36 1.141×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 5.466×1041absentsuperscript1041\times 10^{-41}× 10 start_POSTSUPERSCRIPT - 41 end_POSTSUPERSCRIPT
2 233117202 G C NGEF INPP5D 2.03 6.15 5.16 6.42 2.29 2.40 7.21 4.762×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.826×1013absentsuperscript1013\times 10^{-13}× 10 start_POSTSUPERSCRIPT - 13 end_POSTSUPERSCRIPT
3 136105288 G A SLC35G2 PPP2R3A -1.24 -3.66 -2.03 -4.64 -1.98 -3.04 -4.84 2.250×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.607×107absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
4 11024404 A G CLNK -2.47 -6.00 -4.06 -6.50 -4.32 -2.52 -7.32 5.039×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.275×1013absentsuperscript1013\times 10^{-13}× 10 start_POSTSUPERSCRIPT - 13 end_POSTSUPERSCRIPT
4 71303158 G A SLC4A4 -2.60 -3.77 -3.65 -3.34 -1.49 -1.53 -4.28 1.817×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.466×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
4 112082387 A C FAM241A 2.47 4.82 1.76 3.36 0.40 -0.71 4.68 2.478×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.418×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
4 143428212 C T GAB1 -3.68 -2.84 -3.95 -1.56 -1.36 -4.37 2.017×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.081×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
4 158808801 G A RAPGEF2 FNIP2 -2.43 -3.95 -2.39 -3.82 -2.40 -1.91 -4.66 1.894×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.553×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
5 4068226 C T IRX1 4.53 1.20 3.62 0.34 -0.91 4.50 2.144×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 3.323×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
5 14707491 C T ANKH ANKH -3.92 -3.20 -3.95 -4.36 -3.60 0.60 -4.66 2.480×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.602×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
5 86923485 A G LINC02059 2.49 4.70 2.45 3.76 3.49 2.49 5.12 3.059×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.517×107absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
5 177559423 G A RAB24 FAM193B 1.96 3.85 3.99 4.16 2.48 1.38 4.71 2.313×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.248×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
5 179373099 C T ADAMTS2 1.52 2.54 4.29 4.80 3.12 2.35 4.36 1.938×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.512×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
6 935171 T C LINC01622 -2.80 -3.20 -3.33 -4.55 -3.37 -2.17 -4.75 2.380×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.040×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
6 32686937 T C HLA-DQA2 HLA-DQB1 -3.88 -6.46 -4.86 -7.53 -2.29 -1.10 -8.13 4.461×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.090×1016absentsuperscript1016\times 10^{-16}× 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT
6 41066261 G C OARD1 OARD1 2.69 3.78 6.91 7.12 4.06 6.37 5.558×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.364×1011absentsuperscript1011\times 10^{-11}× 10 start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT
6 47484147 C T CD2AP CD2AP 2.95 5.74 5.21 6.10 5.33 2.24 7.05 3.271×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 8.942×1013absentsuperscript1013\times 10^{-13}× 10 start_POSTSUPERSCRIPT - 13 end_POSTSUPERSCRIPT
7 1543652 A G TMEM184A MAFK 2.33 4.06 2.93 3.64 2.36 0.33 4.54 1.810×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.868×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
7 37842715 G A NME8 2.95 4.15 3.81 3.74 3.20 1.13 4.79 2.045×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 8.230×107absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
7 100406823 C T ZCWPW1 4.25 7.53 4.01 8.41 5.04 3.59 1.29 9.35 8.987×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.266×1021absentsuperscript1021\times 10^{-21}× 10 start_POSTSUPERSCRIPT - 21 end_POSTSUPERSCRIPT
7 143410495 G T EPHA1-AS1 EPHA1 1.19 6.56 4.37 6.81 2.70 1.63 7.52 3.795×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.751×1014absentsuperscript1014\times 10^{-14}× 10 start_POSTSUPERSCRIPT - 14 end_POSTSUPERSCRIPT
8 27362470 C T PTK2B PTK2B 3.84 6.79 6.12 7.94 5.19 2.12 8.70 4.345×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.668×1018absentsuperscript1018\times 10^{-18}× 10 start_POSTSUPERSCRIPT - 18 end_POSTSUPERSCRIPT
8 95041772 C T NDUFAF6 4.06 3.96 4.03 4.50 2.81 0.36 5.17 2.207×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.172×107absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
8 97359646 A G SNORD3H 2.70 3.01 3.70 3.99 2.42 1.30 4.25 1.767×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.067×105absentsuperscript105\times 10^{-5}× 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
8 102564430 G A ODF1 1.72 4.00 2.66 3.53 1.42 -0.48 4.29 1.855×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 8.825×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
8 111515902 C T LINC02237 4.13 0.44 3.77 -0.41 4.40 2.051×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5.387×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
8 144042819 T C PARP10 SPATC1 0.17 4.66 2.47 4.57 3.68 5.16 3.389×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.210×107absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
10 29966853 G A JCAD 3.72 2.05 4.56 1.31 0.48 4.68 2.501×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.443×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
10 42722997 T C LOC283028 0.39 4.79 2.57 4.34 1.14 0.25 5.02 2.128×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.616×107absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
10 59962515 T G LINC01553 1.43 3.63 3.30 5.14 3.48 3.17 5.18 2.031×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.130×107absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
10 80494228 C T TSPAN14 TSPAN14 3.23 3.22 4.17 5.83 2.03 5.35 2.041×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.295×108absentsuperscript108\times 10^{-8}× 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
11 60254475 G A MS4A4E -5.74 -7.97 -8.27 -9.09 -6.66 -3.32 -10.30 8.499×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 3.570×1025absentsuperscript1025\times 10^{-25}× 10 start_POSTSUPERSCRIPT - 25 end_POSTSUPERSCRIPT
11 65888811 G A FIBP FIBP -2.13 -4.59 -1.22 -3.57 -1.62 -0.38 -4.74 2.589×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.070×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
11 86156833 A G PICALM PICALM 6.78 8.67 8.07 10.55 5.11 3.08 11.50 1.074×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 6.418×1031absentsuperscript1031\times 10^{-31}× 10 start_POSTSUPERSCRIPT - 31 end_POSTSUPERSCRIPT
11 121578263 T C SORL1 -3.10 -4.40 -3.82 -5.59 -3.38 -0.52 -5.90 3.920×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.768×109absentsuperscript109\times 10^{-9}× 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT
13 43679792 C T ENOX1 0.19 3.79 1.22 4.28 0.01 -1.03 4.30 1.865×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 8.441×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
13 93594511 A T GPC6-AS2 -0.04 -1.09 -0.85 -2.34 -0.57 -0.62 7.282×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 2.672×101absentsuperscript101\times 10^{-1}× 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
14 32478306 T C AKAP6 AKAP6 -1.45 -4.35 -1.77 -3.63 -0.28 0.77 -4.44 1.869×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.449×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
14 52924962 A G FERMT2 4.68 4.58 4.97 6.27 2.90 1.32 6.58 4.682×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.429×1011absentsuperscript1011\times 10^{-11}× 10 start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT
14 92470949 C T SLC24A4 -3.83 -6.10 -5.16 -6.67 -2.90 -2.58 -7.57 4.647×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.836×1014absentsuperscript1014\times 10^{-14}× 10 start_POSTSUPERSCRIPT - 14 end_POSTSUPERSCRIPT
15 50735410 C T HDC SPPL2A -3.16 -4.81 -4.09 -6.02 -2.45 0.09 -6.29 5.133×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.547×1010absentsuperscript1010\times 10^{-10}× 10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT
15 58753575 A G ADAM10 -2.86 -5.90 -4.16 -5.97 -2.81 -2.16 -6.94 3.385×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.910×1012absentsuperscript1012\times 10^{-12}× 10 start_POSTSUPERSCRIPT - 12 end_POSTSUPERSCRIPT
15 63277703 C T APH1B APH1B 1.20 5.52 3.68 5.72 2.58 2.46 1.61 0.98 2.05 6.45 3.285×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5.482×1011absentsuperscript1011\times 10^{-11}× 10 start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT
16 31120929 A G KAT8 KAT8 -2.28 -5.50 -2.72 -5.84 -2.89 -1.45 -6.56 3.913×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.702×1011absentsuperscript1011\times 10^{-11}× 10 start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT
17 5233752 G A SCIMP SCIMP 3.30 6.04 3.82 5.48 1.93 2.40 6.79 3.297×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5.560×1012absentsuperscript1012\times 10^{-12}× 10 start_POSTSUPERSCRIPT - 12 end_POSTSUPERSCRIPT
17 7581494 G A CD68 LOC100996842 -1.82 -3.60 -1.57 -3.49 -3.37 -1.95 -1.61 -2.72 -3.18 -4.42 1.933×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.941×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
17 49219935 T C ABI3 ABI3 -4.94 -4.75 -2.68 0.20 -2.61 -5.25 2.982×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 7.430×108absentsuperscript108\times 10^{-8}× 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
17 58331728 G C BZRAP1 MIR142 -1.00 -4.94 -5.09 -5.12 -3.81 -1.35 -5.75 3.909×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 4.412×109absentsuperscript109\times 10^{-9}× 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT
17 63482562 C T ACE ACE 2.73 5.07 3.54 5.25 3.92 1.93 2.67 2.09 2.45 6.32 5.299×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.268×1010absentsuperscript1010\times 10^{-10}× 10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT
19 1058177 A G ABCA7 -0.93 -4.61 -2.73 -4.94 -3.96 -1.16 -1.48 -0.38 0.52 -5.45 4.973×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.534×108absentsuperscript108\times 10^{-8}× 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
19 6876985 T C VAV1 ADGRE1 1.05 3.04 3.58 4.42 1.59 0.42 4.21 2.119×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.254×105absentsuperscript105\times 10^{-5}× 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
19 44888997 C T PVRL2 NECTIN2 20.83 51.85 53.66 8.573 0.000
19 51224706 C A CD33 CD33 -3.40 -5.84 -5.09 -5.69 -3.76 -3.97 -6.x96 4.936×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.696×1012absentsuperscript1012\times 10^{-12}× 10 start_POSTSUPERSCRIPT - 12 end_POSTSUPERSCRIPT
19 54664811 A G LILRB4 LILRB4 -2.61 -3.61 -3.13 -3.89 -1.05 0.54 -4.37 1.958×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.300×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
20 56409712 G T CASS4 CASS4 -3.82 -5.84 -4.56 -6.07 -5.14 -7.12 6.582×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5.526×1013absentsuperscript1013\times 10^{-13}× 10 start_POSTSUPERSCRIPT - 13 end_POSTSUPERSCRIPT
21 26775872 C T ADAMTS1 ADAMTS1 -1.60 -2.90 -5.17 -5.54 -3.39 -0.22 -4.87 2.469×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 5.668×107absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT

Appendix O Supplementary figures of meta-analysis for AD

Analogous to Figure 6 in Section 5, Figures 12, 13 and 14 respectively present Manhattan plots of the meta-analysis of the nine studies via conventional marginal association test (with p𝑝pitalic_p-value cutoff 5×1085superscript1085\times 10^{-8}5 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).

Refer to caption
Figure 12: Graphical illustration of the result by applying conventional marginal association test on meta-analysis for AD. The dotted line represents the conventional genome-wide p-value threshold of 5×1085superscript1085\times 10^{-8}5 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT. P-values are truncated at 1050superscript105010^{-50}10 start_POSTSUPERSCRIPT - 50 end_POSTSUPERSCRIPT for better visualization. The results are obtained from the meta-analysis p𝑝pitalic_p-values calculated based on Section L. Variant density is shown at the bottom of plot (number of variants per 1Mb).
Refer to caption
Figure 13: Graphical illustration of the result by applying the GK-marginal on meta-analysis for AD. Each point represents a group of genetic variants. With respect to the target FDR level 0.1, points of identified groups are highlighted in blue or purple. For each locus with at least one identified group, the name of the locus is presented at the variant group with the largest importance statistic (highlighted in purple). Variant density is shown at the bottom of plot (number of variants per 1Mb).
Refer to caption
Figure 14: Graphical illustration of the result by applying the GK-susie-rss on meta-analysis for AD. Each point represents a group of genetic variants. With respect to the target FDR level 0.1, points of identified groups are highlighted in blue or purple. For each locus with at least one identified group, the name of the locus is presented at the variant group with the largest importance statistic (highlighted in purple). Variant density is shown at the bottom of plot (number of variants per 1Mb).

The conventional marginal association test selects many feature groups because it focuses on marginal correlations between feature groups and the response while ignoring spurious correlation induced by linkage disequilibrium. This is shown in Figure 12, where the conventional marginal association test tends to select many nearby loci. This issue is alleviated by the GhostKnockoffs approach that tests conditional independence as seen in Figures 6, 13, and 14.

Appendix P Running Lasso on binary responses

In genetic datasets, the response Y𝑌Yitalic_Y is often binary. Performing Lasso or Lasso-type regressions on binary response may sound unreasonable since it violates the usual linear model assumption. One might assume that utilizing penalized logistic regression to generate feature importance statistics would be much more effective. However, a bit surprisingly, we demonstrate that this intuition may not be correct through the following two simulations.

For the first column of Figure 15, we generate Xiiid𝒩(𝟎,1n𝐈p)superscriptsimilar-to𝑖𝑖𝑑subscript𝑋𝑖𝒩01𝑛subscript𝐈𝑝X_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(\mathbf{0},\frac{1}{% \sqrt{n}}\mathbf{I}_{p})italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_i italic_i italic_d end_ARG end_RELOP caligraphic_N ( bold_0 , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ), and, conditional on Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, (Yi=1)=11+e𝜷Xisubscript𝑌𝑖111superscript𝑒superscript𝜷topsubscript𝑋𝑖\mathbb{P}(Y_{i}=1)=\frac{1}{1+e^{-\bm{\beta}^{\top}X_{i}}}blackboard_P ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG and (Yi=0)=1(Yi=1)subscript𝑌𝑖01subscript𝑌𝑖1\mathbb{P}(Y_{i}=0)=1-\mathbb{P}(Y_{i}=1)blackboard_P ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ) = 1 - blackboard_P ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ), where n=1000𝑛1000n=1000italic_n = 1000 and p=300𝑝300p=300italic_p = 300. We create 𝜷𝜷\bm{\beta}bold_italic_β by uniformly randomly selecting 30 coordinates to be non-zero. The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability. The dark curve represents the knockoffs procedure with Lasso coefficient difference statistic (with tuning parameter chosen by cross-validation), i.e., KF-lassocv. The red curve represents the knockoffs procedure with coefficient difference statistic generated by L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-penalized logistic regression. We vary the signal amplitudes such that we observe relatively complete power profiles below. The target FDR is 0.10.10.10.1. Each point on the curves represents an average over 200 replications. For the second column of Figure 15, we show the result for AR(1) features. Here, n=600𝑛600n=600italic_n = 600, p=200𝑝200p=200italic_p = 200 and the signal amplitude (i.e., the magnitude of non-zero β𝛽\betaitalic_β values) is fixed to be 0.5. Otherwise, the simulation setting is exactly the same as the independent case. We observe that the two methods considered have almost the same power and FDR, so the use of penalized logistic regression does not meaningfully affect the results.

Refer to caption
Figure 15: Power and FDR plots when the response is generated by a logistic regression model. Each point is an average over 200 replications.