Controlled Variable Selection from Summary Statistics Only?
A Solution via GhostKnockoffs and Penalized Regression

Zhaomeng Chen Zihuai He Department of Neurology and Neurological Sciences, Stanford University Benjamin B. Chu Department of Biomedical Data Science, Stanford University Jiaqi Gu Department of Neurology and Neurological Sciences, Stanford University Tim Morrison Department of Statistics, Stanford University Chiara Sabatti Department of Statistics, Stanford University Department of Biomedical Data Science, Stanford University Emmanuel Candès Department of Statistics, Stanford University Department of Mathematics, Stanford University

Abstract

Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. (2022) and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer’s disease, and evidence a significant improvement in power.

^*^*footnotetext: Equal contribution.

Keywords— Variable selection, replicability, summary statistics, false discovery rate (FDR), knockoffs, genome-wide association study (GWAS), pseudo-lasso

1 Introduction

1.1 Background and contributions

Modern large-scale studies frequently involve a multitude of explanatory variables potentially associated with an outcome we would like to better understand. Oftentimes, the goal is to select those explanatory variables that are meaningfully associated with the response variable. For instance, with recent advances in genome sequencing technologies and genotype imputation techniques, one can now gather tens of millions of variants from hundreds of thousands of samples in large-scale genetic studies, with the aim of pinpointing which genetic variants are biologically associated with specific diseases. This information could provide mechanistic insights and potentially aid the development of targeted drugs. In statistics, this challenge is typically framed as a multiple testing problem. Further, due to the sheer number of hypotheses considered and the cost of following false leads, it is generally required to control some form of error rate on the false positives.

In this paper, we focus on controlling the false discovery rate (FDR), which is the expected proportion of false selections among all selected variables. Compared to the more stringent familywise error rate (FWER) control, keeping the FDR under a nominal level allows for more discoveries while maintaining a reasonable statistical guarantee on the rate of false positives. Several methods for FDR control have been proposed in the literature, with the Benjamini-Hochberg procedure being particularly popular (Benjamini and Hochberg, 1995). However, these approaches often assume a parametric model or the existence of valid $p$ -values, which remains difficult, and even problematic, in high-dimensional settings.

Candès et al. (2018) proposed the model-X knockoffs, a broad and flexible framework which allows the statistician to select variables that retain dependence with the response conditional on all other covariates while maintaining FDR control. Model-X knockoffs differs from previous approaches in that (1) it makes no modeling assumptions on the distribution of the response $Y$ we wish to study conditional on the family of covariates $X$ , and (2) it does not require the construction of valid $p$ -values. Instead, the crucial assumption is that the distribution of $X$ is known. The main idea in Candès et al. (2018) is to generate fake variables $\widetilde{X}$ , knockoffs, which we can view as negative controls and can be used to tease apart variables that do influence the response from those who do not. Model-X knockoffs has proved effective in a number of real-world applications, particularly in GWAS; see Bates et al. (2020), Sesia et al. (2021) and He et al. (2022) for examples.

To deploy model-X knockoffs, researchers must have in hand the covariates and responses from all samples. However, in certain situations, individual-level data that may reveal sensitive personal information is not readily accessible. For example, due to privacy concerns, many GWAS studies only publish summary statistics of the original data (Pasaniuc and Price, 2017). Yet in such cases, we would still like to develop controlled variable selection methods that rely solely on summary statistics. In genetic studies, this would enable us to utilize available summary data from different data centers to conduct meta-analysis, enhancing the effective sample size and improving variable selection power. On this front, He et al. (2022) proposed the framework of GhostKnockoffs, which implements the knockoffs procedure with the marginal correlation difference feature importance statistic directly from summary statistics. As we shall review next, the main idea is to generate knockoff $Z-$ scores directly without creating knockoff variables; all that is needed are marginal correlations between the response and the features under study. In details, with $n$ being the sample size and $p$ the number of variables being assayed, the method operates with only $\mathbf{X}^{\top}\mathbf{Y}$ and $\lVert\mathbf{Y}\rVert_{2}^{2}$ , where $\mathbf{X}$ is the $n\times p$ matrix of covariates, and $\mathbf{Y}$ is the $n\times 1$ response vector.

In this paper, we extend the family of GhostKnockoffs methods to incorporate feature importance statistics obtained from penalized regression. We first consider in Section 3 the situation in which the empirical covariance of the covariate-response pair $(X,Y)$ is available; with the above notation, this means that the summary statistics $\mathbf{X}^{\top}\mathbf{X}$ , $\mathbf{X}^{\top}\mathbf{Y}$ , $\lVert\mathbf{Y}\rVert_{2}^{2}$ are available along with the sample size $n$ . Unsurprisingly, we observe substantial power improvement over the method of He et al. (2022) because we can now employ far more effective test statistics. Next, in Section 4, we consider the case where the empirical covariance $\mathbf{X}^{\top}\mathbf{X}$ of the features is not available. There, we propose new imputation methods that consistently outperform He et al. (2022) in comprehensive synthetic and semi-synthetic simulations and rigorously control the FDR under suitable conditions. Finally, in Section 5 we apply our methods to a meta-analysis of nine large-scale array-based genome-wide association and whole-exome/-genome sequencing studies of Alzheimer’s disease, in which our methods yield more discoveries than He et al. (2022). We note that existing work in the genetics literature has implemented variable selection methods based on penalized regression with summary statistics, e.g., Mak et al. (2017) and Zou et al. (2022). However, none of these provide any guarantee of FDR control. In fact, as we note in the main text, these methods can be leveraged in our approach to create knockoffs versions that do control the FDR.

1.2 Code availability and reproducibility

The software and example code that reproduce the results presented in this paper can be found at https://github.com/biona001/ghostknockoff-gwas-reproducibility/tree/main/chen_et_al. Simulation results in Section 3.5, Section 4.4.2 and Section 4.4.3 can be exactly reproduced. Due to data accessibility issue, we only provide code without real data for Section 4.4.1 and Section 5.

2 Model-X Knockoffs and GhostKnockoffs

To begin with, we define the controlled variable selection problem and give a brief review of model-X knockoffs and GhostKnockoffs. For a more detailed exposition, we refer readers to Candès et al. (2018), Barber and Candès (2015), and He et al. (2022). In the following, we use boldface letters for vectors and matrices.^*^**As an exception, we use $X$ , $\widetilde{X}$ , and $Y$ to represent generic covariates, their knockoffs, and the response. We use $\mathbf{X}_{j}\in\mathbb{R}^{n}$ and $\mathbf{x}_{i}\in\mathbb{R}^{p}$ to respectively represent the $j$ th column and $i$ th row of the covariate matrix $\mathbf{X}$ .

2.1 Problem statement

Given covariates $X\in\mathbb{R}^{p}$ and a response $Y\in\mathbb{R}$ , we are interested in understanding which variables influence $Y$ . We formulate this selection problem as testing the conditional independence hypotheses $\mathcal{H}_{0}^{j}:X_{j}\perp\!\!\!\perp Y\mid X_{-j}$ for $1\leq j\leq p$ , where $X_{-j}$ is a shorthand for all the variables except the $j$ th; that is $X_{-j}=\{X_{1},...,X_{j-1},X_{j+1},...,X_{n}\}$ . In words, we should reject $\mathcal{H}_{0}^{j}$ if we believe that $X_{j}$ can help better predict the outcome than if we only had available the values of all the other variables. Put differently, $X_{j}$ has information about $Y$ which cannot be subsumed by the information contained in all the other variables. By conditioning on $X_{-j}$ , these hypothesis tests aim to weed out variables whose relationship to $Y$ is driven by residual correlations with other covariates.

Let $\mathcal{H}_{0}\subset[p]$ be the set of indices for which the null conditional independence hypothesis $\mathcal{H}_{0}^{j}$ is true, and let ${\mathcal{S}}\subset[p]$ be the set of indices of the hypotheses rejected by a selection procedure. The false discovery rate (FDR) is the expected fraction of false positives among the selected, defined as

\text{FDR}:=\mathbb{E}\left[\frac{|{\mathcal{S}}\cap\mathcal{H}_{0}|}{|\hat{% \mathcal{S}}|}\right]

with the convention that $0/0=0$ . Our goal is to make as many rejections as possible while controlling the FDR below a user-specified level $q$ .

In this paper, we consider the setting in which, instead of observing i.i.d. samples from the distribution of ( $X,Y$ ), we only have some summary statistics of the i.i.d. samples. In particular, we will show how one can, quite remarkably, perform tests of conditional independence when we do not directly observe the i.i.d. samples. Throughout this paper, we assume that $X\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma})$ where $\mathbf{\Sigma}$ is known (or, in practice, can be estimated).

2.2 Model-X knockoffs

2.2.1 The procedure

Suppose we observe $n$ i.i.d. samples $(X_{i},Y_{i})$ , $1\leq i\leq n$ , arranged in a data matrix $\mathbf{X}\in\mathbb{R}^{n\times p}$ and response vector $\mathbf{Y}\in\mathbb{R}^{n}$ . In the model-X knockoffs framework Candès et al. (2018), we assume we know the distribution $P_{X}$ of the covariates $X$ while having no knowledge of the conditional distribution $Y\mid X$ . The model-X approach is well-suited to genetic applications where reference panels may be available to estimate $P_{X}$ or where we have good models of linkage disequilibrium.

To implement model-X knockoffs, we first generate a matrix $\widetilde{\mathbf{X}}\in\mathbb{R}^{n\times p}$ of knockoffs such that the following two conditions hold:

	(Exchangeability):	$\displaystyle\;(\mathbf{X}_{j},\widetilde{\mathbf{X}}_{j},\mathbf{X}_{-j},% \widetilde{\mathbf{X}}_{-j})\stackrel{{\scriptstyle d}}{{=}}(\widetilde{% \mathbf{X}}_{j},\mathbf{X}_{j},\mathbf{X}_{-j},\widetilde{\mathbf{X}}_{-j}),\;% \forall\;1\leq j\leq p$		(1)
	(Conditional independence):	$\displaystyle\;\widetilde{\mathbf{X}}\perp\!\!\!\perp\mathbf{Y}\mid\mathbf{X}.$		(2)

Roughly, the first says that we cannot distinguish between $[\mathbf{X}\;\widetilde{\mathbf{X}}]$ and $[\mathbf{X}\;\widetilde{\mathbf{X}}]_{\text{swap}(j)}$ , where $[\mathbf{X}\;\widetilde{\mathbf{X}}]_{\text{swap}(j)}$ is obtained from $[\mathbf{X}\;\widetilde{\mathbf{X}}]$ by swapping the $j$ th and $(j+p)$ th columns. The second condition implies that $\widetilde{\mathbf{X}}$ does not provide any new information about $Y$ conditional on $X$ and is guaranteed if $\widetilde{\mathbf{X}}$ is constructed without looking at $\mathbf{Y}$ . If these properties hold, it can be shown that $\mathbf{X}_{j}$ and $\widetilde{\mathbf{X}}_{j}$ are indistinguishable conditional on $\mathbf{Y}$ for each $j\in\mathcal{H}_{0}$ .

Next, we define feature importance statistics $\mathbf{W}=w([\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})\in\mathbb{R}^{p}$ to be any function of $\mathbf{X}$ , $\widetilde{\mathbf{X}}$ and $\mathbf{Y}$ such that a flip-sign property holds; namely, switching a column $\mathbf{X}_{j}$ with its knockoff $\widetilde{\mathbf{X}}_{j}$ flips the sign of the $j$ th component of the output; formally, $w_{j}([\mathbf{X},\widetilde{\mathbf{X}}]_{\text{swap}(j)},\mathbf{Y})=-w_{j}(% [\mathbf{X},\widetilde{\mathbf{X}}],\mathbf{Y})$ . Common choices include $W_{j}=|\mathbf{X}_{j}^{\top}\mathbf{Y}|-|\widetilde{\mathbf{X}}_{j}^{\top}% \mathbf{Y}|$ (marginal correlation difference statistic) and $W_{j}=|\hat{\beta}_{j}(\lambda_{\text{CV}})|-|\hat{\beta}_{j+p}(\lambda_{\text% {CV}})|$ (Lasso coefficient difference statistic), where $\hat{\bm{\beta}}(\lambda_{\text{CV}})$ is the solution to the Lasso problem

\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2p}}\frac{1}{2}||\mathbf{Y}% -[\mathbf{X}\;\widetilde{\mathbf{X}}]\bm{\beta}||_{2}^{2}+\lambda_{\text{CV}}|% |\mathbf{\bm{\beta}}||_{1},

and $\lambda_{\text{CV}}$ is usually chosen by cross-validation.

Finally, the knockoff filter selects the variables $\mathcal{S}=\{j:W_{j}\geq T\}$ , where

T=\text{min}\left\{t\in\mathcal{W}:\frac{1+\#\{j:\;W_{j}\leq-t\}}{\#\{j:\;W_{j% }\geq t\}\vee 1}\leq q\right\}.

(3)

Here, $\mathcal{W}=\{|W_{j}|:j=1,...,p\}\backslash\{0\}$ , and $T=+\infty$ if $\mathcal{W}$ is empty. Intuitively, the threshold $T$ is chosen to be the most liberal one such that an estimate of FDP is bounded by $q$ . Candès et al. (2018) showed that this procedure controls the FDR of the conditional testing problem at level $q$ .

2.2.2 Gaussian knockoff sampler

Under the assumption that the rows of the data matrix $\mathbf{X}$ are i.i.d. from the Gaussian distribution $\mathcal{N}(\mathbf{0},\mathbf{\Sigma})$ , we can generate a knockoff vector $\widetilde{\mathbf{x}}_{i}$ for each row $\mathbf{x}_{i}$ of the data matrix $\mathbf{X}$ by sampling $\widetilde{\mathbf{x}}_{i}\sim{\mathcal{N}(\mathbf{P}^{\top}\mathbf{x}_{i},% \mathbf{V})}$ independently across rows, where $\mathbf{P}=\mathbf{I}-\mathbf{\Sigma}^{-1}\mathbf{D}$ , $\mathbf{V}=2\,\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}$ , $\mathbf{D}=\text{diag}\{\mathbf{s}\}$ , and $\mathbf{s}\in\mathbb{R}^{p}$ is a vector of free parameters usually obtained by solving a convex optimization problem that depends on $\mathbf{\Sigma}$ (Candès et al., 2018). See Appendix A for details of computing $\mathbf{s}$ . Concatenating all the knockoff vectors then gives a valid matrix $\widetilde{\mathbf{X}}\in\mathbb{R}^{n\times p}$ of knockoffs. In matrix form, the construction above is

\widetilde{\mathbf{X}}=\mathbf{X}\mathbf{P}+\mathbf{E}\mathbf{V}^{1/2},

(4)

where $\mathbf{E}$ is an $n$ by $p$ matrix with i.i.d. standard Gaussian entries, independent of $\mathbf{X}$ and $\mathbf{Y}$ . For later reference, we summarize the Gaussian knockoff sampler in Algorithm 1 and denote it as $\mathcal{G}$ .

Algorithm 1 Gaussian Knockoff Sampler

\mathcal{G}

1: Input:

\mathbf{X}

and

\mathbf{\Sigma}

2: Compute

\mathbf{s}

by solving a convex optimization problem as defined in (15).

3: Compute

\mathbf{D}=\text{diag}\{\mathbf{s}\},\mathbf{P}=\mathbf{I}-\mathbf{\Sigma}^{-1% }\mathbf{D}

, and

\mathbf{V}=2\,\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}

4: Simulate

\mathbf{E}\in\mathbb{R}^{n\times p}

whose entries are i.i.d. standard Gaussian variables.

5: Output:

\widetilde{\mathbf{X}}=\mathbf{X}\mathbf{P}+\mathbf{E}\mathbf{V}^{1/2}.

2.3 GhostKnockoffs with marginal correlation difference statistic

The original model-X knockoffs procedure relies on having access to the covariates and responses from all data points, i.e., the matrix of covariates $\mathbf{X}$ and the response vector $\mathbf{Y}$ . Henceforth, we call these individual-level data. In many application scenarios, however, individual-level data are not available due to privacy concerns. Instead, we only have access to some summary statistics of $\mathbf{X}$ and $\mathbf{Y}$ , e.g., the empirical covariance matrix of the covariaties and the empirical covariance between each covariate and the response.

He et al. (2022) proposed GhostKnockoffs, which implements the knockoffs procedure with marginal correlation difference statistic when only $\mathbf{X}^{\top}\mathbf{Y}$ and $||\mathbf{Y}||_{2}^{2}$ are available. The key idea of He et al. (2022) is to sample the knockoff $Z$ -score $\widetilde{\mathbf{Z}}_{s}$ from $\mathbf{X}^{\top}\mathbf{Y}$ and $||\mathbf{Y}||_{2}^{2}$ directly, in a way such that

\widetilde{\mathbf{Z}}_{s}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}% {{=}}\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y},

(5)

where $\widetilde{\mathbf{X}}=\mathcal{G}(\mathbf{X},\mathbf{\Sigma})$ is the knockoff matrix generated by the Gaussian knockoff sampler (Algorithm 1). If we use $\mathbf{W}=\mathbf{Z}_{s}-\widetilde{\mathbf{Z}}_{s}$ (where $\mathbf{Z}_{s}=\mathbf{X}^{\top}\mathbf{Y}$ ) as the feature importance statistic and run the knockoff filter, the resulting rejection set will have the same distribution as that of the knockoffs procedure with marginal correlation difference statistic. Therefore, the two procedures are statistically identical. In particular, they both control the FDR.

Specifically, He et al. (2022) showed that for $\mathbf{P}$ and $\mathbf{V}$ computed in step 3 of Algorithm 1,

\widetilde{\mathbf{Z}}_{s}=\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+||% \mathbf{Y}||_{2}\mathbf{Z}\ \text{where}\ \mathbf{Z}\sim\mathcal{N}(\mathbf{0}% ,\mathbf{V})\;\text{is independent of}\;\mathbf{X}\;\text{and}\;\mathbf{Y}

(6)

satisfies (5) as detailed in Appendix B. All this is summarized in Algorithm 2. In the following sections, we refer to Algorithm 2 as GhostKnockoffs with marginal correlation difference statistic (GK-marginal).

Algorithm 2 GhostKnockoffs with Marginal Correlation Difference Statistic (GK-marginal)

1: Input:

\mathbf{X}^{\top}\mathbf{Y}

||\mathbf{Y}||_{2}^{2}

, and

\mathbf{\Sigma}

2: Compute

\mathbf{s}

\mathbf{P}

, and

\mathbf{V}

as in Algorithm 1.

3: Compute the feature importance statistics

\mathbf{W}=\lvert\mathbf{Z}_{s}\rvert-\lvert\widetilde{\mathbf{Z}}_{s}\rvert

, where

\widetilde{\mathbf{Z}}_{s}

is generated according to (6).

4: Input

\mathbf{W}

into the knockoffs selection procedure.

5: Output: Knockoffs selection set.

3 GhostKnockoffs with Penalized Regression: Known Empirical Covariance

3.1 Setting

As we have just seen, GhostKnockoffs-marginal gives a way to test conditional hypotheses while maintaining FDR control when only the summary statistics $\mathbf{\mathbf{X}}^{\top}\mathbf{Y}$ and $\lVert\mathbf{Y}\rVert_{2}^{2}$ are available to the analyst. Now, we consider the setting in which we have knowledge of the empirical covariance matrix $\mathbf{X}^{\top}\mathbf{X}$ and the sample size $n$ , in addition to $\mathbf{\mathbf{X}}^{\top}\mathbf{Y}$ and $\lVert\mathbf{Y}\rVert_{2}^{2}$ . These quantities only reveal sample averages of relevant quantities, as opposed to all the individual-level information.

In this section, we propose a variable selection method that utilizes only $\mathbf{X}^{\top}\mathbf{X}$ , $\mathbf{\mathbf{X}}^{\top}\mathbf{Y}$ , $\lVert\mathbf{Y}\rVert_{2}^{2}$ , and $n$ . Our method achieves FDR control and power comparable to the knockoffs procedure with the cross-validated Lasso coefficient difference statistic defined in Section 2. This is interesting because the latter usually outperforms GhostKnockoffs with the marginal correlation difference statistic by a significant margin. Notably, for a fixed tuning parameter $\lambda$ , we show that our procedure is equivalent to a corresponding knockoffs method using the Lasso coefficient difference statistic with the same penalty level $\lambda$ .

3.2 GhostKnockoffs with the Lasso

Recall that in the knockoffs procedure with the Lasso coefficient difference statistic, we solve the optimization problem

\hat{\bm{\beta}}(\lambda)\in\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^% {2p}}\frac{1}{2}||\mathbf{Y}-[\mathbf{X}\;\widetilde{\mathbf{X}}]\bm{\beta}||_% {2}^{2}+\lambda||\bm{\beta}||_{1},

(7)

where $\widetilde{\mathbf{X}}=\mathcal{G}(\mathbf{X},\mathbf{\Sigma}).$ We then define the Lasso coefficient difference feature importance statistics by $W_{j}=\lvert\hat{\beta}_{j}(\lambda)\rvert-\lvert\hat{\beta}_{j+p}(\lambda)\rvert$ for $1\leq j\leq p$ . If we have access to individual-level data, $\lambda$ is usually chosen by cross-validation (Candès et al. (2018) and Weinstein et al. (2020)).^*^**In the case that $Y$ is binary, one may think that utilizing (penalized) logistic regression would give much better power than Lasso. In Appendix P, we show that this intuition may not be correct through simulations, even when $Y$ is generated according to a logistic regression model.

As a first step, we would like to run a statistically equivalent procedure using $\mathbf{X}^{\top}\mathbf{X},\mathbf{X}^{\top}\mathbf{Y}$ , $\lVert\mathbf{Y}\rVert_{2}^{2}$ , and $n$ for a fixed $\lambda$ . Note that, with $\lambda$ fixed, (7) depends on the data only through

\begin{bmatrix}\mathbf{X}^{\top}\mathbf{X}&\mathbf{X}^{\top}\widetilde{\mathbf% {X}}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{X}&\widetilde{\mathbf{X}}^{\top}% \widetilde{\mathbf{X}}\end{bmatrix}

and

\begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{Y}\end{bmatrix}.

Define the Gram matrix of $[\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y}]$

\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})=[\mathbf{X},% \widetilde{\mathbf{X}},\mathbf{Y}]^{\top}\,[\mathbf{X},\widetilde{\mathbf{X}},% \mathbf{Y}].

The Gram matrix can of course be equivalently reconstructed from $(\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y},\widetilde{\mathbf% {X}}^{\top}\mathbf{Y},\mathbf{X}^{\top}\mathbf{X},\widetilde{\mathbf{X}}^{\top% }\mathbf{X},\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}})$ . The main idea is to sample from the joint distribution of $\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})$ using the Gram matrix of $[\mathbf{X},\mathbf{Y}]$ only. Based on this, we can then generate the solution to the Lasso problem (7) (in distribution) for a fixed $\lambda$ .^*^**Careful readers may realize that the solution of the Lasso problem does not depend on $\lVert\mathbf{Y}\rVert_{2}^{2}$ . Here we include $\lVert\mathbf{Y}\rVert_{2}^{2}$ as an input of to be able to make a more general statement later that goes beyond the Lasso. This is achieved via the following Proposition 1, which says in words that if we generate ‘fake’ data matrices $\widecheck{\mathbf{X}}$ and $\widecheck{\mathbf{Y}}$ that lead to the same Gram matrix as that of $\mathbf{X}$ and $\mathbf{Y}$ , then the distribution of $\mathcal{T}$ remains unchanged if we replace the original data matrices by the fake data matrices.

Proposition 1.

Suppose $\widecheck{\mathbf{X}}\in\mathbb{R}^{n\times p}$ and $\widecheck{\mathbf{Y}}\in\mathbb{R}^{n}$ are constructed such that $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}]$ . Setting $\widetilde{\mathbf{X}}=\mathcal{G}({\mathbf{X}},\mathbf{\Sigma})$ and $\widetilde{\widecheck{\mathbf{X}}}=\mathcal{G}(\widecheck{\mathbf{X}},\mathbf{% \Sigma})$ as the outputs of Algorithm 1,^*^**Note that $\widecheck{\mathbf{X}}$ may not be a data matrix with i.i.d. rows and covariance matrix $\mathbf{\Sigma}$ and we should call $\widetilde{\widecheck{\mathbf{X}}}$ the pseudo-Gaussian knockoff data matrix. we have

\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})\mid\mathbf{X},% \mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}\mathcal{T}(\widecheck{\mathbf{X}},% \widetilde{\widecheck{\mathbf{X}}},\widecheck{\mathbf{Y}})\mid\mathbf{X},% \mathbf{Y}.

Proof of Proposition 1 is provided in Appendix C. Specifically, Proposition 1 suggests that summary statistics ( $\mathbf{X}^{\top}\mathbf{X},\mathbf{X}^{\top}\mathbf{Y},||\mathbf{Y}||_{2}^{2}$ , $\mathbf{\Sigma}$ ) are sufficient for sampling the Gram matrix $\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})$ .

Algorithm 3 GhostKnockoffs with Penalized Regression: Known Empirical Covariance

1: Input:

\mathbf{X}^{\top}\mathbf{X},\mathbf{X}^{\top}\mathbf{Y},||\mathbf{Y}||_{2}^{2}

\mathbf{\Sigma}

, and

n

2: Find

\widecheck{\mathbf{X}}

and

\widecheck{\mathbf{Y}}

such that

[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}]

by eigen-decomposition or Cholesky decomposition.

3: Generate

\widetilde{\widecheck{\mathbf{X}}}=\mathcal{G}(\widecheck{\mathbf{X}},\mathbf{% \Sigma})

via Algorithm 1.

4: Run the standard knockoffs procedure (at level

q

) with the Lasso coefficient difference statistic on

\widecheck{\mathbf{X}}

and

\widetilde{\widecheck{\mathbf{X}}}

for a fixed penalty level

\lambda

or use the methods from Sections 3.3 and 3.4.

5: Output: Knockoffs selection set.

We are now able to write down a procedure, namely, Algorithm 3, which is statistically equivalent to the corresponding individual-level knockoffs procedure using the Lasso coefficient difference statistic (or any statistic defined in Sections 3.3 and 3.4). In step 2, $\widecheck{\mathbf{X}}$ and $\widecheck{\mathbf{Y}}$ can be obtained by performing the eigen-decomposition or Cholesky decomposition of $[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}]$ . Brief procedures to construct $\widecheck{\mathbf{X}}$ and $\widecheck{\mathbf{Y}}$ via eigen-decomposition are provided in Appendix D. All we need to do is to run the knockoffs procedure with $\widecheck{\mathbf{X}}$ and $\widetilde{\widecheck{\mathbf{X}}}$ in lieu of ${\mathbf{X}}$ and $\widetilde{{\mathbf{X}}}$ . We say that the procedure is equivalent since the rejection sets have the same distribution. In particular, this proves that Algorithm 3 controls the FDR.

Corollary 1.

Consider a knockoffs feature importance statistic $\mathbf{W}=\mathbf{f}(\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y}% ),\mathbf{U})\in\mathbb{R}^{p}$ , which is a deterministic function of $\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})$ and an independent random variable $\mathbf{U}$ . Define $\widehat{\mathbf{W}}=\mathbf{f}(\mathcal{T}(\widecheck{\mathbf{X}},\widetilde{% \widecheck{\mathbf{X}}},\mathbf{Y}),\mathbf{U})$ . Let $\mathcal{S}_{1}$ (resp. $\mathcal{S}_{2}$ ) be the rejection set obtained from applying the knockoffs filter on $\mathbf{W}$ (resp. $\widehat{\mathbf{W}}$ ). Then $\mathcal{S}_{1}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}% \mathcal{S}_{2}\mid\mathbf{X},\mathbf{Y}$ . Thus, if $\mathbf{W}$ obeys the flip-sign property, both procedures have equal FDR at most equal to $q$ .

Proof.

Proposition 1 gives $\mathbf{W}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}\widehat{% \mathbf{W}}\mid\mathbf{X},\mathbf{Y}$ . Since the selection set is uniquely determined by the values of $\mathbf{W}$ (or $\widehat{\mathbf{W}}$ ), it follows that $\mathcal{S}_{1}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}% \mathcal{S}_{2}\mid\mathbf{X},\mathbf{Y}$ . Therefore, the procedures have the same FDR. ∎

We can easily adapt the method above to accommodate other types of regularization, such as Ridge regression and Elastic Net.

3.3 GhostKnockoffs with the square-root Lasso

In Section 3.2, we assumed that the tuning parameter $\lambda$ in (7) is fixed. In practice, one may choose the penalty level using information from the Gram matrix of $[\mathbf{X},\mathbf{Y}]$ , and the sample size $n$ . Since individual-level data is not available, we are unable to use data-splitting approaches such as cross-validation.

An alternative way to define feature importance is to use the square-root Lasso (Belloni et al., 2011), for which the choice of a reasonable tuning parameter is convenient. The square-root Lasso applied to the knockoffs setting solves

\hat{\bm{\beta}}(\lambda)\in\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^% {2p}}||\mathbf{Y}-[\mathbf{X}\;\widetilde{\mathbf{X}}]\bm{\beta}||_{2}+\lambda% ||\bm{\beta}||_{1},

(8)

and a good choice of $\lambda$ is given by

\lambda=\kappa\cdot\mathbb{E}\left[\frac{\lVert[\mathbf{X}\;\widetilde{\mathbf% {X}}]^{\top}\bm{\epsilon}\rVert_{\infty}}{\lVert\bm{\epsilon}\rVert_{2}}|% \mathbf{X},\widetilde{\mathbf{X}}\right],

(9)

where $\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n})$ and $\kappa$ is a unitless hyperparameter (Tian et al., 2018). This value is a scalar multiple of the expected value of the minimal penalty level required such that all the coefficients are shrunk to zero under the global null model. The square-root Lasso has the benefit that the value of the hyperparameter does not depend on the details of the distribution of $Y$ conditional on $X$ . We also found that the performance of our procedure does not depend very sensitively on the choice of $\kappa$ . In our data examples, we take $\kappa=0.3$ .

In the setting where we only know about values of the summary statistics, we simply replace ( $\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y}$ ) by ( $\widecheck{\mathbf{X}},\widetilde{\widecheck{\mathbf{X}}},\widecheck{\mathbf{Y% }})$ in (8). Further, we note that for any orthogonal matrix $\mathbf{Q}$ ,

	$\displaystyle([\mathbf{X}\;\widetilde{\mathbf{X}}]^{\top}\mathbf{Q}^{\top}\bm{% \epsilon},\bm{\epsilon}^{\top}\bm{\epsilon})\mid\mathbf{X},\widetilde{\mathbf{% X}}$	$\displaystyle\stackrel{{\scriptstyle d}}{{=}}([\mathbf{X}\;\widetilde{\mathbf{% X}}]^{\top}\mathbf{Q}^{\top}\bm{\epsilon},\bm{\epsilon}^{\top}\mathbf{Q}% \mathbf{Q}^{\top}\bm{\epsilon})\mid\mathbf{X},\widetilde{\mathbf{X}}$
		$\displaystyle\stackrel{{\scriptstyle d}}{{=}}([\mathbf{X}\;\widetilde{\mathbf{% X}}]^{\top}\bm{\epsilon},\bm{\epsilon}^{\top}\bm{\epsilon})\mid\mathbf{X},% \widetilde{\mathbf{X}},$

where the second equality follows from $\mathbf{Q}^{\top}\bm{\epsilon}\stackrel{{\scriptstyle d}}{{=}}\bm{\epsilon}$ . Therefore, the value of the hyperparameter in (9) remains unchanged if we multiply $[\mathbf{X}\;\widetilde{\mathbf{X}}]$ by $\mathbf{Q}$ on the left. This implies that (9) is a deterministic function of $[\mathbf{X}\;\widetilde{\mathbf{X}}]^{\top}[\mathbf{X}\;\widetilde{\mathbf{X}}]$ . Hence, the feature importance statistic is a function of $\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})$ . Following Corollary 1, we can apply the knockoffs procedure with the square-root Lasso and matrices $(\widecheck{\mathbf{X}},\widetilde{\widecheck{\mathbf{X}}})$ in lieu of $({\mathbf{X}},\widetilde{{\mathbf{X}}})$ . Upon choosing

\lambda=\kappa\;\mathbb{E}\left[\frac{\lVert[\widecheck{\mathbf{X}}\;% \widetilde{\widecheck{\mathbf{X}}}]^{\top}\bm{\epsilon}\rVert_{\infty}}{\lVert% \bm{\epsilon}\rVert_{2}}\mid\widecheck{\mathbf{X}},\widetilde{\widecheck{% \mathbf{X}}}\right],

(10)

we get a procedure, which is statistically indistinguishable from that we would get if we were performing all the same steps with $\mathbf{X}$ and $\widetilde{\mathbf{X}}$ . (In practice, we compute the value in (10) via Monte Carlo simulation.) In the sequel, we call the resulting procedure summary statistics GhostKnockoffs with square-root Lasso importance statistic (GK-sqrtlasso). Note that GK-sqrtlasso controls the FDR as the flip-sign property of the feature importance statistic holds. This is because swapping a variable with its knockoff does not change the value of the hyperparameter. Therefore, by Corollary 1, applying the knockoff filter to the square-root Lasso feature importance statistics yields FDR control.

3.4 GhostKnockoffs with the Lasso-max

In the standard fixed-X knockoffs setting, cross-validation is also not feasible, since doing so would violate the sufficiency condition required for the feature importance statistics. As one possible alternative, Barber and Candès (2015) considered using as the feature importance statistic the value of $\lambda$ on the Lasso path at which feature $X_{j}$ first enters the model. Formally, they define the feature importance statistic

W_{j}=\text{sup}\{\lambda:\hat{\beta}_{j}(\lambda)\neq 0\}-\text{sup}\{\lambda% :\hat{\beta}_{j+p}(\lambda)\neq 0\},

where $\hat{\bm{\beta}}(\lambda)$ is as in (7). We call this statistic the Lasso-max statistic. Intuitively, a larger penalty level is required to shrink an important feature to zero, so we should expect $W_{j}$ to be large and positive for non-nulls.

By Corollary 1, with the Lasso-max statistic Algorithm 3 produces a rejection set that has the same distribution as the rejection set obtained from the corresponding individual-data-based knockoffs procedure. We call this summary-statistic-based procedure GhostKnockoffs with Lasso-max statistic (GK-lassomax).

We remark that choices of other tuning parameters and feature importance statistics are also possible. For instance, we may choose $\lambda$ to minimize the Stein’s unbiased risk estimate (SURE) associated with (7). We shall however focus on the two approaches we have described.

3.5 Numerical simulations

We consider a variety of simulation settings in which we compare the performance of the proposed GhostKnockoffs with square-root Lasso and Lasso-max statistics (GK-sqrtlasso and GK-lassomax, defined in Sections 3.3 and 3.4), GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2), and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic with individual-level data (KF-lassocv). Note that the first three are statistically equivalent to the corresponding knockoffs procedures with individual-level data.

3.5.1 Independent features

In the first set of simulations (Figure 1), we generate random samples $\mathbf{x}_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(\mathbf{0},% \mathbf{I}_{p})$ and $Y_{i}=\bm{\beta}^{\top}\mathbf{x}_{i}+\sqrt{n}\epsilon_{i}$ , where $\epsilon_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(0,1)$ for $i\in\{1,2,...,n\}$ .^*^**The simulation setting is designed in a way that the signal-to-noise ratio has the same scale as $n$ varies. We consider three settings of varying dimensionality measured by the ratio $p/n$ : $(n,p)\in\{(600,200),(400,400),(200,600)\}$ . In each of the three settings, we create a sparse vector $\bm{\beta}$ by selecting 30 coordinates to be non-zero uniformly at random. The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability. We vary the signal amplitudes such that we explore a wide power range below. For the square-root Lasso, we average over 200 Monte Carlo samples to calculate

\lambda=\kappa\cdot\mathbb{E}\big{[}\frac{\lVert[\mathbf{X}\;\widetilde{% \mathbf{X}}]^{\top}\bm{\epsilon}\rVert_{\infty}}{\lVert\bm{\epsilon}\rVert_{2}% }\mid\mathbf{X},\widetilde{\mathbf{X}}\big{]}.

The target FDR is 20%. Each point on the curves represents the average of the results from 200 replications.

Refer to caption — Figure 1: Power and FDR plots for independent features and a Gaussian linear model with varying dimensions. Each point is an average over 200 replications.

We observe that GK-sqrtlasso and GK-lassomax generally demonstrate greater power than GK-marginal. This enhanced performance is not surprising, as GK-sqrtlasso and GK-lassomax (1) have access to additional information via $\mathbf{X}^{\top}\mathbf{X}$ , and (2) employing a joint modeling algorithm such as Lasso generally provides a better assessment of variable importance for understanding conditional (in)dependence since such a model explicitly adjusts for the effects from all the other variables. We also note the presence of power gaps between GK-lassocv and GK-sqrtlasso/GK-lassomax, likely due to the fact that we are unable to perform cross-validation without individual-level data. All methods control the FDR at the desired level.

3.5.2 AR(1) features

In the second set of simulations (Figures 2), we generate $\mathbf{x}_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(\mathbf{0},% \mathbf{\Sigma}_{\rho})$ for $i\in\{1,2,...,n\}$ , where $\left[\mathbf{\Sigma}_{\rho}\right]_{s,t}=\rho^{|s-t|}$ for $1\leq s,t\leq p$ . As before, we generate $Y_{i}=\bm{\beta}^{\top}\mathbf{x}_{i}+\sqrt{n}\epsilon_{i}$ , where $\epsilon_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1)$ for $i\in\{1,2,...,n\}$ . We consider the same three $(n,p)$ combinations. In each of the three cases, we create a sparse vector $\bm{\beta}$ exactly as before, except that we fix the signal amplitudes to 4, 4, and 7 respectively to explore a wide power range. We vary $\rho$ in $\{0,0.1,0.2,...,0.8\}$ The target FDR is set to be 20%. Each point represents the average of the results from 200 replications.

Again, we observe that GK-sqrtlasso and GK-lassomax generally have greater power than GK-marginal. All methods have (almost) decreasing power as the autocorrelation coefficient increases, since it becomes harder to separate true signals from null variables that are correlated with them. All methods control the FDR at the desired level.

4 GhostKnockoffs with Penalized Regression: Missing Empirical Covariance

4.1 Setting

Thus far, we have discussed how incorporating the additional information from $\mathbf{X}^{\top}\mathbf{X}$ and $n$ could enhance our ability to detect significant features. However, in applications such as genetics, $\mathbf{X}^{\top}\mathbf{X}$ may not be available. In this section, we propose alternative procedures when the scientist only knows about $\mathbf{X}^{\top}\mathbf{Y}$ , $\lVert\mathbf{Y}\rVert^{2}$ and the sample size $n$ . As before, we assume that $X\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma})$ , where the covariance matrix $\mathbf{\Sigma}$ is known (or can be estimated from other data sources).

4.2 GhostKnockoffs with pseudo-lasso

The idea of our method is to modify the Lasso objective function so that it can be constructed from the available summary statistics. It turns out that the solution of our modified objective function is proportional to that of the scout procedure (with known precision matrix) proposed by Witten and Tibshirani (2009). We will see through simulation studies that our procedure improves the power of the original GhostKnockoffs method of (He et al., 2022) while maintaining FDR control.

4.2.1 The procedure

Recall that in the knockoffs procedure with the Lasso statistic, we solve the following optimization problem:

\hat{\bm{\beta}}(\lambda)=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2% p}}\frac{1}{2n}\bm{\beta}^{\top}\begin{bmatrix}\mathbf{X}^{\top}\mathbf{X}&% \mathbf{X}^{\top}\widetilde{\mathbf{X}}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{X}&\widetilde{\mathbf{X}}^{\top}% \widetilde{\mathbf{X}}\end{bmatrix}\bm{\beta}-\frac{1}{n}\bm{\beta}^{\top}% \begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{Y}\end{bmatrix}+\lambda||\bm{\beta}||_{1}.

To mimic the form of the loss function when we do not observe the empirical covariance of the features, we may want to substitute them with their population version: i.e. we swap $\mathbf{X}^{\top}\mathbf{X}/n$ and $\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}}/n$ with $\mathbf{\Sigma}$ and $\mathbf{X}^{\top}\widetilde{\mathbf{X}}/n$ with $\mathbf{\Sigma}-\mathbf{D}$ . As usual, $\mathbf{D}=\text{diag}\{\mathbf{s}\}$ is obtained by solving the convex optimization problem (15). In the language of fixed-X knockoffs (Barber and Candès, 2015), this is equivalent to regarding $\widetilde{\mathbf{X}}$ as a fixed-X knockoff of $\mathbf{X}$ and replacing $\mathbf{X}^{\top}\mathbf{X}/n$ by $\mathbf{\Sigma}$ .^*^**We remark that similar objective functions have been used in, for example, Mak et al. (2017) and Zou et al. (2022). This yields Algorithm 4.

Algorithm 4 GhostKnockoffs with Penalized Regression: Missing Empirical Covariance

1: Input:

\mathbf{X}^{\top}\mathbf{Y},||\mathbf{Y}||_{2}^{2},\mathbf{\Sigma}

and

n

2: Simulate

\mathbf{Z}\sim\mathcal{N}(\mathbf{0},\mathbf{V})

, where

\mathbf{V}

is defined as in Algorithm 2.

3: Solve

\hat{\bm{\beta}}(\lambda)=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2% p}}\frac{1}{2}\bm{\beta}^{\top}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-% \mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\bm{\beta}-\frac{1}{n}% \bm{\beta}^{\top}\begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\vspace{1mm}\\ \mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}\end{bmatrix}+\lambda||\bm{\beta}||_{1},

where

\mathbf{D}

and

\mathbf{P}

are defined as in Section 2.2.2 and

\lambda

is fixed or as chosen in Section 4.2.2

4: Run the standard knockoffs procedure (at level

q

) with importance statistic

W_{j}=\lvert\hat{\beta}_{j}(\lambda)\rvert-\lvert\hat{\beta}_{j+p}(\lambda)\rvert.

5: Output: Knockoffs selection set.

We call this procedure GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso). We show below that Algorithm 4 controls the FDR of selections at level $q$ . Before doing so, we first state a general proposition that includes GK-marginal as a special case.

Proposition 2.

Suppose $\mathbf{V}$ and $\mathbf{P}$ are defined as in Algorithm 2, $\mathbf{Z}\sim\mathcal{N}(\mathbf{0},\mathbf{V})$ is independent of $\mathbf{X}$ and $\mathbf{Y}$ , and $\widetilde{\mathbf{X}}=\mathcal{G}(\mathbf{X},\mathbf{\Sigma})$ . Consider a knockoffs feature importance statistic $\mathbf{W}=\mathbf{g}(\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{% Y},\widetilde{\mathbf{X}}^{\top}\mathbf{Y},\mathbf{U})\in\mathbb{R}^{p}$ , which is a deterministic function of $\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y},\widetilde{\mathbf{% X}}^{\top}\mathbf{Y}$ and an independent random variable $\mathbf{U}$ . Define $\widehat{\mathbf{W}}=\mathbf{g}(\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{% \top}\mathbf{Y},\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}% \rVert_{2}\mathbf{Z},\mathbf{U})$ . Let $\mathcal{S}_{1}$ (resp. $\mathcal{S}_{2}$ ) be the rejection set obtained from applying the knockoffs filter on $\mathbf{W}$ (resp. $\widehat{\mathbf{W}}$ ). Then $\mathcal{S}_{1}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}% \mathcal{S}_{2}\mid\mathbf{X},\mathbf{Y}$ . Thus, if $\mathbf{W}$ obeys the flip-sign property, both procedures have equal FDR at most equal to $q$ .

Proof.

In Appendix B, we prove that

\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}\stackrel{{% \scriptstyle d}}{{=}}\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+||\mathbf{Y}% ||_{2}\mathbf{Z}\mid\mathbf{X},\mathbf{Y}.

As a result, $\mathbf{W}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}\widehat{% \mathbf{W}}\mid\mathbf{X},\mathbf{Y}$ . Since the selection set is uniquely determined by the values of $\mathbf{W}$ (or $\widehat{\mathbf{W}}$ ), it follows that $\mathcal{S}_{1}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}% \mathcal{S}_{2}\mid\mathbf{X},\mathbf{Y}$ . Therefore, the procedures have the same FDR. ∎

Set $\lambda$ to be a fixed numerical constant. Consider the feature importance statistics $\mathbf{W}$ defined by $W_{j}=\lvert\hat{\beta}_{j}(\lambda)\rvert-\lvert\hat{\beta}_{j+p}(\lambda)\rvert,$ where $\hat{\bm{\beta}}(\lambda)$ is the solution to

\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2p}}\frac{1}{2}\bm{\beta}^{% \top}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\bm{\beta}-\frac{1}{n}% \bm{\beta}^{\top}\begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\vspace{1mm}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{Y}\end{bmatrix}+\lambda||\bm{\beta}||_{1},

(11)

and $\widetilde{\mathbf{X}}=\mathcal{G}(\mathbf{X},\mathbf{\Sigma})$ is the Gaussian knockoff data matrix. The feature importance statistic in Algorithm 4 is thus obtained by replacing $\widetilde{\mathbf{X}}^{\top}\mathbf{Y}$ by $\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}$ in (11). Since $\mathbf{W}$ is determined by $\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y}$ and $\widetilde{\mathbf{X}}^{\top}\mathbf{Y}$ , it follows from Proposition 2 that the rejection set of Algorithm 4 has the same distribution as that obtained from running the knockoff filter on $\mathbf{W}$ .

Thus to prove that Algorithm 4 controls the FDR of rejections at level $q$ , it suffices to verify the flip-sign property of the feature importance statistic for $\mathbf{W}$ (see Section 2). This is a consequence of the following lemma:

Lemma 1.

Consider the problem

\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{2p}}\frac{1}{2}\bm{\beta}^{% \top}\mathbf{C}\bm{\beta}-\mathbf{d}^{\top}\bm{\beta}+\lambda||\bm{\beta}||_{1% }+\gamma\lVert\bm{\beta}\rVert^{2}_{2}.

(12)

Let $\bm{\Pi}_{S}$ be any permutation matrix which swaps the jth and (j+p)th entries of a 2p-dimensional vector for each $j\in S\subset\{1,...,p\}$ . Assume that $\mathbf{C}$ is $S$ -swap invariant in the sense that $\bm{\Pi}_{S}^{\top}\mathbf{C}\bm{\Pi}_{S}=\mathbf{C}$ . Then $\hat{\bm{\beta}}$ is a solution to (12) if and only if $\Pi_{S}\hat{\bm{\beta}}$ is a solution to the same problem with $\mathbf{d}$ and $\bm{\Pi}_{S}\mathbf{d}$ swapped. In other words, swapping the entries of $\mathbf{d}$ has the effect of swapping the corresponding entries of the solution.

Proof.

Consider the objective with problem data $\Pi_{S}\mathbf{d}$ :

\frac{1}{2}\bm{\beta}^{\top}\mathbf{C}\bm{\beta}-(\bm{\Pi}_{S}\mathbf{d})^{% \top}\bm{\beta}+\lambda\lVert\bm{\beta}\rVert_{1}+\gamma\lVert\bm{\beta}\rVert% _{2}^{2}=\frac{1}{2}\bm{\beta}^{\top}\mathbf{C}\bm{\beta}-\mathbf{d}^{\top}\bm% {\Pi}_{S}^{\top}\bm{\beta}+\lambda\lVert\bm{\beta}\rVert_{1}+\gamma\lVert\bm{% \beta}\rVert_{2}^{2}.

Set $\bm{\beta}^{\prime}=\bm{\Pi}_{S}^{\top}\bm{\beta}$ so that $\bm{\beta}=\bm{\Pi}_{S}\bm{\beta}^{\prime}$ . Upon changing variables, the objective takes the form

\frac{1}{2}(\bm{\beta}^{\prime})^{\top}\bm{\Pi}_{S}^{\top}\mathbf{C}\Pi_{S}\bm% {\beta}^{\prime}-\mathbf{d}^{\top}\bm{\beta}^{\prime}+\lambda\lVert\bm{\Pi}_{S% }\bm{\beta}^{\prime}\rVert_{1}+\gamma\lVert\bm{\Pi}_{S}\bm{\beta}^{\prime}% \rVert_{2}^{2}=\frac{1}{2}(\bm{\beta}^{\prime})^{\top}\mathbf{C}\bm{\beta}^{% \prime}-\mathbf{d}^{\top}\bm{\beta}^{\prime}+\lambda\lVert\bm{\beta}^{\prime}% \rVert_{1}+\gamma\lVert\bm{\beta}^{\prime}\rVert_{2}^{2},

where the equality follows because $\bm{\Pi}_{S}^{\top}\mathbf{C}\bm{\Pi}_{S}=\mathbf{C}$ and because the 1-norm and 2-norm are invariant under permutation. Now, the objective on the right-hand side is the objective with data $\mathbf{d}$ . If $\hat{\bm{\beta}}$ is the solution with data $\mathbf{d}$ , it follows that $\bm{\Pi}_{S}\hat{\bm{\beta}}$ is the solution with data $\bm{\Pi}_{S}\mathbf{d}$ , and vice versa. This proves the lemma. ∎

Corollary 2.

Algorithm 4 with a fixed $\lambda$ controls the FDR of rejections at level $q$ .

Proof.

It is easy to show that $\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}$ is $S$ -swap invariant for any $S\subset\{1,...,p\}$ . Taking

\mathbf{C}=\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}

and

\mathbf{d}=\frac{1}{n}\begin{bmatrix}\mathbf{X}^{\top}\mathbf{Y}\vspace{1mm}\\ \widetilde{\mathbf{X}}^{\top}\mathbf{Y}\end{bmatrix}

in Lemma 1 establishes the flip-sign property of $\mathbf{W}$ and, therefore, the FDR control of Algorithm 4 for a fixed $\lambda$ . ∎

In practice, to ensure numerical stability, we add a small positive constant multiple of the identity matrix to

\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}

when solving for $\hat{\bm{\beta}}$ . This is equivalent to incorporating a small Ridge penalty into the objective function. It is easy to see that the lemma proved above guarantees that this modification does not compromise the FDR control as

\begin{bmatrix}\mathbf{\Sigma}+c\mathbf{I}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}+c\mathbf{I}\end{bmatrix}

is also $S$ -swap invariant for any $c\in\mathbb{R}$ and any $S\subset\{1,...,p\}.$

4.2.2 Choice of tuning parameter

Several methods can be used to tune the value of the hyperparameter $\lambda$ . We here consider two approaches.

Method 1 (lasso-min)

Pretend a homogeneous Gaussian linear model holds, i.e. $\mathbf{Y}=\mathbf{X}\bm{\beta}^{*}+\sigma\bm{\epsilon}$ for some $\bm{\beta}^{*}\in\mathbb{R}^{p}$ , $\sigma>0$ and $\bm{\epsilon}\sim N(\mathbf{0},\mathbf{I}_{n})$ .

Focus on (11) first and imagine that we have a method for computing $\lambda$ that depends on data only through $\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y}$ , and $\widetilde{\mathbf{X}}^{\top}\mathbf{Y}$ . Note that the objective in Algorithm 4 only substitutes $\widetilde{\mathbf{X}}^{\top}\mathbf{Y}$ in (11) with $\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}$ . Therefore, by Proposition 2 if we set $\lambda$ via the same functional and work with $\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}$ in lieu of $\widetilde{\mathbf{X}}^{\top}\mathbf{Y}$ , we shall achieve FDR control with this data-driven value of the hyperparameter $\lambda$ . This holds of course with the proviso that our selection of hyperparameter is symmetric in the sense that it produces feature importance statistic obeying the flip-sign property.

To set the tuning parameter $\lambda_{0}$ in (11), we use the common choice of taking a constant multiple of the expected value of the minimum $\lambda$ value such that $\hat{\bm{\beta}}(\lambda)=\mathbf{0}_{2p}$ under the null model $\mathbf{Y}=\sigma\bm{\epsilon}$ . By the Karush–Kuhn–Tucker (KKT) conditions (Boyd and Vandenberghe, 2004), this results in a tuning parameter of the form

\lambda_{0}=\kappa\cdot\frac{\sigma}{n}\cdot\mathbb{E}[\lVert\begin{bmatrix}% \mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}^{\top}\bm{\epsilon}\rVert_{% \infty}],

where $\kappa$ is a hyperparameter between 0 and 1. Since $\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}$ is a data matrix whose rows are iid samples from

\mathcal{N}\left(\mathbf{0},\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-% \mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\right),

$\mathbb{E}[\lVert\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}% ^{\top}\bm{\epsilon}\rVert_{\infty}]$ is a numerical constant, which can be estimated arbitrarily well via Monte Carlo simulations. We use the approach from Dicker (2014) to give an estimate of $\sigma$ , which crucially requires knowing only $\lVert\mathbf{Y}\rVert_{2}^{2},\mathbf{X}^{\top}\mathbf{Y}$ , and $\widetilde{\mathbf{X}}^{\top}\mathbf{Y}$ . Dicker (2014) showed that the estimator is consistent and asymptotic normal in the high-dimensional regime. Specifically, in our setting, we estimate $\sigma$ by

\widehat{\sigma}_{0}=\sqrt{\text{max}\left(\frac{2p+n+1}{n(n+1)}\lVert\mathbf{% Y}\rVert_{2}^{2}-\frac{1}{n(n+1)}\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&% \widetilde{\mathbf{X}}\end{bmatrix}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{% \Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}^{-1}\begin{bmatrix}% \mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y},0\right)}.

In sum, a choice for $\lambda$ in Algorithm 4 is this:

1.

Approximate $\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]$ via Monte Carlo simulations, where $\mathbf{R}\in\mathbb{R}^{n\times 2p}$ has iid $\mathcal{N}\left(\mathbf{0},\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-% \mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\right)$ rows, $\bm{\epsilon}\sim N(\mathbf{0},\mathbf{I}_{n})$ is independent.

Compute

\widehat{\sigma}_{0}=\sqrt{\text{max}\left(\frac{2p+n+1}{n(n+1)}\lVert\mathbf{% Y}\rVert_{2}^{2}-\frac{1}{n(n+1)}\begin{bmatrix}\mathbf{Y}^{\top}\mathbf{X}&% \mathbf{Y}^{\top}\mathbf{X}\mathbf{P}+\lVert\mathbf{Y}\rVert_{2}\mathbf{Z}^{% \top}\end{bmatrix}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}^{-1}\begin{bmatrix}% \mathbf{X}^{\top}\mathbf{Y}\\ \mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}\end{bmatrix},0\right)},

where $\mathbf{Z}$ is independent of everything else.

3.

Output $\lambda\approx\kappa\cdot\frac{\widehat{\sigma}_{0}}{n}\cdot\mathbb{E}[\lVert% \mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]$ where the approximation sign $\approx$ reminds us that the expectation is only approximate.

As in the square-root Lasso case, we observe that the power of our method is not very sensitive to the choice of $\kappa$ . We use $\kappa=0.6$ in our simulations below. In Appendix E, we provide details of computation of $\lambda$ and prove that Algorithm 4 maintains FDR control with the computed $\lambda$ .

Method 2 (pseudo-sum)

An alternative way of choosing $\lambda$ is to adapt the pseudo-summary statistics approach proposed by Zhang et al. (2021). Set $\mathbf{r}=\mathbf{X}^{\top}\mathbf{Y}/n$ and $\widetilde{\mathbf{r}}=\mathbf{P}^{\top}\mathbf{r}+\lVert\mathbf{Y}\rVert_{2}% \mathbf{Z}/n$ . The main idea of Zhang et al. (2021) is to generate training summary statistics $\mathbf{r}_{t}$ and validation summary statistics $\mathbf{r}_{v}$ from $\mathbf{r}$ and $\widetilde{\mathbf{r}}$ based on the training and validation sample sizes $n_{t}$ and $n_{v}$ respectively (in this paper we take $n_{t}=0.8n$ and $n_{v}=0.2n$ ). Following Zhang et al. (2021), we generate the training summary statistics

\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}_{t}=\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}+\sqrt{\frac{n_{v}}{n\times n_{t}}}\mathbf{% R},

where

\mathbf{R}\sim\mathcal{N}\left(\mathbf{0},\begin{bmatrix}\mathbf{\Sigma}&% \mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\right),

and the validation summary statistics

\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}_{v}=\frac{1}{n_{v}}\left[n\begin{bmatrix}% \mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}-n_{t}\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}_{t}\right].

Given a sequence of candidate $\lambda$ values, we choose that which maximizes an approximation $f(\lambda)$ of the correlation between the predicted values and the true values on the pseudo-validation set.^*^**Unlike the previous approach, this tuning parameter choice will not induce the exact flip-sign property. However, we observe empirically that our method is robust to this issue, and no FDR inflation occurred. In theory, one could randomly swap all the variables with their corresponding knockoffs and compute the average of all the $\lambda$ values obtained. In the limit, the average will give a data-driven value of $\lambda$ that is invariant to swapping variables with their knockoffs due to symmetry. Specifically, Zhang et al. (2021) considered the approximation

f(\lambda)=\frac{\hat{\bm{\beta}}^{\top}_{t,\lambda}\begin{bmatrix}\mathbf{r}% \\ \widetilde{\mathbf{r}}\end{bmatrix}_{v}}{\sqrt{\hat{\bm{\beta}}^{\top}_{t,% \lambda}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\hat{\bm{\beta}}_{t,% \lambda}}},

(13)

where

\hat{\bm{\beta}}_{t,\lambda}=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}% ^{2p}}\frac{1}{2}\bm{\beta}^{\top}\begin{bmatrix}\mathbf{\Sigma}&\mathbf{% \Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}\bm{\beta}-\bm{\beta}^{% \top}\begin{bmatrix}\mathbf{r}\\ \widetilde{\mathbf{r}}\end{bmatrix}_{t}+\lambda||\bm{\beta}||_{1}.

(14)

Therefore, we choose the $\lambda$ value that maximizes (13) among a set of candidate values. Since the objective function (11) is convex in $\bm{\beta}$ , we may employ the BASIL framework proposed by Qian et al. (2020), which implements a batch version of the strong rules introduced in Tibshirani et al. (2012). BASIL can be directly applied to compute the solution path of (14) efficiently.

Note that there exist other ways to choose the penalty level $\lambda$ using $\mathbf{X}^{\top}\mathbf{Y},\lVert\mathbf{Y}\rVert_{2}$ and $n$ (for example, the Lassosum by Mak et al. (2017)). We do not attempt to claim an optimal strategy.

Connection with the scout procedure

It turns out that step 3 of Algorithm 4 is closely related to the scout procedure (Witten and Tibshirani, 2009). The scout procedure defines a family of covariance-regularized regression methods that achieve superior prediction via shrinking the inverse covariance matrix. It includes the Lasso, Ridge and Elastic Net as special cases. In Appendix F, we show that the solution of objective function (11) is proportional to that of the scout procedure (with known precision matrix $\mathbf{\Sigma}^{-1}$ ). This connection provides a justification on why the objective function (11) is effective.

4.2.3 GhostKnockoffs with other feature importance statistics

In the previous sections, we presented a feature importance statistic based on summary statistics that leads to better power than the marginal correlation difference statistic. By Proposition 2, GhostKnockoffs techniques can be combined with any other feature importance statistics that i) are based on the summary statistics $\mathbf{X}^{\top}\mathbf{Y}$ , $\lVert\mathbf{Y}\rVert^{2}$ and the sample size $n$ and ii) satisfy the flip-sign property. The procedures generated will still guarantee FDR control. In our simulation studies, we found that using the posterior inclusion probability (PIP) produced by the SuSiE-RSS model (Zou et al., 2022) as the feature importance statistic also results in consistent power improvement over GK-marginal. SuSiE-RSS is based on the Sum of Single Effects (SuSiE) model proposed by Wang et al. (2020), which assumes a Bayesian linear model with true coefficients $\bm{\beta}$ represented as the sum of multiple one-hot (random) individual effect vectors. Zou et al. (2022) combines SuSiE with a modified likelihood function to accommodate applications in which only summary statistics are available (see Zou et al. (2022) for details).^*^**We used the susie_rss function inside the R package susieR in our simulations. We call the resulting procedure GhostKnockoffs with SuSiE-RSS statistic and denote it by GK-susie-rss. We include this method in the simulation section below.

4.3 Variants of GhostKnockoffs

The methods we presented so far can be adapted to work with various related procedures. We give three examples below for illustration.

4.3.1 Multi-knockoffs

The knockoffs procedure is a randomized procedure which could produce very different selection sets on different runs. This is especially true when the knockoffs rejection set is small. In fact, the offset on the numerator in (3) implies that knockoffs either rejects more than $\lceil\frac{1}{q}\rceil$ hypotheses, where $q$ is the target FDR level, or rejects nothing. To improve the stability of the knockoffs procedure, Gimenez and Zou (2019) proposed simultaneous multi-knockoffs, which is substantially more stable and powerful than knockoffs when the rejection set is small and maintains FDR control in general.

The idea of Gimenez and Zou (2019) is to create $M$ (instead of one) knockoff copies for every feature so that they jointly satisfy an extended exchangeability condition.^*^**Specifically, the extended exchangeability condition says that if we permute variables with their corresponding (multiple) knockoffs arbitrarily, the joint distribution remains unchanged. If $X\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma})$ , Gimenez and Zou (2019) showed that $\widetilde{X}\in\mathbb{R}^{pM}$ is a valid $M$ multi-knockoff for $X\in\mathbb{R}^{p}$ if $\begin{bmatrix}X&\widetilde{X}\end{bmatrix}\sim\mathcal{N}(\mathbf{0},\mathbf{% G})$ , where

\mathbf{G}=\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}&\cdots&% \mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}&\cdots&\mathbf{\Sigma}-\mathbf{D}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{\Sigma}-\mathbf{D}&\cdots&\cdots&\mathbf{\Sigma}\end{bmatrix}\in% \mathbb{R}^{(M+1)p\times(M+1)p},

Here, $\mathbf{D}=\text{diag}\{\mathbf{s}\}$ , and $\mathbf{s}$ is obtained by solving a more restrictive convex optimization problem than in (15) which guarantees that $\mathbf{G}$ is positive semi-definite (see Gimenez and Zou (2019) for details). In data matrix form, we generate valid $M$ multi-knockoffs by

\widetilde{\mathbf{X}}=\mathbf{X}\mathbf{P}+\mathbf{E}\mathbf{V}^{1/2},

where $\mathbf{P}=\begin{bmatrix}\mathbf{I}-\mathbf{\Sigma}^{-1}\mathbf{D}&\cdots&% \mathbf{I}-\mathbf{\Sigma}^{-1}\mathbf{D}\end{bmatrix}\in\mathbb{R}^{p\times Mp},$ $\mathbf{E}\in\mathbb{R}^{n\times Mp}$ has i.i.d. standard normal entries, and

\mathbf{V}=\begin{bmatrix}2\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}% &\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}&\cdots&\mathbf{D}-\mathbf% {D}\mathbf{\Sigma}^{-1}\mathbf{D}\\ \mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}&2\mathbf{D}-\mathbf{D}% \mathbf{\Sigma}^{-1}\mathbf{D}&\cdots&\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1% }\mathbf{D}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-1}\mathbf{D}&\mathbf{D}-\mathbf{D}% \mathbf{\Sigma}^{-1}\mathbf{D}&\cdots&2\mathbf{D}-\mathbf{D}\mathbf{\Sigma}^{-% 1}\mathbf{D}\end{bmatrix}.

Gimenez and Zou (2019) generalized the knockoffs threshold (3) and the flip-sign property to produce FDR-controlling rejection sets after generating multiple knockoffs via this procedure.

In the summary statistics settings, upon redefining $\mathbf{P}$ , $\mathbf{V}$ and $\mathbf{s}$ as above and replacing the standard knockoffs filter by the multi-knockoffs filter, Algorithms 2 and 3 produce rejection sets that have the same distribution as those produced by their corresponding versions with individual-level data. For Algorithm 4, we simply need to further replace

\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}

by $\mathbf{G}$ .

4.3.2 Group knockoffs

When variables are highly correlated, selection procedures become conservative. For example, if a non-null variable $X_{j}$ is highly correlated with a null variable $X_{k}$ , it becomes difficult to reject $X_{j}\perp\!\!\!\perp Y|X_{-j}$ . This is an important practical concern because highly correlated features are ubiquitous in many settings, particularly GWAS datasets. To overcome this challenge, group knockoffs (Dai and Barber, 2016) can be useful; please see Chu et al. (2023), whose algorithms we employ in the data analyses of Section 5. In group knockoffs, the object of inference is shifted from single variables to groups of highly correlated variables. Specifically, suppose we partition $p$ features into $g$ groups and reorder all features such that features of the same group are in adjacent columns of $\mathbf{X}$ . The objective is to test group conditional independence hypothesis:

\displaystyle H_{\gamma}^{0}:X_{\gamma}\perp\!\!\!\perp Y\mid X_{-\gamma}

where $\gamma\in\{1,...,g\}$ denotes a group and $X_{\gamma}$ is the vector of features in group $\gamma$ . When these groups have strong correlation, single-variable knockoffs may struggle to identify signals, but group knockoffs retain power to identify significant groups. As in Section 4.3.1, all methods described in this paper apply to group knockoffs after redefining $\mathbf{D}$ to the equivalent version in group knockoffs. In Appendix G, we detail the construction of group knockoffs and examples of importance scores at the group level for inference.

4.3.3 Conditional randomization test

The conditional randomization test (CRT) (Candès et al., 2018) is an alternative method to test the conditional independence hypotheses $H_{j}:X_{j}\perp\!\!\!\perp Y\mid X_{-j}$ for $1\leq j\leq p$ . By generating a valid ‘CRT $p$ -value’ $p_{j}$ for each hypothesis ${H}_{j}$ , existing multiple testing procedures, including the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) and the selective SeqStep+ filter (Li and Candès, 2021), can be used to simultaneously test $H_{1},\ldots,H_{p}$ with FDR control.^*^**In general, CRT $p-$ values may not be independent of each other or satisfy the PRDS property (Benjamini and Yekutieli, 2001). Therefore, applying the Benjamini-Hochberg procedure on CRT $p-$ values does not guarantee FDR control theoretically. However, as noted in Candès et al. (2018), the FDR is usually under control empirically. As shown in Candès et al. (2018) and Wang and Janson (2021), doing so can improve the power of multiple testing with greater computational complexity.

In Appendix H, we introduce Ghostknockoffs for CRT (GhostCRT), which adopts techniques introduced in this paper to the framework of CRT.

4.4 Numerical simulations

We conduct simulations on synthetic data as well as semi-synthetic data generated from a real-world genetic dataset. Specifically, we apply GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso, defined in Algorithm 4 with tuning parameter $\lambda$ chosen by either lasso-min or pseudo-sum from Section 4.2.2) and GhostKnockoffs with SuSiE-RSS statistic (GK-susie-rss, defined in Section 4.2.3). We compare their performance with GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2) and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic based on individual-level data (KF-lassocv). We also demonstrate empirically the robustness of our procedures by showing the FDR control when only an estimate of the true covariance matrix $\mathbf{\Sigma}$ is available and when the features are discrete.

4.4.1 Simulations based on real-world genetic data

To mimic the dependency structure among features in real-world applications, we generate synthetic data based on the whole genome sequencing (WGS) data from the Alzheimer’s Disease Sequencing Project (ADSP). The data are obtained from the ADSP consortium following the SNP/Indel Variant Calling Pipeline and data management tool (VCPA) (Leung et al., 2019). The ADSP WGS data records counts of minor alleles of genetic variants over 16,906 individuals. Using reference populations from the 1000 Genomes Consortium (The 1000 Genomes Project Consortium, 2015), we estimate ancestry rates of each individual by SNPWeights v2.1 (Chen et al., 2013) and extract 6,952 individuals with estimated European ancestry rate greater than 80%. We further restrict our simulations to 2,000 randomly selected genetic variants within 0.5Mb distance to the APOE gene (chr19:44909011-45912650; hg38), whose $\varepsilon$ 2 allele and $\varepsilon$ 4 allele are known to be respectively the strongest genetic protective factor and the strongest genetic risk factor for Alzheimer’s disease (Serrano-Pozo et al., 2021; Belloy et al., 2023), and with minor allele frequency (MAF) larger than $0.01$ . Since our simulations focus on performance at identifying relevant clusters of tightly linked variants, we simplify the simulation design by pruning variants to eliminate pairs with absolute correlation greater than $0.75$ . To do so, we first compute the correlation matrix $[\text{cor}(X_{j},X_{k})]_{2000\times 2000}$ of the 2,000 selected variants over the 6,952 extracted individuals using the shrinkage estimate in the R package corpcor (Schäfer and Strimmer, 2005) and apply hierarchical clustering (single-linkage with cutoff value $0.25$ ) on the distance matrix $[1-|\text{cor}(X_{j},X_{k})|]_{2000\times 2000}$ . As a result, we obtain 512 variant clusters such that pairwise correlation between any pair of variants from different clusters is in $[-0.75,0.75]$ . By randomly choosing one representative variant from each cluster, we include $p=512$ tested genetic variants in the simulation study.

For each replicate, we obtain synthetic data by randomly sampling $n=3,000$ individuals without replacement and collecting the sampled individuals’ records on the $p=512$ tested genetic variants as the $n\times p$ covariate matrix $\mathbf{X}$ . We further sample another $n=3,000$ individuals without replacement as the reference panel on which we compute the correlation matrix $\mathbf{\Sigma}$ using the shrinkage estimate in the R package corpcor (Schäfer and Strimmer, 2005). Based on the covariate matrix $\mathbf{X}$ , we generate the response vector $\mathbf{Y}=(Y_{1},\ldots,Y_{n})^{\top}$ from either the linear model (continuous response),

Y_{i}=\beta_{1}X_{i1}+...+\beta_{p}X_{ip}+\epsilon^{C}_{i},\quad\text{where }% \epsilon^{C}_{i}\sim N(0,3^{2}),

or the mixed-effect logit model (binary response),

Y_{i}\sim\text{Bernounli}(\mu_{i}),\quad\text{where }g(\mu_{i})=\beta_{0}+% \beta_{1}X_{i1}+...+\beta_{p}X_{ip}+\epsilon^{B}_{i},\text{ }\epsilon^{B}_{i}% \sim N(0,1^{2})\text{ and }g(x)=\log\Big{(}\frac{x}{1-x}\Big{)}.

Specifically, $\beta_{0}$ under the mixed-effect logit model is $-\log(9)$ so that the prevalence (or the expected proportion of $Y_{i}=1$ ) is $10\%$ . $\epsilon^{C}_{i}$ ’s and $\epsilon^{B}_{i}$ ’s reflect variation due to unobserved covariates. Only $10$ randomly selected coefficients $\beta_{j}$ are nonzero, with value $\beta_{j}=\frac{1}{\sqrt{20\cdot m_{j}(1-m_{j})}}$ , where $m_{j}$ is the MAF of the $j$ -th variant.

With the relevant summary statistics computed, we apply GK-pseudolasso and GK-susie-rss and compare their performances with GK-marginal and KF-lassocv.

Over 1000 replicates under both the linear model and the mixed-effect logit model, average power and FDR of different methods with respect to different target FDR levels are visualized in Figure 3. Under both models, we observe that GK-pseudolasso with both ways of selecting the tuning parameter and GK-susie-rss are uniformly more powerful than GK-marginal. The performance of the proposed methods is very close to that of KF-lassocv. Despite the covariance matrix being estimated using an independent sample and the entries of $X$ being discrete, the FDRs of our proposed methods are controlled in both settings, suggesting the robustness of our methods.

GhostKnockoffs with discrete features

We note that discrete covariates do not follow a Gaussian distribution. However, the knockoffs procedure ensures FDR control whenever the feature importance statistics $W_{j}=w(T_{j},T_{p+j})$ , where $w$ is an anti-symmetric function, and $\mathbf{T}\in\mathbb{R}^{2p}$ is distributionally invariant upon swapping $T_{j}$ with $T_{j+p}$ for each null $j$ . Using Lemma 1, we know that Algorithm 4 controls the FDR if swapping the $j-$ th entry of $\mathbf{Z}=\mathbf{X}^{\top}\mathbf{Y}$ and the $j-$ th entry of $\tilde{\mathbf{Z}}=\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{% Y}\rVert_{2}\mathbf{Z}$ does not change their joint distribution for each null $j$ . In Appendix J, we visually demonstrate the approximate preservation of this distributional invariance. This, along with the robustness of knockoffs (Candès et al., 2018; Barber et al., 2020), helps in explaining why we have not observed FDR inflation with discrete covariates.

4.4.2 Independent features

We revisit the setting from Section 3.5.1 in which $\Sigma=I_{p}$ . For the pseudo-sum method for GK-pseudolasso, we optimize over $\lambda$ using a grid of 100 candidate values interpolating between $\lambda_{\text{max}}$ and $\lambda_{\text{max}}/1000$ linearly in log scale, and

\lambda_{\text{max}}=\frac{1}{n}\mathbb{E}\left[\left\|\begin{bmatrix}\mathbf{% X}^{\top}\mathbf{Y}\\ \mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}\end{bmatrix}\right\|_{\infty}\right]

is the minimal $\lambda$ value that shrinks all the coefficients to zero. To calculate $\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]$ for the lasso-min parameter method, we use a Monte Carlo estimate averaged over 200 samples. The target FDR is 20%. Each point represents an average over 200 replications.

Note that when $\mathbf{\Sigma}=\mathbf{I}_{p}$ , the solution to (15) is $\mathbf{D}=\mathbf{I}_{p}$ . It is easy to see that (11) gives

\hat{\bm{\beta}}=\frac{1}{n}S_{\lambda}\left(\begin{bmatrix}\mathbf{X}&% \widetilde{\mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y}\right),

where the soft-threshold operator $S_{\lambda}(x)=sign(x)(\lvert x\rvert-\lambda)_{+}$ is applied coordinate-wise. Therefore, the method in Section 4.2 soft-thresholds the marginal correlation of $\mathbf{X}$ and $\mathbf{Y}$ .

As shown in Figure 4, all three new methods (GK-pseudolasso with lasso-min/pseudo-sum and GK-susie-rss) consistently outperform GK-marginal, and the FDR is always controlled at the expected level, as theoretically guaranteed. As $n/p$ grows, we see that the three new methods have power closer to KF-lassocv. This is further demonstrated in additional simulations in Appendix I.

4.4.3 AR(1) features

Figure 5 shows the corresponding plots when the covariate matrix is generated from an AR(1) distribution. We found similar patterns to those with independent features. The power of all methods drops when the autocorrelation coefficient increases, as it is then harder to separate true signals from other variables.

5 Application to meta-analysis for Alzheimer’s disease

To illustrate the empirical performance of the methods in detecting genetic variants associated with Alzheimer’s disease (AD), we apply them to a meta-analysis of nine large-scale array-based genome-wide association and whole-exome/-genome sequencing studies for AD. We include the details of the nine studies in Appendix K.

As all studies share the same focus on individuals with European ancestry, we perform a meta-analysis by aggregating their $Z$ -scores and obtain the meta-analysis $Z$ -score $\mathbf{Z}_{\text{meta}}$ (see Appendix L for details). In addition, we obtain the block-diagonal covariance matrix $\mathbf{\Sigma}$ with respect to approximately independent linkage disequilibrium blocks provided by Berisa and Pickrell (2016). Within each block, we use the UK Biobank directly genotyped data as the reference panel and compute the covariance matrix via the Pan-UKB consortium (https://pan.ukbb.broadinstitute.org) with details in Appendix M. To improve the power in the presence of tightly linked variants, we apply the group knockoffs construction on top of the GhostKnockoff algorithm, as detailed in Section 4.3.2. Finally, we implement GK-pseudolasso with tuning parameter chosen by the lasso-min method on the meta-analysis $Z$ -score $\mathbf{Z}_{\text{meta}}$ and the covariance matrix $\mathbf{\Sigma}$ . To stabilize the GhostKnockoffs procedures, we use $M=5$ multi-knockoffs as defined in Section 4.3.1.

Figure 6 presents the result of the meta-analysis of the nine studies via our proposed method with target FDR level 0.1. Here, we specify loci based on variant groups and annotate two loci as different loci if they are 1 Mb away from each other. We adopt the most proximal gene’s name as the locus name.^*^**Specifically, we consider the variant group with the largest group knockoff feature importance statistic within a locus, and then map the locus to the most proximal gene of the variant within the group that has the highest knockoff importance score. As shown by Table 1 in Appendix N, GK-pseudolasso identifies variant groups in 42 and 63 loci when the target FDR level is 0.1 and 0.2 respectively, substantially more than GK-marginal (10 and 17 when the target FDR level is 0.1 and 0.2, respectively). This is consistent with our simulation results in Section 4.4. In addition, we observe from Table 1 that GK-susie-rss identifies fewer loci (35 and 47 when the target FDR level is 0.1 and 0.2, respectively), although it exhibits similar power in simulation studies. In Appendix O, we analogously visualize results of the meta-analysis via conventional marginal association test (with $p$ -value cutoff $5\times 10^{-8}$ ), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).

Table 2 in Appendix N shows the top variant with the largest feature importance statistic in each identified group. Most discoveries exhibit relatively strong marginal associations (marginal $p$ -value $\leq 0.05$ ) in individual studies and the same direction of effects across all studies. Although some loci have an opposite direction of effect in one individual study, such effects are not significant. The consistency across individual studies supports the validity of the proposed method in discovering putative causal variants. In addition, we observe that all top variants of identified groups have small meta-analysis $p$ -values (less than 0.05), though some are not smaller than the stringent genome-wide threshold ( $5\times 10^{-8}$ ) in marginal association tests with FWER control.

To further investigate whether the identified groups are functionally enriched, we apply a SNP-to-gene linking strategy proposed by (Gazal et al., 2022) to link the top variants of identified groups to the genes that they potentially regulate. Out of 63 top variants, we find that 34 (54.0%) can be mapped with functional evidence (e.g., being an expression quantitative trait locus, in a Hi-C linked enhancer region, near the exon of a gene, etc.), where the proportion is significantly higher than the average percentage of the background genome (28.6%). In summary, the proposed method can identify functional genetic variants with weaker statistical effects missed by conventional association tests.

6 Discussion

This paper introduced novel approaches for performing variable selection with FDR control on the basis of summary statistics. We proposed methods for testing conditional independence hypotheses from summary statistics alone. For the methods from Section 4, all we need are essentially the marginal correlations between $X$ and $Y$ ,^*^**Along with $\lVert\mathbf{Y}\rVert^{2}$ and $n$ . which, at first sight, may appear surprising. Our arguments rely on the assumption that the covariates follow a Gaussian distribution, as well as on the linearity and rotational invariance of Gaussian distributions. Since our methods are based on the knockoffs procedure, they do not require any knowledge about the model of $Y$ given $X$ . Our methods extend, and generally give better power than, the work by He et al. (2022) by employing penalized regression to produce the measure of feature importance. The techniques employed in this paper provide a wrapper that can be combined with a variety of feature selection methods, yielding knockoffs versions that guarantee FDR control.

We applied our methods to genetic studies, in which summary statistics are typically available. Due to linkage disequilibrium, the application of our methods to individual genetic variants may yield conservative results. In a parallel work Chu et al. (2023), we have developed tools for constructing group knockoffs efficiently and effectively. When combined, our methods offer a powerful new approach to controlled variable selection in GWAS. This is further supported in our companion work He et al. (2023), where we see the methods in this paper led to significant scientific discoveries.

7 Acknowledgement

Z.C. would like to thank Kevin Guo and Amber Hu for helpful discussions. Z.C. was supported by the Simons Foundation under award 814641. Z.H. was supported by NIH/NIA award AG066206 and AG066515. T.M. was supported by a B.C. and E.J. Eaves Stanford Graduate Fellowship. C.S. was supported by the grants NIH R56HG010812 and NSF DMS2210392. E.J.C. was supported by the Office of Naval Research grant N00014-20-1-2157.

References

Barber and Candès (2015) R. F. Barber and E. J. Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055 – 2085, 2015. URL https://doi.org/10.1214/15-AOS1337.
Barber et al. (2020) R. F. Barber, E. J. Candès, and R. J. Samworth. Robust inference with knockoffs. 2020.
Bates et al. (2020) S. Bates, M. Sesia, C. Sabatti, and E. Candès. Causal inference in genetic trio studies. Proceedings of the National Academy of Sciences, 117(39):24117–24126, 2020.
Belloni et al. (2011) A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
Belloy et al. (2022a) M. E. Belloy, S. J. Eger, Y. Le Guen, V. Damotte, S. Ahmad, M. A. Ikram, A. Ramirez, A. C. Tsolaki, G. Rossi, I. E. Jansen, et al. Challenges at the APOE locus: a robust quality control approach for accurate APOE genotyping. Alzheimer’s Research & Therapy, 14:22, 2022a.
Belloy et al. (2022b) M. E. Belloy, Y. Le Guen, S. J. Eger, V. Napolioni, M. D. Greicius, and Z. He. A Fast and Robust Strategy to Remove Variant-Level Artifacts in Alzheimer Disease Sequencing Project Data. Neurology Genetics, 8(5):e200012, 2022b.
Belloy et al. (2023) M. E. Belloy, S. J. Andrews, Y. Le Guen, M. Cuccaro, L. A. Farrer, V. Napolioni, and M. D. Greicius. APOE Genotype and Alzheimer Disease Risk Across Age, Sex, and Population Ancestry. JAMA Neurology, 80(12):1284–1294, 2023.
Benjamini and Hochberg (1995) Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995.
Benjamini and Yekutieli (2001) Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188, 2001.
Berisa and Pickrell (2016) T. Berisa and J. K. Pickrell. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2):283–285, 2016.
Bis et al. (2020) J. C. Bis, X. Jian, B. W. Kunkle, Y. Chen, K. L. Hamilton-Nelson, W. S. Bush, W. J. Salerno, D. Lancour, Y. Ma, A. E. Renton, et al. Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation. Molecular psychiatry, 25:1859–1875, 2020.
Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
Candès et al. (2018) E. Candès, Y. Fan, L. Janson, and J. Lv. Panning for Gold: ‘Model-X’ Knockoffs for High Dimensional Controlled Variable Selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3):551–577, 2018.
Chen et al. (2013) C.-Y. Chen, S. Pollack, D. J. Hunter, J. N. Hirschhorn, P. Kraft, and A. L. Price. Improved ancestry inference using weights from external reference panels. Bioinformatics, 29(11):1399–1406, 2013.
Chu et al. (2023) B. B. Chu, J. Gu, Z. Chen, T. Morrison, E. Candès, Z. He, and C. Sabatti. Second-order group knockoffs with applications to GWAS. arXiv preprint arXiv:2310.15069, 2023.
Dai and Barber (2016) R. Dai and R. Barber. The knockoff filter for FDR control in group-sparse and multitask regression. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1851–1859. PMLR, 2016.
Dicker (2014) L. H. Dicker. Variance estimation in high-dimensional linear models. Biometrika, 101(2):269–284, 2014.
Gazal et al. (2022) S. Gazal, O. Weissbrod, F. Hormozdiari, K. K. Dey, J. Nasser, K. A. Jagadeesh, D. J. Weiner, H. Shi, C. P. Fulco, L. J. O’Connor, et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nature Genetics, 54:827–836, 2022.
Gimenez and Zou (2019) J. R. Gimenez and J. Zou. Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89, pages 2184–2192. PMLR, 2019.
He et al. (2021) Z. He, L. Liu, C. Wang, Y. Le Guen, J. Lee, S. Gogarten, F. Lu, S. Montgomery, H. Tang, E. K. Silverman, et al. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nature Communications, 12:3152, 2021.
He et al. (2022) Z. He, L. Liu, M. E. Belloy, Y. Le Guen, A. Sossin, X. Liu, X. Qi, S. Ma, P. K. Gyawali, T. Wyss-Coray, et al. Ghostknockoff inference empowers identification of putative causal variants in genome-wide association studies. Nature Communications, 13:7209, 2022.
He et al. (2023) Z. He et al. In silico identification of putative causal genetic variants. 2023.
Huang et al. (2017) K.-l. Huang, E. Marcora, A. A. Pimenova, A. F. Di Narzo, M. Kapoor, S. C. Jin, O. Harari, S. Bertelsen, B. P. Fairfax, J. Czajkowski, et al. A common haplotype lowers PU.1 expression in myeloid cells and delays onset of Alzheimer’s disease. Nature Neuroscience, 20:1052–1061, 2017.
Jansen et al. (2019) I. E. Jansen, J. E. Savage, K. Watanabe, J. Bryois, D. M. Williams, S. Steinberg, J. Sealock, I. K. Karlsson, S. Hägg, L. Athanasiu, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nature Genetics, 51:404–413, 2019.
Kunkle et al. (2019) B. W. Kunkle, B. Grenier-Boley, R. Sims, J. C. Bis, V. Damotte, A. C. Naj, A. Boland, M. Vronskaya, S. J. Van Der Lee, A. Amlie-Wolf, et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates A $\beta$ , tau, immunity and lipid processing. Nature Genetics, 51:414–430, 2019.
Le Guen et al. (2021) Y. Le Guen, M. E. Belloy, V. Napolioni, S. J. Eger, G. Kennedy, R. Tao, Z. He, and M. D. Greicius. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimer’s Research & Therapy, 13:72, 2021.
Leung et al. (2019) Y. Y. Leung, O. Valladares, Y.-F. Chou, H.-J. Lin, A. B. Kuzma, L. Cantwell, L. Qu, P. Gangadharan, W. J. Salerno, G. D. Schellenberg, et al. VCPA: genomic variant calling pipeline and data management tool for Alzheimer’s Disease Sequencing Project. Bioinformatics, 35(10):1768–1770, 2019.
Li and Candès (2021) S. Li and E. J. Candès. Deploying the Conditional Randomization Test in High Multiplicity Problems. arXiv preprint arXiv:2110.02422, 2021.
Mak et al. (2017) T. S. H. Mak, R. M. Porsch, S. W. Choi, X. Zhou, and P. C. Sham. Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41:469–480, 2017.
Pasaniuc and Price (2017) B. Pasaniuc and A. L. Price. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics, 18:117–127, 2017.
Qian et al. (2020) J. Qian, Y. Tanigawa, W. Du, M. Aguirre, C. Chang, R. Tibshirani, M. A. Rivas, and T. Hastie. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genetics, 16(10):e1009141, 2020.
Schäfer and Strimmer (2005) J. Schäfer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:32, 2005.
Schwartzentruber et al. (2021) J. Schwartzentruber, S. Cooper, J. Z. Liu, I. Barrio-Hernandez, E. Bello, N. Kumasaka, A. M. Young, R. J. Franklin, T. Johnson, K. Estrada, et al. Genome-wide meta-analysis, fine-mapping and integrative prioritization implicate new Alzheimer’s disease risk genes. Nature Genetics, 53:392–402, 2021.
Serrano-Pozo et al. (2021) A. Serrano-Pozo, S. Das, and B. T. Hyman. APOE and Alzheimer’s disease: advances in genetics, pathophysiology, and therapeutic approaches. The Lancet Neurology, 20(1):68–80, 2021.
Sesia et al. (2021) M. Sesia, S. Bates, E. Candès, J. Marchini, and C. Sabatti. False discovery rate control in genome-wide association studies with population structure. Proceedings of the National Academy of Sciences, 118(40):e2105841118, 2021.
Spector and Janson (2022) A. Spector and L. Janson. Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1):252–276, 2022.
The 1000 Genomes Project Consortium (2015) The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68–74, 2015.
Tian et al. (2018) X. Tian, J. R. Loftus, and J. E. Taylor. Selective inference with unknown variance via the square-root lasso. Biometrika, 105(4):755–768, 2018.
Tibshirani et al. (2012) R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong Rules for Discarding Predictors in Lasso-Type Problems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74(2):245–266, 2012.
Wang et al. (2020) G. Wang, A. Sarkar, P. Carbonetto, and M. Stephens. A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(5):1273–1300, 2020.
Wang and Janson (2021) W. Wang and L. Janson. A high-dimensional power analysis of the conditional randomization test and knockoffs. Biometrika, 109(3):631–645, 2021.
Weinstein et al. (2020) A. Weinstein, W. J. Su, M. Bogdan, R. F. Barber, and E. J. Candès. A Power Analysis for Model-X Knockoffs with $\ell_{p}$ -Regularized Statistics. arXiv preprint arXiv:2007.15346, 2020.
Willer et al. (2010) C. J. Willer, Y. Li, and G. R. Abecasis. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics, 26(17):2190–2191, 2010.
Witten and Tibshirani (2009) D. M. Witten and R. Tibshirani. Covariance-regularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(3):615–636, 2009.
Zhang et al. (2021) Q. Zhang, F. Privé, B. Vilhjálmsson, and D. Speed. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nature Communications, 12:4192, 2021.
Zou et al. (2022) Y. Zou, P. Carbonetto, G. Wang, and M. Stephens. Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genetics, 18(7):e1010299, 2022.

Appendix A Computation of free parameters $\mathbf{s}$

In this paper, we use the semidefinite program (SDP) construction of second-order knockoffs Candès et al. [2018]. Without loss of generality, we assume that columns of the data matrix $\mathbf{X}$ have been standardized with mean 0 and variance 1 such that diagonal entries $\mathbf{\Sigma}$ are 1. As a result, $\mathbf{s}$ is the solution of the convex optimization problem.

minimize	$\displaystyle\sum_{j=1}^{p}\|1-s_{j}\|$	(15)
subject to	$\displaystyle s_{j}\geq 0,\quad\ 1\leq j\leq p,$
	$\displaystyle\text{diag}\{\mathbf{s}\}\preceq 2\mathbf{\Sigma}.$

Other methods to compute $\mathbf{s}$ include the minimum variance-based reconstructability (MVR) construction [Spector and Janson, 2022] and maximum entropy (ME) construction [Gimenez and Zou, 2019, Spector and Janson, 2022], which are all compatible with our methods in this paper.

Appendix B Equivalence of GhostKnockoffs and the Gaussian knockoff sampler in sampling the knockoff $Z$ -score $\widetilde{\mathbf{Z}}_{s}$

In this section, we summarize the proof of He et al. [2022] that $\widetilde{\mathbf{Z}}_{s}$ computed by (6) satisfies (5) as follows.

Lemma 2.

[He et al., 2022] For any $\mathbf{P}$ and $\mathbf{V}$ computed in step 3 of Algorithm 1, we have

\widetilde{\mathbf{Z}}_{s}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}}% {{=}}\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y},

where $\widetilde{\mathbf{Z}}_{s}$ is computed by (6) and $\widetilde{\mathbf{X}}$ is the output of Algorithm 1.

Proof.

By step 5 of Algorithm 1, we have $\widetilde{\mathbf{X}}=\mathbf{X}\mathbf{P}+\mathbf{E}\mathbf{V}^{1/2}$ , where $\mathbf{E}$ is an $n$ by $p$ matrix with i.i.d. standard Gaussian entries, independent of $\mathbf{X}$ . Therefore,

\displaystyle\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}

\displaystyle\stackrel{{\scriptstyle}}{{=}}\mathbf{P}^{\top}\mathbf{X}^{\top}% \mathbf{Y}+\mathbf{V}^{1/2}\mathbf{E}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}.

Because $\mathbf{E}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}\sim\mathcal{N}(\mathbf{0}% ,||\mathbf{Y}||_{2}^{2}\mathbf{I}_{p})$ , we have

\mathbf{E}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}\stackrel{{\scriptstyle d}% }{{=}}||\mathbf{Y}||_{2}\mathbf{S}\mid\mathbf{X},\mathbf{Y},\quad\text{where}% \;\mathbf{S}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{p})\;\text{is independent % of}\;\mathbf{X}\;\text{and}\;\mathbf{Y}

Thus, we have

	$\displaystyle\widetilde{\mathbf{X}}^{\top}\mathbf{Y}\mid\mathbf{X},\mathbf{Y}$	$\displaystyle\stackrel{{\scriptstyle d}}{{=}}\mathbf{P}^{\top}\mathbf{X}^{\top% }\mathbf{Y}+\|\|\mathbf{Y}\|\|_{2}\mathbf{V}^{1/2}\mathbf{S}\mid\mathbf{X},\mathbf% {Y}$
		$\displaystyle\stackrel{{\scriptstyle d}}{{=}}\mathbf{P}^{\top}\mathbf{X}^{\top% }\mathbf{Y}+\|\|\mathbf{Y}\|\|_{2}\mathbf{Z}\mid\mathbf{X},\mathbf{Y}\quad\text{% where}\;\mathbf{Z}\sim\mathcal{N}(\mathbf{0},\mathbf{V})\;\text{is independent% of}\;\mathbf{X}\;\text{and}\;\mathbf{Y}.$
		$\displaystyle\stackrel{{\scriptstyle}}{{=}}\widetilde{\mathbf{Z}}_{s}\mid% \mathbf{X},\mathbf{Y}.$

∎

Appendix C Proof of Proposition 1

To prove Proposition 1, we need to first prove Lemma 3.

Lemma 3.

Let $\mathbf{Z}_{1}$ and $\mathbf{Z}_{2}$ be two real $n$ by $p$ matrices. For any $n$ and $p$ , if $\mathbf{Z}_{1}^{\top}\mathbf{Z}_{1}=\mathbf{Z}_{2}^{\top}\mathbf{Z}_{2}$ , there must exists an orthogonal matrix $\mathbf{Q}\in\mathbb{R}^{p\times p}$ such that $\mathbf{Z}_{1}=\mathbf{Q}\mathbf{Z}_{2}$ .

Proof.

Suppose $\mathbf{Z}_{1}^{\top}\mathbf{Z}_{1}=\mathbf{Z}_{2}^{\top}\mathbf{Z}_{2}=% \mathbf{U}\bm{\Lambda}\mathbf{U}^{\top}$ , where $\mathbf{U}\in R^{p\times r}$ is an orthogonal matrix such that $\mathbf{U}^{\top}\mathbf{U}=\mathbf{I}_{r}$ , $\bm{\Lambda}\in R^{r\times r}$ is diagonal with positive entries and $r$ is the rank of $\mathbf{Z}_{1}^{\top}\mathbf{Z}_{1}$ . In other words, we perform eigen-decomposition of $\mathbf{Z}_{1}^{\top}\mathbf{Z}_{1}=\mathbf{Z}_{2}^{\top}\mathbf{Z}_{2}$ and remove all zero eigenvalues and their corresponding eigenvectors. Note that $\mathbf{U}\mathbf{U}^{\top}$ is a projection matrix that projects any vector onto $\textit{colspace}(\mathbf{U})$ , the column space of $\mathbf{U}$ .

It is clear that

\textit{colspace}(\mathbf{U}\Lambda\mathbf{U}^{\top})\subseteq\textit{colspace% }(\mathbf{U}).

Because $\mathbf{U}=(\mathbf{U}\bm{\Lambda}\mathbf{U}^{\top})\mathbf{U}\bm{\Lambda}^{-1}$ , we also have

\textit{colspace}(\mathbf{U})\subseteq\textit{colspace}(\mathbf{U}\Lambda% \mathbf{U}^{\top}).

As a result, we have $\textit{colspace}(\mathbf{U}\Lambda\mathbf{U}^{\top})=\textit{colspace}(% \mathbf{U})$ .

Thus, for $k=1,2$ , $\mathbf{U}\mathbf{U}^{\top}$ is a projection matrix that projects any vector onto the column space of $\mathbf{U}\bm{\Lambda}\mathbf{U}^{\top}=\mathbf{Z}_{k}^{\top}\mathbf{Z}_{k}$ . Because $\textit{colspace}(\mathbf{Z}_{k}^{\top}\mathbf{Z}_{k})=\textit{rowspace}(% \mathbf{Z}_{k})$ , we have

\mathbf{Z}_{k}=\mathbf{Z}_{k}\mathbf{U}\mathbf{U}^{\top}=\mathbf{Z}_{k}\mathbf% {U}\bm{\Lambda}^{-1/2}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}.

Let $\mathbf{Q}_{k}=\mathbf{Z}_{k}\mathbf{U}\bm{\Lambda}^{-1/2}$ , we have $\mathbf{Z}_{k}=\mathbf{Q}_{k}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}$ and

\mathbf{Q}_{k}^{\top}\mathbf{Q}_{k}=\bm{\Lambda}^{-1/2}\mathbf{U}^{\top}% \mathbf{Z}_{k}^{\top}\mathbf{Z}_{k}\mathbf{U}\bm{\Lambda}^{-1/2}=\bm{\Lambda}^% {-1/2}\mathbf{U}^{\top}\mathbf{U}\bm{\Lambda}\mathbf{U}^{\top}\mathbf{U}\bm{% \Lambda}^{-1/2}=\mathbf{I}_{r},\quad(k=1,2).

Thus, we have

	$\displaystyle\mathbf{Z}_{1}$	$\displaystyle=\mathbf{Q}_{1}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}=\mathbf{Q}_{1}% \mathbf{Q}_{2}^{\top}\mathbf{Q}_{2}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}=\mathbf% {Q}_{1}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2},$
	$\displaystyle\mathbf{Z}_{2}$	$\displaystyle=\mathbf{Q}_{2}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}=\mathbf{Q}_{2}% \mathbf{Q}_{1}^{\top}\mathbf{Q}_{1}\bm{\Lambda}^{1/2}\mathbf{U}^{\top}=\mathbf% {Q}_{2}\mathbf{Q}_{1}^{\top}\mathbf{Z}_{1}.$

Because $\mathbf{Q}_{1}^{\top}\mathbf{Q}_{1}=\mathbf{Q}_{2}^{\top}\mathbf{Q}_{2}=% \mathbf{I}_{r}$ , there exist $\mathbf{Q}_{1}^{\perp},\mathbf{Q}_{2}^{\perp}\in R^{p\times(p-r)}$ such that $\mathbf{V}_{1}=\begin{bmatrix}\mathbf{Q}_{1}&\mathbf{Q}_{1}^{\perp}\end{bmatrix}$ and $\mathbf{V}_{2}=\begin{bmatrix}\mathbf{Q}_{2}&\mathbf{Q}_{2}^{\perp}\end{bmatrix}$ are both orthogonal matrices. Thus, we have

	$\displaystyle\mathbf{Z}_{1}=\mathbf{Q}_{1}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2}=% (\mathbf{V}_{1}\mathbf{V}_{2}^{\top}-\mathbf{Q}_{1}^{\perp}(\mathbf{Q}_{2}^{% \perp})^{\top})\mathbf{Z}_{2}$
	$\displaystyle\mathbf{Z}_{2}=\mathbf{Q}_{2}\mathbf{Q}_{1}^{\top}\mathbf{Z}_{1}=% (\mathbf{V}_{2}\mathbf{V}_{1}^{\top}-\mathbf{Q}_{2}^{\perp}(\mathbf{Q}_{1}^{% \perp})^{\top})\mathbf{Z}_{1}$

Substituting $\mathbf{Z}_{1}=\mathbf{Q}_{1}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2}$ in $\mathbf{Z}_{2}=\mathbf{Q}_{2}\mathbf{Q}_{1}^{\top}\mathbf{Z}_{1}$ , we have

\mathbf{Z}_{2}=\mathbf{Q}_{2}\mathbf{Q}_{1}^{\top}\mathbf{Q}_{1}\mathbf{Q}_{2}% ^{\top}\mathbf{Z}_{2}=\mathbf{Q}_{2}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2}

and thus

\mathbf{Q}_{1}^{\perp}(\mathbf{Q}_{2}^{\perp})^{\top})\mathbf{Z}_{2}=\mathbf{Q% }_{2}\mathbf{Q}_{2}^{\top}\mathbf{Z}_{2}=\mathbf{0}.

Thus, these exists an orthogonal matrix $\mathbf{Q}=\mathbf{V}_{1}\mathbf{V}_{2}^{\top}$ such that $\mathbf{Z}_{1}=\mathbf{Q}\mathbf{Z}_{2}$ .

∎

We can then prove Proposition 1 as follows. By Lemma 3, since $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}]$ , we know that $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]=\mathbf{Q}^{\top}[\mathbf{X}% \ \mathbf{Y}]$ for some orthogonal matrix $\mathbf{Q}$ .

Let $\mathbf{E}\in\mathbb{R}^{n\times p}$ be a matrix with i.i.d. standard Gaussian entries, we have $\mathbf{Q}\mathbf{E}$ is also a matrix with i.i.d. standard Gaussian entries (i.e. $\mathbf{E}\stackrel{{\scriptstyle d}}{{=}}\mathbf{Q}\mathbf{E}$ ) and

	$\displaystyle(\mathbf{E}^{\top}[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}% ],\mathbf{E}^{\top}\mathbf{E})\mid\mathbf{X},\mathbf{Y}$	$\displaystyle\stackrel{{\scriptstyle}}{{=}}(\mathbf{E}^{\top}\mathbf{Q}^{\top}% [\mathbf{X}\ \mathbf{Y}],\mathbf{E}^{\top}\mathbf{Q}^{\top}\mathbf{Q}\mathbf{E% })\mid\mathbf{X},\mathbf{Y}$
		$\displaystyle\stackrel{{\scriptstyle d}}{{=}}(\mathbf{E}^{\top}[\mathbf{X}\ % \mathbf{Y}],\mathbf{E}^{\top}\mathbf{E})\mid\mathbf{X},\mathbf{Y}.$

By the construction of $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]$ , we have that $\widecheck{\mathbf{X}}^{\top}\widecheck{\mathbf{X}}=\mathbf{X}^{\top}\mathbf{X}$ , $\widecheck{\mathbf{X}}^{\top}\widecheck{\mathbf{Y}}=\mathbf{X}^{\top}\mathbf{Y}$ and $\lVert\widecheck{\mathbf{Y}}\rVert_{2}=\lVert\mathbf{Y}\rVert_{2}$ . Therefore, we focus on the third, fifth and sixth arguments of $\mathcal{T}$ where

		$\displaystyle(\widetilde{\mathbf{X}}^{\top}\mathbf{Y},\mathbf{X}^{\top}\mathbf% {X},\widetilde{\mathbf{X}}^{\top}\widetilde{\mathbf{X}})\mid\mathbf{X},\mathbf% {Y}$
	$\displaystyle\stackrel{{\scriptstyle}}{{=}}$	$\displaystyle\;(\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\mathbf{V}^{1/2}% \mathbf{E}^{\top}\mathbf{Y},\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{X}+% \mathbf{V}^{1/2}\mathbf{E}^{\top}\mathbf{X},\mathbf{P}^{\top}\mathbf{X}^{\top}% \mathbf{X}\mathbf{P}+\mathbf{V}^{1/2}\mathbf{E}^{\top}\mathbf{E}\mathbf{V}^{1/% 2}+$
		$\displaystyle\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{E}\mathbf{V}^{1/2}+% \mathbf{V}^{1/2}\mathbf{E}^{\top}\mathbf{X}\mathbf{P})\mid\mathbf{X},\mathbf{Y}$
	$\displaystyle\stackrel{{\scriptstyle d}}{{=}}$	$\displaystyle\;(\mathbf{P}^{\top}\widecheck{\mathbf{X}}^{\top}\widecheck{% \mathbf{Y}}+\mathbf{V}^{1/2}\mathbf{E}^{\top}\widecheck{\mathbf{Y}},\mathbf{P}% ^{\top}\widecheck{\mathbf{X}}^{\top}\widecheck{\mathbf{X}}+\mathbf{V}^{1/2}% \mathbf{E}^{\top}\widecheck{\mathbf{X}},\mathbf{P}^{\top}\widecheck{\mathbf{X}% }^{\top}\widecheck{\mathbf{X}}\mathbf{P}+\mathbf{V}^{1/2}\mathbf{E}^{\top}% \mathbf{E}\mathbf{V}^{1/2}+$
		$\displaystyle\mathbf{P}^{\top}\widecheck{\mathbf{X}}^{\top}\mathbf{E}\mathbf{V% }^{1/2}+\mathbf{V}^{1/2}\mathbf{E}^{\top}\widecheck{\mathbf{X}}\mathbf{P})\mid% \mathbf{X},\mathbf{Y}$
	$\displaystyle\stackrel{{\scriptstyle}}{{=}}$	$\displaystyle\;(\widetilde{\widecheck{\mathbf{X}}}^{\top}\widecheck{\mathbf{Y}% },\widetilde{\widecheck{\mathbf{X}}}^{\top}\widecheck{\mathbf{X}},\widetilde{% \widecheck{\mathbf{X}}}^{\top}\widetilde{\widecheck{\mathbf{X}}})\mid\mathbf{X% },\mathbf{Y}.$

Hence,

\mathcal{T}(\mathbf{X},\widetilde{\mathbf{X}},\mathbf{Y})\mid\mathbf{X},% \mathbf{Y}\stackrel{{\scriptstyle d}}{{=}}\mathcal{T}(\widecheck{\mathbf{X}},% \widetilde{\widecheck{\mathbf{X}}},\widecheck{\mathbf{Y}})\mid\mathbf{X},% \mathbf{Y}.

Appendix D Construction of $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]$ via eigen-decomposition

In this section, we give details on how to construct $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]$ such that $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}]$ using eigen-decomposition,

[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}]=\mathbf{U}\mathbf{D}% \mathbf{U}^{\top},\quad\text{where }\mathbf{U}=[\textbf{u}_{1}\ \ldots\ % \textbf{u}_{p+1}]\text{ is an orthogonal matrix, }\mathbf{D}=\text{diag}(d_{1}% ,\ldots,d_{p+1}),

with $d_{1}\geq\cdots\geq d_{p+1}$ . We consider two cases as follows.

Case 1 ( $n<p+1$ ): Since $\text{rank}([\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}])\leq n$ , we have $d_{n+1}=\cdots=d_{p+1}=0$ and

[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}]=\mathbf{U}_{1}\mathbf{% D}_{n}\mathbf{U}_{1}^{\top},

where $\mathbf{U}_{1}=[\textbf{u}_{1}\ \ldots\ \textbf{u}_{n}]$ , and $\mathbf{D}_{n}=\text{diag}(d_{1},\ldots,d_{n})$ . Under this case, we let $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]=\mathbf{D}_{n}^{1/2}\mathbf{U% }_{1}^{\top}$ such that $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}]$ is satisfied.

Case 2 ( $n\geq p+1$ ): Under this case, we let

[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]=\begin{bmatrix}\mathbf{D}^{1/% 2}\mathbf{U}^{\top}\\ \mathbf{0}_{(n-p-1)\times(p+1)}\end{bmatrix}

such that

[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=\mathbf{U}\mathbf{D}\mathbf{U}^{\top}=[\mathbf{X}\ % \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{Y}]

is satisfied.

Appendix E Computation of the tuning parameter $\lambda$ for the lasso-min method

Suppose we had access to individual level data such that Gaussian knockoffs $\widetilde{\mathbf{X}}$ can be constructed, we can follow the method of Dicker [2014] to estimate the noise level $\sigma$ by

\widehat{\sigma}_{0}=\sqrt{\text{max}\left(\frac{2p+n+1}{n(n+1)}\lVert\mathbf{% Y}\rVert_{2}^{2}-\frac{\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{% \mathbf{X}}\end{bmatrix}\mathbf{G}^{-1}\begin{bmatrix}\mathbf{X}&\widetilde{% \mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y}}{n(n+1)},0\right)},\quad\text{ where% }\mathbf{G}=\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-\mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}.

(16)

We could then compute $\lambda=\kappa\cdot\frac{\widehat{\sigma}_{0}}{n}\cdot\mathbb{E}[\lVert\mathbf% {R}^{\top}\bm{\epsilon}\rVert_{\infty}]$ , where $\mathbf{R}\in\mathbb{R}^{n\times 2p}$ is a data matrix whose rows are i.i.d. samples from $\mathcal{N}(\mathbf{0},\mathbf{G})$ , and $\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n})$ is independent of $\mathbf{R}$ . In the summary statistics setting, we replace $\widetilde{\mathbf{X}}^{\top}\mathbf{Y}$ in (16) by $\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{Y}\rVert_{2}\mathbf% {Z}$ , where $\mathbf{P}$ and $\mathbf{Z}$ are obtained in Algorithm 4.

The expectation $\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]$ can be computed using Monte Carlo integration. However, when both $n$ and $p$ are very large, sampling $\mathbf{R}$ and $\bm{\epsilon}$ becomes too time-consuming. Observing that

\mathbf{R}^{\top}\bm{\epsilon}=\mathbf{R}^{\top}\frac{\bm{\epsilon}}{\lVert\bm% {\epsilon}\rVert_{2}}\lVert\bm{\epsilon}\rVert_{2}

where $\mathbf{R}^{\top}\frac{\bm{\epsilon}}{\lVert\bm{\epsilon}\rVert_{2}}\sim% \mathcal{N}(\mathbf{0},\mathbf{G})$ and $\bm{\epsilon}$ are independent, we have

\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]=\mathbb{E}[% \lVert N(\mathbf{0},\mathbf{G})\rVert_{\infty}]\mathbb{E}[\lVert\bm{\epsilon}% \rVert_{2}]=\mathbb{E}[\lVert N(\mathbf{0},\mathbf{G})\rVert_{\infty}]\cdot% \sqrt{2}\frac{\Gamma\{(n+1)/2\}}{\Gamma(n/2)}.

By Stirling’s formula that

{\displaystyle\Gamma(z)={\sqrt{\frac{2\pi}{z}}}\,{\left({\frac{z}{e}}\right)}^% {z}\left(1+O\left({\frac{1}{z}}\right)\right),}

we have

\sqrt{2}\frac{\Gamma\{(n+1)/2\}}{\Gamma(n/2)}\sim\sqrt{n}.

Therefore, we may approximate

\mathbb{E}[\lVert\mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]\approx\sqrt{n}% \;\mathbb{E}[\lVert\mathbf{L}\mathbf{Z}\rVert_{\infty}],

where $\mathbf{L}$ is the Cholesky decomposition of $\mathbf{G}$ and $\mathbf{Z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{2p})$ .

In practice, the simulated $\lVert\mathbf{L}\mathbf{Z}\rVert_{\infty}$ usually concentrates around its mean as shown in Figure 7. Thus, only several Monte Carlo samples are needed to accurately estimate $\mathbb{E}[\lVert\mathbf{L}\mathbf{Z}\rVert_{\infty}]$ , and we draw $10$ Monte Carlo samples throughout numerical experiments of this paper.

Next, we prove that Algorithm 4 maintains FDR control when $\lambda$ is computed as described in Section 4.2.2. By Proposition 2, it suffices to show that (11) with the computed $\lambda$ produces feature importance statistics that satisfy the flip sign property. By $\lambda\approx\kappa\cdot\frac{\widehat{\sigma}_{0}}{n}\cdot\mathbb{E}[\lVert% \mathbf{R}^{\top}\bm{\epsilon}\rVert_{\infty}]$ , it suffices to show that $\hat{\sigma}_{0}$ is invariant to swapping variables with their knockoffs [Candès et al., 2018].

Let $\bm{\Pi}_{j}\in\mathbb{R}^{2p\times 2p}$ be the permutation matrix that swaps the $j$ -th column with the $(j+p)$ -th column of a matrix. Thus, we have $\bm{\Pi}_{j}^{-1}=\bm{\Pi}_{j}^{\top}=\bm{\Pi}_{j}$ . This leads to

	$\displaystyle\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}\bm{\Pi}_{j}(\mathbf{G})^{-1}(\begin{bmatrix}\mathbf{X}&% \widetilde{\mathbf{X}}\end{bmatrix}\bm{\Pi}_{j})^{\top}\mathbf{Y}=$	$\displaystyle\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}\bm{\Pi}_{j}(\mathbf{G})^{-1}\bm{\Pi}_{j}^{\top}\begin{bmatrix}% \mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y}$
	$\displaystyle=$	$\displaystyle\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}(\bm{\Pi}_{j}\mathbf{G}\bm{\Pi}_{j})^{-1}\begin{bmatrix}\mathbf{X% }&\widetilde{\mathbf{X}}\end{bmatrix}^{\top}\mathbf{Y}$
	$\displaystyle=$	$\displaystyle\mathbf{Y}^{\top}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}(\mathbf{G})^{-1}\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}% \end{bmatrix}^{\top}\mathbf{Y},$

suggesting $\hat{\sigma}_{0}$ is invariant to swapping variables with their knockoffs [Candès et al., 2018]. Therefore, the FDR of Algorithm 4 is controlled when $\lambda$ is computed as described in Section 4.2.2.

Since all variables in $\widetilde{\mathbf{X}}$ are null, in practice we may replace $\begin{bmatrix}\mathbf{X}&\widetilde{\mathbf{X}}\end{bmatrix}$ by $\mathbf{X}$ , $\mathbf{G}$ by $\mathbf{\Sigma}$ and $2p+n+1$ by $p+n+1$ in (16) to reduce the dimension when estimating $\sigma$ . Although this would, in theory, break the flip-sign property required for FDR control, no FDR inflation is observed in our simulations.

Appendix F Connection with the scout procedure

In this section, we explain the connection of the feature importance statsitic defined in Algorithm 4 and the scout procedure [Witten and Tibshirani, 2009].

For covariates $X\in\mathbb{R}^{p}$ and response $Y\in\mathbb{R}$ , Witten and Tibshirani [2009] assume that $\begin{bmatrix}X\\ Y\end{bmatrix}\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma}_{X,Y})$ . The population linear regression coefficient of $Y$ on $X$ , which induces a linear predictor that achieves the minimal mean squared prediction error, is given by $\bm{\beta}=-\bm{\Theta}_{XY}/\bm{\Theta}_{YY}$ , where $\bm{\Theta}=\begin{bmatrix}\bm{\Theta}_{XX}&\bm{\Theta}_{XY}\\ \bm{\Theta}_{YX}&\bm{\Theta}_{YY}\end{bmatrix}=\mathbf{\Sigma}_{X,Y}^{-1}$ is the precision matrix. Let $\mathbf{S}$ be the empirical covariance matrix of $X$ and $Y$ , they consider the following covariance-regularized regression approach to estimate $\bm{\beta}$ ,

1.

Compute $\hat{\bm{\Theta}}_{XX}$ to maximize $\log\{\det(\bm{\Theta}_{XX})\}-\text{tr}(\mathbf{S}_{XX}\bm{\Theta}_{XX})-J_{1% }(\bm{\Theta}_{XX})$
2.

Compute $\hat{\bm{\Theta}}$ to maximize $\log\{\det(\bm{\Theta})\}-\text{tr}(\mathbf{S}\bm{\Theta})-J_{2}(\bm{\Theta})$ subject to $\bm{\Theta}_{XX}=\hat{\bm{\Theta}}_{XX}$ obtained from Step 1.
3.

Compute $\hat{\bm{\beta}}=-\hat{\bm{\Theta}}_{XY}/\hat{\Theta}_{YY}$ .
4.

Compute $\hat{\bm{\beta}}^{*}=c\hat{\bm{\beta}}$ where $c$ is the regression coefficient of $\mathbf{Y}$ onto $\mathbf{X}\hat{\bm{\beta}}$ .

Here, $J_{1}$ and $J_{2}$ are two penalty functions. The first two steps are to appropriately separate true conditional correlations from those purely due to noise. As shown in Witten and Tibshirani [2009], when $J_{2}(\Theta)=\lambda_{2}{\lVert\bm{\Theta}\rVert_{1}}$ (resp. $\lambda_{2}{\lVert\bm{\Theta}\rVert_{2}^{2}}$ ), the solution to step 3 is proportional to the solution of

\hat{\bm{\beta}}=\operatorname*{arg\,min}_{\bm{\beta}}\bm{\beta}^{\top}\mathbf% {G}_{XX}\bm{\beta}-2\mathbf{S}_{XY}^{\top}\bm{\beta}+\lambda_{2}\lVert\bm{% \beta}\rVert_{1}\;(\text{resp.}\;\lambda_{2}\lVert\bm{\beta}\rVert_{2}^{2}),

where $\mathbf{G}_{XX}$ is the inverse of the solution $\hat{\bm{\Theta}}_{XX}$ from step 1. In other words, the Lasso corresponds to the setting that $J_{1}=0$ and $\mathbf{G}_{XX}=\mathbf{S}_{XX}$ . Witten and Tibshirani [2009] consider various settings in which they demonstrate the superiority of the scout procedure over the Lasso, Ridge and Elastic Net. In the setting of Section 4, we have $\text{cov}(X,\widetilde{X})=\begin{bmatrix}\mathbf{\Sigma}&\mathbf{\Sigma}-% \mathbf{D}\\ \mathbf{\Sigma}-\mathbf{D}&\mathbf{\Sigma}\end{bmatrix}$ . Therefore, the objective function (11) corresponds to the case that the true $\bm{\Theta}_{XX}$ is used in step 1 (here we include both $X$ and $\widetilde{X}$ as explanatory variables).

Appendix G Construction of group knockoffs and examples of importance scores at the group level

For group knockoffs, we test the group conditional independence hypothesis:

\displaystyle H_{\gamma}^{0}:X_{\gamma}\perp\!\!\!\perp Y\mid X_{-\gamma}

where $\gamma\in\{1,...,g\}$ denotes a group and $X_{\gamma}$ is the vector of features in group $\gamma$ .

In addition to the conditional independence (2), group knockoffs $\widetilde{\mathbf{X}}$ must satisfy the group exchangeability condition that

\textbf{(Group exchangeability):}\;(\mathbf{X}_{\gamma},\widetilde{\mathbf{X}}% _{\gamma},\mathbf{X}_{-\gamma},\widetilde{\mathbf{X}}_{-\gamma})\stackrel{{% \scriptstyle d}}{{=}}(\widetilde{\mathbf{X}}_{\gamma},\mathbf{X}_{\gamma},% \mathbf{X}_{-\gamma},\widetilde{\mathbf{X}}_{-\gamma}),\;\forall\;\gamma\in\{1% ,...,g\}.

No exchangeability property is required for features within the same group, which allows greater flexibility in the construction knockoffs.

When $X\sim\mathcal{N}(\mathbf{0},\mathbf{\Sigma})$ , the group exchangeability condition allows the diagonal matrix $\mathbf{D}=\text{diag}(\mathbf{s})$ described in Sections 2 and 4 becomes a block-diagonal matrix $\mathbf{D}=\text{diag}(\mathbf{S}_{1},...,\mathbf{S}_{g})$ , where $\mathbf{S}_{\gamma}$ is a symmetric matrix whose dimension equals the number of variables in group $\gamma$ ( $\gamma\in\{1,...,g\}$ ). With the block-diagonal matrix $\mathbf{D}$ obtained following the SDP construction of Chu et al. [2023] in step 3, Algorithm 1 can construct valid group knockoffs $\widetilde{\mathbf{X}}$ of $\mathbf{X}$ with respect to $g$ feature groups. Analogously, Algorithms 2-4 can also be modified correspondingly to perform inference of $H_{\gamma}^{0}$ ’s.

Although it is conceptually straightforward to modify $\mathbf{D}$ from a diagonal to a block-diagonal matrix, note that doing so introduces significantly more optimization variables. To reduce computational burden in practice, we exploit a form of conditional independence across groups, described in section 4 of [Chu et al., 2023]. The main idea is to select a few key variables in each group that capture most within-group variations, and perform a reduced optimization problem only on the key variables. In the real data analysis result, we defined groups via average-linkage hierarchical clustering with correlation cutoff $0.5$ , selected representatives within groups via Algorithm A1 of Chu et al. [2023] with $c=0.5$ , and replaced objective (15) by the maximum entropy (ME) objective, which has improved power over SDP constructions in simulations.

In this paper, we use $M$ multi-knockoffs. To define the importance score for group $\gamma$ and its knockoffs, we sum the effect for variants in each group. With $M$ knockoff copies, we explicitly compute $Z_{\gamma}=\sum_{i\in\mathcal{A}_{\gamma}}|\beta_{i}|$ and $\widetilde{Z}_{\gamma}^{(\ell)}=\sum_{i\in\mathcal{A}_{\gamma}}|\widetilde{% \beta}^{(\ell)}_{i}|$ for $\ell=1,...,M$ , where $\bm{\beta}=(\bm{\beta},\widetilde{\bm{\beta}}^{1},...,\widetilde{\bm{\beta}}^{% M})$ is the estimated effect sizes from step 4 of Algorithm 4. One may use other choices of feature importance such as the $l_{2}$ norm. The group-wise Lasso coefficient difference is then defined as

\displaystyle W_{\gamma}=(Z_{\gamma}-\operatorname{median}(\widetilde{Z}_{% \gamma}^{(1)},...,\widetilde{Z}_{\gamma}^{(M)}))I_{Z_{\gamma}\geq\operatorname% {max}(\widetilde{Z}^{(1)}_{\gamma},...,\widetilde{Z}^{(M)}_{\gamma})}

and groups with $W_{\gamma}>\tau$ are selected, where $\tau$ is calculated from the multiple knockoff filter [Gimenez and Zou, 2019]. Note that $W_{\gamma}$ is the feature importance statistic first introduced in He et al. [2021].

Appendix H Ghostknockoffs for CRT (GhostCRT)

Let $\mathbf{X}\in\mathbb{R}^{n\times p}$ and $\mathbf{Y}\in\mathbb{R}^{n}$ be the covariate matrix and the response vector respectively. Recall that in the conditional randomization test, to test $H_{j}:X_{j}\perp\!\!\!\perp Y\mid X_{-j}$ , Candès et al. [2018] draw i.i.d. samples $\widetilde{\mathbf{X}}_{j}^{1},\ldots,\widetilde{\mathbf{X}}_{j}^{B}\sim% \mathcal{L}(\mathbf{X}_{j}|\mathbf{X}_{-j})$ ( $\mathbf{X}_{j}$ is the $j$ -th column of the covariate matrix $\mathbf{X}$ ) and compute the CRT $p$ -value as

p_{j}=\frac{1}{B+1}\left[1+\sum_{b=1}^{B}\mathbbm{1}_{T(\widetilde{\mathbf{X}}% _{j}^{b},\mathbf{X}_{-j},\mathbf{Y})\geq T(\mathbf{X}_{j},\mathbf{X}_{-j},% \mathbf{Y})}\right],

(17)

for some feature importance function $T$ .

Under the assumption that rows of $\mathbf{X}$ are i.i.d. samples from $\mathcal{N}(\mathbf{0},\mathbf{\Sigma})$ , we can generate $\widetilde{\mathbf{X}}_{j}^{1},\ldots,\widetilde{\mathbf{X}}_{j}^{B}$ by

\widetilde{\mathbf{X}}_{j}^{b}=\mathbf{X}_{-j}\bm{\gamma}_{j}+v_{j}^{1/2}% \mathbf{E}_{j}^{b},

(18)

where $\bm{\gamma}_{j}=\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{\Sigma}_{-j,j}\in\mathbb{R% }^{p-1}$ , $v_{j}=\Sigma_{j,j}-\mathbf{\Sigma}_{j,-j}\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{% \Sigma}_{-j,j}$ , and $\mathbf{E}_{j}^{1},\ldots,\mathbf{E}_{j}^{B}\stackrel{{\scriptstyle iid}}{{% \sim}}\mathcal{N}(0,\mathbf{I}_{n})$ are independent of everything else. Utilizing the analogy between (4) and (18), we develop the GhostCRT with counterparts of Algorithms 2-3 as follows, while the counterpart Algorithm 4 is derived in the similar way.

Algorithm 5 GhostKnockoffs with Marginal Correlation Difference Statistic for CRT

1: Input:

\mathbf{X}^{\top}\mathbf{Y}

||\mathbf{Y}||_{2}^{2}

, and

\mathbf{\Sigma}

2: for

j=1,\ldots,p

3: Compute

\bm{\gamma}_{j}=\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{\Sigma}_{-j,j}\in\mathbb{R% }^{p-1}

and

v_{j}=\Sigma_{j,j}-\mathbf{\Sigma}_{j,-j}\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{% \Sigma}_{-j,j}

4: for

b=1,\ldots,B

5: Generate

\widetilde{Z}^{b}_{j}=\bm{\gamma}_{j}^{\top}\mathbf{X}_{-j}^{\top}\mathbf{Y}+|% |\mathbf{Y}||_{2}Z_{j}^{b}

where

Z_{j}^{b}\sim\mathcal{N}(0,v_{j})

and is independent of everything else.

6: end for

7: Compute the CRT

p

-value

p_{j}

via (17) with

T(\mathbf{X}_{j},\mathbf{X}_{-j},\mathbf{Y})=|\mathbf{X}_{j}^{\top}\mathbf{Y}|

and

T(\widetilde{\mathbf{X}}_{j}^{b},\mathbf{X}_{-j},\mathbf{Y})=\widetilde{Z}^{b}% _{j}

8: end for

9: Output: Selection set by conducting existing multiple testing procedures on CRT

p

-values

p_{1},\ldots,p_{p}

Algorithm 6 GhostKnockoffs with Penalized Regression for CRT: Known Empirical Covariance

1: Input:

\mathbf{X}^{\top}\mathbf{X},\mathbf{X}^{\top}\mathbf{Y},||\mathbf{Y}||_{2}^{2}

\mathbf{\Sigma}

, and

n

2: Find

\widecheck{\mathbf{X}}

and

\widecheck{\mathbf{Y}}

such that

[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]^{\top}[\widecheck{\mathbf{X}}% \ \widecheck{\mathbf{Y}}]=[\mathbf{X}\ \mathbf{Y}]^{\top}[\mathbf{X}\ \mathbf{% Y}]

by eigen-decomposition or Cholesky decomposition.

3: for

j=1,\ldots,p

4: Compute

\bm{\gamma}_{j}=\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{\Sigma}_{-j,j}\in\mathbb{R% }^{p-1}

and

v_{j}=\Sigma_{j,j}-\mathbf{\Sigma}_{j,-j}\mathbf{\Sigma}_{-j,-j}^{-1}\mathbf{% \Sigma}_{-j,j}

5: for

b=1,\ldots,B

6: Generate

\widetilde{\widecheck{\mathbf{X}}}_{j}^{b}

via (18) using

\widecheck{\mathbf{X}}_{-j}

as input.

7: end for

8: Compute the CRT

p

-value

p_{j}

via (17) and replacing

\widetilde{\mathbf{X}}_{j}^{b}

\widetilde{\widecheck{\mathbf{X}}}_{j}^{b}

with feature importance statistic defined by

T(\mathbf{X}_{j},\mathbf{X}_{-j},\mathbf{Y})=|\hat{\beta}_{j}|,

where

(\hat{\beta}_{j},\hat{\bm{\beta}}_{-j})=\operatorname*{arg\,min}_{(\beta_{j},% \bm{\beta}_{-j})\in\mathbb{R}^{p}}\frac{1}{2}||\mathbf{Y}-\mathbf{X}_{j}{\beta% }_{j}-\mathbf{X}_{-j}{\bm{\beta}}_{-j}||_{2}^{2}+\lambda||(\beta_{j},\bm{\beta% }_{-j})||_{1}.

9: end for

10: Output: Selection set by conducting existing multiple testing procedures on CRT

p

-values

p_{1},\ldots,p_{p}

As (18) is a special case of (4) where

•

P is obtained by substituting the $(j,j)$ -entry and other entries in the $j$ -th column of $\mathbf{I}_{p}$ by $0$ and $\bm{\gamma}_{j}$ respectively;
•

V is a matrix of zeros expect the $(j,j)$ -entry equals $v_{j}$ ,

all theoretical results in Sections 2-4 remain true for the GhostCRT.

Remark 1.

In Algorithm 6, the tuning parameter $\lambda$ is allowed to depend on $\mathbf{X}^{\top}\mathbf{X}$ , $\mathbf{X}^{\top}\mathbf{Y}$ , $\mathbf{Y}^{\top}\mathbf{Y}$ and $n$ . We may also use the square-root Lasso or the Lasso-max importance statistic as outlined in Sections 3.3 and 3.4.

Appendix I Additional results for Section 4.4.2

To further demonstrate the effect of sample size on the new GhostKnockoffs methods in comparison to the individual level knockoffs with (cross-validated) Lasso coefficient difference, we consider additional experiments with $p=600$ and $n=600/1800/3000$ under the same setting of Section 4.4.2. Note that the noise level scales in the order of $\sqrt{n}$ such that the signal to noise ratio does not change dramatically. From Figure 8, we observe that as $n$ increases, all three new methods proposed in this paper have comparable power with KF-lassocv and outperform GK-marginal [He et al., 2022] consistently, with FDR controlled in all cases.

Appendix J Supplementary plots for Section 4.4.1

Let $\mathbf{Z}=\mathbf{X}^{\top}\mathbf{Y}$ and $\tilde{\mathbf{Z}}=\mathbf{P}^{\top}\mathbf{X}^{\top}\mathbf{Y}+\lVert\mathbf{% Y}\rVert_{2}\mathbf{Z}$ be defined as in Algorithm 4. Note that for Algorithm 4 to control the FDR, it suffices to require that swapping the $j-$ th entry of $\mathbf{Z}$ with the $j-$ th entry of $\tilde{\mathbf{Z}}$ does not change the joint distribution of $(\mathbf{Z},\tilde{\mathbf{Z}})$ for each null $j$ .

By the Central Limit Theorem, short sets of entries (e.g. single entries, pairs, and triples etc.) of $(\mathbf{Z},\tilde{\mathbf{Z}})$ are approximately Gaussian. Additionally, in Figure 11, we show empirically that the covariance of $(\mathbf{Z},\tilde{\mathbf{Z}})$ (approximately) satisfies the required swap-invariance for null positions. These approximations, coupled with the robustness of the knockoff framework, empirically yield the FDR control. This is similar to the robustness of second-order knockoffs observed empirically in Candès et al. [2018].

In the setting from Section 4.4.1, Figure 9 depicts the ordered empirical values of $Z_{j}$ (respectively $\tilde{Z}_{j}$ ) plotted against an equal-size ordered random sample from a Gaussian distribution with matching mean and variance as the empirical mean and variance of $Z_{j}$ (respectively $\tilde{Z}_{j}$ ), for three randomly selected indices. This comparison is based on the 1000 sub-sampled data replications from Section 4.4.1. In Figure 10, we overlay the plots for all indices. We observe that $Z_{j}$ and $\tilde{Z}_{j}$ approximately follow Gaussian distributions. In Figure 11, we present the scatter plots of relevant empirical covariances. We observe that all the points roughly concentrate around the $y=x$ line. This shows the approximate swap-invariance of $\mathbf{Z}$ and $\tilde{\mathbf{Z}}$ (for null indices).

Appendix K Details of the nine studies in Section 5

Section 5 considers the following nine studies for Alzheimer’s disease:

1.

The genome-wide association study performed by Huang et al. [2017].
2.

The genome-wide meta-analysis of clinically diagnosed AD and AD-by-proxy by Jansen et al. [2019].
3.

The genome-wide meta-analysis of clinically diagnosed AD by Kunkle et al. [2019].
4.

The genome-wide meta-analysis by Schwartzentruber et al. [2021].
5.

In-house genome-wide associations study of 15,209 cases and 14,452 controls aggregating 27 cohorts across 39 SNP array data sets, imputed using the TOPMed reference panels [Belloy et al., 2022a].
6.

A whole-exome sequencing analyse of data from ADSP by Bis et al. [2020].
7.

A whole-exome sequencing analyse of data from ADSP by Le Guen et al. [2021].
8.

In-house whole-exome sequencing analysis of ADSP (6155 cases, 5418 controls).
9.

In-house whole-genome sequencing analysis of the 2021 ADSP release (3584 cases, 2949 controls) [Belloy et al., 2022b].

Appendix L Calculation of meta-analysis Z-score

Based on $Z$ -scores $\mathbf{Z}_{1},\mathbf{Z}_{2},...,\mathbf{Z}_{K}$ from $K$ studies, we adopt the definition in He et al. [2022] that meta-analysis $Z$ -score with overlapping samples is

\mathbf{Z}_{\text{meta}}=\mathbf{H}\sum_{k=1}^{K}w_{k}\mathbf{C}_{k}\mathbf{Z}% _{k}.

Specifically,

•

optimal weights $w_{1},\ldots,w_{K}$ are obtained by solving the optimization problem

minimize

\displaystyle\sum_{k=1}^{K}\sum_{l=1}^{K}w_{k}w_{l}cor.S_{kl}\quad\textrm{% subject to }\sum_{k=1}^{K}w_{k}\sqrt{n_{k}}=1,\text{ }w_{1},\ldots,w_{K}\geq 0;

•

for $k=1,\ldots,K$ , $\mathbf{C}_{k}=\text{diag}\{c_{k1},...,c_{kp}\}$ is a diagonal matrix where $c_{kj}=1$ if $Z$ -score of the $j$ -th variant is observed in the $k$ -th study and $c_{kj}=0$ otherwise ( $j=1,\ldots,p$ );
•

$\mathbf{H}=\text{diag}\{h_{1},...,h_{p}\}$ is a diagonal matrix where $h_{j}=(\sum_{k}\sum_{l}w_{k}w_{l}c_{kj}c_{lj}cor.S_{kl})^{-1/2}$ ( $j=1,\ldots,p$ );
•

$cor.S_{kl}$ is the study correlation between the $k$ -th study and the $l$ -th study.

In practice, when calculating $cor.S_{kl}$ , we only use variants whose $Z$ -scores are bounded in $[-1.96,1.96]$ in both the $k$ -th study and the $l$ -th study to eliminate the impact of polygenic effects. This meta-analysis approach is a generalization of the METAL method proposed by Willer et al. [2010].

Appendix M Obtaining the covariance matrix $\mathbf{\Sigma}$ in meta-analysis for AD

To perform meta-analysis for AD, we need a suitable estimate of the covariance matrix $\mathbf{\Sigma}$ . In this paper, we adopt strategies in He et al. [2023] and Chu et al. [2023] as follows.

We first download the covariance matrix from the Pan-UKB consortium (https://pan.ukbb.broadinstitute.org), which contains about $24$ million variants across the human genome derived from about $500,000$ British samples. We then extract $p=650,576$ variants which satisfy the following three conditions: (a) the variant is recorded in the UK Biobank genotype array, (b) its MAF exceeds 0.01, (c) its reference/alternate allele pair matches with the ones listed in all the nine studies in meta-analysis. Based on the covariance matrix of extracted variants, we further partition extracted variants into 1703 quasi-independent blocks using the partition given by Berisa and Pickrell [2016]. Finally, we compute the block-diagonal covariance matrix

\mathbf{\Sigma}=\begin{bmatrix}\mathbf{\Sigma}_{1}&&\\ &\ddots&\\ &&\mathbf{\Sigma}_{1703}\end{bmatrix},

where $\mathbf{\Sigma}_{l}$ is the shrinkage estimator of the covariance matrix of variants in the $l$ -th block using the R package corpcor [Schäfer and Strimmer, 2005]. To ensure that all blocks $\mathbf{\Sigma}_{1},...,\mathbf{\Sigma}_{1703}$ are positive definite, we perform eigen-decomposition and increase all their eigenvalues not larger than $10^{-5}$ to $10^{-5}$ .

Appendix N Supplementary tables of meta-analysis for AD

Tables 1 and 2 provide more details of the meta-analysis for AD in Section 5. Specifically, Table 1 presents the number of loci, average signals per locus, standard deviation of the number of signals per locus, average groups per locus, and standard deviation of the number of groups per locus identified by conventional marginal association test, GK-marginal, GK-pseudolasso, and GK-susie-rss. Here, the $p$ -value threshold of the conventional marginal association test is $5\times 10^{-8}$ , and GK-pseudolasso uses the tuning parameter chosen by the lasso-min method in Section 4.2.2. For GK-marginal, GK-pseudolasso, and GK-susie-rss, we display results with respect to target FDR levels 0.05, 0.1, and 0.2. Table 2 provides details of top variants of identified loci given by GK-pseudolasso (target FDR level: 0.20), including their positions (columns “Chr.” and “SNP”), their reference alleles (column “Ref.”) and alternative alleles (column “Alt.”), genes that they potentially regulate (column “TopS2GGene”), their closest genes, their $Z$ -scores from different individual studies, their meta-analysis $Z$ -scores, their feature importance scores ( $W$ ), and the marginal $p$ -values obtained from meta-analysis $Z$ -scores.

Table 1: Summary of results by applying different methods on meta-analysis for AD

Method	Target	Number of	Average signals	SD of signals	Average groups	SD of groups
Method	FDR level	identified loci	per locus	per locus	per locus	per locus
Marginal association test		29	15.517	32.231	1.000	0.000
GK-marginal	0.05	3	107.667	97.027	3.667	3.055
	0.10	10	95.600	214.568	2.700	3.622
	0.20	17	76.176	174.714	2.412	3.318
GK-pseudolasso	0.05	30	21.500	44.323	2.333	4.722
	0.10	42	17.214	38.019	2.024	4.015
	0.20	63	15.889	47.074	1.794	3.561
GK-susie-rss	0.05	21	14.619	27.902	1.286	0.644
	0.10	35	12.000	23.506	1.257	0.657
	0.20	47	10.191	20.591	1.319	0.695

Table 2: Details of top variants of identified loci given by GK-pseudolasso (target FDR level: 0.20).

Chr.	SNP	Ref.	Alt.	TopS2GGene	Closest gene	$Z$ -scores from different individual studies									Meta-analysis	$W$	Marginal
Chr.	SNP	Ref.	Alt.	TopS2GGene	Closest gene	Study 1	Study 2	Study 3	Study 4	Study 5	Study 6	Study 7	Study 8	Study 9	$Z$ -scores	$W$	$p$ -values
1	20853688	C	T	EIF4G3	EIF4G3	2.72	3.93	3.29	4.29	2.89	–	–	0.68	1.65	4.95	2.516 $\times 10^{-3}$	3.697 $\times 10^{-7}$
1	200984367	A	G	KIF21B	KIF21B	-2.21	-3.73	-4.33	-3.60	-2.92	–	–	–	-0.67	-4.36	1.837 $\times 10^{-3}$	6.490 $\times 10^{-6}$
1	207611623	A	G	CR1	CR1	-4.84	-8.81	-7.97	-9.65	-6.37	–	–	–	-3.96	-10.97	1.228 $\times 10^{-2}$	2.802 $\times 10^{-28}$
2	37270395	G	A	–	NDUFAF7	1.50	4.02	3.73	4.03	2.05	–	–	–	–	4.62	1.998 $\times 10^{-3}$	1.953 $\times 10^{-6}$
2	44026309	T	C	–	LRPPRC	-1.23	-3.80	-1.88	-3.55	-2.26	–	–	–	-3.64	-4.37	2.089 $\times 10^{-3}$	6.208 $\times 10^{-6}$
2	65409567	G	A	–	SPRED2	–	-3.93	-2.41	-4.05	-0.23	–	–	–	0.15	-4.44	1.993 $\times 10^{-3}$	4.538 $\times 10^{-6}$
2	105805908	T	C	–	NCK2	0.10	-3.94	-2.80	-4.67	-2.08	–	–	–	–	-4.72	2.490 $\times 10^{-3}$	1.185 $\times 10^{-6}$
2	127136908	A	T	BIN1	BIN1	3.77	10.94	8.68	11.95	8.74	–	–	–	4.90	13.36	1.141 $\times 10^{-2}$	5.466 $\times 10^{-41}$
2	233117202	G	C	NGEF	INPP5D	2.03	6.15	5.16	6.42	2.29	–	–	–	2.40	7.21	4.762 $\times 10^{-3}$	2.826 $\times 10^{-13}$
3	136105288	G	A	SLC35G2	PPP2R3A	-1.24	-3.66	-2.03	-4.64	-1.98	–	–	–	-3.04	-4.84	2.250 $\times 10^{-3}$	6.607 $\times 10^{-7}$
4	11024404	A	G	–	CLNK	-2.47	-6.00	-4.06	-6.50	-4.32	–	–	–	-2.52	-7.32	5.039 $\times 10^{-3}$	1.275 $\times 10^{-13}$
4	71303158	G	A	–	SLC4A4	-2.60	-3.77	-3.65	-3.34	-1.49	–	–	–	-1.53	-4.28	1.817 $\times 10^{-3}$	9.466 $\times 10^{-6}$
4	112082387	A	C	–	FAM241A	2.47	4.82	1.76	3.36	0.40	–	–	–	-0.71	4.68	2.478 $\times 10^{-3}$	1.418 $\times 10^{-6}$
4	143428212	C	T	–	GAB1	–	-3.68	-2.84	-3.95	-1.56	–	–	–	-1.36	-4.37	2.017 $\times 10^{-3}$	6.081 $\times 10^{-6}$
4	158808801	G	A	RAPGEF2	FNIP2	-2.43	-3.95	-2.39	-3.82	-2.40	–	–	–	-1.91	-4.66	1.894 $\times 10^{-3}$	1.553 $\times 10^{-6}$
5	4068226	C	T	–	IRX1	–	4.53	1.20	3.62	0.34	–	–	–	-0.91	4.50	2.144 $\times 10^{-3}$	3.323 $\times 10^{-6}$
5	14707491	C	T	ANKH	ANKH	-3.92	-3.20	-3.95	-4.36	-3.60	–	–	–	0.60	-4.66	2.480 $\times 10^{-3}$	1.602 $\times 10^{-6}$
5	86923485	A	G	–	LINC02059	2.49	4.70	2.45	3.76	3.49	–	–	–	2.49	5.12	3.059 $\times 10^{-3}$	1.517 $\times 10^{-7}$
5	177559423	G	A	RAB24	FAM193B	1.96	3.85	3.99	4.16	2.48	–	–	–	1.38	4.71	2.313 $\times 10^{-3}$	1.248 $\times 10^{-6}$
5	179373099	C	T	–	ADAMTS2	1.52	2.54	4.29	4.80	3.12	–	–	–	2.35	4.36	1.938 $\times 10^{-3}$	6.512 $\times 10^{-6}$
6	935171	T	C	–	LINC01622	-2.80	-3.20	-3.33	-4.55	-3.37	–	–	–	-2.17	-4.75	2.380 $\times 10^{-3}$	1.040 $\times 10^{-6}$
6	32686937	T	C	HLA-DQA2	HLA-DQB1	-3.88	-6.46	-4.86	-7.53	-2.29	–	–	–	-1.10	-8.13	4.461 $\times 10^{-3}$	2.090 $\times 10^{-16}$
6	41066261	G	C	OARD1	OARD1	2.69	3.78	6.91	7.12	4.06	–	–	–	–	6.37	5.558 $\times 10^{-3}$	9.364 $\times 10^{-11}$
6	47484147	C	T	CD2AP	CD2AP	2.95	5.74	5.21	6.10	5.33	–	–	–	2.24	7.05	3.271 $\times 10^{-3}$	8.942 $\times 10^{-13}$
7	1543652	A	G	TMEM184A	MAFK	2.33	4.06	2.93	3.64	2.36	–	–	–	0.33	4.54	1.810 $\times 10^{-3}$	2.868 $\times 10^{-6}$
7	37842715	G	A	–	NME8	2.95	4.15	3.81	3.74	3.20	–	–	–	1.13	4.79	2.045 $\times 10^{-3}$	8.230 $\times 10^{-7}$
7	100406823	C	T	–	ZCWPW1	4.25	7.53	4.01	8.41	5.04	–	–	3.59	1.29	9.35	8.987 $\times 10^{-3}$	4.266 $\times 10^{-21}$
7	143410495	G	T	EPHA1-AS1	EPHA1	1.19	6.56	4.37	6.81	2.70	–	–	–	1.63	7.52	3.795 $\times 10^{-3}$	2.751 $\times 10^{-14}$
8	27362470	C	T	PTK2B	PTK2B	3.84	6.79	6.12	7.94	5.19	–	–	–	2.12	8.70	4.345 $\times 10^{-3}$	1.668 $\times 10^{-18}$
8	95041772	C	T	–	NDUFAF6	4.06	3.96	4.03	4.50	2.81	–	–	–	0.36	5.17	2.207 $\times 10^{-3}$	1.172 $\times 10^{-7}$
8	97359646	A	G	–	SNORD3H	2.70	3.01	3.70	3.99	2.42	–	–	–	1.30	4.25	1.767 $\times 10^{-3}$	1.067 $\times 10^{-5}$
8	102564430	G	A	–	ODF1	1.72	4.00	2.66	3.53	1.42	–	–	–	-0.48	4.29	1.855 $\times 10^{-3}$	8.825 $\times 10^{-6}$
8	111515902	C	T	–	LINC02237	–	4.13	0.44	3.77	-0.41	–	–	–	–	4.40	2.051 $\times 10^{-3}$	5.387 $\times 10^{-6}$
8	144042819	T	C	PARP10	SPATC1	0.17	4.66	2.47	4.57	3.68	–	–	–	–	5.16	3.389 $\times 10^{-3}$	1.210 $\times 10^{-7}$
10	29966853	G	A	–	JCAD	–	3.72	2.05	4.56	1.31	–	–	–	0.48	4.68	2.501 $\times 10^{-3}$	1.443 $\times 10^{-6}$
10	42722997	T	C	–	LOC283028	0.39	4.79	2.57	4.34	1.14	–	–	–	0.25	5.02	2.128 $\times 10^{-3}$	2.616 $\times 10^{-7}$
10	59962515	T	G	–	LINC01553	1.43	3.63	3.30	5.14	3.48	–	–	–	3.17	5.18	2.031 $\times 10^{-3}$	1.130 $\times 10^{-7}$
10	80494228	C	T	TSPAN14	TSPAN14	3.23	3.22	4.17	5.83	2.03	–	–	–	–	5.35	2.041 $\times 10^{-3}$	4.295 $\times 10^{-8}$
11	60254475	G	A	–	MS4A4E	-5.74	-7.97	-8.27	-9.09	-6.66	–	–	–	-3.32	-10.30	8.499 $\times 10^{-3}$	3.570 $\times 10^{-25}$
11	65888811	G	A	FIBP	FIBP	-2.13	-4.59	-1.22	-3.57	-1.62	–	–	–	-0.38	-4.74	2.589 $\times 10^{-3}$	1.070 $\times 10^{-6}$
11	86156833	A	G	PICALM	PICALM	6.78	8.67	8.07	10.55	5.11	–	–	–	3.08	11.50	1.074 $\times 10^{-2}$	6.418 $\times 10^{-31}$
11	121578263	T	C	–	SORL1	-3.10	-4.40	-3.82	-5.59	-3.38	–	–	–	-0.52	-5.90	3.920 $\times 10^{-3}$	1.768 $\times 10^{-9}$
13	43679792	C	T	–	ENOX1	0.19	3.79	1.22	4.28	0.01	–	–	–	-1.03	4.30	1.865 $\times 10^{-3}$	8.441 $\times 10^{-6}$
13	93594511	A	T	–	GPC6-AS2	–	-0.04	-1.09	-0.85	-2.34	–	–	–	-0.57	-0.62	7.282 $\times 10^{-2}$	2.672 $\times 10^{-1}$
14	32478306	T	C	AKAP6	AKAP6	-1.45	-4.35	-1.77	-3.63	-0.28	–	–	–	0.77	-4.44	1.869 $\times 10^{-3}$	4.449 $\times 10^{-6}$
14	52924962	A	G	–	FERMT2	4.68	4.58	4.97	6.27	2.90	–	–	–	1.32	6.58	4.682 $\times 10^{-3}$	2.429 $\times 10^{-11}$
14	92470949	C	T	–	SLC24A4	-3.83	-6.10	-5.16	-6.67	-2.90	–	–	–	-2.58	-7.57	4.647 $\times 10^{-3}$	1.836 $\times 10^{-14}$
15	50735410	C	T	HDC	SPPL2A	-3.16	-4.81	-4.09	-6.02	-2.45	–	–	–	0.09	-6.29	5.133 $\times 10^{-3}$	1.547 $\times 10^{-10}$
15	58753575	A	G	–	ADAM10	-2.86	-5.90	-4.16	-5.97	-2.81	–	–	–	-2.16	-6.94	3.385 $\times 10^{-3}$	1.910 $\times 10^{-12}$
15	63277703	C	T	APH1B	APH1B	1.20	5.52	3.68	5.72	2.58	2.46	1.61	0.98	2.05	6.45	3.285 $\times 10^{-3}$	5.482 $\times 10^{-11}$
16	31120929	A	G	KAT8	KAT8	-2.28	-5.50	-2.72	-5.84	-2.89	–	–	–	-1.45	-6.56	3.913 $\times 10^{-3}$	2.702 $\times 10^{-11}$
17	5233752	G	A	SCIMP	SCIMP	3.30	6.04	3.82	5.48	1.93	–	–	–	2.40	6.79	3.297 $\times 10^{-3}$	5.560 $\times 10^{-12}$
17	7581494	G	A	CD68	LOC100996842	-1.82	-3.60	-1.57	-3.49	-3.37	-1.95	-1.61	-2.72	-3.18	-4.42	1.933 $\times 10^{-3}$	4.941 $\times 10^{-6}$
17	49219935	T	C	ABI3	ABI3	–	-4.94	–	–	-4.75	-2.68	0.20	–	-2.61	-5.25	2.982 $\times 10^{-3}$	7.430 $\times 10^{-8}$
17	58331728	G	C	BZRAP1	MIR142	-1.00	-4.94	-5.09	-5.12	-3.81	–	–	–	-1.35	-5.75	3.909 $\times 10^{-3}$	4.412 $\times 10^{-9}$
17	63482562	C	T	ACE	ACE	2.73	5.07	3.54	5.25	3.92	1.93	2.67	2.09	2.45	6.32	5.299 $\times 10^{-3}$	1.268 $\times 10^{-10}$
19	1058177	A	G	–	ABCA7	-0.93	-4.61	-2.73	-4.94	-3.96	-1.16	-1.48	-0.38	0.52	-5.45	4.973 $\times 10^{-3}$	2.534 $\times 10^{-8}$
19	6876985	T	C	VAV1	ADGRE1	1.05	3.04	3.58	4.42	1.59	–	–	–	0.42	4.21	2.119 $\times 10^{-3}$	1.254 $\times 10^{-5}$
19	44888997	C	T	PVRL2	NECTIN2	20.83	51.85	–	–	–	–	–	–	–	53.66	8.573	0.000
19	51224706	C	A	CD33	CD33	-3.40	-5.84	-5.09	-5.69	-3.76	–	–	–	-3.97	-6.x96	4.936 $\times 10^{-3}$	1.696 $\times 10^{-12}$
19	54664811	A	G	LILRB4	LILRB4	-2.61	-3.61	-3.13	-3.89	-1.05	–	–	–	0.54	-4.37	1.958 $\times 10^{-3}$	6.300 $\times 10^{-6}$
20	56409712	G	T	CASS4	CASS4	-3.82	-5.84	-4.56	-6.07	-5.14	–	–	–	–	-7.12	6.582 $\times 10^{-3}$	5.526 $\times 10^{-13}$
21	26775872	C	T	ADAMTS1	ADAMTS1	-1.60	-2.90	-5.17	-5.54	-3.39	–	–	–	-0.22	-4.87	2.469 $\times 10^{-3}$	5.668 $\times 10^{-7}$

Appendix O Supplementary figures of meta-analysis for AD

Analogous to Figure 6 in Section 5, Figures 12, 13 and 14 respectively present Manhattan plots of the meta-analysis of the nine studies via conventional marginal association test (with $p$ -value cutoff $5\times 10^{-8}$ ), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).

The conventional marginal association test selects many feature groups because it focuses on marginal correlations between feature groups and the response while ignoring spurious correlation induced by linkage disequilibrium. This is shown in Figure 12, where the conventional marginal association test tends to select many nearby loci. This issue is alleviated by the GhostKnockoffs approach that tests conditional independence as seen in Figures 6, 13, and 14.

Appendix P Running Lasso on binary responses

In genetic datasets, the response $Y$ is often binary. Performing Lasso or Lasso-type regressions on binary response may sound unreasonable since it violates the usual linear model assumption. One might assume that utilizing penalized logistic regression to generate feature importance statistics would be much more effective. However, a bit surprisingly, we demonstrate that this intuition may not be correct through the following two simulations.

For the first column of Figure 15, we generate $X_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(\mathbf{0},\frac{1}{% \sqrt{n}}\mathbf{I}_{p})$ , and, conditional on $X_{i}$ , $\mathbb{P}(Y_{i}=1)=\frac{1}{1+e^{-\bm{\beta}^{\top}X_{i}}}$ and $\mathbb{P}(Y_{i}=0)=1-\mathbb{P}(Y_{i}=1)$ , where $n=1000$ and $p=300$ . We create $\bm{\beta}$ by uniformly randomly selecting 30 coordinates to be non-zero. The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability. The dark curve represents the knockoffs procedure with Lasso coefficient difference statistic (with tuning parameter chosen by cross-validation), i.e., KF-lassocv. The red curve represents the knockoffs procedure with coefficient difference statistic generated by $L_{1}$ -penalized logistic regression. We vary the signal amplitudes such that we observe relatively complete power profiles below. The target FDR is $0.1$ . Each point on the curves represents an average over 200 replications. For the second column of Figure 15, we show the result for AR(1) features. Here, $n=600$ , $p=200$ and the signal amplitude (i.e., the magnitude of non-zero $\beta$ values) is fixed to be 0.5. Otherwise, the simulation setting is exactly the same as the independent case. We observe that the two methods considered have almost the same power and FDR, so the use of penalized logistic regression does not meaningfully affect the results.

Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression

Abstract

1 Introduction

1.1 Background and contributions

1.2 Code availability and reproducibility

2 Model-X Knockoffs and GhostKnockoffs

2.1 Problem statement

2.2 Model-X knockoffs

2.2.1 The procedure

2.2.2 Gaussian knockoff sampler

2.3 GhostKnockoffs with marginal correlation difference statistic

3 GhostKnockoffs with Penalized Regression: Known Empirical Covariance

3.1 Setting

3.2 GhostKnockoffs with the Lasso

Proposition 1.

Corollary 1.

Proof.

3.3 GhostKnockoffs with the square-root Lasso

3.4 GhostKnockoffs with the Lasso-max

3.5 Numerical simulations

3.5.1 Independent features

3.5.2 AR(1) features

4 GhostKnockoffs with Penalized Regression: Missing Empirical Covariance

4.1 Setting

4.2 GhostKnockoffs with pseudo-lasso

4.2.1 The procedure

Proposition 2.

Proof.

Lemma 1.

Proof.

Corollary 2.

Proof.

4.2.2 Choice of tuning parameter

Method 1 (lasso-min)

Method 2 (pseudo-sum)

Connection with the scout procedure

4.2.3 GhostKnockoffs with other feature importance statistics

4.3 Variants of GhostKnockoffs

4.3.1 Multi-knockoffs

4.3.2 Group knockoffs

4.3.3 Conditional randomization test

4.4 Numerical simulations

4.4.1 Simulations based on real-world genetic data

GhostKnockoffs with discrete features

4.4.2 Independent features

4.4.3 AR(1) features

5 Application to meta-analysis for Alzheimer’s disease

6 Discussion

7 Acknowledgement

References

Appendix A Computation of free parameters 𝐬𝐬\mathbf{s}bold_s

Appendix B Equivalence of GhostKnockoffs and the Gaussian knockoff sampler in sampling the knockoff Z𝑍Zitalic_Z-score 𝐙~ssubscript~𝐙𝑠\widetilde{\mathbf{Z}}_{s}over~ start_ARG bold_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

Lemma 2.

Proof.

Appendix C Proof of Proposition 1

Lemma 3.

Proof.

Appendix D Construction of [𝐗widecheck⁢𝐘widecheck]delimited-[]widecheck𝐗widecheck𝐘[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}][ overwidecheck start_ARG bold_X end_ARG overwidecheck start_ARG bold_Y end_ARG ] via eigen-decomposition

Appendix E Computation of the tuning parameter λ𝜆\lambdaitalic_λ for the lasso-min method

Appendix F Connection with the scout procedure

Appendix G Construction of group knockoffs and examples of importance scores at the group level

Appendix H Ghostknockoffs for CRT (GhostCRT)

Remark 1.

Appendix I Additional results for Section 4.4.2

Appendix J Supplementary plots for Section 4.4.1

Appendix K Details of the nine studies in Section 5

Appendix L Calculation of meta-analysis Z-score

Appendix M Obtaining the covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ in meta-analysis for AD

Appendix N Supplementary tables of meta-analysis for AD

Appendix O Supplementary figures of meta-analysis for AD

Appendix P Running Lasso on binary responses

Controlled Variable Selection from Summary Statistics Only?
A Solution via GhostKnockoffs and Penalized Regression

Appendix A Computation of free parameters $\mathbf{s}$

Appendix B Equivalence of GhostKnockoffs and the Gaussian knockoff sampler in sampling the knockoff $Z$ -score $\widetilde{\mathbf{Z}}_{s}$

Appendix D Construction of $[\widecheck{\mathbf{X}}\ \widecheck{\mathbf{Y}}]$ via eigen-decomposition

Appendix E Computation of the tuning parameter $\lambda$ for the lasso-min method

Appendix M Obtaining the covariance matrix $\mathbf{\Sigma}$ in meta-analysis for AD