Controlled Variable Selection from Summary Statistics Only?
A Solution via GhostKnockoffs and Penalized Regression
Abstract
Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. (2022) and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer’s disease, and evidence a significant improvement in power.
Keywords— Variable selection, replicability, summary statistics, false discovery rate (FDR), knockoffs, genome-wide association study (GWAS), pseudo-lasso
1 Introduction
1.1 Background and contributions
Modern large-scale studies frequently involve a multitude of explanatory variables potentially associated with an outcome we would like to better understand. Oftentimes, the goal is to select those explanatory variables that are meaningfully associated with the response variable. For instance, with recent advances in genome sequencing technologies and genotype imputation techniques, one can now gather tens of millions of variants from hundreds of thousands of samples in large-scale genetic studies, with the aim of pinpointing which genetic variants are biologically associated with specific diseases. This information could provide mechanistic insights and potentially aid the development of targeted drugs. In statistics, this challenge is typically framed as a multiple testing problem. Further, due to the sheer number of hypotheses considered and the cost of following false leads, it is generally required to control some form of error rate on the false positives.
In this paper, we focus on controlling the false discovery rate (FDR), which is the expected proportion of false selections among all selected variables. Compared to the more stringent familywise error rate (FWER) control, keeping the FDR under a nominal level allows for more discoveries while maintaining a reasonable statistical guarantee on the rate of false positives. Several methods for FDR control have been proposed in the literature, with the Benjamini-Hochberg procedure being particularly popular (Benjamini and Hochberg, 1995). However, these approaches often assume a parametric model or the existence of valid -values, which remains difficult, and even problematic, in high-dimensional settings.
Candès et al. (2018) proposed the model-X knockoffs, a broad and flexible framework which allows the statistician to select variables that retain dependence with the response conditional on all other covariates while maintaining FDR control. Model-X knockoffs differs from previous approaches in that (1) it makes no modeling assumptions on the distribution of the response we wish to study conditional on the family of covariates , and (2) it does not require the construction of valid -values. Instead, the crucial assumption is that the distribution of is known. The main idea in Candès et al. (2018) is to generate fake variables , knockoffs, which we can view as negative controls and can be used to tease apart variables that do influence the response from those who do not. Model-X knockoffs has proved effective in a number of real-world applications, particularly in GWAS; see Bates et al. (2020), Sesia et al. (2021) and He et al. (2022) for examples.
To deploy model-X knockoffs, researchers must have in hand the covariates and responses from all samples. However, in certain situations, individual-level data that may reveal sensitive personal information is not readily accessible. For example, due to privacy concerns, many GWAS studies only publish summary statistics of the original data (Pasaniuc and Price, 2017). Yet in such cases, we would still like to develop controlled variable selection methods that rely solely on summary statistics. In genetic studies, this would enable us to utilize available summary data from different data centers to conduct meta-analysis, enhancing the effective sample size and improving variable selection power. On this front, He et al. (2022) proposed the framework of GhostKnockoffs, which implements the knockoffs procedure with the marginal correlation difference feature importance statistic directly from summary statistics. As we shall review next, the main idea is to generate knockoff scores directly without creating knockoff variables; all that is needed are marginal correlations between the response and the features under study. In details, with being the sample size and the number of variables being assayed, the method operates with only and , where is the matrix of covariates, and is the response vector.
In this paper, we extend the family of GhostKnockoffs methods to incorporate feature importance statistics obtained from penalized regression. We first consider in Section 3 the situation in which the empirical covariance of the covariate-response pair is available; with the above notation, this means that the summary statistics , , are available along with the sample size . Unsurprisingly, we observe substantial power improvement over the method of He et al. (2022) because we can now employ far more effective test statistics. Next, in Section 4, we consider the case where the empirical covariance of the features is not available. There, we propose new imputation methods that consistently outperform He et al. (2022) in comprehensive synthetic and semi-synthetic simulations and rigorously control the FDR under suitable conditions. Finally, in Section 5 we apply our methods to a meta-analysis of nine large-scale array-based genome-wide association and whole-exome/-genome sequencing studies of Alzheimer’s disease, in which our methods yield more discoveries than He et al. (2022). We note that existing work in the genetics literature has implemented variable selection methods based on penalized regression with summary statistics, e.g., Mak et al. (2017) and Zou et al. (2022). However, none of these provide any guarantee of FDR control. In fact, as we note in the main text, these methods can be leveraged in our approach to create knockoffs versions that do control the FDR.
1.2 Code availability and reproducibility
The software and example code that reproduce the results presented in this paper can be found at https://github.com/biona001/ghostknockoff-gwas-reproducibility/tree/main/chen_et_al. Simulation results in Section 3.5, Section 4.4.2 and Section 4.4.3 can be exactly reproduced. Due to data accessibility issue, we only provide code without real data for Section 4.4.1 and Section 5.
2 Model-X Knockoffs and GhostKnockoffs
To begin with, we define the controlled variable selection problem and give a brief review of model-X knockoffs and GhostKnockoffs. For a more detailed exposition, we refer readers to Candès et al. (2018), Barber and Candès (2015), and He et al. (2022). In the following, we use boldface letters for vectors and matrices.***As an exception, we use , , and to represent generic covariates, their knockoffs, and the response. We use and to respectively represent the th column and th row of the covariate matrix .
2.1 Problem statement
Given covariates and a response , we are interested in understanding which variables influence . We formulate this selection problem as testing the conditional independence hypotheses for , where is a shorthand for all the variables except the th; that is . In words, we should reject if we believe that can help better predict the outcome than if we only had available the values of all the other variables. Put differently, has information about which cannot be subsumed by the information contained in all the other variables. By conditioning on , these hypothesis tests aim to weed out variables whose relationship to is driven by residual correlations with other covariates.
Let be the set of indices for which the null conditional independence hypothesis is true, and let be the set of indices of the hypotheses rejected by a selection procedure. The false discovery rate (FDR) is the expected fraction of false positives among the selected, defined as
with the convention that . Our goal is to make as many rejections as possible while controlling the FDR below a user-specified level .
In this paper, we consider the setting in which, instead of observing i.i.d. samples from the distribution of (), we only have some summary statistics of the i.i.d. samples. In particular, we will show how one can, quite remarkably, perform tests of conditional independence when we do not directly observe the i.i.d. samples. Throughout this paper, we assume that where is known (or, in practice, can be estimated).
2.2 Model-X knockoffs
2.2.1 The procedure
Suppose we observe i.i.d. samples , , arranged in a data matrix and response vector . In the model-X knockoffs framework Candès et al. (2018), we assume we know the distribution of the covariates while having no knowledge of the conditional distribution . The model-X approach is well-suited to genetic applications where reference panels may be available to estimate or where we have good models of linkage disequilibrium.
To implement model-X knockoffs, we first generate a matrix of knockoffs such that the following two conditions hold:
(Exchangeability): | (1) | |||
(Conditional independence): | (2) |
Roughly, the first says that we cannot distinguish between and , where is obtained from by swapping the th and th columns. The second condition implies that does not provide any new information about conditional on and is guaranteed if is constructed without looking at . If these properties hold, it can be shown that and are indistinguishable conditional on for each .
Next, we define feature importance statistics to be any function of , and such that a flip-sign property holds; namely, switching a column with its knockoff flips the sign of the th component of the output; formally, . Common choices include (marginal correlation difference statistic) and (Lasso coefficient difference statistic), where is the solution to the Lasso problem
and is usually chosen by cross-validation.
Finally, the knockoff filter selects the variables , where
(3) |
Here, , and if is empty. Intuitively, the threshold is chosen to be the most liberal one such that an estimate of FDP is bounded by . Candès et al. (2018) showed that this procedure controls the FDR of the conditional testing problem at level .
2.2.2 Gaussian knockoff sampler
Under the assumption that the rows of the data matrix are i.i.d. from the Gaussian distribution , we can generate a knockoff vector for each row of the data matrix by sampling independently across rows, where , , , and is a vector of free parameters usually obtained by solving a convex optimization problem that depends on (Candès et al., 2018). See Appendix A for details of computing . Concatenating all the knockoff vectors then gives a valid matrix of knockoffs. In matrix form, the construction above is
(4) |
where is an by matrix with i.i.d. standard Gaussian entries, independent of and . For later reference, we summarize the Gaussian knockoff sampler in Algorithm 1 and denote it as .
2.3 GhostKnockoffs with marginal correlation difference statistic
The original model-X knockoffs procedure relies on having access to the covariates and responses from all data points, i.e., the matrix of covariates and the response vector . Henceforth, we call these individual-level data. In many application scenarios, however, individual-level data are not available due to privacy concerns. Instead, we only have access to some summary statistics of and , e.g., the empirical covariance matrix of the covariaties and the empirical covariance between each covariate and the response.
He et al. (2022) proposed GhostKnockoffs, which implements the knockoffs procedure with marginal correlation difference statistic when only and are available. The key idea of He et al. (2022) is to sample the knockoff -score from and directly, in a way such that
(5) |
where is the knockoff matrix generated by the Gaussian knockoff sampler (Algorithm 1). If we use (where ) as the feature importance statistic and run the knockoff filter, the resulting rejection set will have the same distribution as that of the knockoffs procedure with marginal correlation difference statistic. Therefore, the two procedures are statistically identical. In particular, they both control the FDR.
Specifically, He et al. (2022) showed that for and computed in step 3 of Algorithm 1,
(6) |
satisfies (5) as detailed in Appendix B. All this is summarized in Algorithm 2. In the following sections, we refer to Algorithm 2 as GhostKnockoffs with marginal correlation difference statistic (GK-marginal).
3 GhostKnockoffs with Penalized Regression: Known Empirical Covariance
3.1 Setting
As we have just seen, GhostKnockoffs-marginal gives a way to test conditional hypotheses while maintaining FDR control when only the summary statistics and are available to the analyst. Now, we consider the setting in which we have knowledge of the empirical covariance matrix and the sample size , in addition to and . These quantities only reveal sample averages of relevant quantities, as opposed to all the individual-level information.
In this section, we propose a variable selection method that utilizes only , , , and . Our method achieves FDR control and power comparable to the knockoffs procedure with the cross-validated Lasso coefficient difference statistic defined in Section 2. This is interesting because the latter usually outperforms GhostKnockoffs with the marginal correlation difference statistic by a significant margin. Notably, for a fixed tuning parameter , we show that our procedure is equivalent to a corresponding knockoffs method using the Lasso coefficient difference statistic with the same penalty level .
3.2 GhostKnockoffs with the Lasso
Recall that in the knockoffs procedure with the Lasso coefficient difference statistic, we solve the optimization problem
(7) |
where We then define the Lasso coefficient difference feature importance statistics by for . If we have access to individual-level data, is usually chosen by cross-validation (Candès et al. (2018) and Weinstein et al. (2020)).***In the case that is binary, one may think that utilizing (penalized) logistic regression would give much better power than Lasso. In Appendix P, we show that this intuition may not be correct through simulations, even when is generated according to a logistic regression model.
As a first step, we would like to run a statistically equivalent procedure using , , and for a fixed . Note that, with fixed, (7) depends on the data only through
and
Define the Gram matrix of
The Gram matrix can of course be equivalently reconstructed from . The main idea is to sample from the joint distribution of using the Gram matrix of only. Based on this, we can then generate the solution to the Lasso problem (7) (in distribution) for a fixed .***Careful readers may realize that the solution of the Lasso problem does not depend on . Here we include as an input of to be able to make a more general statement later that goes beyond the Lasso. This is achieved via the following Proposition 1, which says in words that if we generate ‘fake’ data matrices and that lead to the same Gram matrix as that of and , then the distribution of remains unchanged if we replace the original data matrices by the fake data matrices.
Proposition 1.
Suppose and are constructed such that . Setting and as the outputs of Algorithm 1,***Note that may not be a data matrix with i.i.d. rows and covariance matrix and we should call the pseudo-Gaussian knockoff data matrix. we have
Proof of Proposition 1 is provided in Appendix C. Specifically, Proposition 1 suggests that summary statistics (, ) are sufficient for sampling the Gram matrix .
We are now able to write down a procedure, namely, Algorithm 3, which is statistically equivalent to the corresponding individual-level knockoffs procedure using the Lasso coefficient difference statistic (or any statistic defined in Sections 3.3 and 3.4). In step 2, and can be obtained by performing the eigen-decomposition or Cholesky decomposition of . Brief procedures to construct and via eigen-decomposition are provided in Appendix D. All we need to do is to run the knockoffs procedure with and in lieu of and . We say that the procedure is equivalent since the rejection sets have the same distribution. In particular, this proves that Algorithm 3 controls the FDR.
Corollary 1.
Consider a knockoffs feature importance statistic , which is a deterministic function of and an independent random variable . Define . Let (resp. ) be the rejection set obtained from applying the knockoffs filter on (resp. ). Then . Thus, if obeys the flip-sign property, both procedures have equal FDR at most equal to .
Proof.
Proposition 1 gives . Since the selection set is uniquely determined by the values of (or ), it follows that . Therefore, the procedures have the same FDR. ∎
We can easily adapt the method above to accommodate other types of regularization, such as Ridge regression and Elastic Net.
3.3 GhostKnockoffs with the square-root Lasso
In Section 3.2, we assumed that the tuning parameter in (7) is fixed. In practice, one may choose the penalty level using information from the Gram matrix of , and the sample size . Since individual-level data is not available, we are unable to use data-splitting approaches such as cross-validation.
An alternative way to define feature importance is to use the square-root Lasso (Belloni et al., 2011), for which the choice of a reasonable tuning parameter is convenient. The square-root Lasso applied to the knockoffs setting solves
(8) |
and a good choice of is given by
(9) |
where and is a unitless hyperparameter (Tian et al., 2018). This value is a scalar multiple of the expected value of the minimal penalty level required such that all the coefficients are shrunk to zero under the global null model. The square-root Lasso has the benefit that the value of the hyperparameter does not depend on the details of the distribution of conditional on . We also found that the performance of our procedure does not depend very sensitively on the choice of . In our data examples, we take .
In the setting where we only know about values of the summary statistics, we simply replace () by ( in (8). Further, we note that for any orthogonal matrix ,
where the second equality follows from . Therefore, the value of the hyperparameter in (9) remains unchanged if we multiply by on the left. This implies that (9) is a deterministic function of . Hence, the feature importance statistic is a function of . Following Corollary 1, we can apply the knockoffs procedure with the square-root Lasso and matrices in lieu of . Upon choosing
(10) |
we get a procedure, which is statistically indistinguishable from that we would get if we were performing all the same steps with and . (In practice, we compute the value in (10) via Monte Carlo simulation.) In the sequel, we call the resulting procedure summary statistics GhostKnockoffs with square-root Lasso importance statistic (GK-sqrtlasso). Note that GK-sqrtlasso controls the FDR as the flip-sign property of the feature importance statistic holds. This is because swapping a variable with its knockoff does not change the value of the hyperparameter. Therefore, by Corollary 1, applying the knockoff filter to the square-root Lasso feature importance statistics yields FDR control.
3.4 GhostKnockoffs with the Lasso-max
In the standard fixed-X knockoffs setting, cross-validation is also not feasible, since doing so would violate the sufficiency condition required for the feature importance statistics. As one possible alternative, Barber and Candès (2015) considered using as the feature importance statistic the value of on the Lasso path at which feature first enters the model. Formally, they define the feature importance statistic
where is as in (7). We call this statistic the Lasso-max statistic. Intuitively, a larger penalty level is required to shrink an important feature to zero, so we should expect to be large and positive for non-nulls.
By Corollary 1, with the Lasso-max statistic Algorithm 3 produces a rejection set that has the same distribution as the rejection set obtained from the corresponding individual-data-based knockoffs procedure. We call this summary-statistic-based procedure GhostKnockoffs with Lasso-max statistic (GK-lassomax).
We remark that choices of other tuning parameters and feature importance statistics are also possible. For instance, we may choose to minimize the Stein’s unbiased risk estimate (SURE) associated with (7). We shall however focus on the two approaches we have described.
3.5 Numerical simulations
We consider a variety of simulation settings in which we compare the performance of the proposed GhostKnockoffs with square-root Lasso and Lasso-max statistics (GK-sqrtlasso and GK-lassomax, defined in Sections 3.3 and 3.4), GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2), and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic with individual-level data (KF-lassocv). Note that the first three are statistically equivalent to the corresponding knockoffs procedures with individual-level data.
3.5.1 Independent features
In the first set of simulations (Figure 1), we generate random samples and , where for .***The simulation setting is designed in a way that the signal-to-noise ratio has the same scale as varies. We consider three settings of varying dimensionality measured by the ratio : . In each of the three settings, we create a sparse vector by selecting 30 coordinates to be non-zero uniformly at random. The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability. We vary the signal amplitudes such that we explore a wide power range below. For the square-root Lasso, we average over 200 Monte Carlo samples to calculate
The target FDR is 20%. Each point on the curves represents the average of the results from 200 replications.
We observe that GK-sqrtlasso and GK-lassomax generally demonstrate greater power than GK-marginal. This enhanced performance is not surprising, as GK-sqrtlasso and GK-lassomax (1) have access to additional information via , and (2) employing a joint modeling algorithm such as Lasso generally provides a better assessment of variable importance for understanding conditional (in)dependence since such a model explicitly adjusts for the effects from all the other variables. We also note the presence of power gaps between GK-lassocv and GK-sqrtlasso/GK-lassomax, likely due to the fact that we are unable to perform cross-validation without individual-level data. All methods control the FDR at the desired level.
3.5.2 AR(1) features
In the second set of simulations (Figures 2), we generate for , where for . As before, we generate , where for . We consider the same three combinations. In each of the three cases, we create a sparse vector exactly as before, except that we fix the signal amplitudes to 4, 4, and 7 respectively to explore a wide power range. We vary in The target FDR is set to be 20%. Each point represents the average of the results from 200 replications.
Again, we observe that GK-sqrtlasso and GK-lassomax generally have greater power than GK-marginal. All methods have (almost) decreasing power as the autocorrelation coefficient increases, since it becomes harder to separate true signals from null variables that are correlated with them. All methods control the FDR at the desired level.
4 GhostKnockoffs with Penalized Regression: Missing Empirical Covariance
4.1 Setting
Thus far, we have discussed how incorporating the additional information from and could enhance our ability to detect significant features. However, in applications such as genetics, may not be available. In this section, we propose alternative procedures when the scientist only knows about , and the sample size . As before, we assume that , where the covariance matrix is known (or can be estimated from other data sources).
4.2 GhostKnockoffs with pseudo-lasso
The idea of our method is to modify the Lasso objective function so that it can be constructed from the available summary statistics. It turns out that the solution of our modified objective function is proportional to that of the scout procedure (with known precision matrix) proposed by Witten and Tibshirani (2009). We will see through simulation studies that our procedure improves the power of the original GhostKnockoffs method of (He et al., 2022) while maintaining FDR control.
4.2.1 The procedure
Recall that in the knockoffs procedure with the Lasso statistic, we solve the following optimization problem:
To mimic the form of the loss function when we do not observe the empirical covariance of the features, we may want to substitute them with their population version: i.e. we swap and with and with . As usual, is obtained by solving the convex optimization problem (15). In the language of fixed-X knockoffs (Barber and Candès, 2015), this is equivalent to regarding as a fixed-X knockoff of and replacing by .***We remark that similar objective functions have been used in, for example, Mak et al. (2017) and Zou et al. (2022). This yields Algorithm 4.
We call this procedure GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso). We show below that Algorithm 4 controls the FDR of selections at level . Before doing so, we first state a general proposition that includes GK-marginal as a special case.
Proposition 2.
Suppose and are defined as in Algorithm 2, is independent of and , and . Consider a knockoffs feature importance statistic , which is a deterministic function of and an independent random variable . Define . Let (resp. ) be the rejection set obtained from applying the knockoffs filter on (resp. ). Then . Thus, if obeys the flip-sign property, both procedures have equal FDR at most equal to .
Proof.
In Appendix B, we prove that
As a result, . Since the selection set is uniquely determined by the values of (or ), it follows that . Therefore, the procedures have the same FDR. ∎
Set to be a fixed numerical constant. Consider the feature importance statistics defined by where is the solution to
(11) |
and is the Gaussian knockoff data matrix. The feature importance statistic in Algorithm 4 is thus obtained by replacing by in (11). Since is determined by and , it follows from Proposition 2 that the rejection set of Algorithm 4 has the same distribution as that obtained from running the knockoff filter on .
Thus to prove that Algorithm 4 controls the FDR of rejections at level , it suffices to verify the flip-sign property of the feature importance statistic for (see Section 2). This is a consequence of the following lemma:
Lemma 1.
Consider the problem
(12) |
Let be any permutation matrix which swaps the jth and (j+p)th entries of a 2p-dimensional vector for each . Assume that is -swap invariant in the sense that . Then is a solution to (12) if and only if is a solution to the same problem with and swapped. In other words, swapping the entries of has the effect of swapping the corresponding entries of the solution.
Proof.
Consider the objective with problem data :
Set so that . Upon changing variables, the objective takes the form
where the equality follows because and because the 1-norm and 2-norm are invariant under permutation. Now, the objective on the right-hand side is the objective with data . If is the solution with data , it follows that is the solution with data , and vice versa. This proves the lemma. ∎
Corollary 2.
Algorithm 4 with a fixed controls the FDR of rejections at level .
Proof.
In practice, to ensure numerical stability, we add a small positive constant multiple of the identity matrix to
when solving for . This is equivalent to incorporating a small Ridge penalty into the objective function. It is easy to see that the lemma proved above guarantees that this modification does not compromise the FDR control as
is also -swap invariant for any and any
4.2.2 Choice of tuning parameter
Several methods can be used to tune the value of the hyperparameter . We here consider two approaches.
Method 1 (lasso-min)
Pretend a homogeneous Gaussian linear model holds, i.e. for some , and .
Focus on (11) first and imagine that we have a method for computing that depends on data only through , and . Note that the objective in Algorithm 4 only substitutes in (11) with . Therefore, by Proposition 2 if we set via the same functional and work with in lieu of , we shall achieve FDR control with this data-driven value of the hyperparameter . This holds of course with the proviso that our selection of hyperparameter is symmetric in the sense that it produces feature importance statistic obeying the flip-sign property.
To set the tuning parameter in (11), we use the common choice of taking a constant multiple of the expected value of the minimum value such that under the null model . By the Karush–Kuhn–Tucker (KKT) conditions (Boyd and Vandenberghe, 2004), this results in a tuning parameter of the form
where is a hyperparameter between 0 and 1. Since is a data matrix whose rows are iid samples from
is a numerical constant, which can be estimated arbitrarily well via Monte Carlo simulations. We use the approach from Dicker (2014) to give an estimate of , which crucially requires knowing only , and . Dicker (2014) showed that the estimator is consistent and asymptotic normal in the high-dimensional regime. Specifically, in our setting, we estimate by
In sum, a choice for in Algorithm 4 is this:
-
1.
Approximate via Monte Carlo simulations, where has iid rows, is independent.
-
2.
Compute
where is independent of everything else.
-
3.
Output where the approximation sign reminds us that the expectation is only approximate.
Method 2 (pseudo-sum)
An alternative way of choosing is to adapt the pseudo-summary statistics approach proposed by Zhang et al. (2021). Set and . The main idea of Zhang et al. (2021) is to generate training summary statistics and validation summary statistics from and based on the training and validation sample sizes and respectively (in this paper we take and ). Following Zhang et al. (2021), we generate the training summary statistics
where
and the validation summary statistics
Given a sequence of candidate values, we choose that which maximizes an approximation of the correlation between the predicted values and the true values on the pseudo-validation set.***Unlike the previous approach, this tuning parameter choice will not induce the exact flip-sign property. However, we observe empirically that our method is robust to this issue, and no FDR inflation occurred. In theory, one could randomly swap all the variables with their corresponding knockoffs and compute the average of all the values obtained. In the limit, the average will give a data-driven value of that is invariant to swapping variables with their knockoffs due to symmetry. Specifically, Zhang et al. (2021) considered the approximation
(13) |
where
(14) |
Therefore, we choose the value that maximizes (13) among a set of candidate values. Since the objective function (11) is convex in , we may employ the BASIL framework proposed by Qian et al. (2020), which implements a batch version of the strong rules introduced in Tibshirani et al. (2012). BASIL can be directly applied to compute the solution path of (14) efficiently.
Note that there exist other ways to choose the penalty level using and (for example, the Lassosum by Mak et al. (2017)). We do not attempt to claim an optimal strategy.
Connection with the scout procedure
It turns out that step 3 of Algorithm 4 is closely related to the scout procedure (Witten and Tibshirani, 2009). The scout procedure defines a family of covariance-regularized regression methods that achieve superior prediction via shrinking the inverse covariance matrix. It includes the Lasso, Ridge and Elastic Net as special cases. In Appendix F, we show that the solution of objective function (11) is proportional to that of the scout procedure (with known precision matrix ). This connection provides a justification on why the objective function (11) is effective.
4.2.3 GhostKnockoffs with other feature importance statistics
In the previous sections, we presented a feature importance statistic based on summary statistics that leads to better power than the marginal correlation difference statistic. By Proposition 2, GhostKnockoffs techniques can be combined with any other feature importance statistics that i) are based on the summary statistics , and the sample size and ii) satisfy the flip-sign property. The procedures generated will still guarantee FDR control. In our simulation studies, we found that using the posterior inclusion probability (PIP) produced by the SuSiE-RSS model (Zou et al., 2022) as the feature importance statistic also results in consistent power improvement over GK-marginal. SuSiE-RSS is based on the Sum of Single Effects (SuSiE) model proposed by Wang et al. (2020), which assumes a Bayesian linear model with true coefficients represented as the sum of multiple one-hot (random) individual effect vectors. Zou et al. (2022) combines SuSiE with a modified likelihood function to accommodate applications in which only summary statistics are available (see Zou et al. (2022) for details).***We used the susie_rss function inside the R package susieR in our simulations. We call the resulting procedure GhostKnockoffs with SuSiE-RSS statistic and denote it by GK-susie-rss. We include this method in the simulation section below.
4.3 Variants of GhostKnockoffs
The methods we presented so far can be adapted to work with various related procedures. We give three examples below for illustration.
4.3.1 Multi-knockoffs
The knockoffs procedure is a randomized procedure which could produce very different selection sets on different runs. This is especially true when the knockoffs rejection set is small. In fact, the offset on the numerator in (3) implies that knockoffs either rejects more than hypotheses, where is the target FDR level, or rejects nothing. To improve the stability of the knockoffs procedure, Gimenez and Zou (2019) proposed simultaneous multi-knockoffs, which is substantially more stable and powerful than knockoffs when the rejection set is small and maintains FDR control in general.
The idea of Gimenez and Zou (2019) is to create (instead of one) knockoff copies for every feature so that they jointly satisfy an extended exchangeability condition.***Specifically, the extended exchangeability condition says that if we permute variables with their corresponding (multiple) knockoffs arbitrarily, the joint distribution remains unchanged. If , Gimenez and Zou (2019) showed that is a valid multi-knockoff for if , where
Here, , and is obtained by solving a more restrictive convex optimization problem than in (15) which guarantees that is positive semi-definite (see Gimenez and Zou (2019) for details). In data matrix form, we generate valid multi-knockoffs by
where has i.i.d. standard normal entries, and
Gimenez and Zou (2019) generalized the knockoffs threshold (3) and the flip-sign property to produce FDR-controlling rejection sets after generating multiple knockoffs via this procedure.
In the summary statistics settings, upon redefining , and as above and replacing the standard knockoffs filter by the multi-knockoffs filter, Algorithms 2 and 3 produce rejection sets that have the same distribution as those produced by their corresponding versions with individual-level data. For Algorithm 4, we simply need to further replace
by .
4.3.2 Group knockoffs
When variables are highly correlated, selection procedures become conservative. For example, if a non-null variable is highly correlated with a null variable , it becomes difficult to reject . This is an important practical concern because highly correlated features are ubiquitous in many settings, particularly GWAS datasets. To overcome this challenge, group knockoffs (Dai and Barber, 2016) can be useful; please see Chu et al. (2023), whose algorithms we employ in the data analyses of Section 5. In group knockoffs, the object of inference is shifted from single variables to groups of highly correlated variables. Specifically, suppose we partition features into groups and reorder all features such that features of the same group are in adjacent columns of . The objective is to test group conditional independence hypothesis:
where denotes a group and is the vector of features in group . When these groups have strong correlation, single-variable knockoffs may struggle to identify signals, but group knockoffs retain power to identify significant groups. As in Section 4.3.1, all methods described in this paper apply to group knockoffs after redefining to the equivalent version in group knockoffs. In Appendix G, we detail the construction of group knockoffs and examples of importance scores at the group level for inference.
4.3.3 Conditional randomization test
The conditional randomization test (CRT) (Candès et al., 2018) is an alternative method to test the conditional independence hypotheses for . By generating a valid ‘CRT -value’ for each hypothesis , existing multiple testing procedures, including the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) and the selective SeqStep+ filter (Li and Candès, 2021), can be used to simultaneously test with FDR control.***In general, CRT values may not be independent of each other or satisfy the PRDS property (Benjamini and Yekutieli, 2001). Therefore, applying the Benjamini-Hochberg procedure on CRT values does not guarantee FDR control theoretically. However, as noted in Candès et al. (2018), the FDR is usually under control empirically. As shown in Candès et al. (2018) and Wang and Janson (2021), doing so can improve the power of multiple testing with greater computational complexity.
In Appendix H, we introduce Ghostknockoffs for CRT (GhostCRT), which adopts techniques introduced in this paper to the framework of CRT.
4.4 Numerical simulations
We conduct simulations on synthetic data as well as semi-synthetic data generated from a real-world genetic dataset. Specifically, we apply GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso, defined in Algorithm 4 with tuning parameter chosen by either lasso-min or pseudo-sum from Section 4.2.2) and GhostKnockoffs with SuSiE-RSS statistic (GK-susie-rss, defined in Section 4.2.3). We compare their performance with GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2) and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic based on individual-level data (KF-lassocv). We also demonstrate empirically the robustness of our procedures by showing the FDR control when only an estimate of the true covariance matrix is available and when the features are discrete.
4.4.1 Simulations based on real-world genetic data
To mimic the dependency structure among features in real-world applications, we generate synthetic data based on the whole genome sequencing (WGS) data from the Alzheimer’s Disease Sequencing Project (ADSP). The data are obtained from the ADSP consortium following the SNP/Indel Variant Calling Pipeline and data management tool (VCPA) (Leung et al., 2019). The ADSP WGS data records counts of minor alleles of genetic variants over 16,906 individuals. Using reference populations from the 1000 Genomes Consortium (The 1000 Genomes Project Consortium, 2015), we estimate ancestry rates of each individual by SNPWeights v2.1 (Chen et al., 2013) and extract 6,952 individuals with estimated European ancestry rate greater than 80%. We further restrict our simulations to 2,000 randomly selected genetic variants within 0.5Mb distance to the APOE gene (chr19:44909011-45912650; hg38), whose 2 allele and 4 allele are known to be respectively the strongest genetic protective factor and the strongest genetic risk factor for Alzheimer’s disease (Serrano-Pozo et al., 2021; Belloy et al., 2023), and with minor allele frequency (MAF) larger than . Since our simulations focus on performance at identifying relevant clusters of tightly linked variants, we simplify the simulation design by pruning variants to eliminate pairs with absolute correlation greater than . To do so, we first compute the correlation matrix of the 2,000 selected variants over the 6,952 extracted individuals using the shrinkage estimate in the R package corpcor (Schäfer and Strimmer, 2005) and apply hierarchical clustering (single-linkage with cutoff value ) on the distance matrix . As a result, we obtain 512 variant clusters such that pairwise correlation between any pair of variants from different clusters is in . By randomly choosing one representative variant from each cluster, we include tested genetic variants in the simulation study.
For each replicate, we obtain synthetic data by randomly sampling individuals without replacement and collecting the sampled individuals’ records on the tested genetic variants as the covariate matrix . We further sample another individuals without replacement as the reference panel on which we compute the correlation matrix using the shrinkage estimate in the R package corpcor (Schäfer and Strimmer, 2005). Based on the covariate matrix , we generate the response vector from either the linear model (continuous response),
or the mixed-effect logit model (binary response),
Specifically, under the mixed-effect logit model is so that the prevalence (or the expected proportion of ) is . ’s and ’s reflect variation due to unobserved covariates. Only randomly selected coefficients are nonzero, with value , where is the MAF of the -th variant.
With the relevant summary statistics computed, we apply GK-pseudolasso and GK-susie-rss and compare their performances with GK-marginal and KF-lassocv.
Over 1000 replicates under both the linear model and the mixed-effect logit model, average power and FDR of different methods with respect to different target FDR levels are visualized in Figure 3. Under both models, we observe that GK-pseudolasso with both ways of selecting the tuning parameter and GK-susie-rss are uniformly more powerful than GK-marginal. The performance of the proposed methods is very close to that of KF-lassocv. Despite the covariance matrix being estimated using an independent sample and the entries of being discrete, the FDRs of our proposed methods are controlled in both settings, suggesting the robustness of our methods.
GhostKnockoffs with discrete features
We note that discrete covariates do not follow a Gaussian distribution. However, the knockoffs procedure ensures FDR control whenever the feature importance statistics , where is an anti-symmetric function, and is distributionally invariant upon swapping with for each null . Using Lemma 1, we know that Algorithm 4 controls the FDR if swapping the th entry of and the th entry of does not change their joint distribution for each null . In Appendix J, we visually demonstrate the approximate preservation of this distributional invariance. This, along with the robustness of knockoffs (Candès et al., 2018; Barber et al., 2020), helps in explaining why we have not observed FDR inflation with discrete covariates.
4.4.2 Independent features
We revisit the setting from Section 3.5.1 in which . For the pseudo-sum method for GK-pseudolasso, we optimize over using a grid of 100 candidate values interpolating between and linearly in log scale, and
is the minimal value that shrinks all the coefficients to zero. To calculate for the lasso-min parameter method, we use a Monte Carlo estimate averaged over 200 samples. The target FDR is 20%. Each point represents an average over 200 replications.
Note that when , the solution to (15) is . It is easy to see that (11) gives
where the soft-threshold operator is applied coordinate-wise. Therefore, the method in Section 4.2 soft-thresholds the marginal correlation of and .
As shown in Figure 4, all three new methods (GK-pseudolasso with lasso-min/pseudo-sum and GK-susie-rss) consistently outperform GK-marginal, and the FDR is always controlled at the expected level, as theoretically guaranteed. As grows, we see that the three new methods have power closer to KF-lassocv. This is further demonstrated in additional simulations in Appendix I.
4.4.3 AR(1) features
Figure 5 shows the corresponding plots when the covariate matrix is generated from an AR(1) distribution. We found similar patterns to those with independent features. The power of all methods drops when the autocorrelation coefficient increases, as it is then harder to separate true signals from other variables.
5 Application to meta-analysis for Alzheimer’s disease
To illustrate the empirical performance of the methods in detecting genetic variants associated with Alzheimer’s disease (AD), we apply them to a meta-analysis of nine large-scale array-based genome-wide association and whole-exome/-genome sequencing studies for AD. We include the details of the nine studies in Appendix K.
As all studies share the same focus on individuals with European ancestry, we perform a meta-analysis by aggregating their -scores and obtain the meta-analysis -score (see Appendix L for details). In addition, we obtain the block-diagonal covariance matrix with respect to approximately independent linkage disequilibrium blocks provided by Berisa and Pickrell (2016). Within each block, we use the UK Biobank directly genotyped data as the reference panel and compute the covariance matrix via the Pan-UKB consortium (https://pan.ukbb.broadinstitute.org) with details in Appendix M. To improve the power in the presence of tightly linked variants, we apply the group knockoffs construction on top of the GhostKnockoff algorithm, as detailed in Section 4.3.2. Finally, we implement GK-pseudolasso with tuning parameter chosen by the lasso-min method on the meta-analysis -score and the covariance matrix . To stabilize the GhostKnockoffs procedures, we use multi-knockoffs as defined in Section 4.3.1.
Figure 6 presents the result of the meta-analysis of the nine studies via our proposed method with target FDR level 0.1. Here, we specify loci based on variant groups and annotate two loci as different loci if they are 1 Mb away from each other. We adopt the most proximal gene’s name as the locus name.***Specifically, we consider the variant group with the largest group knockoff feature importance statistic within a locus, and then map the locus to the most proximal gene of the variant within the group that has the highest knockoff importance score. As shown by Table 1 in Appendix N, GK-pseudolasso identifies variant groups in 42 and 63 loci when the target FDR level is 0.1 and 0.2 respectively, substantially more than GK-marginal (10 and 17 when the target FDR level is 0.1 and 0.2, respectively). This is consistent with our simulation results in Section 4.4. In addition, we observe from Table 1 that GK-susie-rss identifies fewer loci (35 and 47 when the target FDR level is 0.1 and 0.2, respectively), although it exhibits similar power in simulation studies. In Appendix O, we analogously visualize results of the meta-analysis via conventional marginal association test (with -value cutoff ), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).
Table 2 in Appendix N shows the top variant with the largest feature importance statistic in each identified group. Most discoveries exhibit relatively strong marginal associations (marginal -value ) in individual studies and the same direction of effects across all studies. Although some loci have an opposite direction of effect in one individual study, such effects are not significant. The consistency across individual studies supports the validity of the proposed method in discovering putative causal variants. In addition, we observe that all top variants of identified groups have small meta-analysis -values (less than 0.05), though some are not smaller than the stringent genome-wide threshold () in marginal association tests with FWER control.
To further investigate whether the identified groups are functionally enriched, we apply a SNP-to-gene linking strategy proposed by (Gazal et al., 2022) to link the top variants of identified groups to the genes that they potentially regulate. Out of 63 top variants, we find that 34 (54.0%) can be mapped with functional evidence (e.g., being an expression quantitative trait locus, in a Hi-C linked enhancer region, near the exon of a gene, etc.), where the proportion is significantly higher than the average percentage of the background genome (28.6%). In summary, the proposed method can identify functional genetic variants with weaker statistical effects missed by conventional association tests.
6 Discussion
This paper introduced novel approaches for performing variable selection with FDR control on the basis of summary statistics. We proposed methods for testing conditional independence hypotheses from summary statistics alone. For the methods from Section 4, all we need are essentially the marginal correlations between and ,***Along with and . which, at first sight, may appear surprising. Our arguments rely on the assumption that the covariates follow a Gaussian distribution, as well as on the linearity and rotational invariance of Gaussian distributions. Since our methods are based on the knockoffs procedure, they do not require any knowledge about the model of given . Our methods extend, and generally give better power than, the work by He et al. (2022) by employing penalized regression to produce the measure of feature importance. The techniques employed in this paper provide a wrapper that can be combined with a variety of feature selection methods, yielding knockoffs versions that guarantee FDR control.
We applied our methods to genetic studies, in which summary statistics are typically available. Due to linkage disequilibrium, the application of our methods to individual genetic variants may yield conservative results. In a parallel work Chu et al. (2023), we have developed tools for constructing group knockoffs efficiently and effectively. When combined, our methods offer a powerful new approach to controlled variable selection in GWAS. This is further supported in our companion work He et al. (2023), where we see the methods in this paper led to significant scientific discoveries.
7 Acknowledgement
Z.C. would like to thank Kevin Guo and Amber Hu for helpful discussions. Z.C. was supported by the Simons Foundation under award 814641. Z.H. was supported by NIH/NIA award AG066206 and AG066515. T.M. was supported by a B.C. and E.J. Eaves Stanford Graduate Fellowship. C.S. was supported by the grants NIH R56HG010812 and NSF DMS2210392. E.J.C. was supported by the Office of Naval Research grant N00014-20-1-2157.
References
- Barber and Candès (2015) R. F. Barber and E. J. Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055 – 2085, 2015. URL https://doi.org/10.1214/15-AOS1337.
- Barber et al. (2020) R. F. Barber, E. J. Candès, and R. J. Samworth. Robust inference with knockoffs. 2020.
- Bates et al. (2020) S. Bates, M. Sesia, C. Sabatti, and E. Candès. Causal inference in genetic trio studies. Proceedings of the National Academy of Sciences, 117(39):24117–24126, 2020.
- Belloni et al. (2011) A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
- Belloy et al. (2022a) M. E. Belloy, S. J. Eger, Y. Le Guen, V. Damotte, S. Ahmad, M. A. Ikram, A. Ramirez, A. C. Tsolaki, G. Rossi, I. E. Jansen, et al. Challenges at the APOE locus: a robust quality control approach for accurate APOE genotyping. Alzheimer’s Research & Therapy, 14:22, 2022a.
- Belloy et al. (2022b) M. E. Belloy, Y. Le Guen, S. J. Eger, V. Napolioni, M. D. Greicius, and Z. He. A Fast and Robust Strategy to Remove Variant-Level Artifacts in Alzheimer Disease Sequencing Project Data. Neurology Genetics, 8(5):e200012, 2022b.
- Belloy et al. (2023) M. E. Belloy, S. J. Andrews, Y. Le Guen, M. Cuccaro, L. A. Farrer, V. Napolioni, and M. D. Greicius. APOE Genotype and Alzheimer Disease Risk Across Age, Sex, and Population Ancestry. JAMA Neurology, 80(12):1284–1294, 2023.
- Benjamini and Hochberg (1995) Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995.
- Benjamini and Yekutieli (2001) Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188, 2001.
- Berisa and Pickrell (2016) T. Berisa and J. K. Pickrell. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2):283–285, 2016.
- Bis et al. (2020) J. C. Bis, X. Jian, B. W. Kunkle, Y. Chen, K. L. Hamilton-Nelson, W. S. Bush, W. J. Salerno, D. Lancour, Y. Ma, A. E. Renton, et al. Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation. Molecular psychiatry, 25:1859–1875, 2020.
- Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
- Candès et al. (2018) E. Candès, Y. Fan, L. Janson, and J. Lv. Panning for Gold: ‘Model-X’ Knockoffs for High Dimensional Controlled Variable Selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3):551–577, 2018.
- Chen et al. (2013) C.-Y. Chen, S. Pollack, D. J. Hunter, J. N. Hirschhorn, P. Kraft, and A. L. Price. Improved ancestry inference using weights from external reference panels. Bioinformatics, 29(11):1399–1406, 2013.
- Chu et al. (2023) B. B. Chu, J. Gu, Z. Chen, T. Morrison, E. Candès, Z. He, and C. Sabatti. Second-order group knockoffs with applications to GWAS. arXiv preprint arXiv:2310.15069, 2023.
- Dai and Barber (2016) R. Dai and R. Barber. The knockoff filter for FDR control in group-sparse and multitask regression. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1851–1859. PMLR, 2016.
- Dicker (2014) L. H. Dicker. Variance estimation in high-dimensional linear models. Biometrika, 101(2):269–284, 2014.
- Gazal et al. (2022) S. Gazal, O. Weissbrod, F. Hormozdiari, K. K. Dey, J. Nasser, K. A. Jagadeesh, D. J. Weiner, H. Shi, C. P. Fulco, L. J. O’Connor, et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nature Genetics, 54:827–836, 2022.
- Gimenez and Zou (2019) J. R. Gimenez and J. Zou. Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89, pages 2184–2192. PMLR, 2019.
- He et al. (2021) Z. He, L. Liu, C. Wang, Y. Le Guen, J. Lee, S. Gogarten, F. Lu, S. Montgomery, H. Tang, E. K. Silverman, et al. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nature Communications, 12:3152, 2021.
- He et al. (2022) Z. He, L. Liu, M. E. Belloy, Y. Le Guen, A. Sossin, X. Liu, X. Qi, S. Ma, P. K. Gyawali, T. Wyss-Coray, et al. Ghostknockoff inference empowers identification of putative causal variants in genome-wide association studies. Nature Communications, 13:7209, 2022.
- He et al. (2023) Z. He et al. In silico identification of putative causal genetic variants. 2023.
- Huang et al. (2017) K.-l. Huang, E. Marcora, A. A. Pimenova, A. F. Di Narzo, M. Kapoor, S. C. Jin, O. Harari, S. Bertelsen, B. P. Fairfax, J. Czajkowski, et al. A common haplotype lowers PU.1 expression in myeloid cells and delays onset of Alzheimer’s disease. Nature Neuroscience, 20:1052–1061, 2017.
- Jansen et al. (2019) I. E. Jansen, J. E. Savage, K. Watanabe, J. Bryois, D. M. Williams, S. Steinberg, J. Sealock, I. K. Karlsson, S. Hägg, L. Athanasiu, et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nature Genetics, 51:404–413, 2019.
- Kunkle et al. (2019) B. W. Kunkle, B. Grenier-Boley, R. Sims, J. C. Bis, V. Damotte, A. C. Naj, A. Boland, M. Vronskaya, S. J. Van Der Lee, A. Amlie-Wolf, et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates A, tau, immunity and lipid processing. Nature Genetics, 51:414–430, 2019.
- Le Guen et al. (2021) Y. Le Guen, M. E. Belloy, V. Napolioni, S. J. Eger, G. Kennedy, R. Tao, Z. He, and M. D. Greicius. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimer’s Research & Therapy, 13:72, 2021.
- Leung et al. (2019) Y. Y. Leung, O. Valladares, Y.-F. Chou, H.-J. Lin, A. B. Kuzma, L. Cantwell, L. Qu, P. Gangadharan, W. J. Salerno, G. D. Schellenberg, et al. VCPA: genomic variant calling pipeline and data management tool for Alzheimer’s Disease Sequencing Project. Bioinformatics, 35(10):1768–1770, 2019.
- Li and Candès (2021) S. Li and E. J. Candès. Deploying the Conditional Randomization Test in High Multiplicity Problems. arXiv preprint arXiv:2110.02422, 2021.
- Mak et al. (2017) T. S. H. Mak, R. M. Porsch, S. W. Choi, X. Zhou, and P. C. Sham. Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41:469–480, 2017.
- Pasaniuc and Price (2017) B. Pasaniuc and A. L. Price. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics, 18:117–127, 2017.
- Qian et al. (2020) J. Qian, Y. Tanigawa, W. Du, M. Aguirre, C. Chang, R. Tibshirani, M. A. Rivas, and T. Hastie. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genetics, 16(10):e1009141, 2020.
- Schäfer and Strimmer (2005) J. Schäfer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:32, 2005.
- Schwartzentruber et al. (2021) J. Schwartzentruber, S. Cooper, J. Z. Liu, I. Barrio-Hernandez, E. Bello, N. Kumasaka, A. M. Young, R. J. Franklin, T. Johnson, K. Estrada, et al. Genome-wide meta-analysis, fine-mapping and integrative prioritization implicate new Alzheimer’s disease risk genes. Nature Genetics, 53:392–402, 2021.
- Serrano-Pozo et al. (2021) A. Serrano-Pozo, S. Das, and B. T. Hyman. APOE and Alzheimer’s disease: advances in genetics, pathophysiology, and therapeutic approaches. The Lancet Neurology, 20(1):68–80, 2021.
- Sesia et al. (2021) M. Sesia, S. Bates, E. Candès, J. Marchini, and C. Sabatti. False discovery rate control in genome-wide association studies with population structure. Proceedings of the National Academy of Sciences, 118(40):e2105841118, 2021.
- Spector and Janson (2022) A. Spector and L. Janson. Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1):252–276, 2022.
- The 1000 Genomes Project Consortium (2015) The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68–74, 2015.
- Tian et al. (2018) X. Tian, J. R. Loftus, and J. E. Taylor. Selective inference with unknown variance via the square-root lasso. Biometrika, 105(4):755–768, 2018.
- Tibshirani et al. (2012) R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong Rules for Discarding Predictors in Lasso-Type Problems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74(2):245–266, 2012.
- Wang et al. (2020) G. Wang, A. Sarkar, P. Carbonetto, and M. Stephens. A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(5):1273–1300, 2020.
- Wang and Janson (2021) W. Wang and L. Janson. A high-dimensional power analysis of the conditional randomization test and knockoffs. Biometrika, 109(3):631–645, 2021.
- Weinstein et al. (2020) A. Weinstein, W. J. Su, M. Bogdan, R. F. Barber, and E. J. Candès. A Power Analysis for Model-X Knockoffs with -Regularized Statistics. arXiv preprint arXiv:2007.15346, 2020.
- Willer et al. (2010) C. J. Willer, Y. Li, and G. R. Abecasis. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics, 26(17):2190–2191, 2010.
- Witten and Tibshirani (2009) D. M. Witten and R. Tibshirani. Covariance-regularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(3):615–636, 2009.
- Zhang et al. (2021) Q. Zhang, F. Privé, B. Vilhjálmsson, and D. Speed. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nature Communications, 12:4192, 2021.
- Zou et al. (2022) Y. Zou, P. Carbonetto, G. Wang, and M. Stephens. Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genetics, 18(7):e1010299, 2022.
Appendix A Computation of free parameters
In this paper, we use the semidefinite program (SDP) construction of second-order knockoffs Candès et al. [2018]. Without loss of generality, we assume that columns of the data matrix have been standardized with mean 0 and variance 1 such that diagonal entries are 1. As a result, is the solution of the convex optimization problem.
minimize | (15) | |||
subject to | ||||
Appendix B Equivalence of GhostKnockoffs and the Gaussian knockoff sampler in sampling the knockoff -score
In this section, we summarize the proof of He et al. [2022] that computed by (6) satisfies (5) as follows.
Lemma 2.
Proof.
By step 5 of Algorithm 1, we have , where is an by matrix with i.i.d. standard Gaussian entries, independent of . Therefore,
Because , we have
Thus, we have
∎
Appendix C Proof of Proposition 1
Lemma 3.
Let and be two real by matrices. For any and , if , there must exists an orthogonal matrix such that .
Proof.
Suppose , where is an orthogonal matrix such that , is diagonal with positive entries and is the rank of . In other words, we perform eigen-decomposition of and remove all zero eigenvalues and their corresponding eigenvectors. Note that is a projection matrix that projects any vector onto , the column space of .
It is clear that
Because , we also have
As a result, we have .
Thus, for , is a projection matrix that projects any vector onto the column space of . Because , we have
Let , we have and
Thus, we have
Because , there exist such that and are both orthogonal matrices. Thus, we have
Substituting in , we have
and thus
Thus, these exists an orthogonal matrix such that .
∎
We can then prove Proposition 1 as follows. By Lemma 3, since , we know that for some orthogonal matrix .
Let be a matrix with i.i.d. standard Gaussian entries, we have is also a matrix with i.i.d. standard Gaussian entries (i.e. ) and
By the construction of , we have that , and . Therefore, we focus on the third, fifth and sixth arguments of where
Hence,
Appendix D Construction of via eigen-decomposition
In this section, we give details on how to construct such that using eigen-decomposition,
with .
We consider two cases as follows.
Case 1 (): Since , we have and
where , and . Under this case, we let such that is satisfied.
Case 2 (): Under this case, we let
such that
is satisfied.
Appendix E Computation of the tuning parameter for the lasso-min method
Suppose we had access to individual level data such that Gaussian knockoffs can be constructed, we can follow the method of Dicker [2014] to estimate the noise level by
(16) |
We could then compute , where is a data matrix whose rows are i.i.d. samples from , and is independent of . In the summary statistics setting, we replace in (16) by , where and are obtained in Algorithm 4.
The expectation can be computed using Monte Carlo integration. However, when both and are very large, sampling and becomes too time-consuming. Observing that
where and are independent, we have
By Stirling’s formula that
we have
Therefore, we may approximate
where is the Cholesky decomposition of and .
In practice, the simulated usually concentrates around its mean as shown in Figure 7. Thus, only several Monte Carlo samples are needed to accurately estimate , and we draw Monte Carlo samples throughout numerical experiments of this paper.
Next, we prove that Algorithm 4 maintains FDR control when is computed as described in Section 4.2.2. By Proposition 2, it suffices to show that (11) with the computed produces feature importance statistics that satisfy the flip sign property. By , it suffices to show that is invariant to swapping variables with their knockoffs [Candès et al., 2018].
Let be the permutation matrix that swaps the -th column with the -th column of a matrix. Thus, we have . This leads to
suggesting is invariant to swapping variables with their knockoffs [Candès et al., 2018]. Therefore, the FDR of Algorithm 4 is controlled when is computed as described in Section 4.2.2.
Since all variables in are null, in practice we may replace by , by and by in (16) to reduce the dimension when estimating . Although this would, in theory, break the flip-sign property required for FDR control, no FDR inflation is observed in our simulations.
Appendix F Connection with the scout procedure
In this section, we explain the connection of the feature importance statsitic defined in Algorithm 4 and the scout procedure [Witten and Tibshirani, 2009].
For covariates and response , Witten and Tibshirani [2009] assume that . The population linear regression coefficient of on , which induces a linear predictor that achieves the minimal mean squared prediction error, is given by , where is the precision matrix. Let be the empirical covariance matrix of and , they consider the following covariance-regularized regression approach to estimate ,
-
1.
Compute to maximize
-
2.
Compute to maximize subject to obtained from Step 1.
-
3.
Compute .
-
4.
Compute where is the regression coefficient of onto .
Here, and are two penalty functions. The first two steps are to appropriately separate true conditional correlations from those purely due to noise. As shown in Witten and Tibshirani [2009], when (resp. ), the solution to step 3 is proportional to the solution of
where is the inverse of the solution from step 1. In other words, the Lasso corresponds to the setting that and . Witten and Tibshirani [2009] consider various settings in which they demonstrate the superiority of the scout procedure over the Lasso, Ridge and Elastic Net. In the setting of Section 4, we have . Therefore, the objective function (11) corresponds to the case that the true is used in step 1 (here we include both and as explanatory variables).
Appendix G Construction of group knockoffs and examples of importance scores at the group level
For group knockoffs, we test the group conditional independence hypothesis:
where denotes a group and is the vector of features in group .
In addition to the conditional independence (2), group knockoffs must satisfy the group exchangeability condition that
No exchangeability property is required for features within the same group, which allows greater flexibility in the construction knockoffs.
When , the group exchangeability condition allows the diagonal matrix described in Sections 2 and 4 becomes a block-diagonal matrix , where is a symmetric matrix whose dimension equals the number of variables in group (). With the block-diagonal matrix obtained following the SDP construction of Chu et al. [2023] in step 3, Algorithm 1 can construct valid group knockoffs of with respect to feature groups. Analogously, Algorithms 2-4 can also be modified correspondingly to perform inference of ’s.
Although it is conceptually straightforward to modify from a diagonal to a block-diagonal matrix, note that doing so introduces significantly more optimization variables. To reduce computational burden in practice, we exploit a form of conditional independence across groups, described in section 4 of [Chu et al., 2023]. The main idea is to select a few key variables in each group that capture most within-group variations, and perform a reduced optimization problem only on the key variables. In the real data analysis result, we defined groups via average-linkage hierarchical clustering with correlation cutoff , selected representatives within groups via Algorithm A1 of Chu et al. [2023] with , and replaced objective (15) by the maximum entropy (ME) objective, which has improved power over SDP constructions in simulations.
In this paper, we use multi-knockoffs. To define the importance score for group and its knockoffs, we sum the effect for variants in each group. With knockoff copies, we explicitly compute and for , where is the estimated effect sizes from step 4 of Algorithm 4. One may use other choices of feature importance such as the norm. The group-wise Lasso coefficient difference is then defined as
and groups with are selected, where is calculated from the multiple knockoff filter [Gimenez and Zou, 2019]. Note that is the feature importance statistic first introduced in He et al. [2021].
Appendix H Ghostknockoffs for CRT (GhostCRT)
Let and be the covariate matrix and the response vector respectively. Recall that in the conditional randomization test, to test , Candès et al. [2018] draw i.i.d. samples ( is the -th column of the covariate matrix ) and compute the CRT -value as
(17) |
for some feature importance function .
Under the assumption that rows of are i.i.d. samples from , we can generate by
(18) |
where , , and are independent of everything else. Utilizing the analogy between (4) and (18), we develop the GhostCRT with counterparts of Algorithms 2-3 as follows, while the counterpart Algorithm 4 is derived in the similar way.
Appendix I Additional results for Section 4.4.2
To further demonstrate the effect of sample size on the new GhostKnockoffs methods in comparison to the individual level knockoffs with (cross-validated) Lasso coefficient difference, we consider additional experiments with and under the same setting of Section 4.4.2. Note that the noise level scales in the order of such that the signal to noise ratio does not change dramatically. From Figure 8, we observe that as increases, all three new methods proposed in this paper have comparable power with KF-lassocv and outperform GK-marginal [He et al., 2022] consistently, with FDR controlled in all cases.
Appendix J Supplementary plots for Section 4.4.1
Let and be defined as in Algorithm 4. Note that for Algorithm 4 to control the FDR, it suffices to require that swapping the th entry of with the th entry of does not change the joint distribution of for each null .
By the Central Limit Theorem, short sets of entries (e.g. single entries, pairs, and triples etc.) of are approximately Gaussian. Additionally, in Figure 11, we show empirically that the covariance of (approximately) satisfies the required swap-invariance for null positions. These approximations, coupled with the robustness of the knockoff framework, empirically yield the FDR control. This is similar to the robustness of second-order knockoffs observed empirically in Candès et al. [2018].
In the setting from Section 4.4.1, Figure 9 depicts the ordered empirical values of (respectively ) plotted against an equal-size ordered random sample from a Gaussian distribution with matching mean and variance as the empirical mean and variance of (respectively ), for three randomly selected indices. This comparison is based on the 1000 sub-sampled data replications from Section 4.4.1. In Figure 10, we overlay the plots for all indices. We observe that and approximately follow Gaussian distributions. In Figure 11, we present the scatter plots of relevant empirical covariances. We observe that all the points roughly concentrate around the line. This shows the approximate swap-invariance of and (for null indices).
Appendix K Details of the nine studies in Section 5
Section 5 considers the following nine studies for Alzheimer’s disease:
-
1.
The genome-wide association study performed by Huang et al. [2017].
-
2.
The genome-wide meta-analysis of clinically diagnosed AD and AD-by-proxy by Jansen et al. [2019].
-
3.
The genome-wide meta-analysis of clinically diagnosed AD by Kunkle et al. [2019].
-
4.
The genome-wide meta-analysis by Schwartzentruber et al. [2021].
-
5.
In-house genome-wide associations study of 15,209 cases and 14,452 controls aggregating 27 cohorts across 39 SNP array data sets, imputed using the TOPMed reference panels [Belloy et al., 2022a].
-
6.
A whole-exome sequencing analyse of data from ADSP by Bis et al. [2020].
-
7.
A whole-exome sequencing analyse of data from ADSP by Le Guen et al. [2021].
-
8.
In-house whole-exome sequencing analysis of ADSP (6155 cases, 5418 controls).
-
9.
In-house whole-genome sequencing analysis of the 2021 ADSP release (3584 cases, 2949 controls) [Belloy et al., 2022b].
Appendix L Calculation of meta-analysis Z-score
Based on -scores from studies, we adopt the definition in He et al. [2022] that meta-analysis -score with overlapping samples is
Specifically,
-
•
optimal weights are obtained by solving the optimization problem
minimize -
•
for , is a diagonal matrix where if -score of the -th variant is observed in the -th study and otherwise ();
-
•
is a diagonal matrix where ();
-
•
is the study correlation between the -th study and the -th study.
In practice, when calculating , we only use variants whose -scores are bounded in in both the -th study and the -th study to eliminate the impact of polygenic effects. This meta-analysis approach is a generalization of the METAL method proposed by Willer et al. [2010].
Appendix M Obtaining the covariance matrix in meta-analysis for AD
To perform meta-analysis for AD, we need a suitable estimate of the covariance matrix . In this paper, we adopt strategies in He et al. [2023] and Chu et al. [2023] as follows.
We first download the covariance matrix from the Pan-UKB consortium (https://pan.ukbb.broadinstitute.org), which contains about million variants across the human genome derived from about British samples. We then extract variants which satisfy the following three conditions: (a) the variant is recorded in the UK Biobank genotype array, (b) its MAF exceeds 0.01, (c) its reference/alternate allele pair matches with the ones listed in all the nine studies in meta-analysis. Based on the covariance matrix of extracted variants, we further partition extracted variants into 1703 quasi-independent blocks using the partition given by Berisa and Pickrell [2016]. Finally, we compute the block-diagonal covariance matrix
where is the shrinkage estimator of the covariance matrix of variants in the -th block using the R package corpcor [Schäfer and Strimmer, 2005]. To ensure that all blocks are positive definite, we perform eigen-decomposition and increase all their eigenvalues not larger than to .
Appendix N Supplementary tables of meta-analysis for AD
Tables 1 and 2 provide more details of the meta-analysis for AD in Section 5. Specifically, Table 1 presents the number of loci, average signals per locus, standard deviation of the number of signals per locus, average groups per locus, and standard deviation of the number of groups per locus identified by conventional marginal association test, GK-marginal, GK-pseudolasso, and GK-susie-rss. Here, the -value threshold of the conventional marginal association test is , and GK-pseudolasso uses the tuning parameter chosen by the lasso-min method in Section 4.2.2. For GK-marginal, GK-pseudolasso, and GK-susie-rss, we display results with respect to target FDR levels 0.05, 0.1, and 0.2. Table 2 provides details of top variants of identified loci given by GK-pseudolasso (target FDR level: 0.20), including their positions (columns “Chr.” and “SNP”), their reference alleles (column “Ref.”) and alternative alleles (column “Alt.”), genes that they potentially regulate (column “TopS2GGene”), their closest genes, their -scores from different individual studies, their meta-analysis -scores, their feature importance scores (), and the marginal -values obtained from meta-analysis -scores.
Method | Target | Number of | Average signals | SD of signals | Average groups | SD of groups |
---|---|---|---|---|---|---|
FDR level | identified loci | per locus | per locus | per locus | per locus | |
Marginal association test | 29 | 15.517 | 32.231 | 1.000 | 0.000 | |
GK-marginal | 0.05 | 3 | 107.667 | 97.027 | 3.667 | 3.055 |
0.10 | 10 | 95.600 | 214.568 | 2.700 | 3.622 | |
0.20 | 17 | 76.176 | 174.714 | 2.412 | 3.318 | |
GK-pseudolasso | 0.05 | 30 | 21.500 | 44.323 | 2.333 | 4.722 |
0.10 | 42 | 17.214 | 38.019 | 2.024 | 4.015 | |
0.20 | 63 | 15.889 | 47.074 | 1.794 | 3.561 | |
GK-susie-rss | 0.05 | 21 | 14.619 | 27.902 | 1.286 | 0.644 |
0.10 | 35 | 12.000 | 23.506 | 1.257 | 0.657 | |
0.20 | 47 | 10.191 | 20.591 | 1.319 | 0.695 |
Chr. | SNP | Ref. | Alt. | TopS2GGene | Closest gene | -scores from different individual studies | Meta-analysis | Marginal | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Study 1 | Study 2 | Study 3 | Study 4 | Study 5 | Study 6 | Study 7 | Study 8 | Study 9 | -scores | -values | |||||||
1 | 20853688 | C | T | EIF4G3 | EIF4G3 | 2.72 | 3.93 | 3.29 | 4.29 | 2.89 | – | – | 0.68 | 1.65 | 4.95 | 2.516 | 3.697 |
1 | 200984367 | A | G | KIF21B | KIF21B | -2.21 | -3.73 | -4.33 | -3.60 | -2.92 | – | – | – | -0.67 | -4.36 | 1.837 | 6.490 |
1 | 207611623 | A | G | CR1 | CR1 | -4.84 | -8.81 | -7.97 | -9.65 | -6.37 | – | – | – | -3.96 | -10.97 | 1.228 | 2.802 |
2 | 37270395 | G | A | – | NDUFAF7 | 1.50 | 4.02 | 3.73 | 4.03 | 2.05 | – | – | – | – | 4.62 | 1.998 | 1.953 |
2 | 44026309 | T | C | – | LRPPRC | -1.23 | -3.80 | -1.88 | -3.55 | -2.26 | – | – | – | -3.64 | -4.37 | 2.089 | 6.208 |
2 | 65409567 | G | A | – | SPRED2 | – | -3.93 | -2.41 | -4.05 | -0.23 | – | – | – | 0.15 | -4.44 | 1.993 | 4.538 |
2 | 105805908 | T | C | – | NCK2 | 0.10 | -3.94 | -2.80 | -4.67 | -2.08 | – | – | – | – | -4.72 | 2.490 | 1.185 |
2 | 127136908 | A | T | BIN1 | BIN1 | 3.77 | 10.94 | 8.68 | 11.95 | 8.74 | – | – | – | 4.90 | 13.36 | 1.141 | 5.466 |
2 | 233117202 | G | C | NGEF | INPP5D | 2.03 | 6.15 | 5.16 | 6.42 | 2.29 | – | – | – | 2.40 | 7.21 | 4.762 | 2.826 |
3 | 136105288 | G | A | SLC35G2 | PPP2R3A | -1.24 | -3.66 | -2.03 | -4.64 | -1.98 | – | – | – | -3.04 | -4.84 | 2.250 | 6.607 |
4 | 11024404 | A | G | – | CLNK | -2.47 | -6.00 | -4.06 | -6.50 | -4.32 | – | – | – | -2.52 | -7.32 | 5.039 | 1.275 |
4 | 71303158 | G | A | – | SLC4A4 | -2.60 | -3.77 | -3.65 | -3.34 | -1.49 | – | – | – | -1.53 | -4.28 | 1.817 | 9.466 |
4 | 112082387 | A | C | – | FAM241A | 2.47 | 4.82 | 1.76 | 3.36 | 0.40 | – | – | – | -0.71 | 4.68 | 2.478 | 1.418 |
4 | 143428212 | C | T | – | GAB1 | – | -3.68 | -2.84 | -3.95 | -1.56 | – | – | – | -1.36 | -4.37 | 2.017 | 6.081 |
4 | 158808801 | G | A | RAPGEF2 | FNIP2 | -2.43 | -3.95 | -2.39 | -3.82 | -2.40 | – | – | – | -1.91 | -4.66 | 1.894 | 1.553 |
5 | 4068226 | C | T | – | IRX1 | – | 4.53 | 1.20 | 3.62 | 0.34 | – | – | – | -0.91 | 4.50 | 2.144 | 3.323 |
5 | 14707491 | C | T | ANKH | ANKH | -3.92 | -3.20 | -3.95 | -4.36 | -3.60 | – | – | – | 0.60 | -4.66 | 2.480 | 1.602 |
5 | 86923485 | A | G | – | LINC02059 | 2.49 | 4.70 | 2.45 | 3.76 | 3.49 | – | – | – | 2.49 | 5.12 | 3.059 | 1.517 |
5 | 177559423 | G | A | RAB24 | FAM193B | 1.96 | 3.85 | 3.99 | 4.16 | 2.48 | – | – | – | 1.38 | 4.71 | 2.313 | 1.248 |
5 | 179373099 | C | T | – | ADAMTS2 | 1.52 | 2.54 | 4.29 | 4.80 | 3.12 | – | – | – | 2.35 | 4.36 | 1.938 | 6.512 |
6 | 935171 | T | C | – | LINC01622 | -2.80 | -3.20 | -3.33 | -4.55 | -3.37 | – | – | – | -2.17 | -4.75 | 2.380 | 1.040 |
6 | 32686937 | T | C | HLA-DQA2 | HLA-DQB1 | -3.88 | -6.46 | -4.86 | -7.53 | -2.29 | – | – | – | -1.10 | -8.13 | 4.461 | 2.090 |
6 | 41066261 | G | C | OARD1 | OARD1 | 2.69 | 3.78 | 6.91 | 7.12 | 4.06 | – | – | – | – | 6.37 | 5.558 | 9.364 |
6 | 47484147 | C | T | CD2AP | CD2AP | 2.95 | 5.74 | 5.21 | 6.10 | 5.33 | – | – | – | 2.24 | 7.05 | 3.271 | 8.942 |
7 | 1543652 | A | G | TMEM184A | MAFK | 2.33 | 4.06 | 2.93 | 3.64 | 2.36 | – | – | – | 0.33 | 4.54 | 1.810 | 2.868 |
7 | 37842715 | G | A | – | NME8 | 2.95 | 4.15 | 3.81 | 3.74 | 3.20 | – | – | – | 1.13 | 4.79 | 2.045 | 8.230 |
7 | 100406823 | C | T | – | ZCWPW1 | 4.25 | 7.53 | 4.01 | 8.41 | 5.04 | – | – | 3.59 | 1.29 | 9.35 | 8.987 | 4.266 |
7 | 143410495 | G | T | EPHA1-AS1 | EPHA1 | 1.19 | 6.56 | 4.37 | 6.81 | 2.70 | – | – | – | 1.63 | 7.52 | 3.795 | 2.751 |
8 | 27362470 | C | T | PTK2B | PTK2B | 3.84 | 6.79 | 6.12 | 7.94 | 5.19 | – | – | – | 2.12 | 8.70 | 4.345 | 1.668 |
8 | 95041772 | C | T | – | NDUFAF6 | 4.06 | 3.96 | 4.03 | 4.50 | 2.81 | – | – | – | 0.36 | 5.17 | 2.207 | 1.172 |
8 | 97359646 | A | G | – | SNORD3H | 2.70 | 3.01 | 3.70 | 3.99 | 2.42 | – | – | – | 1.30 | 4.25 | 1.767 | 1.067 |
8 | 102564430 | G | A | – | ODF1 | 1.72 | 4.00 | 2.66 | 3.53 | 1.42 | – | – | – | -0.48 | 4.29 | 1.855 | 8.825 |
8 | 111515902 | C | T | – | LINC02237 | – | 4.13 | 0.44 | 3.77 | -0.41 | – | – | – | – | 4.40 | 2.051 | 5.387 |
8 | 144042819 | T | C | PARP10 | SPATC1 | 0.17 | 4.66 | 2.47 | 4.57 | 3.68 | – | – | – | – | 5.16 | 3.389 | 1.210 |
10 | 29966853 | G | A | – | JCAD | – | 3.72 | 2.05 | 4.56 | 1.31 | – | – | – | 0.48 | 4.68 | 2.501 | 1.443 |
10 | 42722997 | T | C | – | LOC283028 | 0.39 | 4.79 | 2.57 | 4.34 | 1.14 | – | – | – | 0.25 | 5.02 | 2.128 | 2.616 |
10 | 59962515 | T | G | – | LINC01553 | 1.43 | 3.63 | 3.30 | 5.14 | 3.48 | – | – | – | 3.17 | 5.18 | 2.031 | 1.130 |
10 | 80494228 | C | T | TSPAN14 | TSPAN14 | 3.23 | 3.22 | 4.17 | 5.83 | 2.03 | – | – | – | – | 5.35 | 2.041 | 4.295 |
11 | 60254475 | G | A | – | MS4A4E | -5.74 | -7.97 | -8.27 | -9.09 | -6.66 | – | – | – | -3.32 | -10.30 | 8.499 | 3.570 |
11 | 65888811 | G | A | FIBP | FIBP | -2.13 | -4.59 | -1.22 | -3.57 | -1.62 | – | – | – | -0.38 | -4.74 | 2.589 | 1.070 |
11 | 86156833 | A | G | PICALM | PICALM | 6.78 | 8.67 | 8.07 | 10.55 | 5.11 | – | – | – | 3.08 | 11.50 | 1.074 | 6.418 |
11 | 121578263 | T | C | – | SORL1 | -3.10 | -4.40 | -3.82 | -5.59 | -3.38 | – | – | – | -0.52 | -5.90 | 3.920 | 1.768 |
13 | 43679792 | C | T | – | ENOX1 | 0.19 | 3.79 | 1.22 | 4.28 | 0.01 | – | – | – | -1.03 | 4.30 | 1.865 | 8.441 |
13 | 93594511 | A | T | – | GPC6-AS2 | – | -0.04 | -1.09 | -0.85 | -2.34 | – | – | – | -0.57 | -0.62 | 7.282 | 2.672 |
14 | 32478306 | T | C | AKAP6 | AKAP6 | -1.45 | -4.35 | -1.77 | -3.63 | -0.28 | – | – | – | 0.77 | -4.44 | 1.869 | 4.449 |
14 | 52924962 | A | G | – | FERMT2 | 4.68 | 4.58 | 4.97 | 6.27 | 2.90 | – | – | – | 1.32 | 6.58 | 4.682 | 2.429 |
14 | 92470949 | C | T | – | SLC24A4 | -3.83 | -6.10 | -5.16 | -6.67 | -2.90 | – | – | – | -2.58 | -7.57 | 4.647 | 1.836 |
15 | 50735410 | C | T | HDC | SPPL2A | -3.16 | -4.81 | -4.09 | -6.02 | -2.45 | – | – | – | 0.09 | -6.29 | 5.133 | 1.547 |
15 | 58753575 | A | G | – | ADAM10 | -2.86 | -5.90 | -4.16 | -5.97 | -2.81 | – | – | – | -2.16 | -6.94 | 3.385 | 1.910 |
15 | 63277703 | C | T | APH1B | APH1B | 1.20 | 5.52 | 3.68 | 5.72 | 2.58 | 2.46 | 1.61 | 0.98 | 2.05 | 6.45 | 3.285 | 5.482 |
16 | 31120929 | A | G | KAT8 | KAT8 | -2.28 | -5.50 | -2.72 | -5.84 | -2.89 | – | – | – | -1.45 | -6.56 | 3.913 | 2.702 |
17 | 5233752 | G | A | SCIMP | SCIMP | 3.30 | 6.04 | 3.82 | 5.48 | 1.93 | – | – | – | 2.40 | 6.79 | 3.297 | 5.560 |
17 | 7581494 | G | A | CD68 | LOC100996842 | -1.82 | -3.60 | -1.57 | -3.49 | -3.37 | -1.95 | -1.61 | -2.72 | -3.18 | -4.42 | 1.933 | 4.941 |
17 | 49219935 | T | C | ABI3 | ABI3 | – | -4.94 | – | – | -4.75 | -2.68 | 0.20 | – | -2.61 | -5.25 | 2.982 | 7.430 |
17 | 58331728 | G | C | BZRAP1 | MIR142 | -1.00 | -4.94 | -5.09 | -5.12 | -3.81 | – | – | – | -1.35 | -5.75 | 3.909 | 4.412 |
17 | 63482562 | C | T | ACE | ACE | 2.73 | 5.07 | 3.54 | 5.25 | 3.92 | 1.93 | 2.67 | 2.09 | 2.45 | 6.32 | 5.299 | 1.268 |
19 | 1058177 | A | G | – | ABCA7 | -0.93 | -4.61 | -2.73 | -4.94 | -3.96 | -1.16 | -1.48 | -0.38 | 0.52 | -5.45 | 4.973 | 2.534 |
19 | 6876985 | T | C | VAV1 | ADGRE1 | 1.05 | 3.04 | 3.58 | 4.42 | 1.59 | – | – | – | 0.42 | 4.21 | 2.119 | 1.254 |
19 | 44888997 | C | T | PVRL2 | NECTIN2 | 20.83 | 51.85 | – | – | – | – | – | – | – | 53.66 | 8.573 | 0.000 |
19 | 51224706 | C | A | CD33 | CD33 | -3.40 | -5.84 | -5.09 | -5.69 | -3.76 | – | – | – | -3.97 | -6.x96 | 4.936 | 1.696 |
19 | 54664811 | A | G | LILRB4 | LILRB4 | -2.61 | -3.61 | -3.13 | -3.89 | -1.05 | – | – | – | 0.54 | -4.37 | 1.958 | 6.300 |
20 | 56409712 | G | T | CASS4 | CASS4 | -3.82 | -5.84 | -4.56 | -6.07 | -5.14 | – | – | – | – | -7.12 | 6.582 | 5.526 |
21 | 26775872 | C | T | ADAMTS1 | ADAMTS1 | -1.60 | -2.90 | -5.17 | -5.54 | -3.39 | – | – | – | -0.22 | -4.87 | 2.469 | 5.668 |
Appendix O Supplementary figures of meta-analysis for AD
Analogous to Figure 6 in Section 5, Figures 12, 13 and 14 respectively present Manhattan plots of the meta-analysis of the nine studies via conventional marginal association test (with -value cutoff ), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).
The conventional marginal association test selects many feature groups because it focuses on marginal correlations between feature groups and the response while ignoring spurious correlation induced by linkage disequilibrium. This is shown in Figure 12, where the conventional marginal association test tends to select many nearby loci. This issue is alleviated by the GhostKnockoffs approach that tests conditional independence as seen in Figures 6, 13, and 14.
Appendix P Running Lasso on binary responses
In genetic datasets, the response is often binary. Performing Lasso or Lasso-type regressions on binary response may sound unreasonable since it violates the usual linear model assumption. One might assume that utilizing penalized logistic regression to generate feature importance statistics would be much more effective. However, a bit surprisingly, we demonstrate that this intuition may not be correct through the following two simulations.
For the first column of Figure 15, we generate , and, conditional on , and , where and . We create by uniformly randomly selecting 30 coordinates to be non-zero. The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability. The dark curve represents the knockoffs procedure with Lasso coefficient difference statistic (with tuning parameter chosen by cross-validation), i.e., KF-lassocv. The red curve represents the knockoffs procedure with coefficient difference statistic generated by -penalized logistic regression. We vary the signal amplitudes such that we observe relatively complete power profiles below. The target FDR is . Each point on the curves represents an average over 200 replications. For the second column of Figure 15, we show the result for AR(1) features. Here, , and the signal amplitude (i.e., the magnitude of non-zero values) is fixed to be 0.5. Otherwise, the simulation setting is exactly the same as the independent case. We observe that the two methods considered have almost the same power and FDR, so the use of penalized logistic regression does not meaningfully affect the results.