Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discovering Candidate Genes Regulated by GWAS Signals in Cis and Trans

Samhita Pal1 and Xinge Jessie Jeng1superscript1{}^{1^{*}}start_FLOATSUPERSCRIPT 1 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT
1Department of Statistics, North Carolina State University
(*Corresponding author. Email: xjjeng@ncsu.edu
Contributing author: spal4@ncsu.edu)
Abstract

Understanding the genetic underpinnings of complex traits and diseases has been greatly advanced by genome-wide association studies (GWAS). However, a significant portion of trait heritability remains unexplained, known as “missing heritability”. Most GWAS loci reside in non-coding regions, posing challenges in understanding their functional impact. Integrating GWAS with functional genomic data, such as expression quantitative trait loci (eQTLs), can bridge this gap. This study introduces a novel approach to discover candidate genes regulated by GWAS signals in both cis and trans. Unlike existing eQTL studies that focus solely on cis-eQTLs or consider cis- and trans-QTLs separately, we utilize adaptive statistical metrics that can reflect both the strong, sparse effects of cis-eQTLs and the weak, dense effects of trans-eQTLs. Consequently, candidate genes regulated by the joint effects can be prioritized. We demonstrate the efficiency of our method through theoretical and numerical analyses and apply it to adipose eQTL data from the METabolic Syndrome in Men (METSIM) study, uncovering genes playing important roles in the regulatory networks influencing cardiometabolic traits. Our findings offer new insights into the genetic regulation of complex traits and present a practical framework for identifying key regulatory genes based on joint eQTL effects.

Keywords: Berk-Johns; Cardiometabolic traits; Core genes; eGene discovery; Higher criticism; Trans-eQTL

1 Introduction

A central goal of genetics is to understand how genetic variation, environment factors, and other sources of variations influence traits. Over the past decades, genome-wide association studies (GWASs), facilitated by high-throughput sequencing technologies, have revolutionized human genetics. However, GWAS have also left several outstanding questions: the identified GWAS loci have explained only a small percentage of trait heritability, known as the “missing heritability” problem; as sample sizes for GWAS continue to grow, the effect sizes of newly identified loci are mostly very weak; and the majority of GWAS loci are located in non-coding putatively regulatory regions of the genome, leaving it unclear which genes they regulate and how these genes function in pathways to impact the traits. These difficulties in interpreting GWAS results have hindered our efforts to understand disease etiologies and develop new therapies (Visscher et al., 2017; Cano-Gamez and Trynka, 2020).

To tackle these challenges, emerging studies integrate GWAS results with functional genomic data such as gene expression or chromatin activity profiles assayed across a range of cell types and tissues. Many large consortia, such as GTEx (Ardlie et al., 2015), have generated gene expression data for different tissues and cell lines together with genotype information. Such data have been extensively analyzed to study the regulatory effects of SNPs, i.e., expression quantitative trait loci (eQTLs), on gene expression across human tissues, and the results have proven informative for GWAS analysis. For example, transcriptome-wide association studies (TWAS) and colocalization analysis have been developed in recent years to identify genes causally involved in complex diseases based on GWAS and eQTL results. TWAS leverages information from eQTL catalogs to impute gene expression values and directly infer the association between traits and the imputed gene expression for the gene under investigation (Gusev et al., 2016; Gamazon et al., 2015). In contrast, colocalization analysis integrates association results from GWAS and eQTL mapping to identify instances where both traits share a causal variant. The rationale is that if a GWAS trait and a gene’s expression share the same associated SNP, it may suggest a regulatory (and thus putative causal) role of the SNP mediated through the gene on the GWAS trait (Cannon and Mohlke, 2018; Deng and Pan, 2020).

Integrative analyses provide important observations that for most complex traits, the most significant signals are located near genes that make functional sense, although those signals and genes contribute only a small fraction of the total heritability. In eQTL analysis, such nearby signals are called cis-acting eQTLs (cis-eQTL). Because each cis-eQTL influences only one phenotype (the local gene), they cannot elucidate the co-regulatory landscape of the transcriptome and thus the higher organization of gene regulation (Westra et al., 2013; Bryois et al., 2014; Smemo et al., 2014; Claussnitzer et al., 2015; Brynedal et al., 2017).

It has been observed that heritability is dominated by variation in a large number of non-coding common variants spread broadly across the genome and mediated through a wide range of gene functional categories. Such distant variants are called trans-acting eQTLs (trans-eQTLs) (Price et al., 2008; Zhu et al., 2016; Yao et al., 2017; Civelek et al., 2017). An emerging theory known as the omnigenic model suggests that complex traits or diseases are governed by possibly all genes expressed in relevant cells through network behaviors. Among these genes, there is a set of “core genes” directly involved in the biological pathways for each trait and a larger number of “peripheral genes” that can only affect the trait indirectly by connecting with these core genes via regulatory networks. The omnigenic model indicates that cis and trans effects of trait-associated variation are mediated by core genes, which are directly involved in the biological pathways for the trait. Moreover, it is discussed that for typical genes, the majority of expression heritability is caused by trans-eQTLs (Boyle et al., 2017; Liu et al., 2019).

However, for purely practical reasons, most eQTL mapping studies to date have concentrated on cis-eQTLs. This is because trans-eQTLs often have small regulatory effects in humans, making them notoriously difficult to identify. Analyses involving trans-eQTLs often require substantially large sample sizes, significant computational resources, and extremely stringent multiplicity adjustments.

In this paper, we propose a method to directly identify expression genes (eGenes) whose heritability is influenced by GWAS signals in both cis and trans, bypassing the need to identify weak trans-eQTLs individually. Our approach uses statistical metrics sensitive to both cis- and trans-acting signals – cis signals being relatively strong but sparse, and trans signals being relatively weak but dense. The eGenes prioritized by these metrics can offer valuable insights into the mechanisms underlying disease-related traits and may serve as candidates for core genes in omnigenic models.

The statistical metrics utilized in our method for eGene discovery have been studied in high-dimensional inference and discovered as being optimally adaptive for heterogeneous and heteroscedastic mixture model detection (Donoho and Jin, 2004; Jager and Wellner, 2007; Cai et al., 2011; Jeng et al., 2013, 2020). In the context of eQTL analysis, we investigate their adaptivity to both cis- and trans-acting signals, providing theoretical insights into their efficiency. Numerical analyses are conducted to highlight the advantages of our method in discovering three types of eGenes: those regulated exclusively by cis-eQTLs, those regulated solely by trans-eQTLs, and those influenced by both. We apply this new method to analyze adipose eQTL data from the METabolic Syndrome in Men (METSIM) study. The results provide valuable insights into the genes that play important roles in the regulatory networks for cardiometabolic traits.

2 Methodology

2.1 Formulation

Suppose there are m𝑚mitalic_m SNPs, which are identified through GWAS to be associated with at least one of the q𝑞qitalic_q disease-related traits under consideration. We consider all K𝐾Kitalic_K genes as potentially regulated by one or more of the m𝑚mitalic_m GWAS signals. We obtain the summary statistics Zjksubscript𝑍𝑗𝑘Z_{jk}italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT, 1jm1𝑗𝑚1\leq j\leq m1 ≤ italic_j ≤ italic_m, 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K, from marginal association tests between the m𝑚mitalic_m SNPs and the K𝐾Kitalic_K genes. For a given gene k𝑘kitalic_k, assume that

ZjkFk,01{jIk,0}+Fk,11{jIk,1},1jm,formulae-sequencesimilar-tosubscript𝑍𝑗𝑘subscript𝐹𝑘01𝑗subscript𝐼𝑘0subscript𝐹𝑘11𝑗subscript𝐼𝑘11𝑗𝑚Z_{jk}\sim F_{k,0}\cdot 1\{j\in I_{k,0}\}+F_{k,1}\cdot 1\{j\in I_{k,1}\},% \qquad 1\leq j\leq m,italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ∼ italic_F start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ⋅ 1 { italic_j ∈ italic_I start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT } + italic_F start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT ⋅ 1 { italic_j ∈ italic_I start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT } , 1 ≤ italic_j ≤ italic_m ,

where Ik,0subscript𝐼𝑘0I_{k,0}italic_I start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT and Ik,1subscript𝐼𝑘1I_{k,1}italic_I start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT represent the set of indices for noise variants and eQTL signals, respectively, which vary from gene to gene. Moreover, Fk,0subscript𝐹𝑘0F_{k,0}italic_F start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT and Fk,1subscript𝐹𝑘1F_{k,1}italic_F start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT represent the noise and signal distribution of Zjksubscript𝑍𝑗𝑘Z_{jk}italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT, respectively, which also vary from gene to gene. In this model, only the null distribution Fk,0subscript𝐹𝑘0F_{k,0}italic_F start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT is specified. The other components, Ik,0subscript𝐼𝑘0I_{k,0}italic_I start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT, Ik,1subscript𝐼𝑘1I_{k,1}italic_I start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT, and Fk,1subscript𝐹𝑘1F_{k,1}italic_F start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT, are unknown.

Specifically, we expect that for a gene k𝑘kitalic_k regulated solely by cis-eQTLs, its Ik,1subscript𝐼𝑘1I_{k,1}italic_I start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT set should be small, and its signal distribution Fk,1subscript𝐹𝑘1F_{k,1}italic_F start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT would concentrate more on strong intensities. On the other hand, for a gene ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT regulated by multiple trans-eQTLs, its Ik,1subscript𝐼superscript𝑘1I_{k^{\prime},1}italic_I start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 end_POSTSUBSCRIPT set should be relatively larger, and its signal distribution would concentrate more on weak intensities. For a gene influenced by both cis- and trans-eQTLs, its signal distribution could span both strong and weak intensities. Our goal is to discover all these different types of eGenes by prioritizing them over irrelevant genes and identify them with statistical guarantees.

2.2 Adaptive Metrics and Algorithm

A key component of our proposed eGene discovery method is the use of a statistical metric that has been identified as optimally adaptive for detecting various types of mixture models. This metric belongs to a general family of goodness-of-fit (GOF) tests, known for demonstrating asymptotic optimality in detecting weak and rare signals amidst a large number of noise variables (Donoho and Jin, 2004; Jager and Wellner, 2007; Donoho and Jin, 2008; Cai et al., 2011; Jeng et al., 2013, 2020). Two members in the GOF family, the Higher Criticism (HC) test and the Berk-Jones (BJ) test, have also been extensively studied in the literature. Specifically, the HC test is renowned for its effectiveness in identifying rare signals, while the BJ test is recognized for its robustness across different signal patterns in finite samples (Li and Siegmund, 2015; Barnett et al., 2017; Zhang et al., 2020; Sun and Lin, 2020).

The HC and BJ statistics are defined as follows: For a gene k𝑘kitalic_k, derive the p𝑝pitalic_p-values {pjk}j=1,,msubscriptsubscript𝑝𝑗𝑘𝑗1𝑚\{p_{jk}\}_{j=1,\ldots,m}{ italic_p start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 , … , italic_m end_POSTSUBSCRIPT of {Zjk}j=1,,msubscriptsubscript𝑍𝑗𝑘𝑗1𝑚\{Z_{jk}\}_{j=1,\ldots,m}{ italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 , … , italic_m end_POSTSUBSCRIPT; order the p𝑝pitalic_p-values increasingly so that p(1),kp(m),k,p(m),kp_{(1),k}\leq p_{(m),k}\leq\ldots,\leq p_{(m),k}italic_p start_POSTSUBSCRIPT ( 1 ) , italic_k end_POSTSUBSCRIPT ≤ italic_p start_POSTSUBSCRIPT ( italic_m ) , italic_k end_POSTSUBSCRIPT ≤ … , ≤ italic_p start_POSTSUBSCRIPT ( italic_m ) , italic_k end_POSTSUBSCRIPT; and define

HCk=sup1jmmj/mp(j),mp(j),m(1p(j),m)𝐻subscript𝐶𝑘subscriptsupremum1𝑗𝑚𝑚𝑗𝑚subscript𝑝𝑗𝑚subscript𝑝𝑗𝑚1subscript𝑝𝑗𝑚HC_{k}=\sup_{1\leq j\leq m}\sqrt{m}\frac{j/m-p_{(j),m}}{\sqrt{p_{(j),m}(1-p_{(% j),m})}}italic_H italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_m end_POSTSUBSCRIPT square-root start_ARG italic_m end_ARG divide start_ARG italic_j / italic_m - italic_p start_POSTSUBSCRIPT ( italic_j ) , italic_m end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_p start_POSTSUBSCRIPT ( italic_j ) , italic_m end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT ( italic_j ) , italic_m end_POSTSUBSCRIPT ) end_ARG end_ARG (1)

and

BJk=sup1jm/2{jmlog(j/mp(j),k)+(1jm)log(1j/m1p(j),k)}.𝐵subscript𝐽𝑘subscriptsupremum1𝑗𝑚2𝑗𝑚𝑗𝑚subscript𝑝𝑗𝑘1𝑗𝑚1𝑗𝑚1subscript𝑝𝑗𝑘BJ_{k}=\sup_{1\leq j\leq m/2}\left\{{j\over m}\log({j/m\over p_{(j),k}})+(1-{j% \over m})\log({1-j/m\over 1-p_{(j),k}})\right\}.italic_B italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_m / 2 end_POSTSUBSCRIPT { divide start_ARG italic_j end_ARG start_ARG italic_m end_ARG roman_log ( divide start_ARG italic_j / italic_m end_ARG start_ARG italic_p start_POSTSUBSCRIPT ( italic_j ) , italic_k end_POSTSUBSCRIPT end_ARG ) + ( 1 - divide start_ARG italic_j end_ARG start_ARG italic_m end_ARG ) roman_log ( divide start_ARG 1 - italic_j / italic_m end_ARG start_ARG 1 - italic_p start_POSTSUBSCRIPT ( italic_j ) , italic_k end_POSTSUBSCRIPT end_ARG ) } . (2)

Both HC and BJ statistics were designed to detect departures from a specified null hypothesis that all the data points {pjk}j=1,,msubscriptsubscript𝑝𝑗𝑘𝑗1𝑚\{p_{jk}\}_{j=1,\ldots,m}{ italic_p start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 , … , italic_m end_POSTSUBSCRIPT follow the null distribution U(0,1)01(0,1)( 0 , 1 ) (Berk and Jones, 1979; Donoho and Jin, 2004). These statistics compare the ordered p𝑝pitalic_p-values p(j),ksubscript𝑝𝑗𝑘p_{(j),k}italic_p start_POSTSUBSCRIPT ( italic_j ) , italic_k end_POSTSUBSCRIPT to their expected values j/m𝑗𝑚j/mitalic_j / italic_m under the null hypothesis and assess the maximum deviation of a weighted function of the differences between the observed and expected p𝑝pitalic_p-values. The weights are chosen to give more importance to the regions where the differences are most significant. Larger values of the statistics indicate stronger evidence that the data come from a mixture of noise and signal distributions rather than just the noise distribution.

Next, we outline the detailed steps for the proposed eGene discovery method, which utilizes either HC or BJ statistic as the adaptive metric (Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) to capture the joint regulatory effects of the GWAS signals in cis and trans. The genes are ranked based on this adaptive metric, and candidate eGenes are identified with statistical significance.

Step-by-Step Procedure:

  1. 1.

    Identify the m𝑚mitalic_m GWAS signals for a set of q𝑞qitalic_q traits related to a specific disease.

  2. 2.

    Obtain the p𝑝pitalic_p-values for the marginal association tests between the m𝑚mitalic_m GWAS signals and all the K𝐾Kitalic_K genes under consideration.

  3. 3.

    Calculate the adaptive metrics Tk,k=1,,Kformulae-sequencesubscript𝑇𝑘𝑘1𝐾T_{k},k=1,\ldots,Kitalic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k = 1 , … , italic_K, for all the genes using either (1) or (2).

  4. 4.

    Rank the genes according to the values of Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT’s from largest to smallest.

  5. 5.

    Calculate the p𝑝pitalic_p-values of Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for all k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K.

  6. 6.

    Select candidate eGenes with significant Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT p𝑝pitalic_p-values through multiple testing.

Remark: The above procedure is completely data-driven. We utilize the R package SetTest from Zhang et al. (2020) to calculate HCk𝐻subscript𝐶𝑘HC_{k}italic_H italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and BJk𝐵subscript𝐽𝑘BJ_{k}italic_B italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as defined in (1) - (2) and to derive their p𝑝pitalic_p-values. This computation is fast and scalable for large-scale eGene discovery.

The effectiveness of the proposed method hinges on the adaptivity of the Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT statistic to both cis and trans-acting signals, for which we provide some theoretical insight in Section 2.3 and numerical justifications in Section 3.

2.3 Theoretical Insight

For a given gene k𝑘kitalic_k, the problem of detecting the existence of any eQTL signals can be formulated as a global hypothesis testing problem. Let Fk,0=N(0,1)subscript𝐹𝑘0𝑁01F_{k,0}=N(0,1)italic_F start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT = italic_N ( 0 , 1 ), representing the noise distribution after data standardization, and Fk,1=N(μk,τk2)subscript𝐹𝑘1𝑁subscript𝜇𝑘superscriptsubscript𝜏𝑘2F_{k,1}=N(\mu_{k},\tau_{k}^{2})italic_F start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT = italic_N ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), representing the signal distribution, where both μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and τk2superscriptsubscript𝜏𝑘2\tau_{k}^{2}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are unknown. Then, an irrelevant gene that is not influenced by any of the SNPs corresponds to the global null hypothesis

Hk,0:ZjkN(0,1),1jm;:subscript𝐻𝑘0formulae-sequencesimilar-tosubscript𝑍𝑗𝑘𝑁011𝑗𝑚H_{k,0}:Z_{jk}\sim N(0,1),\qquad 1\leq j\leq m;italic_H start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT : italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) , 1 ≤ italic_j ≤ italic_m ; (3)

and a candidate gene regulated by one or more of the SNPs corresponds to the alternative hypothesis

Hk,1:Zjk(1πk)N(0,1)+πkN(μk,τk2),1jm,:subscript𝐻𝑘1formulae-sequencesimilar-tosubscript𝑍𝑗𝑘1subscript𝜋𝑘𝑁01subscript𝜋𝑘𝑁subscript𝜇𝑘superscriptsubscript𝜏𝑘21𝑗𝑚H_{k,1}:Z_{jk}\sim(1-\pi_{k})N(0,1)+\pi_{k}N(\mu_{k},\tau_{k}^{2}),\qquad 1% \leq j\leq m,italic_H start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT : italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ∼ ( 1 - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_N ( 0 , 1 ) + italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_N ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , 1 ≤ italic_j ≤ italic_m , (4)

where πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the proportion of eQTL signals among all the SNPs, which varies from gene to gene.

To gain theoretical insight into the effect of eQTL signal proportion, as well as the impact of signal mean and variance on the global testing problem, we re-parameterize some model parameters as in Cai et al. (2011). This re-parameterization allows their effects to be represented on comparable scales and analyzed jointly. Specifically, let

πk=mγk,γk(0,1),formulae-sequencesubscript𝜋𝑘superscript𝑚subscript𝛾𝑘subscript𝛾𝑘01\pi_{k}=m^{-\gamma_{k}},\qquad\gamma_{k}\in(0,1),italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_m start_POSTSUPERSCRIPT - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 ) , (5)

so that the signal sparsity is represented through the parameter γksubscript𝛾𝑘\gamma_{k}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in a constant scale with γk=0subscript𝛾𝑘0\gamma_{k}=0italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 representing the extremely dense case where all SNPs are eQTL signals and γk=1subscript𝛾𝑘1\gamma_{k}=1italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 representing the extremely sparse case where there is only one eQTL signal. Then, depending on the value of γksubscript𝛾𝑘\gamma_{k}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we consider two scenarios: the sparse scenario with γk(1/2,1]subscript𝛾𝑘121\gamma_{k}\in(1/2,1]italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 / 2 , 1 ] and the dense scenario with γk(0,1/2]subscript𝛾𝑘012\gamma_{k}\in(0,1/2]italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 / 2 ]. In these two scenarios, the signal mean value is re-parameterized differently:

μk={2hklogm,γk(1/2,1),mhk,γk(0,1/2],subscript𝜇𝑘cases2subscript𝑘𝑚subscript𝛾𝑘121superscript𝑚subscript𝑘subscript𝛾𝑘012\mu_{k}=\begin{cases}\sqrt{2h_{k}\log m},&\gamma_{k}\in(1/2,1),\\ m^{-h_{k}},&\gamma_{k}\in(0,1/2],\end{cases}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { start_ROW start_CELL square-root start_ARG 2 italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log italic_m end_ARG , end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 / 2 , 1 ) , end_CELL end_ROW start_ROW start_CELL italic_m start_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 / 2 ] , end_CELL end_ROW (6)

where hk>0subscript𝑘0h_{k}>0italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0 is at the constant scale and represents the signal mean effect. The above re-parameterization reflects two different types of signals, one being sparse (γk(1/2,1)subscript𝛾𝑘121\gamma_{k}\in(1/2,1)italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 / 2 , 1 )) and relatively strong (μk=2hklogmsubscript𝜇𝑘2subscript𝑘𝑚\mu_{k}=\sqrt{2h_{k}\log m}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = square-root start_ARG 2 italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log italic_m end_ARG), the other being dense (γk(0,1/2)subscript𝛾𝑘012\gamma_{k}\in(0,1/2)italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 / 2 )) and relatively weak (μk=mhksubscript𝜇𝑘superscript𝑚subscript𝑘\mu_{k}=m^{-h_{k}}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_m start_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT).

After the re-parameterization, all the parameters γksubscript𝛾𝑘\gamma_{k}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, hksubscript𝑘h_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are at a constant scale, which facilitates the analysis of their joint effects. Notably, the detection boundary ρ(γk,τk)𝜌subscript𝛾𝑘subscript𝜏𝑘\rho(\gamma_{k},\tau_{k})italic_ρ ( italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) for the global testing problem has been found in Cai et al. (2011) and Jeng et al. (2013), assuming independence among Zjk,1jmsubscript𝑍𝑗𝑘1𝑗𝑚Z_{jk},1\leq j\leq mitalic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT , 1 ≤ italic_j ≤ italic_m. Its specific form varies across different scenarios as follows:

When signals are relatively sparse with γk(1/2,1)subscript𝛾𝑘121\gamma_{k}\in(1/2,1)italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 / 2 , 1 ),

ρ(γk,τk)={(2τk2)(γk1/2),τk2[1,2) and γk(1/2,1τk2/4],(1τk1γk)2,τk2[1,2) and γk(1τk2/4,1),0,τk2>2 and γk(1/2,11/τk2],(1τk1γk)2,τk2>2 and γk(11/τk2,1).𝜌subscript𝛾𝑘subscript𝜏𝑘cases2subscriptsuperscript𝜏2𝑘subscript𝛾𝑘12subscriptsuperscript𝜏2𝑘12 and subscript𝛾𝑘121subscriptsuperscript𝜏2𝑘4superscript1subscript𝜏𝑘1subscript𝛾𝑘2subscriptsuperscript𝜏2𝑘12 and subscript𝛾𝑘1subscriptsuperscript𝜏2𝑘410subscriptsuperscript𝜏2𝑘2 and subscript𝛾𝑘1211subscriptsuperscript𝜏2𝑘superscript1subscript𝜏𝑘1subscript𝛾𝑘2subscriptsuperscript𝜏2𝑘2 and subscript𝛾𝑘11subscriptsuperscript𝜏2𝑘1\rho(\gamma_{k},\tau_{k})=\begin{cases}(2-\tau^{2}_{k})\left(\gamma_{k}-1/2% \right),&\tau^{2}_{k}\in[1,~{}2)\text{~{}~{}and~{}~{}}\gamma_{k}\in(1/2,~{}1-% \tau^{2}_{k}/4],\\ \left(1-\tau_{k}\sqrt{1-\gamma_{k}}\right)^{2},&\tau^{2}_{k}\in[1,~{}2)\text{~% {}~{}and~{}~{}}\gamma_{k}\in(1-\tau^{2}_{k}/4,~{}1),\\ 0,&\tau^{2}_{k}>2\text{~{}~{}and~{}~{}}\gamma_{k}\in(1/2,~{}1-1/\tau^{2}_{k}],% \\ \left(1-\tau_{k}\sqrt{1-\gamma_{k}}\right)^{2},&\tau^{2}_{k}>2\text{~{}~{}and~% {}~{}}\gamma_{k}\in(1-1/\tau^{2}_{k},~{}1).\end{cases}italic_ρ ( italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = { start_ROW start_CELL ( 2 - italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 / 2 ) , end_CELL start_CELL italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 1 , 2 ) and italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 / 2 , 1 - italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / 4 ] , end_CELL end_ROW start_ROW start_CELL ( 1 - italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT square-root start_ARG 1 - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ 1 , 2 ) and italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 - italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / 4 , 1 ) , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 2 and italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 / 2 , 1 - 1 / italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] , end_CELL end_ROW start_ROW start_CELL ( 1 - italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT square-root start_ARG 1 - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 2 and italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 - 1 / italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , 1 ) . end_CELL end_ROW (7)

When signals are relatively dense with γk(0,1/2]subscript𝛾𝑘012\gamma_{k}\in(0,1/2]italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 / 2 ],

ρ(γk,τk)={,τk2>1 and γk(0,1/2],1/2γk,τk2=1 and γk(0,1/2].𝜌subscript𝛾𝑘subscript𝜏𝑘casessubscriptsuperscript𝜏2𝑘1 and subscript𝛾𝑘01212subscript𝛾𝑘subscriptsuperscript𝜏2𝑘1 and subscript𝛾𝑘012\rho(\gamma_{k},\tau_{k})=\begin{cases}\infty,&\tau^{2}_{k}>1\text{~{}~{}and~{% }~{}}\gamma_{k}\in(0,1/2],\\ 1/2-\gamma_{k},&\tau^{2}_{k}=1\text{~{}~{}and~{}~{}}\gamma_{k}\in(0,1/2].\end{cases}italic_ρ ( italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = { start_ROW start_CELL ∞ , end_CELL start_CELL italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 1 and italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 / 2 ] , end_CELL end_ROW start_ROW start_CELL 1 / 2 - italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , end_CELL start_CELL italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 and italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 / 2 ] . end_CELL end_ROW (8)

The detection boundary results can serve as a benchmark for method evaluation as it implies the following three aspects:

  1. 1.

    When the signal mean value does not reach the detection boundary, i.e.,

    h{<ρ(γk,τk),γk(1/2,1),>ρ(γk,τk),γk(0,1/2],casesabsent𝜌subscript𝛾𝑘subscript𝜏𝑘subscript𝛾𝑘121absent𝜌subscript𝛾𝑘subscript𝜏𝑘subscript𝛾𝑘012h\begin{cases}<\rho(\gamma_{k},\tau_{k}),&\gamma_{k}\in(1/2,1),\\ >\rho(\gamma_{k},\tau_{k}),&\gamma_{k}\in(0,1/2],\end{cases}italic_h { start_ROW start_CELL < italic_ρ ( italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 / 2 , 1 ) , end_CELL end_ROW start_ROW start_CELL > italic_ρ ( italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 / 2 ] , end_CELL end_ROW (9)

    no methods can consistently separate Hk,1subscript𝐻𝑘1H_{k,1}italic_H start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT from Hk,0subscript𝐻𝑘0H_{k,0}italic_H start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT, i.e.,

    Hk,0(Reject Hk,0)+Hk,1(Fail to reject Hk,0)1 as m.formulae-sequencesubscriptsubscript𝐻𝑘0Reject subscript𝐻𝑘0subscriptsubscript𝐻𝑘1Fail to reject subscript𝐻𝑘01 as 𝑚\mathbb{P}_{H_{k,0}}\left(\text{Reject }H_{k,0}\right)+\mathbb{P}_{H_{k,1}}% \left(\text{Fail to reject }H_{k,0}\right)\rightarrow 1\quad\text{ as }\quad m% \rightarrow\infty.blackboard_P start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( Reject italic_H start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) + blackboard_P start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( Fail to reject italic_H start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) → 1 as italic_m → ∞ .
  2. 2.

    If a method can consistently separate Hk,1subscript𝐻𝑘1H_{k,1}italic_H start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT from Hk,0subscript𝐻𝑘0H_{k,0}italic_H start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT with

    Hk,0(Reject Hk,0)+H1(Fail to reject Hk,0)0 as mformulae-sequencesubscriptsubscript𝐻𝑘0Reject subscript𝐻𝑘0subscriptsubscript𝐻1Fail to reject subscript𝐻𝑘00 as 𝑚\mathbb{P}_{H_{k,0}}\left(\text{Reject }H_{k,0}\right)+\mathbb{P}_{H_{1}}\left% (\text{Fail to reject }H_{k,0}\right)\rightarrow 0\quad\text{ as }\quad m\rightarrow\inftyblackboard_P start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( Reject italic_H start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) + blackboard_P start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( Fail to reject italic_H start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT ) → 0 as italic_m → ∞

    whenever the signal mean value passes the detection boundary, i.e.,

    h{>ρ(γk,τk),γk(1/2,1],<ρ(γk,τk),γk(0,1/2],casesabsent𝜌subscript𝛾𝑘subscript𝜏𝑘subscript𝛾𝑘121absent𝜌subscript𝛾𝑘subscript𝜏𝑘subscript𝛾𝑘012h\begin{cases}>\rho(\gamma_{k},\tau_{k}),&\gamma_{k}\in(1/2,1],\\ <\rho(\gamma_{k},\tau_{k}),&\gamma_{k}\in(0,1/2],\end{cases}italic_h { start_ROW start_CELL > italic_ρ ( italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 / 2 , 1 ] , end_CELL end_ROW start_ROW start_CELL < italic_ρ ( italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 / 2 ] , end_CELL end_ROW (10)

    then this method is considered an optimal method.

  3. 3.

    If a method can achieve optimality and its implementation does not require specified values of the model parameters, then the method is considered an optimally adaptive method.

In the literature, it has been proved that all the members in the aforementioned GOF family are optimally adaptive for the global testing problem in the sparse scenario (γk(1/2,1)subscript𝛾𝑘121\gamma_{k}\in(1/2,1)italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 1 / 2 , 1 )) with equal variance (τk2=1superscriptsubscript𝜏𝑘21\tau_{k}^{2}=1italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1)(Jager and Wellner, 2007). Moreover, the optimal adaptivity of HC test has also been proved in the extended scenarios with (γk(0,1]subscript𝛾𝑘01\gamma_{k}\in(0,1]italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 ]) and possibly unequal variance (τk21superscriptsubscript𝜏𝑘21\tau_{k}^{2}\geq 1italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 1) (Cai et al., 2011; Jeng et al., 2013). We expect that the BJ test performs comparatively to the HC test in the extended scenarios, and this conjecture is supported in the following numerical studies.

3 Simulation

3.1 Simulation Setup

The simulation study aims at checking the competence and efficiency of the proposed eGene discovery method. In the study, we consider 100100100100 genes and 8657865786578657 SNPs. We design three types of eGenes as depicted in Figure 1:
Cis-eGene: illustrated by Gene 1, which is only regulated by one cis-eQTL (SNP 1).
Mixed-eGene: illustrated by Gene 2, which is co-regulated by one cis-eQTL (SNP 2) and a set of trans-eQTLs (a random subset from SNPs 3 - 1002).
Trans-eGene: illustrated by Genes 3 - 20, each of which are regulated by a set of trans-eQTLs (random subsets from SNPs 3 - 1002).
The remaining genes (Genes 21 - 100) are not associated with any of the SNPs and are not depicted in Figure 1.

Refer to caption
Figure 1: Three types of eGenes and their regulatory eQTLs: The solid lines represent cis effect and the dotted lines represent trans effect of the corresponding SNPs. Gene 1 is a cis-eGene regulated by SNP 1 in cis; Gene 2 is a mixed-eGene regulated by SNP 2 in cis and a subset of SNPs in trans; Genes 3 - 20 are trans-eGenes regulated by various subsets of SNPs in trans.

We generate the summary statistics Zjksubscript𝑍𝑗𝑘Z_{jk}italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT for the association tests between the 100 genes and 8657865786578657 SNPs as follows: For k=1,,100𝑘1100k=1,\ldots,100italic_k = 1 , … , 100, let

ZjkN(0,1)1{jIk,0}+N(μk,τk2)1{jIk,1},j=1,,8657.formulae-sequencesimilar-tosubscript𝑍𝑗𝑘𝑁011𝑗subscript𝐼𝑘0𝑁subscript𝜇𝑘subscriptsuperscript𝜏2𝑘1𝑗subscript𝐼𝑘1𝑗18657Z_{jk}\sim N(0,1)\cdot 1\{j\in I_{k,0}\}+N(\mu_{k},\tau^{2}_{k})\cdot 1\{j\in I% _{k,1}\},\qquad j=1,\ldots,8657.italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) ⋅ 1 { italic_j ∈ italic_I start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT } + italic_N ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ 1 { italic_j ∈ italic_I start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT } , italic_j = 1 , … , 8657 .

For k=1𝑘1k=1italic_k = 1 (Gene 1), I1,1={1}subscript𝐼111I_{1,1}=\{1\}italic_I start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT = { 1 } and μ1=Asubscript𝜇1𝐴\mu_{1}=Aitalic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_A, representing the effect of one strong cis-eQTL. For k=2𝑘2k=2italic_k = 2 (Gene 2), I2,1={2}S2subscript𝐼212subscript𝑆2I_{2,1}=\{2\}\cup S_{2}italic_I start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT = { 2 } ∪ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and μ2=A1{j=2}+B1{jS2}subscript𝜇2𝐴1𝑗2𝐵1𝑗subscript𝑆2\mu_{2}=A\cdot 1\{j=2\}+B\cdot 1\{j\in S_{2}\}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_A ⋅ 1 { italic_j = 2 } + italic_B ⋅ 1 { italic_j ∈ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, representing the mixed effect of one strong cis-eQTL and many weak trans-eQTLs. For k=320𝑘320k=3-20italic_k = 3 - 20 (Genes 3 - 20), Ik,1=Sksubscript𝐼𝑘1subscript𝑆𝑘I_{k,1}=S_{k}italic_I start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and μk=B1{jSk}subscript𝜇𝑘𝐵1𝑗subscript𝑆𝑘\mu_{k}=B\cdot 1\{j\in S_{k}\}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_B ⋅ 1 { italic_j ∈ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, representing the effect of many weak trans-eQTLs. Note that S2,S3,,S20subscript𝑆2subscript𝑆3subscript𝑆20S_{2},S_{3},\ldots,S_{20}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT are randomly selected subsets of SNPs from SNPs 3 - 1002 with a common cardinality s=|Sk|𝑠subscript𝑆𝑘s=|S_{k}|italic_s = | italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT |. For k=21100𝑘21100k=21-100italic_k = 21 - 100 (Genes 21 - 100), Ik,1=subscript𝐼𝑘1I_{k,1}=\emptysetitalic_I start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT = ∅, so that they are irrelevant genes. More specific values of the model parameters in the simulation examples include A=4𝐴4A=4italic_A = 4, B=0𝐵0B=0italic_B = 0 or 0.2, τk2=1.05subscriptsuperscript𝜏2𝑘1.05\tau^{2}_{k}=1.05italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1.05 or 1.5, and s=500𝑠500s=500italic_s = 500 or 50. Note that B=0𝐵0B=0italic_B = 0 represents the case where the weak trans-eQTLs barely show a non-zero mean effect, and their effects are solely demonstrated through increased variability (τk2>1superscriptsubscript𝜏𝑘21\tau_{k}^{2}>1italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 1).

3.2 Methods Comparison

We apply the proposed eGene discovery procedure, as described in Section 2.2, to the simulated data. Here, we focus on evaluating the efficiency of the proposed method in prioritizing true eGenes over irrelevant ones. We compare both the HC version and the BJ version of our procedure with other existing methods that use different information pooling strategies to summarize the eQTL effects received by a gene. These methods are summarized as follows:

  1. (a)

    Mean-based method: For a given gene k, summarize the eQTL effects it receives from all the SNPs through the mean value j=1mZjk/msuperscriptsubscript𝑗1𝑚subscript𝑍𝑗𝑘𝑚\sum_{j=1}^{m}Z_{jk}/m∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT / italic_m.

  2. (b)

    MinimumP-based method: For a given gene k, summarize the eQTL effects through the minimum p𝑝pitalic_p-value of Zjksubscript𝑍𝑗𝑘Z_{jk}italic_Z start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT: min1jmpjksubscript1𝑗𝑚subscript𝑝𝑗𝑘\min_{1\leq j\leq m}p_{jk}roman_min start_POSTSUBSCRIPT 1 ≤ italic_j ≤ italic_m end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT.

  3. (c)

    HC-based method: Use the HC statistic in (1) to summarize the eQTL effects for each gene.

  4. (d)

    BJ-based method: Use the BJ Statistic in (2) to summarize the eQTL effects for each gene.

Both the Mean- and MinimumP-based methods are commonly used approaches for information pooling. We compare the finite-sample performances of these methods in simulation settings designed to reflect different types of eGenes. The comparisons focus on the ability of these methods to prioritize different types of eGenes (Genes 1–20) over the irrelevant genes (Genes 21–100).

The results are presented using the Precision-Recall (PR) curve. The PR curve is particularly effective for evaluating the performance of a method in prioritizing relevant variables over irrelevant variables or, in other words, the ranking efficiency of a method. Here, the four methods in (a) - (d) rank the genes differently according to different information pooling approaches. To evaluate the ranking efficiency of these methods, a series of two metrics: Precision and Recall, are calculated along the ranked genes for each method. That is, for a list of ranked genes by a method, calculate

Precisionr=true eGenes in the top r genesr,Recallr=true eGenes in the top r genesall true eGenes,formulae-sequencesubscriptPrecision𝑟true eGenes in the top r genesrsubscriptRecall𝑟true eGenes in the top r genesall true eGenes\text{Precision}_{r}=\frac{\text{true eGenes in the top r genes}}{\text{r}},% \quad\text{Recall}_{r}=\frac{\text{true eGenes in the top r genes}}{\text{all % true eGenes}},Precision start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG true eGenes in the top r genes end_ARG start_ARG r end_ARG , Recall start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG true eGenes in the top r genes end_ARG start_ARG all true eGenes end_ARG ,

for all r=1,,K(=100)𝑟1annotated𝐾absent100r=1,\ldots,K(=100)italic_r = 1 , … , italic_K ( = 100 ). As the rank r increasing from 1 to 100, Recallr increases to reach 1 eventually while Precisionr generally decreases. The PR curve is plotted with Recallr on the x-axis and Precisionr on the y-axis, and the area under the PR curve (AUC-PR) is a single scalar value that summarizes the overall performance of the method in prioritizing true eGenes over irrelevant genes. Compared to the commonly used ROC curve, PR curve is more appropriate when dealing with imbalanced datasets where the number of signals is expected to be much less than that of the noise variables.

3.3 Simulation Results

Figure 2 shows that in the settings where different types of eGenes co-exist and the trans-eGenes are the majority, the MinimumP method performs the worst among the four methods in the left panel, where trans-eQTL signals are relatively weak and dense (τk2=1.05,s=500formulae-sequencesuperscriptsubscript𝜏𝑘21.05𝑠500\tau_{k}^{2}=1.05,s=500italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1.05 , italic_s = 500). Conversely, in the right panel where trans-eQTL signals are stronger and sparser (τk2=1.5,s=50formulae-sequencesuperscriptsubscript𝜏𝑘21.5𝑠50\tau_{k}^{2}=1.5,s=50italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1.5 , italic_s = 50), the Mean-based method performs the worst. In both scenarios, the BJ and HC methods exhibit more stable results and outperform the other two methods, with BJ and HC performing comparably.

Figure 2 presents results with a zero mean value for trans-eQTLs (B=0𝐵0B=0italic_B = 0). Similar results are observed with B=0.2𝐵0.2B=0.2italic_B = 0.2 and are thus omitted to save space.

Refer to caption
Refer to caption
Figure 2: PR Curves of all the methods. The Precision and Recall metrics are presented in mean values from 100 replications. There are 20 eGenes, including one cis-eGene, one mixed-eGene, and 18 trans-eGenes. The trans-eQTLs of an eGene include s𝑠sitalic_s SNPs randomly selected from 1000 SNPs with possible overlapping across eGenes. The left panel has weak and dense trans-eQTL signals (τk2=1.05,s=500formulae-sequencesuperscriptsubscript𝜏𝑘21.05𝑠500\tau_{k}^{2}=1.05,s=500italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1.05 , italic_s = 500), whereas the right panel has stronger and sparser trans-eQTL signals (τk2=1.5,s=50formulae-sequencesuperscriptsubscript𝜏𝑘21.5𝑠50\tau_{k}^{2}=1.5,s=50italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1.5 , italic_s = 50)

.

4 Real Data Analysis

We applied the proposed eGene discovery method to adipose eQTL data from the METSIM study (Laakso et al., 2017; Raulerson et al., 2019). This dataset consisted of RNA-seq data generated on subcutaneous adipose tissue biopsies from 426 males from Kuopio, Finland. While Raulerson et al. (2019) primarily investigated adipose cis-eQTL (\leq 1Mb) colocalized with GWAS signals, we aimed here to discover eGenes regulated by GWAS variants in both cis and trans (>>> 1Mb). The RNA-seq data quality control, normalization, and covariates were described in Raulerson et al. (2019). Trans-eQTL analyses excluded pseudogenes and genes with low mappability, and we used 3 PEER factors as covariates for both cis and trans analyses. We obtained the p𝑝pitalic_p-values for the marginal association tests between the 2,879 GWAS signals (minor allele frequency 0.01absent0.01\geq 0.01≥ 0.01 and imputation R20.5𝑅20.5R2\geq 0.5italic_R 2 ≥ 0.5) and 24,371 genes described in that study. We found that a small number of genes with only one insignificant eQTL data point can cause errors when deriving their BJ p-values using the SetTest package. Consequently, we excluded these genes and conducted our analysis with the remaining 21,593 genes.

As detailed in Section 2.2, we calculated the BJ statistics and ranked the genes accordingly. We further derived the BJ p𝑝pitalic_p-values of all the 21593 genes and applied multiple testing (Bonferroni <0.01absent0.01<0.01< 0.01) to select statistically significant eGenes. This resulted in the selection of 424 genes. The complete list of selected genes, ranked by their BJ statistics, is provided in the Appendix.

Figure 3 presents the histogram of the BJ p𝑝pitalic_p-values of the 424 selected genes in the first panel. It can been seen that most of the selected genes have their BJ p-values much lower than the Bonferroni threshold (4.103237e-07). The results are supported by the outcomes of the HC method as shown in the second panel of Figure 3, where the HC p𝑝pitalic_p-values of the selected 424 genes are also generally very small.

Refer to caption
Refer to caption
Refer to caption
Figure 3: The first panel displays the histogram of BJ p𝑝pitalic_p-values for the 424 selected genes. The second panel presents the histogram of HC p𝑝pitalic_p-values of these genes. The third panel illustrates the proportion of matches between our ranked genes and those identified in Raulerson et al. (2019).

We further compared our results with the eGenes identified in Raulerson et al. (2019), where 318 eGenes (287 primary and 31 secondary) are identified using only the cis-eQTL data. We found 134 common genes between the two sets of results. As illustrated in the third panel of Figure 3, the matching proportion along our list of ranked selected genes generally decreases from 0.67 to 0.32, indicating that our method effectively prioritize true eGenes over irrelevant ones.

It is noteworthy that, different from Raulerson et al. (2019), our study simultaneously analyzes both cis and trans-eQTL data and captures the joint cis- and trans-acting effects on the genes. This approach could enhance the discovery of eGenes that play significant roles in encoding multifunctional proteins or regulating various cellular processes. The top 5 genes, as presented in Table 1, illustrate such examples.

Rank Gene Name Function Interpretation
1 C1QTNF4 Serum C1q/TNF-Related Protein 4 levels have been shown to be
lower in type 2 diabetes patients with nonalcoholic fatty liver
disease (NAFLD) than those without NAFLD.
2 WDR6 Upregulation of WD40 repeat-containing protein 6 levels
has been reported to drive hepatic de novo lipogenesis
during insulin resistance.
3 AC034243.1 This long noncoding RNA is transcribed in the antisense direction
from the transcription start site of CTNNA1, which encodes
catenin alpha 1 and has been implicated in cancer.
4 HLA-DQA1 This component of the major histocompatibility complex plays a
keyrole in the immune system.
5 PEX6 Peroxisomal biogenesis factor 6 plays a role in peroxisomal
protein import.
Table 1: Top 5 genes ranked by the significance of their BJ statistics. Gene names and functions were obtained by searching Gene resources at the National Center for Biotechnology Information; additional references include Han et al. (2023), Yao et al. (2023), and Lobo et al. (2021).

5 Conclusion and Discussion

In this paper, we present a new approach to discovering candidate eGenes by detecting the joint effects of cis- and trans-eQTLs. Leveraging both cis and trans-eQTL data, our method offers a more comprehensive understanding of gene regulation mechanisms compared to traditional methods that consider these effects separately. The novel application of the statistically adaptive metrics enhances the prioritization of key regulatory genes involved in complex traits and diseases.

Applying our approach to adipose eQTL data from the METSIM study, we discovered eGenes regulated by GWAS signals in both cis and trans across a broad spectrum of 93 cardiometabolic traits. This analysis identified 424 significant eGenes, exceeding the number of genes identified by methods that focused solely on cis-eQTLs. The top-ranked eGenes selected by our method play significant roles in encoding multifunctional proteins and regulating various cellular processes. These findings underscore the potential of our eGene discovery method in uncovering complex genetic regulatory mechanisms. The identified eGenes provide valuable insights into the genetic architecture of complex traits and offer promising targets for further functional studies and therapeutic interventions.

Acknowledgments. The authors would like to thank Dr. Sarah Brotman and Dr. Karen Mohlke for their assistance with accessing eQTL data from the METSIM study. The authors also express their gratitude to Dr. Zheyang Wu and Dr. Hong Zhang for their support with the SetTest software.

Declarations.

Data availability The adipose eQTL data used in this paper is sourced from Raulerson et al. (2019) and is publicly accessible.

Appendix A: Complete list of selected genes

 [1] C1QTNF4             WDR6                AC034243.1          HLA-DQA1            PEX6
 [6] UHRF1BP1            ERAP2               HLA-C               HSD17B12            RBM6
[11] CDK2AP1             HLA-DQB1            LINC00680           WNT6                CERS4
[16] CDC37P1             AC016683.6          AP006621.1          PAX8                JMJD7
[21] DALRD3              HLA-DQB1-AS1        XXbac-BPG254F23.6   MLXIPL              PM20D1
[26] SNX10               CTD-2651B20.3       XXbac-BPGBPG55C20.1 RP11-282O18.3       NBPF3
[31] NCKIPSD             RHD                 XKR9                ADORA2B             HLA-DRB1
[36] NPC1L1              MIR1307             AP006621.5          XXbac-BPG299F13.17  ITGB6
[41] AS3MT               GNL3                SPTBN5              MAN2C1              DISP2
[46] C2orf74             C10orf32-ASMT       LITAF               RP11-217B7.2        RP11-378A13.1
[51] NPC1                KIAA1841            KIAA1875            FN3KRP              GBAP1
[56] RP11-65I12.1        JMJD7-PLA2G4B       XXbac-BPG248L24.12  RP11-73M18.9        ADH1A
[61] ZSWIM7              C15orf38-AP3S2      ANAPC4              RP11-296I10.6       MED24
[66] NPIPB7              RHCE                AC034220.3          RP11-109L13.1       ZNF600
[71] CEP192              LILRA3              SERBP1P3            C12orf65            AC068039.4
[76] RP11-1348G14.4      GNMT                RPL36P4             TIPARP              GTF2H1
[81] SMIM19              AC145343.2          AOC1                EXOSC6              RP11-755F10.1
[86] DGKQ                MAST4-AS1           RP11-624M8.1        STAG3L1             LILRB2
[91] RFT1                AC018720.10         PLA2G4B             GSDMA               EIF3KP1
[96] NR1H3               NT5DC2              NPIPA5              INO80E              PPID
[101] RP11-57H14.2        SH2B1               MEGF9               ERAP1               NDUFAF1
[106] CDK5RAP3            RP11-673E1.1        AP3S2               EIF3C               PIDD
[111] AMT                 PLCE1               CDH13               PSMD5-AS1           MAP1LC3A
[116] TRIP4               RP11-179B2.2        HLA-DRB6            SLC47A1             RP11-252K23.2
[121] STRA13              CSNK2B              XXbac-BPGBPG55C20.2 ITIH4               RP11-529H20.6
[126] SLC5A11             VEGFB               KIAA1683            ATXN2L              RPS26
[131] C10orf32            AL049840.1          CTD-2303H24.2       RP4-712E4.1         RP11-244F12.3
[136] USP32P2             MYEOV               AP006621.6          CTC-228N24.3        CLEC18A
[141] RP11-69E11.4        TMEM180             SLC28A2             CRYZ                CEACAMP3
[146] CCDC12              CHST8               NPIPA3              NMRAL1              RP11-148O21.6
[151] APOB                NPIPB6              RP11-211G23.2       RP1-140A9.1         EMC1
[156] MST1L               CCDC116             SRR                 PEPD                C15orf57
[161] BMP8A               ARHGAP1             MAP3K11             SH3PXD2B            G6PC2
[166] SCAPER              SPPL3               RP11-397E7.4        SULT1A2             RP11-17E13.2
[171] TCP11               HLA-B               BTN3A2              KLF14               XXbac-BPG300A18.13
[176] AP3B2               LINC00310           TRIM73              CLUAP1              RP11-419C5.2
[181] HSD17B13            KLHL31              RP1-313I6.12        OPRL1               SLC35G5
[186] LRRC37A             SDHDP6              POM121C             RP11-378J18.8       SYPL2
[191] HEY2                RNF5                KAT8                TMBIM1              TCEA3
[196] HERPUD1             HLA-DRA             PRMT6               RP13-753N3.1        CCDC163P
[201] AFF1                HMGN1               LACTB               PACS1               GUSBP4
[206] RP11-64K12.4        LINC00202-1         CCDC36              SERPINC1            ARL14EP
[211] NPIPA1              AMIGO1              LGALS9C             ROM1                MMP16
[216] TIGD7               CNTN2               RPAP1               PDLIM4              PEMT
[221] CWF19L1             INVS                CELSR2              MARVELD3            RP11-107F6.3
[226] EYA1                TRMT61B             RP11-680G24.5       RP11-324E6.9        SUZ12P
[231] PKD1                WARS2               LHCGR               AHSA2               LRRC37A2
[236] RP5-874C20.3        AC138783.12         ANK1                SLC7A9              CCHCR1
[241] TMEM60              ZNF589              LMOD1               MICA                RP11-502I4.3
[246] NUDCD3              RNF157              ADAM1A              ADCY10P1            VPS53
[251] TRIM66              MRPS18AP1           MYBPC3              MARK3               KIAA1161
[256] GYPE                CTD-2260A17.3       ALMS1P              RPP25               RP11-680G24.4
[261] TOM1L2              PLEC                NBPF1               TMEM9B-AS1          CPNE1
[266] HLA-DQA2            NFKBIL1             RP3-465N24.5        IL20RB              UGT1A7
[271] AL022393.7          ZSCAN31             UGT1A9              UGT1A8              UGT1A3
[276] UGT1A5              RP11-114H24.5       HCAR2               ZC3H12C             IRS1
[281] ELP6                RAN                 C18orf8             SLC9B1              LONRF1
[286] FAM106A             OXCT2               SMG5                RP11-493E12.2       TTC12
[291] PHC1                NEURL2              UGT1A1              TRIM24              RMDN1
[296] CEACAM21            TLK1                UGT1A10             PDXDC2P             GS1-259H13.2
[301] LRRC36              SFI1                SPATA5L1            DHRS11              EIF3CL
[306] OXCT2P1             TSPY26P             MRPL45P2            RP5-966M1.6         LPIN3
[311] RP11-593F23.1       RP11-339B21.14      LINC00674           UGT1A6              REEP2
[316] UGT1A4              SIL1                LRRC45              ABLIM3              NDUFS1
[321] GPIHBP1             APOC1P1             PLTP                RP11-1348G14.1      BEND6
[326] SAA3P               SULT1A1             HLA-DRB5            SURF1               RAC1
[331] ADIPOQ-AS1          EFCAB13             FIGNL1              HCAR1               STAG3L2
[336] AC131056.3          KNOP1               CYP17A1             CTNNAL1             CTC-510F12.4
[341] NPPA-AS1            GPR180              TEX264              RP11-304L19.3       MT1E
[346] MST1R               RP11-817O13.8       MKX                 GOSR1               RPL8
[351] ANXA5               GALNT2              RP1-265C24.5        PARP6               TCF19
[356] SSC5D               ENDOG               RP11-196G11.2       PCNX                AF131215.9
[361] GATSL3              PRPH2               FGF9                PTH1R               ZNF577
[366] FAM182A             TTYH1               KHK                 RP11-457M11.5       RMI2
[371] SPC25               AC007390.5          RP1-130H16.18       CAMK2G              PDE3A
[376] TMEM116             TNXA                RP11-573D15.1       CTA-85E5.10         AC138430.4
[381] DNAH17              RP11-408A13.4       C4B                 OLFM2               NUP85
[386] PTPN12              CYP17A1-AS1         B3GALNT2            RP1-18D14.7         RSPO3
[391] RP11-24N18.1        LRRC37A6P           GATM                NAP1L4              RP11-7F17.5
[396] RP11-433P17.3       DAP3                CYBRD1              TMEM110             PABPC4
[401] RP5-968P14.2        C1orf167            GFPT1               RP11-345M22.2       AC010883.5
[406] AC012146.7          RP11-10L12.4        CROCCP2             RP11-111A22.1       LINC00886
[411] AC006129.4          GRK4                TBKBP1              CMIP                ZNF781
[416] MST1                CCL3L3              YPEL3               HAPLN4              SKIV2L
[421] ZNF703              RFC4                PEG3                UFSP1

References

  • Ardlie et al. (2015) Ardlie, K. G., D. S. Deluca, A. V. Segrè, T. J. Sullivan, T. R. Young, E. T. Gelfand, C. A. Trowbridge, J. B. Maller, T. Tukiainen, et al. (2015). The genotype-tissue expression (gtex) pilot analysis: Multitissue gene regulation in humans. Science 348(6235), 648–660. PMCID: PMC4547484.
  • Barnett et al. (2017) Barnett, I., R. Mukherjee, and X. Lin (2017). The generalized higher criticism for testing snp-set effects in genetic association studies. Journal of the American Statistical Association 112(517), 64–76.
  • Berk and Jones (1979) Berk, R. H. and D. H. Jones (1979). Goodness-of-fit test statistics that dominate the kolmogorov statistics. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 47(1), 47–59.
  • Boyle et al. (2017) Boyle, E. A., Y. I. Li, and J. K. Pritchard (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell 169(7), 1177–1186.
  • Brynedal et al. (2017) Brynedal, B., J. Choi, T. Raj, R. Bjornson, B. E. Stranger, B. M. Neale, B. F. Voight, and C. Cotsapas (2017). Large-scale trans-eqtls affect hundreds of transcripts and mediate patterns of transcriptional co-regulation. The American Journal of Human Genetics 100(4), 581–591.
  • Bryois et al. (2014) Bryois, J., A. Buil, D. M. Evans, J. P. Kemp, S. B. Montgomery, D. F. Conrad, K. M. Ho, S. Ring, M. Hurles, P. Deloukas, et al. (2014). Cis and trans effects of human genomic variants on gene expression. PLoS genetics 10(7), e1004461.
  • Cai et al. (2011) Cai, T. T., X. J. Jeng, and J. Jin (2011). Optimal detection of heterogeneous and heteroscedastic mixtures. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(5), 629–662.
  • Cannon and Mohlke (2018) Cannon, M. E. and K. L. Mohlke (2018). Deciphering the emerging complexities of molecular mechanisms at gwas loci. The American Journal of Human Genetics 103(5), 637–653.
  • Cano-Gamez and Trynka (2020) Cano-Gamez, E. and G. Trynka (2020). From gwas to function: using functional genomics to identify the mechanisms underlying complex diseases. Frontiers in genetics 11, 424.
  • Civelek et al. (2017) Civelek, M., Y. Wu, C. Pan, C. K. Raulerson, A. Ko, A. He, C. Tilford, N. K. Saleem, A. Stančáková, L. J. Scott, et al. (2017). Genetic regulation of adipose gene expression and cardio-metabolic traits. The American Journal of Human Genetics 100(3), 428–443.
  • Claussnitzer et al. (2015) Claussnitzer, M., S. N. Dankel, K.-H. Kim, G. Quon, W. Meuleman, C. Haugen, V. Glunk, I. S. Sousa, J. L. Beaudry, V. Puviindran, et al. (2015). Fto obesity variant circuitry and adipocyte browning in humans. New England Journal of Medicine 373(10), 895–907.
  • Deng and Pan (2020) Deng, Y. and W. Pan (2020). A powerful and versatile colocalization test. PLoS computational biology 16(4), e1007778.
  • Donoho and Jin (2004) Donoho, D. and J. Jin (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics 32(3), 962–994.
  • Donoho and Jin (2008) Donoho, D. and J. Jin (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proceedings of the National Academy of Sciences 105(39), 14790–14795.
  • Gamazon et al. (2015) Gamazon, E. R., H. E. Wheeler, K. P. Shah, S. V. Mozaffari, K. Aquino-Michaels, R. J. Carroll, A. E. Eyler, J. C. Denny, G. Consortium, D. L. Nicolae, et al. (2015). A gene-based association method for mapping traits using reference transcriptome data. Nature genetics 47(9), 1091–1098.
  • Gusev et al. (2016) Gusev, A., A. Ko, H. Shi, G. Bhatia, W. Chung, B. W. Penninx, R. Jansen, E. J. De Geus, D. I. Boomsma, F. A. Wright, et al. (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics 48(3), 245–252.
  • Han et al. (2023) Han, J., H. Fan, Y. Dai, and X. Cheng (2023). Serum c1q/tnf-related protein 4 levels are associated with nonalcoholic fatty liver disease in type 2 diabetic patients. Metabolic Syndrome and Related Disorders 21(3), 163–168.
  • Jager and Wellner (2007) Jager, L. and J. A. Wellner (2007). Goodness-of-fit tests via phi-divergences.
  • Jeng et al. (2013) Jeng, X. J., T. T. Cai, and H. Li (2013). Simultaneous discovery of rare and common segment variants. Biometrika 100(1), 157–172. PMCID:PMC3696347.
  • Jeng et al. (2020) Jeng, X. J., J. Rhyne, T. Zhang, and J.-Y. Tzeng (2020). Effective snp ranking improves the performance of eqtl mapping. Genetic epidemiology 44(6), 611–619.
  • Laakso et al. (2017) Laakso, M., J. Kuusisto, A. Stančáková, T. Kuulasmaa, P. Pajukanta, A. J. Lusis, F. S. Collins, K. L. Mohlke, and M. Boehnke (2017). The metabolic syndrome in men study: a resource for studies of metabolic and cardiovascular diseases. Journal of lipid research 58(3), 481–493. PMID:28119442.
  • Li and Siegmund (2015) Li, J. and D. Siegmund (2015). Higher criticism: p-values and criticism. Annals of Statistics 43(3), 1323–1350.
  • Liu et al. (2019) Liu, X., Y. I. Li, and J. K. Pritchard (2019). Trans effects on gene expression can drive omnigenic inheritance. Cell 177(4), 1022–1034.
  • Lobo et al. (2021) Lobo, S., P. R. Benusiglio, F. Coulet, L. Boussemart, L. Golmard, I. Spier, R. Hüneburg, S. Aretz, C. Colas, and C. Oliveira (2021). Cancer predisposition and germline ctnna1 variants. European journal of medical genetics 64(10), 104316.
  • Price et al. (2008) Price, A. L., N. Patterson, D. C. Hancks, S. Myers, D. Reich, V. G. Cheung, and R. S. Spielman (2008). Effects of cis and trans genetic ancestry on gene expression in african americans. PLoS genetics 4(12), e1000294.
  • Raulerson et al. (2019) Raulerson, C. K., A. Ko, J. C. Kidd, K. W. Currin, S. M. Brotman, M. E. Cannon, Y. Wu, C. N. Spracklen, A. U. Jackson, H. M. Stringham, et al. (2019). Adipose tissue gene expression associations reveal hundreds of candidate genes for cardiometabolic traits. The American Journal of Human Genetics 105(4), 773–787.
  • Smemo et al. (2014) Smemo, S., J. J. Tena, K.-H. Kim, E. R. Gamazon, N. J. Sakabe, C. Gómez-Marín, I. Aneas, F. L. Credidio, D. R. Sobreira, N. F. Wasserman, et al. (2014). Obesity-associated variants within fto form long-range functional connections with irx3. Nature 507(7492), 371–375. PMCID:PMC4113484.
  • Sun and Lin (2020) Sun, R. and X. Lin (2020). Genetic variant set-based tests using the generalized berk–jones statistic with application to a genome-wide association study of breast cancer. Journal of the American Statistical Association 115(531), 1079–1091.
  • Visscher et al. (2017) Visscher, P. M., N. R. Wray, Q. Zhang, P. Sklar, M. I. McCarthy, M. A. Brown, and J. Yang (2017). 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics 101(1), 5–22. PMCID: PMC5501872.
  • Westra et al. (2013) Westra, H.-J., M. J. Peters, T. Esko, H. Yaghootkar, C. Schurmann, J. Kettunen, M. W. Christiansen, B. P. Fairfax, K. Schramm, J. E. Powell, et al. (2013). Systematic identification of trans eqtls as putative drivers of known disease associations. Nature genetics 45(10), 1238–1243.
  • Yao et al. (2017) Yao, C., R. Joehanes, A. D. Johnson, T. Huan, C. Liu, J. E. Freedman, P. J. Munson, D. E. Hill, M. Vidal, and D. Levy (2017). Dynamic role of trans regulation of gene expression in relation to complex traits. The American Journal of Human Genetics 100(4), 571–580. PMCID:PMC5384035.
  • Yao et al. (2023) Yao, Z., Y. Gong, W. Chen, S. Shao, Y. Song, H. Guo, Q. Li, S. Liu, X. Wang, Z. Zhang, et al. (2023). Upregulation of wdr6 drives hepatic de novo lipogenesis in insulin resistance in mice. Nature metabolism 5(10), 1706–1725.
  • Zhang et al. (2020) Zhang, H., J. Jin, and Z. Wu (2020). Distributions and power of optimal signal-detection statistics in finite case. IEEE Transactions on Signal Processing 68, 1021–1033.
  • Zhu et al. (2016) Zhu, Z., F. Zhang, H. Hu, A. Bakshi, M. R. Robinson, J. E. Powell, G. W. Montgomery, M. E. Goddard, N. R. Wray, P. M. Visscher, et al. (2016). Integration of summary data from gwas and eqtl studies predicts complex trait gene targets. Nature genetics 48(5), 481. PMID:27019110.