Kernel canonical correlation analysis for assessing geneâgene interactions and application to ovarian cancer

Larson, Nicholas B; Jenkins, Gregory D; Larson, Melissa C; Vierkant, Robert A; Sellers, Thomas A; Phelan, Catherine M; Schildkraut, Joellen M; Sutphen, Rebecca; Pharoah, Paul P D; Gayther, Simon A; Wentzensen, Nicolas; Goode, Ellen L; Fridley, Brooke L

doi:10.1038/ejhg.2013.69

Download PDF

Article
Open access
Published: 17 April 2013

Kernel canonical correlation analysis for assessing geneâgene interactions and application to ovarian cancer

Nicholas B Larson¹,
Gregory D Jenkins¹,
Melissa C Larson¹,
Robert A Vierkant¹,
Thomas A Sellers²,
Catherine M Phelan²,
Joellen M Schildkraut³,
Rebecca Sutphen⁴,
Paul P D Pharoah⁵,
Simon A Gayther⁶,
Nicolas Wentzensen⁷,
Ovarian Cancer Association Consortium,
Ellen L Goode¹ &
â¦
Brooke L Fridley^1,8Â

European Journal of Human Genetics volumeÂ 22,Â pages 126â131 (2014)Cite this article

3240 Accesses
33 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Although single-locus approaches have been widely applied to identify disease-associated single-nucleotide polymorphisms (SNPs), complex diseases are thought to be the product of multiple interactions between loci. This has led to the recent development of statistical methods for detecting statistical interactions between two loci. Canonical correlation analysis (CCA) has previously been proposed to detect geneâgene coassociation. However, this approach is limited to detecting linear relations and can only be applied when the number of observations exceeds the number of SNPs in a gene. This limitation is particularly important for next-generation sequencing, which could yield a large number of novel variants on a limited number of subjects. To overcome these limitations, we propose an approach to detect geneâgene interactions on the basis of a kernelized version of CCA (KCCA). Our simulation studies showed that KCCA controls the Type-I error, and is more powerful than leading gene-based approaches under a disease model with negligible marginal effects. To demonstrate the utility of our approach, we also applied KCCA to assess interactions between 200 genes in the NF-ÎºB pathway in relation to ovarian cancer risk in 3869 cases and 3276 controls. We identified 13 significant gene pairs relevant to ovarian cancer risk (local false discovery rate <0.05). Finally, we discuss the advantages of KCCA in geneâgene interaction analysis and its future role in genetic association studies.

Family-based gene-environment interaction using sequence kernel association test (FGE-SKAT) for complex quantitative traits

Article Open access 01 April 2021

Aggregative trans-eQTL analysis detects trait-specific target gene sets in whole blood

Article Open access 26 July 2022

Estimation of causal effects of genes on complex traits using a Bayesian-network-based framework applied to GWAS data

Article 17 October 2024

Introduction

Genome-wide association studies (GWAS) have identified hundreds of loci that harbor genetic variants that influence predisposition to a particular phenotype. Such studies involve the characterization of a large number of single-nucleotide polymorphisms (SNPs) across the genome and comparison of the frequency of variants across disease states. Initial GWAS analysis strategies involved single locus models, whereby individual markers were tested independently for association with a given phenotype. Although this approach has successfully identified regions of disease susceptibility, some contend it has failed to fully explain the heritability of complex phenotypes.^{1, 2} As common complex diseases and traits are thought to be a result of complex interactions and multiple low-penetrance variants,^{3, 4} multi-locus SNP models, as opposed to single SNP models, may better capture the true underlying genotypicâphenotypic relationship.

One strategy for multi-locus modeling is to jointly model the effects all SNPs within a given gene (eg, multivariable logistic regression models). However, this approach may lack power as the degrees of freedom of the model could be large and may require filtering or shrinkage approaches. Another drawback to the joint modeling of multiple SNPs within a gene is possible model fitting issues due to multicollinearity between SNPs (ie, linkage disequilibrium (LD)), as well as the lack of inclusion of LD information in the analysis. Recently, this idea of gene-level analysis has led to the concept of geneâgene interaction analysis, as opposed to SNPâSNP interaction approaches. Geneâgene interactions are not only hypothesized to have a large role in explaining missing heritability,⁵ they can also serve to provide biological information through construction of novel gene pathway topologies. Although classically geneâgene interactions have been defined statistically as deviance from additive marginal effects, such as in the case of logistic regression model, this type of model is limiting with respect to statistical power. Moreover, the results of such SNPâSNP interaction analyses lack clear biological interpretability.

Zhao et al⁶ proposed testing for interactions between two unlinked loci using measures of LD, which can be extended to case-control design by comparing such measures across case or disease status. This concept was adapted to gene-level analysis by Peng et al,⁷ which used a Wald-type U-statistic based on canonical correlation analysis⁸ (CCA) to detect geneâgene coassociation in case-control studies. As an LD-based procedure, the CCA approach obtains the maximal correlation of the linear combinations of the SNPs, coded as 0, 1 or 2 in terms of the minor allele, between two genes across case-control status, and tests whether the difference in the first canonical correlations is statistically significant. Although there are many benefits to this approach, it is limited to analyses where the number of observations exceeds the number of markers. Moreover, the use of CCA can only be used to detect linear relationships, which may limit power in the presence of nonlinear correlations between genes. Finally, CCA generally requires a large sample-to-feature ratio to avoid issues with overfitting the data, which raises questions of model regularization.

One solution to the above limitations involving the use of CCA for assessing geneâgene interactions is the use of kernels. Kernel methods are generally defined as algorithms that analyze data represented by similarity matrices, which are derived through the use of positive definite kernel functions.⁹ By mapping the original data to a nonlinear feature space, traditional linear methods involving dot products have been extended to nonlinear applications through the use of the âkernel trickâ.¹⁰ The application of kernel machines is quite popular as a method to derive metrics of genomic similarity,¹¹ and their use has been successful in the area of gene-level association analyses such as SKAT^{12, 13} and SPA-3G.¹⁴ Kernelized version of CCA (KCCA) provides a straightforward generalization of CCA to nonlinear correlations by applying CCA to kernel-generated feature spaces.

In this article, we develop a KCCA procedure for identifying coassociation between genes using genome-wide SNP data from a case-control study of complex phenotypes. We briefly discuss sample CCA and its kernelized version. We further outline and address the statistical and computational issues that accompany this approach, including concerns of regularization. To evaluate the properties of this method, we examine control of Type-I error rate and the power of our KCCA method compared with other existing methods for geneâgene interaction detection by a simulation study. Finally, we apply our KCCA approach to a study case-control study of invasive epithelial ovarian cancer to determine geneâgene interactions between genes within the NF-ÎºB gene pathway.

Materials and methods

Data definition

Let be the number of SNPs corresponding to the gene in a given gene list of size G. Define to be the genotype value for the SNP in gene g in the case subject, for , , and , where is the number of copies of the minor allele for SNP j in gene g under an assumed joint additive-additive genetic model. Similarly, we define for the control subjects, where . This genotypic information can in turn be represented by the respective and matrices and . These matrices may also contain adjusted genotype values that have been corrected for various covariates and population stratification.

Hypothesis testing

To test whether there is a statistical interaction between two genes across case-control status, we use KCCA to generate measures of genetic coassociation for both case and control status. For given genes l,mâ{1,...,G}, such that lâ m, consider the genotype matrices , , , and , with corresponding reduced kernel representations , , , and . Define and to be the respective maximal kernel canonical correlations for cases and controls between genes l and m (see Appendices IâIII in Supplementary Material for details). We then define a statistic based upon an analog of the Fisher variance stabilizing transformation of the Pearsonâs correlation coefficient,¹⁵ given as

The transformation of the canonical correlation is written as , which is approximately distributed as standard normal. A Wald-type statistic for assessing the statistical significance of the difference in geneâgene coassociation between cases and controls for genes l and m is defined by Peng et al⁷ as

which is asymptotically distributed as N(0, 1) under the null hypothesis that , and the cases and controls are independent. For estimating the transform variances and , we apply a robust resampling procedure, the trimmed jackknife¹⁶ (Appendix IV in Supplementary Materials).

Multiple testing

For applications involving exhaustive hypothesis testing of all pairwise gene comparisons in a given gene list, multiple testing and test statistic correlation become problematic issues. To address both of these directly, we apply Efronâs empirical null method¹⁷ for estimating the local false discovery rate (IFDR) for each hypothesis test conducted.

Simulation study

To assess the properties of our KCCA procedure for geneâgene interaction testing, we consider simulation studies that evaluate type-I error control and power. We generated a population of haplotypes for two genes using real genotype data from our case study. fastPHASE¹⁸ was applied to the genotypes from the controls to estimate haplotypes for two genes of comparable size (25.9 and 30.6âkb), followed by use of HapSim¹⁹ to simulate 10â000 haplotypes for each gene. The respective numbers of polymorphic sites for each gene were 79 and 92. Genotype data for a hypothetical individual were simulated by combining pairs of randomly selected haplotypes for each gene.

Let represent the derived minor allele frequency (MAF) from our simulated haplotype populations for marker j in gene i. Next, we randomly selected a fixed number of common () markers to be causal for each gene. We then used a similar approach to effect size definition used by Wu et al,¹² in which effects are a function of the MAF. Let for an interaction effect, such that . Here defines the maximum possible interaction effect. For example, for two markers with MAF=0.20 and Ï=5, the interaction effect is .

Given the difficulty in genome-wide detection of geneâgene interactions without the presence of marginal effects, we only considered disease models with solely epistatic effects. We defined the probability of being a case conditional on genotype via a logistic regression framework, such that

where Î©₁ and Î©₂ define the subsets of markers which are causal, and represents the disease prevalence. Sampling of cases and controls was then completed from a sufficiently large number of simulated genotype-phenotype pairs.

To comparatively evaluate the performance of our KCCA method, we included additional methods on the basis of similar analysis principles: the original CCA-based approach, PC-based logistic regression²⁰ (PC-LR), and a composite-LD method²¹ (CLD). PC-LR obtains principal components from SNP measurements for each gene, which are then fit in simple logistic regression. The CLD method is a covariance-based approach, which evaluates the difference between block interactions across case-control status. Additional details for each approach can be found in their respective publications. For PC-LR, we evaluate the significance of the interaction coefficient between the first principal component of each gene, and for the CLD approach we use 5000 permutations to characterize the reference distribution of the test statistic. All declarations of statistical significance are made at an Î±-level of 0.05. For both Type-I error and power simulations, we consider whether or not explicit marginal effects are included in the disease model. Each simulation scenario is conducted with case-controls status sample sizes of 500, 1000, and 1500, with a total of 1000 iterations each.

Ovarian cancer study

We applied the KCCA approach to detect geneâgene interaction within the NF-ÎºB pathway, using data from a case-control study of invasive epithelial ovarian cancer as part of the GAME-ON Follow-up Ovarian Cancer Genetic Association and Interaction Studies collaboration (described elsewhere^{22, 23}). Participants were enrolled in the Mayo Clinic Ovarian Cancer Study, North Carolina Ovarian Cancer Study, the Tampa Bay Ovarian Cancer Study, the Toronto Ovarian Cancer Study, the National Cancer Institute Ovarian Case-Control Study in Poland, the UK Ovarian Cancer Population Study, the Studies of Epidemiology and Risk Factors in Cancer Heredity Ovarian Cancer Study, the Familial Ovarian Cancer Registry Study, and the Royal Marsden Hospital Ovarian Cancer Study.^{24, 25} Study protocols were approved by the appropriate institutional review board or ethics panel, and all patients provided written informed consent. Genotypes were from the Illumina (San Diego, CA, USA) 610-Quad SNP arrays, with imputation to HapMap v 26 using MACH.²⁶ For 200 autosomal genes within the NF-ÎºB pathway, this resulted in â¼13â000 observed or imputed markers available on 3869 cases and 3276 controls of European descent. For genotyped markers, we coded genotypes as 0, 1, or 2 in terms of the number of observed minor alleles; for imputed markers, we used the expected genotype or âdosageâ. Marker assignment to genes was determined on the basis of NCBI-build 36 gene location data, using a 20-kb buffer region on both the 5â² and 3â² ends of the defined gene location. Information on location, size, and number of SNPs in each of the 200 autosomal genes can be found in Supplementary Table S1.

To address the effects of possible confounding variables, we adjusted the genotypes for age, study site, and the first five principal components from an eigen analysis.²⁷ Each unique gene pair was tested using the KCCA procedure, resulting in total hypothesis tests. For purposes of comparison, we also applied the CCA-based procedure defined by Peng et al⁷ to the data, using 1000 bootstraps for variance estimation. Comparisons involving gene pairs with overlapping regions were removed from our analysis to avoid complications involving shared marker data.

Results

Type-I error

For our simulations, we considered six levels of trimming within the jackknife procedure for SE estimation to determine which level was appropriate. A plot of the empirical Type-I error rate for all trim levels across each sample size are found in Figure 1. The KCCA results derived from Ï=0.15 yield near nominal Type-I error rate levels across all sample sizes. Detailed statistics on the empirical distributions of the test statistics can be found in Table 1. These results indicate that the test statistic follows the assumed standard normal distribution under the null.

Table 1 KCCA null simulation results for various trimming values Ï, which includes the P-value for the Kolmogorov-Smirnov test for normality (KS), the empirical SD and mean of the simulated test statistic distribution, as well as the realized Type-I error rate rejecting at an Î±-level of 0.05

Full size table

Power

For our power simulations, we set and randomly selected five markers from each gene to be causal at each iteration, performing all KCCA tests with Ï=0.15. Bar graphs of the results are found in Figure 2. From this plot, we observe that KCCA outperforms the other methods across all sample sizes, particularly the CCA approach. As the poor performance of CCA is likely due to issues with overfitting, we considered additional simulations where sample sizes were fixed at 1000 and the number of markers per gene was set to values of 10, 20, 30, 40, or 50, randomly subsetted from the total number of markers used in our simulations. Keeping all the other settings of our simulation design the same, the empirical power for each method across number of markers per gene is found in Table 2. Although the resulting power is much closer for low marker numbers, the KCCA approach is still consistently more powerful than the remaining methods.

Table 2 Simulation result for empirical power, using Î± =0.05 level significance testing, for the KCCA, CCA, PC-LR, and CLD testing procedures across varying numbers of markers per gene

Full size table

Application to ovarian cancer study

For the ovarian cancer study of the NF-ÎºB pathway, we identified 13 statistically significant (lFDR<0.05) gene pairs of interest (Table 3) in applying our KCCA method, using the trimmed jackknife SE estimate with Ï=0.15, with the top coassociated gene pair with case-control status occurring between CASP8 and MAP3K3. Application of the CCA-based procedure resulted in 37 significant gene pairs; however, none of these overlapped with those detected by the KCCA method.

Table 3 Detailed results of the significant KCCA geneâgene coassociations for analysis of ovarian cancer risk of significant (lFDRâ¤0.05) geneâgene interactions from FOCI analysis ranked by estimated lFDR

Full size table

To explore one of the top coassociation hits at the SNP level, we analyzed the CASP8âMAP3K3 interaction by generating pairwise marker Pearsonâs correlations. These are presented by case-control status in Figure 3 as gradient colorized images of the correlation coefficients. The figure demonstrates that although the Pearsonâs correlation coefficients themselves are small (|Ï|<0.10), there exist distinct differences between the two correlation structures. Of note is the discrepancy across case-control status with the correlation between marker rs12940055 in MAP3K3 (located at bp 59â075â874) and a large number of SNPs toward the 5â² end of CASP8, indicated by the horizontal band in the lower half of the correlation plots.

Discussion

The identification of interactions between genes and their impact on complex diseases is adding to our understanding the genetic component. Although SNP-based interaction analysis methods are relatively well developed, the large number of SNPs in association studies makes exhaustive pairwise SNP-SNP analyses increasingly infeasible. By addressing the problem at the gene level, the scope of the analyses is not only computationally tenable, but also reduced to a biologically interpretable unit of interest.²⁸ Peng et al⁷ presented a novel statistical method that allows for such analyses in their CCA statistic. However, as the number of genotyped markers increases, the application of CCA may be inappropriate, because the number of genotyped SNPs may approach or exceed the number of observations, especially for smaller experimental designs. This is of particular concern with post-GWAS whole exome and genome sequencing association studies, which will afford the characterization of additional variants unmeasured by current genotyping platforms.

In this article, we have presented a KCCA for geneâgene interaction analysis that not only addresses concerns of dimensionality, but also allows the flexibility to detect nonlinear correlations between genes. We have demonstrated that, using the appropriate SE estimation procedure, the KCCA test statistic exhibits near nominal levels of type-I error rate control and competitive power performance in our data simulations. We have also outlined basic procedures for large-scale application that take into account issues of regularization, computational burden, and multiple testing. As a result, the KCCA procedure is a powerful tool for exploratory geneâgene interaction analysis using SNP data.

It is important to note that although we have shown via our simulation results that the performance of the gene-level interaction analyses using KCCA is more powerful than other current methods under an interaction-only model, additional simulations have shown that the PC-LR approach performs best in the presence of marginal effects (Figure 4). Thus, if evidence suggests the existence of such effects, we recommend the use of the PC-based procedure instead of KCCA. Also, if we collapse the signal into a single interaction between two markers with no LD present, individual SNPâSNP interaction analysis via logistic regression easily bests the gene-level analyses in interaction detection. Thus, although we have demonstrated that our KCCA method performs well as a genome-wide exploratory methodology, alternative methods may perform better under specific circumstances.

Although our data application analysis was conducted on genes within a specific pathway, we envision genome-wide exploratory applications, such as GWAS, to be completed in a similar fashion. However, owing to combinatorial scaling, this will require special computational considerations such as parallelization. For example, exhaustive pairwise analysis of a 20â000-gene genome requires nearly 200 million unique tests, although this total pales in comparison with the number of possible SNPâSNP interactions.

Comparability with CCA statistic

Contrary to the findings of Peng et al,⁷ we have found that hypothesis testing using their CCA statistic with bootstrap variance estimation can be quite conservative, regardless of sample or feature size used, resulting in possible issues with reduced statistical power. This is evidenced in our power simulations where CCA power results for certain simulation conditions results in power below even nominal Type-I error rates. Upon investigation of this discrepancy, we found that the bootstrap variance estimates were quite large in comparison with their KCCA counterparts.

An additional benefit of our procedure is that the assumptions of CCA include multivariate normality of the observations, which is clearly violated by the discrete nature of genotype calls if no adjustments are made. Our method, however, involves low dimensional projections of kernelized observations, which has been shown to be approximately Gaussian.²⁹ Thus, our KCCA procedure is also more consistent with the distributional assumptions of CCA.

Ovarian cancer findings

The analysis of the FOCI data using the KCCA procedure yielded biologically interesting results, with many of the top gene pairs sharing some functional basis. CASP8 and MAP3K3 are integral members of the tumor necrosis factor pathway, and IL1A and IL1B both code for proinflammatory cytokines and have been jointly associated with lung cancer.³⁰ With 13 statistically significant findings, there are also several novel interactions that may warrant further investigation.

We argue the lack of congruency between the results of the CCA-based procedure and our kernelized version is due, in large part, to the lack of any dimensional reduction in the CCA. Although the sample sizes used in our ovarian application analyses are large enough to satisfy most conventional notions of appropriate observation-to-feature ratios for accurate canonical correlation estimation, our simulations indicate that the CCA method suffers greatly from overfitting when there are a relatively large number of genotyped markers in given genes.

One significant gene pair of concern in the NF-ÎºBâKCCA interaction analysis is IL1A-IL1B, which includes genes that are positionally adjacent to one another. Taking into account their respective buffer regions, only 4.4âkb separates the two genes. The significance of this gene pair could be evidence of instability of the method for gene pairs that are in high LD with each other because of locational proximity. However, gene pairs of this nature are also often closely linked functionally. Moreover, there are counter examples in our analysis of neighboring gene pairs that are statistically not significant (eg, LTBR-TNFRSF1A). Regardless, we recommend caution in the interpretation of such results.

Conclusions and future development

Our KCCA algorithm simultaneously supplies dimensionality reduction and nonlinear coassociation analysis for high-dimensional SNP data, providing a powerful framework for detecting statistical epistasis at the gene level. Moreover, this type of analysis can isolate gene pairs of interest for follow-up analysis without being burdened by the multiple testing corrections necessary for genome-wide SNPâSNP pairwise interaction analysis. This is particularly relevant for next-generation sequencing applications, which may interrogate all possible SNPs through whole genome sequencing.

Although we have argued that the KCCA procedure for detecting geneâgene interaction possesses many advantages over the previously proposed CCA statistic, there is also room for improvement and generalizability of our approach. The use of the Gaussian kernel function is a robust selection; however, other kernel functions may be more appropriate for the specific data type, particularly if there are no adjustments for covariate data.³¹ The procedure itself may also be modified in a variety of ways, including the use of sparse canonical correlation^{32, 33} and multigene interaction analysis with generalized canonical correlation,³⁴ and further exploring the resampling procedures used in the SE estimation. Finally, a less computationally intensive alternative to KCCA may be a kernelized variant of principal correlation,³⁵ which could be considered for more demanding analyses such as genome-wide interrogation.

References

Manolio TA, Collins FS, Cox NJ et al: Finding the missing heritability of complex diseases. Nature 2009; 461: 747â753.
ArticleÂ CASÂ Google ScholarÂ
Moore JH : The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered 2003; 56: 73â82.
ArticleÂ Google ScholarÂ
Thomas DC, Haile RW, Duggan D : Recent developments in genomewide association scans: a workshop summary and review. Am J Hum Genet 2005; 77: 337â345.
ArticleÂ CASÂ Google ScholarÂ
Wang WYS, Barratt BJ, Clayton DG, Todd JA : Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 2005; 6: 109â118.
ArticleÂ CASÂ Google ScholarÂ
Thornton-Wells TA, Moore JH, Haines JL : Genetics, statistics and human disease: analytical retooling for complexity. Trends Genet 2004; 20: 640â647.
ArticleÂ CASÂ Google ScholarÂ
Zhao J, Jin L, Xiong M : Test for interaction between two unlinked loci. Am J Hum Genet 2006; 79: 831â845.
ArticleÂ CASÂ Google ScholarÂ
Peng Q, Zhao J, Xue F : A gene-based method for detecting gene-gene co-association in a case-control association study. Eur J Hum Genet 2010; 18: 582â587.
ArticleÂ CASÂ Google ScholarÂ
Hotelling H : Relations between two sets of variates. Biometrika 1936; 28: 321â377.
ArticleÂ Google ScholarÂ
SchÃ¶lkopf B, Smola A : Learning with kernels : support vector machines, regularization, optimization, and beyond. The MIT Press, 2002.
Google ScholarÂ
Scholkopf B, Smola A, Muller KR : Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 1998; 10: 1299â1319.
ArticleÂ Google ScholarÂ
Schaid DJ : Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered 2010; 70: 109â131.
ArticleÂ Google ScholarÂ
Wu MC, Lee S, Cai TX, Li Y, Boehnke M, Lin XH : Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011; 89: 82â93.
ArticleÂ CASÂ Google ScholarÂ
Wu MC, Kraft P, Epstein MP et al: Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet 2010; 86: 929â942.
ArticleÂ CASÂ Google ScholarÂ
Li S, Cui Y : Gene-centric gene-gene interaction: a model-based kernel machine method. Ann Appl Stat 2012; 6: 1134â1161.
ArticleÂ Google ScholarÂ
Lawley DN : Tests of significance in canonical analysis. Biometrika 1959; 46: 59â66.
ArticleÂ Google ScholarÂ
Hinkley D, Wang H-L : A trimmed jackknife. J Roy Stat Soc Ser B 1980; 42: 347â356.
Google ScholarÂ
Efron B : Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 2004; 99: 96â104.
ArticleÂ Google ScholarÂ
Scheet P, Stephens M : A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 2006; 78: 629â644.
ArticleÂ CASÂ Google ScholarÂ
Montana G : HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics 2005; 21: 4309â4311.
ArticleÂ CASÂ Google ScholarÂ
Zhang F, Wagener D : An approach to incorporate linkage disequilibrium structure into genomic association analysis. J Genet Genomics 2008; 35: 381â385.
ArticleÂ Google ScholarÂ
Rajapakse I, Perlman MD, Martin PJ, Hansen JA, Kooperberg C : Multivariate detection of gene-gene interactions. Genet Epidemiol 2012; 36: 622â630.
ArticleÂ Google ScholarÂ
White KL, Schildkraut JM, Palmieri RT et al: Ovarian cancer risk associated with inherited inflammation-related variants. Cancer Res 2012; 72: 1064â1069.
ArticleÂ CASÂ Google ScholarÂ
Fridley BL, Jenkins GD, Tsai YY et al: Gene set analysis of survival following ovarian cancer implicates macrolide binding and intracellular signaling genes. Cancer Epidemiol Biomarkers Prev 2012; 21: 529â536.
ArticleÂ CASÂ Google ScholarÂ
Permuth-Wey J, Chen YA, Tsai YY et al: Inherited variants in mitochondrial biogenesis genes may influence epithelial ovarian cancer risk. Cancer Epidemiol Biomarkers Prev 2011; 20: 1131â1145.
ArticleÂ CASÂ Google ScholarÂ
Permuth-Wey J, Kim D, Tsai YY et al: LIN28B polymorphisms influence susceptibility to epithelial ovarian cancer. Cancer Res 2011; 71: 3896â3903.
ArticleÂ CASÂ Google ScholarÂ
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR : MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 2010; 34: 816â834.
ArticleÂ Google ScholarÂ
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D : Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006; 38: 904â909.
ArticleÂ CASÂ Google ScholarÂ
Neale BM, Sham PC : The future of association studies: gene-based analysis and replication. Am J Hum Genet 2004; 75: 353â362.
ArticleÂ CASÂ Google ScholarÂ
Diaconis P, Freedman D : Asymptotics of graphical projection pursuit. Ann Stat 1984; 12: 793â815.
ArticleÂ Google ScholarÂ
Engels EA, Wu X, Gu J, Dong Q, Liu J, Spitz MR : Systematic evaluation of genetic variants in the inflammation pathway and risk of lung cancer. Cancer Res 2007; 67: 6520â6527.
ArticleÂ CASÂ Google ScholarÂ
Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN : Nonparametric tests of association of multiple genes with human disease. Am J Hum Genet 2005; 76: 780â793.
ArticleÂ CASÂ Google ScholarÂ
Witten DM, Tibshirani RJ : Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol 2009; 8: Article28.
ArticleÂ Google ScholarÂ
Waaijenborg S, Zwinderman AH : Sparse canonical correlation analysis for identifying, connecting and completing gene-expression networks. BMC Bioinform 2009; 10: 315.
ArticleÂ Google ScholarÂ
Yamanishi Y, Vert JP, Nakaya A, Kanehisa M : Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics 2003; 19 ((Suppl 1):): i323âi330.
ArticleÂ Google ScholarÂ
Yamamoto M, Sugiyama T, Murakmi H, Sakaori F : Correlation analysis of principal components from two populations. Comput Stat Data Anal 2007; 51: 4707â4716.
ArticleÂ Google ScholarÂ

Download references

Acknowledgements

This study was funded by the Minnesota Partnership for Biotechnology and Medical Genomics; the Mayo Foundation; the Fred C and Katherine B Andersen Foundation, the Cambridge Biomedical Research Centre; Cancer Research UK (C490/A10119); the Celma Mastry Ovarian Cancer Foundation; and the US National Institutes of Health (CA168524, CA106414, CA136393, CA122443, CA114343, CA140879, and GM86689). The scientific development and funding for this project were partly supported by Genetic Associations and Mechanisms in Oncology (GAME-ON), a NCI Cancer Post-GWAS Initiative (U19-CA148112).

Author information

Authors and Affiliations

Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
Nicholas B Larson,Â Gregory D Jenkins,Â Melissa C Larson,Â Robert A Vierkant,Â Ellen L GoodeÂ &Â Brooke L Fridley
Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, USA
Thomas A SellersÂ &Â Catherine M Phelan
Duke Comprehensive Cancer Center, Duke University, Durham, NC, USA
Joellen M Schildkraut
Department of Pediatrics, Universty of South Florida College of Medicine, Tampa, FL, USA
Rebecca Sutphen
Department of Oncology, University of Cambridge, Cambridge, UK
Paul P D Pharoah
Department of Preventative Medicine, University of Southern California, Los Angeles, CA, USA
Simon A Gayther
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
Nicolas Wentzensen
Department of Biostatistics, University of Kansas Medical Center, Kansas City, KS, USA
Brooke L Fridley

Authors

Nicholas B Larson
View author publications
You can also search for this author in PubMedÂ Google Scholar
Gregory D Jenkins
View author publications
You can also search for this author in PubMedÂ Google Scholar
Melissa C Larson
View author publications
You can also search for this author in PubMedÂ Google Scholar
Robert A Vierkant
View author publications
You can also search for this author in PubMedÂ Google Scholar
Thomas A Sellers
View author publications
You can also search for this author in PubMedÂ Google Scholar
Catherine M Phelan
View author publications
You can also search for this author in PubMedÂ Google Scholar
Joellen M Schildkraut
View author publications
You can also search for this author in PubMedÂ Google Scholar
Rebecca Sutphen
View author publications
You can also search for this author in PubMedÂ Google Scholar
Paul P D Pharoah
View author publications
You can also search for this author in PubMedÂ Google Scholar
Simon A Gayther
View author publications
You can also search for this author in PubMedÂ Google Scholar
Nicolas Wentzensen
View author publications
You can also search for this author in PubMedÂ Google Scholar
Ellen L Goode
View author publications
You can also search for this author in PubMedÂ Google Scholar
Brooke L Fridley
View author publications
You can also search for this author in PubMedÂ Google Scholar

Consortia

Ovarian Cancer Association Consortium

Corresponding author

Correspondence to Brooke L Fridley.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Supplementary Information accompanies this paper on European Journal of Human Genetics website

Supplementary information

Supplementary Table S1 (DOC 311 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/

Reprints and permissions

About this article

Cite this article

Larson, N., Jenkins, G., Larson, M. et al. Kernel canonical correlation analysis for assessing geneâgene interactions and application to ovarian cancer. Eur J Hum Genet 22, 126â131 (2014). https://doi.org/10.1038/ejhg.2013.69

Download citation

Received: 21 August 2012
Revised: 11 January 2013
Accepted: 16 January 2013
Published: 17 April 2013
Issue Date: January 2014
DOI: https://doi.org/10.1038/ejhg.2013.69

Keywords

This article is cited by

Genetic interactions effects for cancer disease identification using computational models: a review
- R. Manavalan
- S. Priya
Medical & Biological Engineering & Computing (2021)
Multi-group analysis using generalized additive kernel canonical correlation analysis
- Eunseong Bae
- Ji-Won Hur
- Chae Young Lim
Scientific Reports (2020)
Eigen-Epistasis for detecting gene-gene interactions
- Virginie Stanislas
- Cyril Dalmasso
- Christophe Ambroise
BMC Bioinformatics (2017)
A label embedding kernel method for multi-view canonical correlation analysis
- Shuzhi Su
- Hongwei Ge
- Yun-Hao Yuan
Multimedia Tools and Applications (2017)
A gene-based information gain method for detecting geneâgene interactions in caseâcontrol studies
- Jin Li
- Dongli Huang
- Limei Wang
European Journal of Human Genetics (2015)

Subjects

Abstract

Similar content being viewed by others

Introduction

Materials and methods

Data definition

Hypothesis testing

Multiple testing

Simulation study

Ovarian cancer study

Results

Type-I error

Power

Application to ovarian cancer study

Discussion

Comparability with CCA statistic

Ovarian cancer findings

Conclusions and future development

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

Ovarian Cancer Association Consortium

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Search

Quick links