Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

epiACO - a method for identifying epistasis based on ant Colony optimization algorithm

Abstract

Background

Identifying epistasis or epistatic interactions, which refer to nonlinear interaction effects of single nucleotide polymorphisms (SNPs), is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Though many works have been done for identifying epistatic interactions, due to their methodological and computational challenges, the algorithmic development is still ongoing.

Results

In this study, a method epiACO is proposed to identify epistatic interactions, which based on ant colony optimization algorithm. Highlights of epiACO are the introduced fitness function Svalue, path selection strategies, and a memory based strategy. The Svalue leverages the advantages of both mutual information and Bayesian network to effectively and efficiently measure associations between SNP combinations and the phenotype. Two path selection strategies, i.e., probabilistic path selection strategy and stochastic path selection strategy, are provided to adaptively guide ant behaviors of exploration and exploitation. The memory based strategy is designed to retain candidate solutions found in the previous iterations, and compare them to solutions of the current iteration to generate new candidate solutions, yielding a more accurate way for identifying epistasis.

Conclusions

Experiments of epiACO and its comparison with other recent methods epiMODE, TEAM, BOOST, SNPRuler, AntEpiSeeker, AntMiner, MACOED, and IACO are performed on both simulation data sets and a real data set of age-related macular degeneration. Results show that epiACO is promising in identifying epistasis and might be an alternative to existing methods.

Peer Review reports

Background

It has been widely accepted that genome-wide association studies (GWAS) play a great role on understanding disease susceptibility and genetic architectures underlying complex diseases, and lots of single nucleotide polymorphisms (SNPs) speculated to associate with diseases have been identified. Nevertheless, most of them seem to be responsible for the causation of Mendelian diseases [1], and they have poor ability in explaining non-Mendelian diseases, i.e., complex diseases, such as cancer, heart disease, cardiopathy, Alzheimer’s disease, hypertension and many others [1, 2]. Recent advances in GWAS confirm that nonlinear interaction effects of SNPs, namely, epistasis or epistatic interactions, could unveil a large portion of unexplained heritability of complex diseases [2]. Therefore, many algorithms have been proposed for identifying epistasis. However, due to their methodological and computational challenges, the algorithmic development is still ongoing [1, 3, 4].

Current epistasis detection methods can be mainly classified into three categories based on their search strategies: exhaustive search [5,6,7], stochastic search [8,9,10], and heuristic search [2, 11,12,13,14,15]. Exhaustive search evaluates the associations of all SNP combinations with the phenotype. Wan et al. [5] developed BOOST (BOolean Operation-based Screening and Testing) which has two stages to analyze all pairwise epistatic interactions in genome-wide case-control studies. Zhang et al. [6] proposed TEAM (Tree-based Epistasis Association Mapping) which updates contingency tables by utilizing the structure of a minimum spanning tree to search two-SNP epistatic interactions. Ritchie et al. [7] used MDR (Multifactor Dimensionality Reduction) to identify epistasis, which divides genotypes into low-risk and high-risk groups to reduce search space. As far as we know, MDR is one of the most popular methods in this field. Though these methods show great performance on identifying epistatic interactions in small scale data sets, they have poor ability on large scale data sets due to their exponential time complexities [1, 3]. Performance of stochastic search methods depend on their sampling numbers, and therefore they are not suitable for genome-wide scale data sets [13]. Tang et al. [8] used epiMODE (epistatic MOdule DEtection) to detect multiple-SNP epistatic interactions by introducing the concept of epistatic module and designing a Gibbs sampling strategy. Jiang et al. [9] proposed epiForest (detection of epistatic interactions using random Forest) which uses random forest to detect epistatic interactions in case-control studies. Zhang et al. [10] developed BEAM (Bayesian Epistasis Association Mapping) to identify epistasis which categorizes SNPs into three non-overlapping groups based on their posterior probabilities. Heuristic search methods generally obtain solutions at substantially reduced time costs, based on their introducing heuristic information and prior knowledge of biological data. For instance, Wan et al. [11] proposed SNPRuler to identify epistatic interactions using predictive rules. Though heuristic search methods sometimes miss global optimal solutions and only obtain several local optimal solutions, they are still promising methods so far.

More recently, many ant colony optimization (ACO) based methods, which belong to the family of heuristic search methods, have been reported for identifying epistatic interactions. Wang et al. [12, 15] proposed AntEpiSeeker and AntEpiSeeker2.0 based on ACO with the two-stage design to detect epistatic interactions. Both of them perform well in their respective experiments though they require large amounts of ants over numerous iterations to obtain acceptable solutions [15]. Shang et al. [2] developed AntMiner to detect epistatic interactions, which is a generalized method of AntEpiSeeker by incorporating heuristic information into ant-decision rule. Although in terms of detection power, it wins most of compared methods, AntMiner needs unaffordable running time to obtain results, which hinders its widely application. Jing et al. [13] proposed MACOED for detecting epistatic interactions, which is a multi-objective ACO supervised heuristic search method, combining both logistical regression and Bayesian network. Though experiments show that MACOED outperforms other compared methods in both detection power and computational feasibility for large data sets, its pheromone updating strategy is not as effective as it claims to be. Our previously proposed method IACO [16] is an alternative to existing ACO based methods, but it is sensitive to SNPs displaying strong marginal effects.

In this paper, we develop a method epiACO to identify epistatic interactions, which based on ACO algorithm. Highlights of epiACO are the introduced fitness function Svalue, path selection strategies, and a memory based strategy. The Svalue leverages the advantages of both mutual information and Bayesian network to effectively and efficiently measure associations between SNP combinations and the phenotype. Two path selection strategies, i.e., probabilistic path selection strategy and stochastic path selection strategy, are provided to adaptively guide ant behaviors of exploration and exploitation. The memory based strategy is designed to retain candidate solutions found in the previous iterations, and compare them to solutions of the current iteration to generate new candidate solutions, yielding a more accurate way for identifying epistasis. Experiments of epiACO and its comparison with other recent methods epiMODE, TEAM, BOOST, SNPRuler, AntEpiSeeker, AntMiner, MACOED, and IACO are performed on both simulation data sets and an age-related macular degeneration real data set. Results show that epiACO is promising in identifying epistatic interactions and might be an alternative to existing methods. The Matlab version of epiACO running on Microsoft Windows is available online at: https://sourceforge.net/projects/epiaco1/files/epiACO.rar/download.

Methods

General ant Colony optimization (ACO)

The general ACO takes inspiration from the foraging behavior of some ant species. Ants walking to and from a food source communicate with each other indirectly by secreting pheromones on the ground, which will gradually evaporate as time passes. The subsequent ants perceive the presence of pheromones and tend to follow the paths with higher pheromone concentration [17]. Such a mechanism forms a positive feedback and eventually most of the ants, if not all, are able to transport food to their nest in the shortest path.

Mathematically, each ant choose its next position through a probability density function (PDF), which is updated by pheromones and heuristic information. The PDF of the general ACO is defined as

$$ {P}_k^{ij}(t)=\left\{\begin{array}{c}\hfill \frac{\tau_{ij}{(t)}^{\alpha}{\eta_{ij}}^{\beta}}{\sum_{u\in {U}_k(t)}{\tau}_{iu}{(t)}^{\alpha}{\eta_{iu}}^{\beta}}\kern3em i\in {U}_k(t)\hfill \\ {}\hfill 0\kern9.5em otherwise\hfill \end{array}\right. $$
(1)

where \( {P}_k^{ij}(t) \) is the probability of ant k that chooses the next position j from the current position i at iteration t. τ ij (t) is the pheromones of the path from position i to position j at iteration t. The heuristic information of the path from position i to position j is denoted as η ij . α and β are parameters that control the importances of pheromones and heuristic information respectively. The positions that are not selected so far by ant k at iteration t are stored in U k (t).

In ACO, ants find the optimal solutions through continuous updating pheromones. Pheromones of the path from position i to position j at iteration t+1 are updated according to the formula

$$ {\tau}_{ij}\left( t+1\right)=\left(1-\rho \right){\tau}_{ij}(t)+\varDelta {\tau}_{ij}(t) $$
(2)

where ρ is a evaporation coefficient, and Δτ ij (t) is the increment of pheromones of the path from position i to position j at iteration t, which is defined as

$$ \varDelta {\tau}_{ij}(t)=\sum_{k=1}^m\varDelta {\tau}_{ij}^k(t) $$
(3)
$$ \varDelta {\tau}_{ij}^k(t)=\left\{\begin{array}{c}\hfill \frac{Q}{S_k(t)}\kern1.5em if\ ant\ k\ via\ t he\ path\ of\ ij\ at\ iteration\ t\hfill \\ {}\hfill 0\kern15.25em otherwise\hfill \end{array}\right. $$
(4)

where m is the number of ants. The increment of pheromones of the path from position i to position j for ant k at iteration t is denoted as \( \Delta {\tau}_{ij}^k(t) \). The path length for ant k at iteration t is denoted as S k (t). Q is a user-specified positive constant.

Fitness function Svalue

Mutual information

In probability theory and information theory, the mutual information of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the amount of information obtained about one random variable, through the other random variable. The mutual information is usually used as an association measure in feature selection problems [2, 18, 19]. In epiACO, mutual information is used to measure the associations between SNP combinations and the phenotype, which can be written as

$$ MI\left( S; Y\right)= H(S)+ H(Y)- H\left( S, Y\right) $$
(5)

where H(S) is the entropy of S, H(Y) is the entropy of Y, H(S,Y) is the joint entropy of both S and Y. Here, S represents a SNP combination, and Y represents the phenotype.

The entropy and the joint entropy are defined as

$$ H(S)=-\sum_{j_1=1}^3\cdots \sum_{j_K=1}^3\left( p\left({s}_{j_1},,\cdots, {s}_{j_K}\right)\cdot \log\ p\left({s}_{j_1},,\cdots, {s}_{j_K}\right)\right) $$
(6)
$$ H(Y)=-\sum_{j=0}^1\left( p\left({y}_j\right)\cdot \log\ p\left({y}_j\right)\right) $$
(7)
$$ H\left( S, Y\right)=-\sum_{j_1=1}^3\cdots \sum_{j_K=1}^3\sum_{j=0}^1\left( p\left({s}_{j_1},,\cdots, {s}_{j_K},{y}_j\right)\cdot \log\ p\left({s}_{j_1},,\cdots, {s}_{j_K},{y}_j\right)\right) $$
(8)

where s is the genotype of a SNP, coded as {1, 2, 3} corresponding to homozygous common genotype, heterozygous genotype, and homozygous minor genotype; y is the label of a sample, coded as {0, 1} corresponding to control and case; and p() is the PDF.

It is seen that a greater mutual information value means a stronger association between the SNP combination and the phenotype.

Bayesian network

The Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG) [20]. In general, the Bayes theorem can be written as

$$ P\left( N| D\right)=\frac{P\left( D| N\right) P(N)}{P(D)} $$
(9)

where P(N|D) is the posterior probability of the Bayesian network model N given the data D, P(D|N) is the class-conditional density, P(D) is the probability of D, and P(N) is the prior probability of N.

In the DAG, the joint probability distribution of n nodes, e.g., x 1 , x 2 ,    , x n , is defined as

$$ p\left({x}_1,{x}_2,,\cdots, {x}_n\right)=\prod_{i=1}^n p\left({x}_i| parent\left({x}_i\right)\right) $$
(10)

where parent(x i ) represents the parent node of x i . In this study, SNPs and the phenotype are denoted as nodes of DAG respectively [21,22,23,24]. The directed edges going from SNP nodes to the phenotype node implies that this SNP combination shows correlation with the phenotype [25]. According to Eq. (9), P(D|N) can be computed while all variables in the DAG are discrete values,

$$ P\left( D| N\right)=\prod_{i=1}^I\left(\frac{\Gamma \left(\sum_{j=1}^J{\alpha}_{i j}\right)}{\Gamma \left({r}_i+\sum_{j=1}^J{\alpha}_{i j}\right)}\prod_{j=1}^J\frac{\Gamma \left({r}_{i j}+{\alpha}_{i j}\right)}{\Gamma \left({\alpha}_{i j}\right)}\right) $$
(11)

where I is the combinatorial number of SNP nodes with different values. Specifically, if l-SNP nodes link to the phenotype node, the combinatorial number of SNP nodes is 3l since one SNP has three genotypes. J is the state number of the phenotype node, which is equal to 2. r i is the number of cases with SNP nodes taking the i th combination. r ij is the number of cases with the phenotype node taking the j th state and its parents taking the i th combination. α ij is a parameter that refers to the prior belief about the number of cases while the nodes taking the corresponding i th combination and j th state [13]. More often, in order to take an equal likelihood for all possible distributions in each Bayesian network model, P(N), P(D) are usually set to constants and α ij  = 1. Then we obtain the following formula,

$$ P\left( N| D\right)\propto \prod_{i=1}^I\left(\frac{\left( J-1\right)!}{\left({r}_i+ J-1\right)!}\prod_{j=1}^J{r}_{i j}!\right) $$
(12)

On the basis of previous studies [22, 26, 27], K2 score is introduced and defined as

$$ \mathrm{K}2\kern0.4em \mathrm{score}=\prod_{i=1}^I\left(\frac{\left( J-1\right)!}{\left({r}_i+ J-1\right)!}\prod_{j=1}^J{r}_{i j}!\right) $$
(13)

Besides, logarithmic form of the K2 score can be written as

$$ \mathrm{K}2{\ \mathrm{score}}_{\log }=\sum_{i=1}^I\left(\sum_{b=1}^{r_i+1} \log (b)\hbox{-} \sum_{j=1}^J\sum_{d=1}^{r_{i j}} \log (d)\right) $$
(14)

It is seen that a lower K2 score implies a greater correlation between the SNP combination and the phenotype.

Svalue

In epiACO, the fitness function plays an important role on deciding which SNP combinations are selected as the optimal solutions. Consequently, a novel fitness function Svalue is introduced, which combines both mutual information and Bayesian Network. The mutual information can effectively measure the nonlinear relationships between SNP combinations and the phenotype without a complex modeling [28]. The Bayesian network is widely used as a promising measure for measuring dependence of several variables [27]. In order to leverage the advantages of these two measures, Svalue is defined as

$$ Svalue(A)=\frac{MI}{{\mathrm{K}2\ \mathrm{score}}_{\log }} $$
(15)

It is seen that the higher the Svalue score, the stronger the association between the SNP combination and the phenotype.

ACO based method epiACO for identifying epistasis

The pseudo code of epiACO is given in Fig. 1, which is mainly consist three strategies, namely, the path selection strategy, the pheromone updating strategy, and the memory based strategy.

Fig. 1
figure 1

The pseudo code of epiACO

Path selection strategy

In epiACO, m ants are represented by {m 1,   , m k ,   , m m } respectively, each of which is used to evaluate a K-SNP combination at each iteration, where K is a user-specified order. The probability of ant k selecting SNP i at iteration t is denoted as \( {P}_k^i(t) \), which is defined as

$$ {P}_k^i(t)=\left\{\begin{array}{c}\hfill R\kern2em q\le {q}_0\hfill \\ {}\hfill S\kern2em q>{q}_0\hfill \end{array}\right. $$
(16)

In order to control the rate of convergence, avoid falling into local optimal solution, two path selection strategies, that is, probabilistic path selection strategy and stochastic path selection strategy, are provided to adaptively guide ant behaviors of exploration and exploitation. q 0 is a threshold, which defined as the ratio of current iteration number to the total iteration number. q is a number that randomly generated from the uniform distribution of [0, 1] while Eq. (16) is employed.

The probabilistic path selection strategy is defined as

$$ R=\left\{\begin{array}{c}\hfill \frac{\tau_i{(t)}^{\alpha}{\eta_i}^{\beta}}{\sum_{u\in {U}_k(t)}{\tau}_u{(t)}^{\alpha}{\eta_u}^{\beta}}\kern3em i\in {U}_k(t)\hfill \\ {}\hfill 0\kern9.5em otherwise\hfill \end{array}\right. $$
(17)

where τ i (t) is the amount of pheromones for SNP i at iteration t and η i is the heuristic information of SNP i. U k (t) is a set of SNPs that are not selected by ant k at iteration t.

The stochastic path selection strategy is defined as

$$ S=\left\{\begin{array}{l}\hfill 1\kern2.25em i= rand\left({V}_k(t)\right)\ \hfill \\ {}\hfill 0\kern4.25em otherwise\kern2.75em \hfill \end{array}\right. $$
(18)

where all SNPs at iteration t is sorted with the descending pheromones, and the latter half is represented as V k (t).

This strategy allows epiACO to cover a wider search space while the iteration number is small and to converge on promising regions of the search space while the iteration number turns to large.

Pheromone updating strategy

In epiACO, the pheromones of all SNPs are updated according to the following formula

$$ {\tau}_i\left( t+1\right)=\left(1-\rho \right){\tau}_i(t)+\varDelta {\tau}_i(t)+\Delta {\tau}_i^{\ast }(t) $$
(19)

where Δτ i (t) is the increment of pheromones of SNP i at iteration t. Besides, additional pheromone increment Δτ i (t) is adopted to reward the SNPs that belong to the candidate solutions.

The Δτ i (t) and the Δτ i (t) are respectively defined as

$$ \Delta {\tau}_i(t)=\sum_{k=1}^m\Delta {\tau}_i^k(t) $$
(20)
$$ \varDelta {\tau_i}^k(t)=\left\{\begin{array}{c}\hfill Svalue(A)\kern2em k\in {M}_i(t)\hfill \\ {}\hfill 0\kern5.25em otherwise\hfill \end{array}\right. $$
(21)
$$ \Delta {\tau}_i^{\ast }(t)=\sum_{k=1}^m\Delta {\tau_i^k}^{\ast }(t) $$
(22)
$$ \varDelta {{\tau_i}^k}^{\ast }(t)=\left\{\begin{array}{c}\hfill \xi Svalue(A)\kern2em k\in {N}_i(t)\hfill \\ {}\hfill 0\kern5.25em otherwise\hfill \end{array}\right. $$
(23)

where M i (t) is the set of ants that select SNP i at iteration t. A is the SNP combination that has been selected. Svalue(A) is the Svalue of SNP combination A. N i (t) is the set of ants that select SNP i which belongs to candidate solutions. ξ is a specified parameter that used to control the additional increment of pheromones, and the recommended range is [0.2, 0.5].

Memory based strategy

In epiACO, SNP combinations that selected by ants at each iteration are evaluated by the fitness function Svalue. In order to retain candidate solutions with high Svalue scores found in the previous iterations, instead of traditional ACO completely discarding the solutions from previous iterations due to each iteration is independent of each other, a memory based strategy is introduced. First, all SNP combinations selected at iteration t are ranked with the descending Svalue scores. Second, an inflection point of these descending Svalue scores is computed using the formula

$$ f= arc\underset{g=3}{\overset{m}{ \max }}\left(\left( Svalue\left({A}_g\right)- Svalue\left({A}_{g-1}\right)\right)-\left( Svalue\left({A}_{g-1}\right)- Svalue\left({A}_{g-2}\right)\right)\right) $$
(24)

where A g is the SNP combination and Svalue(A g ) is the Svalue score of the SNP combination A g . Third, the SNP combinations with top f Svalue scores are viewed as the candidate solutions. Fourth, candidate solutions of previous iterations are compared to solutions of the current iteration to generate new candidate solutions, which will continue to be used in the next iteration. The merit of this strategy is that good solutions generated in any of the iterations will not be lost, yielding a more accurate way for epistasis searching.

Results and discussion

Evaluation measures

Detection power is one of the generally accepted and widely used evaluation measure in the field of identifying epistasis [2, 8, 10, 11, 29,30,31]. In this study, we directly use the detection power proposed by previous studies [2, 8, 10, 11, 29,30,31], which is defined as the proportion of data sets in which the epistasis models are perfectly identified with no false positives. That is,

$$ \mathrm{Power}=\frac{R}{G} $$
(25)

where R is the number of data sets that epistasis models in them are successfully detected, G is the number of all experimental data sets.

Computational complexity is also analyzed. We measure running time in the same computational environment to assess realistic applicability of compared methods (Intel G640 2.80GHz CPUs, 32GB RAM, Microsoft Windows, MATLAB).

Simulation and real data sets

We exemplify 5 commonly used benchmark models of 2-SNP epistatic interactions for the study [8, 10, 31,32,33,34]. Details of the models are given in Table 1. Specifically, Model 1 is the model that displays both marginal effects and interaction effect (ME model), the penetrance of Model 1 increases only when both SNPs have at least one minor allele [8, 10]; Model 2 is also the ME model that display both marginal effects and interaction effect, the additional minor allele at each SNP of which does not further increase the penetrance [10]; Model 3 is randomly chosen from references [33, 34] that show no marginal effects but interaction effect (NME model); Model 4 is directly cited from the reference [33], which is also a NME model; Model 5 is a ZZ NME model [32].

Table 1 Details of 5 commonly used benchmark models of 2-SNP epistatic interactions

For each model, 100 data sets are simulated by epiSIM [35], each containing 2000 cases and 2000 controls. In the first 50 data sets, 100 SNPs are genotyped, while in other 50 data sets the number of genotyped SNPs is increased to 1000. For each data set, random SNPs are set with their minor allele frequencies chosen from [0.05, 0.5] uniformly.

A real data set of age-related macular degeneration (AMD) is also used for testing epiACO. AMD, refers to pathological changes in the central area of the retina, is the most important cause of irreversible visual loss in elderly populations, and is considered as a complex disease having multiple epistatic interactions [8]. The AMD data set contains 103,611 SNPs genotyped by 96 cases and 50 controls, which has been widely used as a benchmark data set [2, 8, 9, 12, 13, 15, 16, 29, 36, 37].

Experiments on simulation data sets

In the study, performance of epiACO is analyzed by comparison with 8 typical 2-SNP epistasis detection methods, i.e., epiMODE, TEAM, BOOST, SNPRuler, AntEpiSeeker, AntMiner, MACOED, and IACO. Among them, BOOST and TEAM are exhaustive search methods, epiMODE is a stochastic search method, others are heuristic search methods. In particular, AntEpiSeeker, AntMiner, MACOED, and IACO are all ACO based methods, which will be discussed in more detail than others since they belong to the same family, as well as epiACO.

The first experiment is performed on 100-SNP data sets. The parameters of each compared method are generally set as default. Only a few are modified according to suggestions in their respective user manual to balance result accuracy and computational cost. For epiMODE, the iteration number is set to 100. For TEAM, the permutation number is set to 100. For BOOST, the iteration threshold is set to 10. For a fair comparison, parameter settings of the ACO based methods are same. Specifically, the iteration number m and the ant number T are set to 25 and 200 respectively. The initial pheromone τ 0, the heuristic information η, and the parameters α and β are all set to 1. The evaporation coefficient ρ is set to 0.2. The constant ξ is set to 0.3. For obtaining accurate results, each method runs 20 times with different random seeds on each data set of each model, which can ensure that the method has not been biased by its initial starting conditions. The average detection power and the average running time with their respective standard deviations on 100-SNP data sets are recorded in Table 2. It is seen that epiACO is comparable and sometimes superior to compared methods, especially ACO based methods.

Table 2 The average detection power and the average running time with their respective standard deviations of compared methods on 100-SNP data sets

For ACO based compared methods, epiACO outperforms AntEpiSeeker and IACO in terms of detection power on all models while they have the similar running time. epiACO has higher detection power than those of AntMiner and MACOED on ME models, but lower detection power than those of them on NME models. Since ME models are the primary models of epistatic interactions in real genetic architectures underlying complex diseases and NME models are special cases, epiACO is more suitable for real applications than other two methods. AntMiner has higher detection power than that of epiACO on NME models due to heuristic information of SNPs being incorporated into its ant-decision rules, which on one hand improves detection power, and on the other hand significantly increases running time, hindering its widely application on large scale data sets, like those for GWAS. For MACOED, it is in fact the stochastic search method since its pheromone updating strategy is only used for more frequently detecting the epistatic interactions that have been detected in the previous iterations, rather than to identify better epistatic interactions. As it is known to all, stochastic search methods are good at identifying NME models, and hence detection power of MACOED on NME models are higher than that of epiACO.

For other compared methods, their results are consistent with and complementary to previous reported results [31]. Both BOOST and SNPRuler have perfect detection power on NME models but detect nothing on ME models since they only focus on identifying NME models. TEAM performs well on some models and worse on others, implying that it is model sensitive. epiMODE detects nothing in four models and only has moderate detection power on Model 1. On the contrary, epiACO performs well on all models, either NME models or ME models, shows that epiACO is model-free and has high stability and reliability.

The second experiment is performed on 1000-SNP data sets. The purpose of this experiment is to demonstrate that epiACO has a place for larger data sets and performs better than compared methods. AntEpiSeeker, MACOED, IACO and epiACO are compared on these data sets with the parameters of the iteration number being 100 and the ant number being 500. Other parameters are the same as those of the first experiment. TEAM, epiMODE, AntMiner and SNPRuler are not considered here due to their unaffordable computational cost or memory on high-dimensional data sets [31]. Since BOOST is an exhaustive search method only focusing on NME models, its results on 1000-SNP data sets can be inferred that it detects nothing on ME models and has perfect detection power on NME models. Each compared method also runs 20 times with different random seeds on each data set of each model to ensure that the method has not been biased by its initial starting conditions. The average detection power and the average running time with their respective standard deviations on 1000-SNP data sets are recorded in Table 3. It is seen that epiACO indeed performs better on larger data sets than compared methods. Specifically, it is the fastest one among the methods; though detection power does not reach a perfect level due to the small iteration number and the ant number, epiACO is still the winner.

Table 3 The average detection power and the average running time with their respective standard deviations of compared methods on 1000-SNP data sets

The third experiment is performed on 100-SNP data sets to test the convergence and contribution of epiACO with the parameter settings being the same as those of the first experiment, and results of which are shown in Fig. 2. From Fig. 2a, it is seen that detection power of all models increase as the iteration numbers increase, and gradually tend to be stable while the iteration numbers come to 25, which might be the evidence that epiACO has converged after 25 iterations in the first experiment. The Fig. 2b shows the contributions of different path selection strategies in epiACO. It is clear that the probabilistic path selection strategy plays a decisive role on ME models while the stochastic selection strategy plays a decisive role on NME models. Therefore, they supplement each other and both of them are indispensable for epiACO. In order to test the contribution of the ant system, both epiACO and a simple random search are run on all models with 200 ants. Fig. 2c records the detection power of them after 25 iterations, and Fig. 2d is a box plot of iteration numbers of them while obtaining the global optimal solutions. Results in these two subfigures show that the detection power of epiACO after 25 iterations is higher than that of the random search on all models, which implies that the introduced ACO system improves the performance of the algorithm.

Fig. 2
figure 2

The convergence and contribution of epiACO with the parameter settings being the same as those of the first experiment

Application to real data set

Potential of the epiACO can also be verified by analyzing a real AMD data set with different parameter settings (m, T, ρ) being (10,000, 500, 0.2) and (20,000, 250, 0.2). The captured 2-SNP epistatic interactions that might be associated with the AMD are reported in Table 4 with ascending Chi-square p-values.

Table 4 Top 10 captured epistatic interactions associated with AMD. CFH: complement factor H. N/A: no gene is available. MED27: mediator complex subunit 27. KDM4C: lysine (K)-specific demethylase 4C. NCALD: neurocalcin delta. ISCA1: iron-sulfur cluster assembly 1. NEDD9: neural precursor cell expressed developmentally down-regulated 9

It is seen that most reported epistatic interactions contain either rs380390 or rs1329428, which is due to these two SNPs having strongest main effects among all SNPs, leading to their combinations with other SNPs usually displaying strong interaction effects, and hence being identified. SNPs rs380390 and rs1329428 reside in the CFH gene, mutations in this gene have been associated with hemolytic-uremic syndrome (HUS) and chronic hypocomplementemic nephropathy. These two SNPs are believed to be significantly associated with AMD [8]. There are only two detected epistatic interactions in the table not containing either rs380390 or rs1329428, that is, (rs1740752, rs1368863) and (rs1740752, rs943008), which are the first time being identified that might be associated with the AMD. Both SNPs of the former locate in the noncoding regions, might lead to AMD through regulating gene expression levels. Obviously this is a bold speculation, needs further studies in depth with the use of more case-control samples, and also the biological experiments, though they are beyond the scope of this study. SNP rs1740752 appears in these two epistatic interactions, but its main effect is moderate, implying not only that it influences the phenotype mainly through its interaction effects with other SNPs, but also that epiACO is promising in identifying epistatic interactions displaying no marginal effect but interaction effects. SNP rs2224762 resides in the intron of KDM4C gene, and chromosomal aberrations and changes in expression of this gene may be found in tumor cells. SNP rs943008 resides in the intron of NEDD9 gene, altered expression of which is strongly associated with cancer. The NEDD9 overexpression is documented to occur and in some cases linked the process of tumorigenesis of many different malignances. Besides, NEDD9 has been studies for a possible association with late onset Alzheimer’s disease, and may be important for recovery from stroke. Therefore, mutations in NEDD9 gene might involve in AMD. We hope that, from these results, some clues could be provided for the exploration of causative factors of AMD.

Conclusions

It has been widely accepted that complex diseases are mainly caused by epistatic interactions. In the study, a method epiACO based ACO algorithm is presented for detecting epistatic interactions. Highlights of epiACO are the introduced fitness function Svalue, path selection strategies, and a memory based strategy. The Svalue leverages the advantages of both mutual information and Bayesian network to effectively and efficiently measure associations between SNP combinations and the phenotype. Two path selection strategies, i.e., probabilistic path selection strategy and stochastic path selection strategy, are provided to adaptively guide ant behaviors of exploration and exploitation. The threshold of q 0 allows epiACO to cover a wider search space while the iteration number is small and to converge on promising regions of the search space while the iteration number turns to large, resulting in high detection power not only in ME models but also in NME models. The memory based strategy is designed to retain candidate solutions found in the previous iterations, and compare them to solutions of the current iteration to generate new candidate solutions, yielding a more accurate way for identifying epistasis. Experiments of epiACO and its comparison with other recent methods epiMODE, TEAM, BOOST, SNPRuler, AntEpiSeeker, AntMiner, MACOED, and IACO are performed on both simulation data sets and a real data set of age-related macular degeneration. Results show that epiACO is promising in identifying epistasis and might be an alternative to existing methods.

The aim of this study is to develop an epistasis detection method based on ACO algorithm, which is comparable and sometimes superior to existing methods, especially ACO based methods for identifying epistasis. This does not mean that methods based on other principles are not suitable for identifying epistasis, and their performance are worse than epiACO. We believe that each method has its own merits and limitations, and the more the methods based on different principles being proposed, the faster this problem being solved. For instance, the common standard conjugate gradient algorithm [38] and the genetic algorithm [39] are important optimization algorithms, and can undoubtedly deal with this problem.

Although the results demonstrate that epiACO performs well on both simulation data sets and a real AMD data set, several limitations remain. First, although detection power is the generally accepted and widely used evaluation measure for detecting epistatic interactions [2, 8, 10, 11, 29,30,31], and we directly use it in this study, other evaluation measures, for example, sensitivity, specificity, accuracy, balanced accuracy, and so on, should be used to carry out a broader performance analysis. Second, several important parameters of epiACO, including the number of ants, the number of iterations and the evaporation coefficient, should be discussed in detail and give their recommended settings respectively. Though how to set parameters appropriately is a great challenge for the family of swarm intelligence algorithms, like ACO algorithm. These limitations inspire us to continue working in the future.

Abbreviations

ACO:

Ant colony optimization

AMD:

Age-related macular degeneration

BEAM:

Bayesian epistasis association mapping

BOOST:

Boolean operation-based screening and testing

DAG:

Directed acyclic graph

epiForest:

Detection of epistatic interactions using random forest

epiMODE:

Epistatic module detection

GWAS:

Genome-wide association studies

MDR:

Multifactor-dimensionality reduction

ME models:

Models that display both marginal effects and interaction effect

NME models:

Models that shows only interaction effects

PDF:

Probability density function

SNP:

Single-nucleotide polymorphism

TEAM:

Tree-based epistasis association mapping

References

  1. Moore JH, Asselbergs FW, Williams SM. Bioinformatics challenges for genome-wide association studies. Bioinformatics. 2010;26(4):445–55.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Shang J, Zhang J, Lei X, Zhang Y, Chen B. Incorporating heuristic information into ant colony optimization for epistasis detection. Genes Genomics. 2012;34(3):321–7.

    Article  Google Scholar 

  3. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions. Bioinformatics. 2003;19(3):376–82.

    Article  CAS  PubMed  Google Scholar 

  4. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003;24(2):150–7.

    Article  PubMed  Google Scholar 

  5. Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NLS, Yu W. BOOST: a fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am J Hum Genet. 2010;87(3):325.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Zhang X, Huang S, Zou F, Wang W. TEAM: efficient two-locus epistasis tests in human genome-wide association study. Bioinformatics. 2010;26(12):i217–27.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Tang W, Wu X, Jiang R, Li Y. Epistatic module detection for case-control studies: a Bayesian model with a Gibbs sampling strategy. PLoS Genet. 2009;5(5):e1000464.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Jiang R, Tang W, Wu X, Fu W. A random forest approach to the detection of epistatic interactions in case-control studies. BMC bioinformatics. 2009;10(Suppl 1):S65.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Zhang Y, Liu JS. Bayesian inference of epistatic interactions in case-control studies. Nat Genet. 2007;39(9):1167–73.

    Article  CAS  PubMed  Google Scholar 

  11. Wan X, Yang C, Yang Q, Xue H, Tang NL, Yu W. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics. 2010;26(1):30–7.

    Article  CAS  PubMed  Google Scholar 

  12. Wang Y, Liu X, Robbins K, Rekaya R. AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm. BMC Research Notes. 2010;3(1):117.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Jing P J, Shen H B. MACOED: a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies. Bioinformatics. 2015; 31(5):634–41.

    Article  CAS  PubMed  Google Scholar 

  14. Christmas J, Keedwell E, Frayling TM, Perry JR. Ant colony optimisation to identify genetic variant association with type 2 diabetes. Inf Sci. 2011;181(9):1609–22.

    Article  Google Scholar 

  15. Wang Y, Liu X, Rekaya R. AntEpiSeeker2. 0: extending epistasis detection to epistasis-associated pathway inference using ant colony optimization. London: Nature Publishing Group; 2012.

    Google Scholar 

  16. Sun Y, Shang J, Liu J X. An improved ant colony optimization algorithm for the detection of SNP-Interactions. In: Huang DS, Han K, Hussain A, editors. Lecture Notes in Computer Science. Switzerland: Springer; 2016. p. 21-32.

  17. Martens D, Baesens B, Fawcett T. Editorial survey: swarm intelligence for data mining. Mach Learn. 2011;82(1):1–42.

    Article  Google Scholar 

  18. Yang J-B, Ong C-J. An effective feature selection method via mutual information estimation. IEEE Trans Syst Man Cybern B Cybern. 2012;42(6):1550–9.

    Article  PubMed  Google Scholar 

  19. Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005;27(8):1226–38.

    Article  PubMed  Google Scholar 

  20. Gopnik A, Tenenbaum JB. Bayesian networks, Bayesian learning and cognitive development. Dev Sci. 2007;10(3):281–7.

    Article  PubMed  Google Scholar 

  21. Han B, Chen XW, Talebizadeh Z, Xu H. Genetic studies of complex human diseases: characterizing SNP-disease associations using Bayesian networks. BMC Syst Biol. 2012;6(3):1–12.

    Google Scholar 

  22. Jiang X, Barmada MS. Identifying genetic interactions in genome-wide data using Bayesian networks. Genet Epidemiol. 2010;34(6):575–81.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Xia J, Neapolitan RE, Barmada MM, Visweswaran S. Learning genetic epistasis using Bayesian network scoring criteria. Bmc Bioinf. 2011;12(12):1–12.

    Google Scholar 

  24. Visweswaran S, Wong AKI, Barmada MM: A Bayesian method for identifying genetic interactions. AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium 2008, 2009(2009):673-677.

    Google Scholar 

  25. Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Mach Learn. 1992;9(4):309–47.

    Google Scholar 

  26. Han B, Chen X-W, Talebizadeh Z, Xu H. Genetic studies of complex human diseases: characterizing SNP-disease associations using Bayesian networks. BMC Syst Biol. 2012;6(Suppl 3):S14.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Jiang X, Neapolitan RE, Barmada MM, Visweswaran S. Learning genetic epistasis using Bayesian network scoring criteria. BMC Bioinf. 2011;12(1):1.

    Article  Google Scholar 

  28. Szymczak S, Igl BW, Ziegler A. Detecting SNP-expression associations: a comparison of mutual information and median test with standard statistical approaches. Stat Med. 2009;28(29):3581–96.

    Article  CAS  PubMed  Google Scholar 

  29. Jiang X, Neapolitan RE, Barmada MM, Visweswaran S, Cooper GF: A fast algorithm for learning epistatic genomic relationships. In: AMIA annual Symposium proceedings: 2010. Am Med Inform Assoc: 341.

  30. Shang J, Zhang J, Sun Y, Zhang Y. EpiMiner: a three-stage co-information based method for detecting and visualizing epistatic interactions. Digital Signal Process. 2014;24:1–13.

    Article  Google Scholar 

  31. Shang J, Zhang J, Sun Y, Liu D, Ye D, Yin Y. Performance analysis of novel methods for detecting epistasis. BMC Bioinf. 2011;12(1):475.

    Article  Google Scholar 

  32. Frankel WN, Schork NJ. Who's afraid of epistasis? Nat Genet. 1996;14(4):371–3.

    Article  CAS  PubMed  Google Scholar 

  33. Li W, Reich J. A complete enumeration and classification of two-locus disease models. Hum Hered. 2000;50(6):334–49.

    Article  CAS  PubMed  Google Scholar 

  34. Greene CS, Penrod NM, Williams SM, Moore JH. Failure to replicate a genetic association may provide important clues about genetic architecture. PLoS One. 2009;4(6):e5639.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Shang J, Zhang J, Lei X, Zhao W, Dong Y. EpiSIM: simulation of multiple epistasis, linkage disequilibrium patterns and haplotype blocks for genome-wide interaction analysis. Genes Genomics. 2013:1–12.

  36. Yu L, Maxwell S, Tao F, et al. Gene, pathway and network frameworks to identify epistatic interactions of single nucleotide polymorphisms derived from GWAS data. Bmc Syst Biol. 2012;6(Suppl 3):S15.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Yang C, Xiang W, Qiang Y, Hong X, Yu W. Identifying main effects and epistatic interactions from large-scale SNP data via adaptive group lasso. Bmc Bioinf. 2010;11(Suppl 1):1–11.

    Google Scholar 

  38. Venna J, Peltonen J, Nybo K, Aidos H, Kaski S. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. J Mach Learn Res. 2010;11(1):451–90.

    Google Scholar 

  39. Pyka M, Heider D, Hauke S, Kircher T, Jansen A. Dynamic causal modeling with genetic algorithms. J Neurosci Methods. 2011;194(2):402–6.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

Not applicable.

Funding

We are grateful to the anonymous reviewers whose suggestions and comments contributed to the significant improvement of this paper. This work was in part supported by the National Natural Science Foundation of China (61502272, 61572284), the Science and Technology Planning Project of Qufu Normal University (xkj201410), the Scientific Research Foundation of Qufu Normal University (BSQD20130119).

Availability of data and materials

The data and MATLAB source codes are available online at.

https://sourceforge.net/projects/epiaco1/files/epiACO.rar/download.

Author information

Authors and Affiliations

Authors

Contributions

YS and JS jointly contributed to the design of the study. YS designed and implemented the epiACO method, performed the experiments, and drafted the manuscript. JXL participated in the design of the study and performed the statistical analysis. SL and CHZ contributed to the data analysis. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Junliang Shang or Jin-Xing Liu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, Y., Shang, J., Liu, JX. et al. epiACO - a method for identifying epistasis based on ant Colony optimization algorithm. BioData Mining 10, 23 (2017). https://doi.org/10.1186/s13040-017-0143-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13040-017-0143-7

Keywords