Figures
Abstract
Multi-omics analysis offers a promising avenue to a better understanding of complex biological phenomena. In particular, untangling the pathophysiology of multifactorial health conditions such as the inflammatory bowel disease (IBD) could benefit from simultaneous consideration of several omics levels. However, taking full advantage of multi-omics data requires the adoption of suitable new tools. Multi-view learning, a machine learning technique that natively joins together heterogeneous data, is a natural source for such methods. Here we present a new approach to variable selection in unsupervised multi-view learning by applying stability selection to canonical correlation analysis (CCA). We apply our method, StabilityCCA, to simulated and real multi-omics data, and demonstrate its ability to find relevant variables and improve the stability of variable selection. In a case study on an IBD microbiome data set, we link together metagenomics and metabolomics, revealing a connection between their joint structure and the disease, and identifying potential biomarkers. Our results showcase the usefulness of multi-view learning in multi-omics analysis and demonstrate StabilityCCA as a powerful tool for biomarker discovery.
Citation: Pusa T, Rousu J (2024) Stable biomarker discovery in multi-omics data via canonical correlation analysis. PLoS ONE 19(9): e0309921. https://doi.org/10.1371/journal.pone.0309921
Editor: Kai Wang, Institute of Apicultural Research, CHINA
Received: February 27, 2024; Accepted: August 20, 2024; Published: September 9, 2024
Copyright: © 2024 Pusa, Rousu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The IBD data set is available as Supplementary Information to Franzosa et al. (2019), at doi.org/10.1038/s41564-018-0306-4.
Funding: The authors wish to acknowledge the financial support by Academy of Finland through the grants 334790 (MAGITICS), 339421 (MASF) and 345802 (AIB), as well as the Global Programme by Finnish Ministry of Education and Culture. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Multi-omics analyses have emerged as the next generation of high-throughput methods. By simultaneously considering multiple levels of the biological system under study, multi-omics offers a more comprehensive view of it, and a better chance of understanding the underlying processes [1–3].
However, along with promise come challenges: issues typical of biological data sets such as small sample sizes combined with a large number of variables are further compounded in multi-omics data [4]. Heterogeneous data types and large discrepancies in the number of variables between different omics are also inherent problems [5, 6]. Taking full advantage of multi-omics necessitates developing and adopting new analytical methods [7, 8].
The large number of variables encountered in biological data necessitates feature selection, and the primary goal of omics studies is often precisely the identification of a core set of variables of interest: biomarker discovery. This can be achieved via regularisation of models and imposing sparsity. However, picking the correct level of sparsity is a non-trivial task. In addition, there exists a known trade-off between sparsity and stability: how sensitive is the choice of variables to changes in the data [9, 10].
Ideally, multi-omics analyses will go beyond simply concatenating different data types, preserving their complementary nature. Multi-view learning, the umbrella term for machine learning methods that consider the integration of multiple feature sets, is well-suited to accomplish this task [3, 11].
One potential solution to the sparsity-stability trade-off is offered by the stability selection framework [12, 13]: by simulating data perturbation with a subsampling procedure, more stable feature sets can be found. Here we show how stability selection can be efficiently applied in a multi-view context.
Our approach, named StabilityCCA, combines canonical correlation analysis (CCA) with stability selection to find subsets of connected variables in a multi-omics data set. CCA is an unsupervised multi-view method that finds a pair of maximally correlated projections for two sets of variables [14]. By applying the stability selection framework to CCA, we offer an alternative approach to variable selection in high-dimensional multi-view analysis. Instead of setting a fixed level of model sparsity by optimising regularisation parameters, StabilityCCA constructs a multitude of models with varying levels of sparsity, giving a more holistic view of variable importance.
We tested StabilityCCA with simulated data and two different types of multi-omics from an inflammatory bowel disease (IBD) gut microbiome data set. The results show improvements in variable selection performance, in particular selection stability. Importantly, we saw significant improvements in stability in real data. We further demonstrate the method by analysing the canonical correlations found in the IBD data, uncovering a connection between the metagenome-metabolome structure and IBD status. With StabilityCCA, we were able to identify the important variables behind this connection, many of which are known IBD biomarkers.
This article is organised as follows. In a theory section, we introduce CCA, sparse CCA and stability selection, and show how stability selection can be applied to sparse CCA, leading to StabilityCCA. In a results section, we show results for two types of simulated data and a real multi-omics data set, comparing two different sparse CCA methods to their StabilityCCA extensions. The article ends with a discussion of the results.
Methods
Canonical correlation analysis
Canonical correlation analysis is an unsupervised dimensionality reduction method that can be applied when the variables under study naturally group into two disjoint sets. Suppose we have two corresponding sets of observations and where n stands for the number of observations, and px and py for the number of variables in X and Y views respectively. We assume that the columns of X and Y have been standardised to have mean zero. CCA seeks projections X a and Y b such that the correlation between them is maximised: (1) where 〈⋅, ⋅〉 denotes the dot product.
The solution to Eq (1) and are the first pair of canonical coefficients, and the projections and the first pair of canonical variables or scores. Further pairs can then be found by requiring that they are uncorrelated with the previous ones, up to min({rank(X), rank(Y)}) pairs.
The CCA problem can be solved analytically, for example using the singular value decomposition [15]. However, when the number of variables exceeds that of observations, this solution is no longer unique. A classic remedy is to introduce some form of regularisation, typically inducing sparsity in the coefficient vectors a and b, and maximise the CCA objective via optimisation [16–19], resulting in sparse CCA (SCCA).
Rather than adding a penalty term to the objective, as is commonly done in regularised regression methods, many SCCA approaches instead formulate the regularisation into a set of constraints on the coefficient vectors a and b. An optimal solution can then be found via alternating optimisation in which the objective is iteratively improved from one side at a time, imposing the regularisation constraints at each step. Such an approach has been previously employed by [18], in a closely related formulation by [20], and in the context of kernel CCA by [19].
We will adopt and test two different SCCA approaches to be used as the base procedure with StabilityCCA. The first one, due to Witten et al., is based on a penalised matrix decomposition, and makes the simplifying assumption that the covariance matrices XTX and YTY can be treated as diagonal, effectively maximising the numerator in Eq (1). A combination of L1- and L2-norm constraints is used to induce sparsity, giving the following objective: (2) (3) (4) where cx and cy are regularisation parameters for a and b respectively. This can be optimised by an alternating algorithm where at each step one of the coefficient vectors is kept fixed and Eq 2 maximised which can be done exactly (for full details of the algorithm, see [18]). We will refer to this method as penalised matrix decomposition CCA (PMD-CCA).
PMD-CCA has several attractive qualities for our purposes. Firstly, when , the L1 and L2 constraints interact, setting some coefficients in a to zero, giving us the feature that the coefficient vector will be maximally sparse at cx ≤ 1, and no sparsity is imposed at . This will prove very convenient for StabilityCCA. Secondly, the optimisation algorithm has low time complexity and appears to converge fast when a good initialisation is given (as proposed by [18], first singular vectors of the cross-covariance matrix appear to be a good choice in practice). This is important since the stability selection framework will necessarily add some further computational burden.
Our second SCCA method is essentially a combination of the approach used by [19] and PMD-CCA. That is, we retain the L1-L2 constraints but solve for the full CCA objective: (5) (6) (7)
Optimisation is done using an alternating projected gradient algorithm (for details, see [19]). We will refer to this method as SCCA-EC for “SCCA with Elastic-net Constraints”.
The SCCA-EC method has the same convenient qualities regarding the regularisation and low time complexity. However, contrary to PMD-CCA it does not ignore the within view covariances. It is not exactly clear when this assumption is appropriate and so it is interesting to see if and when it makes a difference. Furthermore, previous results suggest that different SCCA methods might be better suited for different types of data [21], so it makes sense to try different approaches with the StabilityCCA framework.
Stability selection
One potential solution to the sparsity-stability trade-off is offered by the stability selection framework, described in detail in [12, 13]. The idea is to simulate data perturbation by repeatedly subsampling the data. For each subsample, a regularisation path is calculated, forming a sequence of models with varying levels of sparsity. By recording the frequency with which a variable was selected at a given level of sparsity, we can estimate the probability of each variable to be selected as a function of regularisation.
Here we show how to apply stability selection to CCA. As introduced in the previous Section, we will test two different SCCA approaches as base procedures but the framework can in principle be applied with any SCCA method.
Since we have two sparsity parameters, we sample the parameter space using pairs of parameter values (cxi, cyi). To capture the full range of regularisation, we run the parameters from (1, 1) to , populating the interval using a logarithmic sequence. This will sample the lower end of the parameter space more densely, effectively giving it more weight, since we are more interested in relatively sparse models where only a few variables are being selected.
The procedure, StabilityCCA, is described in pseudocode in Algorithm 1 and illustrated schematically in Fig 1 which shows a visual representation of the output: a stability path. It shows for each variable its selection probability, as estimated by StabilityCCA, as a function of the sparsity parameters.
Above are shown regularisation paths: the SCCA model as a function of sparsity. Below, stability paths, derived from the regularisation paths for 100 subsamples of size n/2. The stability score of a variable is the area-under-curve (AUC) of its stability path.
We define the stability score of a variable as the area-under-curve of its stability path. In other words, it is the average selection probability across all (cx, cy) values sampled. It can be seen as a measure of both relative and absolute variable importance: variables with higher stability scores are selected more often and thus more likely to be true signal variables. Values close to one indicate that a variable is selected almost always, even in very sparse models, and is likely to be very important for the model.
We tested two variable selection strategies based on the stability score, both using a single hyperparameter. In “Top-k” selection, variables are ranked and the top-k variables are selected, k being the hyperparamater. In threshold selection (Thr), variables with a stability score above a threshold τ will be selected, and τ is the hyperparameter. With both strategies, a CCA model can then be formed by running the base procedure with only the selected variables and regularisation parameters set so that no further sparsity is induced.
Algorithm 1 StabilityCCA
Input: X, Y: Data matrices
Output: PX, PY: Selection probabilities
S ← set of subsamples
C ← sequence of parameter value pairs
for (Xs, Ys) ∈ S do
for (cxi, cyi) ∈ C do
(a, b) ← SCCA(Xs, Ys, cxi, cyi)
countXij ← countXij + 1(aj ≠ 0)
countYij ← countYij + 1(bj ≠ 0)
end for
end for
PXij = countXij/|S|
PYij = countYij/|S|
⊳ S is a set of 100 subsamples derived by randomly splitting X and Y in half 50 times.
⊳ C is defined as .
⊳ SCCA is SCCA-EC or PMD-CCA.
Results
Simulated data
Data.
We generated simulated data with a known ground truth using two different data models. The first we will refer to as “single latent data”. Let atrue and btrue be ground truth coefficient vectors for the first (X) and second (Y) view respectively, and and the number of true variables. We randomly select indices Atrue in atrue to be the true variables (same for btrue). The entries are then drawn from the uniform distribution if j ∈ atrue, and set to zero otherwise (same for btrue). A latent factor is drawn for each sample i and the data is generated with and .
In the second data model, “multivariate data”, atrue and btrue are generated in the same manner. Data is then drawn from a multivariate Gaussian distribution where (8) and equivalently for ΣYY and ΣYX.
Let now sx be the true sparsity of the X-view, that is, (and equivalently for Y). For all simulations, we set sx = 2sy, and , and run simulations for sx values 1/2, 1/4 and 1/10. We set n = 50 in all cases. This set-up is meant to capture to varying degrees of severity many of the potential challenges encountered in multi-omics data: the curse of dimensionality, and asymmetry in the number of true variables, the level of true sparsity and number of variables between the two views.
Metrics.
We evaluated the methods’ ability to identify true variables using two metrics. The area under the receiver operating characteristic curve (AUC) was calculated using the absolute values of canonical coefficients as the score for SCCA-EC and PMD-CCA, and the stability scores for their StabilityCCA equivalents. Balanced accuracy (BA) was calculated for binarised canonical coefficients or for the selections performed with the Top-k and the Thr procedure for StabilityCCA.
The stability of selected variable sets was evaluated using the stability estimator introduced by [22]. The stability estimator is a value between 1 and where M is the number of variable sets. The maximum value 1 is achieved only when all selected sets are exactly equal and the minimum when their intersection is empty. The expected value of a random selection where each subset is equally likely to be drawn is zero.
Finally, model performance was evaluated by applying the canonical coefficients to a test set of size 100 and calculating the resulting correlation. For each scenario, we generated ten different random ground truths atrue,btrue, and ten different training and test set pairs for each atrue,btrue. Results are reported as averages over these 100 training/test sets for AUC, BA and test set correlation. The stability estimator was calculated over each of the sets of ten training sets with the same ground truth and we report the average over these.
Hyperparameter tuning.
All hyperparameter tuning was done within the training set using ten rounds of 3-fold cross-validation and test set canonical correlation as metric of success. For the base procedures alone, SCCA-EC and PMD-CCA, a grid search over evenly distributed values of cx and cy in the range and was performed with grid size 15. For the Thr selection procedure, 100 values evenly distributed between the highest and lowest stability score were tested. For Top-k selection, since the number of potential k values can be very large, a binary search strategy was employed between 1 and px + py. Note that in practice, one would likely aim for low values of k. However, we chose this agnostic tuning strategy so as to not artificially bias the model towards sparse models.
The results for single latent data are shown in Fig 2A. For SCCA-EC, we observe a substantial improvement from StabilityCCA across all three variable selection metrics in all scenarios with not much difference between the Top-k and Thr models. For test set correlation there is little difference with the base procedure performing slightly better in the sparsest case.
The base procedures are plotted with solid lines, while the dashed lines represent their different StabilityCCA extensions.
Similar, if not as pronounced improvements are found for PMD-CCA. Notably, PMD-CCA alone clearly outperforms SCCA-EC, and even its StabilityCCA extension in some cases. All in all, we observe clear improvements from StabilityCCA in variable selection, in particular stability, with not much difference in model performance, independent of the SCCA method used.
The multivariate data results, shown in Fig 2B, are more mixed. For SCCA-EC, while stability is still improved, AUC and BA are in some cases higher for the base procedure alone. It appears that SCCA-EC benefits from stability selection more in sparser scenarios. Test set correlation on the other hand is always higher for the base procedure model.
For PMD-CCA, there is still a clear improvement in AUC. The Top-k and Thr selections diverge somewhat: the Top-k model has much better BA in the sparse cases, and is the most stable in the densest case while dropping even below the base procedure for sparser cases. The Thr model has more even stability performance and outperforms Top-k for test set correlation. Interestingly, the results also differ from the single latent data in that SCCA-EC now outperforms PMD-CCA for AUC and test set correlation.
In summary, in our simulated data, adding the stability selection framework to SCCA seems to improve variable selection and in particular selection stability with not much difference in model performance. It appears that PMD-CCA is more stable than SCCA-EC to begin with, and does not benefit as much from StabilityCCA.
Real data
We further evaluated StabilityCCA on two different omics combinations from an IBD data set. IBD is a group of gastrointestinal conditions, the most common forms being Crohn’s disease (CD) and ulcerative colitis (UC). The data comprises fecal metagenomics and metabolomics of individuals with CD (88), UC (76) and healthy controls (56), and was published in [23]. There are two alternative feature sets for metagenomics: species (201 variables) or enzymes (2113 variables), and 466 metabolite variables.
We downloaded abundance tables from the electronic supplementary materials to [23]. All data was transformed using the generalised log transformation: [24, 25]. Variables which had zero variance were removed. Finally, each column (variable) was normalised to have mean zero and variance one.
Since here we obviously do not have a ground truth to compare against, only stability and model performance were evaluated. Non-overlapping training and test sets of size 100 were randomly drawn 25 times. Stability was measured across the selections in these 25 sets and test set correlation is reported as the average over them as before. Hyperparameter tuning was the same as in the simulation experiments.
The results for the species-metabolites case are shown in Fig 3A. For SCCA-EC, there is again a very notable improvement in stability from StabilityCCA, with the Top-k model being slightly more stable than Thr. There is little change in model performance. For PMD-CCA, the base procedure is now again much more stable than SCCA-EC. The Thr-extension underperforms, having worse stability than the base procedure alone while PMD-CCA-Top-k has the highest stability of all methods. However, for model performance, SCCA-EC outperforms across the board.
The whiskers display one standard deviation for the test set correlations.
In the enzymes-metabolites case in Fig 3B, SCCA-EC performance is very similar to the previous case. For PMD-CCA, the base procedure selection is no longer stable. While stability is again improved by the Top-k selection, both SCCA-EC extensions are now more stable. In terms of model performance, SCCA-EC is now clearly better for both the base procedure and the StabilityCCA extensions.
All in all, our experiments on real data show that StabilityCCA greatly improves variable selection stability with no difference or even small improvements in model performance. The Top-k model appears to be the better selection strategy. The results also suggest that of the two SCCA methods tested, SCCA-EC might be better suited for these types of multi-omics data.
Canonical correlations are linked to disease status
We further illustrate the use of StabilityCCA by taking a closer look at the IBD data set. Based on the results presented above, we choose SCCA-EC as the base procedure and Top-k as the variable selection method. However, as mentioned before, in practice we would likely not wish to attempt to optimise k over all possible values but rather choose a relatively low value based on practical considerations such as ease of interpretability.
Indeed, when we investigated how increasing k will affect model performance, we found that while the average test set correlation does keep increasing up to k = 150, very little is gained beyond k = 50, and most of the canonical correlation is already captured with just ten variables (see Fig 4A). Moreover, as we see in Fig 5A and 5B, the top-50 and top-10 models show qualitatively the same behaviour.
The constant lines correspond to the average performance when an optimal value k* was selected through hyperparameter tuning.
(A) Top-50 model canonical variables. The canonical correlation (CC) is shown above the plot. (B) Top-10 model canonical variables. (C) Canonical coefficients and pairwise contributions to the canonical correlation for the top-10 model. (*) match to a standard with isomeric forms that could not be differentiated.
The plots show the canonical variables for the two views, species-composition of the metagenome and metabolomics. Each point corresponds to a sample and the IBD status of the individual has been highlighted. We see that there is a clustering happening, separating IBD samples from controls. To quantify this pattern, we calculated the receiver operating characteristic curves for disease status using the sum of the canonical variables, X a + Y a, as the score: the AUC was 0.88 (IBD v Control, 0.94 for CD v Control and 0.81 for UC v Control) for the top-10 model and 0.86 for the top-50 model (0.93 for CD v Control and 0.79 for UC v Control). This suggests several things: the uncovered connections between the metagenome and the metabolome are strongly linked to disease status in the IBD data, and while this connection is potentially very high-dimensional, its core can be captured by the top variables highlighted by StabilityCCA.
Table 1 shows the top-10 variables and their stability scores. The highest ranked variable, species Subdoligranulum unspecified, has a score close to one, indicating that it is almost always selected. Another prominent variable is the metabolite urobilin. Fig 5C displays the canonical coefficients and the variables’ pairwise contributions: the contribution from variable j of the X-view (species) and variable k from the Y-view (metabolites) is (9) and it is the proportion of the total canonical correlation (Eq (1)) due to the correlation between j and k. Accordingly, the highest contributions come from Subdoligranulum unspecified, and highest pairwise contributions are between Subdoligranulum and urobilin.
Many of the top-10 variables have already been previously connected to IBD. More specifically, previous studies have shown a connection between IBD and decreases in the abundance or levels of Subdoligranulum species [26–29], urobilin [30, 31], hydrocinnamic acid [31], the genus Coprococcus [32–36], Ruminococcus bromii [27, 37], and Oscillibacter [36, 38].
A similar pattern is found when the enzymes feature set is used as the metagenomics view. Fig 6 shows the canonical variables for a top-100 and a top-10 model. Again using the sum of the variables as a score, the AUC is 0.90 (IBD v Control, 0.95 for CD v Control and 0.84 for UC v Control) for the top-100 model and 0.88 for the top-10 model (0.94 for CD v Control and 0.81 for UC v Control), showing that while in this case more top variables are needed to achieve the highest test correlations (see Fig 4B), the underlying biological pattern is already captured by just ten.
(A) Top-100 model canonical variables. The canonical correlation (CC) is shown above the plot. (A) Top-10 model canonical variables. (C) Canonical coefficients and pairwise contributions to the canonical correlation for the top-10 model. (*) match to a standard with isomeric forms that could not be differentiated.
Table 2 shows the top-10 variables and their stability scores. Many of the same metabolites are present. Of the new ones, lithocholate potentially inhibits inflammation [39, 40] and alpha-hydroxybutyrate has been found to be elevated in IBD [41, 42]. The canonical coefficients and pairwise contributions are shown in Fig 6C. The highest contribution is due to lithocholate and the enzyme 1.8.4.2: Protein-disulfide reductase (glutathione)—the highest ranked variable and the highest ranked enzyme variable.
Given that the same patterns were observed with both the species and the enzymes representation of the metagenome, it would be interesting to relate the two to each other—after all, they are alternative depictions of the same underlying system. We can do this directly by taking the canonical coefficients from the species-metabolites model and the enzymes-metabolites model. The result is shown in Fig 7. The resulting canonical correlation is 0.68. The highest pairwise contribution is between Subdoligranulum and the protein-disulfide reductase. Both in turn had high interactions with the same metabolites, most notably the two features matched to urobilin.
Discussion
We have presented a novel approach to variable selection in multi-view data: StabilityCCA. The method combines stability selection with sparse CCA to assess variable importance and select stable sets of variables. We tested the method with two different sparse CCA approaches, using both simulated and real data. The results show that by applying the stability selection framework, we can improve the ability to select true variables as well as the stability of the selection, regardless of the sparse CCA approach used. Importantly, in real multi-omics data, stability of selected feature sets was greatly improved by StabilityCCA.
We applied StabilityCCA to an IBD gut microbiome data set. Suggested causes for IBD include interactions between genetic and environmental factors [43]. In particular, the microbiome has been heavily implicated, with changes in microbiota composition and diversity being routinely observed. However, the exact causative mechanisms remain unknown [44, 45]. Since the metabolome acts as the interface mediating host and microbiota interaction, a metagenome-metabolome multi-omics analysis is a promising avenue to a better understanding of IBD pathology.
We found the canonical correlations between the gut metagenome and metabolome to be strongly linked to the disease. The connection was stronger for Crohn’s disease than for ulcerative colitis, in line with previous analysis of the same data [23]. With StabilityCCA, we were able to pinpoint a relatively small set of variables most responsible for this observed structure, many of which have already been linked to IBD.
While the exact aetiology of IBD remains unclear, the role of butyrate and other short-chain fatty acids (SCFAs) in reducing inflammation and maintaining the integrity of the epithelium barrier has been extensively investigated [46]. Accordingly, at least two of the microbial species highlighted by our analysis, Subdoligranulum unspecified and Ruminococcus bromii, have been identified as SCFA producers [26, 37]. The hypothesis is that a reduced abundance of these beneficial bacteria lead to decreased levels of SCFAs, which in turn lead to inflammation and epithelium breakdown.
In a similar vein, lithocholate, the top variable in the ezymes-metabolites scenario, has been proposed to inhibit epithelial apoptosis and promote barrier function [39, 40]. Another metabolite within the enzymes-metabolites top-10, alpha-hydroxybutyrate, is a potential biomarker for autoimmune disease [42]. Futhermore, it is part of propanoate metabolism: changes in propanoate and butanoate metabolism within the microbiome have been linked to IBD [45].
There are some important limitations to our analysis. As an unsupervised method, CCA, and by extension StabilityCCA, may not optimally fit settings that require supervised prediction. Also, the canonical correlations cannot be directly interpreted as causal dependencies, and hence further study is needed to provide a full picture of the underlying biological significance of the findings.
Additionally, we have focussed here on extending stability selection to CCA which is designed for two views only. There are several alternative ways to extend CCA to three or more views [47, 48], and further research is needed to understand whether our approach could be used with these methods as well.
The fact that the connection between the microbiome and IBD arises “serendipitously” from the data—meaning that disease status was not part of the analysis—arguably lends further credence both to the hypothesis that IBD is connected to the microbiota and the theory that this influence is mediated by the metabolome. By revealing a joint structure of the metagenome and the metabolome, the model points to a good starting point for future study which will hopefully shine more light on the exact mechanisms behind IBD, demonstrating the promise in multi-view analysis.
References
- 1. Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome biology. 2017;18(1):1–15. pmid:28476144
- 2. Pinu FR, Beale DJ, Paten AM, Kouremenos K, Swarup S, Schirra HJ, et al. Systems biology and multi-omics integration: viewpoints from the metabolomics research community. Metabolites. 2019;9(4):76. pmid:31003499
- 3. Nguyen ND, Wang D. Multiview learning for understanding functional multiomics. PLoS computational biology. 2020;16(4):e1007677. pmid:32240163
- 4. Reel PS, Reel S, Pearson E, Trucco E, Jefferson E. Using machine learning approaches for multi-omics data analysis: A review. Biotechnology Advances. 2021;49:107739. pmid:33794304
- 5. Picard M, Scott-Boyer MP, Bodein A, Périn O, Droit A. Integration strategies of multi-omics data for machine learning analysis. Computational and Structural Biotechnology Journal. 2021;19:3735–3746. pmid:34285775
- 6. Chong J, Xia J. Computational approaches for integrative analysis of the metabolome and microbiome. Metabolites. 2017;7(4):62. pmid:29156542
- 7. Krassowski M, Das V, Sahu SK, Misra BB. State of the field in multi-omics research: From computational needs to data mining and sharing. Frontiers in Genetics. 2020;11:610798. pmid:33362867
- 8. Bersanelli M, Mosca E, Remondini D, Giampieri E, Sala C, Castellani G, et al. Methods for the integration of multi-omics data: mathematical aspects. BMC bioinformatics. 2016;17(2):167–177. pmid:26821531
- 9. Xu H, Caramanis C, Mannor S. Sparse algorithms are not stable: A no-free-lunch theorem. IEEE transactions on pattern analysis and machine intelligence. 2011;34(1):187–193.
- 10. Bousquet O, Elisseeff A. Stability and generalization. The Journal of Machine Learning Research. 2002;2:499–526.
- 11. Zhao J, Xie X, Xu X, Sun S. Multi-view learning overview: Recent progress and new challenges. Information Fusion. 2017;38:43–54.
- 12. Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2010;72(4):417–473.
- 13. Shah RD, Samworth RJ. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2013;75(1):55–80.
- 14.
Hotelling H. Relations between two sets of variates. In: Breakthroughs in statistics. Springer; 1992. p. 162–190.
- 15. Uurtio V, Monteiro JM, Kandola J, Shawe-Taylor J, Fernandez-Reyes D, Rousu J. A tutorial on canonical correlation methods. ACM Computing Surveys (CSUR). 2017;50(6):1–33.
- 16.
Parkhomenko E, Tritchler D, Beyene J. Genome-wide sparse canonical correlation of gene expression with genotypes. In: BMC proceedings. vol. 1. Springer; 2007. p. 1–5.
- 17. González I, Déjean S, Martin PG, Baccini A. CCA: An R package to extend canonical correlation analysis. Journal of Statistical Software. 2008;23(12):1–14.
- 18. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–534. pmid:19377034
- 19.
Uurtio V, Bhadra S, Rousu J. Large-scale sparse kernel canonical correlation analysis. In: International Conference on Machine Learning. PMLR; 2019. p. 6383–6391.
- 20. Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Statistical applications in genetics and molecular biology. 2009;8(1). pmid:19222376
- 21. Rodosthenous T, Shahrezaei V, Evangelou M. Integrating multi-OMICS data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study. Bioinformatics. 2020;36(17):4616–4625. pmid:32437529
- 22. Nogueira S, Sechidis K, Brown G. On the Stability of Feature Selection Algorithms. Journal of Machine Learning Research. 2018;18(174):1–54.
- 23. Franzosa EA, Sirota-Madi A, Avila-Pacheco J, Fornelos N, Haiser HJ, Reinker S, et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nature microbiology. 2019;4(2):293–305. pmid:30531976
- 24. Durbin BP, Hardin JS, Hawkins DM, Rocke DM. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics. 2002;18(suppl_1):S105–S110. pmid:12169537
- 25. Huber W, Von Heydebreck A, Sültmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002;18(suppl_1):S96–S104. pmid:12169536
- 26. Xia Y, Wang J, Fang X, Dou T, Han L, Yang C. Combined analysis of metagenomic data revealed consistent changes of gut microbiome structure and function in inflammatory bowel disease. Journal of Applied Microbiology. 2021;131(6):3018–3031. pmid:34008889
- 27. Mondot S, Kang S, Furet JP, Aguirre de Cárcer D, McSweeney C, Morrison M, et al. Highlighting new phylogenetic specificities of Crohn’s disease microbiota. Inflammatory bowel diseases. 2011;17(1):185–192. pmid:20722058
- 28. Chen D, Li Y, Sun H, Xiao M, Lv N, Liang S, et al. P854 Insights into alteration of gut microbiota in inflammatory bowel disease patients with and without Clostridium difficile infection. Journal of Crohn’s and Colitis. 2019;13(Supplement_1):S551–S552.
- 29. Pisani A, Rausch P, Ellul S, Bang C, Tabone T, Marantidis Cordina C, et al. P685 Gut microbiota in patients with Inflammatory Bowel Disease during remission. Journal of Crohn’s and Colitis. 2021;15(Supplement_1):S604–S605.
- 30. Lloyd-Price J, Arze C, Ananthakrishnan AN, Schirmer M, Avila-Pacheco J, Poon TW, et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019;569(7758):655–662.
- 31. Santoru ML, Piras C, Murgia A, Palmas V, Camboni T, Liggi S, et al. Cross sectional evaluation of the gut-microbiome metabolome axis in an Italian cohort of IBD patients. Scientific reports. 2017;7(1):1–14.
- 32. Turpin W, Goethel A, Bedrani L, Croitoru K MDCM. Determinants of IBD heritability: genes, bugs, and more. Inflammatory bowel diseases. 2018;24(6):1133–1148. pmid:29701818
- 33. Kong L, Lloyd-Price J, Vatanen T, Seksik P, Beaugerie L, Simon T, et al. Linking strain engraftment in fecal microbiota transplantation with maintenance of remission in Crohn’s disease. Gastroenterology. 2020;159(6):2193–2202. pmid:32860788
- 34. Nishino K, Nishida A, Inoue R, Kawada Y, Ohno M, Sakai S, et al. Analysis of endoscopic brush samples identified mucosa-associated dysbiosis in inflammatory bowel disease. Journal of gastroenterology. 2018;53(1):95–106. pmid:28852861
- 35. Shaw KA, Bertha M, Hofmekler T, Chopra P, Vatanen T, Srivatsa A, et al. Dysbiosis, inflammation, and response to treatment: a longitudinal study of pediatric subjects with newly diagnosed inflammatory bowel disease. Genome medicine. 2016;8(1):1–13. pmid:27412252
- 36. Pisani A, Rausch P, Bang C, Ellul S, Tabone T, Marantidis Cordina C, et al. Dysbiosis in the Gut Microbiota in Patients with Inflammatory Bowel Disease during Remission. Microbiology Spectrum. 2022; p. e00616–22. pmid:35532243
- 37. Rajilić-Stojanović M, Shanahan F, Guarner F, de Vos WM. Phylogenetic analysis of dysbiosis in ulcerative colitis during remission. Inflammatory bowel diseases. 2013;19(3):481–488. pmid:23385241
- 38. Papa E, Docktor M, Smillie C, Weber S, Preheim SP, Gevers D, et al. Non-invasive mapping of the gastrointestinal microbiota identifies children with inflammatory bowel disease. PloS one. 2012;7(6):e39242. pmid:22768065
- 39. Ward JB, Lajczak NK, Kelly OB, O’Dwyer AM, Giddam AK, Ní Gabhann J, et al. Ursodeoxycholic acid and lithocholic acid exert anti-inflammatory actions in the colon. American Journal of Physiology-Gastrointestinal and Liver Physiology. 2017;312(6):G550–G558. pmid:28360029
- 40. Lajczak-McGinley NK, Porru E, Fallon CM, Smyth J, Curley C, McCarron PA, et al. The secondary bile acids, ursodeoxycholic acid and lithocholic acid, protect against intestinal inflammation by inhibition of epithelial apoptosis. Physiological reports. 2020;8(12):e14456. pmid:32562381
- 41. Santoru ML, Piras C, Murgia F, Leoni VP, Spada M, Murgia A, et al. Metabolic Alteration in Plasma and Biopsies from Patients with IBD. Inflammatory Bowel Diseases. 2021;27(8):1335–1345. pmid:33512485
- 42. Tsoukalas D, Fragoulakis V, Papakonstantinou E, Antonaki M, Vozikis A, Tsatsakis A, et al. Prediction of autoimmune diseases by targeted metabolomic assay of urinary organic acids. Metabolites. 2020;10(12):502. pmid:33302528
- 43. Glassner KL, Abraham BP, Quigley EM. The microbiome and inflammatory bowel disease. Journal of Allergy and Clinical Immunology. 2020;145(1):16–27. pmid:31910984
- 44. Nagalingam NA, Lynch SV. Role of the microbiota in inflammatory bowel diseases. Inflammatory bowel diseases. 2012;18(5):968–984. pmid:21936031
- 45. Kostic AD, Xavier RJ, Gevers D. The microbiome in inflammatory bowel disease: current status and the future ahead. Gastroenterology. 2014;146(6):1489–1499. pmid:24560869
- 46. Zhuang X, Li T, Li M, Huang S, Qiu Y, Feng R, et al. Systematic review and meta-analysis: short-chain fatty acid characterization in patients with inflammatory bowel disease. Inflammatory bowel diseases. 2019;25(11):1751–1763. pmid:31498864
- 47. Kettenring JR. Canonical analysis of several sets of variables. Biometrika. 1971;58(3):433–451.
- 48. Luo Y, Tao D, Ramamohanarao K, Xu C, Wen Y. Tensor canonical correlation analysis for multi-view dimension reduction. IEEE transactions on Knowledge and Data Engineering. 2015;27(11):3111–3124.