\copyrightclause

Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

\conference

Ital-IA 2024: 4th National Conference on Artificial Intelligence, organized by CINI, May 29-30, 2024, Naples, Italy

[email=christopher.irwin@uniupo.it, ] \cormark[1] \fnmark[2] [email=flavio.mignone@smartseq.it, ] [email=stefania.montani@uniupo.it, ] [email=luigi.portinale@uniupo.it, ]

\cortext

[1]Corresponding author. \fntext[1]Authors are listed in alphabetical order. \fntext[2]Phd student enrolled in the National PhD in Artificial Intelligence for Health and Life Sciences, XXXVIII cycle, Università Campus Bio-Medico, Roma.

Graph Neural Networks for Gut Microbiome Metaomic data: A preliminary work

Christopher Irwin Flavio Mignone Stefania Montani Luigi Portinale DISIT, Computer Science Institute, University of Piemonte Orientale, Alessandria, Italy SmartSeq s.r.l., Alessandria, Italy Interdepartmental Research Center on Artificial Intelligence (AI@UPO), University of Piemonte Orientale, 15121 Alessandria, Italy

(2024)

Abstract

The gut microbiome, crucial for human health, presents challenges in analyzing its complex metaomic data due to high dimensionality and sparsity. Traditional methods struggle to capture its intricate relationships. We investigate graph neural networks (GNNs) for this task, aiming to derive meaningful representations of individual gut microbiomes. Unlike methods relying solely on taxa abundance, we directly leverage phylogenetic relationships, in order to obtain a generalized encoder for taxa networks. The representation learnt from the encoder are then used to train a model for phenotype prediction such as Inflammatory Bowel Disease (IBD).

keywords:

Multiomics \sepMicrobiome \sepGraph Representation Learning \sep

1 Introduction

The gut microbiome, a community of microorganisms residing within the human gut, plays a significant role in health and disease. Advancements in sequencing have yielded vast amounts of metaomic data, detailing the gut microbiome’s composition and function. However, effectively analyzing this high-dimensional data remains a challenge. Traditional methods often fail to capture the intricate relationships between the diverse microbial species within the microbiome.

This work explores the potential of graph neural networks (GNNs) for analyzing gut microbiome metaomic data. GNNs, a powerful class of deep learning models adept at handling graph-structured data, hold promise for unraveling the complex relationships within the microbiome. Our primary objective is to develop a method that utilizes metaomic data to obtain a meaningful representation of a patient’s gut microbiome. In particular, we begin by constructing a network that captures the relationships between genes, species, and genus. This network is built using the gene expression levels obtained from all patient samples. By leveraging GNN techniques, we then learn latent representations (embeddings) for each entity within the network. These embeddings capture the underlying functional relationships between genes, species, and genus. Finally, we aggregate these embeddings based on a patient’s specific gene expression profile to obtain a global representation of their unique gut microbiome. This patient-specific microbiome representation can then be used by a classifier to predict a particular phenotype, such as the presence or absence of Inflammatory Bowel Disease (IBD) as done in the present work.

Recently, a wide rage of methods have been proposed to analyse microbiome data and use it to obtain predictive models for biomarkers discovery and classifying patients based on some phenotype. Some of these methods rely on the use of the microbiome features like the relative abundance of taxa. However, such approaches can be limited by the inherent nature of microbiome datasets: sparsity and high dimensionality (e.g., the presence of many features). This may lead to overfitting, with models failing to generalize to other datasets [1].

In order to address these kind of limitations, many works try to leverage the intrinsic phylogenetic information between taxa [2]. For example PopPhy-CNN [3] uses a CNN-based model that works on the 2D matrix derived from the phylogenetic tree representation. MIOSTONE [1] uses a gated phylogenetic encoded neural network. TaxoNN [4] leverages an ensemble of CNNs each specializing to particular taxonomic phylum.

Our method directly builds upon the concept of leveraging phylogenetic relationships to create a meaningful representation of the microbiome. We achieve this by employing a GNN-based method to learn a generalized encoder specifically for the network of taxa. This encoder is trained independently from the downstream classification task, which is handled by a separate model.

2 Dataset

The dataset in use is the IBDMDB which is described in [5] related to a IBD cohort of patients. The database contains observations coming from 2 different omic levels: microbiome metagenomics (MGX) and metatranscriptomics (MTX) gene expression levels (expressed in count per milion - CPMs) along with some other information related to the patient state and the presence of absence of IBD.

Different omic levels have different number of samples and different number of features, and in particular:

•

Metagenomics: 1635 samples with 108433 features.
•

Metatranscriptomics: 736 samples with 70711 features.

Regarding the patient labels, these are related to the presence of absence of IDB (binary). Notably, there are 1594 patients with labels. The dataset has a quite substantial skew, since there are 83% negative and 17% positive samples.

The dataset has a large dimensionality and a different rate of missing samples. To this end, our approach is to use methods which can deal with missing samples for a particular omic-level and a different number of features. Moreover, as already expressed in the previous section, when dealing with metaomics, it can potentially be useful to being able to model an interaction between the different microorganisms.

Refer to caption — Figure 1: Overview of the proposed method: input data characterized by microbiome gene expression across different omic levels (left). Extracted graph connecting genes based on species and genus (center). Application of the GNN module to contextualize node representations: the set of node embeddings representing a patient are aggregated by taking in to account the relative abundance expressed in CPMs and fed to the final classifier predicting IBD presence (bottom)

3 Proposed Method

The main task we want to carry out is to obtain a meaningful representation of a patient using the meta-omic data coming from the analysis of gut microbiome samples. To do this, the idea is to establish relationships among the various microorganisms present in the microbiome, whose presence is determined by the abundance of enzymes/genes that characterize them, using the relational information of species and genus encoded within a graph. Moreover, since it is common to have incomplete datasets (either patients may have different gene expression or missing data may come from a particular omic level), we model our task and the way of treating the data in a way that allows transparency from this point of view. Figure 1 gives an overview of the proposed pipeline.

The following subsections will describe how to get the representation of a patient starting from the metagenomic level. The same technique can be applied in the same way to other omic levels and seamlessly integrated in this pipeline.

3.1 Graph Construction

Given a set of patients with gene expression levels of gut microbiome microorganisms, and associated data on enzymes, species, and genus (phylogenetic information) for each microorganism, our aim is to construct a graph $\mathcal{G}=(U,V)$ where:

•

$U$ denotes the set of nodes, encompassing enzymes, species, and genus.
•

$V$ represents the set of edges, categorized into two types: (enzyme, species) and (species, genus).

Once the phylogenetic graph is constructed, we can represent each patient $P_{i}$ as a subset of nodes $U_{i}\subseteq U$ , where the patient exhibits a gene expression level greater than 0.

3.2 Graph Representation Learning Module

Given a graph $\mathcal{G}$ that encodes the relationships between genes, we aim to learn a representation (embedding) of the nodes that reflects their connectivity within the network. To achieve this, we leverage established methods from the field of graph representation learning, specifically: Graph Laplacian Eigenvector Positional Encoding, Random Walk Positional Encoding, and Node2Vec.

Here, we provide a concise description of each method.

Graph Laplacian Eigenvector Positional Encoding (LPE): this technique relies on the factorization of the graph’s Laplacian matrix. The embedding for a node is defined by the k-smallest, non-trivial eigenvectors associated with that node [6].

Random Walk Positional Encoding (RWPE): This approach, as described in [6], utilizes the random walk matrix to generate node embeddings.

Node2Vec (N2V): This method, presented in [7], is a semi-supervised algorithm that leverages random walks on the graph to learn meaningful features for the nodes.

3.3 Aggregation Function

Having obtained embeddings for all the nodes in the graph, we now require a methodology to aggregate a subset of these embeddings into a single vector that represents a patient. Two aggregation levels are employed for this purpose.

The first level of aggregation aims to generate a flattened representation for each gene. Here, a gene’s embedding is calculated as the mean of the embeddings associated with its corresponding phylogenetic subgraph.

The second level of aggregation focuses on converting the set of genes expressed by a patient into a single embedding representation. This is achieved by utilizing a hyperparameter that determines the number of most highly expressed genes considered for the aggregation process. The optimal value of this hyperparameter will be investigated in Section 4.3.

3.4 Different Omic Levels Integration

Previous sections have outlined the methodology for transforming gene abundance data (obtained from a metagenomic analysis) of a patient’s microbiome into a single patient representation. However, incorporating metatranscriptomic features is straightforward. We simply include genes identified at the transcript level during the construction of the phylogenetic graph, these additional nodes are then considered when generating the patient representation.

4 Experimental results

To evaluate the performance of our proposed method, we conducted preliminary experiments using the IBD dataset described in Section 2. These experiments aimed to investigate the method’s effectiveness in predicting the presence or absence of IBD under various parameter settings. We addressed the following key questions:

Q1.

Node Embedding Comparison: which node embedding technique (LPE, RWPE, N2V) yields the best performance?
Q2.

Impact of Multi-Omics Integration: how does the model’s performance vary when using different omic levels as input: metagenomics only versus metagenomics combined with metatranscriptomics?
Q3.

Gene Selection Strategy: how does the number of genes considered during patient representation generation impact the final model performance?

For the final classification task utilizing patient representations as input features, we employed a Support Vector Machine (SVM) classifier using a Radial Basis Function (RBF). Our choice of a relatively simple model reflects our primary focus on evaluating the efficacy of the graph representation techniques.

4.1 Evaluation of Node Embedding Techniques (Q1)

To assess the performance of different node embedding techniques (LPE, RWPE, N2V), we employed an 80/10/10 train-validation-test split. For each technique, we optimized model hyperparameters using the Optuna framework [8]. Following hyperparameter tuning, each optimal model configuration was evaluated over independent runs, utilizing randomized train-validation-test splits and random initialization. Table 1 reports the results related to Q1. It is clear from such results that the best choices are LPE and N2V, which perform very similarly on both datasets. While RWPE significantly underperforms with repsect to other methods.

4.2 Impact of Omic Levels (Q2)

To assess the influence of data derived from different omic levels on model performance (Q2), we trained two distinct models. The first model used only data from the MGX level, while the second incorporated data from both metagenomics (MGX) and metatrascriptomics (MTX) levels. Importantly, the evaluation for Q2 employed the same experimental setup established for Q1 (detailed in the previous section). The corresponding results for both models are presented in Table 1. From the results it is possible to notice that the model trained using only MGX data sligthly underperforms with respect to the model that includes MTX data.

Table 1: Results obtained by the models using different methods for encoding the phylogenetic graph

Dataset	F1 score			ROC AUC
Dataset	LPE	RWPE	N2V	LPE	RWPE	N2V
MGX only	0.839	0.664	0.837	0.908	0.709	0.914
MGX+MTX	0.871	0.681	0.873	0.929	0.721	0.923

4.3 Gene Selection Strategy (Q3)

Relatively to Q3, we focused on the model trained using the LPE encoding for the phylogenetic graph. To obtain the representation of a patient, we trained the model using as features a different number $k$ of genes. In particular, we considered the top- $k$ most expressed genes. Moreover, we also compared the results between the two datasets (MGX-only, MGX+MTX). As shown in Figure 2, the model’s ROC-AUC initially increases with the number of features, peaking at a certain point before a slight decrease and stabilization. This likely reflects the importance of highly expressed genes for class separation. Genes with lower expression levels are likely to carry less information. Interestingly, despite a small difference, the optimal number of genes differs between the datasets. The MGX-only model achieves its peak ROC-AUC with 3568 genes, while the MGX+MTX model peaks at 3366. This might suggest that the MTX data provides additional useful information, resulting in a smaller number of necessary genes.

5 Conclusion and future work

This paper proposes a method leveraging Graph Neural Networks (GNNs) to extract phylogenetic information from microbiome metagenomics data. The method aims to create a generalizable encoder for the microbiome, enabling its application to phenotype classification tasks. The approach achieves good performance on real-world datasets even with simple classification models, although it falls short of models specifically designed for classification tasks [9]. Future work will involve:

•

Expanding comparisons with other methods.
•

Evaluating performance on additional datasets.
•

Exploring the use of clustering for phenotype classification using the learned encoder.

Acknowledgements.

We acknowledge the use of the Chameleon Cloud platform [10] for providing the essential computational resources that facilitated the execution of our experiments and model training.

References

Jiang et al. [2024] Y. Jiang, M. Aton, Q. Zhu, Y. Y. Lu, Miostone: Modeling microbiome-trait associations with taxonomy-adaptive neural networks, bioRxiv (2024). URL: https://www.biorxiv.org/content/early/2024/02/04/2023.11.04.565596. doi:10.1101/2023.11.04.565596.
Hernández Medina et al. [2022] R. Hernández Medina, S. Kutuzova, K. N. Nielsen, J. Johansen, L. H. Hansen, M. Nielsen, S. Rasmussen, Machine learning and deep learning applications in microbiome research, ISME Communications 2 (2022) 98. URL: https://doi.org/10.1038/s43705-022-00182-9. doi:10.1038/s43705-022-00182-9.
Reiman et al. [2020] D. Reiman, A. A. Metwally, J. Sun, Y. Dai, Popphy-cnn: A phylogenetic tree embedded architecture for convolutional neural networks to predict host phenotype from metagenomic data, IEEE Journal of Biomedical and Health Informatics 24 (2020) 2993–3001. doi:10.1109/JBHI.2020.2993761.
Sharma et al. [2020] D. Sharma, A. D. Paterson, W. Xu, TaxoNN: ensemble of neural networks on stratified microbiome data for disease prediction, Bioinformatics 36 (2020) 4544–4550. URL: https://doi.org/10.1093/bioinformatics/btaa542. doi:10.1093/bioinformatics/btaa542.
Lloyd-Price et al. [2019] J. Lloyd-Price, C. Arze, A. N. Ananthakrishnan, M. Schirmer, J. Avila-Pacheco, T. W. Poon, E. Andrews, N. J. Ajami, K. S. Bonham, C. J. Brislawn, et al., Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases, Nature 569 (2019) 655–662.
Dwivedi et al. [2021] V. P. Dwivedi, A. T. Luu, T. Laurent, Y. Bengio, X. Bresson, Graph neural networks with learnable structural and positional representations, arXiv preprint arXiv:2110.07875 (2021).
Grover and Leskovec [2016] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, Association for Computing Machinery, New York, NY, USA, 2016, p. 855–864. URL: https://doi.org/10.1145/2939672.2939754. doi:10.1145/2939672.2939754.
Akiba et al. [2019] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.
Hu et al. [2023] M. Hu, J. Zhu, G. Peng, W. Lu, H. Wang, Z. Xie, IMOVNN: incomplete multi-omics data integration variational neural networks for gut microbiome disease prediction and biomarker identification, Briefings in Bioinformatics 24 (2023) bbad394. URL: https://doi.org/10.1093/bib/bbad394. doi:10.1093/bib/bbad394.
Keahey et al. [2020] K. Keahey, J. Anderson, Z. Zhen, P. Riteau, P. Ruth, D. Stanzione, M. Cevik, J. Colleran, H. S. Gunawi, C. Hammock, J. Mambretti, A. Barnes, F. Halbach, A. Rocha, J. Stubbs, Lessons learned from the chameleon testbed, in: Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC ’20), USENIX Association, 2020.