-
Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen
Authors:
Alessandro Palma,
Till Richter,
Hanyi Zhang,
Manuel Lubetzki,
Alexander Tong,
Andrea Dittadi,
Fabian Theis
Abstract:
Generative modeling of single-cell RNA-seq data has shown invaluable potential in community-driven tasks such as trajectory inference, batch effect removal and gene expression generation. However, most recent deep models generating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, ignoring the inherently discrete and over-dispersed nature of sing…
▽ More
Generative modeling of single-cell RNA-seq data has shown invaluable potential in community-driven tasks such as trajectory inference, batch effect removal and gene expression generation. However, most recent deep models generating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, ignoring the inherently discrete and over-dispersed nature of single-cell data, which limits downstream applications and hinders the incorporation of robust noise models. Moreover, crucial aspects of deep-learning-based synthetic single-cell generation remain underexplored, such as controllable multi-modal and multi-label generation and its role in the performance enhancement of downstream tasks. This work presents Cell Flow for Generation (CFGen), a flow-based conditional generative model for multi-modal single-cell counts, which explicitly accounts for the discrete nature of the data. Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks such as conditioning on multiple attributes and boosting rare cell type classification via data augmentation. By showcasing CFGen on a diverse set of biological datasets and settings, we provide evidence of its value to the fields of computational biology and deep generative models.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Disentangled Representation Learning through Geometry Preservation with the Gromov-Monge Gap
Authors:
Théo Uscidda,
Luca Eyring,
Karsten Roth,
Fabian Theis,
Zeynep Akata,
Marco Cuturi
Abstract:
Learning disentangled representations in an unsupervised manner is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. While remarkably difficult to solve in general, recent works have shown that disentanglement is provably achievable under additional assumptions that can leverage geometrical constraints, such as…
▽ More
Learning disentangled representations in an unsupervised manner is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. While remarkably difficult to solve in general, recent works have shown that disentanglement is provably achievable under additional assumptions that can leverage geometrical constraints, such as local isometry. To use these insights, we propose a novel perspective on disentangled representation learning built on quadratic optimal transport. Specifically, we formulate the problem in the Gromov-Monge setting, which seeks isometric mappings between distributions supported on different spaces. We propose the Gromov-Monge-Gap (GMG), a regularizer that quantifies the geometry-preservation of an arbitrary push-forward map between two distributions supported on different spaces. We demonstrate the effectiveness of GMG regularization for disentanglement on four standard benchmarks. Moreover, we show that geometry preservation can even encourage unsupervised disentanglement without the standard reconstruction objective - making the underlying model decoder-free, and promising a more practically viable and scalable perspective on unsupervised disentanglement.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
B-Cos Aligned Transformers Learn Human-Interpretable Features
Authors:
Manuel Tran,
Amal Lahiani,
Yashin Dicente Cid,
Melanie Boxberg,
Peter Lienemann,
Christian Matek,
Sophia J. Wagner,
Fabian J. Theis,
Eldad Klaiman,
Tingying Peng
Abstract:
Vision Transformers (ViTs) and Swin Transformers (Swin) are currently state-of-the-art in computational pathology. However, domain experts are still reluctant to use these models due to their lack of interpretability. This is not surprising, as critical decisions need to be transparent and understandable. The most common approach to understanding transformers is to visualize their attention. Howev…
▽ More
Vision Transformers (ViTs) and Swin Transformers (Swin) are currently state-of-the-art in computational pathology. However, domain experts are still reluctant to use these models due to their lack of interpretability. This is not surprising, as critical decisions need to be transparent and understandable. The most common approach to understanding transformers is to visualize their attention. However, attention maps of ViTs are often fragmented, leading to unsatisfactory explanations. Here, we introduce a novel architecture called the B-cos Vision Transformer (BvT) that is designed to be more interpretable. It replaces all linear transformations with the B-cos transform to promote weight-input alignment. In a blinded study, medical experts clearly ranked BvTs above ViTs, suggesting that our network is better at capturing biomedically relevant structures. This is also true for the B-cos Swin Transformer (Bwin). Compared to the Swin Transformer, it even improves the F1-score by up to 4.7% on two public datasets.
△ Less
Submitted 18 January, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation
Authors:
Luca Eyring,
Dominik Klein,
Théo Uscidda,
Giovanni Palla,
Niki Kilbertus,
Zeynep Akata,
Fabian Theis
Abstract:
In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, whi…
▽ More
In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, which makes it prone to outliers and limits its applicability in real-world scenarios. The latter can be particularly harmful in OT domain translation tasks, where the relative position of a sample within a distribution is explicitly taken into account. While unbalanced OT tackles this challenge in the discrete setting, its integration into neural Monge map estimators has received limited attention. We propose a theoretically grounded method to incorporate unbalancedness into any Monge map estimator. We improve existing estimators to model cell trajectories over time and to predict cellular responses to perturbations. Moreover, our approach seamlessly integrates with the OT flow matching (OT-FM) framework. While we show that OT-FM performs competitively in image translation, we further improve performance by incorporating unbalancedness (UOT-FM), which better preserves relevant features. We hence establish UOT-FM as a principled method for unpaired image translation.
△ Less
Submitted 11 March, 2024; v1 submitted 25 November, 2023;
originally announced November 2023.
-
To Transformers and Beyond: Large Language Models for the Genome
Authors:
Micaela E. Consens,
Cameron Dufault,
Michael Wainberg,
Duncan Forster,
Mehran Karimzadeh,
Hani Goodarzi,
Fabian J. Theis,
Alan Moses,
Bo Wang
Abstract:
In the rapidly evolving landscape of genomics, deep learning has emerged as a useful tool for tackling complex computational challenges. This review focuses on the transformative role of Large Language Models (LLMs), which are mostly based on the transformer architecture, in genomics. Building on the foundation of traditional convolutional neural networks and recurrent neural networks, we explore…
▽ More
In the rapidly evolving landscape of genomics, deep learning has emerged as a useful tool for tackling complex computational challenges. This review focuses on the transformative role of Large Language Models (LLMs), which are mostly based on the transformer architecture, in genomics. Building on the foundation of traditional convolutional neural networks and recurrent neural networks, we explore both the strengths and limitations of transformers and other LLMs for genomics. Additionally, we contemplate the future of genomic modeling beyond the transformer architecture based on current trends in research. The paper aims to serve as a guide for computational biologists and computer scientists interested in LLMs for genomic data. We hope the paper can also serve as an educational introduction and discussion for biologists to a fundamental shift in how we will be analyzing genomic data in the future.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Mixed Models with Multiple Instance Learning
Authors:
Jan P. Engelmann,
Alessandro Palma,
Jakub M. Tomczak,
Fabian J. Theis,
Francesco Paolo Casale
Abstract:
Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear…
▽ More
Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity. By leveraging predefined cell embeddings, MixMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains.
△ Less
Submitted 8 March, 2024; v1 submitted 4 November, 2023;
originally announced November 2023.
-
Causal machine learning for single-cell genomics
Authors:
Alejandro Tejada-Lapuerta,
Paul Bertin,
Stefan Bauer,
Hananeh Aliee,
Yoshua Bengio,
Fabian J. Theis
Abstract:
Advances in single-cell omics allow for unprecedented insights into the transcription profiles of individual cells. When combined with large-scale perturbation screens, through which specific biological mechanisms can be targeted, these technologies allow for measuring the effect of targeted perturbations on the whole transcriptome. These advances provide an opportunity to better understand the ca…
▽ More
Advances in single-cell omics allow for unprecedented insights into the transcription profiles of individual cells. When combined with large-scale perturbation screens, through which specific biological mechanisms can be targeted, these technologies allow for measuring the effect of targeted perturbations on the whole transcriptome. These advances provide an opportunity to better understand the causative role of genes in complex biological processes such as gene regulation, disease progression or cellular development. However, the high-dimensional nature of the data, coupled with the intricate complexity of biological systems renders this task nontrivial. Within the machine learning community, there has been a recent increase of interest in causality, with a focus on adapting established causal techniques and algorithms to handle high-dimensional data. In this perspective, we delineate the application of these methodologies within the realm of single-cell genomics and their challenges. We first present the model that underlies most of current causal approaches to single-cell biology and discuss and challenge the assumptions it entails from the biological point of view. We then identify open problems in the application of causal approaches to single-cell data: generalising to unseen environments, learning interpretable models, and learning causal models of dynamics. For each problem, we discuss how various research directions - including the development of computational approaches and the adaptation of experimental protocols - may offer ways forward, or on the contrary pose some difficulties. With the advent of single cell atlases and increasing perturbation data, we expect causal models to become a crucial tool for informed experimental design.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Entropic (Gromov) Wasserstein Flow Matching with GENOT
Authors:
Dominik Klein,
Théo Uscidda,
Fabian Theis,
Marco Cuturi
Abstract:
Optimal transport (OT) theory has reshaped the field of generative modeling: Combined with neural networks, recent \textit{Neural OT} (N-OT) solvers use OT as an inductive bias, to focus on ``thrifty'' mappings that minimize average displacement costs. This core principle has fueled the successful application of N-OT solvers to high-stakes scientific challenges, notably single-cell genomics. N-OT…
▽ More
Optimal transport (OT) theory has reshaped the field of generative modeling: Combined with neural networks, recent \textit{Neural OT} (N-OT) solvers use OT as an inductive bias, to focus on ``thrifty'' mappings that minimize average displacement costs. This core principle has fueled the successful application of N-OT solvers to high-stakes scientific challenges, notably single-cell genomics. N-OT solvers are, however, increasingly confronted with practical challenges: while most N-OT solvers can handle squared-Euclidean costs, they must be repurposed to handle more general costs; their reliance on deterministic Monge maps as well as mass conservation constraints can easily go awry in the presence of outliers; mapping points \textit{across} heterogeneous spaces is out of their reach. While each of these challenges has been explored independently, we propose a new framework that can handle, natively, all of these needs. The \textit{generative entropic neural OT} (GENOT) framework models the conditional distribution $π_\varepsilon(\*y|\*x)$ of an optimal \textit{entropic} coupling $π_\varepsilon$, using conditional flow matching. GENOT is generative, and can transport points \textit{across} spaces, guided by sample-based, unbalanced solutions to the Gromov-Wasserstein problem, that can use any cost. We showcase our approach on both synthetic and single-cell datasets, using GENOT to model cell development, predict cellular responses, and translate between data modalities.
△ Less
Submitted 12 March, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
Conditionally Invariant Representation Learning for Disentangling Cellular Heterogeneity
Authors:
Hananeh Aliee,
Ferdinand Kapl,
Soroor Hediyeh-Zadeh,
Fabian J. Theis
Abstract:
This paper presents a novel approach that leverages domain variability to learn representations that are conditionally invariant to unwanted variability or distractors. Our approach identifies both spurious and invariant latent features necessary for achieving accurate reconstruction by placing distinct conditional priors on latent features. The invariant signals are disentangled from noise by enf…
▽ More
This paper presents a novel approach that leverages domain variability to learn representations that are conditionally invariant to unwanted variability or distractors. Our approach identifies both spurious and invariant latent features necessary for achieving accurate reconstruction by placing distinct conditional priors on latent features. The invariant signals are disentangled from noise by enforcing independence which facilitates the construction of an interpretable model with a causal semantic. By exploiting the interplay between data domains and labels, our method simultaneously identifies invariant features and builds invariant predictors. We apply our method to grand biological challenges, such as data integration in single-cell genomics with the aim of capturing biological variations across datasets with many samples, obtained from different conditions or multiple laboratories. Our approach allows for the incorporation of specific biological mechanisms, including gene programs, disease states, or treatment conditions into the data integration process, bridging the gap between the theoretical assumptions and real biological applications. Specifically, the proposed approach helps to disentangle biological signals from data biases that are unrelated to the target task or the causal explanation of interest. Through extensive benchmarking using large-scale human hematopoiesis and human lung cancer data, we validate the superiority of our approach over existing methods and demonstrate that it can empower deeper insights into cellular heterogeneity and the identification of disease cell states.
△ Less
Submitted 2 July, 2023;
originally announced July 2023.
-
The power of motifs as inductive bias for learning molecular distributions
Authors:
Johanna Sommer,
Leon Hetzel,
David Lüdke,
Fabian Theis,
Stephan Günnemann
Abstract:
Machine learning for molecules holds great potential for efficiently exploring the vast chemical space and thus streamlining the drug discovery process by facilitating the design of new therapeutic molecules. Deep generative models have shown promising results for molecule generation, but the benefits of specific inductive biases for learning distributions over small graphs are unclear. Our study…
▽ More
Machine learning for molecules holds great potential for efficiently exploring the vast chemical space and thus streamlining the drug discovery process by facilitating the design of new therapeutic molecules. Deep generative models have shown promising results for molecule generation, but the benefits of specific inductive biases for learning distributions over small graphs are unclear. Our study aims to investigate the impact of subgraph structures and vocabulary design on distribution learning, using small drug molecules as a case study. To this end, we introduce Subcover, a new subgraph-based fragmentation scheme, and evaluate it through a two-step variational auto-encoder. Our results show that Subcover's improved identification of chemically meaningful subgraphs leads to a relative improvement of the FCD score by 30%, outperforming previous methods. Our findings highlight the potential of Subcover to enhance the performance and scalability of existing methods, contributing to the advancement of drug discovery.
△ Less
Submitted 4 April, 2023;
originally announced June 2023.
-
MAGNet: Motif-Agnostic Generation of Molecules from Shapes
Authors:
Leon Hetzel,
Johanna Sommer,
Bastian Rieck,
Fabian Theis,
Stephan Günnemann
Abstract:
Recent advances in machine learning for molecules exhibit great potential for facilitating drug discovery from in silico predictions. Most models for molecule generation rely on the decomposition of molecules into frequently occurring substructures (motifs), from which they generate novel compounds. While motif representations greatly aid in learning molecular distributions, such methods struggle…
▽ More
Recent advances in machine learning for molecules exhibit great potential for facilitating drug discovery from in silico predictions. Most models for molecule generation rely on the decomposition of molecules into frequently occurring substructures (motifs), from which they generate novel compounds. While motif representations greatly aid in learning molecular distributions, such methods struggle to represent substructures beyond their known motif set. To alleviate this issue and increase flexibility across datasets, we propose MAGNet, a graph-based model that generates abstract shapes before allocating atom and bond types. To this end, we introduce a novel factorisation of the molecules' data distribution that accounts for the molecules' global context and facilitates learning adequate assignments of atoms and bonds onto shapes. Despite the added complexity of shape abstractions, MAGNet outperforms most other graph-based approaches on standard benchmarks. Importantly, we demonstrate that MAGNet's improved expressivity leads to molecules with more topologically distinct structures and, at the same time, diverse atom and bond assignments.
△ Less
Submitted 7 November, 2023; v1 submitted 30 May, 2023;
originally announced May 2023.
-
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Authors:
Manuel Tran,
Yashin Dicente Cid,
Amal Lahiani,
Fabian J. Theis,
Tingying Peng,
Eldad Klaiman
Abstract:
Training multimodal foundation models is challenging due to the limited availability of multimodal datasets. While many public datasets pair images with text, few combine images with audio or text with audio. Even rarer are datasets that align all three modalities at once. Critical domains such as healthcare, infrastructure, or transportation are particularly affected by missing modalities. This m…
▽ More
Training multimodal foundation models is challenging due to the limited availability of multimodal datasets. While many public datasets pair images with text, few combine images with audio or text with audio. Even rarer are datasets that align all three modalities at once. Critical domains such as healthcare, infrastructure, or transportation are particularly affected by missing modalities. This makes it difficult to integrate all modalities into a large pre-trained neural network that can be used out-of-the-box or fine-tuned for different downstream tasks. We introduce LoReTTa (Linking mOdalities with a tRansitive and commutativE pre-Training sTrAtegy) to address this understudied problem. Our self-supervised framework unifies causal modeling and masked modeling with the rules of commutativity and transitivity. This allows us to transition within and between modalities. As a result, our pre-trained models are better at exploring the true underlying joint probability distribution. Given a dataset containing only the disjoint combinations (A, B) and (B, C), LoReTTa can model the relation A <-> C with A <-> B <-> C. In particular, we show that a transformer pre-trained with LoReTTa can handle any mixture of modalities at inference time, including the never-seen pair (A, C) and the triplet (A, B, C). We extensively evaluate our approach on a synthetic, medical, and reinforcement learning dataset. Across different domains, our universal multimodal transformer consistently outperforms strong baselines such as GPT, BERT, and CLIP on tasks involving the missing modality tuple.
△ Less
Submitted 16 January, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Uncertainty Quantification for Atlas-Level Cell Type Transfer
Authors:
Jan Engelmann,
Leon Hetzel,
Giovanni Palla,
Lisa Sikkema,
Malte Luecken,
Fabian Theis
Abstract:
Single-cell reference atlases are large-scale, cell-level maps that capture cellular heterogeneity within an organ using single cell genomics. Given their size and cellular diversity, these atlases serve as high-quality training data for the transfer of cell type labels to new datasets. Such label transfer, however, must be robust to domain shifts in gene expression due to measurement technique, l…
▽ More
Single-cell reference atlases are large-scale, cell-level maps that capture cellular heterogeneity within an organ using single cell genomics. Given their size and cellular diversity, these atlases serve as high-quality training data for the transfer of cell type labels to new datasets. Such label transfer, however, must be robust to domain shifts in gene expression due to measurement technique, lab specifics and more general batch effects. This requires methods that provide uncertainty estimates on the cell type predictions to ensure correct interpretation. Here, for the first time, we introduce uncertainty quantification methods for cell type classification on single-cell reference atlases. We benchmark four model classes and show that currently used models lack calibration, robustness, and actionable uncertainty scores. Furthermore, we demonstrate how models that quantify uncertainty are better suited to detect unseen cell types in the setting of atlas-level cell type transfer.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Sparsity in Continuous-Depth Neural Networks
Authors:
Hananeh Aliee,
Till Richter,
Mikhail Solonin,
Ignacio Ibarra,
Fabian Theis,
Niki Kilbertus
Abstract:
Neural Ordinary Differential Equations (NODEs) have proven successful in learning dynamical systems in terms of accurately recovering the observed trajectories. While different types of sparsity have been proposed to improve robustness, the generalization properties of NODEs for dynamical systems beyond the observed data are underexplored. We systematically study the influence of weight and featur…
▽ More
Neural Ordinary Differential Equations (NODEs) have proven successful in learning dynamical systems in terms of accurately recovering the observed trajectories. While different types of sparsity have been proposed to improve robustness, the generalization properties of NODEs for dynamical systems beyond the observed data are underexplored. We systematically study the influence of weight and feature sparsity on forecasting as well as on identifying the underlying dynamical laws. Besides assessing existing methods, we propose a regularization technique to sparsify "input-output connections" and extract relevant features during training. Moreover, we curate real-world datasets consisting of human motion capture and human hematopoiesis single-cell RNA-seq data to realistically analyze different levels of out-of-distribution (OOD) generalization in forecasting and dynamics identification respectively. Our extensive empirical evaluation on these challenging benchmarks suggests that weight sparsity improves generalization in the presence of noise or irregular sampling. However, it does not prevent learning spurious feature dependencies in the inferred dynamics, rendering them impractical for predictions under interventions, or for inferring the true underlying dynamics. Instead, feature sparsity can indeed help with recovering sparse ground-truth dynamics compared to unregularized NODEs.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Noise transfer for unsupervised domain adaptation of retinal OCT images
Authors:
Valentin Koch,
Olle Holmberg,
Hannah Spitzer,
Johannes Schiefelbein,
Ben Asani,
Michael Hafner,
Fabian J Theis
Abstract:
Optical coherence tomography (OCT) imaging from different camera devices causes challenging domain shifts and can cause a severe drop in accuracy for machine learning models. In this work, we introduce a minimal noise adaptation method based on a singular value decomposition (SVDNA) to overcome the domain gap between target domains from three different device manufacturers in retinal OCT imaging.…
▽ More
Optical coherence tomography (OCT) imaging from different camera devices causes challenging domain shifts and can cause a severe drop in accuracy for machine learning models. In this work, we introduce a minimal noise adaptation method based on a singular value decomposition (SVDNA) to overcome the domain gap between target domains from three different device manufacturers in retinal OCT imaging. Our method utilizes the difference in noise structure to successfully bridge the domain gap between different OCT devices and transfer the style from unlabeled target domain images to source images for which manual annotations are available. We demonstrate how this method, despite its simplicity, compares or even outperforms state-of-the-art unsupervised domain adaptation methods for semantic segmentation on a public OCT dataset. SVDNA can be integrated with just a few lines of code into the augmentation pipeline of any network which is in contrast to many state-of-the-art domain adaptation methods which often need to change the underlying model architecture or train a separate style transfer model. The full code implementation for SVDNA is available at https://github.com/ValentinKoch/SVDNA.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
FedNorm: Modality-Based Normalization in Federated Learning for Multi-Modal Liver Segmentation
Authors:
Tobias Bernecker,
Annette Peters,
Christopher L. Schlett,
Fabian Bamberg,
Fabian Theis,
Daniel Rueckert,
Jakob Weiß,
Shadi Albarqouni
Abstract:
Given the high incidence and effective treatment options for liver diseases, they are of great socioeconomic importance. One of the most common methods for analyzing CT and MRI images for diagnosis and follow-up treatment is liver segmentation. Recent advances in deep learning have demonstrated encouraging results for automatic liver segmentation. Despite this, their success depends primarily on t…
▽ More
Given the high incidence and effective treatment options for liver diseases, they are of great socioeconomic importance. One of the most common methods for analyzing CT and MRI images for diagnosis and follow-up treatment is liver segmentation. Recent advances in deep learning have demonstrated encouraging results for automatic liver segmentation. Despite this, their success depends primarily on the availability of an annotated database, which is often not available because of privacy concerns. Federated Learning has been recently proposed as a solution to alleviate these challenges by training a shared global model on distributed clients without access to their local databases. Nevertheless, Federated Learning does not perform well when it is trained on a high degree of heterogeneity of image data due to multi-modal imaging, such as CT and MRI, and multiple scanner types. To this end, we propose Fednorm and its extension \fednormp, two Federated Learning algorithms that use a modality-based normalization technique. Specifically, Fednorm normalizes the features on a client-level, while Fednorm+ employs the modality information of single slices in the feature normalization. Our methods were validated using 428 patients from six publicly available databases and compared to state-of-the-art Federated Learning algorithms and baseline models in heterogeneous settings (multi-institutional, multi-modal data). The experimental results demonstrate that our methods show an overall acceptable performance, achieve Dice per patient scores up to 0.961, consistently outperform locally trained models, and are on par or slightly better than centralized models.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
SystemMatch: optimizing preclinical drug models to human clinical outcomes via generative latent-space matching
Authors:
Scott Gigante,
Varsha G. Raghavan,
Amanda M. Robinson,
Robert A. Barton,
Adeeb H. Rahman,
Drausin F. Wulsin,
Jacques Banchereau,
Noam Solomon,
Luis F. Voloch,
Fabian J. Theis
Abstract:
Translating the relevance of preclinical models ($\textit{in vitro}$, animal models, or organoids) to their relevance in humans presents an important challenge during drug development. The rising abundance of single-cell genomic data from human tumors and tissue offers a new opportunity to optimize model systems by their similarity to targeted human cell types in disease. In this work, we introduc…
▽ More
Translating the relevance of preclinical models ($\textit{in vitro}$, animal models, or organoids) to their relevance in humans presents an important challenge during drug development. The rising abundance of single-cell genomic data from human tumors and tissue offers a new opportunity to optimize model systems by their similarity to targeted human cell types in disease. In this work, we introduce SystemMatch to assess the fit of preclinical model systems to an $\textit{in sapiens}$ target population and to recommend experimental changes to further optimize these systems. We demonstrate this through an application to developing $\textit{in vitro}$ systems to model human tumor-derived suppressive macrophages. We show with held-out $\textit{in vivo}$ controls that our pipeline successfully ranks macrophage subpopulations by their biological similarity to the target population, and apply this analysis to rank a series of 18 $\textit{in vitro}$ macrophage systems perturbed with a variety of cytokine stimulations. We extend this analysis to predict the behavior of 66 $\textit{in silico}$ model systems generated using a perturbational autoencoder and apply a $k$-medoids approach to recommend a subset of these model systems for further experimental development in order to fully explore the space of possible perturbations. Through this use case, we demonstrate a novel approach to model system development to generate a system more similar to human biology.
△ Less
Submitted 14 May, 2022;
originally announced May 2022.
-
Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution
Authors:
Leon Hetzel,
Simon Böhm,
Niki Kilbertus,
Stephan Günnemann,
Mohammad Lotfollahi,
Fabian Theis
Abstract:
Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely perform…
▽ More
Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely performed bulk RNA HTS is required to enrich single-cell data meaningfully. We introduce chemCPA, a new encoder-decoder architecture to study the perturbational effects of unseen drugs. We combine the model with an architecture surgery for transfer learning and demonstrate how training on existing bulk RNA HTS datasets can improve generalisation performance. Better generalisation reduces the need for extensive and costly screens at single-cell resolution. We envision that our proposed method will facilitate more efficient experiment designs through its ability to generate in-silico hypotheses, ultimately accelerating drug discovery.
△ Less
Submitted 30 December, 2022; v1 submitted 28 April, 2022;
originally announced April 2022.
-
Beyond Predictions in Neural ODEs: Identification and Interventions
Authors:
Hananeh Aliee,
Fabian J. Theis,
Niki Kilbertus
Abstract:
Spurred by tremendous success in pattern matching and prediction tasks, researchers increasingly resort to machine learning to aid original scientific discovery. Given large amounts of observational data about a system, can we uncover the rules that govern its evolution? Solving this task holds the great promise of fully understanding the causal interactions and being able to make reliable predict…
▽ More
Spurred by tremendous success in pattern matching and prediction tasks, researchers increasingly resort to machine learning to aid original scientific discovery. Given large amounts of observational data about a system, can we uncover the rules that govern its evolution? Solving this task holds the great promise of fully understanding the causal interactions and being able to make reliable predictions about the system's behavior under interventions. We take a step towards answering this question for time-series data generated from systems of ordinary differential equations (ODEs). While the governing ODEs might not be identifiable from data alone, we show that combining simple regularization schemes with flexible neural ODEs can robustly recover the dynamics and causal structures from time-series data. Our results on a variety of (non)-linear first and second order systems as well as real data validate our method. We conclude by showing that we can also make accurate predictions under interventions on variables or the system itself.
△ Less
Submitted 23 June, 2021;
originally announced June 2021.
-
A field guide to cultivating computational biology
Authors:
Anne E Carpenter,
Casey S Greene,
Piero Carnici,
Benilton S Carvalho,
Michiel de Hoon,
Stacey Finley,
Kim-Anh Le Cao,
Jerry SH Lee,
Luigi Marchionni,
Suzanne Sindi,
Fabian J Theis,
Gregory P Way,
Jean YH Yang,
Elana J Fertig
Abstract:
Biomedical research centers can empower basic discovery and novel therapeutic strategies by leveraging their large-scale datasets from experiments and patients. This data, together with new technologies to create and analyze it, has ushered in an era of data-driven discovery which requires moving beyond the traditional individual, single-discipline investigator research model. This interdisciplina…
▽ More
Biomedical research centers can empower basic discovery and novel therapeutic strategies by leveraging their large-scale datasets from experiments and patients. This data, together with new technologies to create and analyze it, has ushered in an era of data-driven discovery which requires moving beyond the traditional individual, single-discipline investigator research model. This interdisciplinary niche is where computational biology thrives. It has matured over the past three decades and made major contributions to scientific knowledge and human health, yet researchers in the field often languish in career advancement, publication, and grant review. We propose solutions for individual scientists, institutions, journal publishers, funding agencies, and educators.
△ Less
Submitted 22 April, 2021;
originally announced April 2021.
-
Conditional out-of-sample generation for unpaired data using trVAE
Authors:
Mohammad Lotfollahi,
Mohsen Naghipourfar,
Fabian J. Theis,
F. Alexander Wolf
Abstract:
While generative models have shown great success in generating high-dimensional samples conditional on low-dimensional descriptors (learning e.g. stroke thickness in MNIST, hair color in CelebA, or speaker identity in Wavenet), their generation out-of-sample poses fundamental problems. The conditional variational autoencoder (CVAE) as a simple conditional generative model does not explicitly relat…
▽ More
While generative models have shown great success in generating high-dimensional samples conditional on low-dimensional descriptors (learning e.g. stroke thickness in MNIST, hair color in CelebA, or speaker identity in Wavenet), their generation out-of-sample poses fundamental problems. The conditional variational autoencoder (CVAE) as a simple conditional generative model does not explicitly relate conditions during training and, hence, has no incentive of learning a compact joint distribution across conditions. We overcome this limitation by matching their distributions using maximum mean discrepancy (MMD) in the decoder layer that follows the bottleneck. This introduces a strong regularization both for reconstructing samples within the same condition and for transforming samples across conditions, resulting in much improved generalization. We refer to the architecture as \emph{transformer} VAE (trVAE). Benchmarking trVAE on high-dimensional image and tabular data, we demonstrate higher robustness and higher accuracy than existing approaches. In particular, we show qualitatively improved predictions for cellular perturbation response to treatment and disease based on high-dimensional single-cell gene expression data, by tackling previously problematic minority classes and multiple conditions. For generic tasks, we improve Pearson correlations of high-dimensional estimated means and variances with their ground truths from 0.89 to 0.97 and 0.75 to 0.87, respectively.
△ Less
Submitted 30 October, 2019; v1 submitted 3 October, 2019;
originally announced October 2019.
-
Single-cell eQTLGen Consortium: a personalized understanding of disease
Authors:
Monique G. P. van der Wijst,
Dylan H. de Vries,
Hilde E. Groot,
Gosia Trynka,
Chung-Chau Hon,
Martijn C. Nawijn,
Youssef Idaghdour,
Pim van der Harst,
Chun J. Ye,
Joseph Powell,
Fabian J. Theis,
Ahmed Mahfouz,
Matthias Heinig,
Lude Franke
Abstract:
In recent years, functional genomics approaches combining genetic information with bulk RNA-sequencing data have identified the downstream expression effects of disease-associated genetic risk factors through so-called expression quantitative trait locus (eQTL) analysis. Single-cell RNA-sequencing creates enormous opportunities for mapping eQTLs across different cell types and in dynamic processes…
▽ More
In recent years, functional genomics approaches combining genetic information with bulk RNA-sequencing data have identified the downstream expression effects of disease-associated genetic risk factors through so-called expression quantitative trait locus (eQTL) analysis. Single-cell RNA-sequencing creates enormous opportunities for mapping eQTLs across different cell types and in dynamic processes, many of which are obscured when using bulk methods. The enormous increase in throughput and reduction in cost per cell now allow this technology to be applied to large-scale population genetics studies. Therefore, we have founded the single-cell eQTLGen consortium (sc-eQTLGen), aimed at pinpointing disease-causing genetic variants and identifying the cellular contexts in which they affect gene expression. Ultimately, this information can enable development of personalized medicine. Here, we outline the goals, approach, potential utility and early proofs-of-concept of the sc-eQTLGen consortium. We also provide a set of study design considerations for future single-cell eQTL studies.
△ Less
Submitted 27 September, 2019;
originally announced September 2019.
-
Fully integrative data analysis of NMR metabolic fingerprints with comprehensive patient data: a case report based on the German Chronic Kidney Disease (GCKD) study
Authors:
Helena U. Zacharias,
Michael Altenbuchinger,
Stefan Solbrig,
Andreas Schäfer,
Mustafa Buyukozkan,
Ulla T. Schultheiß,
Fruzsina Kotsis,
Anna Köttgen,
Jan Krumsiek,
Fabian J. Theis,
Rainer Spang,
Peter J. Oefner,
Wolfram Gronwald,
GCKD study investigators
Abstract:
Omics data facilitate the gain of novel insights into the pathophysiology of diseases and, consequently, their diagnosis, treatment, and prevention. To that end, it is necessary to integrate omics data with other data types such as clinical, phenotypic, and demographic parameters of categorical or continuous nature. Here, we exemplify this data integration issue for a study on chronic kidney disea…
▽ More
Omics data facilitate the gain of novel insights into the pathophysiology of diseases and, consequently, their diagnosis, treatment, and prevention. To that end, it is necessary to integrate omics data with other data types such as clinical, phenotypic, and demographic parameters of categorical or continuous nature. Here, we exemplify this data integration issue for a study on chronic kidney disease (CKD), where complex clinical and demographic parameters were assessed together with one-dimensional (1D) 1H NMR metabolic fingerprints. Routine analysis screens for associations of single metabolic features with clinical parameters, which requires confounding variables typically chosen by expert knowledge to be taken into account. This knowledge can be incomplete or unavailable. The results of this article are manifold. We introduce a framework for data integration that intrinsically adjusts for confounding variables. We give its mathematical and algorithmic foundation, provide a state-of-the-art implementation, and give several sanity checks. In particular, we show that the discovered associations remain significant after variable adjustment based on expert knowledge. In contrast, we illustrate that the discovery of associations in routine analysis can be biased by incorrect or incomplete expert knowledge in univariate screening approaches. Finally, we exemplify how our data integration approach reveals important associations between CKD comorbidities and metabolites. Moreover, we evaluate the predictive performance of the estimated models on independent validation data and contrast the results with a naive screening approach.
△ Less
Submitted 8 October, 2018;
originally announced October 2018.
-
Mitosis Detection in Intestinal Crypt Images with Hough Forest and Conditional Random Fields
Authors:
Gerda Bortsova,
Michael Sterr,
Lichao Wang,
Fausto Milletari,
Nassir Navab,
Anika Böttcher,
Heiko Lickert,
Fabian Theis,
Tingying Peng
Abstract:
Intestinal enteroendocrine cells secrete hormones that are vital for the regulation of glucose metabolism but their differentiation from intestinal stem cells is not fully understood. Asymmetric stem cell divisions have been linked to intestinal stem cell homeostasis and secretory fate commitment. We monitored cell divisions using 4D live cell imaging of cultured intestinal crypts to characterize…
▽ More
Intestinal enteroendocrine cells secrete hormones that are vital for the regulation of glucose metabolism but their differentiation from intestinal stem cells is not fully understood. Asymmetric stem cell divisions have been linked to intestinal stem cell homeostasis and secretory fate commitment. We monitored cell divisions using 4D live cell imaging of cultured intestinal crypts to characterize division modes by means of measurable features such as orientation or shape. A statistical analysis of these measurements requires annotation of mitosis events, which is currently a tedious and time-consuming task that has to be performed manually. To assist data processing, we developed a learning based method to automatically detect mitosis events. The method contains a dual-phase framework for joint detection of dividing cells (mothers) and their progeny (daughters). In the first phase we detect mother and daughters independently using Hough Forest whilst in the second phase we associate mother and daughters by modelling their joint probability as Conditional Random Field (CRF). The method has been evaluated on 32 movies and has achieved an AUC of 72%, which can be used in conjunction with manual correction and dramatically speed up the processing pipeline.
△ Less
Submitted 26 August, 2016;
originally announced August 2016.
-
A simulation-based approach for solving optimisation problems with ODE-type steady state constraints
Authors:
Anna Fiedler,
Fabian J. Theis,
Jan Hasenauer
Abstract:
Ordinary differential equations (ODEs) are widely used to model biological, (bio-)chemical and technical processes. The parameters of these ODEs are often estimated from experimental data using ODE-constrained optimisation. This article proposes a simple simulation-based approach for solving optimisation problems with steady state constraints relying on an ODE. This simulation-based optimisation m…
▽ More
Ordinary differential equations (ODEs) are widely used to model biological, (bio-)chemical and technical processes. The parameters of these ODEs are often estimated from experimental data using ODE-constrained optimisation. This article proposes a simple simulation-based approach for solving optimisation problems with steady state constraints relying on an ODE. This simulation-based optimisation method is tailored to the problem structure and exploits the local geometry of the steady state manifold and its stability properties. A parameterisation of the steady state manifold is not required. We prove local convergence of the method for locally strictly convex objective functions. Effciency and reliability of the proposed method are demonstrated in two examples. The proposed method demonstrated better convergence properties than existing general purpose methods and a significantly higher number of converged starts per time.
△ Less
Submitted 5 November, 2015;
originally announced November 2015.
-
Data-driven modelling of biological multi-scale processes
Authors:
Jan Hasenauer,
Nick Jagiella,
Sabrina Hross,
Fabian J. Theis
Abstract:
Biological processes involve a variety of spatial and temporal scales. A holistic understanding of many biological processes therefore requires multi-scale models which capture the relevant properties on all these scales. In this manuscript we review mathematical modelling approaches used to describe the individual spatial scales and how they are integrated into holistic models. We discuss the rel…
▽ More
Biological processes involve a variety of spatial and temporal scales. A holistic understanding of many biological processes therefore requires multi-scale models which capture the relevant properties on all these scales. In this manuscript we review mathematical modelling approaches used to describe the individual spatial scales and how they are integrated into holistic models. We discuss the relation between spatial and temporal scales and the implication of that on multi-scale modelling. Based upon this overview over state-of-the-art modelling approaches, we formulate key challenges in mathematical and computational modelling of biological multi-scale and multi-physics processes. In particular, we considered the availability of analysis tools for multi-scale models and model-based multi-scale data integration. We provide a compact review of methods for model-based data integration and model-based hypothesis testing. Furthermore, novel approaches and recent trends are discussed, including computation time reduction using reduced order and surrogate models, which contribute to the solution of inference problems. We conclude the manuscript by providing a few ideas for the development of tailored multi-scale inference methods.
△ Less
Submitted 21 June, 2015;
originally announced June 2015.
-
MCA: Multiresolution Correlation Analysis, a graphical tool for subpopulation identification in single-cell gene expression data
Authors:
Justin Feigelman,
Fabian J. Theis,
Carsten Marr
Abstract:
Background: Biological data often originate from samples containing mixtures of subpopulations, corresponding e.g. to distinct cellular phenotypes. However, identification of distinct subpopulations may be difficult if biological measurements yield distributions that are not easily separable. Results: We present Multiresolution Correlation Analysis (MCA), a method for visually identifying subpopul…
▽ More
Background: Biological data often originate from samples containing mixtures of subpopulations, corresponding e.g. to distinct cellular phenotypes. However, identification of distinct subpopulations may be difficult if biological measurements yield distributions that are not easily separable. Results: We present Multiresolution Correlation Analysis (MCA), a method for visually identifying subpopulations based on the local pairwise correlation between covariates, without needing to define an a priori interaction scale. We demonstrate that MCA facilitates the identification of differentially regulated subpopulations in simulated data from a small gene regulatory network, followed by application to previously published single-cell qPCR data from mouse embryonic stem cells. We show that MCA recovers previously identified subpopulations, provides additional insight into the underlying correlation structure, reveals potentially spurious compartmentalizations, and provides insight into novel subpopulations. Conclusions: MCA is a useful method for the identification of subpopulations in low-dimensional expression data, as emerging from qPCR or FACS measurements. With MCA it is possible to investigate the robustness of covariate correlations with respect subpopulations, graphically identify outliers, and identify factors contributing to differential regulation between pairs of covariates. MCA thus provides a framework for investigation of expression correlations for genes of interests and biological hypothesis generation.
△ Less
Submitted 8 July, 2014;
originally announced July 2014.
-
Separation of uncorrelated stationary time series using autocovariance matrices
Authors:
Jari Miettinen,
Katrin Illner,
Klaus Nordhausen,
Hannu Oja,
Sara Taskinen,
Fabian J. Theis
Abstract:
Blind source separation (BSS) is a signal processing tool, which is widely used in various fields. Examples include biomedical signal separation, brain imaging and economic time series applications. In BSS, one assumes that the observed $p$ time series are linear combinations of $p$ latent uncorrelated weakly stationary time series. The aim is then to find an estimate for an unmixing matrix, which…
▽ More
Blind source separation (BSS) is a signal processing tool, which is widely used in various fields. Examples include biomedical signal separation, brain imaging and economic time series applications. In BSS, one assumes that the observed $p$ time series are linear combinations of $p$ latent uncorrelated weakly stationary time series. The aim is then to find an estimate for an unmixing matrix, which transforms the observed time series back to uncorrelated latent time series. In SOBI (Second Order Blind Identification) joint diagonalization of the covariance matrix and autocovariance matrices with several lags is used to estimate the unmixing matrix. The rows of an unmixing matrix can be derived either one by one (deflation-based approach) or simultaneously (symmetric approach). The latter of these approaches is well-known especially in signal processing literature, however, the rigorous analysis of its statistical properties has been missing so far. In this paper, we fill this gap and investigate the statistical properties of the symmetric SOBI estimate in detail and find its limiting distribution under general conditions. The asymptotical efficiencies of symmetric SOBI estimate are compared to those of recently introduced deflation-based SOBI estimate under general multivariate MA$(\infty)$ processes. The theory is illustrated by some finite-sample simulation studies as well as a real EEG data example.
△ Less
Submitted 14 May, 2014;
originally announced May 2014.
-
Joining Forces of Bayesian and Frequentist Methodology: A Study for Inference in the Presence of Non-Identifiability
Authors:
Andreas Raue,
Clemens Kreutz,
Fabian Joachim Theis,
Jens Timmer
Abstract:
Increasingly complex applications involve large datasets in combination with non-linear and high dimensional mathematical models. In this context, statistical inference is a challenging issue that calls for pragmatic approaches that take advantage of both Bayesian and frequentist methods. The elegance of Bayesian methodology is founded in the propagation of information content provided by experime…
▽ More
Increasingly complex applications involve large datasets in combination with non-linear and high dimensional mathematical models. In this context, statistical inference is a challenging issue that calls for pragmatic approaches that take advantage of both Bayesian and frequentist methods. The elegance of Bayesian methodology is founded in the propagation of information content provided by experimental data and prior assumptions to the posterior probability distribution of model predictions. However, for complex applications experimental data and prior assumptions potentially constrain the posterior probability distribution insufficiently. In these situations Bayesian Markov chain Monte Carlo sampling can be infeasible. From a frequentist point of view insufficient experimental data and prior assumptions can be interpreted as non-identifiability. The profile likelihood approach offers to detect and to resolve non-identifiability by experimental design iteratively. Therefore, it allows one to better constrain the posterior probability distribution until Markov chain Monte Carlo sampling can be used securely. Using an application from cell biology we compare both methods and show that a successive application of both methods facilitates a realistic assessment of uncertainty in model predictions.
△ Less
Submitted 21 February, 2012;
originally announced February 2012.
-
Stability and multi-attractor dynamics of a toggle switch based on a two-stage model of stochastic gene expression
Authors:
Michael K. Strasser,
Fabian J. Theis,
Carsten Marr
Abstract:
A toggle switch consists of two genes that mutually repress each other. This regulatory motif is active during cell differentiation and is thought to act as a memory device, being able to choose and maintain cell fate decisions. In this contribution, we study the stability and dynamics of a two-stage gene expression switch within a probabilistic framework inspired by the properties of the Pu/Gata…
▽ More
A toggle switch consists of two genes that mutually repress each other. This regulatory motif is active during cell differentiation and is thought to act as a memory device, being able to choose and maintain cell fate decisions. In this contribution, we study the stability and dynamics of a two-stage gene expression switch within a probabilistic framework inspired by the properties of the Pu/Gata toggle switch in myeloid progenitor cells. We focus on low mRNA numbers, high protein abundance and monomeric transcription factor binding. Contrary to the expectation from a deterministic description, this switch shows complex multi-attractor dy- namics without autoactivation and cooperativity. Most importantly, the four attractors of the system, which only emerge in a probabilistic two-stage description, can be identified with committed and primed states in cell differentiation. We first study the dynamics of the system and infer the mechanisms that move the system between attractors using both the quasi-potential and the probability flux of the system. Second, we show that the residence times of the system in one of the committed attractors are geometrically distributed and provide an analytical expression of the distribution. Most importantly we find that the mean residence time increases linearly with the mean protein level. Finally, we study the implications of this distribution for the stability of a switch and discuss the influence of the stability on a specific cell differentiation mechanism. Our model explains lineage priming and proposes the need of either high protein numbers or long term modifications such as chromatin remodeling to achieve stable cell fate decisions. Notably we present a system with high protein abundance that nevertheless requires a probabilistic description to exhibit multistability, complex switching dynamics and lineage priming.
△ Less
Submitted 1 December, 2011;
originally announced December 2011.
-
Modularity maximization and tree clustering: Novel ways to determine effective geographic borders
Authors:
Daniel Grady,
Rafael Brune,
Christian Thiemann,
Fabian Theis,
Dirk Brockmann
Abstract:
Territorial subdivisions and geographic borders are essential for understanding phenomena in sociology, political science, history, and economics. They influence the interregional flow of information and cross-border trade and affect the diffusion of innovation and technology. However, most existing administrative borders were determined by a variety of historic and political circumstances along w…
▽ More
Territorial subdivisions and geographic borders are essential for understanding phenomena in sociology, political science, history, and economics. They influence the interregional flow of information and cross-border trade and affect the diffusion of innovation and technology. However, most existing administrative borders were determined by a variety of historic and political circumstances along with some degree of arbitrariness. Societies have changed drastically, and it is doubtful that currently existing borders reflect the most logical divisions. Fortunately, at this point in history we are in a position to actually measure some aspects of the geographic structure of society through human mobility. Large-scale transportation systems such as trains and airlines provide data about the number of people traveling between geographic locations, and many promising human mobility proxies are being discovered, such as cell phones, bank notes, and various online social networks. In this chapter we apply two optimization techniques to a human mobility proxy (bank note circulation) to investigate the effective geographic borders that emerge from a direct analysis of human mobility.
△ Less
Submitted 6 April, 2011;
originally announced April 2011.
-
Patterns of subnet usage reveal distinct scales of regulation in the transcriptional regulatory network of Escherichia coli
Authors:
Carsten Marr,
Fabian J. Theis,
Larry S. Liebovitch,
Marc-Thorsten Hütt
Abstract:
The set of regulatory interactions between genes, mediated by transcription factors, forms a species' transcriptional regulatory network (TRN). By comparing this network with measured gene expression data one can identify functional properties of the TRN and gain general insight into transcriptional control. We define the subnet of a node as the subgraph consisting of all nodes topologically downs…
▽ More
The set of regulatory interactions between genes, mediated by transcription factors, forms a species' transcriptional regulatory network (TRN). By comparing this network with measured gene expression data one can identify functional properties of the TRN and gain general insight into transcriptional control. We define the subnet of a node as the subgraph consisting of all nodes topologically downstream of the node, including itself. Using a large set of microarray expression data of the bacterium Escherichia coli, we find that the gene expression in different subnets exhibits a structured pattern in response to environmental changes and genotypic mutation. Subnets with less changes in their expression pattern have a higher fraction of feed-forward loop motifs and a lower fraction of small RNA targets within them. Our study implies that the TRN consists of several scales of regulatory organization: 1) subnets with more varying gene expression controlled by both transcription factors and post-transcriptional RNA regulation, and 2) subnets with less varying gene expression having more feed-forward loops and less post-transcriptional RNA regulation.
△ Less
Submitted 24 May, 2010;
originally announced May 2010.
-
The structure of borders in a small world
Authors:
C. Thiemann,
F. Theis,
D. Grady,
R. Brune,
D. Brockmann
Abstract:
Geographic borders are not only essential for the effective functioning of government, the distribution of administrative responsibilities and the allocation of public resources, they also influence the interregional flow of information, cross-border trade operations, the diffusion of innovation and technology, and the spatial spread of infectious diseases. However, as growing interactions and m…
▽ More
Geographic borders are not only essential for the effective functioning of government, the distribution of administrative responsibilities and the allocation of public resources, they also influence the interregional flow of information, cross-border trade operations, the diffusion of innovation and technology, and the spatial spread of infectious diseases. However, as growing interactions and mobility across long distances, cultural, and political borders continue to amplify the small world effect and effectively decrease the relative importance of local interactions, it is difficult to assess the location and structure of effective borders that may play the most significant role in mobility-driven processes. The paradigm of spatially coherent communities may no longer be a plausible one, and it is unclear what structures emerge from the interplay of interactions and activities across spatial scales. Here we analyse a multi-scale proxy network for human mobility that incorporates travel across a few to a few thousand kilometres. We determine an effective system of geographically continuous borders implicitly encoded in multi-scale mobility patterns. We find that effective large scale boundaries define spatially coherent subdivisions and only partially coincide with administrative borders. We find that spatial coherence is partially lost if only long range traffic is taken into account and show that prevalent models for multi-scale mobility networks cannot account for the observed patterns. These results will allow for new types of quantitative, comparative analyses of multi-scale interaction networks in general and may provide insight into a multitude of spatiotemporal phenomena generated by human activity.
△ Less
Submitted 6 January, 2010;
originally announced January 2010.