-
Graph Representation Learning Strategies for Omics Data: A Case Study on Parkinson's Disease
Authors:
Elisa Gómez de Lope,
Saurabh Deshpande,
Ramón Viñas Torné,
Pietro Liò,
Enrico Glaab,
Stéphane P. A. Bordas
Abstract:
Omics data analysis is crucial for studying complex diseases, but its high dimensionality and heterogeneity challenge classical statistical and machine learning methods. Graph neural networks have emerged as promising alternatives, yet the optimal strategies for their design and optimization in real-world biomedical challenges remain unclear. This study evaluates various graph representation learn…
▽ More
Omics data analysis is crucial for studying complex diseases, but its high dimensionality and heterogeneity challenge classical statistical and machine learning methods. Graph neural networks have emerged as promising alternatives, yet the optimal strategies for their design and optimization in real-world biomedical challenges remain unclear. This study evaluates various graph representation learning models for case-control classification using high-throughput biological data from Parkinson's disease and control samples. We compare topologies derived from sample similarity networks and molecular interaction networks, including protein-protein and metabolite-metabolite interactions (PPI, MMI). Graph Convolutional Network (GCNs), Chebyshev spectral graph convolution (ChebyNet), and Graph Attention Network (GAT), are evaluated alongside advanced architectures like graph transformers, the graph U-net, and simpler models like multilayer perceptron (MLP).
These models are systematically applied to transcriptomics and metabolomics data independently. Our comparative analysis highlights the benefits and limitations of various architectures in extracting patterns from omics data, paving the way for more accurate and interpretable models in biomedical research.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Evaluating representation learning on the protein structure universe
Authors:
Arian R. Jamasb,
Alex Morehead,
Chaitanya K. Joshi,
Zuobai Zhang,
Kieran Didi,
Simon V. Mathis,
Charles Harris,
Jian Tang,
Jianlin Cheng,
Pietro Lio,
Tom L. Blundell
Abstract:
We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relations…
▽ More
We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone Design
Authors:
Rishabh Anand,
Chaitanya K. Joshi,
Alex Morehead,
Arian R. Jamasb,
Charles Harris,
Simon V. Mathis,
Kieran Didi,
Bryan Hooi,
Pietro Liò
Abstract:
We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone design. We build upon SE(3) flow matching for protein backbone generation and establish protocols for data preparation and evaluation to address unique challenges posed by RNA modeling. We formulate RNA structures as a set of rigid-body frames and associated loss functions which account for larger, more conformationally fle…
▽ More
We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone design. We build upon SE(3) flow matching for protein backbone generation and establish protocols for data preparation and evaluation to address unique challenges posed by RNA modeling. We formulate RNA structures as a set of rigid-body frames and associated loss functions which account for larger, more conformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins (4 atoms per residue). Toward tackling the lack of diversity in 3D RNA datasets, we explore training with structural clustering and cropping augmentations. Additionally, we define a suite of evaluation metrics to measure whether the generated RNA structures are globally self-consistent (via inverse folding followed by forward folding) and locally recover RNA-specific structural descriptors. The most performant version of RNA-FrameFlow generates locally realistic RNA backbones of 40-150 nucleotides, over 40% of which pass our validity criteria as measured by a self-consistency TM-score >= 0.45, at which two RNAs have the same global fold. Open-source code: https://github.com/rish-16/rna-backbone-design
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Improving Antibody Design with Force-Guided Sampling in Diffusion Models
Authors:
Paulina Kulytė,
Francisco Vargas,
Simon Valentin Mathis,
Yu Guang Wang,
José Miguel Hernández-Lobato,
Pietro Liò
Abstract:
Antibodies, crucial for immune defense, primarily rely on complementarity-determining regions (CDRs) to bind and neutralize antigens, such as viruses. The design of these CDRs determines the antibody's affinity and specificity towards its target. Generative models, particularly denoising diffusion probabilistic models (DDPMs), have shown potential to advance the structure-based design of CDR regio…
▽ More
Antibodies, crucial for immune defense, primarily rely on complementarity-determining regions (CDRs) to bind and neutralize antigens, such as viruses. The design of these CDRs determines the antibody's affinity and specificity towards its target. Generative models, particularly denoising diffusion probabilistic models (DDPMs), have shown potential to advance the structure-based design of CDR regions. However, only a limited dataset of bound antibody-antigen structures is available, and generalization to out-of-distribution interfaces remains a challenge. Physics based force-fields, which approximate atomic interactions, offer a coarse but universal source of information to better mold designs to target interfaces. Integrating this foundational information into diffusion models is, therefore, highly desirable. Here, we propose a novel approach to enhance the sampling process of diffusion models by integrating force field energy-based feedback. Our model, DiffForce, employs forces to guide the diffusion sampling process, effectively blending the two distributions. Through extensive experiments, we demonstrate that our method guides the model to sample CDRs with lower energy, enhancing both the structure and sequence of the generated antibodies.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
SynFlowNet: Towards Molecule Design with Guaranteed Synthesis Pathways
Authors:
Miruna Cretu,
Charles Harris,
Julien Roy,
Emmanuel Bengio,
Pietro Liò
Abstract:
Recent breakthroughs in generative modelling have led to a number of works proposing molecular generation models for drug discovery. While these models perform well at capturing drug-like motifs, they are known to often produce synthetically inaccessible molecules. This is because they are trained to compose atoms or fragments in a way that approximates the training distribution, but they are not…
▽ More
Recent breakthroughs in generative modelling have led to a number of works proposing molecular generation models for drug discovery. While these models perform well at capturing drug-like motifs, they are known to often produce synthetically inaccessible molecules. This is because they are trained to compose atoms or fragments in a way that approximates the training distribution, but they are not explicitly aware of the synthesis constraints that come with making molecules in the lab. To address this issue, we introduce SynFlowNet, a GFlowNet model whose action space uses chemically validated reactions and reactants to sequentially build new molecules. We evaluate our approach using synthetic accessibility scores and an independent retrosynthesis tool. SynFlowNet consistently samples synthetically feasible molecules, while still being able to find diverse and high-utility candidates. Furthermore, we compare molecules designed with SynFlowNet to experimentally validated actives, and find that they show comparable properties of interest, such as molecular weight, SA score and predicted protein binding affinity.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
A framework for conditional diffusion modelling with applications in motif scaffolding for protein design
Authors:
Kieran Didi,
Francisco Vargas,
Simon V Mathis,
Vincent Dutordoir,
Emile Mathieu,
Urszula J Komorowska,
Pietro Lio
Abstract:
Many protein design applications, such as binder or enzyme design, require scaffolding a structural motif with high precision. Generative modelling paradigms based on denoising diffusion processes emerged as a leading candidate to address this motif scaffolding problem and have shown early experimental success in some cases. In the diffusion paradigm, motif scaffolding is treated as a conditional…
▽ More
Many protein design applications, such as binder or enzyme design, require scaffolding a structural motif with high precision. Generative modelling paradigms based on denoising diffusion processes emerged as a leading candidate to address this motif scaffolding problem and have shown early experimental success in some cases. In the diffusion paradigm, motif scaffolding is treated as a conditional generation task, and several conditional generation protocols were proposed or imported from the Computer Vision literature. However, most of these protocols are motivated heuristically, e.g. via analogies to Langevin dynamics, and lack a unifying framework, obscuring connections between the different approaches. In this work, we unify conditional training and conditional sampling procedures under one common framework based on the mathematically well-understood Doob's h-transform. This new perspective allows us to draw connections between existing methods and propose a new variation on existing conditional training protocols. We illustrate the effectiveness of this new protocol in both, image outpainting and motif scaffolding and find that it outperforms standard methods.
△ Less
Submitted 13 March, 2024; v1 submitted 14 December, 2023;
originally announced December 2023.
-
A Hitchhiker's Guide to Geometric GNNs for 3D Atomic Systems
Authors:
Alexandre Duval,
Simon V. Mathis,
Chaitanya K. Joshi,
Victor Schmidt,
Santiago Miret,
Fragkiskos D. Malliaros,
Taco Cohen,
Pietro Liò,
Yoshua Bengio,
Michael Bronstein
Abstract:
Recent advances in computational modelling of atomic systems, spanning molecules, proteins, and materials, represent them as geometric graphs with atoms embedded as nodes in 3D Euclidean space. In these graphs, the geometric attributes transform according to the inherent physical symmetries of 3D atomic systems, including rotations and translations in Euclidean space, as well as node permutations.…
▽ More
Recent advances in computational modelling of atomic systems, spanning molecules, proteins, and materials, represent them as geometric graphs with atoms embedded as nodes in 3D Euclidean space. In these graphs, the geometric attributes transform according to the inherent physical symmetries of 3D atomic systems, including rotations and translations in Euclidean space, as well as node permutations. In recent years, Geometric Graph Neural Networks have emerged as the preferred machine learning architecture powering applications ranging from protein structure prediction to molecular simulations and material generation. Their specificity lies in the inductive biases they leverage - such as physical symmetries and chemical properties - to learn informative representations of these geometric graphs.
In this opinionated paper, we provide a comprehensive and self-contained overview of the field of Geometric GNNs for 3D atomic systems. We cover fundamental background material and introduce a pedagogical taxonomy of Geometric GNN architectures: (1) invariant networks, (2) equivariant networks in Cartesian basis, (3) equivariant networks in spherical basis, and (4) unconstrained networks. Additionally, we outline key datasets and application areas and suggest future research directions. The objective of this work is to present a structured perspective on the field, making it accessible to newcomers and aiding practitioners in gaining an intuition for its mathematical abstractions.
△ Less
Submitted 13 March, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Score-Based Generative Models for Designing Binding Peptide Backbones
Authors:
John D Boom,
Matthew Greenig,
Pietro Sormanni,
Pietro Liò
Abstract:
Score-based generative models (SGMs) have proven to be powerful tools for designing new proteins. Designing proteins that bind a pre-specified target is highly relevant to a range of medical and industrial applications. Despite the flurry of new SGMs in the last year, there has been little systematic exploration of the impact of design choices in SGMs for protein design. Here we present LoopGen, a…
▽ More
Score-based generative models (SGMs) have proven to be powerful tools for designing new proteins. Designing proteins that bind a pre-specified target is highly relevant to a range of medical and industrial applications. Despite the flurry of new SGMs in the last year, there has been little systematic exploration of the impact of design choices in SGMs for protein design. Here we present LoopGen, a flexible SGM framework for the design of short binding peptide structures. We apply our framework to design antibody binding loop structures conditional on a target epitope and evaluate a variety of modelling choices in SGM-based protein design. We demonstrate that modelling residue orientations in addition to positions improves not only the quality of the output structures but also their diversity. Additionally, we identify variance schedules that result in significant performance improvements and observe patterns that may motivate the development of better schedules for protein design. Finally, we develop three novel tests to evaluate whether the model generates structures that are appropriately conditioned on an epitope, demonstrating that LoopGen's generated structures are dependent on the structure, sequence, and position of the epitope. Our findings will help guide future development and evaluation of generative models for binding proteins.
△ Less
Submitted 2 November, 2023; v1 submitted 10 October, 2023;
originally announced October 2023.
-
Will More Expressive Graph Neural Networks do Better on Generative Tasks?
Authors:
Xiandong Zou,
Xiangyu Zhao,
Pietro Liò,
Yiren Zhao
Abstract:
Graph generation poses a significant challenge as it involves predicting a complete graph with multiple nodes and edges based on simply a given label. This task also carries fundamental importance to numerous real-world applications, including de-novo drug and molecular design. In recent years, several successful methods have emerged in the field of graph generation. However, these approaches suff…
▽ More
Graph generation poses a significant challenge as it involves predicting a complete graph with multiple nodes and edges based on simply a given label. This task also carries fundamental importance to numerous real-world applications, including de-novo drug and molecular design. In recent years, several successful methods have emerged in the field of graph generation. However, these approaches suffer from two significant shortcomings: (1) the underlying Graph Neural Network (GNN) architectures used in these methods are often underexplored; and (2) these methods are often evaluated on only a limited number of metrics. To fill this gap, we investigate the expressiveness of GNNs under the context of the molecular graph generation task, by replacing the underlying GNNs of graph generative models with more expressive GNNs. Specifically, we analyse the performance of six GNNs in two different generative frameworks -- autoregressive generation models, such as GCPN and GraphAF, and one-shot generation models, such as GraphEBM -- on six different molecular generative objectives on the ZINC-250k dataset. Through our extensive experiments, we demonstrate that advanced GNNs can indeed improve the performance of GCPN, GraphAF, and GraphEBM on molecular generation tasks, but GNN expressiveness is not a necessary condition for a good GNN-based generative model. Moreover, we show that GCPN and GraphAF with advanced GNNs can achieve state-of-the-art results across 17 other non-GNN-based graph generative approaches, such as variational autoencoders and Bayesian optimisation models, on the proposed molecular generative objectives (DRD2, Median1, Median2), which are important metrics for de-novo molecular design.
△ Less
Submitted 20 February, 2024; v1 submitted 23 August, 2023;
originally announced August 2023.
-
DiffHopp: A Graph Diffusion Model for Novel Drug Design via Scaffold Hopping
Authors:
Jos Torge,
Charles Harris,
Simon V. Mathis,
Pietro Lio
Abstract:
Scaffold hopping is a drug discovery strategy to generate new chemical entities by modifying the core structure, the \emph{scaffold}, of a known active compound. This approach preserves the essential molecular features of the original scaffold while introducing novel chemical elements or structural features to enhance potency, selectivity, or bioavailability. However, there is currently a lack of…
▽ More
Scaffold hopping is a drug discovery strategy to generate new chemical entities by modifying the core structure, the \emph{scaffold}, of a known active compound. This approach preserves the essential molecular features of the original scaffold while introducing novel chemical elements or structural features to enhance potency, selectivity, or bioavailability. However, there is currently a lack of generative models specifically tailored for this task, especially in the pocket-conditioned context. In this work, we present DiffHopp, a conditional E(3)-equivariant graph diffusion model tailored for scaffold hopping given a known protein-ligand complex.
△ Less
Submitted 14 August, 2023;
originally announced August 2023.
-
Benchmarking Generated Poses: How Rational is Structure-based Drug Design with Generative Models?
Authors:
Charles Harris,
Kieran Didi,
Arian R. Jamasb,
Chaitanya K. Joshi,
Simon V. Mathis,
Pietro Lio,
Tom Blundell
Abstract:
Deep generative models for structure-based drug design (SBDD), where molecule generation is conditioned on a 3D protein pocket, have received considerable interest in recent years. These methods offer the promise of higher-quality molecule generation by explicitly modelling the 3D interaction between a potential drug and a protein receptor. However, previous work has primarily focused on the quali…
▽ More
Deep generative models for structure-based drug design (SBDD), where molecule generation is conditioned on a 3D protein pocket, have received considerable interest in recent years. These methods offer the promise of higher-quality molecule generation by explicitly modelling the 3D interaction between a potential drug and a protein receptor. However, previous work has primarily focused on the quality of the generated molecules themselves, with limited evaluation of the 3D molecule \emph{poses} that these methods produce, with most work simply discarding the generated pose and only reporting a "corrected" pose after redocking with traditional methods. Little is known about whether generated molecules satisfy known physical constraints for binding and the extent to which redocking alters the generated interactions. We introduce PoseCheck, an extensive analysis of multiple state-of-the-art methods and find that generated molecules have significantly more physical violations and fewer key interactions compared to baselines, calling into question the implicit assumption that providing rich 3D structure information improves molecule complementarity. We make recommendations for future research tackling identified failure modes and hope our benchmark can serve as a springboard for future SBDD generative modelling work to have a real-world impact.
△ Less
Submitted 14 August, 2023;
originally announced August 2023.
-
Graph Denoising Diffusion for Inverse Protein Folding
Authors:
Kai Yi,
Bingxin Zhou,
Yiqing Shen,
Pietro Liò,
Yu Guang Wang
Abstract:
Inverse protein folding is challenging due to its inherent one-to-many mapping characteristic, where numerous possible amino acid sequences can fold into a single, identical protein backbone. This task involves not only identifying viable sequences but also representing the sheer diversity of potential solutions. However, existing discriminative models, such as transformer-based auto-regressive mo…
▽ More
Inverse protein folding is challenging due to its inherent one-to-many mapping characteristic, where numerous possible amino acid sequences can fold into a single, identical protein backbone. This task involves not only identifying viable sequences but also representing the sheer diversity of potential solutions. However, existing discriminative models, such as transformer-based auto-regressive models, struggle to encapsulate the diverse range of plausible solutions. In contrast, diffusion probabilistic models, as an emerging genre of generative approaches, offer the potential to generate a diverse set of sequence candidates for determined protein backbones. We propose a novel graph denoising diffusion model for inverse protein folding, where a given protein backbone guides the diffusion process on the corresponding amino acid residue types. The model infers the joint distribution of amino acids conditioned on the nodes' physiochemical properties and local environment. Moreover, we utilize amino acid replacement matrices for the diffusion forward process, encoding the biologically-meaningful prior knowledge of amino acids from their spatial and sequential neighbors as well as themselves, which reduces the sampling space of the generative process. Our model achieves state-of-the-art performance over a set of popular baseline methods in sequence recovery and exhibits great potential in generating diverse protein sequences for a determined protein backbone structure.
△ Less
Submitted 7 November, 2023; v1 submitted 29 June, 2023;
originally announced June 2023.
-
Neural Embeddings for Protein Graphs
Authors:
Francesco Ceccarelli,
Lorenzo Giusti,
Sean B. Holden,
Pietro Liò
Abstract:
Proteins perform much of the work in living organisms, and consequently the development of efficient computational methods for protein representation is essential for advancing large-scale biological research. Most current approaches struggle to efficiently integrate the wealth of information contained in the protein sequence and structure. In this paper, we propose a novel framework for embedding…
▽ More
Proteins perform much of the work in living organisms, and consequently the development of efficient computational methods for protein representation is essential for advancing large-scale biological research. Most current approaches struggle to efficiently integrate the wealth of information contained in the protein sequence and structure. In this paper, we propose a novel framework for embedding protein graphs in geometric vector spaces, by learning an encoder function that preserves the structural distance between protein graphs. Utilizing Graph Neural Networks (GNNs) and Large Language Models (LLMs), the proposed framework generates structure- and sequence-aware protein representations. We demonstrate that our embeddings are successful in the task of comparing protein structures, while providing a significant speed-up compared to traditional approaches based on structural alignment. Our framework achieves remarkable results in the task of protein structure classification; in particular, when compared to other work, the proposed method shows an average F1-Score improvement of 26% on out-of-distribution (OOD) samples and of 32% when tested on samples coming from the same distribution as the training data. Our approach finds applications in areas such as drug prioritization, drug re-purposing, disease sub-type analysis and elsewhere.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
gRNAde: Geometric Deep Learning for 3D RNA inverse design
Authors:
Chaitanya K. Joshi,
Arian R. Jamasb,
Ramon Viñas,
Charles Harris,
Simon V. Mathis,
Alex Morehead,
Rishabh Anand,
Pietro Liò
Abstract:
Computational RNA design tasks are often posed as inverse problems, where sequences are designed based on adopting a single desired secondary structure without considering 3D geometry and conformational diversity. We introduce gRNAde, a geometric RNA design pipeline operating on 3D RNA backbones to design sequences that explicitly account for structure and dynamics. Under the hood, gRNAde is a mul…
▽ More
Computational RNA design tasks are often posed as inverse problems, where sequences are designed based on adopting a single desired secondary structure without considering 3D geometry and conformational diversity. We introduce gRNAde, a geometric RNA design pipeline operating on 3D RNA backbones to design sequences that explicitly account for structure and dynamics. Under the hood, gRNAde is a multi-state Graph Neural Network that generates candidate RNA sequences conditioned on one or more 3D backbone structures where the identities of the bases are unknown. On a single-state fixed backbone re-design benchmark of 14 RNA structures from the PDB identified by Das et al. [2010], gRNAde obtains higher native sequence recovery rates (56% on average) compared to Rosetta (45% on average), taking under a second to produce designs compared to the reported hours for Rosetta. We further demonstrate the utility of gRNAde on a new benchmark of multi-state design for structurally flexible RNAs, as well as zero-shot ranking of mutational fitness landscapes in a retrospective analysis of a recent RNA polymerase ribozyme structure. Open source code: https://github.com/chaitjo/geometric-rna-design
△ Less
Submitted 25 May, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Assisting clinical practice with fuzzy probabilistic decision trees
Authors:
Emma L. Ambags,
Giulia Capitoli,
Vincenzo L' Imperio,
Michele Provenzano,
Marco S. Nobile,
Pietro Liò
Abstract:
The need for fully human-understandable models is increasingly being recognised as a central theme in AI research. The acceptance of AI models to assist in decision making in sensitive domains will grow when these models are interpretable, and this trend towards interpretable models will be amplified by upcoming regulations. One of the killer applications of interpretable AI is medical practice, w…
▽ More
The need for fully human-understandable models is increasingly being recognised as a central theme in AI research. The acceptance of AI models to assist in decision making in sensitive domains will grow when these models are interpretable, and this trend towards interpretable models will be amplified by upcoming regulations. One of the killer applications of interpretable AI is medical practice, which can benefit from accurate decision support methodologies that inherently generate trust. In this work, we propose FPT, (MedFP), a novel method that combines probabilistic trees and fuzzy logic to assist clinical practice. This approach is fully interpretable as it allows clinicians to generate, control and verify the entire diagnosis procedure; one of the methodology's strength is the capability to decrease the frequency of misdiagnoses by providing an estimate of uncertainties and counterfactuals. Our approach is applied as a proof-of-concept to two real medical scenarios: classifying malignant thyroid nodules and predicting the risk of progression in chronic kidney disease patients. Our results show that probabilistic fuzzy decision trees can provide interpretable support to clinicians, furthermore, introducing fuzzy variables into the probabilistic model brings significant nuances that are lost when using the crisp thresholds set by traditional probabilistic decision trees. We show that FPT and its predictions can assist clinical practice in an intuitive manner, with the use of a user-friendly interface specifically designed for this purpose. Moreover, we discuss the interpretability of the FPT model.
△ Less
Submitted 26 April, 2023; v1 submitted 16 April, 2023;
originally announced April 2023.
-
GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data
Authors:
Andrei Margeloiu,
Nikola Simidjievski,
Pietro Lio,
Mateja Jamnik
Abstract:
Neural network models often struggle with high-dimensional but small sample-size tabular datasets. One reason is that current weight initialisation methods assume independence between weights, which can be problematic when there are insufficient samples to estimate the model's parameters accurately. In such small data scenarios, leveraging additional structures can improve the model's performance…
▽ More
Neural network models often struggle with high-dimensional but small sample-size tabular datasets. One reason is that current weight initialisation methods assume independence between weights, which can be problematic when there are insufficient samples to estimate the model's parameters accurately. In such small data scenarios, leveraging additional structures can improve the model's performance and training stability. To address this, we propose GCondNet, a general approach to enhance neural networks by leveraging implicit structures present in tabular data. We create a graph between samples for each data dimension, and utilise Graph Neural Networks (GNNs) for extracting this implicit structure, and for conditioning the parameters of the first layer of an underlying predictor network. By creating many small graphs, GCondNet exploits the data's high-dimensionality, and thus improves the performance of an underlying predictor network. We demonstrate the effectiveness of our method on 9 real-world datasets, where GCondNet outperforms 15 standard and state-of-the-art methods. The results show that GCondNet is a versatile framework for injecting graph-regularisation into various types of neural networks, including MLPs and tabular Transformers.
△ Less
Submitted 17 November, 2023; v1 submitted 11 November, 2022;
originally announced November 2022.
-
Structure-based Drug Design with Equivariant Diffusion Models
Authors:
Arne Schneuing,
Yuanqi Du,
Charles Harris,
Arian Jamasb,
Ilia Igashov,
Weitao Du,
Tom Blundell,
Pietro Lió,
Carla Gomes,
Max Welling,
Michael Bronstein,
Bruno Correia
Abstract:
Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an SE(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Comprehensive in silico experiments demo…
▽ More
Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an SE(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Comprehensive in silico experiments demonstrate the efficiency and effectiveness of DiffSBDD in generating novel and diverse drug-like ligands with competitive docking scores. We further explore the flexibility of the diffusion framework for a broader range of tasks in drug design campaigns, such as off-the-shelf property optimization and partial molecular design with inpainting.
△ Less
Submitted 30 June, 2023; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Distributed representations of graphs for drug pair scoring
Authors:
Paul Scherer,
Pietro Liò,
Mateja Jamnik
Abstract:
In this paper we study the practicality and usefulness of incorporating distributed representations of graphs into models within the context of drug pair scoring. We argue that the real world growth and update cycles of drug pair scoring datasets subvert the limitations of transductive learning associated with distributed representations. Furthermore, we argue that the vocabulary of discrete subst…
▽ More
In this paper we study the practicality and usefulness of incorporating distributed representations of graphs into models within the context of drug pair scoring. We argue that the real world growth and update cycles of drug pair scoring datasets subvert the limitations of transductive learning associated with distributed representations. Furthermore, we argue that the vocabulary of discrete substructure patterns induced over drug sets is not dramatically large due to the limited set of atom types and constraints on bonding patterns enforced by chemistry. Under this pretext, we explore the effectiveness of distributed representations of the molecular graphs of drugs in drug pair scoring tasks such as drug synergy, polypharmacy, and drug-drug interaction prediction. To achieve this, we present a methodology for learning and incorporating distributed representations of graphs within a unified framework for drug pair scoring. Subsequently, we augment a number of recent and state-of-the-art models to utilise our embeddings. We empirically show that the incorporation of these embeddings improves downstream performance of almost every model across different drug pair scoring tasks, even those the original model was not designed for. We publicly release all of our drug embeddings for the DrugCombDB, DrugComb, DrugbankDDI, and TwoSides datasets.
△ Less
Submitted 24 November, 2022; v1 submitted 19 September, 2022;
originally announced September 2022.
-
Modular multi-source prediction of drug side-effects with DruGNN
Authors:
Pietro Bongini,
Franco Scarselli,
Monica Bianchini,
Giovanna Maria Dimitri,
Niccolò Pancino,
Pietro Liò
Abstract:
Drug Side-Effects (DSEs) have a high impact on public health, care system costs, and drug discovery processes. Predicting the probability of side-effects, before their occurrence, is fundamental to reduce this impact, in particular on drug discovery. Candidate molecules could be screened before undergoing clinical trials, reducing the costs in time, money, and health of the participants. Drug side…
▽ More
Drug Side-Effects (DSEs) have a high impact on public health, care system costs, and drug discovery processes. Predicting the probability of side-effects, before their occurrence, is fundamental to reduce this impact, in particular on drug discovery. Candidate molecules could be screened before undergoing clinical trials, reducing the costs in time, money, and health of the participants. Drug side-effects are triggered by complex biological processes involving many different entities, from drug structures to protein-protein interactions. To predict their occurrence, it is necessary to integrate data from heterogeneous sources. In this work, such heterogeneous data is integrated into a graph dataset, expressively representing the relational information between different entities, such as drug molecules and genes. The relational nature of the dataset represents an important novelty for drug side-effect predictors. Graph Neural Networks (GNNs) are exploited to predict DSEs on our dataset with very promising results. GNNs are deep learning models that can process graph-structured data, with minimal information loss, and have been applied on a wide variety of biological tasks. Our experimental results confirm the advantage of using relationships between data entities, suggesting interesting future developments in this scope. The experimentation also shows the importance of specific subsets of data in determining associations between drugs and side-effects.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Structure-aware generation of drug-like molecules
Authors:
Pavol Drotár,
Arian Rokkum Jamasb,
Ben Day,
Cătălina Cangea,
Pietro Liò
Abstract:
Structure-based drug design involves finding ligand molecules that exhibit structural and chemical complementarity to protein pockets. Deep generative methods have shown promise in proposing novel molecules from scratch (de-novo design), avoiding exhaustive virtual screening of chemical space. Most generative de-novo models fail to incorporate detailed ligand-protein interactions and 3D pocket str…
▽ More
Structure-based drug design involves finding ligand molecules that exhibit structural and chemical complementarity to protein pockets. Deep generative methods have shown promise in proposing novel molecules from scratch (de-novo design), avoiding exhaustive virtual screening of chemical space. Most generative de-novo models fail to incorporate detailed ligand-protein interactions and 3D pocket structures. We propose a novel supervised model that generates molecular graphs jointly with 3D pose in a discretised molecular space. Molecules are built atom-by-atom inside pockets, guided by structural information from crystallographic data. We evaluate our model using a docking benchmark and find that guided generation improves predicted binding affinities by 8% and drug-likeness scores by 10% over the baseline. Furthermore, our model proposes molecules with binding scores exceeding some known ligands, which could be useful in future wet-lab studies.
△ Less
Submitted 7 November, 2021;
originally announced November 2021.
-
3D Infomax improves GNNs for Molecular Property Prediction
Authors:
Hannes Stärk,
Dominique Beaini,
Gabriele Corso,
Prudencio Tossou,
Christian Dallago,
Stephan Günnemann,
Pietro Liò
Abstract:
Molecular property prediction is one of the fastest-growing applications of deep learning with critical real-world impacts. Including 3D molecular structure as input to learned models improves their performance for many molecular tasks. However, this information is infeasible to compute at the scale required by several real-world applications. We propose pre-training a model to reason about the ge…
▽ More
Molecular property prediction is one of the fastest-growing applications of deep learning with critical real-world impacts. Including 3D molecular structure as input to learned models improves their performance for many molecular tasks. However, this information is infeasible to compute at the scale required by several real-world applications. We propose pre-training a model to reason about the geometry of molecules given only their 2D molecular graphs. Using methods from self-supervised learning, we maximize the mutual information between 3D summary vectors and the representations of a Graph Neural Network (GNN) such that they contain latent 3D information. During fine-tuning on molecules with unknown geometry, the GNN still generates implicit 3D information and can use it to improve downstream tasks. We show that 3D pre-training provides significant improvements for a wide range of properties, such as a 22% average MAE reduction on eight quantum mechanical properties. Moreover, the learned representations can be effectively transferred between datasets in different molecular spaces.
△ Less
Submitted 4 June, 2022; v1 submitted 8 October, 2021;
originally announced October 2021.
-
Neural Distance Embeddings for Biological Sequences
Authors:
Gabriele Corso,
Rex Ying,
Michal Pándy,
Petar Veličković,
Jure Leskovec,
Pietro Liò
Abstract:
The development of data-dependent heuristics and representations for biological sequences that reflect their evolutionary distance is critical for large-scale biological research. However, popular machine learning approaches, based on continuous Euclidean spaces, have struggled with the discrete combinatorial formulation of the edit distance that models evolution and the hierarchical relationship…
▽ More
The development of data-dependent heuristics and representations for biological sequences that reflect their evolutionary distance is critical for large-scale biological research. However, popular machine learning approaches, based on continuous Euclidean spaces, have struggled with the discrete combinatorial formulation of the edit distance that models evolution and the hierarchical relationship that characterises real-world datasets. We present Neural Distance Embeddings (NeuroSEED), a general framework to embed sequences in geometric vector spaces, and illustrate the effectiveness of the hyperbolic space that captures the hierarchical structure and provides an average 22% reduction in embedding RMSE against the best competing geometry. The capacity of the framework and the significance of these improvements are then demonstrated devising supervised and unsupervised NeuroSEED approaches to multiple core tasks in bioinformatics. Benchmarked with common baselines, the proposed approaches display significant accuracy and/or runtime improvements on real-world datasets. As an example for hierarchical clustering, the proposed pretrained and from-scratch methods match the quality of competing baselines with 30x and 15x runtime reduction, respectively.
△ Less
Submitted 11 October, 2021; v1 submitted 20 September, 2021;
originally announced September 2021.
-
Neural message passing for joint paratope-epitope prediction
Authors:
Alice Del Vecchio,
Andreea Deac,
Pietro Liò,
Petar Veličković
Abstract:
Antibodies are proteins in the immune system which bind to antigens to detect and neutralise them. The binding sites in an antibody-antigen interaction are known as the paratope and epitope, respectively, and the prediction of these regions is key to vaccine and synthetic antibody development. Contrary to prior art, we argue that paratope and epitope predictors require asymmetric treatment, and pr…
▽ More
Antibodies are proteins in the immune system which bind to antigens to detect and neutralise them. The binding sites in an antibody-antigen interaction are known as the paratope and epitope, respectively, and the prediction of these regions is key to vaccine and synthetic antibody development. Contrary to prior art, we argue that paratope and epitope predictors require asymmetric treatment, and propose distinct neural message passing architectures that are geared towards the specific aspects of paratope and epitope prediction, respectively. We obtain significant improvements on both tasks, setting the new state-of-the-art and recovering favourable qualitative predictions on antigens of relevance to COVID-19.
△ Less
Submitted 25 July, 2021; v1 submitted 31 May, 2021;
originally announced June 2021.
-
Using ontology embeddings for structural inductive bias in gene expression data analysis
Authors:
Maja Trębacz,
Zohreh Shams,
Mateja Jamnik,
Paul Scherer,
Nikola Simidjievski,
Helena Andres Terre,
Pietro Liò
Abstract:
Stratifying cancer patients based on their gene expression levels allows improving diagnosis, survival analysis and treatment planning. However, such data is extremely highly dimensional as it contains expression values for over 20000 genes per patient, and the number of samples in the datasets is low. To deal with such settings, we propose to incorporate prior biological knowledge about genes fro…
▽ More
Stratifying cancer patients based on their gene expression levels allows improving diagnosis, survival analysis and treatment planning. However, such data is extremely highly dimensional as it contains expression values for over 20000 genes per patient, and the number of samples in the datasets is low. To deal with such settings, we propose to incorporate prior biological knowledge about genes from ontologies into the machine learning system for the task of patient classification given their gene expression data. We use ontology embeddings that capture the semantic similarities between the genes to direct a Graph Convolutional Network, and therefore sparsify the network connections. We show this approach provides an advantage for predicting clinical targets from high-dimensional low-sample data.
△ Less
Submitted 22 November, 2020;
originally announced November 2020.
-
Gene Regulatory Network Inference with Latent Force Models
Authors:
Jacob Moss,
Pietro Lió
Abstract:
Delays in protein synthesis cause a confounding effect when constructing Gene Regulatory Networks (GRNs) from RNA-sequencing time-series data. Accurate GRNs can be very insightful when modelling development, disease pathways, and drug side-effects. We present a model which incorporates translation delays by combining mechanistic equations and Bayesian approaches to fit to experimental data. This e…
▽ More
Delays in protein synthesis cause a confounding effect when constructing Gene Regulatory Networks (GRNs) from RNA-sequencing time-series data. Accurate GRNs can be very insightful when modelling development, disease pathways, and drug side-effects. We present a model which incorporates translation delays by combining mechanistic equations and Bayesian approaches to fit to experimental data. This enables greater biological interpretability, and the use of Gaussian processes enables non-linear expressivity through kernels as well as naturally accounting for biological variation.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
Incorporating network based protein complex discovery into automated model construction
Authors:
Paul Scherer,
Maja Trȩbacz,
Nikola Simidjievski,
Zohreh Shams,
Helena Andres Terre,
Pietro Liò,
Mateja Jamnik
Abstract:
We propose a method for gene expression based analysis of cancer phenotypes incorporating network biology knowledge through unsupervised construction of computational graphs. The structural construction of the computational graphs is driven by the use of topological clustering algorithms on protein-protein networks which incorporate inductive biases stemming from network biology research in protei…
▽ More
We propose a method for gene expression based analysis of cancer phenotypes incorporating network biology knowledge through unsupervised construction of computational graphs. The structural construction of the computational graphs is driven by the use of topological clustering algorithms on protein-protein networks which incorporate inductive biases stemming from network biology research in protein complex discovery. This structurally constrains the hypothesis space over the possible computational graph factorisation whose parameters can then be learned through supervised or unsupervised task settings. The sparse construction of the computational graph enables the differential protein complex activity analysis whilst also interpreting the individual contributions of genes/proteins involved in each individual protein complex. In our experiments analysing a variety of cancer phenotypes, we show that the proposed methods outperform SVM, Fully-Connected MLP, and Randomly-Connected MLPs in all tasks. Our work introduces a scalable method for incorporating large interaction networks as prior knowledge to drive the construction of powerful computational models amenable to introspective study.
△ Less
Submitted 29 September, 2020;
originally announced October 2020.
-
Signal metrics analysis of oscillatory patterns in bacterial multi-omic networks
Authors:
Francesco Bardozzo,
Pietro Liò,
Roberto Tagliaferri
Abstract:
Motivation: One of the branches of Systems Biology is focused on a deep understanding of underlying regulatory networks through the analysis of the biomolecules oscillations and their interplay. Synthetic Biology exploits gene or/and protein regulatory networks towards the design of oscillatory networks for producing useful compounds. Therefore, at different levels of application and for different…
▽ More
Motivation: One of the branches of Systems Biology is focused on a deep understanding of underlying regulatory networks through the analysis of the biomolecules oscillations and their interplay. Synthetic Biology exploits gene or/and protein regulatory networks towards the design of oscillatory networks for producing useful compounds. Therefore, at different levels of application and for different purposes, the study of biomolecular oscillations can lead to different clues about the mechanisms underlying living cells. It is known that network-level interactions involve more than one type of biomolecule as well as biological processes operating at multiple omic levels. Combining network/pathway-level information with genetic information it is possible to describe well-understood or unknown bacterial mechanisms and organism-specific dynamics. Results: Network multi-omic integration has led to the discovery of interesting oscillatory signals. Following the methodologies used in signal processing and communication engineering, a new methodology is introduced to identify and quantify the extent of the multi-omic oscillations of the signal. New signal metrics are designed to allow further biotechnological explanations and provide important clues about the oscillatory nature of the pathways and their regulatory circuits. Our algorithms designed for the analysis of multi-omic signals are tested and validated on 11 different bacteria for thousands of multi-omic signals perturbed at the network level by different experimental conditions. Information on the order of genes, codon usage, gene expression, and protein molecular weight is integrated at three different functional levels. Oscillations show interesting evidence that network-level multi-omic signals present a synchronized response to perturbations and evolutionary relations along with taxa.
△ Less
Submitted 1 August, 2020;
originally announced August 2020.
-
Computational Logic for Biomedicine and Neurosciences
Authors:
Elisabetta de Maria,
Joelle Despeyroux,
Amy Felty,
Pietro Liò,
Carlos Olarte,
Abdorrahim Bahrami
Abstract:
We advocate here the use of computational logic for systems biology, as a \emph{unified and safe} framework well suited for both modeling the dynamic behaviour of biological systems, expressing properties of them, and verifying these properties. The potential candidate logics should have a traditional proof theoretic pedigree (including either induction, or a sequent calculus presentation enjoying…
▽ More
We advocate here the use of computational logic for systems biology, as a \emph{unified and safe} framework well suited for both modeling the dynamic behaviour of biological systems, expressing properties of them, and verifying these properties. The potential candidate logics should have a traditional proof theoretic pedigree (including either induction, or a sequent calculus presentation enjoying cut-elimination and focusing), and should come with certified proof tools. Beyond providing a reliable framework, this allows the correct encodings of our biological systems. % For systems biology in general and biomedicine in particular, we have so far, for the modeling part, three candidate logics: all based on linear logic. The studied properties and their proofs are formalized in a very expressive (non linear) inductive logic: the Calculus of Inductive Constructions (CIC). The examples we have considered so far are relatively simple ones; however, all coming with formal semi-automatic proofs in the Coq system, which implements CIC. In neuroscience, we are directly using CIC and Coq, to model neurons and some simple neuronal circuits and prove some of their dynamic properties. % In biomedicine, the study of multi omic pathway interactions, together with clinical and electronic health record data should help in drug discovery and disease diagnosis. Future work includes using more automatic provers. This should enable us to specify and study more realistic examples, and in the long term to provide a system for disease diagnosis and therapy prognosis.
△ Less
Submitted 6 October, 2020; v1 submitted 15 July, 2020;
originally announced July 2020.
-
The Computational Patient has Diabetes and a COVID
Authors:
Pietro Barbiero,
Pietro Lió
Abstract:
Medicine is moving from a curative discipline to a preventative discipline relying on personalised and precise treatment plans. The complex and multi level pathophysiological patterns of most diseases require a systemic medicine approach and are challenging current medical therapies. On the other hand, computational medicine is a vibrant interdisciplinary field that could help move from an organ-c…
▽ More
Medicine is moving from a curative discipline to a preventative discipline relying on personalised and precise treatment plans. The complex and multi level pathophysiological patterns of most diseases require a systemic medicine approach and are challenging current medical therapies. On the other hand, computational medicine is a vibrant interdisciplinary field that could help move from an organ-centered approach to a process-oriented approach. The ideal computational patient would require an international interdisciplinary effort, of larger scientific and technological interdisciplinarity than the Human Genome Project. When deployed, such a patient would have a profound impact on how healthcare is delivered to patients. Here we present a computational patient model that integrates, refines and extends recent mechanistic or phenomenological models of cardiovascular, RAS and diabetic processes. Our aim is twofold: analyse the modularity and composability of the model-building blocks of the computational patient and to study the dynamical properties of well-being and disease states in a broader functional context. We present results from a number of experiments among which we characterise the dynamic impact of COVID-19 and type-2 diabetes (T2D) on cardiovascular and inflammation conditions. We tested these experiments under different exercise, meal and drug regimens. We report results showing the striking importance of transient dynamical responses to acute state conditions and we provide guidelines for system design principles for the inter-relationship between modules and components in systemic medicine. Finally this initial computational Patient can be used as a toolbox for further modifications and extensions.
△ Less
Submitted 18 July, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Towards a predictive spatio-temporal representation of brain data
Authors:
Tiago Azevedo,
Luca Passamonti,
Pietro Liò,
Nicola Toschi
Abstract:
The characterisation of the brain as a "connectome", in which the connections are represented by correlational values across timeseries and as summary measures derived from graph theory analyses, has been very popular in the last years. However, although this representation has advanced our understanding of the brain function, it may represent an oversimplified model. This is because the typical f…
▽ More
The characterisation of the brain as a "connectome", in which the connections are represented by correlational values across timeseries and as summary measures derived from graph theory analyses, has been very popular in the last years. However, although this representation has advanced our understanding of the brain function, it may represent an oversimplified model. This is because the typical fMRI datasets are constituted by complex and highly heterogeneous timeseries that vary across space (i.e., location of brain regions). We compare various modelling techniques from deep learning and geometric deep learning to pave the way for future research in effectively leveraging the rich spatial and temporal domains of typical fMRI datasets, as well as of other similar datasets. As a proof-of-concept, we compare our approaches in the homogeneous and publicly available Human Connectome Project (HCP) dataset on a supervised binary classification task. We hope that our methodological advances relative to previous "connectomic" measures can ultimately be clinically and computationally relevant by leading to a more nuanced understanding of the brain dynamics in health and disease. Such understanding of the brain can fundamentally reduce the constant specialised clinical expertise in order to accurately understand brain variability.
△ Less
Submitted 29 February, 2020;
originally announced March 2020.
-
Co-Attentive Cross-Modal Deep Learning for Medical Evidence Synthesis and Decision Making
Authors:
Devin Taylor,
Simeon Spasov,
Pietro Liò
Abstract:
Modern medicine requires generalised approaches to the synthesis and integration of multimodal data, often at different biological scales, that can be applied to a variety of evidence structures, such as complex disease analyses and epidemiological models. However, current methods are either slow and expensive, or ineffective due to the inability to model the complex relationships between data mod…
▽ More
Modern medicine requires generalised approaches to the synthesis and integration of multimodal data, often at different biological scales, that can be applied to a variety of evidence structures, such as complex disease analyses and epidemiological models. However, current methods are either slow and expensive, or ineffective due to the inability to model the complex relationships between data modes which differ in scale and format. We address these issues by proposing a cross-modal deep learning architecture and co-attention mechanism to accurately model the relationships between the different data modes, while further reducing patient diagnosis time. Differentiating Parkinson's Disease (PD) patients from healthy patients forms the basis of the evaluation. The model outperforms the previous state-of-the-art unimodal analysis by 2.35%, while also being 53% more parameter efficient than the industry standard cross-modal model. Furthermore, the evaluation of the attention coefficients allows for qualitative insights to be obtained. Through the coupling with bioinformatics, a novel link between the interferon-gamma-mediated pathway, DNA methylation and PD was identified. We believe that our approach is general and could optimise the process of medical evidence synthesis and decision making in an actionable way.
△ Less
Submitted 8 November, 2019; v1 submitted 13 September, 2019;
originally announced September 2019.
-
ncRNA Classification with Graph Convolutional Networks
Authors:
Emanuele Rossi,
Federico Monti,
Michael Bronstein,
Pietro Liò
Abstract:
Non-coding RNA (ncRNA) are RNA sequences which don't code for a gene but instead carry important biological functions. The task of ncRNA classification consists in classifying a given ncRNA sequence into its family. While it has been shown that the graph structure of an ncRNA sequence folding is of great importance for the prediction of its family, current methods make use of machine learning clas…
▽ More
Non-coding RNA (ncRNA) are RNA sequences which don't code for a gene but instead carry important biological functions. The task of ncRNA classification consists in classifying a given ncRNA sequence into its family. While it has been shown that the graph structure of an ncRNA sequence folding is of great importance for the prediction of its family, current methods make use of machine learning classifiers on hand-crafted graph features. We improve on the state-of-the-art for this task with a graph convolutional network model which achieves an accuracy of 85.73% and an F1-score of 85.61% over 13 classes. Moreover, our model learns in an end-to-end fashion from the raw RNA graphs and removes the need for expensive feature extraction. To the best of our knowledge, this also represents the first successful application of graph convolutional networks to RNA folding data.
△ Less
Submitted 15 May, 2019;
originally announced May 2019.
-
Drug-Drug Adverse Effect Prediction with Graph Co-Attention
Authors:
Andreea Deac,
Yu-Hsiang Huang,
Petar Veličković,
Pietro Liò,
Jian Tang
Abstract:
Complex or co-existing diseases are commonly treated using drug combinations, which can lead to higher risk of adverse side effects. The detection of polypharmacy side effects is usually done in Phase IV clinical trials, but there are still plenty which remain undiscovered when the drugs are put on the market. Such accidents have been affecting an increasing proportion of the population (15% in th…
▽ More
Complex or co-existing diseases are commonly treated using drug combinations, which can lead to higher risk of adverse side effects. The detection of polypharmacy side effects is usually done in Phase IV clinical trials, but there are still plenty which remain undiscovered when the drugs are put on the market. Such accidents have been affecting an increasing proportion of the population (15% in the US now) and it is thus of high interest to be able to predict the potential side effects as early as possible. Systematic combinatorial screening of possible drug-drug interactions (DDI) is challenging and expensive. However, the recent significant increases in data availability from pharmaceutical research and development efforts offer a novel paradigm for recovering relevant insights for DDI prediction. Accordingly, several recent approaches focus on curating massive DDI datasets (with millions of examples) and training machine learning models on them. Here we propose a neural network architecture able to set state-of-the-art results on this task---using the type of the side-effect and the molecular structure of the drugs alone---by leveraging a co-attentional mechanism. In particular, we show the importance of integrating joint information from the drug pairs early on when learning each drug's representation.
△ Less
Submitted 1 May, 2019;
originally announced May 2019.
-
GenHap: A Novel Computational Method Based on Genetic Algorithms for Haplotype Assembly
Authors:
Andrea Tangherloni,
Simone Spolaor,
Leonardo Rundo,
Marco S. Nobile,
Paolo Cazzaniga,
Giancarlo Mauri,
Pietro Liò,
Ivan Merelli,
Daniela Besozzi
Abstract:
The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medic…
▽ More
The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications. To reconstruct the two haplotypes, we addressed the weighted Minimum Error Correction (wMEC) problem, which is a successful approach for haplotype assembly. This NP-hard problem consists in computing the two haplotypes that partition the sequencing reads into two disjoint sub-sets, with the least number of corrections to the SNP values. To this aim, we propose here GenHap, a novel computational method for haplotype assembly based on Genetic Algorithms, yielding optimal solutions by means of a global search process. In order to evaluate the effectiveness of our approach, we run GenHap on two synthetic (yet realistic) datasets, based on the Roche/454 and PacBio RS II sequencing technologies. We compared the performance of GenHap against HapCol, an efficient state-of-the-art algorithm for haplotype phasing. Our results show that GenHap always obtains high accuracy solutions (in terms of haplotype error rate), and is up to 4x faster than HapCol in the case of Roche/454 instances and up to 20x faster when compared on the PacBio RS II dataset. Finally, we assessed the performance of GenHap on two different real datasets. Future-generation sequencing technologies, producing longer reads with higher coverage, can highly benefit from GenHap, thanks to its capability of efficiently solving large instances of the haplotype assembly problem.
△ Less
Submitted 18 December, 2018;
originally announced December 2018.
-
Modelling trait dependent speciation with Approximate Bayesian Computation
Authors:
Krzysztof Bartoszek,
Pietro Liò
Abstract:
Phylogeny is the field of modelling the temporal discrete dynamics of speciation. Complex models can nowadays be studied using the Approximate Bayesian Computation approach which avoids likelihood calculations. The field's progression is hampered by the lack of robust software to estimate the numerous parameters of the speciation process. In this work we present an R package, pcmabc, based on Appr…
▽ More
Phylogeny is the field of modelling the temporal discrete dynamics of speciation. Complex models can nowadays be studied using the Approximate Bayesian Computation approach which avoids likelihood calculations. The field's progression is hampered by the lack of robust software to estimate the numerous parameters of the speciation process. In this work we present an R package, pcmabc, based on Approximate Bayesian Computations, that implements three novel phylogenetic algorithms for trait-dependent speciation modelling. Our phylogenetic comparative methodology takes into account both the simulated traits and phylogeny, attempting to estimate the parameters of the processes generating the phenotype and the trait. The user is not restricted to a predefined set of models and can specify a variety of evolutionary and branching models. We illustrate the software with a simulation-reestimation study focused around the branching Ornstein-Uhlenbeck process, where the branching rate depends non-linearly on the value of the driving Ornstein-Uhlenbeck process. Included in this work is a tutorial on how to use the software.
△ Less
Submitted 10 December, 2018;
originally announced December 2018.
-
Structure-Based Networks for Drug Validation
Authors:
Cătălina Cangea,
Arturas Grauslys,
Pietro Liò,
Francesco Falciani
Abstract:
Classifying chemicals according to putative modes of action (MOAs) is of paramount importance in the context of risk assessment. However, current methods are only able to handle a very small proportion of the existing chemicals. We address this issue by proposing an integrative deep learning architecture that learns a joint representation from molecular structures of drugs and their effects on hum…
▽ More
Classifying chemicals according to putative modes of action (MOAs) is of paramount importance in the context of risk assessment. However, current methods are only able to handle a very small proportion of the existing chemicals. We address this issue by proposing an integrative deep learning architecture that learns a joint representation from molecular structures of drugs and their effects on human cells. Our choice of architecture is motivated by the significant influence of a drug's chemical structure on its MOA. We improve on the strong ability of a unimodal architecture (F1 score of 0.803) to classify drugs by their toxic MOAs (Verhaar scheme) through adding another learning stream that processes transcriptional responses of human cells affected by drugs. Our integrative model achieves an even higher classification performance on the LINCS L1000 dataset - the error is reduced by 4.6%. We believe that our method can be used to extend the current Verhaar scheme and constitute a basis for fast drug validation and risk assessment.
△ Less
Submitted 21 November, 2018;
originally announced November 2018.
-
Multi-omic Network Regression: Methodology, Tool and Case Study
Authors:
Vandan Parmar,
Pietro Lio
Abstract:
The analysis of biological networks is characterized by the definition of precise linear constraints used to cumulatively reduce the solution space of the computed states of a multi-omic (for instance metabolic, transcriptomic and proteomic) model. In this paper, we attempt, for the first time, to combine metabolic modelling and networked Cox regression, using the metabolic model of the bacterium…
▽ More
The analysis of biological networks is characterized by the definition of precise linear constraints used to cumulatively reduce the solution space of the computed states of a multi-omic (for instance metabolic, transcriptomic and proteomic) model. In this paper, we attempt, for the first time, to combine metabolic modelling and networked Cox regression, using the metabolic model of the bacterium Helicobacter Pylori. This enables a platform both for quantitative analysis of networked regression, but also testing the findings from network regression (a list of significant vectors and their networked relationships) on in vivo transcriptomic data. Data generated from the model, using flux balance analysis to construct a Pareto front, specifically, a trade-off of Oxygen exchange and growth rate and a trade-off of Carbon Dioxide exchange and growth rate, is analysed and then the model is used to quantify the success of the analysis. It was found that using the analysis, reconstruction of the initial data was considerably more successful than a pure noise alternative. Our methodological approach is quite general and it could be of interest for the wider community of complex networks researchers; it is implemented in a software tool, MoNeRe, which is freely available through the Github platform.
△ Less
Submitted 12 October, 2018;
originally announced October 2018.
-
Seeing the wood for the trees: a forest of methods for optimisation and omic-network integration in metabolic modelling
Authors:
Supreeta Vijayakumar,
Max Conway,
Pietro Lió,
Claudio Angione
Abstract:
Metabolic modelling has entered a mature phase with dozens of methods and software implementations available to the practitioner and the theoretician. It is not easy for a modeller to be able to see the wood (or the forest) for the trees. Driven by this analogy, we here present a "forest" of principal methods used for constraint-based modelling in systems biology. This provides a tree-based view o…
▽ More
Metabolic modelling has entered a mature phase with dozens of methods and software implementations available to the practitioner and the theoretician. It is not easy for a modeller to be able to see the wood (or the forest) for the trees. Driven by this analogy, we here present a "forest" of principal methods used for constraint-based modelling in systems biology. This provides a tree-based view of methods available to prospective modellers, also available in interactive version at http://modellingmetabolism.net, where it will be kept updated with new methods after the publication of the present manuscript. Our updated classification of existing methods and tools highlights the most promising in the different branches, with the aim to develop a vision of how existing methods could hybridise and become more complex. We then provide the first hands-on tutorial for multi-objective optimisation of metabolic models in R. We finally discuss the implementation of multi-view machine-learning approaches in poly-omic integration. Throughout this work, we demonstrate the optimisation of trade-offs between multiple metabolic objectives, with a focus on omic data integration through machine learning. We anticipate that the combination of a survey, a perspective on multi-view machine learning, and a step-by-step R tutorial should be of interest for both the beginner and the advanced user.
△ Less
Submitted 21 September, 2018;
originally announced September 2018.
-
Cross-modal Recurrent Models for Weight Objective Prediction from Multimodal Time-series Data
Authors:
Petar Veličković,
Laurynas Karazija,
Nicholas D. Lane,
Sourav Bhattacharya,
Edgar Liberis,
Pietro Liò,
Angela Chieh,
Otmane Bellahsen,
Matthieu Vegreville
Abstract:
We analyse multimodal time-series data corresponding to weight, sleep and steps measurements. We focus on predicting whether a user will successfully achieve his/her weight objective. For this, we design several deep long short-term memory (LSTM) architectures, including a novel cross-modal LSTM (X-LSTM), and demonstrate their superiority over baseline approaches. The X-LSTM improves parameter eff…
▽ More
We analyse multimodal time-series data corresponding to weight, sleep and steps measurements. We focus on predicting whether a user will successfully achieve his/her weight objective. For this, we design several deep long short-term memory (LSTM) architectures, including a novel cross-modal LSTM (X-LSTM), and demonstrate their superiority over baseline approaches. The X-LSTM improves parameter efficiency by processing each modality separately and allowing for information flow between them by way of recurrent cross-connections. We present a general hyperparameter optimisation technique for X-LSTMs, which allows us to significantly improve on the LSTM and a prior state-of-the-art cross-modal approach, using a comparable number of parameters. Finally, we visualise the model's predictions, revealing implications about latent variables in this task.
△ Less
Submitted 29 November, 2017; v1 submitted 23 September, 2017;
originally announced September 2017.
-
Modelling the order of driver mutations and metabolic mutations as structures in cancer dynamics
Authors:
Gianluca Ascolani,
Pietro Lió
Abstract:
Recent works have stressed the important role that random mutations have in the development of cancer phenotype. We challenge this current view by means of bioinformatic data analysis and computational modelling approaches. Not all the mutations are equally important for the development of metastasis. The survival of cancer cells from the primary tumour site to the secondary seeding sites depends…
▽ More
Recent works have stressed the important role that random mutations have in the development of cancer phenotype. We challenge this current view by means of bioinformatic data analysis and computational modelling approaches. Not all the mutations are equally important for the development of metastasis. The survival of cancer cells from the primary tumour site to the secondary seeding sites depends on the occurrence of very few driver mutations promoting oncogenic cell behaviours and on the order with which these mutations occur. We introduce a model in the framework of Cellular Automata to investigate the effects of metabolic mutations and mutation order on cancer stemness and tumour cell migration in bone metastasised breast cancers. The metabolism of the cancer cell is a key factor in its proliferation rate. Bioinformatics analysis on a cancer mutation database shows that metabolism-modifying alterations constitute an important class of key cancer mutations. Our approach models three types of mutations: drivers, the order of which is relevant for the dynamics, metabolic which support cancer growth and are estimated from existing databases, and non--driver mutations. Our results provide a quantitative basis of how the order of driver mutations and the metabolic mutations in different cancer clones could impact proliferation of therapy-resistant clonal populations and patient survival. Further mathematical modelling of the order of mutations is presented in terms of operators. We believe our work is novel because it quantifies two important factors in cancer spreading models: the order of driver mutations and the effects of metabolic mutations.
△ Less
Submitted 26 June, 2017; v1 submitted 30 May, 2017;
originally announced May 2017.
-
Estimation and Modelling of PCBs Bioaccumulation in the Adriatic Sea Ecosystem
Authors:
Marianna Taffi,
Nicola Paoletti,
Pietro Liò,
Luca Tesei,
Sandra Pucciarelli,
Mauro Marini
Abstract:
Persistent Organic Pollutants represent a global ecological concern due to their ability to accumulate in organisms and to spread species-by-species via feeding connections. In this work we focus on the estimation and simulation of the bioaccumulation dynamics of persistent pollutants in the marine ecosystem, and we apply the approach for reconstructing a model of PCBs bioaccumulation in the Adria…
▽ More
Persistent Organic Pollutants represent a global ecological concern due to their ability to accumulate in organisms and to spread species-by-species via feeding connections. In this work we focus on the estimation and simulation of the bioaccumulation dynamics of persistent pollutants in the marine ecosystem, and we apply the approach for reconstructing a model of PCBs bioaccumulation in the Adriatic sea, estimated after an extensive review of trophic and PCBs concentration data on Adriatic species. Our estimations evidence the occurrence of PCBs biomagnification in the Adriatic food web, together with a strong dependence of bioaccumulation on trophic dynamics and external factors like fishing activity.
△ Less
Submitted 25 May, 2014;
originally announced May 2014.
-
Disease processes as hybrid dynamical systems
Authors:
Pietro Liò,
Emanuela Merelli,
Nicola Paoletti
Abstract:
We investigate the use of hybrid techniques in complex processes of infectious diseases. Since predictive disease models in biomedicine require a multiscale approach for understanding the molecule-cell-tissue-organ-body interactions, heterogeneous methodologies are often employed for describing the different biological scales. Hybrid models provide effective means for complex disease modelling whe…
▽ More
We investigate the use of hybrid techniques in complex processes of infectious diseases. Since predictive disease models in biomedicine require a multiscale approach for understanding the molecule-cell-tissue-organ-body interactions, heterogeneous methodologies are often employed for describing the different biological scales. Hybrid models provide effective means for complex disease modelling where the action and dosage of a drug or a therapy could be meaningfully investigated: the infection dynamics can be classically described in a continuous fashion, while the scheduling of multiple treatment discretely. We define an algebraic language for specifying general disease processes and multiple treatments, from which a semantics in terms of hybrid dynamical system can be derived. Then, the application of control-theoretic tools is proposed in order to compute the optimal scheduling of multiple therapies. The potentialities of our approach are shown in the case study of the SIR epidemic model and we discuss its applicability on osteomyelitis, a bacterial infection affecting the bone remodelling system in a specific and multiscale manner. We report that formal languages are helpful in giving a general homogeneous formulation for the different scales involved in a multiscale disease process; and that the combination of hybrid modelling and control theory provides solid grounds for computational medicine.
△ Less
Submitted 19 August, 2012;
originally announced August 2012.
-
Multiple verification in computational modeling of bone pathologies
Authors:
Pietro Liò,
Emanuela Merelli,
Nicola Paoletti
Abstract:
We introduce a model checking approach to diagnose the emerging of bone pathologies. The implementation of a new model of bone remodeling in PRISM has led to an interesting characterization of osteoporosis as a defective bone remodeling dynamics with respect to other bone pathologies. Our approach allows to derive three types of model checking-based diagnostic estimators. The first diagnostic meas…
▽ More
We introduce a model checking approach to diagnose the emerging of bone pathologies. The implementation of a new model of bone remodeling in PRISM has led to an interesting characterization of osteoporosis as a defective bone remodeling dynamics with respect to other bone pathologies. Our approach allows to derive three types of model checking-based diagnostic estimators. The first diagnostic measure focuses on the level of bone mineral density, which is currently used in medical practice. In addition, we have introduced a novel diagnostic estimator which uses the full patient clinical record, here simulated using the modeling framework. This estimator detects rapid (months) negative changes in bone mineral density. Independently of the actual bone mineral density, when the decrease occurs rapidly it is important to alarm the patient and monitor him/her more closely to detect insurgence of other bone co-morbidities. A third estimator takes into account the variance of the bone density, which could address the investigation of metabolic syndromes, diabetes and cancer. Our implementation could make use of different logical combinations of these statistical estimators and could incorporate other biomarkers for other systemic co-morbidities (for example diabetes and thalassemia). We are delighted to report that the combination of stochastic modeling with formal methods motivate new diagnostic framework for complex pathologies. In particular our approach takes into consideration important properties of biosystems such as multiscale and self-adaptiveness. The multi-diagnosis could be further expanded, inching towards the complexity of human diseases. Finally, we briefly introduce self-adaptiveness in formal methods which is a key property in the regulative mechanisms of biological systems and well known in other mathematical and engineering areas.
△ Less
Submitted 7 September, 2011;
originally announced September 2011.
-
Stochastic analysis of a miRNA-protein toggle switch
Authors:
E. Giampieri,
D. Remondini,
L. de Oliveira,
G. Castellani,
P. Lió
Abstract:
Within systems biology there is an increasing interest in the stochastic behavior of genetic and biochemical reaction networks. An appropriate stochastic description is provided by the chemical master equation, which represents a continuous time Markov chain (CTMC). In this paper we consider the stochastic properties of a biochemical circuit, known to control eukaryotic cell cycle and possibly inv…
▽ More
Within systems biology there is an increasing interest in the stochastic behavior of genetic and biochemical reaction networks. An appropriate stochastic description is provided by the chemical master equation, which represents a continuous time Markov chain (CTMC). In this paper we consider the stochastic properties of a biochemical circuit, known to control eukaryotic cell cycle and possibly involved in oncogenesis, recently proposed in the literature within a deterministic framework. Due to the inherent stochasticity of biochemical processes and the small number of molecules involved, the stochastic approach should be more correct in describing the real system: we study the agreement between the two approaches by exploring the system parameter space. We address the problem by proposing a simplified version of the model that allows analytical treatment, and by performing numerical simulations for the full model. We observed optimal agreement between the stochastic and the deterministic description of the circuit in a large range of parameters, but some substantial differences arise in at least two cases: 1) when the deterministic system is in the proximity of a transition from a monostable to a bistable configuration, and 2) when bistability (in the deterministic system) is "masked" in the stochastic system by the distribution tails. The approach provides interesting estimates of the optimal number of molecules involved in the toggle. Our discussion of the points of strengths, potentiality and weakness of the chemical master equation in systems biology and the differences with respect to deterministic modeling are leveraged in order to provide useful advice for both the bioinformatician practitioner and the theoretical scientist.
△ Less
Submitted 22 June, 2011;
originally announced June 2011.
-
StochKit-FF: Efficient Systems Biology on Multicore Architectures
Authors:
Marco Aldinucci,
Andrea Bracciali,
Pietro Liò,
Anil Sorathiya,
Massimo Torquati
Abstract:
The stochastic modelling of biological systems is an informative, and in some cases, very adequate technique, which may however result in being more expensive than other modelling approaches, such as differential equations. We present StochKit-FF, a parallel version of StochKit, a reference toolkit for stochastic simulations. StochKit-FF is based on the FastFlow programming toolkit for multicores…
▽ More
The stochastic modelling of biological systems is an informative, and in some cases, very adequate technique, which may however result in being more expensive than other modelling approaches, such as differential equations. We present StochKit-FF, a parallel version of StochKit, a reference toolkit for stochastic simulations. StochKit-FF is based on the FastFlow programming toolkit for multicores and exploits the novel concept of selective memory. We experiment StochKit-FF on a model of HIV infection dynamics, with the aim of extracting information from efficiently run experiments, here in terms of average and variance and, on a longer term, of more structured data.
△ Less
Submitted 11 July, 2010;
originally announced July 2010.
-
Noise and nonlinearities in high-throughput data
Authors:
Viet-Anh Nguyen,
Zdena Koukolikova-Nicola,
Franco Bagnoli,
Pietro Lio
Abstract:
High-throughput data analyses are becoming common in biology, communications, economics and sociology. The vast amounts of data are usually represented in the form of matrices and can be considered as knowledge networks. Spectra-based approaches have proved useful in extracting hidden information within such networks and for estimating missing data, but these methods are based essentially on lin…
▽ More
High-throughput data analyses are becoming common in biology, communications, economics and sociology. The vast amounts of data are usually represented in the form of matrices and can be considered as knowledge networks. Spectra-based approaches have proved useful in extracting hidden information within such networks and for estimating missing data, but these methods are based essentially on linear assumptions. The physical models of matching, when applicable, often suggest non-linear mechanisms, that may sometimes be identified as noise. The use of non-linear models in data analysis, however, may require the introduction of many parameters, which lowers the statistical weight of the model. According to the quality of data, a simpler linear analysis may be more convenient than more complex approaches.
In this paper, we show how a simple non-parametric Bayesian model may be used to explore the role of non-linearities and noise in synthetic and experimental data sets.
△ Less
Submitted 5 January, 2010;
originally announced January 2010.
-
Risk perception in epidemic modeling
Authors:
Franco Bagnoli,
Pietro Lio,
Luca Sguanci
Abstract:
We investigate the effects of risk perception in a simple model of epidemic spreading. We assume that the perception of the risk of being infected depends on the fraction of neighbors that are ill. The effect of this factor is to decrease the infectivity, that therefore becomes a dynamical component of the model. We study the problem in the mean-field approximation and by numerical simulations f…
▽ More
We investigate the effects of risk perception in a simple model of epidemic spreading. We assume that the perception of the risk of being infected depends on the fraction of neighbors that are ill. The effect of this factor is to decrease the infectivity, that therefore becomes a dynamical component of the model. We study the problem in the mean-field approximation and by numerical simulations for regular, random and scale-free networks.
We show that for homogeneous and random networks, there is always a value of perception that stops the epidemics. In the ``worst-case'' scenario of a scale-free network with diverging input connectivity, a linear perception cannot stop the epidemics; however we show that a non-linear increase of the perception risk may lead to the extinction of the disease. This transition is discontinuous, and is not predicted by the mean-field analysis.
△ Less
Submitted 23 August, 2007; v1 submitted 14 May, 2007;
originally announced May 2007.
-
Mathematical Model of HIV superinfection dynamics and R5 to X4 switch
Authors:
Luca Sguanci,
Franco Bagnoli,
Pietro Lio
Abstract:
During the HIV infection several quasispecies of the virus arise, which are able to use different coreceptors, in particular the CCR5 and CXCR4 coreceptors (R5 and X4 phenotypes, respectively). The switch in coreceptor usage has been correlated with a faster progression of the disease to the AIDS phase. As several pharmaceutical companies are starting large phase III trials for R5 and X4 drugs,…
▽ More
During the HIV infection several quasispecies of the virus arise, which are able to use different coreceptors, in particular the CCR5 and CXCR4 coreceptors (R5 and X4 phenotypes, respectively). The switch in coreceptor usage has been correlated with a faster progression of the disease to the AIDS phase. As several pharmaceutical companies are starting large phase III trials for R5 and X4 drugs, models are needed to predict the co-evolutionary and competitive dynamics of virus strains. We present a model of HIV early infection which describes the dynamics of R5 quasispecies and a model of HIV late infection which describes the R5 to X4 switch. We report the following findings: after superinfection or coinfection, quasispecies dynamics has time scales of several months and becomes even slower at low number of CD4+ T cells. The curve of CD4+ T cells decreases, during AIDS late stage, and can be described taking into account the X4 related Tumor Necrosis Factor dynamics. Phylogenetic inference of chemokine receptors suggests that viral mutational pathway may generate R5 variants able to interact with chemokine receptors different from CXCR4. This may explain the massive signaling disruptions in the immune system observed during AIDS late stages and may have relevance for vaccination and therapy.
△ Less
Submitted 1 March, 2006;
originally announced March 2006.
-
Statistical analysis of simple repeats in the human genome
Authors:
Francesco Piazza,
Pietro Lio
Abstract:
The human genome contains repetitive DNA at different level of sequence length, number and dispersion. Highly repetitive DNA is particularly rich in homo-- and di--nucleotide repeats, while middle repetitive DNA is rich of families of interspersed, mobile elements hundreds of base pairs (bp) long, among which the Alu families. A link between homo- and di-polymeric tracts and mobile elements has…
▽ More
The human genome contains repetitive DNA at different level of sequence length, number and dispersion. Highly repetitive DNA is particularly rich in homo-- and di--nucleotide repeats, while middle repetitive DNA is rich of families of interspersed, mobile elements hundreds of base pairs (bp) long, among which the Alu families. A link between homo- and di-polymeric tracts and mobile elements has been recently highlighted. In particular, the mobility of Alu repeats, which form 10% of the human genome, has been correlated with the length of poly(A) tracts located at one end of the Alu. These tracts have a rigid and non-bendable structure and have an inhibitory effect on nucleosomes, which normally compact the DNA. We performed a statistical analysis of the genome-wide distribution of lengths and inter--tract separations of poly(X) and poly(XY) tracts in the human genome. Our study shows that in humans the length distributions of these sequences reflect the dynamics of their expansion and DNA replication. By means of general tools from linguistics, we show that the latter play the role of highly-significant content-bearing terms in the DNA text. Furthermore, we find that such tracts are positioned in a non-random fashion, with an apparent periodicity of 150 bases. This allows us to extend the link between repetitive, highly mobile elements such as Alus and low-complexity words in human DNA. More precisely, we show that Alus are sources of poly(X) tracts, which in turn affect in a subtle way the combination and diversification of gene expression and the fixation of multigene families.
△ Less
Submitted 11 February, 2005;
originally announced February 2005.