-
Protein property prediction with uncertainties
Authors:
Peter Mørch Groth,
Mads Herbert Kerrn,
Lars Olsen,
Jesper Salomon,
Wouter Boomsma
Abstract:
Reliable prediction of variant effects in proteins has seen considerable progress in recent years. The increasing availability of data in this regime has improved both the prediction performance and our ability to track progress in the field, measured in terms of prediction accuracy averaged over many datasets. For practical use in protein engineering, it is important that we can also provide reli…
▽ More
Reliable prediction of variant effects in proteins has seen considerable progress in recent years. The increasing availability of data in this regime has improved both the prediction performance and our ability to track progress in the field, measured in terms of prediction accuracy averaged over many datasets. For practical use in protein engineering, it is important that we can also provide reliable uncertainty estimates for our predictions, but such metrics are rarely reported. We here provide a Gaussian process regression model, Kermut, which obtains state-of-the-art performance for protein property prediction while also offering estimates of uncertainty through its posterior. We proceed by assessing the quality of these uncertainty estimates. Our results show that the model provides meaningful overall calibration, but that accurate instance-specific uncertainty quantification remains challenging. We hope that this will encourage future work in this promising direction.
△ Less
Submitted 9 April, 2024;
originally announced July 2024.
-
BEND: Benchmarking DNA Language Models on biologically meaningful tasks
Authors:
Frederikke Isa Marin,
Felix Teufel,
Marc Horlacher,
Dennis Madsen,
Dennis Pultz,
Ole Winther,
Wouter Boomsma
Abstract:
The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that…
▽ More
The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://github.com/frederikkemarin/BEND.
△ Less
Submitted 9 April, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Internal-Coordinate Density Modelling of Protein Structure: Covariance Matters
Authors:
Marloes Arts,
Jes Frellsen,
Wouter Boomsma
Abstract:
After the recent ground-breaking advances in protein structure prediction, one of the remaining challenges in protein machine learning is to reliably predict distributions of structural states. Parametric models of fluctuations are difficult to fit due to complex covariance structures between degrees of freedom in the protein chain, often causing models to either violate local or global structural…
▽ More
After the recent ground-breaking advances in protein structure prediction, one of the remaining challenges in protein machine learning is to reliably predict distributions of structural states. Parametric models of fluctuations are difficult to fit due to complex covariance structures between degrees of freedom in the protein chain, often causing models to either violate local or global structural constraints. In this paper, we present a new strategy for modelling protein densities in internal coordinates, which uses constraints in 3D space to induce covariance structure between the internal degrees of freedom. We illustrate the potential of the procedure by constructing a variational autoencoder with full covariance output induced by the constraints implied by the conditional mean in 3D, and demonstrate that our approach makes it possible to scale density models of internal coordinates to full protein backbones in two settings: 1) a unimodal setting for proteins exhibiting small fluctuations and limited amounts of available data, and 2) a multimodal setting for larger conformational changes in a high data regime.
△ Less
Submitted 24 January, 2024; v1 submitted 27 February, 2023;
originally announced February 2023.
-
What is a meaningful representation of protein sequences?
Authors:
Nicki Skafte Detlefsen,
Søren Hauberg,
Wouter Boomsma
Abstract:
How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different dat…
▽ More
How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.
△ Less
Submitted 7 March, 2022; v1 submitted 28 November, 2020;
originally announced December 2020.
-
Protein structure validation and refinement using amide proton chemical shifts derived from quantum mechanics
Authors:
Anders S. Christensen,
Troels E. Linnet,
Mikael Borg,
Wouter Boomsma,
Kresten Lindorff-Larsen,
Thomas Hamelryck,
Jan H. Jensen
Abstract:
We present the ProCS method for the rapid and accurate prediction of protein backbone amide proton chemical shifts - sensitive probes of the geometry of key hydrogen bonds that determine protein structure. ProCS is parameterized against quantum mechanical (QM) calculations and reproduces high level QM results obtained for a small protein with an RMSD of 0.25 ppm (r = 0.94). ProCS is interfaced wit…
▽ More
We present the ProCS method for the rapid and accurate prediction of protein backbone amide proton chemical shifts - sensitive probes of the geometry of key hydrogen bonds that determine protein structure. ProCS is parameterized against quantum mechanical (QM) calculations and reproduces high level QM results obtained for a small protein with an RMSD of 0.25 ppm (r = 0.94). ProCS is interfaced with the PHAISTOS protein simulation program and is used to infer statistical protein ensembles that reflect experimentally measured amide proton chemical shift values. Such chemical shift-based structural refinements, starting from high-resolution X-ray structures of Protein G, ubiquitin, and SMN Tudor Domain, result in average chemical shifts, hydrogen bond geometries, and trans-hydrogen bond (h3JNC') spin-spin coupling constants that are in excellent agreement with experiment. We show that the structural sensitivity of the QM-based amide proton chemical shift predictions is needed to refine protein structures to this agreement. The ProCS method thus offers a powerful new tool for refining the structures of hydrogen bonding networks to high accuracy with many potential applications such as protein flexibility in ligand binding.
△ Less
Submitted 24 November, 2013; v1 submitted 9 May, 2013;
originally announced May 2013.
-
Potentials of Mean Force for Protein Structure Prediction Vindicated, Formalized and Generalized
Authors:
Thomas Hamelryck,
Mikael Borg,
Martin Paluszewski,
Jonas Paulsen,
Jes Frellsen,
Christian Andreetta,
Wouter Boomsma,
Sandro Bottaro,
Jesper Ferkinghoff-Borg
Abstract:
Understanding protein structure is of crucial importance in science, medicine and biotechnology. For about two decades, knowledge based potentials based on pairwise distances -- so-called "potentials of mean force" (PMFs) -- have been center stage in the prediction and design of protein structure and the simulation of protein folding. However, the validity, scope and limitations of these potential…
▽ More
Understanding protein structure is of crucial importance in science, medicine and biotechnology. For about two decades, knowledge based potentials based on pairwise distances -- so-called "potentials of mean force" (PMFs) -- have been center stage in the prediction and design of protein structure and the simulation of protein folding. However, the validity, scope and limitations of these potentials are still vigorously debated and disputed, and the optimal choice of the reference state -- a necessary component of these potentials -- is an unsolved problem. PMFs are loosely justified by analogy to the reversible work theorem in statistical physics, or by a statistical argument based on a likelihood function. Both justifications are insightful but leave many questions unanswered. Here, we show for the first time that PMFs can be seen as approximations to quantities that do have a rigorous probabilistic justification: they naturally arise when probability distributions over different features of proteins need to be combined. We call these quantities reference ratio distributions deriving from the application of the reference ratio method. This new view is not only of theoretical relevance, but leads to many insights that are of direct practical use: the reference state is uniquely defined and does not require external physical insights; the approach can be generalized beyond pairwise distances to arbitrary features of protein structure; and it becomes clear for which purposes the use of these quantities is justified. We illustrate these insights with two applications, involving the radius of gyration and hydrogen bonding. In the latter case, we also show how the reference ratio method can be iteratively applied to sculpt an energy funnel. Our results considerably increase the understanding and scope of energy functions derived from known biomolecular structures.
△ Less
Submitted 23 November, 2010; v1 submitted 24 August, 2010;
originally announced August 2010.