Search | arXiv e-print repository

CARE: a Benchmark Suite for the Classification and Retrieval of Enzymes

Authors: Jason Yang, Ariane Mora, Shengchao Liu, Bruce J. Wittmann, Anima Anandkumar, Frances H. Arnold, Yisong Yue

Abstract: Enzymes are important proteins that catalyze chemical reactions. In recent years, machine learning methods have emerged to predict enzyme function from sequence; however, there are no standardized benchmarks to evaluate these methods. We introduce CARE, a benchmark and dataset suite for the Classification And Retrieval of Enzymes (CARE). CARE centers on two tasks: (1) classification of a protein s… ▽ More Enzymes are important proteins that catalyze chemical reactions. In recent years, machine learning methods have emerged to predict enzyme function from sequence; however, there are no standardized benchmarks to evaluate these methods. We introduce CARE, a benchmark and dataset suite for the Classification And Retrieval of Enzymes (CARE). CARE centers on two tasks: (1) classification of a protein sequence by its enzyme commission (EC) number and (2) retrieval of an EC number given a chemical reaction. For each task, we design train-test splits to evaluate different kinds of out-of-distribution generalization that are relevant to real use cases. For the classification task, we provide baselines for state-of-the-art methods. Because the retrieval task has not been previously formalized, we propose a method called Contrastive Reaction-EnzymE Pretraining (CREEP) as one of the first baselines for this task. CARE is available at https://github.com/jsunn-y/CARE/. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2104.04457 [pdf, other]

doi 10.1016/j.cbpa.2021.04.004

Protein sequence design with deep generative models

Authors: Zachary Wu, Kadina E. Johnston, Frances H. Arnold, Kevin K. Yang

Abstract: Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods. Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Comments: 11 pages, 2 figures

arXiv:1902.07231 [pdf]

doi 10.1073/pnas.1901979116

Machine learning-assisted directed protein evolution with combinatorial libraries

Authors: Zachary Wu, S. B. Jennifer Kan, Russell D. Lewis, Bruce J. Wittmann, Frances H. Arnold

Abstract: To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning in the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine learning models trained on tested variants provide a fast method for testing seq… ▽ More To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning in the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine learning models trained on tested variants provide a fast method for testing sequence space computationally. We validate this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (stereodivergence) of a new-to-nature carbene Si-H insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee. By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem. △ Less

Submitted 4 January, 2020; v1 submitted 19 February, 2019; originally announced February 2019.

Comments: Corrected best S-selective variant sequence in Figure 4. Corrected less R-selective variant sequences from Round II Input library in Table 2 and Supp Table 4. Corrections may also be found on PNAS version https://www.pnas.org/content/early/2019/12/26/1921770117

Journal ref: PNAS April 30, 2019 116 (18) 8852-8858

arXiv:1811.10775 [pdf, other]

Machine learning-guided directed evolution for protein engineering

Authors: Kevin K. Yang, Zachary Wu, Frances H. Arnold

Abstract: Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and… ▽ More Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function. △ Less

Submitted 19 April, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

Comments: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolution

arXiv:0705.0201 [pdf, ps, other]

doi 10.1186/1745-6150-2-17

Neutral genetic drift can aid functional protein evolution

Authors: Jesse D Bloom, Philip A Romero, Zhongyi Lu, Frances H Arnold

Abstract: BACKGROUND: Many of the mutations accumulated by naturally evolving proteins are neutral in the sense that they do not significantly alter a protein's ability to perform its primary biological function. However, new protein functions evolve when selection begins to favor other, "promiscuous" functions that are incidental to a protein's biological role. If mutations that are neutral with respect… ▽ More BACKGROUND: Many of the mutations accumulated by naturally evolving proteins are neutral in the sense that they do not significantly alter a protein's ability to perform its primary biological function. However, new protein functions evolve when selection begins to favor other, "promiscuous" functions that are incidental to a protein's biological role. If mutations that are neutral with respect to a protein's primary biological function cause substantial changes in promiscuous functions, these mutations could enable future functional evolution. RESULTS: Here we investigate this possibility experimentally by examining how cytochrome P450 enzymes that have evolved neutrally with respect to activity on a single substrate have changed in their abilities to catalyze reactions on five other substrates. We find that the enzymes have sometimes changed as much as four-fold in the promiscuous activities. The changes in promiscuous activities tend to increase with the number of mutations, and can be largely rationalized in terms of the chemical structures of the substrates. The activities on chemically similar substrates tend to change in a coordinated fashion, potentially providing a route for systematically predicting the change in one function based on the measurement of several others. CONCLUSIONS: Our work suggests that initially neutral genetic drift can lead to substantial changes in protein functions that are not currently under selection, in effect poising the proteins to more readily undergo functional evolution should selection "ask new questions" in the future. △ Less

Submitted 2 May, 2007; originally announced May 2007.

Journal ref: Biology Direct 2:17 (2007)

arXiv:0704.1885 [pdf, ps, other]

doi 10.1186/1741-7007-5-29

Evolution favors protein mutational robustness in sufficiently large populations

Authors: Jesse D. Bloom, Zhongyi Lu, David Chen, Alpan Raval, Ophelia S. Venturelli, Frances H. Arnold

Abstract: BACKGROUND: An important question is whether evolution favors properties such as mutational robustness or evolvability that do not directly benefit any individual, but can influence the course of future evolution. Functionally similar proteins can differ substantially in their robustness to mutations and capacity to evolve new functions, but it has remained unclear whether any of these differenc… ▽ More BACKGROUND: An important question is whether evolution favors properties such as mutational robustness or evolvability that do not directly benefit any individual, but can influence the course of future evolution. Functionally similar proteins can differ substantially in their robustness to mutations and capacity to evolve new functions, but it has remained unclear whether any of these differences might be due to evolutionary selection for these properties. RESULTS: Here we use laboratory experiments to demonstrate that evolution favors protein mutational robustness if the evolving population is sufficiently large. We neutrally evolve cytochrome P450 proteins under identical selection pressures and mutation rates in populations of different sizes, and show that proteins from the larger and thus more polymorphic population tend towards higher mutational robustness. Proteins from the larger population also evolve greater stability, a biophysical property that is known to enhance both mutational robustness and evolvability. The excess mutational robustness and stability is well described by existing mathematical theories, and can be quantitatively related to the way that the proteins occupy their neutral network. CONCLUSIONS: Our work is the first experimental demonstration of the general tendency of evolution to favor mutational robustness and protein stability in highly polymorphic populations. We suggest that this phenomenon may contribute to the mutational robustness and evolvability of viruses and bacteria that exist in large populations. △ Less

Submitted 14 April, 2007; originally announced April 2007.

Journal ref: BMC Biology 5:29 (2007)

arXiv:q-bio/0506002 [pdf]

doi 10.1073/pnas.0504070102

Why highly expressed proteins evolve slowly

Authors: D. Allan Drummond, Jesse D. Bloom, Christoph Adami, Claus O. Wilke, Frances H. Arnold

Abstract: Much recent work has explored molecular and population-genetic constraints on the rate of protein sequence evolution. The best predictor of evolutionary rate is expression level, for reasons which have remained unexplained. Here, we hypothesize that selection to reduce the burden of protein misfolding will favor protein sequences with increased robustness to translational missense errors. Pressu… ▽ More Much recent work has explored molecular and population-genetic constraints on the rate of protein sequence evolution. The best predictor of evolutionary rate is expression level, for reasons which have remained unexplained. Here, we hypothesize that selection to reduce the burden of protein misfolding will favor protein sequences with increased robustness to translational missense errors. Pressure for translational robustness increases with expression level and constrains sequence evolution. Using several sequenced yeast genomes, global expression and protein abundance data, and sets of paralogs traceable to an ancient whole-genome duplication in yeast, we rule out several confounding effects and show that expression level explains roughly half the variation in Saccharomyces cerevisiae protein evolutionary rates. We examine causes for expression's dominant role and find that genome-wide tests favor the translational robustness explanation over existing hypotheses that invoke constraints on function or translational efficiency. Our results suggest that proteins evolve at rates largely unrelated to their functions, and can explain why highly expressed proteins evolve slowly across the tree of life. △ Less

Submitted 12 August, 2005; v1 submitted 2 June, 2005; originally announced June 2005.

Comments: 40 pages, 3 figures, with supporting information

Journal ref: Proc. Nat'l. Acad. Sci. USA 102(40):14338-14343 (2005)

arXiv:q-bio/0505018 [pdf]

Inferring interactions from combinatorial protein libraries

Authors: Jeffrey B. Endelman, Jesse D. Bloom, Christopher R. Otey, Marco Landwehr, Frances H. Arnold

Abstract: Proteins created by combinatorial methods in vitro are an important source of information for understanding sequence-structure-function relationships. Alignments of folded proteins from combinatorial libraries can be analyzed using methods developed for naturally occurring proteins, but this neglects the information contained in the unfolded sequences of the library. We introduce two algorithms,… ▽ More Proteins created by combinatorial methods in vitro are an important source of information for understanding sequence-structure-function relationships. Alignments of folded proteins from combinatorial libraries can be analyzed using methods developed for naturally occurring proteins, but this neglects the information contained in the unfolded sequences of the library. We introduce two algorithms, logistic regression and excess information analysis, that use both the folded and unfolded sequences and compare them against contingency table and statistical coupling analysis, which only use the former. The test set for this benchmark study is a library of fictitious proteins that fold according to a hypothetical energy model. Of the four methods studied, only logistic regression is able to correctly recapitulate the energy model from the sequence alignment. The other algorithms predict spurious interactions between alignment positions with strong but individual influences on protein stability. When present in the same protein, stabilizing amino acids tend to lower the energy below the threshold needed for folding. As a result, their frequencies in the alignment can be correlated even if the positions do not interact. We believe any algorithm that neglects the nonlinear relationship between folding and energy is susceptible to this error. △ Less

Submitted 6 February, 2006; v1 submitted 9 May, 2005; originally announced May 2005.

Comments: 21 pages, 2 figures

arXiv:q-bio/0411041 [pdf]

doi 10.1016/j.jmb.2005.05.023

Why high-error-rate random mutagenesis libraries are enriched in functional and improved proteins

Authors: D. Allan Drummond, Brent L. Iverson, George Georgiou, Frances H. Arnold

Abstract: Recently, several groups have used error-prone polymerase chain reactions to construct mutant libraries containing up to 27 nucleotide mutations per gene on average, and reported a striking observation: although retention of protein function initially declines exponentially with mutations as has previously been observed, orders of magnitude more proteins remain viable at the highest mutation rat… ▽ More Recently, several groups have used error-prone polymerase chain reactions to construct mutant libraries containing up to 27 nucleotide mutations per gene on average, and reported a striking observation: although retention of protein function initially declines exponentially with mutations as has previously been observed, orders of magnitude more proteins remain viable at the highest mutation rates than this trend would predict. Mutant proteins having improved or novel activity were isolated disproportionately from these heavily mutated libraries, leading to the suggestion that distant regions of sequence space are enriched in useful cooperative mutations and that optimal mutagenesis should target these regions. If true, these claims have profound implications for laboratory evolution and for evolutionary theory. Here, we demonstrate that properties of the polymerase chain reaction can explain these results and, consequently, that average protein viability indeed decreases exponentially with mutational distance at all error rates. We show that high-error-rate mutagenesis may be useful in certain cases, though for very different reasons than originally proposed, and that optimal mutation rates are inherently protocol-dependent. Our results allow optimal mutation rates to be found given mutagenesis conditions and a protein of known mutational robustness. △ Less

Submitted 18 February, 2005; v1 submitted 22 November, 2004; originally announced November 2004.

Comments: Optimality results improved. 26 pages, 4 figures, 3 tables

Journal ref: Journal of Molecular Biology 350(4):806-816 (2005).

arXiv:q-bio/0409013 [pdf, ps, other]

doi 10.1073/pnas.0406744102

Thermodynamic Prediction of Protein Neutrality

Authors: Jesse D. Bloom, Jonathan J. Silberg, Claus O. Wilke, D. Allan Drummond, Christoph Adami, Frances H. Arnold

Abstract: We present a simple theory that uses thermodynamic parameters to predict the probability that a protein retains the wildtype structure after one or more random amino acid substitutions. Our theory predicts that for large numbers of substitutions the probability that a protein retains its structure will decline exponentially with the number of substitutions, with the severity of this decline dete… ▽ More We present a simple theory that uses thermodynamic parameters to predict the probability that a protein retains the wildtype structure after one or more random amino acid substitutions. Our theory predicts that for large numbers of substitutions the probability that a protein retains its structure will decline exponentially with the number of substitutions, with the severity of this decline determined by properties of the structure. Our theory also predicts that a protein can gain extra robustness to the first few substitutions by increasing its thermodynamic stability. We validate our theory with simulations on lattice protein models and by showing that it quantitatively predicts previously published experimental measurements on subtilisin and our own measurements on variants of TEM1 beta-lactamase. Our work unifies observations about the clustering of functional proteins in sequence space, and provides a basis for interpreting the response of proteins to substitutions in protein engineering applications. △ Less

Submitted 4 December, 2004; v1 submitted 13 September, 2004; originally announced September 2004.

Journal ref: Proc. Natl. Acad. Sci. USA, 102:606-611, 2005

arXiv:q-bio/0401038 [pdf, ps, other]

doi 10.1016/S0006-3495(04)74329-5

Stability and the Evolvability of Function in a Model Protein

Authors: Jesse D Bloom, Claus O Wilke, Frances H Arnold, Christoph Adami

Abstract: Functional proteins must fold with some minimal stability to a structure that can perform a biochemical task. Here we use a simple model to investigate the relationship between the stability requirement and the capacity of a protein to evolve the function of binding to a ligand. Although our model contains no built-in tradeoff between stability and function, proteins evolved function more effici… ▽ More Functional proteins must fold with some minimal stability to a structure that can perform a biochemical task. Here we use a simple model to investigate the relationship between the stability requirement and the capacity of a protein to evolve the function of binding to a ligand. Although our model contains no built-in tradeoff between stability and function, proteins evolved function more efficiently when the stability requirement was relaxed. Proteins with both high stability and high function evolved more efficiently when the stability requirement was gradually increased than when there was constant selection for high stability. These results show that in our model, the evolution of function is enhanced by allowing proteins to explore sequences corresponding to marginally stable structures, and that it is easier to improve stability while maintaining high function than to improve function while maintaining high stability. Our model also demonstrates that even in the absence of a fundamental biophysical tradeoff between stability and function, the speed with which function can evolve is limited by the stability requirement imposed on the protein. △ Less

Submitted 27 January, 2004; originally announced January 2004.

Comments: Biophysical Journal in press

Journal ref: Biophysical Journal, 86:2758-2764 (2004)

Showing 1–11 of 11 results for author: Arnold, F H