Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
46 views4 pages

Machine Learning For Chemistry

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 4

comment

Best practices in machine learning for chemistry


Statistical tools based on machine learning are becoming integrated into chemistry research workflows. We discuss
the elements necessary to train reliable, repeatable and reproducible models, and recommend a set of guidelines
for machine learning reports.

Nongnuch Artrith, Keith T. Butler, François-Xavier Coudert, Seungwu Han, Olexandr Isayev,
Anubhav Jain and Aron Walsh

C
hemistry has long-benefited from history1. Algorithmic innovation, improved firmly established. As with any scientific
the use of models to interpret data availability, and increases in computer report, it is essential that sufficient
patterns in data, from the Eyring power have led to an unprecedented growth information and data is made available for
equation in chemical kinetics, the scales in the field2,3. Extending the previous a machine learning study to be critically
of electronegativity to describe chemical generation of high-throughput methods, and assessed and repeatable. As a community,
stability and reactivity, to the ligand-field building on the many extensive and curated we must work together to significantly
approaches that connect molecular structure databases available, the ability to map improve the efficiency, effectiveness, and
and spectroscopy. Such models are typically between the chemical structure of molecules reproducibility of machine learning models
in the form of reproducible closed-form and materials and their physical properties and datasets by adhering to the FAIR
equations and remain relevant over the has been widely demonstrated using (findable, accessible, interoperable, reusable)
course of decades. However, the rules of supervised learning for both regression guiding principles for scientific data
chemistry are often limited to specific (for example, reaction rate) and classification management and stewardship12.
classes of systems (for example, electron (for example, reaction outcome) problems. Below, we outline a set of guidelines
counting for polyhedral boranes) and Notably, molecular modelling has benefited to consider when building and applying
conditions (for example, thermodynamic from interatomic potentials based on machine learning models. These should
equilibrium or a steady state). Gaussian processes4 and artificial neural assist in the development of robust models,
Beyond the limits where simple analytical networks5 that can reproduce structural providing clarity for manuscripts, and
expressions are applicable or sophisticated transformations at a fraction of the cost building the credibility needed for statistical
numerical models can be computed, statistical required by standard first-principles tools to gain widespread acceptance and
modelling and analysis are becoming valuable simulation techniques. The research utility in chemistry.
research tools in chemistry. These present literature itself has become a valuable
an opportunity to discover new or more resource for mining latent knowledge using Guidelines when using machine
generalized relationships that have previously natural language processing, as recently learning models
escaped human intuition. Yet, practitioners of applied to extract synthesis recipes for 1. Data sources. The quality, quantity
these techniques must follow careful protocols inorganic crystals6. Beyond data-mining, the and diversity of available data impose an
to achieve levels of validity, reproducibility, efficient exploration of chemical hyperspace, upper limit on the accuracy and generality
and longevity similar to those of established including the solution of inverse-design of any derived model. The use of static
methods. problems, is becoming tractable through datasets (for example, from established
The purpose of this Comment is to suggest the application of autoencoders and chemical databases) leads to a linear
a standard of ‘best practices’ to ensure that generative models7. Unfortunately, the lack model construction process from data
the models developed through statistical of transparency surrounding data-driven collection → model training. In contrast,
learning are robust and observed effects are methods has led some scientists to question dynamic datasets (for example, from guided
reproducible. We hope that the associated the validity of results and argue that the field experiments or calculations) lead to an
checklist (Fig. 1 and Supplementary Data 1) faces a “reproducibility crisis”8. iterative model-construction process that
will be useful to authors, referees, and readers The transition to an open-science is sometimes referred to as active learning,
to guide the critical evaluation of, and provide ecosystem that includes reproducible with data collection → model training →
a degree of standardization to, the training workflows and the publication of use model to identify missing data → repeat.
and reporting of machine learning models. supporting data in machine-readable Care must be taken with data selection in
We propose that publishers can create formats is ongoing within chemistry9. In both regimes.
submission guidelines and reproducibility computational chemistry, reproducibility Most data sources are biased. Bias
policy for machine-learning manuscripts and the implementation of mainstream can originate from the method used
assisted by the provided checklist. We hope methods, such as density functional theory, to generate or acquire the data, for
that many scientists will spearhead this have been investigated10. This, and other example, an experimental technique that
campaign and voluntarily provide a machine studies11, proposed open standards that is more sensitive to heavier elements, or
learning checklist to support their papers. are complemented by the availability of simulation-based datasets that favour
online databases. The same must be done materials with small crystallographic unit
The growth of machine learning and for data-driven methods. Machine learning cells due to limits on the computational
making it FAIR for chemistry represents a developing power available. Bias can also arise from the
The application of statistical machine area where data is a vital commodity, but context of a dataset compiled for a specific
learning techniques in chemistry has a long protocols and standards have not been purpose or by a specific sub-community,
Nature Chemistry | VOL 13 | June 2021 | 505–508 | www.nature.com/naturechemistry 505
comment

Checklist for reporting and evaluating machine learning models

1. Data sources

1a. Are all data sources listed and publicly available?

1b. If using an external database, is an access date or version number provided?

1c. Are any potential biases in the source dataset reported and/or mitigated?

2. Data cleaning

2a. Are the data cleaning steps clearly and fully described, either in text or as a code pipeline?

2b. Is an evaluation of the amount of removed source data presented?

2c. Are instances of combining data from multiple sources clearly identified, and potential issues mitigated?

3. Data representations

3a. Are methods for representing data as features or descriptors clearly articulated, ideally with software implementations?

3b. Are comparisons against standard feature sets provided?

4. Model choice

4a. Is a software implementation of the model provided such that it can be trained and tested with new data?

4b. Are baseline comparisons to simple/trivial models (for example, 1-nearest neighbour, random forest, most frequent class) provided?

4c. Are baseline comparisons to current state-of-the-art provided?

5. Model training and validation

5a. Does the model clearly split data into different sets for training (model selection), validation (hyperparameter opimization), and testing (final evaluation)?

5b. Is the method of data splitting (for example, random, cluster- or time-based splitting, forward cross-validation) clearly stated?
Does it mimic anticipated real-world application?

5c. Does the data splitting procedure avoid data leakage (for example, is the same composition present in training and test sets)?

6. Code and reproducibility

6a. Is the code or workflow available in a public repository?

6b. Are scripts to reproduce the findings in the paper provided?

Fig. 1 | A suggested author and reviewer checklist for reporting and evaluating machine learning models. This proposed checklist is also provided as
Supplementary Data 1.

as recently explored for reagent choice potentials from regions of a potential energy 10% of erroneous data. Indeed, one study
and reaction conditions used in inorganic surface that are most relevant14, but any found that 14% of the data describing the
synthesis13. A classic example of the perils bias, or attempts at its mitigation, should be elastic properties of crystals in the Materials
of a biased dataset came on 3 November discussed. Project is unphysical15. Cleaning steps
1948, when The Chicago Tribune headline Databases often evolve over time, with include removing duplicates, entries with
declared ‘Dewey Defeats Truman’ based on new data added (continuously or by batch missing values, incoherent or unphysical
projecting results from the previous day’s releases). For reasons of reproducibility, values, or data type conversions. Data
U. S. presidential election. In truth, Truman it is essential that these databases use curation may also have been performed
defeated Dewey (303–189 in the Electoral some mechanism for version control (for before publication of the original dataset.
College). The source of the error? The use example, release numbers, Git versioning, This cleaning of the data can also include
of phone-based polls at a time when mostly or timestamps) as part of the metadata and normalization and homogenization, where
wealthy (and Republican-leaning) citizens maintain long-term availability to previous several sources are combined. Attention
owned phones. One can imagine analogous versions of the database. should be given to the characterization of
sampling errors regarding chemical datasets, We recommend listing all data sources, possible discrepancies between sources,
where particular classes of ‘fashionable’ documenting the strategy for data selection, and the impact of homogenization on
compounds such as metal dichalcogenides and including access dates or version derived machine learning models. The
or halide perovskites may feature widely, numbers. If data is protected or proprietary, dramatic effect of data quality on model
but do not represent the diversity of a minimally reproducible example using a performance and the importance of careful
all materials. public dataset can be an alternative. data curation has been highlighted in the
It is important to identify and discuss closely related field of cheminformatics16,17.
the sources and limitations of a dataset. 2. Data cleaning and curation. Raw datasets One seminal study showed examples of
Bias may be intended and desirable, for often contain errors, omissions, or outliers. how accumulation of database errors and
example, in the construction of interatomic It is common for databases to contain over incorrect processing of chemical structures

506 Nature Chemistry | VOL 13 | June 2021 | 505–508 | www.nature.com/naturechemistry


comment

could lead to significant losses in the 4. Model choice. Many flavours of machine detect overfitting during training of the
predictive ability of machine learning learning exist, from classical algorithms such parameters. The model hyperparameters are
models18. When errors are identified as the ‘support-vector machines’, ensemble optimized against the performance on the
in public databases, it is important to methods like ‘random forests’, to deep validation set. A test set of unseen data is
communicate these to the dataset maintainer learning methods involving complex neural then used to assess the accuracy of the final
as part of the research process. network architectures. High accuracy in model and again to detect overfitting. These
The ability of a statistical model to be tasks involving chemical problems has been three sets can be formed from random splits
‘right for the wrong reasons’ can occur when reported for graph-based neural networks of the original dataset, or by first clustering
the true signal is correlated with a false designed to represent bonding interactions the data into similar types to ensure a
one in the data. In one notable example, a between elements22,23. Transfer-learning diverse split is achieved. In estimating the
high-accuracy model was trained to predict techniques make it possible to train superior training accuracy, the mean-squared errors
the performance of Buchwald−Hartwig models from the smaller datasets that are are usually inspected and reported, but it
cross-coupling19. The findings prompted the common in chemistry, with one success case should be confirmed that the accuracy is
suggestion that almost the same accuracy being the retraining of a general-purpose achieved uniformly over the whole dataset.
could be achieved if all features in the dataset interatomic potential based on a small The computational intensiveness of the
are replaced with random strings of digits20. dataset of high-quality quantum mechanical training process should also be reported
We recommend describing all cleaning calculations24. as the utility of the approach to others
steps applied to the original data, while also However, the sophistication of a model will depend on the data and resource
providing an evaluation of the extent of data is unrelated to the appropriateness for a required. For example, sequence-based
removed and modified through this process. given problem: higher complexity is not generative models are a powerful approach
As it is impossible to check large databases always better. In fact, model complexity for molecular de novo design but training
manually, the implementation and sharing often comes with the cost of reduced them using recurrent neural networks is
of semi-automated workflows integrating transparency and interpretability. The use currently only feasible if one has access to
data curation pipelines is crucial. of a six-layer neural network to predict state-of-the-art graphics processing units
earthquake aftershocks25 was the subject of and millions of training samples27. Following
3. Data representation. The same type of vigorous online debate, as well as a formal conventional terminology, the validation set
chemical information can be represented rebuttal26 demonstrating that a single neuron is only used during training, whereas the
in many ways. The choice of representation with only two free parameters (as opposed independent test set is used for assessing a
(or encoding) is critical in model building to the 13,451 of the original model) could trained model prior to application. However,
and can be as important for determining provide the same level of accuracy. This the accuracy of a trained model on an
model performance as the choice of machine case highlights the importance of baselines arbitrary test set is not a universal metric for
learning method. It is therefore essential that include selecting the most frequent evaluating performance.
to evaluate different representations class (classification), always predicting the The test set must be representative of the
when constructing a new model. For the mean (regression), or comparing results intended application range. For example,
representation of molecules and extended against a model with no extrapolative a model trained on solvation structures
crystals, various approaches have been power, such as a 1-nearest-neighbour, which and energies under acidic conditions may
developed. Some capture the global features essentially ‘looks up’ the closest known be accurate on similar data, but not be
of the entire molecule or crystallographic data point when making a prediction. In transferable to basic conditions. Reliable
unit cell, while others represent local cases where machine learning alternatives measures of test accuracy can be difficult to
features such as bonding environments or for conventional techniques are proposed, formulate. One study assessed the accuracy
fragments, and some combine both aspects. a comparison with the state-of-the-art of machine learning models trained to
Both hand-crafted descriptors, which make is another important baseline test and a predict steel fatigue strength or critical
use of prior knowledge (and are often general measure of the success of the model. temperature of superconductivity using
computationally efficient), and general We recommend justifying your model random cross-validation or clustered by a
learned descriptors (unbiased but usually choice by including baseline comparisons diversity splitting strategy28. In the latter
computationally demanding) can be used. to simpler — even trivial — models, as well scenario, the model accuracies dropped
In chemistry, it is beneficial if the chosen as the current state-of-the-art. A software substantially (2–4× performance reduction).
representation obeys physical invariants implementation should be provided so that The models were extremely fragile to the
of the system, such as symmetry21. While the model can be trained and tested with introduction of new and slightly different
there is merit in developing new approaches, new data. data, to the point of losing any predictive
comparison with established methods (both power.
in accuracy and cost) is advisable so that 5. Model training and validation. Training Methods of validation that aim to
advantages and disadvantages are clear. a robust model must balance underfitting test extrapolative (versus interpolative)
We recommend that the methods and overfitting, which is important for both performance are being developed either
used for representing data are stated and the model parameters (for example, weights by excluding entire classes of compounds
compared with standard feature sets. It is in a neural network) and hyperparameters (known as leave-class-out selection or
advisable to draw from the experience of (for example, kernel parameters, activation scaffold split) for testing28, or by excluding
published chemical representation schemes, functions, as well as the choice and settings the extreme values in the dataset for
and their reference implementations in of the training algorithm). Three datasets testing29. Another industry standard
standard open libraries such as RDKit are involved in model construction and approach is time-split cross-validation30,
(https://www.rdkit.org), DScribe (https:// selection. A training set is used as an where a model is trained on historical data
singroup.github.io/dscribe), and Matminer optimization target for models to learn available at a certain date and tested on data
(https://hackingmaterials.lbl.gov/matminer) from for a given choice of hyperparameters. that is generated later, simulating the process
before attempting to design new ones. An independent validation set is used to of prospective validation.

Nature Chemistry | VOL 13 | June 2021 | 505–508 | www.nature.com/naturechemistry 507


comment

We recommend stating how the training, We recommend that the full code or Mellon University, Pittsburgh, PA, USA. 8Energy
validation, and test sets were obtained, as workflow is made available in a public Technologies Area, Lawrence Berkeley National
well as the sensitivity of model performance repository that guarantees long-term Laboratory, Berkeley, California, USA. 9Department
with respect to the parameters of the archiving (for example, an online repository of Materials, Imperial College London, London, UK.
training method, for example, when training archived with a permanent DOI). Providing 10
Department of Materials Science and Engineering,
is repeated with different random seeds or the code not only allows the study to be Yonsei University, Seoul, Korea.
ordering of the dataset. Validation should be exactly replicated by others, but to be Twitter: @nartrith; @keeeto2000; @fxcoudert;
performed on data related to the intended challenged, critiqued and further improved. @olexandr; @jainpapers; @lonepair
application. At the minimum, a script or electronic ✉e-mail: na2782@columbia.edu; keith.butler@stfc.
notebook should be provided that contains ac.uk; fx.coudert@chimieparistech.psl.eu;
6. Code and reproducibility. There is a all parameters to reproduce the results hansw@snu.ac.kr; olexandr@olexandrisayev.com;
reproducibility crisis across all fields of reported. ajain@lbl.gov; a.walsh@imperial.ac.uk
research. If we set aside cases of outright
misconduct and data fabrication, the Maintaining high digital standards Published online: 31 May 2021
selective reporting of positive results These new adventures in chemical research https://doi.org/10.1038/s41557-021-00716-z
is widespread. Going deeper, data are only possible thanks to those who
dredging (p-hacking) is a manipulation have contributed to the underpinning References
1. Gasteiger, J. & Zupan, J. Angew. Chem. Int. Ed. 32,
technique to find outcomes that can be techniques, algorithms, codes, and packages. 503–527 (1993).
presented as statistically significant, thus Developments in this field are supported by 2. Aspuru-Guzik, A. et al. Nat. Chem. 11, 286–294 (2019).
dramatically increasing the observed effect. an open-source philosophy that includes the 3. Butler, K. T. et al. Nature 559, 547–555 (2018).
4. Deringer, V. L. et al. J. Phys. Chem. Lett. 9, 2879–2885 (2018).
‘Hypothesizing after the results are known’ posting of preprints and making software 5. Behler, J. Angew. Chem. Int. Ed. 56, 12828–12840 (2017).
(HARKing) involves presenting a post-hoc openly and freely available. Future progress 6. Kononova, O. et al. Sci. Data 6, 203 (2019).
hypothesis in a research report as if it were, critically depends on these researchers being 7. Sanchez-Lengeling, B. & Aspuru-Guzik, A. Science 361,
360–365 (2018).
in fact, an a priori hypothesis. To strengthen able to demonstrate the impact of their 8. Hutson, M. Science 359, 725–726 (2018).
public trust in science and improve the contributions. In all reports, remember to 9. Coudert, F. X. Chem. Mater. 29, 2615–2617 (2017).
reproducibility of published research, it is cite the methods and packages employed 10. Lejaeghere, K. et al. Science 351, aad3000 (2016).
11. Smith, D. G. A. et al. WIREs Comp. Mater. Sci. 11, e1491 (2021).
important for authors to make their data and to ensure that the development community 12. Wilkinson, M. D. et al. Sci. Data 3, 160018 (2016).
code publicly available. This goes beyond receives the recognition they deserve. 13. Jia, X. et al. Nature 573, 251–255 (2019).
purely computational studies and initiatives The suggestions put forward in this 14. Artrith, N. et al. J. Chem. Phys. 148, 241711 (2018).
like the ‘dark reactions project’ to show the Comment have emerged from interactions 15. Chibani, S. & Coudert, F.-X. Chem. Sci. 10, 8589–8599 (2019).
16. Tropsha, A. Mol. Inform. 29, 476–488 (2010).
unique value of failed experiments that have with many researchers, and are in line with 17. Gramatica, P. et al. Mol. Inform. 31, 817–835 (2012).
never been reported in literature31. other perspectives on this topic33,34. While 18. Young, D., Martin, T., Venkatapathy, R. & Harten, P. QSAR Comb.
The first five steps require many there is great power and potential in the Sci. 27, 1337–1345 (2008).
19. Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G.
choices to be made by researchers to application and development of machine Science 360, 186–190 (2018).
train meaningful machine learning learning for chemistry, it is up to us to 20. Chuang, K. V. & Keiser, M. J. Science 362, eaat8603 (2018).
models. While the reasoning behind these establish and maintain a high standard of 21. Braams, B. J. & Bowman, J. M. Int. Rev. Phys. Chem. 28,
577 (2009).
choices should be reported, this alone research and reporting. ❐ 22. Chen, C. et al. Chem. Mater. 31, 3564–3572 (2019).
is not sufficient to meet the burden of 23. Xie, T. & Grossman, J. C. Phys. Rev. Lett. 120, 145301 (2018).
reproducibility32. Many variables that are Editor’s note: This article has been 24. Smith, J. S. et al. Nat. Commun. 10, 2903 (2019).
25. DeVries, P. M. R. et al. Nature 560, 632–634 (2018).
not typically listed in the methods section peer-reviewed. 26. Mignan, A. & Broccardo, M. Nature 574, E1–E3 (2019).
of a publication can play a role in the final 27. Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. J.
result – the devil is in the hyperparameters. Nongnuch Artrith   1,2 ✉, Keith T. Butler3 ✉, Cheminformatics 9, 48 (2017).
Even software versions are important as François-Xavier Coudert   4 ✉, 28. Meredig, B. et al. Mol. Syst. Des. Eng. 3, 819–825 (2018).

Seungwu Han5 ✉, Olexandr Isayev   6,7 ✉,


29. Xiong, Z. et al. Comp. Mater. Sci. 171, 109203 (2020).
default variables often change. For large 30. Sheridan, R. P. J. Chem. Inf. Model 53, 783–790 (2013).
developments, the report of a standalone Anubhav Jain   8 ✉ and Aron Walsh   9,10 ✉ 31. Raccuglia, P. et al. Nature 533, 73–76 (2016).
code, for example in the Journal of Open 1
Department of Chemical Engineering, Columbia 32. Reproducibility and replicability in science. The National
Academies of Sciences, Engineering, and Medicine https://www.
Source Software, may be appropriate. It University, New York, NY, USA. 2Columbia Center nationalacademies.org/our-work/reproducibility-and-
is desirable to report auxiliary software for Computational Electrochemistry (CCCE), replicability-in-science (accessed 13 May 2021).
packages and versions required to run the Columbia University, New York, NY, USA. 3SciML, 33. Wang, A. Y.-T. et al. Chem. Mater. 32, 4954–4965 (2020).
34. Riley, P. Nature 572, 27–29 (2019).
reported workflows, which can be achieved Scientific Computing Department, STFC Rutherford
by listing all dependencies, by exporting Appleton Laboratory, Harwell Campus, Didcot, Competing interests
the software environment (for example, UK. 4Chimie ParisTech, PSL University, CNRS, The authors declare no competing interests.
conda environments) or by providing Institut de Recherche de Chimie Paris, Paris,
standalone containers for running the code. France. 5Department of Materials Science and Additional information
Initiatives are being developed to support Engineering, Seoul National University, Seoul, Supplementary information The online version
contains supplementary material available at https://doi.
the reporting of reproducible workflows, Korea. 6Computational Biology Department, School
org/10.1038/s41557-021-00716-z.
including https://www.commonwl.org, of Computer Science, Carnegie Mellon University, Peer review information Nature Chemistry thanks Joshua
https://www.researchobject.org and https:// Pittsburgh, Pennsylvania, PA, USA. 7Department Schrier and the other, anonymous, reviewer(s) for their
www.dlhub.org. of Chemistry, Mellon College of Science, Carnegie contribution to the peer review of this work.

508 Nature Chemistry | VOL 13 | June 2021 | 505–508 | www.nature.com/naturechemistry

You might also like