Abstract
Recently, we published an article in this journal that explored physics-based representations in combination with kernel models for predicting reaction properties (i.e. TS barrier heights). In an anonymous comment on our contribution, the authors argue, amongst other points, that deep learning models relying on atom-mapped reaction SMILES are more appropriate for the same task. This raises the question: are deep learning models sounding the death knell for kernel based models? By studying several datasets that vary in the type of chemical (i.e. high-quality atom-mapping) and structural information (i.e. Cartesian coordinates of reactants and products) contained within, we illustrate that physics-based representations combined with kernel models are competitive with deep learning models. Indeed, in some cases, such as when reaction barriers are sensitive to the geometry, physics-based models represent the only viable candidate. Furthermore, we illustrate that the good performance of deep learning models relies on high-quality atom-mapping, which comes with significant human time-cost and, in some cases, is impossible. As such, both physics-based and graph models offer their own relative benefits to predict reaction barriers of differing datasets.
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
In 2022 we published an article [1] focused on extending the ability of quantum machine learning [2–5] (i.e. physics-based representations in combination with kernel models), to predict reaction (as opposed to only molecular) properties. State-of-the-art molecular representations [6–12] were modified for reactions, and benchmarked on four diverse 'reaction property' datasets (SN2-20 [1, 13], Proparg-21-TS [14], Hydroform-22-TS [1] and GDB7-20-TS [15]) that spanned both thermodynamic and kinetic properties across varying molecular sizes, dataset sizes, and diverse reaction classes. By studying the performance of different representations on the four datasets, key design principles for physics-based reaction representations were extracted and subsequently used to design the lightweight bond-based reaction representation ().
Our physics-based reaction representations use nuclear charges and Cartesian coordinates to define each molecular species involved in a reaction. This is inspired from the origin of the computed labels [2–5], where a molecular representation mimics the role of the Hamiltonian. For neutral molecules, nuclear charges and coordinates are required to solve the Schrödinger equation for molecular properties. Accordingly, the datasets studied involve Cartesian coordinate (xyz) structure files of reactants and products along with the associated reaction properties (reaction energies or barriers). While other machine learning models may also be suitable, physics-based or quantum machine learning (QML) representations have often been used with kernel-ridge regression (KRR) models [2–12] due to their ability to work with high-dimensional input vectors (like state-of-the-art SLATM [9] and SOAP [12]) and the typically small (several hundred to several thousand) size of the datasets used in quantum chemistry. In a similar spirit, physics-based deep learning models [16–24] learn their own representations from the structural input data (i.e. nuclear charges and coordinates) and encode known physical priors, such as symmetries, into the network architecture. While these deep learning models are highly competitive with physics-based representations + KRR models in predicting computed molecular properties [16–24], they have not yet been adapted to predicting reaction properties, and therefore were not considered in our previous publication [1].
Published deep learning models for reaction property prediction [25–28], in contrast to physics-based models, rely on atom-mapped reaction SMILES strings and (in some cases) computed density functional theory (DFT) features [27, 28] to make predictions. Atom-mapping refers to the process of identifying which atoms in the reactants transform to which atoms in the products. This information is used, for example, in the condensed graph of reaction (CGR) model [25, 26] to build the reaction graph before passing it to chemprop [29] for training. In the case of DFT-feature augmented networks, reaction graphs [30, 31] constructed with atom-mapped SMILES are augmented with additional features obtained from quantum-chemical computations [27, 28]. Critically, atom-mapping provides key information to these networks, since the nodes of the reaction graph contain which transformations take place at which atoms. Atom-mapping, however, is no simple task [32–35]. State-of-the-art algorithms, which have only been adapted for organic chemistry and are likely to struggle with 'unseen' chemistry, either explicitly encode chemical heuristics [35] or learn them from millions of examples [34]. To avoid the noise associated with these automatic reaction-mapping tools, most published atom-mapped reaction data is preferentially done by hand [15, 26, 36], a process that comes with significant human time cost. In short, using deep learning models [25–28] for reaction property prediction relies heavily on the feasibility of atom-mapping the reactions.
In their contribution, the anonymous authors (henceforth, Anon et al) primarily emphasize their concern that our work solely focused on physics-based representations/kernel models and neglected baselines of recent deep learning models trained for similar tasks. While the emphasis of our original paper was clearly on physics-based models for reaction property learning, we agree that there is merit in including baselines of other model types (where available). Since the format of the datasets we studied was Cartesian coordinate structural files in three out of four cases (Proparg-21-TS, Hydroform-22-TS and SN2-20), we could not train available deep learning models that rely on atom-mapped reaction SMILES [25, 26] and, in some cases, computed DFT features [27, 28]. Anon et al compare the CGR deep learning model [26] from chemprop [29] with a SLATM+KRR model 3 on the GDB7-20-TS dataset [15], which was the sole dataset of the four we studied that has input formats for both deep learning and physics-based models available. Based on the superior performance of CGR on this lone dataset, they question the relevance of QML models as a whole.
Here, in order to more broadly assess the performance of physics-based representations/kernel models and atom-mapping based deep learning models, we study two datasets (GDB7-22-TS [37] and Cyclo-23-TS [36]) where both atom-mapping and three-dimensional structures of reactants and products are provided. By varying the manner (and therefore the estimated time cost) in which the reactions are atom-mapped, we illustrate the direct and significant impact this key process has on the quality of the predictions. We further use the Proparg-21-TS dataset to highlight when geometric encoding is essential for making good predictions.
2. Computational details
2.1. Datasets
2.1.1. GDB7-22-TS
The GDB7-22-TS dataset [37] consists of close to 12 000 diverse organic reactions constructed from the GDB7 dataset [38–40] along with corresponding energy barriers computed at the CCSD(T)-F12a/cc-pVDZ-F12//ωB97X-D3/def2-TZVP level. This is an updated version of the GDB7-20-TS set [15] used in our previous work [1] and the corresponding comment.
2.1.2. Cyclo-23-TS
The Cyclo-23-TS dataset [36] encompasses 5269 reaction profiles for [3 + 2] cycloaddition reactions with barriers computed at B3LYP-D3(BJ)/def2-TZVP//B3LYP-D3(BJ)/def2-SVP level of theory.
2.1.3. Proparg-21-TS
The Proparg-21-TS dataset [14] contains 760 structures of intermediates before and after the enantioselective transition state of the propargylation of benzaldehyde, with barriers computed at B97D/TZV(2p,2d) level. We have studied this dataset on several occasions [1, 41] due to its sensitivity to the three-dimensional structure of reactants and products.
2.2. Data splits
In accordance with community standards, cross-validated random splits are used throughout. However, as suggested by Anon et al, we rigorously separate a validation set to optimize the hyperparameters of the KRR models. For all datasets, we use splits of 80% train, 10% validation and 10% test and cross-validate over 10 splits. As is done in chemprop [29], we ensure that all reactions in the dataset appear, in one of the folds, as a test set.
While open questions undoubtedly exist surrounding interpolation vs. extrapolation as well as designing challenging out-of-sample test sets such as those containing new chemistry [26, 28, 42–49], larger [9, 50, 51] and/or outlier [52] molecules/materials, or new sets appearing over time [53–55], this topic remains an ongoing community-wide discussion with no clear best practices, and will not be discussed further here.
2.3. Machine learning models
2.3.1. Physics-based kernel models
Throughout the results and discussion, the SLATMd representation is employed [1, 9, 41], which we found to have the best performance of the physics-based representations tested in our initial work [1]. Note that this differs from the SLATM representation used by Anon et al in their comment. While there is a small performance loss in moving from full SLATMd (including one-, two-, and three-body interactions) [9] to SLATM (including one- and two-body interactions only), it is only natural to compare the best-performing physics-based KRR model with the best-performing deep learning models.
Throughout, we refer to SLATMd as SLATM, but note it is the difference-based reaction representation rather than the molecular variation. As in our initial work, gaussian kernels are used, where the hyperparameters (kernel width and regularization parameter) are optimized on each validation set (in a ten-fold 80%/10%/10% cross-validation scheme).
2.3.2. Deep learning models
2.4. SMILES generation
The format of the Proparg-21-TS dataset [14] is files containing Cartesian coordinates of reactants and products. To benchmark deep learning models on this set, Cartesian coordinates were converted to SMILES using the xyz2mol [56] function of cell2mol [57]. The corresponding reaction SMILES were saved to the csv file for this dataset.
2.5. Atom mapping
The CGR model requires atom-mapped reaction SMILES as input. To assess the sensitivity of the trained models on the quality of the atom-mapping, three versions of each CGR model were tested (where possible): (i) with 'true' atom-mapping, as provided by the authors of the dataset (typically mapped by hand); (ii) with 'automatic' atom-mapping, performed by the open-source tool RXNMapper (version 0.2.9); [34] and (iii) with 'random' atom-mapping which randomly shuffles the atom-mapping indices of reactants and/or products. This latter option evaluates the efficacy of the CGR models without atom-mapping information.
For the Proparg-21-TS dataset, no 'true' atom-mapping is provided, so were unable to evaluate option (i). Option (ii) failed due to internal errors of RXNMapper (presumably arising from the foreign nature of the chemistry of the molecules in the dataset). Thus, only option (iii) is included.
2.6. Geometries
As in our original work [1], we use the optimized geometries (obtained from DFT computations) available in the published datasets to construct physics-based representations. While a commonly employed approach for various models describing molecules using their 3D structure [16, 18, 22–24, 58–61], this practice is a focus of criticism for physics-based models by Anon and co-workers. Specifically, they emphasise that obtaining 3D molecular geometries requires expensive pre-processing steps that are avoided when using atom-mapped reaction SMILES. While the 3D structure generation task from SMILES is by no means solved [62–69], structural estimates can be rapidly generated using distance geometry algorithms and quickly refined using force-field [63, 64], semi-empirical [70, 71] or tight-binding methods [72, 73]. Pipelines of this type facilitate the use of physics-based representations. Indeed, it has been shown on numerous occasions [54, 55, 74–76] that lower-quality geometries can be used successfully. This is particularly true for out-of-sample validation [74, 75], where higher-quality geometries are often not available.
3. Results and discussion
Figure 1 compares the performance of the various CGR models and the SLATM+KRR model on the three earlier discussed datasets.
3.1. Performance on the GDB7-22-TS dataset
Figure 1(a) shows the stark performance hierarchy of CGR models depending on the quality of the atom-mapping. With 'true' atom maps (CGR True, blue), the CGR model outperforms the SLATM+KRR model (red) by a significant margin (4.9 ± 0.2 kcal mol−1 vs 9.5 ± 0.9 kcal mol−1). Anon et al observed similar results comparing this 'true' CGR with two-body only SLATM+KRR on the GDB7-20-TS dataset, leading to their conclusion that kernel models are necessarily worse-performing than deep learning models for reaction property prediction. While the CGR model using RXNMapper (CGR RXNMapper, orange) still outperforms the SLATM+KRR model with a prediction error of 6.3 ± 0.3 kcal mol−1, completely removing atom-mapping information (CGR Random, green) results in a model with predictive errors of 11.2 ± 0.4 kcal mol−1, which is 2 kcal mol−1 worse than the SLATM+KRR model. These findings illustrate that the good performance of the default CGR model depends critically upon the use of (high-quality) atom-mapping information, which is not used in physics-based models. This fact does not make SLATM+KRR an inherently poor model, rather simply a model that uses different information (three-dimensional structure of reactants and products) than CGR (atom-mapping), that renders it less competitive for this specific dataset. As will be seen, this is not the case for other datasets (vide infra).
3.2. Performance on the Cyclo-23-TS dataset
Figure 1(b) shows a smaller performance difference between the various atom-mapped CGR models compared to the GDB7-22-TS set. In particular, the results are almost identical for the RXNMapper (orange) and true atom maps (blue). Since this dataset contains only cycloaddition reactions, it is likely that the RXNMapper has been pre-trained on reactions of this type and, therefore, is able to correctly map this dataset. Removing atom-mapping from the CGR model (green) increases the predictive error from 2.78 ± 0.15 kcal mol−1 to 3.96 ± 0.15 kcal mol−1. Stuyver and Coley [27] reported errors of 2.96 kcal mol−1 using the augmented QM-GNN [28] (QM-GNN True, purple) or 3.09 kcal mol−1 for a regular GNN [30, 31] (WLN-GNN True, brown) for the same dataset 4 . Both of these models rely on True atom mapping, though seemingly utilise the information less effectively than the CGR model. Our SLATM+KRR model achieves a predictive error of 3.13 ± 0.14 kcal mol−1, thus reaching similar performance to the WLN-GNN [27, 30, 31], and outperforming the CGR version without atom-mapping information.
The figure 1(b) results illustrate that the Cyclo-23-TS dataset is more sensitive to the 3D structure of reactants and products, and less sensitive to chemical information regarding transformation sites. Since the dataset consists of a single reaction class, the chemical information is likely less relevant than in the GDB7-22-TS, which contains mixed reaction classes. The SLATM+KRR model is, therefore, a competitive candidate for this dataset.
3.3. Performance on the Proparg-21-TS dataset
The Proparg-21-TS dataset fundamentally differs from the GDB7-22-TS and Cyclo-23-TS sets, in that the original data format is Cartesian coordinates of reactant and product structures rather than (atom-mapped) reaction SMILES. This naturally made it well-suited to learning from physics-based reaction representations, which was demonstrated in several publications [1, 41]. As mentioned in section 2.5 and shown in figure 1(c), no original atom-mapping information exists (blue). Attempts to atom-map using RXNMapper failed for this dataset (orange). Thus, we were only able to obtain the CGR without atom mapping information. Here, the difference between the CGR Random (green) and SLATM+KRR (red) is drastic, with prediction errors of 2.9 ± 0.4 kcal mol−1 and 0.39 ± 0.08 kcal mol−1 respectively. This dataset, being highly sensitive to three-dimensional geometry (containing both R and S enantiomers), illustrates that situations exist where the CGR is not only non-competitive, but totally inappropriate. The SLATM+KRR model, on the other hand, achieves excellent predictive capabilities.
3.4. Different datasets, different challenges
The different behavior of the various CGR models and SLATM+KRR model across the three datasets illustrates the great diversity of challenges posed by different reaction datasets. The GDB7-22-TS (as well as its related GDB7-20-TS [15] discussed by Anon et al) contains multiple reaction classes, making it more sensitive to the chemistry of these reactions than associated changes in geometry when moving from reactants to products. The CGR model performs well for this set only when using high-quality atom-mapping information that, while available for this specific dataset, may not be available for others. Comparing the CGR model without atom-mapping information to the physics-based SLATM+KRR illustrates better predictive performance by the latter, since additional information is encoded compared to CGR Random. The Cyclo-23-TS dataset, on the other hand, focuses on a single reaction class, making it less sensitive to atom mapping information. Correspondingly, the SLATM+KRR model comes closer in predictive accuracy to True atom-mapped model baselines. Finally, the Proparg-21-TS dataset comprises different enantiomers and is, therefore, highly sensitive to encoding the reaction geometry leading to the SLATM+KRR model becoming the only viable candidate for accurate barrier prediction.
4. Conclusion
In conclusion, we have illustrated that existing models for predicting reaction barriers rely either on chemical (in the form of atom-mapped reactions) or structural information (in the form of Cartesian coordinates of reactants and products). By varying the quality of atom-mapping from hand mapping ('True') to automated tool mapping ('RXNMapper') to no mapping ('Random'), we have shown how the CGR model performance varies on different datasets. Comparing these results to those obtained using physics-based model SLATM+KRR, reveals the relative importance of a chemical or structural prior. Only the GDB7-22-TS dataset, composed of multiple reaction classes, is highly sensitive to a chemical prior. Physics-based models, on the other hand, are more competitive (and in some cases, the only predictive option) for datasets consisting of fewer or single reaction types, where structural priors become more important. Since 'True' atom-mapping relies on expert knowledge and detailed understanding of the underlying reaction mechanism, while physics based models are built using 3D geometries of reactants and products, we argue that the CGR model and SLATM+KRR each offer their own relative advantages to predicting barriers of varying composition. In response to the question: 'Are deep learning models really the death knell for physics-based kernel representations?' we contend the answer is a resounding 'no'.
Acknowledgment
Malte Franke (related work on the Proparg-21-TS dataset) as well as Alexandre Schoepfer and Ksenia Briling (helpful discussions) are acknowledged. The authors are grateful to the EPFL and the National Centre of Competence in Research (NCCR) 'Sustainable chemical process through catalysis' (Catalysis) of the Swiss National Science Foundation (SNSF, Grant No. 180544) for financial support.
Data availability statement
All code and data required to reproduce these results are available at https://github.com/lcmd-epfl/reply-physics-reactions.
Footnotes
- 3
A two-body only version of the full SLATMd (one-, two- and three-body interactions).
- 4
Errors reported for the non-ensembled GNN, 90/5/5 splits, in the supplementary material of the publication.