Abstract
Computer-driven molecular design combines the principles of chemistry, physics, and artificial intelligence to identify chemical compounds with tailored properties. While quantum-mechanical (QM) methods, coupled with machine learning, already offer a direct mapping from 3D molecular structures to their properties, effective methodologies for the inverse mapping in chemical space remain elusive. We address this challenge by demonstrating the possibility of parametrizing a chemical space with a finite set of QM properties. Our proof-of-concept implementation achieves an approximate property-to-structure mapping, the QIM model (which stands for âQuantum Inverse Mappingâ), by forcing a variational auto-encoder with a property encoder to obtain a common internal representation for both structures and properties. After validating this mapping for small drug-like molecules, we illustrate its capabilities with an explainability study as well as by the generation of de novo molecular structures with targeted properties and transition pathways between conformational isomers. Our findings thus provide a proof-of-principle demonstration aiming to enable the inverse property-to-structure design in diverse chemical spaces.
Similar content being viewed by others
Introduction
The discovery and optimization of chemical compounds can be accelerated thanks to the marked advancements in quantum and statistical methods, their implementation in advanced software, as well as the seemingly never-ending improvement in computer hardware1,2. Unlike the traditional painstaking trial-and-error process which heavily relied on experimental work (known as the Edisonian approach), we can now compute a wide range of accurate physicochemical properties of a given compound using quantum-mechanical (QM) methods with only the respective atomic coordinates and atom types. However, rationally exploring the incredibly vast chemical compound space (CCS, estimated to contain 1060 molecular structures even for small organic molecules)3via highly accurate QM methods is still unfeasible due to their prohibitive computational cost. In this regard, machine learning (ML) techniques have revolutionized the field of molecular design by offering a fast but just as accurate method for obtaining properties from 3D molecular structures (a.k.a., direct mapping)4,5,6,7,8, positioning them as an indispensable resource in high-throughput screening pipelines9,10,11,12,13,14,15. Although these approximate mappings have undoubtedly enhanced our understanding of the CCS, it is the possibility to invert them that has the potential to truly disrupt and transform the field. Addressing this challenge would allow us to predict the 3D molecular structures from their inherent properties, representing a paradigm shift in the design and discovery of chemical compounds with specific functionalities.
The quest to establish an inverse mapping has emerged as a formidable challenge, captivating the interest and dedication of researchers from various disciplines such as organic chemistry, materials science, and molecular docking16,17,18,19,20,21,22,23,24,25. A hint for the existence of such a property-to-structure relationship is given by considering the demonstrated capability of generative models to selectively generate random structures conditioned on a predefined set of desired properties. Indeed, generative modeling has yielded numerous groundbreaking research outcomes26,27,28, particularly in the field of cheminformatics, in which there is extensive literature delving into diverse generative architectures with research focusing mostly on language models and autoregressive generation employing a text-based representation like SMILES29,30,31,32,33,34,35,36,37,38,39,40. Models dealing with the intricate 3D structure of molecules have been recently developed, leading to promising results across different design tasks33,41,42,43,44,45,46,47. In the same breath, multi-objective optimization problems have been tackled using generative models and genetic algorithms, e.g., the design of functional organic molecules for optoelectronics applications and of candidate structures for dielectric organic materials48,49,50,51. These compelling examples underscore the relevance of tackling the inverse design problem, yet the potential of parametrizing the CCS using properties as coordinates remains unexplored. Overcoming this challenge would lay the groundwork for an alternative and multifaceted approach to understanding and manipulating the intricate relationship between the properties and structures of organic molecules.
Stemming from the âfreedom of designâ conjecture in the molecular property space espoused in our previous works52,53, this study aims to investigate three main aspects of generative models for molecular design: (i) the definition of a differentiable parameterization of CCS by leveraging QM data of equilibrium molecules, (ii) the explainability of the resulting property-to-structure mapping, and (iii) applicability range of the learned inverse mapping. At present, not all three of these points can be addressed using state-of-the-art models in conditional 3D molecular generation42,43. The primary reason for this lies in the inherent design of each model, which hinders the establishment of a differentiable property-to-structure mapping with respect to the molecular size, see âMethodsâ section for an extended explanation. Hence, we propose a more flexible and tailored approach, the QIM model (which stands for âQuantum Inverse Mappingâ), that combines a Variational Auto-Encoder (VAE) architecture to encode the molecular structures (represented as Coulomb matrices) with a property encoder to encode the associated QM properties, see Fig. 1. The joint training of the VAE and property encoders yields a low-dimensional internal representation that is common for both the molecular structures and QM properties. This enables us to combine the property encoder with the decoder component of the VAE and, hence, to successfully approximate the CCS parameterization using QM properties as intrinsic coordinates for navigating the chemical space of small drug-like molecules contained in the QM7-X dataset54. The QIM model accurately predicts the heavy atom composition of molecules, reconstructs their geometries with reasonable accuracy, and meets the criteria of being one-shot and differentiable, making it explainable and flexible. Thanks to the differentiability of our CCS parameterization, we can identify the most relevant properties in the molecular reconstruction process as well as substructures in the molecular property space covered by QM7-X. The conventional multi-objective generative modeling paradigm is also retrieved by conditionally sampling in the input space of properties to accomplish two distinct design tasks with well-defined objectives, producing results comparable to a specialized model such as cG-Schnet42 trained on the same tasks. As a final showcase of the capabilities of learning a CCS parameterization, we have implemented a geodesic search algorithm that uses the latent space representation as internal coordinates, enabling the definition of transition structures and the estimation of rotational energy profiles by learning only from equilibrium geometries. Although our proof-of-concept implementation is currently assessed only for small molecules, our findings suggest that a CCS parameterization based on QM properties is feasible and holds promise for a wide range of applications, including interpretability, generation of de novo molecules, transition path interpolation, and reaction barrier estimation. Thus, this work highlights the remarkable opportunities that arise from defining an inverse mapping approach to advance our understanding of the chemical compound space by connecting QM properties and molecular structures.
Results
Property-to-structure mapping
To train the VAE and the property encoder (see âMethodsâ section), we have considered a subset from QM7-X dataset of 40,988 equilibrium conformations of molecules with up to seven heavy atoms including C, N, and O. Here, 17 QM global extensive and intensive properties (listed in Table 1) per conformation were selected to define the property encoder. The technical details of the input data and training procedure are specified in the âMethodsâ section and Supplementary Note 3 of the Supplementary Information (SI), respectively.
We now assess the capability of the trained QIM model to establish an approximate parameterization of the chemical compound space (CCS) spanned by QM7-X based on the predefined set of QM properties. In doing so, we will make use of the molecules in the test set together with their corresponding properties, i.e., the property set of these molecules will be used to construct the model and, then, the generated molecule will be compared with the original one. Figure 2a shows the boxplots of the relative error on Coulomb matrix (CM) reconstruction for an increasing number of properties used to train the model. Here we have analyzed the relative error on CM instead of the root-mean-squared deviation (RMSD) between structures because the latter is only defined for molecules for which the composition is correctly predicted and, consequently, it will present more fluctuations across different number of properties especially when the number of properties used is low and hence the errors in composition are high. Accordingly, one can see that our model allows us to reconstruct the representation with an average error that converges to ~5% when considering more than seven properties during the training, after which the distribution becomes increasingly skewed as can be seen by the median value. This relative error is defined as \(\Delta=\frac{| \tilde{{{{{{{\bf{C}}}}}}}}-{{{{{{\bf{C}}}}}}}| }{| {{{{{{\bf{C}}}}}}}| }\times 100\), where C is the original CM and \(\tilde{{{{{{{\bf{C}}}}}}}}\) is the reconstructed one (both to be considered vectors). While we provide the mean and deviation of the CM reconstruction error, it is important to note that this indicator does not completely reflect the quality of the mapping due to its noisy and nonlinear correlation with RMSD (see Supplementary Fig. 1 of SI).
To have a better understanding of this finding, we also report the distribution and the cumulative distribution over the test set for the RMSD in the case of the full set of properties (see Fig. 2b). The performance of the QIM model deteriorates when considering a lower number of properties, as it is shown in Supplementary Fig. 2 of the SI. Indeed, the mode of the RMSD distribution, when considering the full set of properties, is close to 0.5 Ã . Overall, despite obtaining a wide RMSD spectrum (\(\left[0.05,1.6\right]\) Ã ), we found thatâ~â70% of the molecules in the test set were reconstructed within RMSDâ=â0.7 Ã . We empirically found this threshold to be adequate to separate molecules with an acceptable reconstruction of the heavy atom structure in terms of topology and orientation. To further motivate the selection of this threshold, we provide several illustrative examples of both original and reconstructed molecules in Fig. 2c, along with a quantitative analysis of topological errors in Supplementary Fig. 3 of the SI. Concerning the chemical composition reconstruction, the model exhibits excellent performance, correctly predicting it for 99.96% of the molecules in the test set. Additionally, we investigated the impact on the performance of the QIM model when considering extensive and intensive properties separately in the training procedure, see Supplementary Fig. 4 of the SI. The models trained on separated properties present a higher number of structures with large RMSD compared to the final model, indicating a lower performance in structure reconstruction. In fact, the model trained on intensive properties was only able to reconstruct with the right chemical composition approximately 8000 from 10,000 molecules of the test set. This decrease from 99.96 to 75.85% in molecular reconstruction comes together with an increment of the ãRMSDã up to 0.7 Ã as well as a reduction up to 55% of tested molecules reconstructed with an RMSD below 0.7 Ã . In brief, these results have confirmed the need to use both types of properties in the training of the QIM model to have a better reconstruction of molecular structures.
Interpretability and performance of our model
Then, the established CCS parameterization is analyzed by implementing a gradient attribution map, which enables us to assess the individual contributions of each property to the output structures. The calculation of an attribution map A for each property is explained in âMethodsâ section. Figure 3 shows the A values per property which have been normalized over the maximum value and sorted in decreasing order for the full representation. Overall, we have found that extensive properties are more informative than intensive ones for the task of molecular reconstruction. This can be explained by the fact that these extensive properties depend on crucial molecular features that are also considered in a 3D representation like CM, e.g., number of atoms, number of electrons (related to the chemical composition), and geometry. Moreover, when comparing CMs, even a slight difference of one atom can significantly increase the loss, leading to a larger sensitivity of the model to variations in system size and composition. Thus, A values for the components of the total energy and the molecular polarizability are higher compared with these for the molecular orbital energies and dipole moment; in particular, EXX and EKIN present the largest A values. Interestingly, this finding is in agreement with the identification of molecular clusters in the two-dimensional principal component analysis (PCA) of the latent space of the VAE encoder, i.e., the higher the A value, the more correlated the property with respect to the PCA (see insets in Fig. 3). A similar observation holds when using a more complex dimensionality reduction technique such as t-distributed stochastic neighbor embedding (t-SNE)55, see Supplementary Fig. 6 of the SI. The fact that a linear method such as PCA shows good organization in terms of energy-related properties is a non-trivial finding that will be later exploited in the interpolation of transition structures.
Furthermore, we have studied how the revealed hierarchy of QM properties organizes the QM7-X CCS. Starting from the properties with the highest A value, in Fig. 4a, one can see the two-dimensional projection of the QM7-X molecular property space defined by EKIN and EXX, i.e., \(\left({E}_{{{{{{{\rm{KIN}}}}}}}},{E}_{{{{{{{\rm{XX}}}}}}}}\right)\)-space. Despite these two properties having a high inverse correlation (Pearson coefficientâ=âââ0.92), it is noticeable how the molecules in the dataset seem to organize as linear-shape clusters containing molecules with the same heavy atom composition. In particular, upon examining those clusters, it becomes evident that EKIN is mostly influenced by the heavy atom composition within a molecule. On the other hand, EXX is highly sensitive to the number of H atoms, thereby indicating a dependence on the particular bond types that are present. This is further analyzed in Fig. 4b, where we provide a closer examination of one of the highly populated clusters (see yellow ellipse in plot) and highlight the molecules with respect to their number of H atoms. Here, we uncovered a finer local structure with almost perfect inverse correlation (Pearson coefficientâ=âââ0.99) as well as very compact clusters formed by isomers. Overall, this behavior can be understood by considering the qualitative aspects of EKIN and EXX: the dominant contribution to EKIN stems from the inner shell electrons (trivial consequence of the virial theorem) and the primary influence on EXX arises from the valence electrons. Also, exchange-related quantities have been found to play a significant role in characterizing bonds56,57, explaining their sensitivity to the number of H atoms in a molecule. These considerations account for the efficacy of these two QM properties in identifying clustered molecular isomer subspaces and accurately predicting heavy atom composition.
Next, we pick a molecular isomer subspace (see blue circle in plot) and show how different pairs of QM properties with high and low A values distribute the molecular structures (see Fig. 4c, d). As expected, high A values identify properties that are better local coordinates for exploring this subspace (vide supra). Indeed, these properties present relatively smaller changes in their values across related structures in comparison to properties with low A valuesâan example of their efficiency in identifying molecular structures within a specific molecular isomer subspace while effectively distinguishing them from other structures spanning the entire property spectrum. Notice that the same behaviors can be found throughout the entire set of QM7-X equilibrium molecules. These findings already provide compelling evidence for the potential of our proof-of-concept implementation in furthering our understanding of the molecular property space and unraveling the intricate relationship between QM properties and molecular structures.
The successful property-based parameterization of CCS achieved by the model is grounded in the remarkable similarity attained among the latent representations of CMs and respective properties. A visualization that demonstrates this observation can be seen in Fig. 5a, which showcases the overlap between the PCA of both latent representations. Similar result was obtained by using the t-SNE technique, see Supplementary Fig. 7 of the SI. This verifies the initial assumption we stated about the joint training procedure, see âMethodsâ section. Moreover, we have examined how the differences in latent space representation correlate with the quality of the reconstructed molecules. This analysis aims at obtaining a self-consistent method for error estimation. To this end, if z is the latent representation from the property encoder, we can take the reconstructed representations and encode them again with the VAE encoder, producing a new latent representation \(\tilde{{{{{{{\bf{z}}}}}}}}\). Then, we look at the correlation of the quantity \(| | \Delta {{{{{{\bf{z}}}}}}}| |=| | {{{{{{\bf{z}}}}}}}-\tilde{{{{{{{\bf{z}}}}}}}}| |\) with the RMSD between the original structures and the one reconstructed from properties considering the molecules in test set. In Fig. 5(b), one can see that there seem to be a bulk (approximately linear) correlation with the presence of numerous outliers. To further investigate this behavior, we also look at the boxplots of the RMSD for varying values of Îz by splitting the test set into subsets to get an adequate statistics for each point. The results are plotted in Fig. 5(c) in which we show that there is a nonlinear but monotonic behavior for the relationships between these quantities, finding a minimum for RMSD in the region of low â£â£Îzâ£â£ values (\(\in \left[0,0.4\right]\)). Since for reconstructing the actual molecular structure, we are mostly interested in having a low RMSD, we will filter out the generated structures for which â£â£Îzâ£â£ falls outside this predetermined interval in the following sections. This screening approach based on error estimation will be employed in the next application tests to enhance the quality of generated structures with a targeted set of QM properties.
Acting as a conditional generative model
The learned CCS parameterization based on QM properties can provide a versatile solution for multi-objective targeted molecular generation. Here, we show the capability of the QIM model to target predefined pairs of QM properties by using a multivariate Gaussian regression approach to model the distribution of the QM7-X property space (see âMethodsâ section). Based on this procedure, we first navigate through the \(\left(\alpha,{E}_{{{{{{{\rm{MBD}}}}}}}}\right)\)-space that encompasses two relevant properties for molecular reconstruction (high attribution value A), see Fig. 6a. The moderate degree of correlation between these properties of QM7-X molecules (i.e., Pearson coefficientâ=â0.60, gray dots) also grants us access to the intrinsic âfreedom of designâ in CCS when searching for molecules with desired functionality52. Subsequently, we transition to a weakly correlated 2D property space given by \(\left(\alpha,\zeta \right)\)-space (i.e., Pearson coefficientâ=â0.44, gray dots) by replacing EMBD with a dipole moment ζ (low A), see Fig. 6b. In the (EEMBD,âα) property space, fifteen samples (or 15-molecule set) were generated per targeted pair, while ten samples (or 10-molecule set) were generated per target in the (ζ,âα) one. The targets are represented as colored crosses with capital letters. The targeted α values were here selected to generate medium-to large-sized molecules, since α is an extensive property that mostly depends on the elemental composition as well as the connectivity of atoms within the molecule. Whereas, the targeted EMBD and ζ values were chosen to tune the topological effects (i.e., extended vs. compact, packed/globular vs. void space) and elemental composition (homogeneous vs. heterogeneous) in the generated molecules, respectively. The top 5 generated molecules (i.e., five optimal molecules) per target have been highlighted with colored circles in both 2D property spaces. In Fig. 6, we also show the generated structure of a selected molecule per target. Overall, these results demonstrate that our model is capable of generating diverse molecular structures with similar scaffolds as in QM7-X molecules by using only QM properties as input.
Given the discrete nature of CCS and particularly the reduced coverage of the employed QM7-X dataset, it is clear that the inverse mapping generates molecules that can deviate from the respective targeted values. The performance per target can be quantitatively measured by defining \(\epsilon=\frac{| {y}_{{{{{{{\rm{calc}}}}}}}}-{y}_{{{{{{{\rm{t}}}}}}}}| }{\Delta y}\times 100\) with ycalc and yt as the calculated and target values of the property y. Îy represents the extent of the property spectrum across the entire dataset. Accordingly, for each of the 15-molecule set depicted in Fig. 6a, the ϵ value is circa 9.2% for EMBD and circa 3.5% for α. These values are reduced to circa 3.2% for EMBD and circa 1.3% for α by only considering the top 5-molecule set. On the other hand, the molecules generated in \(\left(\alpha,\zeta \right)\)-space displayed ϵâââ2.8% for the prediction of α but the ϵ corresponding to ζ was circa 5.9% (see top 5-molecule set per target in Fig. 6b). Despite preventing the establishment of a differentiable property-to-structure mapping with respect to molecular size, we have used the state-of-the-art architecture cG-SchNet42 to rigorously examine our results in the two multi-property targeted molecular design tasks. Although cG-SchNet models achieve better performances, the QIM model remains comparable, as the difference in the values of ϵ is not substantial, ranging from 0.6 to 5.3% in favor of cg-Schnet (see Supplementary Note 5 and Supplementary Tables 1â4 of the SI). The only exception was the prediction of EMBD for the top 15 molecules in the \(\left(\alpha,{E}_{{{{{{{\rm{MBD}}}}}}}}\right)\)-space where the difference of ϵ values wasâââ7.3%. This result can be related to the explicit treatment of H atoms during the training of the cG-SchNet model, while in our model H atoms are added after the molecule is generated (see âMethodsâ section). Notice that, in terms of computational efficiency, our model achieves convergence in three hours, which is faster than the time required to train the cG-SchNet model on the same hardware (i.e., four days).
Furthermore, we have found that the spread of the generated molecules (i.e., flexibility of the model in molecular generation) can be rationalized by analyzing the relative variance of the conditional multi-Gaussian distributions of the non-targeted QM properties. In fact, molecules generated with the highest precision (i.e., targets A and D) are characterized by lower negative log likelihood values and small relative variances in the extensive non-targeted properties (see Supplementary Fig. 4 of the SI). This outcome could be a key factor for controlling the degree of flexibility when designing molecules in targeted regions of a given property space since a larger variance in extensive properties may result in a more diverse set of molecules as these properties are the most relevant in defining the molecular structure (see Fig. 3). In this regard, taking a closer look at the generated molecules, one can see that the molecules in the \(\left({E}_{{{{{{{\rm{MBD}}}}}}}},\alpha \right)\)-space display a greater diversity in heavy atom composition compared to the ones generated in the \(\left(\zeta,\alpha \right)\)-space, see Supplementary Fig. 8 of the SI. Importantly, the presence of diverse chemical compositions and structures within the larger sets for each target is another evidence that our approach does not follow a conventional chemical exploration based on fixed molecular scaffolds. Instead, our inverse design procedure can fully utilize the diversity of property-to-structure relations as embodied in the recently proposed âfreedom of designâ conjecture in CCS52.
Generating compounds in uncharted CCS regions
To demonstrate that the QIM model is capable of generating de novo molecules with desired QM properties beyond the QM7-X dataset, we have modified the training procedure to reduce the bias of the model towards treating existing molecular fragments (see âMethodsâ section). This action slightly reduces the model performance in molecular reconstruction (ãRMSDã is only decreased by 0.05 à ), but it enables the generation of scaffolds beyond QM7-X, mostly composed of unseen molecules containing eight heavy atoms. While using this new procedure across multiple targets in the \(\left(\alpha,{E}_{{{{{{{\rm{MBD}}}}}}}}\right)\)-space, we found that the generation of unseen compositions is confined to a region of low coverage defined by high α and large â£EMBD⣠(see Fig. 7). Certainly, the model here exhibits a higher degree of flexibility and it is capable of generating molecules with diverse compositions of eight heavy atoms. In Fig. 7, one can also see that the spread of generated molecules for target T1 and T2 is broader than that of corresponding targets discussed in the preceding section. In term of the relative error ϵ, the samples in target T1 presented an error circa 10% for EMBD and 8% for α, while, for samples in T2, the observed errors were circa 20% for EMBD and 16% for α. Despite the lower prediction accuracy of QM properties, the errors for target T1 are notably comparable to the results shown in Fig. 6 and those obtained by the cG-SchNet model (see Supplementary Note 5 of the SI). Note that, unlike the QIM model, the cG-SchNet model can generate molecules with up to nine heavy atoms without requiring any additional modifications. Moreover, by analyzing the generated molecules using the QIM model, we found that the novelty in heavy atom compositions can vary depending on the specific target location, e.g., T2 samples consider only (C,N)-based molecules while T1 samples are more chemically diverse and cover (C,N,O)-based molecules. For both sets of samples, EMBD and α values are in the range of expected values for molecules with eight heavy atoms, i.e.,âââ0.6 eVâ<âEMBDâ<âââ0.09 eV and 66 a.u.â<âαâ<â160 a.u. Likewise, all generated molecules were well-aligned with chemical and physical intuition, e.g., molecules generated in target T1 exhibit a larger â£EMBD⣠than those in target B (see Fig. 6a) due to the more complex heavy atom composition and compact structure.
Interpolating transition geometries between isomers
Throughout the preceding sections, we have showcased the versatility of an approximate CCS parameterization in a range of contexts. Indeed, it was demonstrated that the latent space of the QIM model exhibits a discernible structure characterized by energetic properties with high A values when undergoing a linear transformation such as PCA, see Fig. 3. This particular behavior assures that a linear interpolation in latent space will generate structures for which the energetic properties smoothly changeâa consequence of the fact that linear transformations do not affect convexity properties up to a sign. Accordingly, we here aim to explore the potential of utilizing this latent representation as an intrinsic coordinate system to generate reasonable guesses for the transition geometries between conformational isomers in QM7-X. To do so, we use the geodesic interpolation algorithm58 for VAEs but, in this instance, the objective is to find curves in the property space that are geodesics with respect to the metric induced by the latent space encoding, see details of implementation in the âMethodsâ section. Following the selection criteria described in âMethodsâ section, we have investigated three different pairs of conformational isomers C4H9NO2 (case I), C4H5NO2 (case II), and C5H5NO (case III) whose structures are achiral and were reconstructed with an RMSD â¤0.2 à , see initial (iâ=â0) and final (iâ=â6) geometries per case in Fig. 8aâc.
Overall, the interpolated geometries displayed in Fig. 8aâc (see yellow balls) effectively demonstrate the capability of the QIM model to produce plausible geometries for the transition path of the studied isomerizations. However, by analyzing sample by sample, we detected a few abrupt/unphysical changes in the geometries for the cases II and III between steps 1 and 2. These artifacts primarily arise from two key factors: (i) the sensitivity of the CM representation to small changes which could produce large mirror-like transformations in the resulting geometry and (ii) the fact that model performance is degraded across unknown sectors of the latent space associated to the transition geometries. The relative energy component ÎEiâ=âEiâââE0 (E is the sum of atomization and MBD dispersion energies) of the corresponding geodesic in property space for the three isomerizations is reported in Fig. 8d. Unexpectedly, with no request other than minimizing the distance in latent space, a barrier-like behavior is retrieved in all cases with energy barriers between 0.6 eV and 1.4 eVâanother evidence of the potential application of our model for generating guesses of transition geometries as well as energy profiles. The obtained results are remarkable since the model was trained exclusively on equilibrium geometries.
To examine how close the estimated energy profile is to the true minimum energy path and the quality of the generated transition geometries, we have used them as initial guesses for a nudged elastic band (NEB) calculation following the ODE method59. Before proceeding with the NEB calculations, H atoms were added to all molecules with OpenBabel and, subsequently, optimizations of initial and final geometries were carried out employing an ML force field trained on PBE0+MBD energies/forces of equilibrium and non-equilibrium molecular conformations contained in QM7-X dataset (model taken from ref. 60). This accurate ML force field was also used to perform the NEB calculations at PBE0+MBD level of theory61. The relative energies calculated following this direct procedure are presented in Fig. 8d and compared to their corresponding geodesic energies, showing how the later were consistently overestimating the energy barrier. As for the updated geometries, the RMSD of their heavy atoms structures with respect to the initially interpolated ones was found to be between 0.14 Ã and 0.35 Ã , see colored balls in Fig. 8aâc.
Discussion
In the present work, we have presented a machine-learning approach, the QIM model, to the inverse property-to-structure design process by learning a parameterization of the chemical compound space (CCS) of small organic molecules that uses QM properties as intrinsic coordinates to navigate the space of molecular structures. This challenging task was successfully accomplished through the development of a proof-of-concept implementation that jointly trains a variational autoencoder and a property encoder. The trained model, utilizing the equilibrium molecules contained in QM7-X dataset, is able to reconstruct the molecules (heavy atom composition and 3D structure) within the test set from their properties with reasonably good accuracy. Moreover, the differentiability of the learned CCS parameterization not only enabled us to identify the most relevant properties in the molecular reconstruction process but also revealed intriguing insights. Our findings indicate that the combination of exchange energy and kinetic energy plays a pivotal role in clustering the molecules in the dataset based on chemical composition and bond types. This information can be further utilized to conditionally generate molecules belonging to specific isomer subspaces within property spaces defined by more pertinent QM properties. In contrast, we observed that intensive properties exhibit limitations as local coordinates for conformational space navigation when compared to extensive onesâa result primarily attributed to the purely geometrical/chemical definition of the Coulomb Matrix (CM) representation. Nevertheless, the QIM model exhibited its best performance when both extensive and intensive properties were considered in the training procedure, highlighting the significance of intensive properties in achieving efficient molecular reconstruction.
The QIM model has also proven its applicability in diverse molecular design tasks by allowing conditional sampling for multi-objective molecular generation and the interpolation of transition pathways of chemical transformations. Unlike the traditional generative paradigm in machine-learned latent spaces, our approach allows to determine and interpret the input conditional distribution since every coordinate in our model is a QM property with a clear physical meaning. The modified training procedure applied to our model provided an initial assessment for using inverse design to generate de novo molecules with desired QM properties, highlighting the promise of rational exploration of CCS, but also uncovering the relative limitation of generative models in extrapolating to molecules larger than the training samples. Future studies in this direction should investigate the use of both a more refined molecular representation to effectively capture both geometric/electronic features and a dataset spanning a much larger chemical space than the one systematically covered by QM7-X dataset62. Our work also suggests that the common latent representation of properties and structures can capture essential aspects of the underlying physics governing structure-property relationships, namely, the existence of transition pathways in chemical space. In this regard, the successful execution of a NEB calculation using the generated transition geometries is a compelling evidence of the meaningful nature of the latent representation as intrinsic coordinates in chemical space. This establishes a connection between our interpolation method and other studies focused on geodesic transition path interpolation63. Indeed, we have demonstrated that aiming to parameterize the CCS through the utilization of QM properties can lead to the development of more interpretable models capable of performing a wide range of chemical tasks.
While the QIM model is multitasking and customizable, it still exhibits certain shortcomings in terms of molecular representation and coverage of chemical space that require further investigation to develop a more robust and generalizable model suitable for ML-based screening pipelines within the field of drug discovery (vide supra). Once these limitations are relaxed, one could devise the development of an active learning-like method that uses the self-consistency metric based on latent space overlap. This would allow going beyond the chemical and property space spanned by the training set51. Another interesting avenue is the development of a QM-informed generative model for molecules with pharmaceutical and biological relevance (e.g., Aquamarine64, MoleculeNet65, and solubility66 datasets) by coupling in-silico (gas-phase and solvated) QM properties with experimental data to exploit unknown property correlations in more realistic scenarios67. Hence, we expect this study to serve as a motivation for future research endeavors focused on advancing the field of generative models by leveraging physical and chemical design rules obtained from structure-property/property-property relationships, fostering the development of enhanced models that offer increased accuracy for practical chemical tasks while preserving the high interpretability of the presented approach.
Methods
Need for a tailored implementation
To achieve the goals defined in the âIntroductionâ section, our idea was to look for a generative model that provides a direct differentiable mapping from the space of QM properties to the space of chemical structures. To this end, we first considered the state-of-the-art models in conditional 3D molecular generation that were recently developed such as the equivariant diffusion model (EDM)43 and the conditional generative neural network (cG-SchNet)42. The methodology put forth in EDM, which has been adapted in multiple landmark works33,44,46,47, is a diffusion-based model that acts on a graph-based representation of molecules. Despite showing good performance in specific applications, the use of a graph-based representation poses a significant limitation for our work. The need to define the number of nodes externally to the model makes any possible property-structure mapping inherently non-differentiable with respect to the number of atoms parameter. On the other hand, cG-SchNet model works in an autoregressive fashion by learning the conditional distribution of the position and type of each atom in a molecule given the position and type of the previously generated ones. The generation process keeps adding atoms in euclidian 3D space until a stop token is encountered. This method has been successfully used51 and allows for the generation of molecules with a varying number of atoms. However, the autoregressive nature together with the numerous adjustments in the final cG-SchNet model prevent us from obtaining a one-shot differentiable mapping. Accordingly, we opted for implementing a more flexible and tailored model that combines a VAE architecture to encode the molecular structures (represented as Coulomb matrices) with a property encoder to encode the associated QM properties, see Fig. 1. The QIM model effectively addresses the discussed challenges in terms of molecular size and differentiability. It enables a direct and differentiable parameterization of the chemical compound space (CCS) based on global properties, and it is also highly explainable.
Variational auto-encoder framework for inverse design
A Variational Auto-Encoder (VAE) is in general comprised of two neural networks called encoder and decoder. The encoder encodes the input in a lower-dimensional latent space representation and the decoder decodes that representation trying to recover the encoded input. The model thus learns a continuous low-dimensional representation capturing the most salient statistical features of the input during the compression process. Whereas, in the decoding process, the decoder learns how to generate samples coherent with the ones in the dataset. Both networks are probabilistic, meaning that they parameterize a probability distribution. This, together with a regularization term forcing the latent distribution to be close to a reference distribution, helps obtain a less sparse internal representation and avoids large portions of the latent space from decoding into invalid outputs. The loss function used to train these models comes from the well-known evidence lower bound (ELBO)68. If x is a sample from a dataset with probability distribution p(x), then the loss reads:
where qÏ(zâ£x) and pθ(xâ£z) are the probability distributions parameterized by the encoder and the decoder, respectively. DKL is the Kullback-Leibler divergence and p(z) is the prior probability distribution of the latent space representation z. Usually, all distributions are chosen to be identically multivariate Gaussian distributed and \(p({{{{{{\bf{z}}}}}}})={{{{{{\mathcal{N}}}}}}}({{{{{{\bf{0}}}}}}},{\mathbb{I}})\). In the training procedure, the encoder learns to encode x in the probability distribution qÏ(zâ£x) from which its latent representation z is sampled using the well-known reparameterization trick68. The divergence term here ensures that the representation stays compact, since the distribution will be as close as possible to the prior distribution \({{{{{{\mathcal{N}}}}}}}({{\mathbf{0}}},{\mathbb{I}})\). Then, z is fed to the decoder which learns to turn this random variable back to x. After training is over, one can sample z from \({{{{{{\mathcal{N}}}}}}}({{{{{{\bf{0}}}}}}},{\mathbb{I}})\), feed it to the decoder, and obtain samples x with a distribution close to p(x).
In our implementation, we start from a dataset Dâ=â(X,âY) where X are the molecular structures and Y are their QM properties. To obtain an inverse mapping f:âYâââX, we modified the standard VAE framework by adding a third network parameterizing the probability distribution pÏ(zâ£y). Moreover, we modify the loss function by including a likelihood term (e.g., \(-\log ({p}_{\psi }({{{{{{\bf{z}}}}}}}| {{{{{{\bf{y}}}}}}}))\)) in the ELBO loss function, which is now expressed as:
where β and Ï are adjustable coefficients introduced as hyperparameters. The training procedure is here similar as described for VAE but, in this case, the variable z sampled from qÏ(zâ£x) will also maximize the likelihood term \({{\mathbb{E}}}_{{q}_{\phi }}[\log ({p}_{\psi }({{{{{{\bf{z}}}}}}}| {{{{{{\bf{y}}}}}}}))]\). This modification ensures the emergence of a common latent representation for both the VAE and the property encoders. After training, one takes samples of y from the property space and gets a value for z as the mean of pÏ(zâ£y) that serves as the maximum-likelihood estimator for the latent space representation of the corresponding molecular structure. Subsequently, a molecule x can be generated, which is expected to possess properties similar to y. A schematic representation of our proposed implementation is shown in Fig. 1.
Molecular representation and structure retrieval
In recent years, there has been a surge of groundbreaking research in generative modeling with a predominant focus on text-based representations like SMILES26,27,28,29,30,31,32,33,34,35,36,37,38,39,40, which encode information on the atom composition and connectivity within a molecule. However, despite its success, this representation limits our ability to study the QM properties of molecular systems. This issue arises because the electronic density, a key quantity in most QM methods, relies on the atom composition and three-dimensional atomic coordinates as input. Accordingly, since we are addressing the inverse mapping between QM properties and 3D structures, we require a different representation that enables us to address both structural and electronic properties. In particular, we need a representation that fulfills the following criteria: firstly, it has to encode the atomic positions and atomic species of a molecule with as many invariances as possible (translational, rotational and permutational). Secondly, it has to show a strong correlation with QM properties. Finally, and most importantly, it has to be invertible, namely that one should be able to retrieve the Cartesian coordinates and atomic species. Neglecting the permutational invariance requirement, we find that a simple yet effective 3D molecular representation is the Coulomb matrix (CM)69,70. CM is an elegant and physically-inspired descriptor, which has shown great success in a wide variety of investigations related to molecular property prediction. CM is invariant to translations and rotations, allowing for the retrieval of atomic positions and species of a molecule up to a chirality transformation. This representation is defined as:
where the indices i and j run over the atoms in the molecule and Zi indicates the atomic species. We have chosen to treat the hydrogen (H) atoms implicitly in our work due to several reasons. Primarily, the associated terms in the Coulomb matrix representation pertaining to H atoms are significantly smaller compared to the other atoms in a given system. As a consequence, even small errors in the reconstruction process can lead to disproportionately large changes in the positions of H atoms. By treating the hydrogen atoms implicitly, we can mitigate the impact of these potential errors and minimize the distortion caused by inaccuracies in their positions. This approach allows us to focus our method on accurately representing the molecular scaffold (i.e., positions of heavy atoms). For what concerns the network input standardization, we adapted the matrix to the maximum possible number of atoms per species found in the dataset, padding the rest of the entries with zeros (see Supplementary Note 1 of the Supplementary Information (SI)).
The retrieval of Cartesian coordinates and chemical composition from CM follows two steps. We first obtain the composition from the diagonal elements by applying the inverse function \(g={(2(\cdot ))}^{\frac{1}{2.4}}\) and rounding the outcomes to the closest integer values. Accordingly, we use the set {Z} of atomic numbers obtained and get the distances dij as \({{{{{{{\bf{d}}}}}}}}_{ij}={\left(\frac{{{{{{{{\bf{M}}}}}}}}_{ij}}{{{{{{{{\bf{Z}}}}}}}}_{i}{{{{{{{\bf{Z}}}}}}}}_{j}}\right)}^{-1}\). Lastly, we apply classical multidimensional scaling (CMDS) to the resulting Euclidian distance matrix (EDM)71,72 to get the Cartesian coordinates of atoms up to a chirality transformation (see Supplementary Note 1 of the SI). To reconstruct H atoms, we use OpenBabel software73 that restores connectivity and bond order based on interatomic distances. After this, the positions of the H atoms are optimized with third-order self-consistent charge density-functional tight binding (DFTB3)74,75 that also accounts for many-body dispersion/van der Waals (vdW) interactions via the MBD approach76,77 (DFTB3+MBD, same level of theory used for optimizing the geometries in QM7-X dataset), while freezing the position of the rest of the atoms in the molecule.
As a final remark concerning the choice of representation, a graph-based representation would be beneficial for generative purposes as it treats molecular fragments robustly and avoids the need for truncation. However, in this work, we are aiming at defining a CCS parameterization based on a set of global QM properties of molecules. Consequently, it becomes crucial to account for the potential variation in the number of atoms as we navigate through the property coordinates. Given the challenges associated with graph-based approaches due to the requirement of specifying the number of nodes in advance, the utilization of (appropriately padded) CM proves to be an efficient and scalable representation for inverse design purposes at this stage of model development.
Reference chemical space
Out of theâ~â4.2M molecules in the QM7-X dataset54, we have selected a subset of 40,â988 equilibrium conformations for molecules containing up to seven heavy atoms, including C, N, and O. For training our model, we used the following data splitting: 28k and 2k molecules for the training and validation sets, respectively, while the remaining molecules were used for testing the model. Moreover, since the QM7-X dataset contains a plethora of physicochemical properties, computed by means of non-empirical hybrid density-functional theory (DFT) and a many-body treatment of vdW/dispersion interactions (i.e., PBE0+MBD)76,78,79,80 in conjunction with the tightly converged numeric atom-centered basis sets81, we have opted to consider 17 global (extensive and intensive) properties during the training process.
Gradient attribution map
Similar to the standard approach employed in machine learning applications that involve image inputs82, we compute the gradient of the Coulomb matrix components \({{{{{{{\bf{CM}}}}}}}}_{i}^{k}\) of a given molecule k with respect to a property pj, yielding the Jacobian matrix \({{{{{{{\bf{J}}}}}}}}_{ij}^{k}=\frac{\partial {{{{{{{\bf{CM}}}}}}}}_{i}^{k}}{\partial {{{{{{{\bf{p}}}}}}}}_{j}}\). By taking the norm of \({{{{{{{\bf{J}}}}}}}}_{ij}^{k}\) over the output dimension of CM and, then, averaging over the subset \({\mathbb{B}}\) of best reconstructed molecules (150 molecules with RMSD â¤0.2 à ), we obtain an attribution map for each property expressed as:
where N is the total number of molecules included in \({\mathbb{B}}\). The idea behind this calculation is to understand which changes in the input produce the largest changes in the output.
Multi-Gaussian fitting for conditional molecular design
Here, we have used a multi-Gaussian fitting procedure to show the capability of the QIM model to target predefined pairs of QM properties, denoted as {m}. In doing so, we first acknowledge that our model has a fixed-dimensional input space for properties, ensuring the attainment of the random sampling associated with typical generative models. The conditional sampling is then achieved by sampling the non-targeted properties, denoted as {n}, while conditioning on {m} values. For this purpose, we need a modeled version of the distribution of the property space spanned by QM7-X that facilitates the determination of the posterior distribution of {n}, given fixed {m} values. Indeed, we use a multivariate Gaussian regression approach to model this property space. Specifically, we constructed a model with 91 multivariate Gaussian distributions \({\left\{{{{{{{\mathcal{N}}}}}}}\left({{{{{{{\boldsymbol{\mu }}}}}}}}_{k},{{{{{{{\boldsymbol{\Sigma }}}}}}}}_{k}\right)\right\}}_{k\in \{1[\ldots ]91\}}\) with μk and Σk as the mean value and the covariance matrix of the Gaussian k. This choice was made following a Bayesian information criterion (see Supplementary Note 4 of the SI).
With fixed target values, denoted as \(\{\bar{{{{{{{\bf{m}}}}}}}}\}\), the Gaussian \(\bar{k}\) from which it is more likely to sample the targeted properties is then selected following the maximum likelihood criterion \(\bar{k}={{{{{{\rm{argmax}}}}}}}k\left({{{{{{{\mathcal{N}}}}}}}}_{{{{{{{\bf{m}}}}}}}}({{{{{{\bf{m}}}}}}}=\bar{{{{{{{\bf{m}}}}}}}}| {{{{{{{\boldsymbol{\mu }}}}}}}}_{k},{{{{{{{\boldsymbol{\Sigma }}}}}}}}_{k})\right)\), where \({{{{{{{\mathcal{N}}}}}}}}_{{{{{{{\bf{m}}}}}}}}\) is the marginal distribution relative to the chosen set of properties {m}. Next, the conditional probability formula for multivariate Gaussian distributions is applied, obtaining the distribution of the non-targeted properties {n} for \({{{{{{\bf{m}}}}}}}=\bar{{{{{{{\bf{m}}}}}}}}\). This can be written as,
with \(\tilde{{{{{{{\boldsymbol{\mu }}}}}}}}\) and \(\tilde{{{{{{{\boldsymbol{\Sigma }}}}}}}}\) defined as,
To proceed with the generation of molecular structures, multiple values for {n} are sampled from \({{{{{{\mathcal{N}}}}}}}({{{{{{\bf{n}}}}}}}| \tilde{{{{{{{\boldsymbol{\mu }}}}}}}}({{{{{{\bf{m}}}}}}}),\tilde{{{{{{{\boldsymbol{\Sigma }}}}}}}}({{{{{{\bf{m}}}}}}}))\) for a fixed values of the targeted properties \({{{{{{\bf{m}}}}}}}=\bar{{{{{{{\bf{m}}}}}}}}\) (the variables with âï¿£â on top correspond to the target values throughout the text). Each sample s will correspond to the total set of (targeted and non-targeted) properties \(\bar{{{{{{{\bf{m}}}}}}}}\bigoplus {{{{{{{\bf{n}}}}}}}}_{s}\) which is then fed to the model that will generate a Coulomb matrix sample CMs. At this stage, the VAE encoder is used to encode the samples and, consequently, to obtain the latent representation encoding zs. This is used to filter the samples based on the self-consistency criterium from defined when analyzing the modelâs performance. Namely we consider \(\Delta {{{{{{\bf{z}}}}}}}=| | {{{{{{\bf{z}}}}}}}-{{{{{{{\bf{z}}}}}}}}_{s}| | \in \left[0,0.4\right]\), where z is the encoding from the property encoder. Note that this specific interval is adjusted for each target, taking into consideration the possibility of it being either too loose or excessively stringent. Cartesian coordinates, atomic species of molecules and H atoms are posteriorly retrieved from the generated Coulomb matrices as described above. After checking for broken molecules, we optimize the position of H atoms and then screen for overlapping atoms by putting a threshold of 10 a.u. to the maximum force component at with DFTB3+MBD level of theory. As a final step, the generated structures undergo geometry optimizations using DFTB3+MBD and, subsequently, their QM properties are calculated at the PBE0+MBD level of theory to enable a comparison with the target values \(\{\bar{{{{{{{\bf{m}}}}}}}}\}\) in the QM7-X dataset.
Masking method for representation
Stemming from the consideration that the padded Coulomb matrix representation presents issues in treating molecular fragments independently, we decided to address the problem during the training procedure with the following modifications:
-
Randomly masking atoms: This is done by taking a random binary vector of length l, where the padded Coulomb Matrix is an lâÃâl matrix, and by taking the outer product of the said vector. The distribution of zeros and ones is a Bernoulli distribution with pâ=â0.5. To maintain the same expected value for the reconstruction loss, we also multiply the mask by a factor 2. The resulting mask \({{{{{{\mathcal{M}}}}}}}\) is then applied to the reconstruction loss function before taking the sum over the Coulomb matrix, namely:
$$los{s}_{{{{{{{\mathcal{M}}}}}}}}={\Sigma }_{i,j}{{{{{{{\mathcal{M}}}}}}}}_{ij}(-\log ({p}_{\theta }({{{{{{{\bf{x}}}}}}}}_{ij}| {{{{{{\bf{z}}}}}}}))),$$(8)where \({{{{{{{\rm{loss}}}}}}}}_{{{{{{{\mathcal{M}}}}}}}}\) is the new reconstruction loss.
-
Leaky masking of padding: this is done in a similar way by setting \({{{{{{{\mathcal{M}}}}}}}}_{ij}\) to 0.1 if xijâ=â0.
The application of this masking significantly enhances the capability of QIM model to treat fragments independently. Additionally, due to the leaky masking of the padding, the model can consider molecules with more heavy atoms than the largest molecule in the dataset. To generate the unseen compounds, we have repeated the initial procedure without using the self-consistency filter and generating samples until a new heavy atom composition is found.
Geodesic interpolation algorithm
While it would be relatively straightforward to perform a linear interpolation in the latent space and generate associated geometries using the learned CCS parameterization, we are also interested in exploring the feasibility of obtaining an energy profile estimation along with this task. To do so, we use the geodesic interpolation algorithm58 for VAEs but, in this instance, the objective is to find curves in the property space that are geodesics with respect to the metric induced by the latent space encoding. This enables us to see how a close-to-linear interpolation in the latent space is reflected in both the property input space and the structure output space. The procedure to get the transition geometries consists in optimizing with a gradient descent algorithm an initial linear interpolation in property space between two structures, by minimizing the distance of the path in latent space.
The geodesic interpolation procedure starts with considering the initial and final conformational isomers, namely x0 and xN. We then linearly interpolate Nâââ1 points between the corresponding coordinates in property space defined by the seventeen properties considered during the training procedure. Applying the property-to-structure relationship defined by our model, we obtain an initial configuration given by the set {y0,ây1,ây2,ââ¦,âyN} of property coordinates, the set {z0,âz1,âz2,ââ¦,âzN} of latent space coordinates and the set {x0,âx1,âx2,ââ¦,âxN} of molecular structures. After this, we use a gradient descent algorithm to minimize the loss function \(L={\Sigma }_{i=1}^{N}| | {{{{{{{\bf{z}}}}}}}}_{i+1}-{{{{{{{\bf{z}}}}}}}}_{i}| {| }^{2}+\epsilon {\Sigma }_{i=1}^{N}| | {{{{{{{\bf{y}}}}}}}}_{i+1}-{{{{{{{\bf{y}}}}}}}}_{i}| {| }^{2}\), where the first term enforces the minimum distance in the latent space, while the second term is a regularization term with ϵâ<â<â1 to enforce continuity in the property space. To find a stationary path, the optimization runs until a convergence criterion on the norm of the gradient is met (i.e.,â<â10â4). Concerning the choice of molecules for this evaluation test, we also took into account that chiral molecules are indistinguishable when using the CM representation. We hence select the initial and final states ensuring that these are not mirror images of one another, as the model would interpret them as identical molecules.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The QM7-X data54 that support the findings of this study are available in ZENODO with the identifier https://doi.org/10.5281/zenodo.4288677. The preprocessed data to make our implementation work are provided in the Github repository Repo-AleFalla83. Source data are provided with this paper. The authors declare that all data supporting the findings of this study are available within the paper and its supplementary information files. Source data are provided with this paper.
Code availability
The code corresponding to our implementation is available in the Github repository Repo-AleFalla83. Moreover, we provide a Jupyter notebook to reproduce the main results.
References
Kulik, H. J. et al. Roadmap on machine learning in electronic structure. Electron. Struct. 4, 023004 (2022).
Sadybekov, A. V. & Katritch, V. Computational approaches streamlining drug discovery. Nature 616, 673â685 (2023).
von Lilienfeld, O., Müller, K. & Tkatchenko, A. Exploring chemical compound space with quantum-based machine learning. Nat. Rev. Chem. 4, 347â358 (2020).
Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R. & Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 8, 13890 (2017).
Gao, X., Ramezanghorbani, F., Isayev, O., Smith, J. S. & Roitberg, A. E. Torchani: A free and open source pytorch-based deep learning implementation of the ani neural network potentials. J. Chem. Inf. Model. 60, 3408â3415 (2020).
Bigi, F., Pozdnyakov, S. N. & Ceriotti, M. Wigner kernels: body-ordered equivariant machine learning without a basis. Preprint at https://arxiv.org/abs/2303.04124 (2023).
Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
Steinmann, S. N., Wang, Q. & Seh, Z. W. How machine learning can accelerate electrocatalysis discovery and optimization. Mater. Horiz. 10, 393â406 (2023).
Dreiman, G. H. S., Bictash, M., Fish, P., Griffin, L. D. & Svensson, F. Changing the hts paradigm: Ai-driven iterative screening for hit finding. Slas Discov. 26, 257â262 (2020).
Jansen, J. et al. Biased complement diversity selection for effective exploration of chemical space in hit-finding campaigns. J. Chem. Inf. Model. 59, 1709â1714 (2019).
Paricharak, S. et al. Data-driven approaches used for compound library design, hit triage and bioactivity modeling in high-throughput screening. Brief. Bioinforma. 19, 277â285 (2016).
Riniker, S., Wang, Y., Jenkins, J. & Landrum, G. Using information from historical high-throughput screens to predict active compounds. J. Chem. Inf. Model. 54, 1880â91 (2014).
Ahmed, L. et al. Efficient iterative virtual screening with apache spark and conformal prediction. J. Cheminformatics 10, 8 (2018).
Helal, K. Y., Maciejewski, M., Gregori-Puigjané, E., Glick, M. & Wassermann, A. Public domain hts fingerprints: Design and evaluation of compound bioactivity profiles from pubchemâs bioassay repository. J. Chem. Inf. Model. 56 2, 390â398 (2016).
Beresini, M. et al. Small-molecule library subset screening as an aid for accelerating lead identification. J. Biomol. Screen. 19, 758â770 (2014).
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 361, 360â365 (2018).
Zunger, A. Inverse design in search of materials with target functionalities. Nat. Rev. Chem. 2, 0121 (2018).
Kim, K. et al. Deep-learning-based inverse design model for intelligent discovery of organic molecules. npj Comput. Mater. 4, 67 (2018).
Chen, Y. et al. Deep generative model for drug design from protein target sequence. J. Cheminformatics 15, 38 (2023).
Lee, J. et al. Machine learning-based inverse design methods considering data characteristics and design space size in materials design and manufacturing: a review. Mater. Horiz. 10, 5436â5456 (2023).
Moret, M. et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat. Commun. 14, 114 (2023).
Lin, J. et al. Machine learning accelerates the investigation of targeted mofs: Performance prediction, rational design and intelligent synthesis. Nano Today 49, 101802 (2023).
Noh, J., Gu, G. H., Kim, S. & Jung, Y. Machine-enabled inverse design of inorganic solid materials: Promises and challenges. Chem. Sci. 11, 4871â4881 (2020).
Nigam, A., Pollice, R., Krenn, M., Gomes, Gd. P. & Aspuru-Guzik, A. Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (stoned) algorithm for molecules using selfies. Chem. Sci. 12, 7079â7090 (2021).
Nigam, A., Pollice, R. & Aspuru-Guzik, A. Parallel tempered genetic algorithm guided by deep neural networks for inverse molecular design. Digital Discov. 1, 390â404 (2022).
Anstine, D. M. & Isayev, O. Generative models as an emerging paradigm in the chemical sciences. J. Am. Chem. Soc. 145, 8736â8750 (2023).
Seo, S., Lim, J. & Kim, W. Y. Molecular generative model via retrosynthetically prepared chemical building block assembly. Adv. Sci. 10, 2206674 (2023).
Dollar, O., Joshi, N., Beck, D. A. C. & Pfaendtner, J. Attention-based generative models for de novo molecular design. Chem. Sci. 12, 8362â8372 (2021).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268â276 (2018).
De Cao, N. & Kipf, T. MolGAN: an implicit generative model for small molecular graphs. Preprint at https://arxiv.org/abs/1805.11973 (2018).
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de novo design through deep reinforcement learning. J. Cheminformatics 9, 48 (2017).
Kang, S. & Cho, K. Conditional molecular design with deep generative models. J. Chem. Inf. Model. 59, 43â52 (2018).
Corso, G., Stärk, H., Jing, B., Barzilay, R. & Jaakkola, T. S. DiffDock: diffusion steps, twists, and turns for molecular docking. In Proc. 11th International Conference on Learning Representations https://openreview.net/forum?id=kKF8_K-mBbS (2023).
Guimaraes, G. L., Sanchez-Lengeling, B., Outeiral, C., Farias, P. L. C. & Aspuru-Guzik, A. Objective-reinforced generative adversarial networks (organ) for sequence generation models. Preprint at https://arXiv.org/abs/1705.10843 (2018).
Samanta, B. et al. Nevae: A deep generative model for molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, 33, 1110â1117 (2019).
Li, Y., Zhang, L. & ming Liu, Z. Multi-objective de novo drug design with conditional graph generative model. J. Cheminformatics 10, 33 (2018).
Maziarka, Å. et al. Mol-cyclegan: a generative model for molecular optimization. J. Cheminformatics 12, 2 (2019).
Zang, C. & Wang, F. Moflow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 617â626 (2020).
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. Preprint at https://arXiv.org/abs/1802.04364 (2019).
Grover, A., Zweig, A. & Ermon, S. Graphite: Iterative generative modeling of graphs. Preprint at https://arXiv.org/abs/1803.10459 (2019).
Xue, D. et al. Advances and challenges in deep generative models for de novo molecule generation. WIREs Comput. Mol. Sci. 9, e1395 (2019).
Gebauer, N. W. A., Gastegger, M., Hessmann, S. S. P., Müller, K.-R. & Schütt, K. T. Inverse design of 3d molecular structures with conditional generative neural networks. Nat. Commun. 13, 973 (2022).
Hoogeboom, E., Satorras, V. G., Vignac, C. & Welling, M. Equivariant diffusion for molecule generation in 3d. Preprint at https://arXiv.org/abs/2203.17003 (2022).
Xie, T., Fu, X., Ganea, O.-E., Barzilay, R. & Jaakkola, T. S. Crystal diffusion variational autoencoder for periodic material generation. In International Conference on Learning Representations https://openreview.net/forum?id=03RLpj-tc_ (2022).
Wu, L., Gong, C., Liu, X., Ye, M. & Liu, Q. Diffusion-based molecule generation with informative prior bridges. In Advances in Neural Information Processing Systems https://openreview.net/forum?id=TJUNtiZiTKE (2022).
Guan, J.et al. 3d equivariant diffusion for target-aware molecule generation and affinity prediction. In The Eleventh International Conference on Learning Representations https://openreview.net/forum?id=kJqXEPXMsE0 (2023).
Xu, M. et al. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations https://openreview.net/forum?id=PzcvxEMzvQC (2022).
Hiener, D. C. & Hutchison, G. R. Pareto optimization of oligomer polarizability and dipole moment using a genetic algorithm. J. Phys. Chem. A 126, 2750â2760 (2022).
Mannodi-Kanakkithodi, A., Pilania, G., Huan, T. D., Lookman, T. & Ramprasad, R. Machine learning strategy for accelerated design of polymer dielectrics. Sci. Rep. 6, 20952 (2016).
Yuan, Q., Santana-Bonilla, A., Zwijnenburg, M. A. & Jelfs, K. E. Molecular generation targeting desired electronic properties via deep generative models. Nanoscale 12, 6744â6758 (2020).
Westermayr, J., Gilkes, J., Barrett, R. & Maurer, R. J. High-throughput property-driven generative design of functional organic molecules. Nat. Comput. Sci. 3, 139â148 (2023).
Medrano Sandonas, L. et al. "Freedom of designâ in chemical compound space: towards rational in silico design of molecules with targeted quantum-mechanical properties. Chem. Sci. 14, 10702â10717 (2023).
Góger, S., Medrano Sandonas, L., Müller, C. & Tkatchenko, A. Data-driven tailoring of molecular dipole polarizability and frontier orbital energies in chemical compound space. Phys. Chem. Chem. Phys. 25, 22211â22222 (2023).
Hoja, J. et al. QM7-X, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules. Sci. Data 8, 43 (2021).
van der Maaten, L. & Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579â2605 (2008).
Rincón, L., Alvarellos, J. E. & Almeida, R. Electron density, exchange-correlation density, and bond characterization from the perspective of the valence-bond theory. II. Numerical results. J. Chem. Phys. 122, 214103 (2005).
Collins, T. C., Euwema, R. N., Stukel, D. J. & Wepfer, G. G. Valence electron density of states of znse obtained from an energy dependent exchange approximation. Int. J. Quantum Chem. 5, 77â85 (1970).
Shao, H., Kumar, A. & Fletcher, P. T. The riemannian geometry of deep generative models. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 428â4288 (2018).
Makri, S., Ortner, C. & Kermode, J. R. A preconditioning scheme for minimum energy path finding methods. J. Chem. Phys. 150, 094109 (2019).
Unke, O. et al. Spookynet: Learning force fields with electronic degrees of freedom and nonlocal effects. Nat. Commun. 12, 7273 (2021).
Schreiner, M., Bhowmik, A., Vegge, T., Jørgensen, P. B. & Winther, O. Neuralnebâneural networks can find reaction paths fast. Mach. Learn.: Sci. Technol. 3, 045022 (2022).
Vignac, C. & Frossard, P. Top-n: Equivariant set and graph generation without exchangeability. In International Conference on Learning Representations https://openreview.net/forum?id=-Gk_IPJWvk (2022).
Zhu, X., Thompson, K. & Martinez, T. Geodesic interpolation for reaction pathways. J. Chem. Phys. 150, 164103 (2019).
Medrano Sandonas, L. et al. Dataset for quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules. Sci. Data 11, 742 (2024).
Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 9, 513â530 (2018).
Sorkun, M. C., Khetan, A. & Er, S. Aqsoldb, a curated reference set of aqueous solubility and 2d descriptors for a diverse set of compounds. Sci. Data 6, 143 (2019).
Cremer, J., Medrano Sandonas, L., Tkatchenko, A., Clevert, D.-A. & De Fabritiis, G. Equivariant graph neural networks for toxicity prediction. Chem. Res. Toxicol. 36, 1561â1573 (2023).
Kingma, D. P. & Welling, M. Auto-encoding variational bayes. Preprint at https://arXiv.org/abs/1312.6114 (2022).
Rupp, M., Tkatchenko, A., Müller, K.-R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108, 058301 (2012).
Montavon, G. et al. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 15, 095003 (2013).
Dokmanic, I., Parhizkar, R., Ranieri, J. & Vetterli, M. Euclidean distance matrices: Essential theory, algorithms, and applications. IEEE Signal Process. Mag. 32, 12â30 (2015).
Hoffmann, M. & Noé, F. Generating valid euclidean distance matrices. Preprint at https://arXiv.org/abs/1910.03131 (2019).
OâBoyle, N. M. et al. Open babel: An open chemical toolbox. J. Cheminformatics 3, 1â14 (2011).
Seifert, G., Porezag, D. & Frauenheim, T. Calculations of molecules, clusters, and solids with a simplified LCAO-DFT-LDA scheme. Int. J. Quantum Chem. 58, 185â192 (1996).
Gaus, M., Cui, Q. & Elstner, M. DFTB3: Extension of the self-consistent-charge density-functional tight-binding method (SCC-DFTB). J. Chem. Theory Comput. 7, 931â948 (2011).
Tkatchenko, A., DiStasio Jr, R. A., Car, R. & Scheffler, M. Accurate and efficient method for many-body van der waals interactions. Phys. Rev. Lett. 108, 236402 (2012).
Stöhr, M., Michelitsch, G. S., Tully, J. C., Reuter, K. & Maurer, R. J. Communication: Charge-population based dispersion interactions for molecules and materials. J. Chem. Phys. 144, 151101 (2016).
Perdew, J. P., Ernzerhof, M. & Burke, K. Rationale for mixing exact exchange with density functional approximations. J. Chem. Phys. 105, 9982â9985 (1996).
Adamo, C. & Barone, V. Toward reliable density functional methods without adjustable parameters: The PBE0 model. J. Chem. Phys. 110, 6158â6170 (1999).
Ambrosetti, A., Reilly, A. M., DiStasio Jr, R. A. & Tkatchenko, A. Long-range correlation energy calculated from coupled atomic response functions. J. Chem. Phys. 140, 18A508 (2014).
Havu, V., Blum, V., Havu, P. & Scheffler, M. Efficient O(N) integration for all-electron electronic structure calculation using numeric basis functions. J. Comput. Phys. 228, 8367â8379 (2009).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2014).
Fallani, A., Medrano Sandonas, L. & Tkatchenko, A. Inverse mapping of quantum properties to structures for chemical space of small organic molecules. ZENODO https://doi.org/10.5281/zenodo.11537048 (2024).
Acknowledgements
We thank Dr. Szabolcs Goger and Mr. Kyunghoon Han for the helpful discussions and comments to improve the manuscript. This research was financially supported by the European Unionâs Horizon 2020 research and innovation program under the Marie SkÅodowska-Curie Innovative Training Network - European Industrial Doctorate grant agreement No 956832, âAdvanced Machine learning for Innovative Drug Discoveryâ (AIDD). The results discussed in this work were obtained using the computational resources provided by the High Performance Computing (HPC) at the University of Luxembourg.
Author information
Authors and Affiliations
Contributions
The work was initially conceived by A.F. and L.M.S., and designed with contributions from A.T. A.F. developed the machine learning code, performed the model training, and analyzed the model performance in diverse applications together with L.M.S. A.T. supervised and revised all stages of the work. All authors discussed the results and contributed to the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous, reviewer(s) for their contribution to the peer review of this work. AÂ peer review file is available.
Additional information
Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleâs Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâs Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fallani, A., Medrano Sandonas, L. & Tkatchenko, A. Inverse mapping of quantum properties to structures for chemical space of small organic molecules. Nat Commun 15, 6061 (2024). https://doi.org/10.1038/s41467-024-50401-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-50401-1