Inverse mapping of quantum properties to structures for chemical space of small organic molecules

Fallani, Alessio; Medrano Sandonas, Leonardo; Tkatchenko, Alexandre

doi:10.1038/s41467-024-50401-1

Download PDF

Article
Open access
Published: 18 July 2024

Inverse mapping of quantum properties to structures for chemical space of small organic molecules

Nature Communications volumeÂ 15, ArticleÂ number:Â 6061 (2024) Cite this article

3176 Accesses
1 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Computer-driven molecular design combines the principles of chemistry, physics, and artificial intelligence to identify chemical compounds with tailored properties. While quantum-mechanical (QM) methods, coupled with machine learning, already offer a direct mapping from 3D molecular structures to their properties, effective methodologies for the inverse mapping in chemical space remain elusive. We address this challenge by demonstrating the possibility of parametrizing a chemical space with a finite set of QM properties. Our proof-of-concept implementation achieves an approximate property-to-structure mapping, the QIM model (which stands for âQuantum Inverse Mappingâ), by forcing a variational auto-encoder with a property encoder to obtain a common internal representation for both structures and properties. After validating this mapping for small drug-like molecules, we illustrate its capabilities with an explainability study as well as by the generation of de novo molecular structures with targeted properties and transition pathways between conformational isomers. Our findings thus provide a proof-of-principle demonstration aiming to enable the inverse property-to-structure design in diverse chemical spaces.

Molecular design with automated quantum computing-based deep learning and optimization

Article Open access 14 August 2023

QM7-X, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules

Article Open access 02 February 2021

Exploring chemical compound space with quantum-based machine learning

Article 12 June 2020

Introduction

The discovery and optimization of chemical compounds can be accelerated thanks to the marked advancements in quantum and statistical methods, their implementation in advanced software, as well as the seemingly never-ending improvement in computer hardware^1,2. Unlike the traditional painstaking trial-and-error process which heavily relied on experimental work (known as the Edisonian approach), we can now compute a wide range of accurate physicochemical properties of a given compound using quantum-mechanical (QM) methods with only the respective atomic coordinates and atom types. However, rationally exploring the incredibly vast chemical compound space (CCS, estimated to contain 10⁶⁰ molecular structures even for small organic molecules)³via highly accurate QM methods is still unfeasible due to their prohibitive computational cost. In this regard, machine learning (ML) techniques have revolutionized the field of molecular design by offering a fast but just as accurate method for obtaining properties from 3D molecular structures (a.k.a., direct mapping)^4,5,6,7,8, positioning them as an indispensable resource in high-throughput screening pipelines^{9,10,11,12,13,14,15}. Although these approximate mappings have undoubtedly enhanced our understanding of the CCS, it is the possibility to invert them that has the potential to truly disrupt and transform the field. Addressing this challenge would allow us to predict the 3D molecular structures from their inherent properties, representing a paradigm shift in the design and discovery of chemical compounds with specific functionalities.

The quest to establish an inverse mapping has emerged as a formidable challenge, captivating the interest and dedication of researchers from various disciplines such as organic chemistry, materials science, and molecular docking^{16,17,18,19,20,21,22,23,24,25}. A hint for the existence of such a property-to-structure relationship is given by considering the demonstrated capability of generative models to selectively generate random structures conditioned on a predefined set of desired properties. Indeed, generative modeling has yielded numerous groundbreaking research outcomes^26,27,28, particularly in the field of cheminformatics, in which there is extensive literature delving into diverse generative architectures with research focusing mostly on language models and autoregressive generation employing a text-based representation like SMILES^{29,30,31,32,33,34,35,36,37,38,39,40}. Models dealing with the intricate 3D structure of molecules have been recently developed, leading to promising results across different design tasks^{33,41,42,43,44,45,46,47}. In the same breath, multi-objective optimization problems have been tackled using generative models and genetic algorithms, e.g., the design of functional organic molecules for optoelectronics applications and of candidate structures for dielectric organic materials^48,49,50,51. These compelling examples underscore the relevance of tackling the inverse design problem, yet the potential of parametrizing the CCS using properties as coordinates remains unexplored. Overcoming this challenge would lay the groundwork for an alternative and multifaceted approach to understanding and manipulating the intricate relationship between the properties and structures of organic molecules.

Stemming from the âfreedom of designâ conjecture in the molecular property space espoused in our previous works^52,53, this study aims to investigate three main aspects of generative models for molecular design: (i) the definition of a differentiable parameterization of CCS by leveraging QM data of equilibrium molecules, (ii) the explainability of the resulting property-to-structure mapping, and (iii) applicability range of the learned inverse mapping. At present, not all three of these points can be addressed using state-of-the-art models in conditional 3D molecular generation^42,43. The primary reason for this lies in the inherent design of each model, which hinders the establishment of a differentiable property-to-structure mapping with respect to the molecular size, see âMethodsâ section for an extended explanation. Hence, we propose a more flexible and tailored approach, the QIM model (which stands for âQuantum Inverse Mappingâ), that combines a Variational Auto-Encoder (VAE) architecture to encode the molecular structures (represented as Coulomb matrices) with a property encoder to encode the associated QM properties, see Fig. 1. The joint training of the VAE and property encoders yields a low-dimensional internal representation that is common for both the molecular structures and QM properties. This enables us to combine the property encoder with the decoder component of the VAE and, hence, to successfully approximate the CCS parameterization using QM properties as intrinsic coordinates for navigating the chemical space of small drug-like molecules contained in the QM7-X dataset⁵⁴. The QIM model accurately predicts the heavy atom composition of molecules, reconstructs their geometries with reasonable accuracy, and meets the criteria of being one-shot and differentiable, making it explainable and flexible. Thanks to the differentiability of our CCS parameterization, we can identify the most relevant properties in the molecular reconstruction process as well as substructures in the molecular property space covered by QM7-X. The conventional multi-objective generative modeling paradigm is also retrieved by conditionally sampling in the input space of properties to accomplish two distinct design tasks with well-defined objectives, producing results comparable to a specialized model such as cG-Schnet⁴² trained on the same tasks. As a final showcase of the capabilities of learning a CCS parameterization, we have implemented a geodesic search algorithm that uses the latent space representation as internal coordinates, enabling the definition of transition structures and the estimation of rotational energy profiles by learning only from equilibrium geometries. Although our proof-of-concept implementation is currently assessed only for small molecules, our findings suggest that a CCS parameterization based on QM properties is feasible and holds promise for a wide range of applications, including interpretability, generation of de novo molecules, transition path interpolation, and reaction barrier estimation. Thus, this work highlights the remarkable opportunities that arise from defining an inverse mapping approach to advance our understanding of the chemical compound space by connecting QM properties and molecular structures.

**Fig. 1: Scheme of our approach to inverse propertyâstructure design.**

Results

Property-to-structure mapping

To train the VAE and the property encoder (see âMethodsâ section), we have considered a subset from QM7-X dataset of 40,988 equilibrium conformations of molecules with up to seven heavy atoms including C, N, and O. Here, 17 QM global extensive and intensive properties (listed in Table 1) per conformation were selected to define the property encoder. The technical details of the input data and training procedure are specified in the âMethodsâ section and Supplementary Note 3 of the Supplementary Information (SI), respectively.

Table 1 Quantum mechanical (QM) properties

Full size table

We now assess the capability of the trained QIM model to establish an approximate parameterization of the chemical compound space (CCS) spanned by QM7-X based on the predefined set of QM properties. In doing so, we will make use of the molecules in the test set together with their corresponding properties, i.e., the property set of these molecules will be used to construct the model and, then, the generated molecule will be compared with the original one. Figure 2a shows the boxplots of the relative error on Coulomb matrix (CM) reconstruction for an increasing number of properties used to train the model. Here we have analyzed the relative error on CM instead of the root-mean-squared deviation (RMSD) between structures because the latter is only defined for molecules for which the composition is correctly predicted and, consequently, it will present more fluctuations across different number of properties especially when the number of properties used is low and hence the errors in composition are high. Accordingly, one can see that our model allows us to reconstruct the representation with an average error that converges to ~5% when considering more than seven properties during the training, after which the distribution becomes increasingly skewed as can be seen by the median value. This relative error is defined as $\Delta=\frac{| \tilde{{{{{{{\bf{C}}}}}}}}-{{{{{{\bf{C}}}}}}}| }{| {{{{{{\bf{C}}}}}}}| }\times 100$, where C is the original CM and $\tilde{{{{{{{\bf{C}}}}}}}}$ is the reconstructed one (both to be considered vectors). While we provide the mean and deviation of the CM reconstruction error, it is important to note that this indicator does not completely reflect the quality of the mapping due to its noisy and nonlinear correlation with RMSD (see Supplementary Fig. 1 of SI).

**Fig. 2: Performance of the QIM model in the molecular reconstruction process.**

To have a better understanding of this finding, we also report the distribution and the cumulative distribution over the test set for the RMSD in the case of the full set of properties (see Fig. 2b). The performance of the QIM model deteriorates when considering a lower number of properties, as it is shown in Supplementary Fig. 2 of the SI. Indeed, the mode of the RMSD distribution, when considering the full set of properties, is close to 0.5 Ã. Overall, despite obtaining a wide RMSD spectrum ($\left[0.05,1.6\right]$ Ã), we found thatâ~â70% of the molecules in the test set were reconstructed within RMSDâ=â0.7 Ã. We empirically found this threshold to be adequate to separate molecules with an acceptable reconstruction of the heavy atom structure in terms of topology and orientation. To further motivate the selection of this threshold, we provide several illustrative examples of both original and reconstructed molecules in Fig. 2c, along with a quantitative analysis of topological errors in Supplementary Fig. 3 of the SI. Concerning the chemical composition reconstruction, the model exhibits excellent performance, correctly predicting it for 99.96% of the molecules in the test set. Additionally, we investigated the impact on the performance of the QIM model when considering extensive and intensive properties separately in the training procedure, see Supplementary Fig. 4 of the SI. The models trained on separated properties present a higher number of structures with large RMSD compared to the final model, indicating a lower performance in structure reconstruction. In fact, the model trained on intensive properties was only able to reconstruct with the right chemical composition approximately 8000 from 10,000 molecules of the test set. This decrease from 99.96 to 75.85% in molecular reconstruction comes together with an increment of the ãRMSDã up to 0.7 Ã as well as a reduction up to 55% of tested molecules reconstructed with an RMSD below 0.7 Ã. In brief, these results have confirmed the need to use both types of properties in the training of the QIM model to have a better reconstruction of molecular structures.

Interpretability and performance of our model

Then, the established CCS parameterization is analyzed by implementing a gradient attribution map, which enables us to assess the individual contributions of each property to the output structures. The calculation of an attribution map A for each property is explained in âMethodsâ section. Figure 3 shows the A values per property which have been normalized over the maximum value and sorted in decreasing order for the full representation. Overall, we have found that extensive properties are more informative than intensive ones for the task of molecular reconstruction. This can be explained by the fact that these extensive properties depend on crucial molecular features that are also considered in a 3D representation like CM, e.g., number of atoms, number of electrons (related to the chemical composition), and geometry. Moreover, when comparing CMs, even a slight difference of one atom can significantly increase the loss, leading to a larger sensitivity of the model to variations in system size and composition. Thus, A values for the components of the total energy and the molecular polarizability are higher compared with these for the molecular orbital energies and dipole moment; in particular, E_XX and E_KIN present the largest A values. Interestingly, this finding is in agreement with the identification of molecular clusters in the two-dimensional principal component analysis (PCA) of the latent space of the VAE encoder, i.e., the higher the A value, the more correlated the property with respect to the PCA (see insets in Fig. 3). A similar observation holds when using a more complex dimensionality reduction technique such as t-distributed stochastic neighbor embedding (t-SNE)⁵⁵, see Supplementary Fig. 6 of the SI. The fact that a linear method such as PCA shows good organization in terms of energy-related properties is a non-trivial finding that will be later exploited in the interpolation of transition structures.

**Fig. 3: Gradient attribution map for molecular properties.**

Furthermore, we have studied how the revealed hierarchy of QM properties organizes the QM7-X CCS. Starting from the properties with the highest A value, in Fig. 4a, one can see the two-dimensional projection of the QM7-X molecular property space defined by E_KIN and E_XX, i.e., $\left({E}_{{{{{{{\rm{KIN}}}}}}}},{E}_{{{{{{{\rm{XX}}}}}}}}\right)$-space. Despite these two properties having a high inverse correlation (Pearson coefficientâ=âââ0.92), it is noticeable how the molecules in the dataset seem to organize as linear-shape clusters containing molecules with the same heavy atom composition. In particular, upon examining those clusters, it becomes evident that E_KIN is mostly influenced by the heavy atom composition within a molecule. On the other hand, E_XX is highly sensitive to the number of H atoms, thereby indicating a dependence on the particular bond types that are present. This is further analyzed in Fig. 4b, where we provide a closer examination of one of the highly populated clusters (see yellow ellipse in plot) and highlight the molecules with respect to their number of H atoms. Here, we uncovered a finer local structure with almost perfect inverse correlation (Pearson coefficientâ=âââ0.99) as well as very compact clusters formed by isomers. Overall, this behavior can be understood by considering the qualitative aspects of E_KIN and E_XX: the dominant contribution to E_KIN stems from the inner shell electrons (trivial consequence of the virial theorem) and the primary influence on E_XX arises from the valence electrons. Also, exchange-related quantities have been found to play a significant role in characterizing bonds^56,57, explaining their sensitivity to the number of H atoms in a molecule. These considerations account for the efficacy of these two QM properties in identifying clustered molecular isomer subspaces and accurately predicting heavy atom composition.

**Fig. 4: Two-dimensional (2D) projections of the high-dimensional QM7-X molecular property space.**

Next, we pick a molecular isomer subspace (see blue circle in plot) and show how different pairs of QM properties with high and low A values distribute the molecular structures (see Fig. 4c, d). As expected, high A values identify properties that are better local coordinates for exploring this subspace (vide supra). Indeed, these properties present relatively smaller changes in their values across related structures in comparison to properties with low A valuesâan example of their efficiency in identifying molecular structures within a specific molecular isomer subspace while effectively distinguishing them from other structures spanning the entire property spectrum. Notice that the same behaviors can be found throughout the entire set of QM7-X equilibrium molecules. These findings already provide compelling evidence for the potential of our proof-of-concept implementation in furthering our understanding of the molecular property space and unraveling the intricate relationship between QM properties and molecular structures.

The successful property-based parameterization of CCS achieved by the model is grounded in the remarkable similarity attained among the latent representations of CMs and respective properties. A visualization that demonstrates this observation can be seen in Fig. 5a, which showcases the overlap between the PCA of both latent representations. Similar result was obtained by using the t-SNE technique, see Supplementary Fig. 7 of the SI. This verifies the initial assumption we stated about the joint training procedure, see âMethodsâ section. Moreover, we have examined how the differences in latent space representation correlate with the quality of the reconstructed molecules. This analysis aims at obtaining a self-consistent method for error estimation. To this end, if z is the latent representation from the property encoder, we can take the reconstructed representations and encode them again with the VAE encoder, producing a new latent representation $\tilde{{{{{{{\bf{z}}}}}}}}$. Then, we look at the correlation of the quantity $| | \Delta {{{{{{\bf{z}}}}}}}| |=| | {{{{{{\bf{z}}}}}}}-\tilde{{{{{{{\bf{z}}}}}}}}| |$ with the RMSD between the original structures and the one reconstructed from properties considering the molecules in test set. In Fig. 5(b), one can see that there seem to be a bulk (approximately linear) correlation with the presence of numerous outliers. To further investigate this behavior, we also look at the boxplots of the RMSD for varying values of Îz by splitting the test set into subsets to get an adequate statistics for each point. The results are plotted in Fig. 5(c) in which we show that there is a nonlinear but monotonic behavior for the relationships between these quantities, finding a minimum for RMSD in the region of low â£â£Îzâ£â£ values ($\in \left[0,0.4\right]$). Since for reconstructing the actual molecular structure, we are mostly interested in having a low RMSD, we will filter out the generated structures for which â£â£Îzâ£â£ falls outside this predetermined interval in the following sections. This screening approach based on error estimation will be employed in the next application tests to enhance the quality of generated structures with a targeted set of QM properties.

**Fig. 5: Similarity analysis between the latent space representation from the VAE and property encoders.**

Acting as a conditional generative model

The learned CCS parameterization based on QM properties can provide a versatile solution for multi-objective targeted molecular generation. Here, we show the capability of the QIM model to target predefined pairs of QM properties by using a multivariate Gaussian regression approach to model the distribution of the QM7-X property space (see âMethodsâ section). Based on this procedure, we first navigate through the $\left(\alpha,{E}_{{{{{{{\rm{MBD}}}}}}}}\right)$-space that encompasses two relevant properties for molecular reconstruction (high attribution value A), see Fig. 6a. The moderate degree of correlation between these properties of QM7-X molecules (i.e., Pearson coefficientâ=â0.60, gray dots) also grants us access to the intrinsic âfreedom of designâ in CCS when searching for molecules with desired functionality⁵². Subsequently, we transition to a weakly correlated 2D property space given by $\left(\alpha,\zeta \right)$-space (i.e., Pearson coefficientâ=â0.44, gray dots) by replacing E_MBD with a dipole moment Î¶ (low A), see Fig. 6b. In the (E_EMBD,âÎ±) property space, fifteen samples (or 15-molecule set) were generated per targeted pair, while ten samples (or 10-molecule set) were generated per target in the (Î¶,âÎ±) one. The targets are represented as colored crosses with capital letters. The targeted Î± values were here selected to generate medium-to large-sized molecules, since Î± is an extensive property that mostly depends on the elemental composition as well as the connectivity of atoms within the molecule. Whereas, the targeted E_MBD and Î¶ values were chosen to tune the topological effects (i.e., extended vs. compact, packed/globular vs. void space) and elemental composition (homogeneous vs. heterogeneous) in the generated molecules, respectively. The top 5 generated molecules (i.e., five optimal molecules) per target have been highlighted with colored circles in both 2D property spaces. In Fig. 6, we also show the generated structure of a selected molecule per target. Overall, these results demonstrate that our model is capable of generating diverse molecular structures with similar scaffolds as in QM7-X molecules by using only QM properties as input.

**Fig. 6: Generation of molecules with a targeted pair of quantum-mechanical (QM) properties.**

Given the discrete nature of CCS and particularly the reduced coverage of the employed QM7-X dataset, it is clear that the inverse mapping generates molecules that can deviate from the respective targeted values. The performance per target can be quantitatively measured by defining $\epsilon=\frac{| {y}_{{{{{{{\rm{calc}}}}}}}}-{y}_{{{{{{{\rm{t}}}}}}}}| }{\Delta y}\times 100$ with y_calc and y_t as the calculated and target values of the property y. Îy represents the extent of the property spectrum across the entire dataset. Accordingly, for each of the 15-molecule set depicted in Fig. 6a, the Ïµ value is circa 9.2% for E_MBD and circa 3.5% for Î±. These values are reduced to circa 3.2% for E_MBD and circa 1.3% for Î± by only considering the top 5-molecule set. On the other hand, the molecules generated in $\left(\alpha,\zeta \right)$-space displayed Ïµâââ2.8% for the prediction of Î± but the Ïµ corresponding to Î¶ was circa 5.9% (see top 5-molecule set per target in Fig. 6b). Despite preventing the establishment of a differentiable property-to-structure mapping with respect to molecular size, we have used the state-of-the-art architecture cG-SchNet⁴² to rigorously examine our results in the two multi-property targeted molecular design tasks. Although cG-SchNet models achieve better performances, the QIM model remains comparable, as the difference in the values of Ïµ is not substantial, ranging from 0.6 to 5.3% in favor of cg-Schnet (see Supplementary Note 5 and Supplementary Tables 1â4 of the SI). The only exception was the prediction of E_MBD for the top 15 molecules in the $\left(\alpha,{E}_{{{{{{{\rm{MBD}}}}}}}}\right)$-space where the difference of Ïµ values wasâââ7.3%. This result can be related to the explicit treatment of H atoms during the training of the cG-SchNet model, while in our model H atoms are added after the molecule is generated (see âMethodsâ section). Notice that, in terms of computational efficiency, our model achieves convergence in three hours, which is faster than the time required to train the cG-SchNet model on the same hardware (i.e., four days).

Furthermore, we have found that the spread of the generated molecules (i.e., flexibility of the model in molecular generation) can be rationalized by analyzing the relative variance of the conditional multi-Gaussian distributions of the non-targeted QM properties. In fact, molecules generated with the highest precision (i.e., targets A and D) are characterized by lower negative log likelihood values and small relative variances in the extensive non-targeted properties (see Supplementary Fig. 4 of the SI). This outcome could be a key factor for controlling the degree of flexibility when designing molecules in targeted regions of a given property space since a larger variance in extensive properties may result in a more diverse set of molecules as these properties are the most relevant in defining the molecular structure (see Fig. 3). In this regard, taking a closer look at the generated molecules, one can see that the molecules in the $\left({E}_{{{{{{{\rm{MBD}}}}}}}},\alpha \right)$-space display a greater diversity in heavy atom composition compared to the ones generated in the $\left(\zeta,\alpha \right)$-space, see Supplementary Fig. 8 of the SI. Importantly, the presence of diverse chemical compositions and structures within the larger sets for each target is another evidence that our approach does not follow a conventional chemical exploration based on fixed molecular scaffolds. Instead, our inverse design procedure can fully utilize the diversity of property-to-structure relations as embodied in the recently proposed âfreedom of designâ conjecture in CCS⁵².

Generating compounds in uncharted CCS regions

To demonstrate that the QIM model is capable of generating de novo molecules with desired QM properties beyond the QM7-X dataset, we have modified the training procedure to reduce the bias of the model towards treating existing molecular fragments (see âMethodsâ section). This action slightly reduces the model performance in molecular reconstruction (ãRMSDã is only decreased by 0.05 Ã), but it enables the generation of scaffolds beyond QM7-X, mostly composed of unseen molecules containing eight heavy atoms. While using this new procedure across multiple targets in the $\left(\alpha,{E}_{{{{{{{\rm{MBD}}}}}}}}\right)$-space, we found that the generation of unseen compositions is confined to a region of low coverage defined by high Î± and large â£E_MBDâ£ (see Fig. 7). Certainly, the model here exhibits a higher degree of flexibility and it is capable of generating molecules with diverse compositions of eight heavy atoms. In Fig. 7, one can also see that the spread of generated molecules for target T₁ and T₂ is broader than that of corresponding targets discussed in the preceding section. In term of the relative error Ïµ, the samples in target T₁ presented an error circa 10% for E_MBD and 8% for Î±, while, for samples in T₂, the observed errors were circa 20% for E_MBD and 16% for Î±. Despite the lower prediction accuracy of QM properties, the errors for target T₁ are notably comparable to the results shown in Fig. 6 and those obtained by the cG-SchNet model (see Supplementary Note 5 of the SI). Note that, unlike the QIM model, the cG-SchNet model can generate molecules with up to nine heavy atoms without requiring any additional modifications. Moreover, by analyzing the generated molecules using the QIM model, we found that the novelty in heavy atom compositions can vary depending on the specific target location, e.g., T₂ samples consider only (C,N)-based molecules while T₁ samples are more chemically diverse and cover (C,N,O)-based molecules. For both sets of samples, E_MBD and Î± values are in the range of expected values for molecules with eight heavy atoms, i.e.,âââ0.6 eVâ<âE_MBDâ<âââ0.09 eV and 66 a.u.â<âÎ±â<â160 a.u. Likewise, all generated molecules were well-aligned with chemical and physical intuition, e.g., molecules generated in target T₁ exhibit a larger â£E_MBDâ£ than those in target B (see Fig. 6a) due to the more complex heavy atom composition and compact structure.

**Fig. 7: Targeting unseen compounds with a given pair of molecular properties.**

Interpolating transition geometries between isomers

Throughout the preceding sections, we have showcased the versatility of an approximate CCS parameterization in a range of contexts. Indeed, it was demonstrated that the latent space of the QIM model exhibits a discernible structure characterized by energetic properties with high A values when undergoing a linear transformation such as PCA, see Fig. 3. This particular behavior assures that a linear interpolation in latent space will generate structures for which the energetic properties smoothly changeâa consequence of the fact that linear transformations do not affect convexity properties up to a sign. Accordingly, we here aim to explore the potential of utilizing this latent representation as an intrinsic coordinate system to generate reasonable guesses for the transition geometries between conformational isomers in QM7-X. To do so, we use the geodesic interpolation algorithm⁵⁸ for VAEs but, in this instance, the objective is to find curves in the property space that are geodesics with respect to the metric induced by the latent space encoding, see details of implementation in the âMethodsâ section. Following the selection criteria described in âMethodsâ section, we have investigated three different pairs of conformational isomers C₄H₉NO₂ (case I), C₄H₅NO₂ (case II), and C₅H₅NO (case III) whose structures are achiral and were reconstructed with an RMSD â¤0.2 Ã, see initial (iâ=â0) and final (iâ=â6) geometries per case in Fig. 8aâc.

**Fig. 8: Predicting transition path between conformational isomers.**

Overall, the interpolated geometries displayed in Fig. 8aâc (see yellow balls) effectively demonstrate the capability of the QIM model to produce plausible geometries for the transition path of the studied isomerizations. However, by analyzing sample by sample, we detected a few abrupt/unphysical changes in the geometries for the cases II and III between steps 1 and 2. These artifacts primarily arise from two key factors: (i) the sensitivity of the CM representation to small changes which could produce large mirror-like transformations in the resulting geometry and (ii) the fact that model performance is degraded across unknown sectors of the latent space associated to the transition geometries. The relative energy component ÎE_iâ=âE_iâââE₀ (E is the sum of atomization and MBD dispersion energies) of the corresponding geodesic in property space for the three isomerizations is reported in Fig. 8d. Unexpectedly, with no request other than minimizing the distance in latent space, a barrier-like behavior is retrieved in all cases with energy barriers between 0.6 eV and 1.4 eVâanother evidence of the potential application of our model for generating guesses of transition geometries as well as energy profiles. The obtained results are remarkable since the model was trained exclusively on equilibrium geometries.

To examine how close the estimated energy profile is to the true minimum energy path and the quality of the generated transition geometries, we have used them as initial guesses for a nudged elastic band (NEB) calculation following the ODE method⁵⁹. Before proceeding with the NEB calculations, H atoms were added to all molecules with OpenBabel and, subsequently, optimizations of initial and final geometries were carried out employing an ML force field trained on PBE0+MBD energies/forces of equilibrium and non-equilibrium molecular conformations contained in QM7-X dataset (model taken from ref. ⁶⁰). This accurate ML force field was also used to perform the NEB calculations at PBE0+MBD level of theory⁶¹. The relative energies calculated following this direct procedure are presented in Fig. 8d and compared to their corresponding geodesic energies, showing how the later were consistently overestimating the energy barrier. As for the updated geometries, the RMSD of their heavy atoms structures with respect to the initially interpolated ones was found to be between 0.14 Ã and 0.35 Ã, see colored balls in Fig. 8aâc.

Discussion

In the present work, we have presented a machine-learning approach, the QIM model, to the inverse property-to-structure design process by learning a parameterization of the chemical compound space (CCS) of small organic molecules that uses QM properties as intrinsic coordinates to navigate the space of molecular structures. This challenging task was successfully accomplished through the development of a proof-of-concept implementation that jointly trains a variational autoencoder and a property encoder. The trained model, utilizing the equilibrium molecules contained in QM7-X dataset, is able to reconstruct the molecules (heavy atom composition and 3D structure) within the test set from their properties with reasonably good accuracy. Moreover, the differentiability of the learned CCS parameterization not only enabled us to identify the most relevant properties in the molecular reconstruction process but also revealed intriguing insights. Our findings indicate that the combination of exchange energy and kinetic energy plays a pivotal role in clustering the molecules in the dataset based on chemical composition and bond types. This information can be further utilized to conditionally generate molecules belonging to specific isomer subspaces within property spaces defined by more pertinent QM properties. In contrast, we observed that intensive properties exhibit limitations as local coordinates for conformational space navigation when compared to extensive onesâa result primarily attributed to the purely geometrical/chemical definition of the Coulomb Matrix (CM) representation. Nevertheless, the QIM model exhibited its best performance when both extensive and intensive properties were considered in the training procedure, highlighting the significance of intensive properties in achieving efficient molecular reconstruction.

The QIM model has also proven its applicability in diverse molecular design tasks by allowing conditional sampling for multi-objective molecular generation and the interpolation of transition pathways of chemical transformations. Unlike the traditional generative paradigm in machine-learned latent spaces, our approach allows to determine and interpret the input conditional distribution since every coordinate in our model is a QM property with a clear physical meaning. The modified training procedure applied to our model provided an initial assessment for using inverse design to generate de novo molecules with desired QM properties, highlighting the promise of rational exploration of CCS, but also uncovering the relative limitation of generative models in extrapolating to molecules larger than the training samples. Future studies in this direction should investigate the use of both a more refined molecular representation to effectively capture both geometric/electronic features and a dataset spanning a much larger chemical space than the one systematically covered by QM7-X dataset⁶². Our work also suggests that the common latent representation of properties and structures can capture essential aspects of the underlying physics governing structure-property relationships, namely, the existence of transition pathways in chemical space. In this regard, the successful execution of a NEB calculation using the generated transition geometries is a compelling evidence of the meaningful nature of the latent representation as intrinsic coordinates in chemical space. This establishes a connection between our interpolation method and other studies focused on geodesic transition path interpolation⁶³. Indeed, we have demonstrated that aiming to parameterize the CCS through the utilization of QM properties can lead to the development of more interpretable models capable of performing a wide range of chemical tasks.

While the QIM model is multitasking and customizable, it still exhibits certain shortcomings in terms of molecular representation and coverage of chemical space that require further investigation to develop a more robust and generalizable model suitable for ML-based screening pipelines within the field of drug discovery (vide supra). Once these limitations are relaxed, one could devise the development of an active learning-like method that uses the self-consistency metric based on latent space overlap. This would allow going beyond the chemical and property space spanned by the training set⁵¹. Another interesting avenue is the development of a QM-informed generative model for molecules with pharmaceutical and biological relevance (e.g., Aquamarine⁶⁴, MoleculeNet⁶⁵, and solubility⁶⁶ datasets) by coupling in-silico (gas-phase and solvated) QM properties with experimental data to exploit unknown property correlations in more realistic scenarios⁶⁷. Hence, we expect this study to serve as a motivation for future research endeavors focused on advancing the field of generative models by leveraging physical and chemical design rules obtained from structure-property/property-property relationships, fostering the development of enhanced models that offer increased accuracy for practical chemical tasks while preserving the high interpretability of the presented approach.

Methods

Need for a tailored implementation

To achieve the goals defined in the âIntroductionâ section, our idea was to look for a generative model that provides a direct differentiable mapping from the space of QM properties to the space of chemical structures. To this end, we first considered the state-of-the-art models in conditional 3D molecular generation that were recently developed such as the equivariant diffusion model (EDM)⁴³ and the conditional generative neural network (cG-SchNet)⁴². The methodology put forth in EDM, which has been adapted in multiple landmark works^33,44,46,47, is a diffusion-based model that acts on a graph-based representation of molecules. Despite showing good performance in specific applications, the use of a graph-based representation poses a significant limitation for our work. The need to define the number of nodes externally to the model makes any possible property-structure mapping inherently non-differentiable with respect to the number of atoms parameter. On the other hand, cG-SchNet model works in an autoregressive fashion by learning the conditional distribution of the position and type of each atom in a molecule given the position and type of the previously generated ones. The generation process keeps adding atoms in euclidian 3D space until a stop token is encountered. This method has been successfully used⁵¹ and allows for the generation of molecules with a varying number of atoms. However, the autoregressive nature together with the numerous adjustments in the final cG-SchNet model prevent us from obtaining a one-shot differentiable mapping. Accordingly, we opted for implementing a more flexible and tailored model that combines a VAE architecture to encode the molecular structures (represented as Coulomb matrices) with a property encoder to encode the associated QM properties, see Fig. 1. The QIM model effectively addresses the discussed challenges in terms of molecular size and differentiability. It enables a direct and differentiable parameterization of the chemical compound space (CCS) based on global properties, and it is also highly explainable.

Variational auto-encoder framework for inverse design

A Variational Auto-Encoder (VAE) is in general comprised of two neural networks called encoder and decoder. The encoder encodes the input in a lower-dimensional latent space representation and the decoder decodes that representation trying to recover the encoded input. The model thus learns a continuous low-dimensional representation capturing the most salient statistical features of the input during the compression process. Whereas, in the decoding process, the decoder learns how to generate samples coherent with the ones in the dataset. Both networks are probabilistic, meaning that they parameterize a probability distribution. This, together with a regularization term forcing the latent distribution to be close to a reference distribution, helps obtain a less sparse internal representation and avoids large portions of the latent space from decoding into invalid outputs. The loss function used to train these models comes from the well-known evidence lower bound (ELBO)⁶⁸. If x is a sample from a dataset with probability distribution p(x), then the loss reads:

$${{{{{{{\rm{loss}}}}}}}}_{{{{{{{\rm{ELBO}}}}}}}}={D}_{{{{{{{\rm{KL}}}}}}}}[{q}_{\phi }({{{{{{\bf{z}}}}}}}| {{{{{{\bf{x}}}}}}})| | p({{{{{{\bf{z}}}}}}})]-{{\mathbb{E}}}_{{q}_{\phi }}[\log ({p}_{\theta }({{{{{{\bf{x}}}}}}}| {{{{{{\bf{z}}}}}}}))],$$

(1)

where q_Ï(zâ£x) and p_Î¸(xâ£z) are the probability distributions parameterized by the encoder and the decoder, respectively. D_KL is the Kullback-Leibler divergence and p(z) is the prior probability distribution of the latent space representation z. Usually, all distributions are chosen to be identically multivariate Gaussian distributed and $p({{{{{{\bf{z}}}}}}})={{{{{{\mathcal{N}}}}}}}({{{{{{\bf{0}}}}}}},{\mathbb{I}})$. In the training procedure, the encoder learns to encode x in the probability distribution q_Ï(zâ£x) from which its latent representation z is sampled using the well-known reparameterization trick⁶⁸. The divergence term here ensures that the representation stays compact, since the distribution will be as close as possible to the prior distribution ${{{{{{\mathcal{N}}}}}}}({{\mathbf{0}}},{\mathbb{I}})$. Then, z is fed to the decoder which learns to turn this random variable back to x. After training is over, one can sample z from ${{{{{{\mathcal{N}}}}}}}({{{{{{\bf{0}}}}}}},{\mathbb{I}})$, feed it to the decoder, and obtain samples x with a distribution close to p(x).

In our implementation, we start from a dataset Dâ=â(X,âY) where X are the molecular structures and Y are their QM properties. To obtain an inverse mapping f:âYâââX, we modified the standard VAE framework by adding a third network parameterizing the probability distribution p_Ï(zâ£y). Moreover, we modify the loss function by including a likelihood term (e.g., $-\log ({p}_{\psi }({{{{{{\bf{z}}}}}}}| {{{{{{\bf{y}}}}}}}))$) in the ELBO loss function, which is now expressed as:

$${{{{{{\rm{loss}}}}}}}=\beta {D}_{{{{{{{\rm{KL}}}}}}}}[{q}_{\phi }({{{{{{\bf{z}}}}}}}| {{{{{{\bf{x}}}}}}})| | p({{{{{{\bf{z}}}}}}})]-{{\mathbb{E}}}_{{q}_{\phi }}[\log ({p}_{\theta }({{{{{{\bf{x}}}}}}}| {{{{{{\bf{z}}}}}}}))]-\tau {{\mathbb{E}}}_{{q}_{\phi }}[\log ({p}_{\psi }({{{{{{\bf{z}}}}}}}| {{{{{{\bf{y}}}}}}}))],$$

(2)

where Î² and Ï are adjustable coefficients introduced as hyperparameters. The training procedure is here similar as described for VAE but, in this case, the variable z sampled from q_Ï(zâ£x) will also maximize the likelihood term ${{\mathbb{E}}}_{{q}_{\phi }}[\log ({p}_{\psi }({{{{{{\bf{z}}}}}}}| {{{{{{\bf{y}}}}}}}))]$. This modification ensures the emergence of a common latent representation for both the VAE and the property encoders. After training, one takes samples of y from the property space and gets a value for z as the mean of p_Ï(zâ£y) that serves as the maximum-likelihood estimator for the latent space representation of the corresponding molecular structure. Subsequently, a molecule x can be generated, which is expected to possess properties similar to y. A schematic representation of our proposed implementation is shown in Fig. 1.

Molecular representation and structure retrieval

In recent years, there has been a surge of groundbreaking research in generative modeling with a predominant focus on text-based representations like SMILES^{26,27,28,29,30,31,32,33,34,35,36,37,38,39,40}, which encode information on the atom composition and connectivity within a molecule. However, despite its success, this representation limits our ability to study the QM properties of molecular systems. This issue arises because the electronic density, a key quantity in most QM methods, relies on the atom composition and three-dimensional atomic coordinates as input. Accordingly, since we are addressing the inverse mapping between QM properties and 3D structures, we require a different representation that enables us to address both structural and electronic properties. In particular, we need a representation that fulfills the following criteria: firstly, it has to encode the atomic positions and atomic species of a molecule with as many invariances as possible (translational, rotational and permutational). Secondly, it has to show a strong correlation with QM properties. Finally, and most importantly, it has to be invertible, namely that one should be able to retrieve the Cartesian coordinates and atomic species. Neglecting the permutational invariance requirement, we find that a simple yet effective 3D molecular representation is the Coulomb matrix (CM)^69,70. CM is an elegant and physically-inspired descriptor, which has shown great success in a wide variety of investigations related to molecular property prediction. CM is invariant to translations and rotations, allowing for the retrieval of atomic positions and species of a molecule up to a chirality transformation. This representation is defined as:

$${{{{{{{\bf{M}}}}}}}}_{ij}=\left\{\begin{array}{ll}0.5{{{{{{{\bf{Z}}}}}}}}_{i}^{2.4}\quad &{{{{{{\rm{if}}}}}}}\,i=j.\\ \frac{{{{{{{{\bf{Z}}}}}}}}_{i}{{{{{{{\bf{Z}}}}}}}}_{j}}{| {{{{{{{\bf{r}}}}}}}}_{i}-{{{{{{{\bf{r}}}}}}}}_{j}| },\quad &{{{{{{\rm{otherwise}}}}}}}.\end{array}\right.$$

(3)

where the indices i and j run over the atoms in the molecule and Z_i indicates the atomic species. We have chosen to treat the hydrogen (H) atoms implicitly in our work due to several reasons. Primarily, the associated terms in the Coulomb matrix representation pertaining to H atoms are significantly smaller compared to the other atoms in a given system. As a consequence, even small errors in the reconstruction process can lead to disproportionately large changes in the positions of H atoms. By treating the hydrogen atoms implicitly, we can mitigate the impact of these potential errors and minimize the distortion caused by inaccuracies in their positions. This approach allows us to focus our method on accurately representing the molecular scaffold (i.e., positions of heavy atoms). For what concerns the network input standardization, we adapted the matrix to the maximum possible number of atoms per species found in the dataset, padding the rest of the entries with zeros (see Supplementary Note 1 of the Supplementary Information (SI)).

The retrieval of Cartesian coordinates and chemical composition from CM follows two steps. We first obtain the composition from the diagonal elements by applying the inverse function $g={(2(\cdot ))}^{\frac{1}{2.4}}$ and rounding the outcomes to the closest integer values. Accordingly, we use the set {Z} of atomic numbers obtained and get the distances d_ij as ${{{{{{{\bf{d}}}}}}}}_{ij}={\left(\frac{{{{{{{{\bf{M}}}}}}}}_{ij}}{{{{{{{{\bf{Z}}}}}}}}_{i}{{{{{{{\bf{Z}}}}}}}}_{j}}\right)}^{-1}$. Lastly, we apply classical multidimensional scaling (CMDS) to the resulting Euclidian distance matrix (EDM)^71,72 to get the Cartesian coordinates of atoms up to a chirality transformation (see Supplementary Note 1 of the SI). To reconstruct H atoms, we use OpenBabel software⁷³ that restores connectivity and bond order based on interatomic distances. After this, the positions of the H atoms are optimized with third-order self-consistent charge density-functional tight binding (DFTB3)^74,75 that also accounts for many-body dispersion/van der Waals (vdW) interactions via the MBD approach^76,77 (DFTB3+MBD, same level of theory used for optimizing the geometries in QM7-X dataset), while freezing the position of the rest of the atoms in the molecule.

As a final remark concerning the choice of representation, a graph-based representation would be beneficial for generative purposes as it treats molecular fragments robustly and avoids the need for truncation. However, in this work, we are aiming at defining a CCS parameterization based on a set of global QM properties of molecules. Consequently, it becomes crucial to account for the potential variation in the number of atoms as we navigate through the property coordinates. Given the challenges associated with graph-based approaches due to the requirement of specifying the number of nodes in advance, the utilization of (appropriately padded) CM proves to be an efficient and scalable representation for inverse design purposes at this stage of model development.

Reference chemical space

Out of theâ~â4.2M molecules in the QM7-X dataset⁵⁴, we have selected a subset of 40,â988 equilibrium conformations for molecules containing up to seven heavy atoms, including C, N, and O. For training our model, we used the following data splitting: 28k and 2k molecules for the training and validation sets, respectively, while the remaining molecules were used for testing the model. Moreover, since the QM7-X dataset contains a plethora of physicochemical properties, computed by means of non-empirical hybrid density-functional theory (DFT) and a many-body treatment of vdW/dispersion interactions (i.e., PBE0+MBD)^76,78,79,80 in conjunction with the tightly converged numeric atom-centered basis sets⁸¹, we have opted to consider 17 global (extensive and intensive) properties during the training process.

Gradient attribution map

Similar to the standard approach employed in machine learning applications that involve image inputs⁸², we compute the gradient of the Coulomb matrix components ${{{{{{{\bf{CM}}}}}}}}_{i}^{k}$ of a given molecule k with respect to a property p_j, yielding the Jacobian matrix ${{{{{{{\bf{J}}}}}}}}_{ij}^{k}=\frac{\partial {{{{{{{\bf{CM}}}}}}}}_{i}^{k}}{\partial {{{{{{{\bf{p}}}}}}}}_{j}}$. By taking the norm of ${{{{{{{\bf{J}}}}}}}}_{ij}^{k}$ over the output dimension of CM and, then, averaging over the subset ${\mathbb{B}}$ of best reconstructed molecules (150 molecules with RMSD â¤0.2 Ã), we obtain an attribution map for each property expressed as:

$${{{{{{{\bf{A}}}}}}}}_{j}=\frac{1}{N}\sum\limits_{k\in {\mathbb{B}}}\left\Vert \frac{\partial {{{{{{{\bf{CM}}}}}}}}_{i}^{k}}{\partial {{{{{{{\bf{p}}}}}}}}_{j}}\right\Vert,$$

(4)

where N is the total number of molecules included in ${\mathbb{B}}$. The idea behind this calculation is to understand which changes in the input produce the largest changes in the output.

Multi-Gaussian fitting for conditional molecular design

Here, we have used a multi-Gaussian fitting procedure to show the capability of the QIM model to target predefined pairs of QM properties, denoted as {m}. In doing so, we first acknowledge that our model has a fixed-dimensional input space for properties, ensuring the attainment of the random sampling associated with typical generative models. The conditional sampling is then achieved by sampling the non-targeted properties, denoted as {n}, while conditioning on {m} values. For this purpose, we need a modeled version of the distribution of the property space spanned by QM7-X that facilitates the determination of the posterior distribution of {n}, given fixed {m} values. Indeed, we use a multivariate Gaussian regression approach to model this property space. Specifically, we constructed a model with 91 multivariate Gaussian distributions ${\left\{{{{{{{\mathcal{N}}}}}}}\left({{{{{{{\boldsymbol{\mu }}}}}}}}_{k},{{{{{{{\boldsymbol{\Sigma }}}}}}}}_{k}\right)\right\}}_{k\in \{1[\ldots ]91\}}$ with Î¼_k and Î£_k as the mean value and the covariance matrix of the Gaussian k. This choice was made following a Bayesian information criterion (see Supplementary Note 4 of the SI).

With fixed target values, denoted as $\{\bar{{{{{{{\bf{m}}}}}}}}\}$, the Gaussian $\bar{k}$ from which it is more likely to sample the targeted properties is then selected following the maximum likelihood criterion $\bar{k}={{{{{{\rm{argmax}}}}}}}k\left({{{{{{{\mathcal{N}}}}}}}}_{{{{{{{\bf{m}}}}}}}}({{{{{{\bf{m}}}}}}}=\bar{{{{{{{\bf{m}}}}}}}}| {{{{{{{\boldsymbol{\mu }}}}}}}}_{k},{{{{{{{\boldsymbol{\Sigma }}}}}}}}_{k})\right)$, where ${{{{{{{\mathcal{N}}}}}}}}_{{{{{{{\bf{m}}}}}}}}$ is the marginal distribution relative to the chosen set of properties {m}. Next, the conditional probability formula for multivariate Gaussian distributions is applied, obtaining the distribution of the non-targeted properties {n} for ${{{{{{\bf{m}}}}}}}=\bar{{{{{{{\bf{m}}}}}}}}$. This can be written as,

$$p({{{{{{\bf{n}}}}}}}| {{{{{{\bf{m}}}}}}}=\bar{{{{{{{\bf{m}}}}}}}})={{{{{{\mathcal{N}}}}}}}({{{{{{\bf{n}}}}}}}| \tilde{{{{{{{\boldsymbol{\mu }}}}}}}},\tilde{{{{{{{\boldsymbol{\Sigma }}}}}}}}),$$

(5)

with $\tilde{{{{{{{\boldsymbol{\mu }}}}}}}}$ and $\tilde{{{{{{{\boldsymbol{\Sigma }}}}}}}}$ defined as,

$$\tilde{{{{{{{\boldsymbol{\mu }}}}}}}}={{{{{{{\boldsymbol{\mu }}}}}}}}_{n}+{{{{{{{\boldsymbol{\Sigma }}}}}}}}_{nm}{{{{{{{\boldsymbol{\Sigma }}}}}}}}_{mm}^{-1}(\bar{{{{{{{\bf{m}}}}}}}}-{{{{{{{\boldsymbol{\mu }}}}}}}}_{m})$$

(6)

$$\tilde{{{{{{{\boldsymbol{\Sigma }}}}}}}}={{{{{{{\boldsymbol{\Sigma }}}}}}}}_{nn}-{{{{{{{\boldsymbol{\Sigma }}}}}}}}_{nm}{{{{{{{\boldsymbol{\Sigma }}}}}}}}_{mm}^{-1}{{{{{{{\boldsymbol{\Sigma }}}}}}}}_{mn}.$$

(7)

To proceed with the generation of molecular structures, multiple values for {n} are sampled from ${{{{{{\mathcal{N}}}}}}}({{{{{{\bf{n}}}}}}}| \tilde{{{{{{{\boldsymbol{\mu }}}}}}}}({{{{{{\bf{m}}}}}}}),\tilde{{{{{{{\boldsymbol{\Sigma }}}}}}}}({{{{{{\bf{m}}}}}}}))$ for a fixed values of the targeted properties ${{{{{{\bf{m}}}}}}}=\bar{{{{{{{\bf{m}}}}}}}}$ (the variables with âï¿£â on top correspond to the target values throughout the text). Each sample s will correspond to the total set of (targeted and non-targeted) properties $\bar{{{{{{{\bf{m}}}}}}}}\bigoplus {{{{{{{\bf{n}}}}}}}}_{s}$ which is then fed to the model that will generate a Coulomb matrix sample CM_s. At this stage, the VAE encoder is used to encode the samples and, consequently, to obtain the latent representation encoding z_s. This is used to filter the samples based on the self-consistency criterium from defined when analyzing the modelâs performance. Namely we consider $\Delta {{{{{{\bf{z}}}}}}}=| | {{{{{{\bf{z}}}}}}}-{{{{{{{\bf{z}}}}}}}}_{s}| | \in \left[0,0.4\right]$, where z is the encoding from the property encoder. Note that this specific interval is adjusted for each target, taking into consideration the possibility of it being either too loose or excessively stringent. Cartesian coordinates, atomic species of molecules and H atoms are posteriorly retrieved from the generated Coulomb matrices as described above. After checking for broken molecules, we optimize the position of H atoms and then screen for overlapping atoms by putting a threshold of 10 a.u. to the maximum force component at with DFTB3+MBD level of theory. As a final step, the generated structures undergo geometry optimizations using DFTB3+MBD and, subsequently, their QM properties are calculated at the PBE0+MBD level of theory to enable a comparison with the target values $\{\bar{{{{{{{\bf{m}}}}}}}}\}$ in the QM7-X dataset.

Masking method for representation

Stemming from the consideration that the padded Coulomb matrix representation presents issues in treating molecular fragments independently, we decided to address the problem during the training procedure with the following modifications:

Randomly masking atoms: This is done by taking a random binary vector of length l, where the padded Coulomb Matrix is an lâÃâl matrix, and by taking the outer product of the said vector. The distribution of zeros and ones is a Bernoulli distribution with pâ=â0.5. To maintain the same expected value for the reconstruction loss, we also multiply the mask by a factor 2. The resulting mask ${{{{{{\mathcal{M}}}}}}}$ is then applied to the reconstruction loss function before taking the sum over the Coulomb matrix, namely:
$$los{s}_{{{{{{{\mathcal{M}}}}}}}}={\Sigma }_{i,j}{{{{{{{\mathcal{M}}}}}}}}_{ij}(-\log ({p}_{\theta }({{{{{{{\bf{x}}}}}}}}_{ij}| {{{{{{\bf{z}}}}}}}))),$$
(8)
where ${{{{{{{\rm{loss}}}}}}}}_{{{{{{{\mathcal{M}}}}}}}}$ is the new reconstruction loss.
Leaky masking of padding: this is done in a similar way by setting ${{{{{{{\mathcal{M}}}}}}}}_{ij}$ to 0.1 if x_ijâ=â0.

The application of this masking significantly enhances the capability of QIM model to treat fragments independently. Additionally, due to the leaky masking of the padding, the model can consider molecules with more heavy atoms than the largest molecule in the dataset. To generate the unseen compounds, we have repeated the initial procedure without using the self-consistency filter and generating samples until a new heavy atom composition is found.

Geodesic interpolation algorithm

While it would be relatively straightforward to perform a linear interpolation in the latent space and generate associated geometries using the learned CCS parameterization, we are also interested in exploring the feasibility of obtaining an energy profile estimation along with this task. To do so, we use the geodesic interpolation algorithm⁵⁸ for VAEs but, in this instance, the objective is to find curves in the property space that are geodesics with respect to the metric induced by the latent space encoding. This enables us to see how a close-to-linear interpolation in the latent space is reflected in both the property input space and the structure output space. The procedure to get the transition geometries consists in optimizing with a gradient descent algorithm an initial linear interpolation in property space between two structures, by minimizing the distance of the path in latent space.

The geodesic interpolation procedure starts with considering the initial and final conformational isomers, namely x₀ and x_N. We then linearly interpolate Nâââ1 points between the corresponding coordinates in property space defined by the seventeen properties considered during the training procedure. Applying the property-to-structure relationship defined by our model, we obtain an initial configuration given by the set {y₀,ây₁,ây₂,ââ¦,ây_N} of property coordinates, the set {z₀,âz₁,âz₂,ââ¦,âz_N} of latent space coordinates and the set {x₀,âx₁,âx₂,ââ¦,âx_N} of molecular structures. After this, we use a gradient descent algorithm to minimize the loss function $L={\Sigma }_{i=1}^{N}| | {{{{{{{\bf{z}}}}}}}}_{i+1}-{{{{{{{\bf{z}}}}}}}}_{i}| {| }^{2}+\epsilon {\Sigma }_{i=1}^{N}| | {{{{{{{\bf{y}}}}}}}}_{i+1}-{{{{{{{\bf{y}}}}}}}}_{i}| {| }^{2}$, where the first term enforces the minimum distance in the latent space, while the second term is a regularization term with Ïµâ<â<â1 to enforce continuity in the property space. To find a stationary path, the optimization runs until a convergence criterion on the norm of the gradient is met (i.e.,â<â10^â4). Concerning the choice of molecules for this evaluation test, we also took into account that chiral molecules are indistinguishable when using the CM representation. We hence select the initial and final states ensuring that these are not mirror images of one another, as the model would interpret them as identical molecules.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The QM7-X data⁵⁴ that support the findings of this study are available in ZENODO with the identifier https://doi.org/10.5281/zenodo.4288677. The preprocessed data to make our implementation work are provided in the Github repository Repo-AleFalla⁸³. Source data are provided with this paper. The authors declare that all data supporting the findings of this study are available within the paper and itsÂ supplementary information files.Â Source data are provided with this paper.

Code availability

The code corresponding to our implementation is available in the Github repository Repo-AleFalla⁸³. Moreover, we provide a Jupyter notebook to reproduce the main results.

References

Kulik, H. J. et al. Roadmap on machine learning in electronic structure. Electron. Struct. 4, 023004 (2022).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Sadybekov, A. V. & Katritch, V. Computational approaches streamlining drug discovery. Nature 616, 673â685 (2023).
ArticleÂ ADSÂ CASÂ PubMedÂ Google ScholarÂ
von Lilienfeld, O., MÃ¼ller, K. & Tkatchenko, A. Exploring chemical compound space with quantum-based machine learning. Nat. Rev. Chem. 4, 347â358 (2020).
ArticleÂ Google ScholarÂ
SchÃ¼tt, K. T., Arbabzadah, F., Chmiela, S., MÃ¼ller, K. R. & Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 8, 13890 (2017).
ArticleÂ ADSÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Gao, X., Ramezanghorbani, F., Isayev, O., Smith, J. S. & Roitberg, A. E. Torchani: A free and open source pytorch-based deep learning implementation of the ani neural network potentials. J. Chem. Inf. Model. 60, 3408â3415 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Bigi, F., Pozdnyakov, S. N. & Ceriotti, M. Wigner kernels: body-ordered equivariant machine learning without a basis. Preprint at https://arxiv.org/abs/2303.04124 (2023).
Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
ArticleÂ ADSÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Steinmann, S. N., Wang, Q. & Seh, Z. W. How machine learning can accelerate electrocatalysis discovery and optimization. Mater. Horiz. 10, 393â406 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Dreiman, G. H. S., Bictash, M., Fish, P., Griffin, L. D. & Svensson, F. Changing the hts paradigm: Ai-driven iterative screening for hit finding. Slas Discov. 26, 257â262 (2020).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Jansen, J. et al. Biased complement diversity selection for effective exploration of chemical space in hit-finding campaigns. J. Chem. Inf. Model. 59, 1709â1714 (2019).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Paricharak, S. et al. Data-driven approaches used for compound library design, hit triage and bioactivity modeling in high-throughput screening. Brief. Bioinforma. 19, 277â285 (2016).
Google ScholarÂ
Riniker, S., Wang, Y., Jenkins, J. & Landrum, G. Using information from historical high-throughput screens to predict active compounds. J. Chem. Inf. Model. 54, 1880â91 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Ahmed, L. et al. Efficient iterative virtual screening with apache spark and conformal prediction. J. Cheminformatics 10, 8 (2018).
ArticleÂ Google ScholarÂ
Helal, K. Y., Maciejewski, M., Gregori-PuigjanÃ©, E., Glick, M. & Wassermann, A. Public domain hts fingerprints: Design and evaluation of compound bioactivity profiles from pubchemâs bioassay repository. J. Chem. Inf. Model. 56 2, 390â398 (2016).
ArticleÂ Google ScholarÂ
Beresini, M. et al. Small-molecule library subset screening as an aid for accelerating lead identification. J. Biomol. Screen. 19, 758â770 (2014).
ArticleÂ PubMedÂ Google ScholarÂ
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 361, 360â365 (2018).
ArticleÂ ADSÂ CASÂ PubMedÂ Google ScholarÂ
Zunger, A. Inverse design in search of materials with target functionalities. Nat. Rev. Chem. 2, 0121 (2018).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Kim, K. et al. Deep-learning-based inverse design model for intelligent discovery of organic molecules. npj Comput. Mater. 4, 67 (2018).
ArticleÂ ADSÂ Google ScholarÂ
Chen, Y. et al. Deep generative model for drug design from protein target sequence. J. Cheminformatics 15, 38 (2023).
ArticleÂ CASÂ Google ScholarÂ
Lee, J. et al. Machine learning-based inverse design methods considering data characteristics and design space size in materials design and manufacturing: a review. Mater. Horiz. 10, 5436â5456 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Moret, M. et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat. Commun. 14, 114 (2023).
ArticleÂ ADSÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Lin, J. et al. Machine learning accelerates the investigation of targeted mofs: Performance prediction, rational design and intelligent synthesis. Nano Today 49, 101802 (2023).
ArticleÂ Google ScholarÂ
Noh, J., Gu, G. H., Kim, S. & Jung, Y. Machine-enabled inverse design of inorganic solid materials: Promises and challenges. Chem. Sci. 11, 4871â4881 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Nigam, A., Pollice, R., Krenn, M., Gomes, Gd. P. & Aspuru-Guzik, A. Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (stoned) algorithm for molecules using selfies. Chem. Sci. 12, 7079â7090 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Nigam, A., Pollice, R. & Aspuru-Guzik, A. Parallel tempered genetic algorithm guided by deep neural networks for inverse molecular design. Digital Discov. 1, 390â404 (2022).
ArticleÂ CASÂ Google ScholarÂ
Anstine, D. M. & Isayev, O. Generative models as an emerging paradigm in the chemical sciences. J. Am. Chem. Soc. 145, 8736â8750 (2023).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Seo, S., Lim, J. & Kim, W. Y. Molecular generative model via retrosynthetically prepared chemical building block assembly. Adv. Sci. 10, 2206674 (2023).
ArticleÂ CASÂ Google ScholarÂ
Dollar, O., Joshi, N., Beck, D. A. C. & Pfaendtner, J. Attention-based generative models for de novo molecular design. Chem. Sci. 12, 8362â8372 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
GÃ³mez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268â276 (2018).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
De Cao, N. & Kipf, T. MolGAN: an implicit generative model for small molecular graphs. Preprint at https://arxiv.org/abs/1805.11973 (2018).
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de novo design through deep reinforcement learning. J. Cheminformatics 9, 48 (2017).
ArticleÂ Google ScholarÂ
Kang, S. & Cho, K. Conditional molecular design with deep generative models. J. Chem. Inf. Model. 59, 43â52 (2018).
ArticleÂ PubMedÂ Google ScholarÂ
Corso, G., StÃ¤rk, H., Jing, B., Barzilay, R. & Jaakkola, T. S. DiffDock: diffusion steps, twists, and turns for molecular docking. In Proc. 11th International Conference on Learning Representations https://openreview.net/forum?id=kKF8_K-mBbS (2023).
Guimaraes, G. L., Sanchez-Lengeling, B., Outeiral, C., Farias, P. L. C. & Aspuru-Guzik, A. Objective-reinforced generative adversarial networks (organ) for sequence generation models. Preprint at https://arXiv.org/abs/1705.10843 (2018).
Samanta, B. et al. Nevae: A deep generative model for molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, 33, 1110â1117 (2019).
Li, Y., Zhang, L. & ming Liu, Z. Multi-objective de novo drug design with conditional graph generative model. J. Cheminformatics 10, 33 (2018).
ArticleÂ Google ScholarÂ
Maziarka, Å. et al. Mol-cyclegan: a generative model for molecular optimization. J. Cheminformatics 12, 2 (2019).
ArticleÂ Google ScholarÂ
Zang, C. & Wang, F. Moflow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 617â626 (2020).
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. Preprint at https://arXiv.org/abs/1802.04364 (2019).
Grover, A., Zweig, A. & Ermon, S. Graphite: Iterative generative modeling of graphs. Preprint at https://arXiv.org/abs/1803.10459 (2019).
Xue, D. et al. Advances and challenges in deep generative models for de novo molecule generation. WIREs Comput. Mol. Sci. 9, e1395 (2019).
ArticleÂ Google ScholarÂ
Gebauer, N. W. A., Gastegger, M., Hessmann, S. S. P., MÃ¼ller, K.-R. & SchÃ¼tt, K. T. Inverse design of 3d molecular structures with conditional generative neural networks. Nat. Commun. 13, 973 (2022).
ArticleÂ ADSÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hoogeboom, E., Satorras, V. G., Vignac, C. & Welling, M. Equivariant diffusion for molecule generation in 3d. Preprint at https://arXiv.org/abs/2203.17003 (2022).
Xie, T., Fu, X., Ganea, O.-E., Barzilay, R. & Jaakkola, T. S. Crystal diffusion variational autoencoder for periodic material generation. In International Conference on Learning Representations https://openreview.net/forum?id=03RLpj-tc_ (2022).
Wu, L., Gong, C., Liu, X., Ye, M. & Liu, Q. Diffusion-based molecule generation with informative prior bridges. In Advances in Neural Information Processing Systems https://openreview.net/forum?id=TJUNtiZiTKE (2022).
Guan, J.et al. 3d equivariant diffusion for target-aware molecule generation and affinity prediction. In The Eleventh International Conference on Learning Representations https://openreview.net/forum?id=kJqXEPXMsE0 (2023).
Xu, M. et al. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations https://openreview.net/forum?id=PzcvxEMzvQC (2022).
Hiener, D. C. & Hutchison, G. R. Pareto optimization of oligomer polarizability and dipole moment using a genetic algorithm. J. Phys. Chem. A 126, 2750â2760 (2022).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Mannodi-Kanakkithodi, A., Pilania, G., Huan, T. D., Lookman, T. & Ramprasad, R. Machine learning strategy for accelerated design of polymer dielectrics. Sci. Rep. 6, 20952 (2016).
ArticleÂ ADSÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Yuan, Q., Santana-Bonilla, A., Zwijnenburg, M. A. & Jelfs, K. E. Molecular generation targeting desired electronic properties via deep generative models. Nanoscale 12, 6744â6758 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Westermayr, J., Gilkes, J., Barrett, R. & Maurer, R. J. High-throughput property-driven generative design of functional organic molecules. Nat. Comput. Sci. 3, 139â148 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Medrano Sandonas, L. et al. "Freedom of designâ in chemical compound space: towards rational in silico design of molecules with targeted quantum-mechanical properties. Chem. Sci. 14, 10702â10717 (2023).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
GÃ³ger, S., Medrano Sandonas, L., MÃ¼ller, C. & Tkatchenko, A. Data-driven tailoring of molecular dipole polarizability and frontier orbital energies in chemical compound space. Phys. Chem. Chem. Phys. 25, 22211â22222 (2023).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hoja, J. et al. QM7-X, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules. Sci. Data 8, 43 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
van der Maaten, L. & Hinton, G. Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579â2605 (2008).
Google ScholarÂ
RincÃ³n, L., Alvarellos, J. E. & Almeida, R. Electron density, exchange-correlation density, and bond characterization from the perspective of the valence-bond theory. II. Numerical results. J. Chem. Phys. 122, 214103 (2005).
Collins, T. C., Euwema, R. N., Stukel, D. J. & Wepfer, G. G. Valence electron density of states of znse obtained from an energy dependent exchange approximation. Int. J. Quantum Chem. 5, 77â85 (1970).
ArticleÂ Google ScholarÂ
Shao, H., Kumar, A. & Fletcher, P. T. The riemannian geometry of deep generative models. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 428â4288 (2018).
Makri, S., Ortner, C. & Kermode, J. R. A preconditioning scheme for minimum energy path finding methods. J. Chem. Phys. 150, 094109 (2019).
ArticleÂ ADSÂ PubMedÂ Google ScholarÂ
Unke, O. et al. Spookynet: Learning force fields with electronic degrees of freedom and nonlocal effects. Nat. Commun. 12, 7273 (2021).
ArticleÂ ADSÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Schreiner, M., Bhowmik, A., Vegge, T., JÃ¸rgensen, P. B. & Winther, O. Neuralnebâneural networks can find reaction paths fast. Mach. Learn.: Sci. Technol. 3, 045022 (2022).
ADSÂ Google ScholarÂ
Vignac, C. & Frossard, P. Top-n: Equivariant set and graph generation without exchangeability. In International Conference on Learning Representations https://openreview.net/forum?id=-Gk_IPJWvk (2022).
Zhu, X., Thompson, K. & Martinez, T. Geodesic interpolation for reaction pathways. J. Chem. Phys. 150, 164103 (2019).
ArticleÂ ADSÂ PubMedÂ Google ScholarÂ
Medrano Sandonas, L. et al. Dataset for quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules. Sci. Data 11, 742 (2024).
Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 9, 513â530 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Sorkun, M. C., Khetan, A. & Er, S. Aqsoldb, a curated reference set of aqueous solubility and 2d descriptors for a diverse set of compounds. Sci. Data 6, 143 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Cremer, J., Medrano Sandonas, L., Tkatchenko, A., Clevert, D.-A. & De Fabritiis, G. Equivariant graph neural networks for toxicity prediction. Chem. Res. Toxicol. 36, 1561â1573 (2023).
CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kingma, D. P. & Welling, M. Auto-encoding variational bayes. Preprint at https://arXiv.org/abs/1312.6114 (2022).
Rupp, M., Tkatchenko, A., MÃ¼ller, K.-R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108, 058301 (2012).
ArticleÂ ADSÂ PubMedÂ Google ScholarÂ
Montavon, G. et al. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 15, 095003 (2013).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Dokmanic, I., Parhizkar, R., Ranieri, J. & Vetterli, M. Euclidean distance matrices: Essential theory, algorithms, and applications. IEEE Signal Process. Mag. 32, 12â30 (2015).
ArticleÂ ADSÂ Google ScholarÂ
Hoffmann, M. & NoÃ©, F. Generating valid euclidean distance matrices. Preprint at https://arXiv.org/abs/1910.03131 (2019).
OâBoyle, N. M. et al. Open babel: An open chemical toolbox. J. Cheminformatics 3, 1â14 (2011).
Google ScholarÂ
Seifert, G., Porezag, D. & Frauenheim, T. Calculations of molecules, clusters, and solids with a simplified LCAO-DFT-LDA scheme. Int. J. Quantum Chem. 58, 185â192 (1996).
ArticleÂ CASÂ Google ScholarÂ
Gaus, M., Cui, Q. & Elstner, M. DFTB3: Extension of the self-consistent-charge density-functional tight-binding method (SCC-DFTB). J. Chem. Theory Comput. 7, 931â948 (2011).
ArticleÂ CASÂ Google ScholarÂ
Tkatchenko, A., DiStasio Jr, R. A., Car, R. & Scheffler, M. Accurate and efficient method for many-body van der waals interactions. Phys. Rev. Lett. 108, 236402 (2012).
ArticleÂ ADSÂ PubMedÂ Google ScholarÂ
StÃ¶hr, M., Michelitsch, G. S., Tully, J. C., Reuter, K. & Maurer, R. J. Communication: Charge-population based dispersion interactions for molecules and materials. J. Chem. Phys. 144, 151101 (2016).
ArticleÂ ADSÂ PubMedÂ Google ScholarÂ
Perdew, J. P., Ernzerhof, M. & Burke, K. Rationale for mixing exact exchange with density functional approximations. J. Chem. Phys. 105, 9982â9985 (1996).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Adamo, C. & Barone, V. Toward reliable density functional methods without adjustable parameters: The PBE0 model. J. Chem. Phys. 110, 6158â6170 (1999).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Ambrosetti, A., Reilly, A. M., DiStasio Jr, R. A. & Tkatchenko, A. Long-range correlation energy calculated from coupled atomic response functions. J. Chem. Phys. 140, 18A508 (2014).
ArticleÂ PubMedÂ Google ScholarÂ
Havu, V., Blum, V., Havu, P. & Scheffler, M. Efficient O(N) integration for all-electron electronic structure calculation using numeric basis functions. J. Comput. Phys. 228, 8367â8379 (2009).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2014).
Fallani, A., Medrano Sandonas, L. & Tkatchenko, A. Inverse mapping of quantum properties to structures for chemical space of small organic molecules. ZENODO https://doi.org/10.5281/zenodo.11537048 (2024).

Download references

Acknowledgements

We thank Dr. Szabolcs Goger and Mr. Kyunghoon Han for the helpful discussions and comments to improve the manuscript. This research was financially supported by the European Unionâs Horizon 2020 research and innovation program under the Marie SkÅodowska-Curie Innovative Training Network - European Industrial Doctorate grant agreement No 956832, âAdvanced Machine learning for Innovative Drug Discoveryâ (AIDD). The results discussed in this work were obtained using the computational resources provided by the High Performance Computing (HPC) at the University of Luxembourg.

Author information

Authors and Affiliations

Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg
Alessio Fallani,Â Leonardo Medrano SandonasÂ &Â Alexandre Tkatchenko
Institute for Materials Science and Max Bergmann Center of Biomaterials, TU Dresden, 01062, Dresden, Germany
Leonardo Medrano Sandonas

Authors

Alessio Fallani
View author publications
You can also search for this author in PubMedÂ Google Scholar
Leonardo Medrano Sandonas
View author publications
You can also search for this author in PubMedÂ Google Scholar
Alexandre Tkatchenko
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

The work was initially conceived by A.F. and L.M.S., and designed with contributions from A.T. A.F. developed the machine learning code, performed the model training, and analyzed the model performance in diverse applications together with L.M.S. A.T. supervised and revised all stages of the work. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Alessio Fallani, Leonardo Medrano Sandonas or Alexandre Tkatchenko.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous, reviewer(s) for their contribution to the peer review of this work. AÂ peer review file is available.

Additional information

Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleâs Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâs Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fallani, A., Medrano Sandonas, L. & Tkatchenko, A. Inverse mapping of quantum properties to structures for chemical space of small organic molecules. Nat Commun 15, 6061 (2024). https://doi.org/10.1038/s41467-024-50401-1

Download citation

Received: 14 September 2023
Accepted: 01 July 2024
Published: 18 July 2024
DOI: https://doi.org/10.1038/s41467-024-50401-1

Subjects

Abstract

Similar content being viewed by others

Molecular design with automated quantum computing-based deep learning and optimization

QM7-X, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules

Exploring chemical compound space with quantum-based machine learning

Introduction

Results

Property-to-structure mapping

Interpretability and performance of our model

Acting as a conditional generative model

Generating compounds in uncharted CCS regions

Interpolating transition geometries between isomers

Discussion

Methods

Need for a tailored implementation

Variational auto-encoder framework for inverse design

Molecular representation and structure retrieval

Reference chemical space

Gradient attribution map

Multi-Gaussian fitting for conditional molecular design

Masking method for representation

Geodesic interpolation algorithm

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links