¹¹affiliationtext: School of Electronic and Computer Engineering, Peking University, Shenzhen, China²²affiliationtext: AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, China³³affiliationtext: State Key Laboratory of Physical Chemistry of Solid Surfaces, School of Electronic Science and Engineering, Innovation Laboratory for Science and Technologies of Energy Materials of Fujian Province (IKKEM) and College of Chemistry and Chemical Engineering Xiamen University, Xiamen 361005, China⁴⁴affiliationtext: Peng Cheng Laboratory, Shenzhen, China⁵⁵affiliationtext: School of Materials Science and Engineering, Peking University, Beijing, China^*^*affiliationtext: These authors contributed equally to this work^$\dagger$^$\dagger$affiliationtext: Corresponding authors: yuanli-ece@pku.edu.cn, xcwang@xmu.edu.cn, fmo@pku.edu.cn

Deep peak property learning for efficient chiral molecules ECD spectra prediction

Hao Li Da Long Li Yuan Yonghong Tian Xinchang Wang Fanyang Mo

Abstract

Chiral molecule assignation is crucial for asymmetric catalysis, functional materials, and the drug industry. The conventional approach requires theoretical calculations of electronic circular dichroism (ECD) spectra, which is time-consuming and costly. To speed up this process, we have incorporated deep learning techniques for the ECD prediction. We first set up a large-scale dataset of Chiral Molecular ECD spectra (CMCDS) with calculated ECD spectra. We further develop the ECDFormer model, a Transformer-based model to learn the chiral molecular representations and predict corresponding ECD spectra with improved efficiency and accuracy. Unlike other models for spectrum prediction, our ECDFormer creatively focused on peak properties rather than the whole spectrum sequence for prediction, inspired by the scenario of chiral molecule assignation. Specifically, ECDFormer predicts the peak properties, including number, position, and symbol, then renders the ECD spectra from these peak properties, which significantly outperforms other models in ECD prediction, Our ECDFormer reduces the time of acquiring ECD spectra from 1-100 hours per molecule to 1.5s.

Keywords

Chiral Molecule Assignation, ECD Spectra Prediction, Deep Learning.

A chiral molecule refers to a unique spatial arrangement that cannot be superimposed onto its mirror image, resulting in non-identical left-handed and right-handed forms. Chirality is ubiquitous in chemistry and biology and plays a crucial role in various fields such as asymmetric catalysis [1, 2], functional materials [3, 4], drug discovery [5], and other related areas [6, 7]. Specifically, in the drug discovery area, the drug activity often depends on its absolute configuration. A well-known chiral drug is thalidomide in Fig. 1(a), which was previously used as an antiemetic drug for morning sickness [8] in the form of enantiomeric pairs. However, one of its chiral configurations (R-type) is safe, while the other chiral configuration (S-type) induces severe teratogenic effects. Thus, assigning the absolute configuration of chiral molecules has always been the center of chiral-related research.

There are traditional approaches for discerning the chiral configuration of a molecule with single chiral carbon, including electronic circular dichroism (ECD) spectroscopy, nuclear magnetic resonance spectroscopy, and X-ray single-crystal diffraction methods [9, 10]. Among these methods, ECD spectroscopy is the most efficient and reliable method for determining the absolute configuration of chiral molecules. However, the procedure is still laborious and time-consuming, including chiral separation of isomers, obtaining experimental CD spectra, computation of the theoretical ECD spectra through quantum chemical calculations, and comparison of both experimental and theoretical ECD spectra to achieve conclusive identification of the absolute configuration. Specifically, this comparison focused on the wavelength, the signs of the Cotton effects (positive or negative peaks), the intensity of peaks, and their agreement between experimental and calculated spectra [11].

Refer to caption — Figure 1: The scheme for ECD prediction and chiral molecule assignation. a Thalidomide has two configurations (R/S). R-Thalidomide induces sedative effects, whereas S-Thalidomide is associated with teratogenic effects. b ECD comparison is most frequently employed for assigning the absolute configuration. However, The theoretical calculation of ECD is time-consuming, involving steps such as conformational searching, conformational optimization, excited-state property calculation, and Boltzmann weighting. So we employ deep learning for acceleration. c As molecules become more complex, the computation time increases. Our CPU version is IntelXeonE5-2640v4@2.40GHz.

For experimental chemists, the theoretical calculation of ECD spectra in the aforementioned steps stands out as the most time-consuming and technically demanding task. As shown in Fig. 1(b), the computation of ECD spectra for a chiral molecule entails multiple stages. Initially, a molecular structure model is drawn, followed by molecular dynamics simulations to explore various energetically favorable conformations. Subsequently, these conformations undergo individual structure optimization and energy calculations at the density functional theory (DFT) level of precision. Then the ECD spectra of the molecules are computed employing time-dependent DFT (TD-DFT) calculations. The final calculated ECD spectrum is generated by combining the individual ECD spectra of different conformations, weighted by their Boltzmann probabilities. This requires experimental chemists to possess a proficient understanding of specialized tools, such as molecular dynamics and DFT calculations. Moreover, the computational demands and time requirements associated with this process are substantial, thereby highlighting its rate-determine step in the assignment of chiral absolute configurations. It raises an open question: “Can we speed up the theoretical calculation of ECD spectra?”

In recent years, statistical tools based on machine learning have been integrated into chemistry research workflows [12]. This integration is enabling researchers to analyze vast datasets with greater precision and discover intricate patterns and relationships that were previously undetectable, significantly enhancing the efficiency and effectiveness of chemical research and innovation [13, 14]. Large and high-quality datasets are essential for the effectiveness of machine learning methods. We first need to have a library of chiral molecules. Fortunately, we have constructed a library of 25000+ chiral molecules (Chiral Molecules Retention Time Dataset, CMRT) in our previous work, which introduced a machine learning framework to enhance the efficiency of chromatographic enantioseparation in experimental chemistry [15]. Based on the CMRT dataset, a Chiral Molecular CD Spectra Dataset (CMCDS) was generated by selecting chiral molecules from CMRT and calculating their ECD spectra. To the best of our knowledge, CMCDS is the first large-scale dataset for ECD spectra prediction.

With the CMCDS dataset, we further construct the ECDFormer, a deep-learning model to speed up the prediction of the ECD spectra for chiral molecules. Inspired by the chemical assignation scenario that focuses on peak properties in the ECD spectra, our ECDFormer creatively proposes a peak property prediction module to render the ECD spectra from peak properties rather than predict the ECD spectra directly. For the input molecule, our ECDFormer applies its atom, bond, angle features, and molecular descriptors as the description information into the GeoGNN structure [16] to learn the molecular representation. For the peak property learning module, we apply the transformer encoder [17] to learn the peak property features from molecular representations. Then we respectively predict the peak number, position (wavelength), and symbol (the sign of Cotton effect) from property features and render them into the ECD spectra as the prediction of theoretical ECD spectra.

The quantitative experimental results demonstrate the accuracy and efficiency of our ECDFormer compared with other baselines that directly predict the whole ECD spectra. The visualizations show that ECDFormer predicts correct ECD spectra for molecules in CMCDS as well as the natural molecules with pharmaceutical effects. Our model not only advances research in chiral chemistry but also has potential applications in asymmetric synthesis and facilitates high-throughput screening of chiral drug molecules in the pharmaceutical development field. Our contribution can be summarized as follows:

•

The ECD spectra calculation for chiral molecular assignation is crucial yet time-consuming for chemists. A deep-learning model, ECDFormer, was proposed to predict the ECD spectra and improve the assignation efficiency. Inspired by the assignation procedure in chemistry, ECDFormer focuses on peak prediction and renders peaks into the ECD spectra.
•

We proposed a large-scale dataset, CMCDS, for the ECD prediction task. CMCDS containing ECD spectra for 22,190 chiral molecules was produced utilizing substantial computational power.
•

Experimental results demonstrate the accuracy and efficiency of ECDFormer on the CMCDS dataset. ECDFormer also predicts correct ECD spectra for the natural product molecules that have pharmaceutical effects.

1 Results

1.1 Construction of the CMCDS dataset

As shown in Fig.2, the CMCDS dataset is mainly realized by large-scale theoretical calculations, consisting of ECD spectra and SMILES sequences of 22190 chiral molecules, and the ECD spectral data of all the molecules were calculated by Gaussian16 A.03 packages [18]. Our chiral molecules were mainly crawled from the literature of asymmetric catalysis, and we transformed the SMILES files of the molecules into MOL files with the help of the RDKit package to obtain the 3D atomic coordinates of the molecules. The above MOL files were converted into Gaussian input gjf files in batches through Python. Then the molecule structure was optimized at B3LYP [19]/6-31G level. Furthermore, we conducted the electronic circular dichroism calculation at the CAM-B3LYP [20]/6-31G(d) level, setting the number of states (nstates) to 20. We fix the half-peak width at 0.3 and apply Gaussian broadening, utilizing the energies and wavelengths derived from these 20 excited states. The ECD spectra of all molecules were acquired in the same way, and we used Python for batch data processing.

1.2 Construction of the ECDFormer model

Fig.3 shows the computational workflow of our ECDFormer model. The workflow takes the atom-bond-angle features and molecular descriptors as the features of the target molecule. ECDFormer contains four modules for ECD prediction: (i) the molecular feature extraction module to get the chiral molecular representation based on a geometric-enhanced graph neural network. (ii) the peak property learning module to extract the peak property features from chiral molecular representation using a Transformer Encoder structure. (iii) the peak property prediction module to predict the peak properties, including number, position, and symbol, from the learned peak property features. (iv) the ECD rendering module to reconstruct the ECD spectra from predicted peak properties.

Molecular Electronic Circular Dichroism (ECD) spectra are characterized by the presence of positive and negative peaks as a result of the Cotton effect [21]. Compared to other spectra including protein ECD spectra [22] and molecular infrared spectra [21], molecular ECD spectra reveal significant morphological variations. This distinct feature makes traditional sequence prediction models (LSTM, GRU) less effective for ECD prediction by directly predicting the whole spectra. Chemists often concentrate on the symbols of peaks (indicating the direction of the Cotton effect) and their positions (related to the wavelengths of the peaks) in ECD spectra for determining chirality in molecules. To streamline the ECD prediction process, we focus on predicting essential ECD information such as the number of peaks, their positions, and symbols. Accordingly, the peak-focused loss function to support this approach is:

\displaystyle L(y^{true},y^{pred})=L_{ce}^{Num}(y^{true},y^{pred})+(L_{ce}^{% Pos}(y^{true},y^{pred})+2*L_{ce}^{Sym}(y^{true},y^{pred}))

(1)

where $L_{ce}$ for peak number, position, and symbol are cross-entropy loss [23]. Due to the emphasis of ECD spectra prediction on the positive and negative peaks, we slightly increased the loss weight for peak symbols to enforce the model prediction.

1.3 Peak-specific Evaluation Metrics for the ECD Prediction Task

The ECD spectra of chemical molecules exhibit two distinct characteristics: (i). a high degree of shape diversity, (ii). a strong reliance on peak attributes for chiral molecule identification. These characteristics are significantly different from the ECD spectra of proteins, rendering it inappropriate to adopt the Root Mean Square Error (RMSE) evaluation metric used in protein ECD spectrum prediction tasks [24, 25, 26]. To better evaluate the quality of the ECD spectrum for the chiral molecular assignation task, we establish three sets of evaluation metrics based on peak attributes of ECD spectra: (1). Number-RMSE: the RMSE of peak number between ground-truth and prediction ECD spectra. (2). Position-RMSE: the RMSE of each peak’s position between ground-truth and prediction ECD spectra. (3). Symbol-Acc: the matching accuracy of peaks’ symbols between ground-truth and prediction ECD spectra. These metrics provide a reasonable and comprehensive assessment of ECD spectrum prediction quality from different perspectives.

1.4 Performance comparison on the CMCDS dataset

To comprehensively evaluate the performance of our ECDFormer, we implemented two categories of models as our baselines, the machine learning models and deep learning models. Table. 1 demonstrates that our model achieves state-of-the-art performance across these baselines. The specific experimental analysis is provided below.

1.4.1 Comparison with machine learning baselines.

Machine learning models are commonly used as analytical tools in the fields of chemistry and materials science [27, 28]. We select three common models, including SGD Regressor, Positive Aggressive Regressor, and Logistic Regressor, as the baselines. Comparing line.1-3 and line.10 in Table.1, machine learning baselines perform unsatisfactorily, which is mainly attributed to the models’ inability to decouple spectral sequences from complex molecular structural features. This emphasizes the necessity of employing deep learning models to tackle the task of predicting ECD spectra for chiral molecules.

1.4.2 Comparison with deep learning baselines.

In the context of abundant data, deep learning models have shown excellent performance in complex task settings. With the CMCDS dataset, we implement sequence prediction deep learning models as our baselines, including LSTM [29], GRU [30], and Transformer Decoder [17]. Comparing line.5-8 with line.10 in Table.1, our ECDFormer, predicting the peak property of ECD spectra, significantly outperforms other baselines. The results demonstrate the effectiveness of our peak property prediction module in ECDFormer. Comparing line.6/8 with line.7/9, the pretrained models have little influence on the ECD prediction task, due to the lack of chiral molecular information during the pretraining stage

#	Method	Initialization		Evaluation Metrics
#	Method	Rand	Pretrain	Position-RMSE (nm) $\downarrow$	Number-RMSE $\downarrow$	Symbol-Acc. (%) $\uparrow$
Machine Learning Methods
1	Logistic-Regressor	✓	-	7.81	7.22	47.8
2	SGD-Regressor	✓	-	6.44	6.36	47.1
3	Aggr-Regressor	✓	-	5.97	4.39	48.5
Deep Learning Methods
4	GeoGNN+Linear	✓	-	8.62	2.87	51.9
5	GeoGNN+GRU	✓	-	6.47	1.72	39.5
6	GeoGNN+LSTM	✓	-	5.91	1.76	43.7
7	GeoGNN+LSTM	-	✓	4.68	1.45	46.4
8	GeoGNN+Transformer	✓	-	4.69	1.36	49.2
9	GeoGNN+Transformer	-	✓	5.82	1.64	37.3
10	ECDFormer (ours)	✓	-	2.29	1.24	72.7

Table 1: Performance for ECD prediction task. We propose the experimental results on our ECDFormer framework and the corresponding baselines including machine learning models and deep learning models. Focusing on peak property prediction, our ECDFormer model surpasses baselines under all evaluation metrics.

1.5 The Analysis Visualization on Peak-specific ECD Evaluation Metrics

To better analyze the models’ performance, including our ECDFormer and other baselines, under three peak-specific evaluation metrics, we draw the analysis graphs for each evaluation metric in Fig. 4. The detailed analysis is as follows:

1.5.1 Peak Number Analysis

In Fig. 4(a), we analyze ECDFormer’s predictive capability regarding the peak number and demonstrate its excellent performance in predicting peak number for complex spectra (Peak-Number $>5$ ) compared to baseline models. The X-axis represents the ground truth values of the peak number, while the Y-axis represents the predicted values of the peak number. Therefore, the closer the data points are to the $y=x$ line, the better the predictive performance. The density of the data points is indicated by the size of the red circles, where a larger red circle represents a higher concentration of data points. Fig. 4(a) shows that in ECDFormer, the largest red circles all appear on the $y=x$ line, even when predicting hard samples (Peak-Number $>5$ ). The RMSE of peak number is 1.01, indicating the good performance of peak number prediction for our ECDFormer.

1.5.2 Peak Position Analysis

In Fig. 4(b), we analyze the model’s peak position predictive capability. Specifically, we visualize the violin graphs of the position differences between predicted peaks and ground-truth peaks. To further visualize the performance in easy-to-hard cases, we split the test dataset based on the peak number $N_{v}$ of a molecule. Compared with baselines, for all cases from easy to hard, most predictions in ECDFormer have 0 difference with ground truth, demonstrating the effectiveness.

1.5.3 Peak Symbol Analysis

In Fig. 4(c), similar to the peak position analysis, we further analyze the model’s peak symbol predictive capability. we visualize the violin graphs of the symbol differences between predicted peaks and ground-truth peaks. Compared with baselines, for all cases from easy to hard, most predictions in ECDFormer have the same symbols as ground truth, demonstrating the effectiveness.

1.6 The Visualization of ECD Spectra Prediction Cases

Our visualization contains two parts: (a). Visualizing the ECD spectra corresponding to molecules in the test split of the CMCDS dataset, and (b). Visualizing the ECD spectra corresponding to existing pharmaceutical molecules. Fig. 5 presents our visualization of the CMCDS dataset test split, demonstrating our model’s ability to achieve good performance predictions even when faced with complex molecules of various structures. Fig. 6 shows the ECD predictions for existing pharmaceutical molecules. We first visualize the ECD predictions for R/S type of hydroxybrevianamide [31], a natural product in Aspergillus sp. fungus. Fig. 6(top) shows that ECDFormer can successfully predict the ECD spectra for R-type and S-type molecule pairs. We also visualize the ECD predictions for other pharmaceutical molecules, including Wulfenioidins.L [32] (Anti-Zika Virus Effect), Purpurascenines.B [33] (Antagonist Effect), and Alkaloids [34] (Anti-inflammatory Effect). Our ECDFormer predictions also match the ECD theoretical spectra of these complex natural products with pharmaceutical effects.

2 Discussion

This study proposes a research framework for integrating deep learning techniques into the field of chemistry to improve the efficiency of researchers in acquiring the ECD spectra of chiral molecules. The proposed ECDFormer focuses on several core issues including data collection, 3D characterization of chiral molecules, and understanding of chirality. Firstly, as the ECD spectra of each molecule are calculated consistently, this study mainly employs Python scripts for batch processing as well as generation of the data, thus providing a standardized CMCDS dataset. Secondly, a specialized neural network, ECDFormer, was established, and experimental results showed that it can directly obtain ECD spectra from the smiles of chiral molecules.

The ECDFormer model’s experimental validation demonstrates its proficient capability and generalization ability in predicting ECD spectra for small organic molecules, including single-chiral-centered molecules and multi-chiral-centered molecules. However, there are areas for improvement in this study that could be addressed in future research. Initially, in compiling the extensive ECD spectral data, we bypassed the conformational search for each molecule to minimize time and cost, which may have introduced some inaccuracies in the spectral data. Additionally, the choice of basis set in DFT calculations limits the spectrum of chiral molecules we can study, particularly excluding those containing elements heavier than iodine. Moreover, our focus was solely on molecules with a single chiral center, intentionally excluding those with multiple chiral centers. Despite these constraints, we remain optimistic about the ECDFormer model’s potential in accurately determining the absolute configuration of chiral molecules. The model offers a rapid way to acquire ECD spectra directly from the SMILES notation of the molecules.

3 Methods

3.1 Problem Definition and Preliminary for Electronic Circular Dichroism Prediction

We first briefly introduce the problem definition of the ECD prediction task for the convenience of description and discussion.

3.1.1 Electronic Circular Dichroism Prediction Task.

Generally, each chemical molecule has its electronic circular dichroism (ECD). For molecule $M_{1}$ , we represent the ECD of $M_{1}$ as $\{\mathcal{S}_{1:i}\}_{i=1}^{N_{w}}$ , where $\mathcal{S}_{1:i}$ is the input light wavelength from 80 to 450nm, and $N_{w}$ is the ECD range from -200Mdeg to 200Mdeg. For $M_{1}$ ’s chiral-form molecule $\widetilde{M}_{1}$ , we represent its ECD as $\{-\mathcal{S}_{1:i}\}_{i=1}^{N_{w}}$ . When applying deep learning models for ECD prediction, a direct thought is to establish a site-level sequence prediction model to predict every $\mathcal{S}_{1:i}$ of the ECD. However, in practice analysis, molecular representation lacks the knowledge to reconstruct the site-level ECD sequence. Thus, we simplify the ECD prediction task from the chemical perspective, focusing on the peak features in the ECD sequence. Specifically, we represent molecule $M_{1}$ ’s ECD sequence as $\{\mathcal{P}_{1:j}\}_{j=1}^{N_{p}}$ , where $\mathcal{P}_{1:j}$ is the j-th peak in ECD sequence, and $N_{p}$ is the peak number. The ECD prediction task aims to predict the peak number $N_{p}$ , the peak position and height for $\mathcal{P}_{1:j}$ . Under the new task setting, deep learning models achieve better performance on ECD prediction and chiral molecule distinguishment.

3.1.2 Deep Leaning Models: Graph Neural Network (GNN) and Transformer Network.

Graph Neural Network [35] is an outstanding model for graph representation learning. For molecules, the atoms and chemical bonds are easy to interpret as a graph. Thus, GNN becomes the regular model to extract molecular representation [15, 36, 37]. For an input molecular graph, GNN takes the weighted average of node features and their neighbor features, resulting in new representations for nodes. GNN iterates this process through multiple layers of fully connected layers to progressively propagate and aggregate information from the nodes, leading to richer molecule representations.

The Transformer [38] is a seminal deep-learning model for the sequence processing task. It utilizes stacked attention [17] modules to fuse sequence features from different positions, thereby achieving improved sequence prediction performance. The Transformer contains two parts: the Encoder and the Decoder. The Encoder employs bidirectional attention modules to better integrate input features, while the Decoder employs unidirectional attention modules for sequence prediction. In this work, we have redefined the ECD prediction task as peak information prediction, and therefore, we apply the Transformer Encoder structure for its enhanced feature fusion capability.

3.2 The Framework of the proposed ECDFormer

The overview of our ECDFormer is illustrated in Fig. 3. Our ECDFormer contains four major modules: (1). Feature Extraction Module with GeoGNN [16, 39], (2). Peak Property Learning Module with Transformer Encoder, (3). Peak Property Prediction Module, (4). ECD Rendering Module. The workflow of our ECDFormer is described below.

The Feature Extraction Module utilizes GeoGNN containing two graph convolutional networks to extract the molecule’s geometric and descriptor information from the molecule’s atom-bond graph and bond-angle graph. Then, the molecule representation features are input into the Peak Property Learning Module together with empty query tokens. With the transformer encoder structure, the Peak Property Learning Module extracts the peak-related features from the molecule features to the empty query tokens. In the Peak Property Prediction Module, the resulting peak-related features are simultaneously fed into three specific task heads: the peak-number head, the peak-position head, and the peak-height head to predict the peak properties. Finally, the ECD Rendering Module reconstructs the ECD spectra from the peak properties employing mathematical simulation methods. We further introduce more details about the Feature Extraction Module, Peak Property Learning Module, Peak Property Prediction Module, and ECD Rendering Module in the following subsections.

3.3 Molecular Feature Extraction Module

As shown in Fig. 3, for the molecular feature extraction module, we apply the GeoGNN structure to encode molecular geometric features by modeling the atom-bond-angle corresponding relations. Compared with the traditional GNNs that only consider the atom-bond relationship, GeoGNN [16, 39] has a stronger ability in molecular representation modeling.

Specifically, for an input molecule $M$ , we denote its atom set as $\mathcal{V}$ , its bond set as $\mathcal{E}$ , and its bond-angle set as $\mathcal{A}$ . Then we introduce $M$ ’s atom-bond graph $G$ and bond-angle graph $H$ . The atom-bond graph is defined as $G=(\mathcal{V},\mathcal{E})$ , where atom $u\in\mathcal{V}$ is regarded as the node of $G$ and bond $(u,v)\in\mathcal{E}$ as the edge of $G$ . Similarly, the bond-angle graph is defined as $H=(\mathcal{E},\mathcal{A})$ , where bond $(u,v)\in\mathcal{E}$ is regarded as the node of $H$ and bond angle $(u,v,w)\in\mathcal{A}$ as the edge of $H$ . Both the atom-bond graph and the bond-angle graph are input into the GeoGNN for further feature extraction.

Then, our feature extraction module learns the representation of atoms and bonds iteratively. For the k-th iteration, we use ${\rm\textbf{h}}_{u}$ and ${\rm\textbf{h}}_{uv}$ as the representation of atom $u$ and bond $(u,v)$ . To achieve information aggregation between the atom-bond graph $G$ and the bond-angle graph $H$ , the representation vectors of the bonds are taken as the information link between $G$ and $H$ . Specifically, the iteration of our feature extraction module contains two stages:

In the first stage, the bonds’ representation vectors are learned by aggregating messages from the neighboring bonds and corresponding bond angles in the bond–angle graph $H$ . Given bond $(u,v)$ , in $k$ -th iteration, its representation ${\rm\textbf{h}}_{uv}^{(k)}$ is formalized by:

	$\displaystyle{\rm\textbf{a}}_{uv}^{(k)}=\mathcal{F}_{bond-angle}^{(k)}(\{({\rm% \textbf{h}}_{uv}^{(k-1)},{\rm\textbf{h}}_{uw}^{(k-1)},{\rm\textbf{x}}_{wuv}):w% \in\mathcal{N}(u)\}\cup\{({\rm\textbf{h}}_{uv}^{(k-1)},{\rm\textbf{h}}_{vw}^{(% k-1)},{\rm\textbf{x}}_{uvw}):w\in\mathcal{N}(v)\}),$		(2)
	$\displaystyle{\rm\textbf{h}}_{uv}^{(k)}=\mathcal{W}_{s}*{\rm\textbf{h}}_{uv}^{% (k-1)}+{\rm\textbf{a}}_{uv}^{(k)},$		(3)

where $\mathcal{N}(u)$ and $\mathcal{N}(v)$ are the neighbor atoms of $u$ and $v$ . $\{(u,w):w\in\mathcal{N}(u)\}\cup\{(v,w):w\in\mathcal{N}(v)\}$ are the neighbor bonds of bond $(u,v)$ . $\mathcal{F}_{bond-angle}$ is an MLP with two linear layers, acting as the message aggregation function. ${\rm\textbf{a}}_{uv}$ is bond $(u,v)$ ’s aggregated feature from neighbor bonds. Then, the bond $(u,v)$ ’s representation vector is updated according to ${\rm\textbf{a}}_{uv}$ in Eq.2.

In the second stage, with the updated bond representation in $H$ , we further learn the atoms’ representation by aggregating messages from the neighboring atoms and the corresponding bond representations from $H$ . Given an atom $u$ , its representation ${\rm\textbf{h}}_{u}^{(k)}$ in the $k$ -th iteration is formalized as:

	$\displaystyle{\rm\textbf{a}}_{u}^{(k)}=\mathcal{F}_{atom-bond}^{(k)}(\{({\rm% \textbf{h}}_{u}^{(k-1)},{\rm\textbf{h}}_{v}^{(k-1)},{\rm\textbf{h}}_{uv}^{(k-1% )}):v\in\mathcal{N}(u)\}),$		(4)
	$\displaystyle{\rm\textbf{h}}_{u}^{(k)}=\mathcal{W}_{s}*{\rm\textbf{h}}_{u}^{(k% -1)}+{\rm\textbf{a}}_{u}^{(k)},$		(5)

where $\mathcal{N}(u)$ represents the neighbor atoms of atom $u$ . $\mathcal{F}_{atom-bond}$ is an MLP with two linear layers, acting as the message aggregation function. ${\rm\textbf{a}}_{u}$ is the atom $u$ ’s aggregated feature from neighbor atoms. The representation of $u$ is updated according to ${\rm\textbf{a}}_{u}$ in Eq.4.

After all iterations, we calculate the molecular global representation ${\rm\textbf{h}}_{G}$ by summarizing and pooling over the atoms’ representation $\{{\rm\textbf{h}}_{u}\},\forall u\in\mathcal{V}$ . We further take the input molecule $M$ ’s global representation ${\rm\textbf{h}}_{G}$ and atom representations $\{{\rm\textbf{h}}_{u}\}$ as the input of further modules.

3.4 Peak Property Learning Module

Our peak property learning module aims to fuse the atom representation features and extract the key features with peak property information. We apply the transformer encoder as the fusion model due to its powerful feature fusion capability enabled by its cross-attention structure. Specifically, for molecule $M$ , we first random initialize a set of tokens $\{{\rm\textbf{Q}}_{i}\}_{i=1}^{n}$ as the peak tokens for $M$ . Then we combine the peak tokens $\{{\rm\textbf{Q}}_{i}\}$ with $M$ ’s global feature ${\rm\textbf{h}}_{G}$ and $M$ ’s atom features $\{{\rm\textbf{h}}_{u}\}$ as the input tokens for transformer encoder, which is formalized as:

\displaystyle[{\rm\textbf{Q}}_{j},{\rm\textbf{h}}_{G,j},{\rm\textbf{h}}_{u,j}]% ={{Layer}_{j}}({\rm\textbf{Q}}_{j-1},{\rm\textbf{h}}_{G,j-1},{\rm\textbf{h}}_{% u,j-1})

(6)

where $Layer_{j}$ represents the $j$ -th transformer encoder layer. After the cross attention in $N$ transformer encoder layers, the ${\rm\textbf{Q}}_{N}$ denote the final representations for peak tokens. We get the peak number $N_{p}$ from the molecular ground-truth ECD spectra $\{\mathcal{P}_{1:j}\}_{j=1}^{N_{p}}$ . Thus, we extract the first $N_{p}$ peak token features from $\{{\rm\textbf{Q}}_{i}\}_{i=1}^{n}$ as the peak property information of molecule $M$ .

3.5 Peak Property Prediction Module

Our peak property prediction module aims to reconstruct the peak property, including the peak number, peak symbol, and peak position, from the output features of the peak property learning module. For the peak number prediction, we apply the two-layer MLP to predict the peak number from the molecule global feature ${\rm\textbf{h}}_{G}$ , which is formalized as:

\displaystyle\mathcal{P}_{num}={\rm Linear}({\rm ReLU}({\rm Linear}({\rm% \textbf{h}}_{G})))

(7)

where $\mathcal{P}_{num}$ represents the peak number for the ecd spectra of molecule $M$ . For peak height $\mathcal{P}_{height}$ and peak position $\mathcal{P}_{pos}$ . we also apply two separate two-layer MLPs to predict $\mathcal{P}_{symbol}$ and $\mathcal{P}_{pos}$ from the corresponding peak token ${\rm\textbf{Q}}_{i}$ from $\{{\rm\textbf{Q}}_{i}\}_{i=1}^{n}$ .

\displaystyle\mathcal{P}_{pos}={\rm Linear}({\rm ReLU}({\rm Linear}({\rm% \textbf{Q}}_{i}))),\quad\mathcal{P}_{symbol}={\rm Linear}({\rm ReLU}({\rm Linear% }({\rm\textbf{Q}}_{i}))),

(8)

Here we predict all three peak properties that are vital for ECD spectra prediction.

3.6 ECD Spectra Rendering Module

The final module, the ECD spectra rendering module, aims to render the predicted ECD spectra from the abstract peak properties. We employ the Gaussian noise distribution model to fit the spectral curve. Specifically, given the position $l_{p}$ and the corresponding height $l_{h}$ of a peak, we set a Gaussian noise distribution with mean value: $\mu=l_{p}$ and a standard deviation of $\sigma=l_{h}$ . We then extract the distribution range of $[\mu-6\sigma,\mu+6\sigma]$ as the fitting curve for the peak. We render the predicted ECD spectra for molecule $M$ by combining the fitting curves of all predicted peaks.

3.7 Experimental Settings and Training Hyperparameters

In the molecular feature extraction module, we set the number of GINConv in GeoGNN to be $5$ and the graph pooling strategy to be $summation$ . The embedding dimension of molecular features is $128$ and the batch size for ECDFormer is $256$ . We apply the AdamW [40] optimizer implemented in Pytorch. The learning rate $=1e^{-3}$ . For better convergence, we apply the $StepLR$ schedular with a decreasing rate $=0.25$ to adaptively adjust the learning rate. During training, the CMCDS dataset is randomly divided into $90/5/5$ for train/valid/test splits. ECDFormer is trained with 1000 epochs, selecting the best valid checkpoint for the testing procedure. For other deep-learning baselines, we apply the learning rate $=5e^{-4}$ and epoch $=1500$ , while other parameters are the same as ECDFormer.

Data Availability

The first large-scale ECD spectra dataset for chiral molecules, the CMCDS dataset, has been deposited in the Github repository, my-dataset-link.

Code Availability

All code used in data analysis and preparation of the manuscript, alongside a description of necessary steps for reproducing results, can be found in a GitHub repository accompanying this manuscript: my-github-link.

References

[1] Noyori, R. Asymmetric catalysis: science and opportunities (nobel lecture). \JournalTitleAngewandte Chemie International Edition 41, 2008–2022 (2002).
[2] List, B. & MacMillan, D. The 2021 nobel prize in chemistry: asymmetric catalysis with small organic molecules. \JournalTitleCurrent Science 121, 1148 (2021).
[3] Amabilino, D. B. & Veciana, J. Supramolecular chiral functional materials. \JournalTitleSupramolecular Chirality 253–302 (2006).
[4] Shen, B., Kim, Y. & Lee, M. Supramolecular chiral 2d materials and emerging functions. \JournalTitleAdvanced Materials 32, 1905669 (2020).
[5] Teng, Y. et al. Advances and applications of chiral resolution in pharmaceutical field. \JournalTitleChirality 34, 1094–1119 (2022).
[6] Lininger, A. et al. Chirality in light–matter interaction. \JournalTitleAdvanced Materials 35, 2107325 (2023).
[7] Evers, F. et al. Theory of chirality induced spin selectivity: Progress and challenges. \JournalTitleAdvanced Materials 34, 2106629 (2022).
[8] Zhang, W. et al. Great concern for chiral pharmaceuticals from the thalidomide tragedy. \JournalTitleUniv. Chem 34, 1–12 (2019).
[9] Ebeling, D. et al. Assigning the absolute configuration of single aliphatic molecules by visual inspection. \JournalTitleNature communications 9, 2420 (2018).
[10] Menna, M., Imperatore, C., Mangoni, A., Della Sala, G. & Taglialatela-Scafati, O. Challenges in the configuration assignment of natural products. a case-selective perspective. \JournalTitleNatural product reports 36, 476–489 (2019).
[11] Junior, F. M. d. S. & Junior, J. M. B. Absolute configuration from chiroptical spectroscopy. \JournalTitleChiral Separations and Stereochemical Elucidation: Fundamentals, Methods, and Applications 551–591 (2023).
[12] Janet, J. P. & Kulik, H. J. Machine Learning in chemistry, vol. 1 (American Chemical Society, 2020).
[13] de Almeida, A. F., Moreira, R. & Rodrigues, T. Synthetic organic chemistry driven by artificial intelligence. \JournalTitleNature Reviews Chemistry 3, 589–604 (2019).
[14] Hermann, J. et al. Ab-initio quantum chemistry with neural-network wavefunctions. \JournalTitlearXiv preprint arXiv:2208.12590 (2022).
[15] Xu, H., Lin, J., Zhang, D. & Mo, F. Retention time prediction for chromatographic enantioseparation by quantile geometry-enhanced graph neural network. \JournalTitleNature Communications 14, 3095 (2023).
[16] Fang, X. et al. Geometry-enhanced molecular representation learning for property prediction. \JournalTitleNature Machine Intelligence 4, 127–134 (2022).
[17] Vaswani, A. et al. Attention is all you need. \JournalTitleAdvances in neural information processing systems 30 (2017).
[18] Frisch, M. J. et al. Gaussian˜16 Revision B.01 (2016). Gaussian Inc. Wallingford CT.
[19] Stephens, P. J., Devlin, F. J., Chabalowski, C. F. & Frisch, M. J. Ab initio calculation of vibrational absorption and circular dichroism spectra using density functional force fields. \JournalTitleThe Journal of physical chemistry 98, 11623–11627 (1994).
[20] Yanai, T., Tew, D. P. & Handy, N. C. A new hybrid exchange-correlation functional using the coulomb-attenuating method (cam-b3lyp). \JournalTitleChemical physics letters 393, 51–57 (2004).
[21] Zou, Z. et al. A deep learning model for predicting selected organic molecular spectra. \JournalTitleNature Computational Science 1–8 (2023).
[22] Rogers, D. M. et al. Electronic circular dichroism spectroscopy of proteins. \JournalTitleChem 5, 2751–2774 (2019).
[23] Mao, A., Mohri, M. & Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. \JournalTitlearXiv preprint arXiv:2304.07288 (2023).
[24] Nagy, G., Igaev, M., Jones, N. C., Hoffmann, S. V. & Grubmuller, H. Sesca: predicting circular dichroism spectra from protein molecular structures. \JournalTitleJournal of chemical theory and computation 15, 5087–5102 (2019).
[25] Micsonai, A., Bulyáki, É. & Kardos, J. Bestsel: from secondary structure analysis to protein fold prediction by circular dichroism spectroscopy. \JournalTitleStructural Genomics: General Applications 175–189 (2021).
[26] Zhao, L. et al. Accurate machine learning prediction of protein circular dichroism spectra with embedded density descriptors. \JournalTitleJACS Au 1, 2377–2384 (2021).
[27] Artrith, N. et al. Best practices in machine learning for chemistry. \JournalTitleNature chemistry 13, 505–508 (2021).
[28] Wei, J. et al. Machine learning in materials science. \JournalTitleInfoMat 1, 338–358 (2019).
[29] Shi, X. et al. Convolutional lstm network: A machine learning approach for precipitation nowcasting. \JournalTitleAdvances in neural information processing systems 28 (2015).
[30] Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. \JournalTitlearXiv preprint arXiv:1412.3555 (2014).
[31] Xu, W.-F. et al. 17-hydroxybrevianamide n and its n1-methyl derivative, quinazolinones from a soft-coral-derived aspergillus sp. fungus: 13 s enantiomers as the true natural products. \JournalTitleJournal of Natural Products 84, 1353–1358 (2021).
[32] Tu, W.-C. et al. Wulfenioidins d–n, structurally diverse diterpenoids with anti-zika virus activity isolated from orthosiphon wulfenioides. \JournalTitleJournal of Natural Products 86, 2348–2359 (2023).
[33] Lam, Y. T. et al. Purpurascenines a–c, azepino-indole alkaloids from cortinarius purpurascens: Isolation, biosynthesis, and activity studies on the 5-ht2a receptor. \JournalTitleJournal of Natural Products (2023).
[34] Liu, F. et al. Anti-inflammatory quinoline alkaloids from the roots of waltheria indica. \JournalTitleJournal of Natural Products 86, 276–289 (2023).
[35] Zhang, S., Tong, H., Xu, J. & Maciejewski, R. Graph convolutional networks: a comprehensive review. \JournalTitleComputational Social Networks 6, 1–23 (2019).
[36] Mahmood, O., Mansimov, E., Bonneau, R. & Cho, K. Masked graph modeling for molecule generation. \JournalTitleNature communications 12, 3156 (2021).
[37] Zhong, W., Yang, Z. & Chen, C. Y.-C. Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing. \JournalTitleNature Communications 14, 3009 (2023).
[38] Han, K. et al. A survey on vision transformer. \JournalTitleIEEE transactions on pattern analysis and machine intelligence 45, 87–110 (2022).
[39] Peng, Y. et al. Enhanced graph isomorphism network for molecular admet properties prediction. \JournalTitleIeee Access 8, 168344–168360 (2020).
[40] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. \JournalTitlearXiv preprint arXiv:1412.6980 (2014).
[41] OLBoyle, N., Vandermeersch, T. & Hutchison, G. Confab-generation of diverse low energy conformers. \JournalTitleJournal of Cheminformatics (2011).

Author contributions statement

H.L., D.L., X.W., and F.M. conceived the basic idea and designed the research study. D.L. and F.M. generated the ECD spectra dataset using DFT calculation. H.L. developed the method. L.Y. and F.M. further modified the method. D.L. and X.W. conceived the evaluation metric in the experiment. H.L. and D.L. evaluated the performance on the CMCDS dataset and natural product molecules. H.L. and D.L. wrote the manuscript. L.Y., Y.T., X.W., and F.M. revised the manuscript. Y.L. and Y.T. provided the deep learning computing platform.

Support information

S1 Statistical results for the CMCDS dataset

For all molecules in the CMCDS dataset, we visualized their property distribution by counting the number of atoms in each molecule, the number of peaks of the corresponding ECD spectra, and the number of chemical bonds. In the CMCDS dataset, all chiral molecules are single-chiral-centered, resulting in finite complexity. It is observable that the majority of these molecules consist of approximately 60 atoms, with the largest molecule not exceeding 200 atoms (Fig. 7(a)). Furthermore, most of these molecules possess around 25 chemical bonds, with a maximum of 65 bonds (Fig. 7(b)). Additionally, the ECD spectra of these molecules typically exhibit 3 to 4 peaks, with a maximum of 8 peaks (Fig. 7(c)).

#	Method	Molecule-Type	Position-RMSE (nm) $\downarrow$	Number-RMSE $\downarrow$	Symbol-Acc. (%) $\uparrow$
1	ECDFormer	Single-Chiral-Center	2.29	1.24	72.7
2	ECDFormer	Multi-Chiral-Center	2.88	1.76	63.1

Table 2: Performance for ECD prediction for multi-chiral-centered molecules. We also propose a comparison between the performance of single-chiral-centered molecules and multi-chiral-centered molecules. ECDFormer suffers a slight performance decrease when predicting multi-chiral-centered molecules, demonstrating the generalization ability of ECDFormer.

S2 The Generalization Ability on Multi-Chiral-Centered Molecules

Our ECDFormer is trained on the CMCDS dataset, where all molecules have a single-chiral-centered carbon. In Table. 1, our ECDFormer achieves outstanding ECD spectra prediction performance for single-chiral-centered molecules. To evaluate the generalization ability of our ECDFormer, we further test ECDFormer’s performance on multi-chiral-centered molecules, which are more complex in molecular structure and ECD spectra. Specifically, we gather a small group of multi-chiral-centered molecules with their ECD spectra as our test dataset. Then we evaluate the ECDFormer’s performance on this multi-chiral-centered dataset. As shown in Table. 2, ECDFormer suffers a slight performance decrease when predicting multi-chiral-carbon molecules, demonstrating the generalization ability of ECDFormer. The good performance of ECDFormer on both single-chiral-centered and multi-chiral-centered molecules further validates its strong applicability.

S3 The Chemical Interpretability Analysis of ECDFormer

To comprehensively assess the performance of the entire deep learning model, we focused on analyzing the chemical interpretability of the ECDFormer model. By visualizing all the predicted cases in the test split of CMCDS, We found that the spectral similarity of each conformation within a molecule can impact the predictive results of the ECDFormer.

Specifically, we selected molecules that were perfect matches and those that were completely wrong, ran conformational searches [41] on them, and then calculated the ECD spectra for each conformation. We found that the spectral similarity of each conformation within a molecule can impact the predictive results of the model. As shown in Fig.8, in the excellent-predicted cases, the ECD spectrum of each conformation showed minor differences. In contrast, in most of the bad-predicted cases, the ECD spectra for each conformation showed significant differences in peak shapes and wavelength. This suggests that for different configurations of the same molecule, if their ECD spectra are highly similar, the prediction of ECD spectra by the trained model will be accurate, and vice versa. This phenomenon can be explained from the deep-learning aspect. For a molecule with different ECD spectra shapes, deep-learning models are hard to learn the latent features for prediction. In contrast, for a molecule with similar ECD spectra shapes, deep-learning models are easy to learn the latent pattern from ECD spectra, which improves the prediction performance. However, a small number of molecules (Excellent_4 and Bad_4 in Fig.8) do not fit this pattern due to the uncertainty in the deep-learning method and the complexity of the chiral assignation field. We are currently investigating these exceptional cases further.