Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2401.03403v1 [cs.CE] 07 Jan 2024
11affiliationtext: School of Electronic and Computer Engineering, Peking University, Shenzhen, China22affiliationtext: AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, China33affiliationtext: State Key Laboratory of Physical Chemistry of Solid Surfaces, School of Electronic Science and Engineering, Innovation Laboratory for Science and Technologies of Energy Materials of Fujian Province (IKKEM) and College of Chemistry and Chemical Engineering Xiamen University, Xiamen 361005, China44affiliationtext: Peng Cheng Laboratory, Shenzhen, China55affiliationtext: School of Materials Science and Engineering, Peking University, Beijing, China**affiliationtext: These authors contributed equally to this work$\dagger$$\dagger$affiliationtext: Corresponding authors: yuanli-ece@pku.edu.cn, xcwang@xmu.edu.cn, fmo@pku.edu.cn

Deep peak property learning for efficient chiral molecules ECD spectra prediction

Hao Li Da Long Li Yuan Yonghong Tian Xinchang Wang Fanyang Mo
Abstract

Chiral molecule assignation is crucial for asymmetric catalysis, functional materials, and the drug industry. The conventional approach requires theoretical calculations of electronic circular dichroism (ECD) spectra, which is time-consuming and costly. To speed up this process, we have incorporated deep learning techniques for the ECD prediction. We first set up a large-scale dataset of Chiral Molecular ECD spectra (CMCDS) with calculated ECD spectra. We further develop the ECDFormer model, a Transformer-based model to learn the chiral molecular representations and predict corresponding ECD spectra with improved efficiency and accuracy. Unlike other models for spectrum prediction, our ECDFormer creatively focused on peak properties rather than the whole spectrum sequence for prediction, inspired by the scenario of chiral molecule assignation. Specifically, ECDFormer predicts the peak properties, including number, position, and symbol, then renders the ECD spectra from these peak properties, which significantly outperforms other models in ECD prediction, Our ECDFormer reduces the time of acquiring ECD spectra from 1-100 hours per molecule to 1.5s.

Keywords

Chiral Molecule Assignation, ECD Spectra Prediction, Deep Learning.

A chiral molecule refers to a unique spatial arrangement that cannot be superimposed onto its mirror image, resulting in non-identical left-handed and right-handed forms. Chirality is ubiquitous in chemistry and biology and plays a crucial role in various fields such as asymmetric catalysis [1, 2], functional materials [3, 4], drug discovery [5], and other related areas [6, 7]. Specifically, in the drug discovery area, the drug activity often depends on its absolute configuration. A well-known chiral drug is thalidomide in Fig. 1(a), which was previously used as an antiemetic drug for morning sickness [8] in the form of enantiomeric pairs. However, one of its chiral configurations (R-type) is safe, while the other chiral configuration (S-type) induces severe teratogenic effects. Thus, assigning the absolute configuration of chiral molecules has always been the center of chiral-related research.

There are traditional approaches for discerning the chiral configuration of a molecule with single chiral carbon, including electronic circular dichroism (ECD) spectroscopy, nuclear magnetic resonance spectroscopy, and X-ray single-crystal diffraction methods [9, 10]. Among these methods, ECD spectroscopy is the most efficient and reliable method for determining the absolute configuration of chiral molecules. However, the procedure is still laborious and time-consuming, including chiral separation of isomers, obtaining experimental CD spectra, computation of the theoretical ECD spectra through quantum chemical calculations, and comparison of both experimental and theoretical ECD spectra to achieve conclusive identification of the absolute configuration. Specifically, this comparison focused on the wavelength, the signs of the Cotton effects (positive or negative peaks), the intensity of peaks, and their agreement between experimental and calculated spectra [11].

Refer to caption
Figure 1: The scheme for ECD prediction and chiral molecule assignation. a Thalidomide has two configurations (R/S). R-Thalidomide induces sedative effects, whereas S-Thalidomide is associated with teratogenic effects. b ECD comparison is most frequently employed for assigning the absolute configuration. However, The theoretical calculation of ECD is time-consuming, involving steps such as conformational searching, conformational optimization, excited-state property calculation, and Boltzmann weighting. So we employ deep learning for acceleration. c As molecules become more complex, the computation time increases. Our CPU version is IntelXeonE5-2640v4@2.40GHz.

For experimental chemists, the theoretical calculation of ECD spectra in the aforementioned steps stands out as the most time-consuming and technically demanding task. As shown in Fig. 1(b), the computation of ECD spectra for a chiral molecule entails multiple stages. Initially, a molecular structure model is drawn, followed by molecular dynamics simulations to explore various energetically favorable conformations. Subsequently, these conformations undergo individual structure optimization and energy calculations at the density functional theory (DFT) level of precision. Then the ECD spectra of the molecules are computed employing time-dependent DFT (TD-DFT) calculations. The final calculated ECD spectrum is generated by combining the individual ECD spectra of different conformations, weighted by their Boltzmann probabilities. This requires experimental chemists to possess a proficient understanding of specialized tools, such as molecular dynamics and DFT calculations. Moreover, the computational demands and time requirements associated with this process are substantial, thereby highlighting its rate-determine step in the assignment of chiral absolute configurations. It raises an open question: “Can we speed up the theoretical calculation of ECD spectra?

In recent years, statistical tools based on machine learning have been integrated into chemistry research workflows [12]. This integration is enabling researchers to analyze vast datasets with greater precision and discover intricate patterns and relationships that were previously undetectable, significantly enhancing the efficiency and effectiveness of chemical research and innovation [13, 14]. Large and high-quality datasets are essential for the effectiveness of machine learning methods. We first need to have a library of chiral molecules. Fortunately, we have constructed a library of 25000+ chiral molecules (Chiral Molecules Retention Time Dataset, CMRT) in our previous work, which introduced a machine learning framework to enhance the efficiency of chromatographic enantioseparation in experimental chemistry [15]. Based on the CMRT dataset, a Chiral Molecular CD Spectra Dataset (CMCDS) was generated by selecting chiral molecules from CMRT and calculating their ECD spectra. To the best of our knowledge, CMCDS is the first large-scale dataset for ECD spectra prediction.

With the CMCDS dataset, we further construct the ECDFormer, a deep-learning model to speed up the prediction of the ECD spectra for chiral molecules. Inspired by the chemical assignation scenario that focuses on peak properties in the ECD spectra, our ECDFormer creatively proposes a peak property prediction module to render the ECD spectra from peak properties rather than predict the ECD spectra directly. For the input molecule, our ECDFormer applies its atom, bond, angle features, and molecular descriptors as the description information into the GeoGNN structure [16] to learn the molecular representation. For the peak property learning module, we apply the transformer encoder [17] to learn the peak property features from molecular representations. Then we respectively predict the peak number, position (wavelength), and symbol (the sign of Cotton effect) from property features and render them into the ECD spectra as the prediction of theoretical ECD spectra.

The quantitative experimental results demonstrate the accuracy and efficiency of our ECDFormer compared with other baselines that directly predict the whole ECD spectra. The visualizations show that ECDFormer predicts correct ECD spectra for molecules in CMCDS as well as the natural molecules with pharmaceutical effects. Our model not only advances research in chiral chemistry but also has potential applications in asymmetric synthesis and facilitates high-throughput screening of chiral drug molecules in the pharmaceutical development field. Our contribution can be summarized as follows:

  • The ECD spectra calculation for chiral molecular assignation is crucial yet time-consuming for chemists. A deep-learning model, ECDFormer, was proposed to predict the ECD spectra and improve the assignation efficiency. Inspired by the assignation procedure in chemistry, ECDFormer focuses on peak prediction and renders peaks into the ECD spectra.

  • We proposed a large-scale dataset, CMCDS, for the ECD prediction task. CMCDS containing ECD spectra for 22,190 chiral molecules was produced utilizing substantial computational power.

  • Experimental results demonstrate the accuracy and efficiency of ECDFormer on the CMCDS dataset. ECDFormer also predicts correct ECD spectra for the natural product molecules that have pharmaceutical effects.

1 Results

1.1 Construction of the CMCDS dataset

As shown in Fig.2, the CMCDS dataset is mainly realized by large-scale theoretical calculations, consisting of ECD spectra and SMILES sequences of 22190 chiral molecules, and the ECD spectral data of all the molecules were calculated by Gaussian16 A.03 packages [18]. Our chiral molecules were mainly crawled from the literature of asymmetric catalysis, and we transformed the SMILES files of the molecules into MOL files with the help of the RDKit package to obtain the 3D atomic coordinates of the molecules. The above MOL files were converted into Gaussian input gjf files in batches through Python. Then the molecule structure was optimized at B3LYP [19]/6-31G level. Furthermore, we conducted the electronic circular dichroism calculation at the CAM-B3LYP [20]/6-31G(d) level, setting the number of states (nstates) to 20. We fix the half-peak width at 0.3 and apply Gaussian broadening, utilizing the energies and wavelengths derived from these 20 excited states. The ECD spectra of all molecules were acquired in the same way, and we used Python for batch data processing.

1.2 Construction of the ECDFormer model

Fig.3 shows the computational workflow of our ECDFormer model. The workflow takes the atom-bond-angle features and molecular descriptors as the features of the target molecule. ECDFormer contains four modules for ECD prediction: (i) the molecular feature extraction module to get the chiral molecular representation based on a geometric-enhanced graph neural network. (ii) the peak property learning module to extract the peak property features from chiral molecular representation using a Transformer Encoder structure. (iii) the peak property prediction module to predict the peak properties, including number, position, and symbol, from the learned peak property features. (iv) the ECD rendering module to reconstruct the ECD spectra from predicted peak properties.

Refer to caption
Figure 2: The generation pipeline for our chiral molecular CD spectra dataset (CMCDS) for ECD prediction task.
Refer to caption
Figure 3: The General Pipeline of our ECDFormer model. The design of the peak property learning and prediction modules is inspired by the chemical chiral assignation procedure. By predicting peak properties and rendering ECD spectra, ECDFormer outperforms baselines in the ECD spectra prediction task.

Molecular Electronic Circular Dichroism (ECD) spectra are characterized by the presence of positive and negative peaks as a result of the Cotton effect [21]. Compared to other spectra including protein ECD spectra [22] and molecular infrared spectra [21], molecular ECD spectra reveal significant morphological variations. This distinct feature makes traditional sequence prediction models (LSTM, GRU) less effective for ECD prediction by directly predicting the whole spectra. Chemists often concentrate on the symbols of peaks (indicating the direction of the Cotton effect) and their positions (related to the wavelengths of the peaks) in ECD spectra for determining chirality in molecules. To streamline the ECD prediction process, we focus on predicting essential ECD information such as the number of peaks, their positions, and symbols. Accordingly, the peak-focused loss function to support this approach is:

L(ytrue,ypred)=LceNum(ytrue,ypred)+(LcePos(ytrue,ypred)+2*LceSym(ytrue,ypred))𝐿superscript𝑦𝑡𝑟𝑢𝑒superscript𝑦𝑝𝑟𝑒𝑑superscriptsubscript𝐿𝑐𝑒𝑁𝑢𝑚superscript𝑦𝑡𝑟𝑢𝑒superscript𝑦𝑝𝑟𝑒𝑑superscriptsubscript𝐿𝑐𝑒𝑃𝑜𝑠superscript𝑦𝑡𝑟𝑢𝑒superscript𝑦𝑝𝑟𝑒𝑑2superscriptsubscript𝐿𝑐𝑒𝑆𝑦𝑚superscript𝑦𝑡𝑟𝑢𝑒superscript𝑦𝑝𝑟𝑒𝑑\displaystyle L(y^{true},y^{pred})=L_{ce}^{Num}(y^{true},y^{pred})+(L_{ce}^{% Pos}(y^{true},y^{pred})+2*L_{ce}^{Sym}(y^{true},y^{pred}))italic_L ( italic_y start_POSTSUPERSCRIPT italic_t italic_r italic_u italic_e end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ) = italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_u italic_m end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_t italic_r italic_u italic_e end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ) + ( italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_o italic_s end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_t italic_r italic_u italic_e end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ) + 2 * italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_y italic_m end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_t italic_r italic_u italic_e end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ) ) (1)

where Lcesubscript𝐿𝑐𝑒L_{ce}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT for peak number, position, and symbol are cross-entropy loss [23]. Due to the emphasis of ECD spectra prediction on the positive and negative peaks, we slightly increased the loss weight for peak symbols to enforce the model prediction.

1.3 Peak-specific Evaluation Metrics for the ECD Prediction Task

The ECD spectra of chemical molecules exhibit two distinct characteristics: (i). a high degree of shape diversity, (ii). a strong reliance on peak attributes for chiral molecule identification. These characteristics are significantly different from the ECD spectra of proteins, rendering it inappropriate to adopt the Root Mean Square Error (RMSE) evaluation metric used in protein ECD spectrum prediction tasks [24, 25, 26]. To better evaluate the quality of the ECD spectrum for the chiral molecular assignation task, we establish three sets of evaluation metrics based on peak attributes of ECD spectra: (1). Number-RMSE: the RMSE of peak number between ground-truth and prediction ECD spectra. (2). Position-RMSE: the RMSE of each peak’s position between ground-truth and prediction ECD spectra. (3). Symbol-Acc: the matching accuracy of peaks’ symbols between ground-truth and prediction ECD spectra. These metrics provide a reasonable and comprehensive assessment of ECD spectrum prediction quality from different perspectives.

1.4 Performance comparison on the CMCDS dataset

To comprehensively evaluate the performance of our ECDFormer, we implemented two categories of models as our baselines, the machine learning models and deep learning models. Table. 1 demonstrates that our model achieves state-of-the-art performance across these baselines. The specific experimental analysis is provided below.

1.4.1 Comparison with machine learning baselines.

Machine learning models are commonly used as analytical tools in the fields of chemistry and materials science [27, 28]. We select three common models, including SGD Regressor, Positive Aggressive Regressor, and Logistic Regressor, as the baselines. Comparing line.1-3 and line.10 in Table.1, machine learning baselines perform unsatisfactorily, which is mainly attributed to the models’ inability to decouple spectral sequences from complex molecular structural features. This emphasizes the necessity of employing deep learning models to tackle the task of predicting ECD spectra for chiral molecules.

1.4.2 Comparison with deep learning baselines.

In the context of abundant data, deep learning models have shown excellent performance in complex task settings. With the CMCDS dataset, we implement sequence prediction deep learning models as our baselines, including LSTM [29], GRU [30], and Transformer Decoder [17]. Comparing line.5-8 with line.10 in Table.1, our ECDFormer, predicting the peak property of ECD spectra, significantly outperforms other baselines. The results demonstrate the effectiveness of our peak property prediction module in ECDFormer. Comparing line.6/8 with line.7/9, the pretrained models have little influence on the ECD prediction task, due to the lack of chiral molecular information during the pretraining stage

# Method Initialization Evaluation Metrics
Rand Pretrain Position-RMSE (nm\downarrow Number-RMSE \downarrow Symbol-Acc. (%) \uparrow
Machine Learning Methods
1

Logistic-Regressor

- 7.81 7.22 47.8
2

SGD-Regressor

- 6.44 6.36 47.1
3

Aggr-Regressor

- 5.97 4.39 48.5
Deep Learning Methods
4

GeoGNN+Linear

- 8.62 2.87 51.9
5

GeoGNN+GRU

- 6.47 1.72 39.5
6

GeoGNN+LSTM

- 5.91 1.76 43.7
7

GeoGNN+LSTM

- 4.68 1.45 46.4
8

GeoGNN+Transformer

- 4.69 1.36 49.2
9

GeoGNN+Transformer

- 5.82 1.64 37.3
10

ECDFormer (ours)

- 2.29 1.24 72.7
Table 1: Performance for ECD prediction task. We propose the experimental results on our ECDFormer framework and the corresponding baselines including machine learning models and deep learning models. Focusing on peak property prediction, our ECDFormer model surpasses baselines under all evaluation metrics.
Refer to caption
Figure 4: The performance comparison between ECDFormer and baselines for ECD prediction. a The data distribution plot for the ground-truth peak number and their predicted number. b The violin plot of the discrepancies in peak positions between ground-truth ECD and predicted ECD from ECDFormer and baselines. Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the peak number, representing the difficulty of cases. c The violin plot of the discrepancies in peak symbols between ground-truth ECD and predicted ECD from ECDFormer and baselines.

1.5 The Analysis Visualization on Peak-specific ECD Evaluation Metrics

To better analyze the models’ performance, including our ECDFormer and other baselines, under three peak-specific evaluation metrics, we draw the analysis graphs for each evaluation metric in Fig. 4. The detailed analysis is as follows:

1.5.1 Peak Number Analysis

In Fig. 4(a), we analyze ECDFormer’s predictive capability regarding the peak number and demonstrate its excellent performance in predicting peak number for complex spectra (Peak-Number>5absent5>5> 5) compared to baseline models. The X-axis represents the ground truth values of the peak number, while the Y-axis represents the predicted values of the peak number. Therefore, the closer the data points are to the y=x𝑦𝑥y=xitalic_y = italic_x line, the better the predictive performance. The density of the data points is indicated by the size of the red circles, where a larger red circle represents a higher concentration of data points. Fig. 4(a) shows that in ECDFormer, the largest red circles all appear on the y=x𝑦𝑥y=xitalic_y = italic_x line, even when predicting hard samples (Peak-Number>5absent5>5> 5). The RMSE of peak number is 1.01, indicating the good performance of peak number prediction for our ECDFormer.

1.5.2 Peak Position Analysis

In Fig. 4(b), we analyze the model’s peak position predictive capability. Specifically, we visualize the violin graphs of the position differences between predicted peaks and ground-truth peaks. To further visualize the performance in easy-to-hard cases, we split the test dataset based on the peak number Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT of a molecule. Compared with baselines, for all cases from easy to hard, most predictions in ECDFormer have 0 difference with ground truth, demonstrating the effectiveness.

1.5.3 Peak Symbol Analysis

In Fig. 4(c), similar to the peak position analysis, we further analyze the model’s peak symbol predictive capability. we visualize the violin graphs of the symbol differences between predicted peaks and ground-truth peaks. Compared with baselines, for all cases from easy to hard, most predictions in ECDFormer have the same symbols as ground truth, demonstrating the effectiveness.

1.6 The Visualization of ECD Spectra Prediction Cases

Our visualization contains two parts: (a). Visualizing the ECD spectra corresponding to molecules in the test split of the CMCDS dataset, and (b). Visualizing the ECD spectra corresponding to existing pharmaceutical molecules. Fig. 5 presents our visualization of the CMCDS dataset test split, demonstrating our model’s ability to achieve good performance predictions even when faced with complex molecules of various structures. Fig. 6 shows the ECD predictions for existing pharmaceutical molecules. We first visualize the ECD predictions for R/S type of hydroxybrevianamide [31], a natural product in Aspergillus sp. fungus. Fig. 6(top) shows that ECDFormer can successfully predict the ECD spectra for R-type and S-type molecule pairs. We also visualize the ECD predictions for other pharmaceutical molecules, including Wulfenioidins.L [32] (Anti-Zika Virus Effect), Purpurascenines.B [33] (Antagonist Effect), and Alkaloids [34] (Anti-inflammatory Effect). Our ECDFormer predictions also match the ECD theoretical spectra of these complex natural products with pharmaceutical effects.

Refer to caption
Figure 5: Visualization of ECD spectra predictions from ECDFormer. We visualize the ground-truth spectra and ECDFormer’s prediction spectra of the selected molecules from the test split of the CMCDS dataset.
Refer to caption
Figure 6: ECD predictions on natural products with pharmaceutical effects. We select pharmaceutical products from recent journals. Visualizations show that ECDFormer can produce correct predictions for natural products and their R/S types.

2 Discussion

This study proposes a research framework for integrating deep learning techniques into the field of chemistry to improve the efficiency of researchers in acquiring the ECD spectra of chiral molecules. The proposed ECDFormer focuses on several core issues including data collection, 3D characterization of chiral molecules, and understanding of chirality. Firstly, as the ECD spectra of each molecule are calculated consistently, this study mainly employs Python scripts for batch processing as well as generation of the data, thus providing a standardized CMCDS dataset. Secondly, a specialized neural network, ECDFormer, was established, and experimental results showed that it can directly obtain ECD spectra from the smiles of chiral molecules.

The ECDFormer model’s experimental validation demonstrates its proficient capability and generalization ability in predicting ECD spectra for small organic molecules, including single-chiral-centered molecules and multi-chiral-centered molecules. However, there are areas for improvement in this study that could be addressed in future research. Initially, in compiling the extensive ECD spectral data, we bypassed the conformational search for each molecule to minimize time and cost, which may have introduced some inaccuracies in the spectral data. Additionally, the choice of basis set in DFT calculations limits the spectrum of chiral molecules we can study, particularly excluding those containing elements heavier than iodine. Moreover, our focus was solely on molecules with a single chiral center, intentionally excluding those with multiple chiral centers. Despite these constraints, we remain optimistic about the ECDFormer model’s potential in accurately determining the absolute configuration of chiral molecules. The model offers a rapid way to acquire ECD spectra directly from the SMILES notation of the molecules.

3 Methods

3.1 Problem Definition and Preliminary for Electronic Circular Dichroism Prediction

We first briefly introduce the problem definition of the ECD prediction task for the convenience of description and discussion.

3.1.1 Electronic Circular Dichroism Prediction Task.

Generally, each chemical molecule has its electronic circular dichroism (ECD). For molecule M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we represent the ECD of M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as {𝒮1:i}i=1Nwsuperscriptsubscriptsubscript𝒮:1𝑖𝑖1subscript𝑁𝑤\{\mathcal{S}_{1:i}\}_{i=1}^{N_{w}}{ caligraphic_S start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where 𝒮1:isubscript𝒮:1𝑖\mathcal{S}_{1:i}caligraphic_S start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT is the input light wavelength from 80 to 450nm, and Nwsubscript𝑁𝑤N_{w}italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is the ECD range from -200Mdeg to 200Mdeg. For M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT’s chiral-form molecule M~1subscript~𝑀1\widetilde{M}_{1}over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we represent its ECD as {𝒮1:i}i=1Nwsuperscriptsubscriptsubscript𝒮:1𝑖𝑖1subscript𝑁𝑤\{-\mathcal{S}_{1:i}\}_{i=1}^{N_{w}}{ - caligraphic_S start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. When applying deep learning models for ECD prediction, a direct thought is to establish a site-level sequence prediction model to predict every 𝒮1:isubscript𝒮:1𝑖\mathcal{S}_{1:i}caligraphic_S start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT of the ECD. However, in practice analysis, molecular representation lacks the knowledge to reconstruct the site-level ECD sequence. Thus, we simplify the ECD prediction task from the chemical perspective, focusing on the peak features in the ECD sequence. Specifically, we represent molecule M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT’s ECD sequence as {𝒫1:j}j=1Npsuperscriptsubscriptsubscript𝒫:1𝑗𝑗1subscript𝑁𝑝\{\mathcal{P}_{1:j}\}_{j=1}^{N_{p}}{ caligraphic_P start_POSTSUBSCRIPT 1 : italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where 𝒫1:jsubscript𝒫:1𝑗\mathcal{P}_{1:j}caligraphic_P start_POSTSUBSCRIPT 1 : italic_j end_POSTSUBSCRIPT is the j-th peak in ECD sequence, and Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the peak number. The ECD prediction task aims to predict the peak number Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the peak position and height for 𝒫1:jsubscript𝒫:1𝑗\mathcal{P}_{1:j}caligraphic_P start_POSTSUBSCRIPT 1 : italic_j end_POSTSUBSCRIPT. Under the new task setting, deep learning models achieve better performance on ECD prediction and chiral molecule distinguishment.

3.1.2 Deep Leaning Models: Graph Neural Network (GNN) and Transformer Network.

Graph Neural Network [35] is an outstanding model for graph representation learning. For molecules, the atoms and chemical bonds are easy to interpret as a graph. Thus, GNN becomes the regular model to extract molecular representation [15, 36, 37]. For an input molecular graph, GNN takes the weighted average of node features and their neighbor features, resulting in new representations for nodes. GNN iterates this process through multiple layers of fully connected layers to progressively propagate and aggregate information from the nodes, leading to richer molecule representations.

The Transformer [38] is a seminal deep-learning model for the sequence processing task. It utilizes stacked attention [17] modules to fuse sequence features from different positions, thereby achieving improved sequence prediction performance. The Transformer contains two parts: the Encoder and the Decoder. The Encoder employs bidirectional attention modules to better integrate input features, while the Decoder employs unidirectional attention modules for sequence prediction. In this work, we have redefined the ECD prediction task as peak information prediction, and therefore, we apply the Transformer Encoder structure for its enhanced feature fusion capability.

3.2 The Framework of the proposed ECDFormer

The overview of our ECDFormer is illustrated in Fig. 3. Our ECDFormer contains four major modules: (1). Feature Extraction Module with GeoGNN [16, 39], (2). Peak Property Learning Module with Transformer Encoder, (3). Peak Property Prediction Module, (4). ECD Rendering Module. The workflow of our ECDFormer is described below.

The Feature Extraction Module utilizes GeoGNN containing two graph convolutional networks to extract the molecule’s geometric and descriptor information from the molecule’s atom-bond graph and bond-angle graph. Then, the molecule representation features are input into the Peak Property Learning Module together with empty query tokens. With the transformer encoder structure, the Peak Property Learning Module extracts the peak-related features from the molecule features to the empty query tokens. In the Peak Property Prediction Module, the resulting peak-related features are simultaneously fed into three specific task heads: the peak-number head, the peak-position head, and the peak-height head to predict the peak properties. Finally, the ECD Rendering Module reconstructs the ECD spectra from the peak properties employing mathematical simulation methods. We further introduce more details about the Feature Extraction Module, Peak Property Learning Module, Peak Property Prediction Module, and ECD Rendering Module in the following subsections.

3.3 Molecular Feature Extraction Module

As shown in Fig. 3, for the molecular feature extraction module, we apply the GeoGNN structure to encode molecular geometric features by modeling the atom-bond-angle corresponding relations. Compared with the traditional GNNs that only consider the atom-bond relationship, GeoGNN [16, 39] has a stronger ability in molecular representation modeling.

Specifically, for an input molecule M𝑀Mitalic_M, we denote its atom set as 𝒱𝒱\mathcal{V}caligraphic_V, its bond set as \mathcal{E}caligraphic_E, and its bond-angle set as 𝒜𝒜\mathcal{A}caligraphic_A. Then we introduce M𝑀Mitalic_M’s atom-bond graph G𝐺Gitalic_G and bond-angle graph H𝐻Hitalic_H. The atom-bond graph is defined as G=(𝒱,)𝐺𝒱G=(\mathcal{V},\mathcal{E})italic_G = ( caligraphic_V , caligraphic_E ), where atom u𝒱𝑢𝒱u\in\mathcal{V}italic_u ∈ caligraphic_V is regarded as the node of G𝐺Gitalic_G and bond (u,v)𝑢𝑣(u,v)\in\mathcal{E}( italic_u , italic_v ) ∈ caligraphic_E as the edge of G𝐺Gitalic_G. Similarly, the bond-angle graph is defined as H=(,𝒜)𝐻𝒜H=(\mathcal{E},\mathcal{A})italic_H = ( caligraphic_E , caligraphic_A ), where bond (u,v)𝑢𝑣(u,v)\in\mathcal{E}( italic_u , italic_v ) ∈ caligraphic_E is regarded as the node of H𝐻Hitalic_H and bond angle (u,v,w)𝒜𝑢𝑣𝑤𝒜(u,v,w)\in\mathcal{A}( italic_u , italic_v , italic_w ) ∈ caligraphic_A as the edge of H𝐻Hitalic_H. Both the atom-bond graph and the bond-angle graph are input into the GeoGNN for further feature extraction.

Then, our feature extraction module learns the representation of atoms and bonds iteratively. For the k-th iteration, we use 𝐡usubscript𝐡𝑢{\rm\textbf{h}}_{u}h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝐡uvsubscript𝐡𝑢𝑣{\rm\textbf{h}}_{uv}h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT as the representation of atom u𝑢uitalic_u and bond (u,v)𝑢𝑣(u,v)( italic_u , italic_v ). To achieve information aggregation between the atom-bond graph G𝐺Gitalic_G and the bond-angle graph H𝐻Hitalic_H, the representation vectors of the bonds are taken as the information link between G𝐺Gitalic_G and H𝐻Hitalic_H. Specifically, the iteration of our feature extraction module contains two stages:

In the first stage, the bonds’ representation vectors are learned by aggregating messages from the neighboring bonds and corresponding bond angles in the bond–angle graph H𝐻Hitalic_H. Given bond (u,v)𝑢𝑣(u,v)( italic_u , italic_v ), in k𝑘kitalic_k-th iteration, its representation 𝐡uv(k)superscriptsubscript𝐡𝑢𝑣𝑘{\rm\textbf{h}}_{uv}^{(k)}h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is formalized by:

𝐚uv(k)=bondangle(k)({(𝐡uv(k1),𝐡uw(k1),𝐱wuv):w𝒩(u)}{(𝐡uv(k1),𝐡vw(k1),𝐱uvw):w𝒩(v)}),superscriptsubscript𝐚𝑢𝑣𝑘superscriptsubscript𝑏𝑜𝑛𝑑𝑎𝑛𝑔𝑙𝑒𝑘conditional-setsuperscriptsubscript𝐡𝑢𝑣𝑘1superscriptsubscript𝐡𝑢𝑤𝑘1subscript𝐱𝑤𝑢𝑣𝑤𝒩𝑢conditional-setsuperscriptsubscript𝐡𝑢𝑣𝑘1superscriptsubscript𝐡𝑣𝑤𝑘1subscript𝐱𝑢𝑣𝑤𝑤𝒩𝑣\displaystyle{\rm\textbf{a}}_{uv}^{(k)}=\mathcal{F}_{bond-angle}^{(k)}(\{({\rm% \textbf{h}}_{uv}^{(k-1)},{\rm\textbf{h}}_{uw}^{(k-1)},{\rm\textbf{x}}_{wuv}):w% \in\mathcal{N}(u)\}\cup\{({\rm\textbf{h}}_{uv}^{(k-1)},{\rm\textbf{h}}_{vw}^{(% k-1)},{\rm\textbf{x}}_{uvw}):w\in\mathcal{N}(v)\}),a start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_b italic_o italic_n italic_d - italic_a italic_n italic_g italic_l italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( { ( h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_u italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , x start_POSTSUBSCRIPT italic_w italic_u italic_v end_POSTSUBSCRIPT ) : italic_w ∈ caligraphic_N ( italic_u ) } ∪ { ( h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_v italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , x start_POSTSUBSCRIPT italic_u italic_v italic_w end_POSTSUBSCRIPT ) : italic_w ∈ caligraphic_N ( italic_v ) } ) , (2)
𝐡uv(k)=𝒲s*𝐡uv(k1)+𝐚uv(k),superscriptsubscript𝐡𝑢𝑣𝑘subscript𝒲𝑠superscriptsubscript𝐡𝑢𝑣𝑘1superscriptsubscript𝐚𝑢𝑣𝑘\displaystyle{\rm\textbf{h}}_{uv}^{(k)}=\mathcal{W}_{s}*{\rm\textbf{h}}_{uv}^{% (k-1)}+{\rm\textbf{a}}_{uv}^{(k)},h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = caligraphic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT * h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT + a start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , (3)

where 𝒩(u)𝒩𝑢\mathcal{N}(u)caligraphic_N ( italic_u ) and 𝒩(v)𝒩𝑣\mathcal{N}(v)caligraphic_N ( italic_v ) are the neighbor atoms of u𝑢uitalic_u and v𝑣vitalic_v. {(u,w):w𝒩(u)}{(v,w):w𝒩(v)}conditional-set𝑢𝑤𝑤𝒩𝑢conditional-set𝑣𝑤𝑤𝒩𝑣\{(u,w):w\in\mathcal{N}(u)\}\cup\{(v,w):w\in\mathcal{N}(v)\}{ ( italic_u , italic_w ) : italic_w ∈ caligraphic_N ( italic_u ) } ∪ { ( italic_v , italic_w ) : italic_w ∈ caligraphic_N ( italic_v ) } are the neighbor bonds of bond (u,v)𝑢𝑣(u,v)( italic_u , italic_v ). bondanglesubscript𝑏𝑜𝑛𝑑𝑎𝑛𝑔𝑙𝑒\mathcal{F}_{bond-angle}caligraphic_F start_POSTSUBSCRIPT italic_b italic_o italic_n italic_d - italic_a italic_n italic_g italic_l italic_e end_POSTSUBSCRIPT is an MLP with two linear layers, acting as the message aggregation function. 𝐚uvsubscript𝐚𝑢𝑣{\rm\textbf{a}}_{uv}a start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT is bond (u,v)𝑢𝑣(u,v)( italic_u , italic_v )’s aggregated feature from neighbor bonds. Then, the bond (u,v)𝑢𝑣(u,v)( italic_u , italic_v )’s representation vector is updated according to 𝐚uvsubscript𝐚𝑢𝑣{\rm\textbf{a}}_{uv}a start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT in Eq.2.

In the second stage, with the updated bond representation in H𝐻Hitalic_H, we further learn the atoms’ representation by aggregating messages from the neighboring atoms and the corresponding bond representations from H𝐻Hitalic_H. Given an atom u𝑢uitalic_u, its representation 𝐡u(k)superscriptsubscript𝐡𝑢𝑘{\rm\textbf{h}}_{u}^{(k)}h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in the k𝑘kitalic_k-th iteration is formalized as:

𝐚u(k)=atombond(k)({(𝐡u(k1),𝐡v(k1),𝐡uv(k1)):v𝒩(u)}),superscriptsubscript𝐚𝑢𝑘superscriptsubscript𝑎𝑡𝑜𝑚𝑏𝑜𝑛𝑑𝑘conditional-setsuperscriptsubscript𝐡𝑢𝑘1superscriptsubscript𝐡𝑣𝑘1superscriptsubscript𝐡𝑢𝑣𝑘1𝑣𝒩𝑢\displaystyle{\rm\textbf{a}}_{u}^{(k)}=\mathcal{F}_{atom-bond}^{(k)}(\{({\rm% \textbf{h}}_{u}^{(k-1)},{\rm\textbf{h}}_{v}^{(k-1)},{\rm\textbf{h}}_{uv}^{(k-1% )}):v\in\mathcal{N}(u)\}),a start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_a italic_t italic_o italic_m - italic_b italic_o italic_n italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( { ( h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ) : italic_v ∈ caligraphic_N ( italic_u ) } ) , (4)
𝐡u(k)=𝒲s*𝐡u(k1)+𝐚u(k),superscriptsubscript𝐡𝑢𝑘subscript𝒲𝑠superscriptsubscript𝐡𝑢𝑘1superscriptsubscript𝐚𝑢𝑘\displaystyle{\rm\textbf{h}}_{u}^{(k)}=\mathcal{W}_{s}*{\rm\textbf{h}}_{u}^{(k% -1)}+{\rm\textbf{a}}_{u}^{(k)},h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = caligraphic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT * h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT + a start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , (5)

where 𝒩(u)𝒩𝑢\mathcal{N}(u)caligraphic_N ( italic_u ) represents the neighbor atoms of atom u𝑢uitalic_u. atombondsubscript𝑎𝑡𝑜𝑚𝑏𝑜𝑛𝑑\mathcal{F}_{atom-bond}caligraphic_F start_POSTSUBSCRIPT italic_a italic_t italic_o italic_m - italic_b italic_o italic_n italic_d end_POSTSUBSCRIPT is an MLP with two linear layers, acting as the message aggregation function. 𝐚usubscript𝐚𝑢{\rm\textbf{a}}_{u}a start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the atom u𝑢uitalic_u’s aggregated feature from neighbor atoms. The representation of u𝑢uitalic_u is updated according to 𝐚usubscript𝐚𝑢{\rm\textbf{a}}_{u}a start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT in Eq.4.

After all iterations, we calculate the molecular global representation 𝐡Gsubscript𝐡𝐺{\rm\textbf{h}}_{G}h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT by summarizing and pooling over the atoms’ representation {𝐡u},u𝒱subscript𝐡𝑢for-all𝑢𝒱\{{\rm\textbf{h}}_{u}\},\forall u\in\mathcal{V}{ h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } , ∀ italic_u ∈ caligraphic_V. We further take the input molecule M𝑀Mitalic_M’s global representation 𝐡Gsubscript𝐡𝐺{\rm\textbf{h}}_{G}h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and atom representations {𝐡u}subscript𝐡𝑢\{{\rm\textbf{h}}_{u}\}{ h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } as the input of further modules.

3.4 Peak Property Learning Module

Our peak property learning module aims to fuse the atom representation features and extract the key features with peak property information. We apply the transformer encoder as the fusion model due to its powerful feature fusion capability enabled by its cross-attention structure. Specifically, for molecule M𝑀Mitalic_M, we first random initialize a set of tokens {𝐐i}i=1nsuperscriptsubscriptsubscript𝐐𝑖𝑖1𝑛\{{\rm\textbf{Q}}_{i}\}_{i=1}^{n}{ Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as the peak tokens for M𝑀Mitalic_M. Then we combine the peak tokens {𝐐i}subscript𝐐𝑖\{{\rm\textbf{Q}}_{i}\}{ Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } with M𝑀Mitalic_M’s global feature 𝐡Gsubscript𝐡𝐺{\rm\textbf{h}}_{G}h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and M𝑀Mitalic_M’s atom features {𝐡u}subscript𝐡𝑢\{{\rm\textbf{h}}_{u}\}{ h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT } as the input tokens for transformer encoder, which is formalized as:

[𝐐j,𝐡G,j,𝐡u,j]=Layerj(𝐐j1,𝐡G,j1,𝐡u,j1)subscript𝐐𝑗subscript𝐡𝐺𝑗subscript𝐡𝑢𝑗𝐿𝑎𝑦𝑒subscript𝑟𝑗subscript𝐐𝑗1subscript𝐡𝐺𝑗1subscript𝐡𝑢𝑗1\displaystyle[{\rm\textbf{Q}}_{j},{\rm\textbf{h}}_{G,j},{\rm\textbf{h}}_{u,j}]% ={{Layer}_{j}}({\rm\textbf{Q}}_{j-1},{\rm\textbf{h}}_{G,j-1},{\rm\textbf{h}}_{% u,j-1})[ Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , h start_POSTSUBSCRIPT italic_G , italic_j end_POSTSUBSCRIPT , h start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT ] = italic_L italic_a italic_y italic_e italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( Q start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , h start_POSTSUBSCRIPT italic_G , italic_j - 1 end_POSTSUBSCRIPT , h start_POSTSUBSCRIPT italic_u , italic_j - 1 end_POSTSUBSCRIPT ) (6)

where Layerj𝐿𝑎𝑦𝑒subscript𝑟𝑗Layer_{j}italic_L italic_a italic_y italic_e italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the j𝑗jitalic_j-th transformer encoder layer. After the cross attention in N𝑁Nitalic_N transformer encoder layers, the 𝐐Nsubscript𝐐𝑁{\rm\textbf{Q}}_{N}Q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT denote the final representations for peak tokens. We get the peak number Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT from the molecular ground-truth ECD spectra {𝒫1:j}j=1Npsuperscriptsubscriptsubscript𝒫:1𝑗𝑗1subscript𝑁𝑝\{\mathcal{P}_{1:j}\}_{j=1}^{N_{p}}{ caligraphic_P start_POSTSUBSCRIPT 1 : italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Thus, we extract the first Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT peak token features from {𝐐i}i=1nsuperscriptsubscriptsubscript𝐐𝑖𝑖1𝑛\{{\rm\textbf{Q}}_{i}\}_{i=1}^{n}{ Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as the peak property information of molecule M𝑀Mitalic_M.

3.5 Peak Property Prediction Module

Our peak property prediction module aims to reconstruct the peak property, including the peak number, peak symbol, and peak position, from the output features of the peak property learning module. For the peak number prediction, we apply the two-layer MLP to predict the peak number from the molecule global feature 𝐡Gsubscript𝐡𝐺{\rm\textbf{h}}_{G}h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, which is formalized as:

𝒫num=Linear(ReLU(Linear(𝐡G)))subscript𝒫𝑛𝑢𝑚LinearReLULinearsubscript𝐡𝐺\displaystyle\mathcal{P}_{num}={\rm Linear}({\rm ReLU}({\rm Linear}({\rm% \textbf{h}}_{G})))caligraphic_P start_POSTSUBSCRIPT italic_n italic_u italic_m end_POSTSUBSCRIPT = roman_Linear ( roman_ReLU ( roman_Linear ( h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ) ) (7)

where 𝒫numsubscript𝒫𝑛𝑢𝑚\mathcal{P}_{num}caligraphic_P start_POSTSUBSCRIPT italic_n italic_u italic_m end_POSTSUBSCRIPT represents the peak number for the ecd spectra of molecule M𝑀Mitalic_M. For peak height 𝒫heightsubscript𝒫𝑒𝑖𝑔𝑡\mathcal{P}_{height}caligraphic_P start_POSTSUBSCRIPT italic_h italic_e italic_i italic_g italic_h italic_t end_POSTSUBSCRIPT and peak position 𝒫possubscript𝒫𝑝𝑜𝑠\mathcal{P}_{pos}caligraphic_P start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT. we also apply two separate two-layer MLPs to predict 𝒫symbolsubscript𝒫𝑠𝑦𝑚𝑏𝑜𝑙\mathcal{P}_{symbol}caligraphic_P start_POSTSUBSCRIPT italic_s italic_y italic_m italic_b italic_o italic_l end_POSTSUBSCRIPT and 𝒫possubscript𝒫𝑝𝑜𝑠\mathcal{P}_{pos}caligraphic_P start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT from the corresponding peak token 𝐐isubscript𝐐𝑖{\rm\textbf{Q}}_{i}Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from {𝐐i}i=1nsuperscriptsubscriptsubscript𝐐𝑖𝑖1𝑛\{{\rm\textbf{Q}}_{i}\}_{i=1}^{n}{ Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

𝒫pos=Linear(ReLU(Linear(𝐐i))),𝒫symbol=Linear(ReLU(Linear(𝐐i))),formulae-sequencesubscript𝒫𝑝𝑜𝑠LinearReLULinearsubscript𝐐𝑖subscript𝒫𝑠𝑦𝑚𝑏𝑜𝑙LinearReLULinearsubscript𝐐𝑖\displaystyle\mathcal{P}_{pos}={\rm Linear}({\rm ReLU}({\rm Linear}({\rm% \textbf{Q}}_{i}))),\quad\mathcal{P}_{symbol}={\rm Linear}({\rm ReLU}({\rm Linear% }({\rm\textbf{Q}}_{i}))),caligraphic_P start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT = roman_Linear ( roman_ReLU ( roman_Linear ( Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) , caligraphic_P start_POSTSUBSCRIPT italic_s italic_y italic_m italic_b italic_o italic_l end_POSTSUBSCRIPT = roman_Linear ( roman_ReLU ( roman_Linear ( Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) , (8)

Here we predict all three peak properties that are vital for ECD spectra prediction.

3.6 ECD Spectra Rendering Module

The final module, the ECD spectra rendering module, aims to render the predicted ECD spectra from the abstract peak properties. We employ the Gaussian noise distribution model to fit the spectral curve. Specifically, given the position lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and the corresponding height lhsubscript𝑙l_{h}italic_l start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of a peak, we set a Gaussian noise distribution with mean value: μ=lp𝜇subscript𝑙𝑝\mu=l_{p}italic_μ = italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and a standard deviation of σ=lh𝜎subscript𝑙\sigma=l_{h}italic_σ = italic_l start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. We then extract the distribution range of [μ6σ,μ+6σ]𝜇6𝜎𝜇6𝜎[\mu-6\sigma,\mu+6\sigma][ italic_μ - 6 italic_σ , italic_μ + 6 italic_σ ] as the fitting curve for the peak. We render the predicted ECD spectra for molecule M𝑀Mitalic_M by combining the fitting curves of all predicted peaks.

3.7 Experimental Settings and Training Hyperparameters

In the molecular feature extraction module, we set the number of GINConv in GeoGNN to be 5555 and the graph pooling strategy to be summation𝑠𝑢𝑚𝑚𝑎𝑡𝑖𝑜𝑛summationitalic_s italic_u italic_m italic_m italic_a italic_t italic_i italic_o italic_n. The embedding dimension of molecular features is 128128128128 and the batch size for ECDFormer is 256256256256. We apply the AdamW [40] optimizer implemented in Pytorch. The learning rate =1e3absent1superscript𝑒3=1e^{-3}= 1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. For better convergence, we apply the StepLR𝑆𝑡𝑒𝑝𝐿𝑅StepLRitalic_S italic_t italic_e italic_p italic_L italic_R schedular with a decreasing rate =0.25absent0.25=0.25= 0.25 to adaptively adjust the learning rate. During training, the CMCDS dataset is randomly divided into 90/5/5905590/5/590 / 5 / 5 for train/valid/test splits. ECDFormer is trained with 1000 epochs, selecting the best valid checkpoint for the testing procedure. For other deep-learning baselines, we apply the learning rate =5e4absent5superscript𝑒4=5e^{-4}= 5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and epoch =1500absent1500=1500= 1500, while other parameters are the same as ECDFormer.

Data Availability

The first large-scale ECD spectra dataset for chiral molecules, the CMCDS dataset, has been deposited in the Github repository, my-dataset-link.

Code Availability

All code used in data analysis and preparation of the manuscript, alongside a description of necessary steps for reproducing results, can be found in a GitHub repository accompanying this manuscript: my-github-link.

References

  • [1] Noyori, R. Asymmetric catalysis: science and opportunities (nobel lecture). \JournalTitleAngewandte Chemie International Edition 41, 2008–2022 (2002).
  • [2] List, B. & MacMillan, D. The 2021 nobel prize in chemistry: asymmetric catalysis with small organic molecules. \JournalTitleCurrent Science 121, 1148 (2021).
  • [3] Amabilino, D. B. & Veciana, J. Supramolecular chiral functional materials. \JournalTitleSupramolecular Chirality 253–302 (2006).
  • [4] Shen, B., Kim, Y. & Lee, M. Supramolecular chiral 2d materials and emerging functions. \JournalTitleAdvanced Materials 32, 1905669 (2020).
  • [5] Teng, Y. et al. Advances and applications of chiral resolution in pharmaceutical field. \JournalTitleChirality 34, 1094–1119 (2022).
  • [6] Lininger, A. et al. Chirality in light–matter interaction. \JournalTitleAdvanced Materials 35, 2107325 (2023).
  • [7] Evers, F. et al. Theory of chirality induced spin selectivity: Progress and challenges. \JournalTitleAdvanced Materials 34, 2106629 (2022).
  • [8] Zhang, W. et al. Great concern for chiral pharmaceuticals from the thalidomide tragedy. \JournalTitleUniv. Chem 34, 1–12 (2019).
  • [9] Ebeling, D. et al. Assigning the absolute configuration of single aliphatic molecules by visual inspection. \JournalTitleNature communications 9, 2420 (2018).
  • [10] Menna, M., Imperatore, C., Mangoni, A., Della Sala, G. & Taglialatela-Scafati, O. Challenges in the configuration assignment of natural products. a case-selective perspective. \JournalTitleNatural product reports 36, 476–489 (2019).
  • [11] Junior, F. M. d. S. & Junior, J. M. B. Absolute configuration from chiroptical spectroscopy. \JournalTitleChiral Separations and Stereochemical Elucidation: Fundamentals, Methods, and Applications 551–591 (2023).
  • [12] Janet, J. P. & Kulik, H. J. Machine Learning in chemistry, vol. 1 (American Chemical Society, 2020).
  • [13] de Almeida, A. F., Moreira, R. & Rodrigues, T. Synthetic organic chemistry driven by artificial intelligence. \JournalTitleNature Reviews Chemistry 3, 589–604 (2019).
  • [14] Hermann, J. et al. Ab-initio quantum chemistry with neural-network wavefunctions. \JournalTitlearXiv preprint arXiv:2208.12590 (2022).
  • [15] Xu, H., Lin, J., Zhang, D. & Mo, F. Retention time prediction for chromatographic enantioseparation by quantile geometry-enhanced graph neural network. \JournalTitleNature Communications 14, 3095 (2023).
  • [16] Fang, X. et al. Geometry-enhanced molecular representation learning for property prediction. \JournalTitleNature Machine Intelligence 4, 127–134 (2022).
  • [17] Vaswani, A. et al. Attention is all you need. \JournalTitleAdvances in neural information processing systems 30 (2017).
  • [18] Frisch, M. J. et al. Gaussian˜16 Revision B.01 (2016). Gaussian Inc. Wallingford CT.
  • [19] Stephens, P. J., Devlin, F. J., Chabalowski, C. F. & Frisch, M. J. Ab initio calculation of vibrational absorption and circular dichroism spectra using density functional force fields. \JournalTitleThe Journal of physical chemistry 98, 11623–11627 (1994).
  • [20] Yanai, T., Tew, D. P. & Handy, N. C. A new hybrid exchange-correlation functional using the coulomb-attenuating method (cam-b3lyp). \JournalTitleChemical physics letters 393, 51–57 (2004).
  • [21] Zou, Z. et al. A deep learning model for predicting selected organic molecular spectra. \JournalTitleNature Computational Science 1–8 (2023).
  • [22] Rogers, D. M. et al. Electronic circular dichroism spectroscopy of proteins. \JournalTitleChem 5, 2751–2774 (2019).
  • [23] Mao, A., Mohri, M. & Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. \JournalTitlearXiv preprint arXiv:2304.07288 (2023).
  • [24] Nagy, G., Igaev, M., Jones, N. C., Hoffmann, S. V. & Grubmuller, H. Sesca: predicting circular dichroism spectra from protein molecular structures. \JournalTitleJournal of chemical theory and computation 15, 5087–5102 (2019).
  • [25] Micsonai, A., Bulyáki, É. & Kardos, J. Bestsel: from secondary structure analysis to protein fold prediction by circular dichroism spectroscopy. \JournalTitleStructural Genomics: General Applications 175–189 (2021).
  • [26] Zhao, L. et al. Accurate machine learning prediction of protein circular dichroism spectra with embedded density descriptors. \JournalTitleJACS Au 1, 2377–2384 (2021).
  • [27] Artrith, N. et al. Best practices in machine learning for chemistry. \JournalTitleNature chemistry 13, 505–508 (2021).
  • [28] Wei, J. et al. Machine learning in materials science. \JournalTitleInfoMat 1, 338–358 (2019).
  • [29] Shi, X. et al. Convolutional lstm network: A machine learning approach for precipitation nowcasting. \JournalTitleAdvances in neural information processing systems 28 (2015).
  • [30] Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. \JournalTitlearXiv preprint arXiv:1412.3555 (2014).
  • [31] Xu, W.-F. et al. 17-hydroxybrevianamide n and its n1-methyl derivative, quinazolinones from a soft-coral-derived aspergillus sp. fungus: 13 s enantiomers as the true natural products. \JournalTitleJournal of Natural Products 84, 1353–1358 (2021).
  • [32] Tu, W.-C. et al. Wulfenioidins d–n, structurally diverse diterpenoids with anti-zika virus activity isolated from orthosiphon wulfenioides. \JournalTitleJournal of Natural Products 86, 2348–2359 (2023).
  • [33] Lam, Y. T. et al. Purpurascenines a–c, azepino-indole alkaloids from cortinarius purpurascens: Isolation, biosynthesis, and activity studies on the 5-ht2a receptor. \JournalTitleJournal of Natural Products (2023).
  • [34] Liu, F. et al. Anti-inflammatory quinoline alkaloids from the roots of waltheria indica. \JournalTitleJournal of Natural Products 86, 276–289 (2023).
  • [35] Zhang, S., Tong, H., Xu, J. & Maciejewski, R. Graph convolutional networks: a comprehensive review. \JournalTitleComputational Social Networks 6, 1–23 (2019).
  • [36] Mahmood, O., Mansimov, E., Bonneau, R. & Cho, K. Masked graph modeling for molecule generation. \JournalTitleNature communications 12, 3156 (2021).
  • [37] Zhong, W., Yang, Z. & Chen, C. Y.-C. Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing. \JournalTitleNature Communications 14, 3009 (2023).
  • [38] Han, K. et al. A survey on vision transformer. \JournalTitleIEEE transactions on pattern analysis and machine intelligence 45, 87–110 (2022).
  • [39] Peng, Y. et al. Enhanced graph isomorphism network for molecular admet properties prediction. \JournalTitleIeee Access 8, 168344–168360 (2020).
  • [40] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. \JournalTitlearXiv preprint arXiv:1412.6980 (2014).
  • [41] OLBoyle, N., Vandermeersch, T. & Hutchison, G. Confab-generation of diverse low energy conformers. \JournalTitleJournal of Cheminformatics (2011).

Author contributions statement

H.L., D.L., X.W., and F.M. conceived the basic idea and designed the research study. D.L. and F.M. generated the ECD spectra dataset using DFT calculation. H.L. developed the method. L.Y. and F.M. further modified the method. D.L. and X.W. conceived the evaluation metric in the experiment. H.L. and D.L. evaluated the performance on the CMCDS dataset and natural product molecules. H.L. and D.L. wrote the manuscript. L.Y., Y.T., X.W., and F.M. revised the manuscript. Y.L. and Y.T. provided the deep learning computing platform.

Support information

S1 Statistical results for the CMCDS dataset

For all molecules in the CMCDS dataset, we visualized their property distribution by counting the number of atoms in each molecule, the number of peaks of the corresponding ECD spectra, and the number of chemical bonds. In the CMCDS dataset, all chiral molecules are single-chiral-centered, resulting in finite complexity. It is observable that the majority of these molecules consist of approximately 60 atoms, with the largest molecule not exceeding 200 atoms (Fig. 7(a)). Furthermore, most of these molecules possess around 25 chemical bonds, with a maximum of 65 bonds (Fig. 7(b)). Additionally, the ECD spectra of these molecules typically exhibit 3 to 4 peaks, with a maximum of 8 peaks (Fig. 7(c)).

Refer to caption
Figure 7: The visualization of the molecular properties in the CMCDS dataset. a All molecules contain fewer than 200 atoms, and most molecules have about 75 atoms. b All molecules contain fewer than 65 chemical bonds, and most molecules have about 25 chemical bonds. c The number of peaks ranges from 0 to 8 in the ECD spectrum, and most have 4 peaks.
# Method Molecule-Type Position-RMSE (nm\downarrow Number-RMSE \downarrow Symbol-Acc. (%) \uparrow
1

ECDFormer

Single-Chiral-Center 2.29 1.24 72.7
2

ECDFormer

Multi-Chiral-Center 2.88 1.76 63.1
Table 2: Performance for ECD prediction for multi-chiral-centered molecules. We also propose a comparison between the performance of single-chiral-centered molecules and multi-chiral-centered molecules. ECDFormer suffers a slight performance decrease when predicting multi-chiral-centered molecules, demonstrating the generalization ability of ECDFormer.

S2 The Generalization Ability on Multi-Chiral-Centered Molecules

Our ECDFormer is trained on the CMCDS dataset, where all molecules have a single-chiral-centered carbon. In Table. 1, our ECDFormer achieves outstanding ECD spectra prediction performance for single-chiral-centered molecules. To evaluate the generalization ability of our ECDFormer, we further test ECDFormer’s performance on multi-chiral-centered molecules, which are more complex in molecular structure and ECD spectra. Specifically, we gather a small group of multi-chiral-centered molecules with their ECD spectra as our test dataset. Then we evaluate the ECDFormer’s performance on this multi-chiral-centered dataset. As shown in Table. 2, ECDFormer suffers a slight performance decrease when predicting multi-chiral-carbon molecules, demonstrating the generalization ability of ECDFormer. The good performance of ECDFormer on both single-chiral-centered and multi-chiral-centered molecules further validates its strong applicability.

S3 The Chemical Interpretability Analysis of ECDFormer

To comprehensively assess the performance of the entire deep learning model, we focused on analyzing the chemical interpretability of the ECDFormer model. By visualizing all the predicted cases in the test split of CMCDS, We found that the spectral similarity of each conformation within a molecule can impact the predictive results of the ECDFormer.

Specifically, we selected molecules that were perfect matches and those that were completely wrong, ran conformational searches [41] on them, and then calculated the ECD spectra for each conformation. We found that the spectral similarity of each conformation within a molecule can impact the predictive results of the model. As shown in Fig.8, in the excellent-predicted cases, the ECD spectrum of each conformation showed minor differences. In contrast, in most of the bad-predicted cases, the ECD spectra for each conformation showed significant differences in peak shapes and wavelength. This suggests that for different configurations of the same molecule, if their ECD spectra are highly similar, the prediction of ECD spectra by the trained model will be accurate, and vice versa. This phenomenon can be explained from the deep-learning aspect. For a molecule with different ECD spectra shapes, deep-learning models are hard to learn the latent features for prediction. In contrast, for a molecule with similar ECD spectra shapes, deep-learning models are easy to learn the latent pattern from ECD spectra, which improves the prediction performance. However, a small number of molecules (Excellent_4 and Bad_4 in Fig.8) do not fit this pattern due to the uncertainty in the deep-learning method and the complexity of the chiral assignation field. We are currently investigating these exceptional cases further.

Refer to caption
Figure 8: The first column is the calculated ECD spectrum of molecules in the excellent class, obtained after a conformational search with different conformations using the same calculation method. In the second column are the calculated ECD spectra of molecules in the bad class, obtained in the same way as in the first column, for different conformations.