Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2403.12478v1 [cond-mat.mtrl-sci] 19 Mar 2024

Con-CDVAE: A method for the conditional generation of crystal structures

Cai-Yuan Ye1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT     Hong-Ming Weng1,2,3,123{}^{1,2,3,}start_FLOATSUPERSCRIPT 1 , 2 , 3 , end_FLOATSUPERSCRIPT Corresponding author. E-mail: hmweng@iphy.ac.cn    and  Quan-Sheng Wu1,2,12{}^{1,2,}start_FLOATSUPERSCRIPT 1 , 2 , end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTBeijing National Laboratory for Condensed Matter Physics and Institute of Physics,
Chinese Academy of Sciences, Beijing 100190, China
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTUniversity of Chinese Academy of Sciences, Beijing 100049, China
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTSongshan Lake Materials Laboratory, Dongguan, Guangdong 523808, China
Corresponding author. E-mail: quansheng.wu@iphy.ac.cn
(March 19, 2024)
Abstract

In recent years, progress has been made in generating new crystalline materials using generative machine learning models, though gaps remain in efficiently generating crystals based on target properties. This paper proposes the Con-CDVAE model, an extension of the Crystal Diffusion Variational Autoencoder (CDVAE), for conditional crystal generation. We introduce innovative components, design a two-step training method, and develop three unique generation strategies to enhance model performance. The effectiveness of Con-CDVAE is demonstrated through extensive testing under various conditions, including both single and combined property targets. Ablation studies further underscore the critical role of the new components in achieving our model’s performance. Additionally, we validate the physical credibility of the generated crystals through Density Functional Theory (DFT) calculations, confirming Con-CDVAE’s potential in material science research.

Keywords: Crystal structure generation, Machine learning, Conditional generation, Diffusion model

1 Introduction

Material science plays a vital role in the development of modern technology and industry, and materials with excellent properties are the basis for manufacturing a variety of advanced equipment. Crystals, as a kind of material with periodicity, are used in many important fields, such as solar cells, batteries, and catalysis[1].

Although traditional ways, like experiences and first-principles calculations, have catalogued millions of crystals and formed a series of databases, such as the Inorganic Crystal Structure Database (ICSD)[1, 2], the Materials Project (MP)[3], the Open Quantum Materials Database (OQMD)[4]. But these kinds of methods gradually meet the limitation in searching the complex chemical space and structural space[5].

Fortunately, thanks to the open material databases, some new methods based on machine learning (ML) demonstrated their potential to overcome the limitation. Some ML models have achieved fast prediction of material properties [6, 7], reduce computing costs of Density-functional theory (DFT) or molecular dynamics (MD)[8, 9], which make it possible to combine ML with first-principles calculations and enable efficient search for new materials. The new remarkable work was reported by Deepmind who proposed GNoME model to discover new materials at an astonishing rate[10].

However, combining the discriminative ML models with traditional methods has not completely transcended the limitations of traditional methods. Because they still relied on atomic substitution to explore new materials[10]. In the field of natural language and imagery, generative models have shown unexpected creativity while maintaining effectiveness, such as the ChatGPT series[11, 12] and the DALL-E series[13, 14]. Applying generative modeling to the field of materials searching may break through the limitations of traditional methods fundamentally. At present, the models capable of material generation can be broadly divided into three types. Zekun Ren et al used 2D tables, which were called FTCP representation, to represent crystals and generated new crystals by generating the 2D tables[15]. Kristof T. Schütt et al generated molecular materials by generating atoms one by one[16, 17]. Xie T et al combined the graph neural network (GNN), variational autoencoder (VAE), and diffusion model to propose a model called CDVAE for crystal generation[18].

Although some progress has been made in crystals generation, there is still a lack of a method that can efficiently generate crystals based on the attributes people need. Table-like representation and generating atoms one by one may not the best way for crystals, so in this article, inspired by DALL-E2, we propose a model based on CDVAE, named Con-CDVAE, which is capable of conditionally generating crystals according to the properties people desire. We design a two-step method to train our model, and generate crystals’ latent variables according given properties before generate crystals. We test our model in different conditions (e.g., formation energy, band gap, crystal system, combination of formation energy and band gap), and try different strategies. Finally, we do an ablation experiment for Predictor block to demonstrate it can help us to construct the latent variable space.

2 Method

2.1 Diffusion model and crystal generative model

Diffusion model is a kind of widely used generative model in recent years. It was proposed by Sohl-Dickstein according to the nonequilibrium thermodynamics[19]. Song and Ho respectively improved it and proposed noise conditioned score networks (NCSN[20]) and denoising diffusion probabilistic models (DDPM[21]). The general idea of the diffusion model is to use the diffusion process (also called the forward process) to gradually add noise to the data sample, and gradually transform the data distribution into a prior distribution that is easy to sample, such as Gaussian distribution. In the process, a ML model capable of denoising is trained. In the reverse process, a sample is randomly initialized from the prior distribution, and then the trained model is use to denoise and generate a new sample conforming to the real distribution of the data.

In NCSN[20], we always set a sequence of increasing standard deviations σ1<σ2<<σTsubscript𝜎1subscript𝜎2subscript𝜎𝑇\sigma_{1}<\sigma_{2}<...<\sigma_{T}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < … < italic_σ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and add Gaussian noise to the data.

q(xt|xt1)=xt1+𝒩(0,(σt2σt12)I),q(xt|x0)=x0+𝒩(0,σt2I)formulae-sequence𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡1𝒩0superscriptsubscript𝜎𝑡2superscriptsubscript𝜎𝑡12𝐼𝑞conditionalsubscript𝑥𝑡subscript𝑥0subscript𝑥0𝒩0superscriptsubscript𝜎𝑡2𝐼q(x_{t}|x_{t-1})=x_{t-1}+\mathcal{N}(0,(\sigma_{t}^{2}-\sigma_{t-1}^{2})I),\ q% (x_{t}|x_{0})=x_{0}+\mathcal{N}(0,\sigma_{t}^{2}I)italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + caligraphic_N ( 0 , ( italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_I ) , italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) (1)

In the reverse process, annealing Langevin dynamics algorithm is used with a ML model called score network sθ(xt)subscript𝑠𝜃subscript𝑥𝑡s_{\theta}(x_{t})italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

xt1=xt+ϵ2sθ(xt)+ϵ𝒩(0,I)subscript𝑥𝑡1subscript𝑥𝑡italic-ϵ2subscript𝑠𝜃subscript𝑥𝑡italic-ϵ𝒩0𝐼x_{t-1}=x_{t}+\frac{\epsilon}{2}\ s_{\theta}(x_{t})+\sqrt{\epsilon}\ \mathcal{% N}(0,I)italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG italic_ϵ end_ARG caligraphic_N ( 0 , italic_I ) (2)

where ϵitalic-ϵ\epsilonitalic_ϵ is step size and score network sθ(xt)subscript𝑠𝜃subscript𝑥𝑡s_{\theta}(x_{t})italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) predict the gradient of perturbed data distribution xlogq(xt)subscript𝑥log𝑞subscript𝑥𝑡\triangledown_{x}\mathrm{log}q(x_{t})▽ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

In DDPM[21], we always define a sequence of positive noise scales 0<β1,β2,,βT0subscript𝛽1subscript𝛽2subscript𝛽𝑇0<\beta_{1},\beta_{2},...,\beta_{T}0 < italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and get the perturbed data with:

q(xt|xt1)=1βtxt1+𝒩(0,βtI),q(xt|x0)=α¯tx0+𝒩(0,(1α¯t)I)formulae-sequence𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡11subscript𝛽𝑡subscript𝑥𝑡1𝒩0subscript𝛽𝑡𝐼𝑞conditionalsubscript𝑥𝑡subscript𝑥0subscript¯𝛼𝑡subscript𝑥0𝒩01subscript¯𝛼𝑡𝐼q(x_{t}|x_{t-1})=\sqrt{1-\beta_{t}}\ x_{t-1}+\mathcal{N}(0,\beta_{t}I),\ q(x_{% t}|x_{0})=\sqrt{\overline{\alpha}_{t}}\ x_{0}+\mathcal{N}(0,(1-\overline{% \alpha}_{t})I)italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + caligraphic_N ( 0 , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) , italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + caligraphic_N ( 0 , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ) (3)

where α¯t=i=1t(1βi)subscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡1subscript𝛽𝑖\overline{\alpha}_{t}=\prod_{i=1}^{t}(1-\beta_{i})over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). And denoise with:

xt1=11βt(xt+βtsθ(xt))+βt𝒩(0,I)subscript𝑥𝑡111subscript𝛽𝑡subscript𝑥𝑡subscript𝛽𝑡subscript𝑠𝜃subscript𝑥𝑡subscript𝛽𝑡𝒩0𝐼x_{t-1}=\frac{1}{\sqrt{1-\beta_{t}}}\left(x_{t}+\beta_{t}s_{\theta}(x_{t})% \right)+\sqrt{\beta_{t}}\ \mathcal{N}(0,I)italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_N ( 0 , italic_I ) (4)

Both diffusion models will be used in this article, where NCSN is used to generate crystals like CDVAE done[18] and DDPM is used to generate latent variables of crystals. When apply NCSN in crystals generation, we use M=(A,X,L)𝑀𝐴𝑋𝐿M=(A,X,L)italic_M = ( italic_A , italic_X , italic_L ) to represent a crystal. A,X,L𝐴𝑋𝐿A,X,Litalic_A , italic_X , italic_L represent one-hot vector of atomic type, atomic coordinate and Lattice vector, respectively. In a VAE structure, we use Encoder(M)𝐸𝑛𝑐𝑜𝑑𝑒𝑟𝑀Encoder(M)italic_E italic_n italic_c italic_o italic_d italic_e italic_r ( italic_M ) to get the latent variable z𝑧zitalic_z for every crystal, make Decoder(Mt,z,t)𝐷𝑒𝑐𝑜𝑑𝑒𝑟subscript𝑀𝑡𝑧𝑡Decoder(M_{t},z,t)italic_D italic_e italic_c italic_o italic_d italic_e italic_r ( italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z , italic_t ) act as the scoring network, where Mt=(Xt,At,L)subscript𝑀𝑡subscript𝑋𝑡subscript𝐴𝑡𝐿M_{t}=(X_{t},A_{t},L)italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L ) means a crystal perturbed by noise at level t𝑡titalic_t. We followed the noise addition scheme of CDVAE[18]:

Xt=X+𝒩(0,σt2I)subscript𝑋𝑡𝑋𝒩0superscriptsubscript𝜎𝑡2𝐼X_{t}=X+\mathcal{N}(0,\sigma_{t}^{2}I)italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_X + caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) (5)
At=11+σtA+σt1+σtAcompsubscript𝐴𝑡11subscript𝜎𝑡𝐴subscript𝜎𝑡1subscript𝜎𝑡subscript𝐴𝑐𝑜𝑚𝑝A_{t}=\frac{1}{1+\sigma_{t}}\cdot A+\frac{\sigma_{t}}{1+\sigma_{t}}\cdot A_{comp}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ italic_A + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ italic_A start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT (6)

where Acompsubscript𝐴𝑐𝑜𝑚𝑝A_{comp}italic_A start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT is the one-hot like vector representing the chemical formula of the crystal. In generation we have

Xt1=Xt+ηtϵx+𝒩(0,ηt2I)subscript𝑋𝑡1subscript𝑋𝑡subscript𝜂𝑡subscriptitalic-ϵ𝑥𝒩0superscriptsubscript𝜂𝑡2𝐼X_{t-1}=X_{t}+\eta_{t}\cdot\epsilon_{x}+\mathcal{N}(0,\eta_{t}^{2}I)italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_ϵ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + caligraphic_N ( 0 , italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) (7)

where ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the step size, and ϵx,At1subscriptitalic-ϵ𝑥subscript𝐴𝑡1\epsilon_{x},A_{t-1}italic_ϵ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is the output of scoring network Decoder(Mt,z,t)𝐷𝑒𝑐𝑜𝑑𝑒𝑟subscript𝑀𝑡𝑧𝑡Decoder(M_{t},z,t)italic_D italic_e italic_c italic_o italic_d italic_e italic_r ( italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z , italic_t ).

Refer to caption
Figure 1: Training and generation flow chart of Con-CDVAE.

2.2 Con-CDVAE

In order to achieve conditional generation, we build two new blocks based on CDVAE[18], and introduce the Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r block ,inspired by DALL-E2[14], to generate the latent variable according given properties. The training and generation flow chart of Con-CDVAE is shown in Figure 1 which is worth noting is that we trained the model in two steps. We train some CGCNN models[6], a kind of GNN model, to quickly verify whether the generated crystals meet our needs. Although there are errors in using ML models for verifying, it can also meet the needs of preliminary verification.

2.2.1 Condition.

As shown in Figure 1, we build PropEmb𝑃𝑟𝑜subscript𝑝𝐸𝑚𝑏Prop_{Emb}italic_P italic_r italic_o italic_p start_POSTSUBSCRIPT italic_E italic_m italic_b end_POSTSUBSCRIPT and Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r blocks based on the VAE structure of CDVAE[18]. PropEmb𝑃𝑟𝑜subscript𝑝𝐸𝑚𝑏Prop_{Emb}italic_P italic_r italic_o italic_p start_POSTSUBSCRIPT italic_E italic_m italic_b end_POSTSUBSCRIPT block embeds the properties, and there are differences in how continuous and discrete properties are embedded. Continuous properties are expanded by Gaussian basis function and embedded by an MLP.

Ccon=MLP([e(aconamin)22σ2,e(aconaminσ)22σ2,,e(aconamax)22σ2])subscript𝐶𝑐𝑜𝑛MLPsuperscript𝑒superscriptsubscript𝑎𝑐𝑜𝑛subscript𝑎𝑚𝑖𝑛22superscript𝜎2superscript𝑒superscriptsubscript𝑎𝑐𝑜𝑛subscript𝑎𝑚𝑖𝑛𝜎22superscript𝜎2superscript𝑒superscriptsubscript𝑎𝑐𝑜𝑛subscript𝑎𝑚𝑎𝑥22superscript𝜎2C_{con}=\mathrm{MLP}\left(\left[e^{-\frac{(a_{con}-a_{min})^{2}}{2\sigma^{2}}}% ,e^{-\frac{(a_{con}-a_{min}-\sigma)^{2}}{2\sigma^{2}}},...,e^{-\frac{(a_{con}-% a_{max})^{2}}{2\sigma^{2}}}\right]\right)italic_C start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = roman_MLP ( [ italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_a start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_a start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT - italic_σ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT , … , italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_a start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ] ) (8)

where aconsubscript𝑎𝑐𝑜𝑛a_{con}italic_a start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT is property value, aminsubscript𝑎𝑚𝑖𝑛a_{min}italic_a start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT is the minimum, amaxsubscript𝑎𝑚𝑎𝑥a_{max}italic_a start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the maximum, and σ𝜎\sigmaitalic_σ is grid spacing. Each class of discrete properties are represented by a learnable vector and then processed with an MLP.

Cdis=MLP(adisk)subscript𝐶𝑑𝑖𝑠MLPsuperscriptsubscript𝑎𝑑𝑖𝑠𝑘C_{dis}=\mathrm{MLP}\left(a_{dis}^{k}\right)italic_C start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT = roman_MLP ( italic_a start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) (9)

where adisksuperscriptsubscript𝑎𝑑𝑖𝑠𝑘a_{dis}^{k}italic_a start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is the learnable vector of k-th class of the discrete property. And we apply an MLP to mix embedding vectors of different properties when train the model with combination contditon. Then the latent variable z𝑧zitalic_z and properties embedding will be mix by an MLP before fed into the Decoder𝐷𝑒𝑐𝑜𝑑𝑒𝑟Decoderitalic_D italic_e italic_c italic_o italic_d italic_e italic_r.

Cprop=MLP(Ca1Ca2Can)subscript𝐶𝑝𝑟𝑜𝑝MLPdirect-sumsubscript𝐶subscript𝑎1subscript𝐶subscript𝑎2subscript𝐶subscript𝑎𝑛C_{prop}=\mathrm{MLP}\left(C_{a_{1}}\oplus C_{a_{2}}\oplus...\oplus C_{a_{n}}\right)italic_C start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT = roman_MLP ( italic_C start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊕ italic_C start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊕ … ⊕ italic_C start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (10)
zcon=MLP(Cpropz)subscript𝑧𝑐𝑜𝑛MLPdirect-sumsubscript𝐶𝑝𝑟𝑜𝑝𝑧z_{con}=\mathrm{MLP}\left(C_{prop}\oplus z\right)italic_z start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = roman_MLP ( italic_C start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT ⊕ italic_z ) (11)

where direct-sum\oplus represents concatenation of vectors.

The Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r block is an improvement of the property predictor in CDVAE[18], which can only be applied to one continuous property. Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r block can be applied to multiple properties at the same time, including discrete properties. This block use latent variable z𝑧zitalic_z as input to make crystals with similar properties close in the latent space which may be helpful when use Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r to generate new latent variable with properties. The normalized mean squared error (MSE) loss function is used for continuous properties and the cross entropy loss function is used for discrete properties.

𝔏pre=(a^cona~con)2+CrossEntropy(a^dis,adis),a~con=aconaminamaxaminformulae-sequencesubscript𝔏𝑝𝑟𝑒superscriptsubscript^𝑎𝑐𝑜𝑛subscript~𝑎𝑐𝑜𝑛2CrossEntropysubscript^𝑎𝑑𝑖𝑠subscript𝑎𝑑𝑖𝑠subscript~𝑎𝑐𝑜𝑛subscript𝑎𝑐𝑜𝑛subscript𝑎𝑚𝑖𝑛subscript𝑎𝑚𝑎𝑥subscript𝑎𝑚𝑖𝑛\mathfrak{L}_{pre}=\left(\widehat{a}_{con}-\widetilde{a}_{con}\right)^{2}+% \mathrm{CrossEntropy}(\widehat{a}_{dis},a_{dis}),\widetilde{a}_{con}=\frac{a_{% con}-a_{min}}{a_{max}-a_{min}}fraktur_L start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT = ( over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT - over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_CrossEntropy ( over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT ) , over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = divide start_ARG italic_a start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT end_ARG (12)

where a^consubscript^𝑎𝑐𝑜𝑛\widehat{a}_{con}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT and a^dissubscript^𝑎𝑑𝑖𝑠\widehat{a}_{dis}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_d italic_i italic_s end_POSTSUBSCRIPT represent predicted values. Using the joint loss function 𝔏joint=𝔏CDVAE+α𝔏presubscript𝔏𝑗𝑜𝑖𝑛𝑡subscript𝔏𝐶𝐷𝑉𝐴𝐸𝛼subscript𝔏𝑝𝑟𝑒\mathfrak{L}_{joint}=\mathfrak{L}_{CDVAE}+\alpha\cdot\mathfrak{L}_{pre}fraktur_L start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT = fraktur_L start_POSTSUBSCRIPT italic_C italic_D italic_V italic_A italic_E end_POSTSUBSCRIPT + italic_α ⋅ fraktur_L start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT, we train Encoder,Predictor,PropEmb,Decoder𝐸𝑛𝑐𝑜𝑑𝑒𝑟𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟𝑃𝑟𝑜subscript𝑝𝐸𝑚𝑏𝐷𝑒𝑐𝑜𝑑𝑒𝑟Encoder,Predictor,Prop_{Emb},Decoderitalic_E italic_n italic_c italic_o italic_d italic_e italic_r , italic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r , italic_P italic_r italic_o italic_p start_POSTSUBSCRIPT italic_E italic_m italic_b end_POSTSUBSCRIPT , italic_D italic_e italic_c italic_o italic_d italic_e italic_r first, which is call ”Step-one” training. The hyperparameter α𝛼\alphaitalic_α is usually set to 3.

2.2.2 Prior.

After the ”Step-one”, we build the Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r block, which is the DDPM version of the diffusion model. The task of Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r is to generate latent variables with the given properties, so we use the result of Encoder𝐸𝑛𝑐𝑜𝑑𝑒𝑟Encoderitalic_E italic_n italic_c italic_o italic_d italic_e italic_r as label. Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r can be divided into two sub-blocks: PriorEmb𝑃𝑟𝑖𝑜subscript𝑟𝐸𝑚𝑏Prior_{Emb}italic_P italic_r italic_i italic_o italic_r start_POSTSUBSCRIPT italic_E italic_m italic_b end_POSTSUBSCRIPT and PriorDecode𝑃𝑟𝑖𝑜subscript𝑟𝐷𝑒𝑐𝑜𝑑𝑒Prior_{Decode}italic_P italic_r italic_i italic_o italic_r start_POSTSUBSCRIPT italic_D italic_e italic_c italic_o italic_d italic_e end_POSTSUBSCRIPT. PriorEmb𝑃𝑟𝑖𝑜subscript𝑟𝐸𝑚𝑏Prior_{Emb}italic_P italic_r italic_i italic_o italic_r start_POSTSUBSCRIPT italic_E italic_m italic_b end_POSTSUBSCRIPT is similar with PropEmb𝑃𝑟𝑜subscript𝑝𝐸𝑚𝑏Prop_{Emb}italic_P italic_r italic_o italic_p start_POSTSUBSCRIPT italic_E italic_m italic_b end_POSTSUBSCRIPT, but can embed chemical formula when we need.

Cche=MLP(nHvHnHevHe)subscript𝐶𝑐𝑒MLPdirect-productdirect-productsubscript𝑛𝐻subscript𝑣𝐻subscript𝑛𝐻𝑒subscript𝑣𝐻𝑒C_{che}=\mathrm{MLP}\left(n_{H}\cdot v_{H}\odot n_{He}\cdot v_{He}\odot...\right)italic_C start_POSTSUBSCRIPT italic_c italic_h italic_e end_POSTSUBSCRIPT = roman_MLP ( italic_n start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ⊙ italic_n start_POSTSUBSCRIPT italic_H italic_e end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_H italic_e end_POSTSUBSCRIPT ⊙ … ) (13)

where nHsubscript𝑛𝐻n_{H}italic_n start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the fraction of hydrogen, vHsubscript𝑣𝐻v_{H}italic_v start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the 92-dimensional embedding vector of hydrogen, and direct-product\odot means adding the elements of the corresponding positions of two vectors. The 92-dimensional embedding vector of elements is got from CGCNN[6]. PriorDecode𝑃𝑟𝑖𝑜subscript𝑟𝐷𝑒𝑐𝑜𝑑𝑒Prior_{Decode}italic_P italic_r italic_i italic_o italic_r start_POSTSUBSCRIPT italic_D italic_e italic_c italic_o italic_d italic_e end_POSTSUBSCRIPT consists of six layers of neurons. The number of neurons in the first three layers decreases layer by layer, and the number of neurons in the last three layers increases layer by layer. There is a residual structure between the layers with the same number of neurons. We use Gaussian noise zt=α¯tz+1α¯tϵzsuperscript𝑧𝑡subscript¯𝛼𝑡𝑧1subscript¯𝛼𝑡subscriptitalic-ϵ𝑧z^{t}=\sqrt{\overline{\alpha}_{t}}z+\sqrt{1-\overline{\alpha}_{t}}\epsilon_{z}italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, where ϵz𝒩(0,I)similar-tosubscriptitalic-ϵ𝑧𝒩0𝐼\epsilon_{z}\sim\mathcal{N}(0,I)italic_ϵ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ). We use the loss function 𝔏priorsubscript𝔏𝑝𝑟𝑖𝑜𝑟\mathfrak{L}_{prior}fraktur_L start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT to train Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r, where ϵ^z=PriorDecode(zt,Cprop,t)subscript^italic-ϵ𝑧𝑃𝑟𝑖𝑜subscript𝑟𝐷𝑒𝑐𝑜𝑑𝑒superscript𝑧𝑡subscript𝐶𝑝𝑟𝑜𝑝𝑡\hat{\epsilon}_{z}=Prior_{Decode}(z^{t},C_{prop},t)over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_P italic_r italic_i italic_o italic_r start_POSTSUBSCRIPT italic_D italic_e italic_c italic_o italic_d italic_e end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_p italic_r italic_o italic_p end_POSTSUBSCRIPT , italic_t ).

𝔏prior=𝔼ϵz𝒩(0,I),t𝒰(0,T)ϵ^zϵz2subscript𝔏𝑝𝑟𝑖𝑜𝑟subscript𝔼formulae-sequencesimilar-tosubscriptitalic-ϵ𝑧𝒩0𝐼similar-to𝑡𝒰0𝑇superscriptnormsubscript^italic-ϵ𝑧subscriptitalic-ϵ𝑧2\mathfrak{L}_{prior}=\mathbb{E}_{\epsilon_{z}\sim\mathcal{N}(0,I),t\sim% \mathcal{U}(0,T)}\left\|\hat{\epsilon}_{z}-\epsilon_{z}\right\|^{2}fraktur_L start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ) , italic_t ∼ caligraphic_U ( 0 , italic_T ) end_POSTSUBSCRIPT ∥ over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT - italic_ϵ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (14)

For the training of the Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r block (”Step-two” training process), we divided it into two approaches. The first one involves training with properties identical to those of the PropEmb𝑃𝑟𝑜subscript𝑝𝐸𝑚𝑏Prop_{Emb}italic_P italic_r italic_o italic_p start_POSTSUBSCRIPT italic_E italic_m italic_b end_POSTSUBSCRIPT block’s input, referred to as the default condition Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r. The second one involves training with properties such as band gap, formation energy, crystal system, space group, convex hull energy, atomic number, and chemical formula, termed as the full condition Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r. For example, if we want to train a Con-CDVAE capable of crystal generation based on band gap as a condition, then the input for the PropEmb𝑃𝑟𝑜subscript𝑝𝐸𝑚𝑏Prop_{Emb}italic_P italic_r italic_o italic_p start_POSTSUBSCRIPT italic_E italic_m italic_b end_POSTSUBSCRIPT block will only include band gap. The input for the default condition Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r will also only include band gap, while the input for the full condition Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r will include all the mentioned properties.

Using these two kinds of Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r, we proposed three strategies during the generation process. The first one involves generating latent variables using the default condition Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r, referred to as the default𝑑𝑒𝑓𝑎𝑢𝑙𝑡defaultitalic_d italic_e italic_f italic_a italic_u italic_l italic_t strategy. The second one involves utilizing the full condition Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r and providing all the required properties as input, referred to as the full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l strategy. The third one involves using the full condition Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r but inputting only a subset of properties referred to as the less𝑙𝑒𝑠𝑠lessitalic_l italic_e italic_s italic_s strategy. The missing attributes are randomly sampled from the training set based on the provided properties to complete the model’s input. This three strategies will be labeled as ”D”, ”F”, ”L” in the following figures and tabels.

After generating latent variables using the Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r, we employed the Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r block to filter the generated latent variables, selecting those that best meet our predefined criteria to input into the Decoder𝐷𝑒𝑐𝑜𝑑𝑒𝑟Decoderitalic_D italic_e italic_c italic_o italic_d italic_e italic_r for crystal generation. The filtering ratio is 20000:200, meaning that for each test mentioned in section 3, two hundred crystals were generated. Similar methods can also be observed in DALL·E2[14].

Table 1: The generation success rate of models trained with different data sets using different strategies is obtained by using different classification of band gap and formation energy as generation conditions
Regression BG Regression FE Classification of BG Classification of FE
MAE(eV)\downarrow MAE(eV/atom)\downarrow Accuracy(%)\uparrow Accuracy(%)\uparrow
MP20 0.362 0.095 86.9 97.0
MP40 0.359 0.081 86.9 98.4
OQMD 0.124 0.059 95.6 98.5

2.3 Evaluation of the generated crystals

We use pymatgen Python package[22] and several trained CGCNN models to quickly validate the generation performance of the model. We analyze the crystal system of the generated crystals via SpacegroupAnalyzer utility from the pymatgen with the parameters: symprec=0.2, angle_tolerance=5. Two regression models were trained separately utilizing band gap and formation energy. Two classification models were trained with thresholds based on whether the band gap is 0 eV and whether the formation energy is greater than -0.9 eV/atom. These two thresholds are selected based on the data distribution of the MP database. The performance of CGCNN on the test set is shown in Table 1. Although there are errors in the verification based on machine learning model, it is sufficient to reflect the ability of Con-CDVAE to generate crystals according to attributes, and it can be preliminatively verified before further doing a large number of DFT calculations.

3 Experiments

Refer to caption
Figure 2: Predicted band gap distributions of crystal generated by Con-CDVAE. The first to third lines are the results obtained by training with the MP20, MP40, and OQMD datasets, respectively. The first to third columns are crystals generated using the default𝑑𝑒𝑓𝑎𝑢𝑙𝑡defaultitalic_d italic_e italic_f italic_a italic_u italic_l italic_t, full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l, and less𝑙𝑒𝑠𝑠lessitalic_l italic_e italic_s italic_s strategy respectively.

To train Con-CDVAE, we use the data get from MP and OQMD, and filter into 3 sub-datasets. We first filtered by atomic number density greater than 0.01 atom/Å3superscriptangstrom3$\mathrm{\SIUnitSymbolAngstrom}$^{3}roman_Å start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, energy above hull less than 0.5 eV/atom and band gap less then 8 eV. Subsequently, selecting cells with atomic numbers not exceeding 20, 40 and 20 we obtain 3 sub-datasets called MP20, MP40, and OQMD20 (For simplicity, we will refer to it as OQMD in the following text.). There are 71665 crystals in MP20, 108039 crystals in MP40, and 616412 crystals in OQMD. Distribution of band gap, formation energy and crystal system in three sub-datasets are shown in Figure A1. We observed that, in the three sub-datasets, materials with zero bandgap constitute a significant proportion. The formation energy distribution of MP20 and MP40 exhibits two peaks around 0 eV/atom and -2.5 eV/atom, while OQMD shows a unimodal distribution around 0 eV/atom. The crystal system distribution of MP20 and MP40 is more uniform compared to OQMD. Cubic crystal structures are more prevalent in OQMD. We utilized these three subdatasets to train Con-CDVAE with a ratio of 8:1:1 for training set, validation set, and test set, respectively.

3.1 Generating with single condition

Refer to caption
Figure 3: Predicted formation energy distributions of crystal generated by Con-CDVAE. The first to third lines are the results obtained by training with the MP20, MP40, and OQMD datasets, respectively. The first to third columns are crystals generated using the default𝑑𝑒𝑓𝑎𝑢𝑙𝑡defaultitalic_d italic_e italic_f italic_a italic_u italic_l italic_t, full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l, and less𝑙𝑒𝑠𝑠lessitalic_l italic_e italic_s italic_s strategy respectively.

We initially trained Con-CDVAE separately using band gap (BG), formation energy (FE), and crystal system (CS) as single condition. Three independent models were trained on the three sub-datasets, and all were tested using three strategies. When using the less𝑙𝑒𝑠𝑠lessitalic_l italic_e italic_s italic_s strategy for crystal generation based on crystal system as condition, we only input the crystal system into the full condition Prior𝑃𝑟𝑖𝑜𝑟Prioritalic_P italic_r italic_i italic_o italic_r. However, when generating based on band gap or formation energy as condition, we only input band gap and formation energy as conditions, with the missing conditions randomly sampled from the training set. Figure 2 illustrates our crystal generation based on band gap of 0 eV, 3 eV, and 6 eV (represented by blue, yellow, and green bars, respectively). The bandgap distributions are then depicted in histogram form. Different subplots represent different datasets and strategies, where the first to third rows correspond to MP20, MP40, OQMD, and the first to third columns correspond to the default𝑑𝑒𝑓𝑎𝑢𝑙𝑡defaultitalic_d italic_e italic_f italic_a italic_u italic_l italic_t strategy, full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l strategy, and less𝑙𝑒𝑠𝑠lessitalic_l italic_e italic_s italic_s strategy. Figure 3 shows our crystal generation based on formation energy of 0 eV/atom, -2 eV/atom, and -4 eV/atom in the similar format.

Refer to caption
Figure 4: Predicted band gap and predicted formation energy of crystals generated by Con-CDVAE. The first to third lines are the results obtained by training with the MP20, MP40, and OQMD datasets, respectively. The first to third columns are crystals generated using the default𝑑𝑒𝑓𝑎𝑢𝑙𝑡defaultitalic_d italic_e italic_f italic_a italic_u italic_l italic_t, full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l, and less𝑙𝑒𝑠𝑠lessitalic_l italic_e italic_s italic_s strategy respectively.

All three single condition tests indicate that the results generated using the full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l strategy are most in line with our set condition, achieving the intended purpose of designing the full strategy. However, the full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l strategy limits the application of our model. In practical applications, we often only have requirements for one or a few properties, and cannot completely obtain all the information of an unknown crystal. Therefore, we design the less𝑙𝑒𝑠𝑠lessitalic_l italic_e italic_s italic_s strategy. However, from the test results, the less𝑙𝑒𝑠𝑠lessitalic_l italic_e italic_s italic_s strategy based on random extraction cannot maintain the same capability as the full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l strategy, and is not significantly better than the default𝑑𝑒𝑓𝑎𝑢𝑙𝑡defaultitalic_d italic_e italic_f italic_a italic_u italic_l italic_t strategy. We believe that increasing the amount of training data can enhance the model’s performance, as evident in the default𝑑𝑒𝑓𝑎𝑢𝑙𝑡defaultitalic_d italic_e italic_f italic_a italic_u italic_l italic_t strategy for crystal system and the full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l strategy for formation energy. However, the distribution and quality of the data itself can also impact model training. For instance, in the OQMD dataset, a higher proportion of cubic crystal system materials leads to better performance when generating crystals with a cubic crystal system, compared to models trained on the other two databases. Additionally, OQMD has a lower proportion of materials with formation energy less than -2 eV/atom, resulting in slightly poorer performance when generating crystals with a formation energy of -4 eV/atom compared to models trained on the other two databases.

Comparing the results from the three types of tests, we find that the generation results based on the condition of formation energy are superior to those based on band gap, which, in turn, are better than those based on crystal system. This is because the mapping from chemical and structural information to formation energy is simpler than the mapping to band gap, and the corresponding inverse mapping is also simpler. The inferior results in crystal system generation are attributed to our use of the pymatgen library for crystal system determination, which imposes stricter tolerance conditions. Crystal structures generated directly using the diffusion model without undergoing DFT relaxation may introduce errors that exceed this tolerance threshold.

Although Con-CDVAE generates better results in regions with sufficient training data, we can still see that it has the ability to perform conditional generation in regions with less data, such as the result generated with a band gap of 3 eV.

We also try to generate crystals with crystal system as condition. The result is show in Appendix B Table B1, which show that Con-CDVAE is not very good at generating the structure of fixed space groups, because CDVAE cannot guarantee the symmetry of the generated crystals well. Further relaxation with DFT may be needed to better show the symmetry of the generated crystals, which is also an area that needs to be perfected in the future.

3.2 Generating with combination condition

Then, we use band gap and formation energy as combination conditions to train Con-CDVAE to test the generation ability of the model under multiple conditions. The results are shown in Figure 4, where each point represents a generated crystal, and the horizontal and vertical coordinates of the points represent the formation energy and band gap predicted by CGCNN respectively. Points of different colors represent different conditions set during generation. Red indicates that the band gap is set to 3 eV and the formation energy is set to -2 eV/atom. Yellow indicates that the band gap is set to 0 eV and the formation energy is set to -2 eV/atom. Blue indicates that the band gap is set to 3 eV and the formation energy is set to 0 eV/atom. Green indicates that the band gap is set to 0 eV and the formation energy is set to 0 eV/atom.

As can be seen from the figure, Con-CDVAE has a stronger ability to generate crystals with a specific formation energy than crystals with a specific band gap (the data points are distributed longitudinally). By comparing Figure 2 and Figure 4, we find that combination condition generation will degrade the performance of model, but the model still has certain ability to generate with combination condition, especially when the full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l strategy is used.

In order to reduce the difficulty of generation, we designed a joint conditional generation test with discrete attributes using formation energy and bandgap. The specific method involved labeling crystals with zero bandgap as ’zero’ and the rest as ’non-zero’, while crystals with formation energy greater than -0.9 eV/atom were labeled as ’high’ and the rest as ’low’. The model was trained based on these labels, and the final results are shown in Table 2.

Refer to caption
Figure 5: T-SNE of crystals’ latent variables which are obtained by the Con-CDVAE. The first line obtained by training with the Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r block, the second line without the Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r block. The first to third columns are the results obtained by training with formation energy, band gap, and crystal system as condition, respectively.
Table 2: The generation success rate of models trained with different data sets using different strategies is obtained by using different classification of band gap and formation energy as generation conditions
target BG:zero; BG:zero; BG:non-zero; BG:non-zero;
classification FE:high (%) FE:low (%) FE:high (%) FE:low (%)
MP20-D 0.0 0.0 53.5 99.5
MP20-F 100.0 74.0 99.5 22.0
MP20-L 92.5 5.0 96.0 99.5
MP40-D 90.5 11.0 90.0 99.0
MP40-F 100.0 51.5 43.5 100.0
MP40-L 90.0 60.0 99.0 98.5
OQMD-D 88.5 37.0 27.0 92.0
OQMD-F 96.5 90.5 46.0 39.0
OQMD-L 88.5 16.0 60.0 62.0

3.3 Ablation experiment for Predictor block

Using MP20, we train three Con-CDVAE without the Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r block on formation energy, band gap, and crystal system as the single condition, which means setting the hyperparameter α𝛼\alphaitalic_α in the joint loss function 𝔏jointsubscript𝔏𝑗𝑜𝑖𝑛𝑡\mathfrak{L}_{joint}fraktur_L start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT to 0. After the training, Encoder𝐸𝑛𝑐𝑜𝑑𝑒𝑟Encoderitalic_E italic_n italic_c italic_o italic_d italic_e italic_r is used to obtain the latent variables of the crystals in test set, and T-sne[23] is used for dimensionality reduction and two-dimensional dot plot is drawn. Then, corresponding two-dimensional point plots are drawn using the three MP20-trained models mentioned in section 3, and the results are shown in Figure 5.

The three subplots in the first row of the figure depict the results without the Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r block, while the second row shows the results with the Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r block. In the plots of the first column, the closer the color is to blue, and the larger the size of the points, the higher the formation energy of the corresponding crystals. In the plots of the second column, the closer the color is to blue and the larger the size of the points, the larger the band gap of the corresponding crystals. Here, the logarithm is taken for the band gap as most crystals have a band gap of 0. In the plots of the third column, points of different colors represent crystals of different crystal systems. As we can see, Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r effectively aggregates crystals with similar properties in the latent space, which is important for structuring latent space.

3.4 Validation by DFT

Refer to caption
Figure 6: The predicted formation energy and the calculated formation energy of 135 crystals generated by Con-CDVAE. The blue, yellow, and green dots represent crystals generated using the default𝑑𝑒𝑓𝑎𝑢𝑙𝑡defaultitalic_d italic_e italic_f italic_a italic_u italic_l italic_t, full𝑓𝑢𝑙𝑙fullitalic_f italic_u italic_l italic_l, and less𝑙𝑒𝑠𝑠lessitalic_l italic_e italic_s italic_s strategy, respectively. The points of the circle, triangle, and square represent the target formation energy 0, -2, and -4 eV/atom, respectively, set during generation.

In order to verify the rationality of crystals generated by Con-CDVAE, We randomly selected 135 model-generated materials for relaxation by DFT in the test with formation energy as single condition (Three formation energy conditions and three strategies constitute nine cases, and 15 crystals are randomly selected in each case). At the same time, we also calculate formation energy of them.

The result is shown in Figure 6. The vertical coordinate is the formation energy predicted by CGCNN, and the horizontal coordinate is the formation energy calculated by DFT, and the eight crystals that fail to relax are labeled ”Err”. Dots of different shapes represent crystals generated with different formation energy conditions, and dots of different colors represent different generation strategies. The DFT calculation further verifies the ability of Con-CDVAE to generate crystals according to the conditions.

As shown in Table 3, we calculate the mean absolute change (MAC) of lattice constants and lattice angles between initial generated structures and DFT relaxed structures and use StructureMatcher from pymatgen to calculate root mean squared displacement (RMSD).

Table 3: The MAC of lattice constants and lattice angles and RMSD crystals before and after relaxation.
MAC(Å)a{}_{a}($\mathrm{\SIUnitSymbolAngstrom}$)start_FLOATSUBSCRIPT italic_a end_FLOATSUBSCRIPT ( roman_Å ) MAC(Å)b{}_{b}($\mathrm{\SIUnitSymbolAngstrom}$)start_FLOATSUBSCRIPT italic_b end_FLOATSUBSCRIPT ( roman_Å ) MAC(Å)c{}_{c}($\mathrm{\SIUnitSymbolAngstrom}$)start_FLOATSUBSCRIPT italic_c end_FLOATSUBSCRIPT ( roman_Å ) MAC(Å)α{}_{\alpha}($\mathrm{\SIUnitSymbolAngstrom}$)start_FLOATSUBSCRIPT italic_α end_FLOATSUBSCRIPT ( roman_Å ) MAC(Å)β{}_{\beta}($\mathrm{\SIUnitSymbolAngstrom}$)start_FLOATSUBSCRIPT italic_β end_FLOATSUBSCRIPT ( roman_Å ) MAC(Å)γ{}_{\gamma}($\mathrm{\SIUnitSymbolAngstrom}$)start_FLOATSUBSCRIPT italic_γ end_FLOATSUBSCRIPT ( roman_Å ) RMSD(Å)angstrom($\mathrm{\SIUnitSymbolAngstrom}$)( roman_Å )
0.376 0.397 0.740 4.694 4.727 3.818 0.307
Refer to caption
Figure 7: Five structure generated by Con-CDVAE are similar with crystals from MP[3]. The second row of crystals is derived from MP, and the corresponding ID is their ID in the MP database.

For the 127 crystals that successfully relaxed, we use StructureMatcher with the following default parameters: ltol=0.2, stol=0.3, angle_tol=5 to find out if similar materials already existed in MP. We found a total of 22 pairs, and Figure 7 shows 5 of them. Notably, mp-981 does not appear in the training set. This shows that our model has the ability to generate crystals outside of the training set. In Figure 8 we show five gennerated crystals which do not find matching materials in MP.

Refer to caption
Figure 8: Five structures generated by Con-CDVAE did not find any similar crystals in the MP database.

4 Conclusion

In this paper, a diffusion model, Con-CDVAE, which can effectively realize crystal generation according to target properties, is proposed. We design a two-step model training method and three generation strategies. We test the model under a variety of conditions, including single and combination conditions, and checked the results with ML model. We found that although the model performs well in regions with sufficient data, it can also generate crystals in regions with insufficient data. Then, we perform preliminary ablation experiments on the model and demonstrate the importance of the Predictor𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟Predictoritalic_P italic_r italic_e italic_d italic_i italic_c italic_t italic_o italic_r block for building the spatial structure of latent variables. Finally, we performed a simple DFT validation, further confirming the rationality of the crystals generated by Con-CDVAE, as well as the model’s ability to generate crystals based on target properties and to generate crystals outside the training set.

Of course, there are areas in our work that can be further improved. The model needs to be further improved to recognize the symmetry information of crystals and realize the generation of crystals with specific space groups. How to improve the model to reduce the MAC and RMSD of the generated crystals before and after relaxation requires further investigation. Other more practical applications are physical properties that are also worth trying further, such as the critical temperature of superconducting materials.

Appendix A: Data distribution

Refer to caption
Figure A1: Distribution of band gap, formation energy and crystal system in three sub-datasets.

Figure A1 shows the distribution of the three data sets used in this paper in the three attributes of band gap, formation energy and crystal system. We can see that in the three data sets, crystals with zero-bandgap occupy a large part, and the data distribution in the formation energy is relatively wider. We think this is one of the important reasons why the conditional generation of the band gap looks worse than the conditional generation of the forming energy.

In the comparison of OQMD and MP dataset, we found that the crystal formation energy in OQMD is mostly concentrated near 0eV, while the formation energy of MP crystal shows two peaks, which indicates that the average stability of crystal in MP is higher than that of OQMD, and also reflects that the vast material space is still not fully explored by human beings. In OQMD, the cubic crystal system is a large part of the material, while in MP it is relatively average.

Appendix B: Generate crystals with crystal system as condition

Table B1: The generation success rate of models trained with different data sets using different strategies is obtained by using different crystal systems as generation conditions
target CS tric. (%) mono. (%) orth. (%) tetra. (%) trig. (%) hexag. (%) cubic (%)
MP20-D 100.0 0.0 0.0 2.0 0.0 0.0 0.0
MP20-F 100.0 29.0 83.0 74.0 81.0 0.0 93.0
MP20-L 100.0 0.0 0.0 3.0 4.5 0.0 31.0
MP40-D 100.0 0.0 0.0 0.0 3.0 0.0 0.0
MP40-F 100.0 0.0 3.0 24.0 23.5 0.0 68.0
MP40-L 92.0 3.0 1.0 0.0 1.0 0.0 14.5
OQMD-D 86.0 0.0 22.0 27.0 25.5 7.5 93.5
OQMD-F 48.0 57.5 3.0 88.5 87.0 0.0 100.0
OQMD-L 100.0 0.5 37.5 9.5 21.0 22.0 2.0

Table B1 shows the result of crystals generation with crystal system as condition. We see that the formation of triclinic systems seems good, but this is because many of the resulting crystals cannot be recognized as symmetrical. The important reason for this result is that the tolerance set by pymatgen when we identify crystal symmetry is 0.2 Åangstrom\mathrm{\SIUnitSymbolAngstrom}roman_Å (twenty times the default value), and the RMSD before and after DFT relaxation is already higher than this tolerance, reaching 0.307 Åangstrom\mathrm{\SIUnitSymbolAngstrom}roman_Å. This suggests that the atomic positions of the crystals generated by our model need to be more precise. Another reason may be that the generation of crystal system and symmetry may require special design of the model in order to perform well under the current data volume and parameter number.

Appendix C: DFT parameter

In this study, the relaxation of crystal structures is performed using the Vienna Ab initio Simulation Package (VASP-5.4.4)[24]. Lattice shape, unit cell volume, and atomic positions are allowed to change. The PBE exchange-correlation functional[25] is employed, and the KPOINTS file is generated using vaspkit[26] with the Kmesh-Resolved Value set to 0.03 (2*PI/Åangstrom\mathrm{\SIUnitSymbolAngstrom}roman_Å). The cutoff energy for the plane wave basis set is set to 1.5 times the largest ENMAX value in the POTCAR file. The relaxation process is terminated when the norms of all atomic forces were smaller than 0.001 eV. A conjugate gradient algorithm was utilized to achieve the instantaneous ground state for the atoms.

Code availability

Our code is avaliable at https://github.com/cyye001/Con-CDVAE.

Note

During development of the Con-CDVAE model, we become aware of a pre-print by Xie T et al that proposes a diffusion model[27] with the same aim as our, but the method to realize conditional generation is different.

Acknowledgment

This work was supported by the National Key Research and Development Program of China (Grant No. 2023YFA1607400, 2022YFA1403800), the National Natural Science Foundation of China (Grant No.12274436, 11925408, 11921004), the Science Center of the National Natural Science Foundation of China (Grant No. 12188101), and H.W. acknowledge support from the Informatization Plan of the Chinese Academy of Sciences (CASWX2021SF-0102) and the New Cornerstone Science Foundation through the XPLORER PRIZE.

References

  • [1] Alec Belsky, Mariette Hellenbrandt, Vicky Lynn Karen, and Peter Luksch. New developments in the inorganic crystal structure database (icsd): accessibility in support of materials research and design. Acta Crystallographica Section B: Structural Science, 58(3):364–369, 2002.
  • [2] Mariette Hellenbrandt. The inorganic crystal structure database (icsd)—present and future. Crystallography Reviews, 10(1):17–22, 2004.
  • [3] Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation. APL materials, 1(1), 2013.
  • [4] James E Saal, Scott Kirklin, Muratahan Aykol, Bryce Meredig, and Christopher Wolverton. Materials design and discovery with high-throughput density functional theory: the open quantum materials database (oqmd). Jom, 65:1501–1509, 2013.
  • [5] Ankit Agrawal and Alok Choudhary. Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. Apl Materials, 4(5):053208, 2016.
  • [6] Tian Xie and Jeffrey C Grossman. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters, 120(14):145301, 2018.
  • [7] Chao Liang, Yilimiranmu Rouzhahong, Caiyuan Ye, Chong Li, Biao Wang, and Huashan Li. Material symmetry recognition and property prediction accomplished by crystal capsule representation. Nature Communications, 14(1):5198, 2023.
  • [8] Oliver T Unke, Stefan Chmiela, Huziel E Sauceda, Michael Gastegger, Igor Poltavsky, Kristof T Schutt, Alexandre Tkatchenko, and Klaus-Robert Muller. Machine learning force fields. Chemical Reviews, 121(16):10142–10186, 2021.
  • [9] Chi Chen and Shyue Ping Ong. A universal graph deep learning interatomic potential for the periodic table. Nature Computational Science, 2(11):718–728, 2022.
  • [10] Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, pages 1–6, 2023.
  • [11] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [12] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [13] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2:3, 2023.
  • [14] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • [15] Zekun Ren, Siyu Isaac Parker Tian, Juhwan Noh, Felipe Oviedo, Guangzong Xing, Jiali Li, Qiaohao Liang, Ruiming Zhu, Armin G Aberle, Shijing Sun, et al. An invertible crystallographic representation for general inverse design of inorganic crystals with targeted properties. Matter, 5(1):314–335, 2022.
  • [16] Niklas Gebauer, Michael Gastegger, and Kristof Schütt. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. Advances in neural information processing systems, 32, 2019.
  • [17] Niklas WA Gebauer, Michael Gastegger, Stefaan SP Hessmann, Klaus-Robert Müller, and Kristof T Schütt. Inverse design of 3d molecular structures with conditional generative neural networks. Nature communications, 13(1):973, 2022.
  • [18] Tian Xie, Xiang Fu, Octavian-Eugen Ganea, Regina Barzilay, and Tommi Jaakkola. Crystal diffusion variational autoencoder for periodic material generation. arXiv preprint arXiv:2110.06197, 2021.
  • [19] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • [20] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • [21] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [22] Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent L Chevrier, Kristin A Persson, and Gerbrand Ceder. Python materials genomics (pymatgen): A robust, open-source python library for materials analysis. Computational Materials Science, 68:314–319, 2013.
  • [23] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • [24] Georg Kresse and Jürgen Furthmüller. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Computational materials science, 6(1):15–50, 1996.
  • [25] John P Perdew, Kieron Burke, and Matthias Ernzerhof. Generalized gradient approximation made simple. Physical review letters, 77(18):3865, 1996.
  • [26] Vei Wang, Nan Xu, Jin-Cheng Liu, Gang Tang, and Wen-Tong Geng. Vaspkit: A user-friendly interface facilitating high-throughput computing and analysis using vasp code. Computer Physics Communications, 267:108033, 2021.
  • [27] Claudio Zeni, Robert Pinsler, Daniel Zügner, Andrew Fowler, Matthew Horton, Xiang Fu, Sasha Shysheya, Jonathan Crabbé, Lixin Sun, Jake Smith, et al. Mattergen: a generative model for inorganic materials design. arXiv preprint arXiv:2312.03687, 2023.