Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY-NC-SA 4.0
arXiv:2310.04925v2 [cs.LG] 13 Dec 2023

Crystal-GFN: sampling crystals with desirable properties and constraints

Mila AI4Science Alex Hernandez-Garcia Mila, Quebec AI Institute Université de Montréal Alexandre Duval Mila, Quebec AI Institute CentraleSupélec, Université Paris-Saclay Alexandra Volokhova Mila, Quebec AI Institute Université de Montréal Yoshua Bengio Mila, Quebec AI Institute Université de Montréal Divya Sharma Johns Hopkins University Pierre Luc Carrier Mila, Quebec AI Institute Yasmine Benabed Université de Montréal Centre d’excellence en électrification des transports et stockage d’énergie, Hydro-Québec Michał Koziarski Mila, Quebec AI Institute Université de Montréal Victor Schmidt Mila, Quebec AI Institute Université de Montréal
Abstract

Accelerating material discovery holds the potential to greatly help mitigate the climate crisis. Discovering new solid-state materials such as electrocatalysts, super-ionic conductors or photovoltaic materials can have a crucial impact, for instance, in improving the efficiency of renewable energy production and storage. In this paper, we introduce Crystal-GFN, a generative model of crystal structures that sequentially samples structural properties of crystalline materials, namely the space group, composition and lattice parameters. This domain-inspired approach enables the flexible incorporation of physical and structural hard constraints, as well as the use of any available predictive model of a desired physicochemical property as an objective function. To design stable materials, one must target the candidates with the lowest formation energy. Here, we use as objective the formation energy per atom of a crystal structure predicted by a new proxy machine learning model trained on MatBench. The results demonstrate that Crystal-GFN is able to sample highly diverse crystals with low (median -3.1 eV/atom) predicted formation energy.

1  Introduction


Crystal-Lattice System triclinic, monoclinic, … (8) ptPoint Symmetrycentrosymmetric, polar, … (5)ptSpace Group1,2,,230122301,2,\ldots,2301 , 2 , … , 230C1C1C1Space GroupElementRefer to captionRefer to captionRefer to caption\vdots# Atoms12345\vdotsN𝑁Nitalic_NCompositionRefer to caption(a,b,c),(α,β,γ)𝑎𝑏𝑐𝛼𝛽𝛾(a,b,c),(\alpha,\beta,\gamma)( italic_a , italic_b , italic_c ) , ( italic_α , italic_β , italic_γ )Cubic:a=b=c,α=β=γ=90formulae-sequence𝑎𝑏𝑐𝛼𝛽𝛾superscript90a{=}b{=}c,\alpha{=}\beta{=}\gamma{=}90^{\circ}italic_a = italic_b = italic_c , italic_α = italic_β = italic_γ = 90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPTMonoclinic:α=γ=90𝛼𝛾superscript90\alpha{=}\gamma{=}90^{\circ}italic_α = italic_γ = 90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT\vdotsLattice ParametersFm3̄m (225)C2Li8O4 (Li2O)Fm3̄m (225)C3Li8O4 (Li2O)Fm3̄m (225)a=b=c=4.65𝑎𝑏𝑐4.65a{=}b{=}c{=}4.65italic_a = italic_b = italic_c = 4.65α=β=γ=90𝛼𝛽𝛾superscript90\alpha{=}\beta{=}\gamma{=}90^{\circ}italic_α = italic_β = italic_γ = 90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPTC4Final outputptProxy Modelx𝑥xitalic_xReward R(x)𝑅𝑥R(x)italic_R ( italic_x )ConstraintsCi

Figure 1: A schematic of the crystal generation process of Crystal-GFN. First, the space group is selected, in turn decomposed into the selection of a crystal-lattice system and a point symmetry. Then, a composition is generated by iteratively selecting an element type and its quantity. Finally, the lattice parameters of the unit cell are sampled. We introduce hard constraints, denoted by Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, within and between these components, as described in Section 4.

Materials discovery plays a vital role in transforming numerous industries that are currently responsible for a significant fraction of the global greenhouse gas emissions. From developing cutting-edge photovoltaic panels or new-generation solid-state batteries to tackling carbon capture or catalysis, the quest for innovative materials with targeted properties has the power to reshape the technological landscape.

However, the discovery of new materials is an extremely challenging task at various levels of innovation: crystal structure generation, property prediction and successful development of synthesis protocols. In practice, given the vast space of conceivable materials, scientists aiming at designing a material with a desirable property must use their domain expertise and intuition to restrict the search space to just a few promising structures. This traditional trial-and-error process is extremely costly both in terms of time and efforts, providing machine learning (ML) with an opportunity to greatly accelerate the generation and evaluation of promising candidates.

While ML has already made significant progress with respect to property prediction and generation of small molecules (Long et al., 2021; Ren et al., 2022; Pakornchote et al., 2023; Zheng et al., 2023), its impact in the field of solid-state energy materials lags behind. At the root of this discrepancy lies the crystalline nature of their structure, which imposes periodic boundary conditions that make it difficult to use finite graphs for crystal representation. As a result, the sole task of generating a stable candidate is highly complex. It involves exploring the combinatorial space of chemical compositions, considering all possible periodic arrangements of atoms as well as respecting the complex physics that make them viable, stable and metastable structures.

Indeed, the generation of crystal 3D structures involves a multitude of factors, including specific bonding preferences between different atom types, geometric and chemical constraints, and a positioning that is correlated with the local energy minimum defined by quantum mechanics. Overall, current methods often struggle to capture the above complexities in the generation process, failing to consistently generate valid structures. Besides, most such approaches would not enable an efficient search through the crystal space or a search for crystals with a specific property.

In this work, we introduce Crystal-GFN (Fig. 1), a generative model designed to sequentially sample crystal structures in a space inspired by theoretical crystallography. Instead of directly generating atomic positions, Crystal-GFN samples the space group, the composition and the lattice parameters of a crystal structure (Section 4). This domain-inspired approach facilitates the incorporation of physicochemical and structural constraints, and allows the possibility of extending the sampling space, for example to obtain atomic positions. The GFlowNet framework offers the ability to sample proportionally to a reward function, which enables diverse sampling. Here, we first trained a predictive model of the formation energy per atom and used it as the reward function to train a Crystal-GFN for 12 hours in a CPU-only machine. In our evaluation in Section 5, we sampled 10,000 compounds from the trained Crystal-GFN to demonstrate that it is able to sample diverse candidates with low predicted formation energy. 95 % of the sampled structures have a predicted energy lower than -2 eV/atom and the overall median is -3.1 eV/atom.

2  Related Work

There exists a record of generative methods for molecular and crystal structure generation, generally falling into one of these categories: Variational Auto Encoders (VAE) (Ren et al., 2022; Pakornchote et al., 2023), Generative Adversarial Networks (GANs) (Cao & Kipf, 2018; Nouira et al., 2018; Long et al., 2021), Normalising flows (Satorras et al., 2021; Ahmad & Cai, 2022) and Diffusion models (Zheng et al., 2023; Xu et al., 2022). In general, these methods are trained to reconstruct the training data from a latent distribution by maximising its likelihood or minimising the discrepancy between generated sampled and the real distribution. Once trained, they can generate new instances by sampling from the learned latent space.

Hoffmann et al. (2019); Long et al. (2021); Court et al. (2020) treat materials as 3D voxel images, but the process of decoding images back to atom types and coordinates often results in low validity. Several methods (Ren et al., 2020; Kim et al., 2020; Zhao et al., 2021) encode unit cell-related information to generate crystals. GANCSP (Kim et al., 2020) and CubicGAN (Zhao et al., 2021) use as input fractional coordinates, element properties, and lattice parameters, to build models that generate crystals conditioned on composition or both composition and space group. Such models are generally not invariant to any Euclidean transformations.

CDVAE (Xie et al., 2021), employs a diffusion model in conjunction with a VAE to sample new materials, leveraging lattice parameters. However, it does not enable imposing domain-based constraints into the generation process. Overall, directly sampling 3D positions without considering energy minimisation can lead to physically unrealistic and unstable structures. Cheng et al. (2022) leveraged physics and symmetry-based constraints to solve an optimisation problem over crystal graphs to obtain atomic coordinates, but it requires a composition and lattice parameters as input.

Different from the above, autoregressive methods like G-SchNet Gebauer et al. (2019) or G-SphereNet generates 3D molecules sequentially, placing atom-by-atom. They are unaware of periodicity and cannot generate the lattice, making them ill-suited for crystals. This also applies to flow models Shi et al. (2020).

As most methods in the literature directly sample 3D positions, they struggle to embed essential physicochemical constraints like energy minimisation, dihedral angles geometric constraints or bonding preferences between different atom types, often leading to physically unrealistic and unstable structures. In our proposed model, we incorporate domain knowledge from materials science and crystallography to sequentially build materials by selecting the space group, composition and lattice parameters while imposing compatiblity and validity constraints. Therefore, the generated crystals adhere to additional fundamental principles and constraints and result in more physically viable structures.

3  Background

In this section, we briefly review the necessary background on GFlowNets and crystallography, before describing our proposed method, Crystal-GFN. Appendix A provides additional background about materials and crystal structures.

3.1  GFlowNets

Generative Flow Networks (GFlowNets or GFN for short), were introduced by Bengio et al. (2021) as an amortised inference method to sample from high-dimensional distributions, where both traditional methods, such as MCMC and reinforcement learning are inefficient in terms of mode mixing. In essence, GFlowNets are designed to learn a sampling policy π𝜋\piitalic_π to generate objects x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X proportionally to a non-negative reward function R(x)𝑅𝑥R(x)italic_R ( italic_x ), that is π(x)R(x)proportional-to𝜋𝑥𝑅𝑥\pi(x)\propto R(x)italic_π ( italic_x ) ∝ italic_R ( italic_x ). This facilitates the discovery of multiple modes of a given reward function, which is a desirable objective in scientific discovery  (Jain et al., 2023). For instance, GFlowNets have been successfully applied for biological sequence design and molecular property prediction as part of active learning algorithms (Jain et al., 2022; Hernandez-Garcia et al., 2023).

A key property of GFlowNets is that objects are generated sequentially. Starting from a special state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, transitions stst+1𝔸subscript𝑠𝑡subscript𝑠𝑡1𝔸s_{t}{\rightarrow}s_{t+1}\in\mathbb{A}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ blackboard_A are applied between states s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, forming trajectories τ=(s0s1x)𝜏subscript𝑠0subscript𝑠1𝑥\tau=(s_{0}\rightarrow s_{1}\rightarrow\ldots\rightarrow x)italic_τ = ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → … → italic_x ), where 𝔸𝔸\mathbb{A}blackboard_A is a predefined action space and 𝒮𝒮\mathcal{S}caligraphic_S the state space. This sequential construction of objects, together with the fact that the transition policy Pθ(st+1|st)subscript𝑃𝜃conditionalsubscript𝑠𝑡1subscript𝑠𝑡P_{\theta}(s_{t+1}|s_{t})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is parameterised by a neural network (with parameters θ𝜃\thetaitalic_θ), provides amortisation and the potential of systematic generalisation. Generalisation is possible if the decomposition into partially constructed objects provides a structure that can be learned by a machine learning model.

In order to obtain a policy that samples proportionally to the reward distribution, several training objectives have been proposed in the literature. Here, we will make use of the Trajectory Balance objective (Malkin et al., 2022), which has proven effective in various tasks (Jain et al., 2022; Hernandez-Garcia et al., 2023; Malkin et al., 2022). While GFlowNets were initially introduced for discrete probabilistic modelling, Lahlou et al. (2023) recently proposed a generalisation of the framework for continuous or hybrid state spaces. This work empirically demonstrates this generalisation, as we propose a GFlowNet that operates in a hybrid—mixture of discrete and continuous—state space.

3.2  Crystals

Crystals are highly structured solid materials defined by a repeated arrangement of atoms in space. The elementary pattern that periodically repeats itself is called a unit cell. This unique ordering constitutes the material’s signature and influences greatly its properties. Therefore, the understanding of crystal structures and the structure-property relationship is crucial in the development of new materials. While there exists an infinite number of crystal structures, the field of theoretical crystallography has developed ways to systematically parameterise and classify crystals according to their symmetry.

Lattice

An n𝑛nitalic_n-dimensional lattice ΛΛ\Lambdaroman_Λ can be defined as the set of integral combinations of the linearly independent non-copolanar lattice basis vectors 𝐚insubscript𝐚𝑖superscript𝑛\mathbf{a}_{i}\in\mathbb{R}^{n}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT: Λ{inmi𝐚imi}approaches-limitΛconditional-setsuperscriptsubscript𝑖𝑛subscript𝑚𝑖subscript𝐚𝑖subscript𝑚𝑖\Lambda\doteq\left\{\sum_{i}^{n}{m}_{i}\mathbf{a}_{i}\mid m_{i}\in\mathbb{Z}\right\}roman_Λ ≐ { ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_Z }, given a reference origin. In 3D, the 3 lattice basis vectors can be converted into 6 parameters: a,b,c𝑎𝑏𝑐a,b,citalic_a , italic_b , italic_c determine the lengths of each dimension and α,β,γ𝛼𝛽𝛾\alpha,\beta,\gammaitalic_α , italic_β , italic_γ the angles between dimensions.

Composition

We call composition the set of all atom types comprised in the crystal’s unit cell, introduced in their respective quantities.

Space group

The specific arrangements of atoms in a crystal will generate some degree of symmetry. Symmetry is achieved when an operation is applied to an initial arrangement of atoms and leave it unchanged. The combination of all present symmetry elements in a crystal indicates its space group. A space group element can be described as a tuple (𝐖,𝐭)𝐖𝐭\left(\mathbf{W},\mathbf{t}\right)( bold_W , bold_t ), where 𝐖𝐖\mathbf{W}bold_W is the orthogonal matrix and 𝐭𝐭\mathbf{t}bold_t the translation vector. An element maps a vector 𝐱n𝐱superscript𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to 𝐖𝐱+𝐭𝐖𝐱𝐭\mathbf{W}\mathbf{x}+\mathbf{t}bold_Wx + bold_t. In three dimensions, space groups are classified into 230 types.

4  Crystal-GFNs

In this section, we describe the details of Crystal-GFN, the main method we introduce in this paper to explore the space of crystal structures and generate crystals with desirable properties and domain constraints.

Drawing inspiration from crystallography, we propose to represent crystals as the concatenation of three distinct components or subspaces: space group (SG), composition (C) and lattice parameters (LP) of the unit cell of a material. The Cartesian product of the spaces of these three components would make the state space 𝒮=𝒮SG×𝒮C×𝒮LP𝒮subscript𝒮𝑆𝐺subscript𝒮𝐶subscript𝒮𝐿𝑃\mathcal{S}=\mathcal{S}_{SG}\times\mathcal{S}_{C}\times\mathcal{S}_{LP}caligraphic_S = caligraphic_S start_POSTSUBSCRIPT italic_S italic_G end_POSTSUBSCRIPT × caligraphic_S start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT × caligraphic_S start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT and sample space 𝒳=𝒳SG×𝒳C×𝒳LP𝒳subscript𝒳𝑆𝐺subscript𝒳𝐶subscript𝒳𝐿𝑃\mathcal{X}=\mathcal{X}_{SG}\times\mathcal{X}_{C}\times\mathcal{X}_{LP}caligraphic_X = caligraphic_X start_POSTSUBSCRIPT italic_S italic_G end_POSTSUBSCRIPT × caligraphic_X start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT × caligraphic_X start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT of a naive, unrestricted implementation with GFlowNets. An important advantage of the GFlowNet framework and of the domain-inspired data representation that we propose to use is that it allows us to flexibly introduce domain knowledge such as well-studied geometrical constraints from crystallography as well as physicochemical constraints, described below in this section. These constraints are mostly introduced by restricting the action space 𝔸𝔸\mathbb{A}blackboard_A of the Crystal-GFN, both within and across subspaces.

In this work, we design the Crystal-GFN such that trajectories first select the space group, then the composition and finally the lattice parameters. Figure 1 offers a graphical summary of the sequential generation in a Crystal-GFN. Below we describe the details of the three subspaces and the constraints we have introduced.

4.1  Space group

In our Crystal-GFN, the sample space for the space group subspace is 𝒳SG={1,2,,230}subscript𝒳𝑆𝐺12230\mathcal{X}_{SG}=\{1,2,\ldots,230\}caligraphic_X start_POSTSUBSCRIPT italic_S italic_G end_POSTSUBSCRIPT = { 1 , 2 , … , 230 }, corresponding to the 230 symmetry groups in three dimensions. As discussed in Section 3.1, GFlowNets rely on the decomposition of objects into multiple steps to facilitate generalisation. Therefore, we draw inspiration from theoretical crystallography to incorporate additional structure into the space group subspace.

A lattice carries 14 possible arrangements, referred to as the Bravais lattices, that are categorised into seven crystal systems: triclinic, monoclinic, orthorhombic, tetragonal, trigonal, hexagonal and cubic. Alternatively, a related but slightly different categorisation is the lattice system, with also seven systems (see Fig. 2). The lattice system category is convenient for our purposes because it imposes specific constraints on the lattice parameters (see C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT in Fig. 1). Here, we use a derived categorisation resulting from the combination of a crystal system and a lattice system. We refer to it as crystal-lattice system and it has eight categories: triclinic, monoclinic, orthorhombic, tetragonal, trigonal-rhombohedral, trigonal-hexagonal, hexagonal-hexagonal and cubic.

Besides the crystal-lattice system, we introduce another category in the Crystal-GFN space group subspace: the point symmetry (also known as site symmetry), which defines the type of symmetry of a point group. We use the following five categories of point symmetries: centrosymmetric, non-centrosymmetric, enantiomorphic, polar and enantiomorphic-polar. We choose these categories because the combination of point symmetry and crystal system gives rise to one of the 32 crystal classes or crystallographic point groups.

In sum, our space group subspace 𝒮SGsubscript𝒮𝑆𝐺\mathcal{S}_{SG}caligraphic_S start_POSTSUBSCRIPT italic_S italic_G end_POSTSUBSCRIPT is a three-dimensional set where the entries correspond to the crystal-lattice system, the point symmetry and the space group. We define the action space such that, starting from the source state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the transitions can set any of the three dimensions. However, the selection of an option along any dimension restricts the valid options for the remaining dimensions (constraint C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Fig. 1). For example, if point symmetry “non-centrosymmetric” is selected from the source state, only three crystal-lattice systems remain valid and the number of valid space groups is reduced to 25. The “stop” action is valid if and only if the space group has been selected.

4.2  Composition

We represent a composition as the number of atoms of each element present in the unit cell. Specifically, for a vocabulary of D𝐷Ditalic_D elements, where each element can have up to K𝐾Kitalic_K atoms, we construct D𝐷Ditalic_D-dimensional vectors where each entry indicates the number of atoms of element d𝑑ditalic_d. This yields the space 𝒮C={(k1,kD)|kd{0,1,K},d=1,D}subscript𝒮𝐶conditional-setsubscript𝑘1subscript𝑘𝐷formulae-sequencesubscript𝑘𝑑01𝐾𝑑1𝐷\mathcal{S}_{C}=\{(k_{1},\ldots k_{D})|k_{d}\in\{0,1,\ldots K\},d=1,\ldots D\}caligraphic_S start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = { ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_k start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) | italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ { 0 , 1 , … italic_K } , italic_d = 1 , … italic_D }. The action space of the composition subspace consists of the choice of element and number of atoms, that is 𝔸C={1,D}×{1,K}subscript𝔸𝐶1𝐷1𝐾\mathbb{A}_{C}=\{1,\ldots D\}\times\{1,\ldots K\}blackboard_A start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = { 1 , … italic_D } × { 1 , … italic_K }, plus the special “stop” action to finish a trajectory.

To meet the electroneutrality requirement, the sum of positive and negative charges in a material must be equal. Since we have control over the action space of the GFlowNet, we incorporate a hard constraint to ensure that all the generated compositions can have a neutral charge (constraint C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in Fig. 1). We argue that this is an advantage over methods that aim at learning such properties implicitly from the data (Xie et al., 2021).

Finally, we also have an opportunity to incorporate inter-subspace constraints between the space group and the composition. Because of the symmetry imposed by each space group, not all compositions are possible given a space group, and given a composition not all space groups are valid. By way of illustration, the most restrictive space group—international number 230–cannot accommodate compositions with fewer than 16 atoms per element. Since here the trajectories of a Crystal-GFN first sample a space group, we restrict the action space of the GFlowNet to only allow the sampling of compositions that are compatible with the Wyckoff positions (see Appendix A) of the selected space group (constraint C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Fig. 1).

4.3  Lattice parameters

The lattice parameters (a,b,c,α,β,γ𝑎𝑏𝑐𝛼𝛽𝛾a,b,c,\alpha,\beta,\gammaitalic_a , italic_b , italic_c , italic_α , italic_β , italic_γ), in combination with the lattice system (or crystal system) determine the shape of the unit cell in three dimensions. While both the composition and the space group subspaces are discrete, the lattice parameters are real-valued and thus yield a continuous sample space 𝒳LP={(min,max)3×(θmin,θmax)3}subscript𝒳𝐿𝑃superscriptsubscript𝑚𝑖𝑛subscript𝑚𝑎𝑥3superscriptsubscript𝜃𝑚𝑖𝑛subscript𝜃𝑚𝑎𝑥3\mathcal{X}_{LP}=\{(\ell_{min},\ell_{max})^{3}\times(\theta_{min},\theta_{max}% )^{3}\}caligraphic_X start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT = { ( roman_ℓ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT × ( italic_θ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT } , where minsubscript𝑚𝑖𝑛\ell_{min}roman_ℓ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and maxsubscript𝑚𝑎𝑥\ell_{max}roman_ℓ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT are the minimum and maximum lengths and θminsubscript𝜃𝑚𝑖𝑛\theta_{min}italic_θ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and θmaxsubscript𝜃𝑚𝑎𝑥\theta_{max}italic_θ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT are the minimum and maximum angles, respectively. In order to sample sets of continuous lattice parameters, the GFlowNet policy model outputs the parameters of a six-dimensional mixture of Beta distributions, a continuous probability distribution defined in the interval (0,1)01(0,1)( 0 , 1 ).

Here, we incorporate an inter-subspace constraint from the space group (C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT in Fig. 1). Every space group is classified into a lattice system which, in turn, constrains the values of the lattice parameters. For instance, for the cubic lattice system, a=b=c𝑎𝑏𝑐a=b=citalic_a = italic_b = italic_c and α=β=γ=90𝛼𝛽𝛾superscript90\alpha=\beta=\gamma=90^{\circ}italic_α = italic_β = italic_γ = 90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. The remaining constraints are illustrated in Fig. 2. This drastically reduces the state space and removes the need for having to learn such relationships and constraints from the data.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Figure 2: The seven lattice systems and the constraints they impose on the lattice parameters. In order a–g: triclinic, monoclinic, orthorhombic, tetragonal, rhombohedral, hexagonal and cubic. Source of the figures: Wikimedia Commons, licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.

4.4  Reward function

As explained in Section 3, our Crystal-GFN is trained to sample proportionally to a reward function. This offers the flexibility of using as reward any quantity of interest, such as a desired property of materials. As an initial proof of concept for the method, we choose the reward to be based on the formation energy (FE) per atom of the sampled crystal structure, in order to generate materials with higher likelihood of being thermodynamically stable.

The formation energy of a material is defined as the amount of energy needed to form a material from its constituent elements in their standard states. If a compound C𝐶Citalic_C is the combination of elements A𝐴Aitalic_A and B𝐵Bitalic_B, that is A+BC𝐴𝐵𝐶A+B\rightarrow Citalic_A + italic_B → italic_C, then the formation energy (Efsubscript𝐸𝑓E_{f}italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT) of C𝐶Citalic_C is defined as follows: Ef(C)=E(C)E(A)E(B)subscript𝐸𝑓𝐶𝐸𝐶𝐸𝐴𝐸𝐵E_{f}(C)=E(C)-E(A)-E(B)italic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_C ) = italic_E ( italic_C ) - italic_E ( italic_A ) - italic_E ( italic_B ). A negative formation energy indicates that the formation reaction of a material from its constituent elements is spontaneous or favourable. On the contrary, a positive formation energy indicates that the formation reaction is not thermodynamically favoured.

Because crystal structures sampled by the Crystal-GFN may be unknown or not characterised in existing databases, its true FE may be unknown too. This is why we train a proxy machine learning model to predict the FE given a crystal x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, which is parameterised as the output of Crystal-GFN.

In particular, we train a physics-informed Multi-Layer Perceptron (MLP) on the MatBench data set (Dunn et al., 2020), a snapshot of the Materials Project (Jain et al., 2013). For a given crystal structure, the input to the proxy MLP is the concatenation of the following (refer to Appendix C for more details):

  • A physical embedding of the crystal’s elements, obtained using PhAST (Duval et al., 2022), that is from the atomic number, the period, the group and other relevant atomic properties.

  • A learned embedding for its space group.

  • The standardised lattice parameters.

We stratify the MatBench data set into train, validation and test sets, controlling for the distribution in target FE. Our training and validation sets contain 65,048 and 23,232 data points, respectively. The final reward function is a Boltzmann transformation of the proxy output: R(x)=exp(MLP(x)T)𝑅𝑥MLP𝑥𝑇R(x)=\exp\big{(}-\frac{\text{MLP}(x)}{T}\big{)}italic_R ( italic_x ) = roman_exp ( - divide start_ARG MLP ( italic_x ) end_ARG start_ARG italic_T end_ARG ) where T𝑇Titalic_T is a temperature hyper-parameter. This ensures that lower FE yields higher positive reward, and we can control our preference for lower energies with the temperature T𝑇Titalic_T.

5  Empirical evaluation

In this section, we present the results of experiments designed to evaluate the potential of Crystal-GFN to explore the space of materials. The motivation of using GFlowNet to generate crystal structures is twofold: One, we are interested in discovering structures with high scores of a property of interest, in this case low formation energy. At the same time, we want to discover not just one but multiple and diverse structures. The main reason to seek diverse samples is that often the target function is underspecified and the true objective is multifaceted or unknown. A natural way of dealing with underspecification is to try multiple candidates to increase the likelihood of finding successful structures for the downstream applications Jain et al. (2023); Hernandez-Garcia et al. (2023). Thus, we here evaluate two aspects of Crystal-GFN: the distribution of the predicted formation energy of the generated samples, and their diversity.

5.1  Experimental setup

Exploring the infinitely large space of crystals to find the ones with certain properties is a daunting “needle-in-a-haystack” task. In order to make the haystack a little smaller—though still infinitely large—we restrict the search task to a subset of 113 space groups and compositions with up to 5 unique elements from a set of 12 common elements. We train the Crystal-GFN for 50,000 iterations, which amounts to 500,000 queries to the proxy model and about 12 hours on a CPU-only machine. Further details about the experimental setup are provided in Appendix B.

5.2  Results

In this section, we present results related to our proxy model, to the distribution of the formation energy in Crystal-GFN samples and to the diversity of these samples.

Proxy model of the formation energy

As presented in Section 4.4, our reward function is based on a proxy MLP trained on MatBench (Materials Project). Since part of our performance analysis of the proposed Crystal-GFN depends on the prediction of the formation energy by this proxy model, it is important to first verify its accuracy. In contrast to other methods, it does not rely on atom positions to predict the FE of a crystal, but rather on a higher-level description of the crystal, namely its space group, composition and lattice parameters, akin to the outputs of Crystal-GFN. Nonetheless, it achieves a mean absolute error (MAE) of 0.10±0.005eV/atomplus-or-minus0.100.005eV/atom0.10\pm 0.005\ \text{eV/atom}0.10 ± 0.005 eV/atom on the validation set. We further describe its hyper-parameters and detailed performance in Appendix C.

Predicted formation energy of samples

Evaluating generative models is well known to be a difficult, task-dependent problem. Here, in order to gain insights about the sampling policy of the trained Crystal-GFN, we sample 10,000 crystals and compare the distribution with 1) the validation set from MatBench and 2) 10,000 samples from a randomly initialised and untrained Crystal-GFN. Figure 3 shows the kernel density estimation of the formation energies predicted by our proxy model in the aforementioned sets of samples. As main conclusion, we observe that training the Crystal-GFN shifts the formation energy of the sampled crystals towards lower values, with a median of -3.1 eV/atom, even lower than the median energy in the validation set. Given the vast search space, it is remarkable that with only 12 hours of training, Crystal-GFN learns to sample from a distribution where 95 % of the structures have a formation energy lower than -2 eV/atom (as predicted by the proxy model).

Refer to caption
Figure 3: Distributions of the formation energy predicted by our proxy model in three relevant distributions of samples: in blue, samples from Crystal-GFN after training; in orange, the validation set, representative of the MatBench database; in pink, samples from an untrained Crystal-GFN. As a main conclusion, we observe that Crystal-GFN, after training, manages to sample crystals with predicted formation energies in the range of the validation set or lower.

Diversity of samples

As discussed before, an important goal in certain materials discovery applications is to find diverse candidates. In other words, sampling crystals with low formation energy would be almost useless for most applications if all the crystals were identical or highly similar. To gain insights about the diversity, we analyse the crystal structures sampled by the trained Crystal-GFN. The main conclusion is that Crystal-GFN samples candidates with very high diversity. In particular, regarding the composition, we find that all 12 elements are found in the samples from the Crystal-GFN, and 10 of them appear in the 100 samples with lowest predicted formation energy. Regarding crystal categories and space groups, we find that 5 out of the 8 crystal-lattice systems and all 5 point symmetries are found in the top-100 samples alone. While not all 113 space groups are found in the 10,000 samples, 80 of them were present (70 %). Finally, in terms of lattice parameters, we also observe relatively similar distributions of lengths and angles, compared to the MatBench data set. Further details and visualisations of these results are provided in Appendix D.

6  Conclusions and future work

In this paper, we have introduced a new generative model, Crystal-GFN, to sample inorganic crystal structures proportionally to a property of interest. A key feature of Crystal-GFN is its flexibility to be trained with any available reward function and to incorporate domain constraints. The latter is due to the fact that Crystal-GFN constructs crystals sequentially in the space of compositions, space groups and lattice parameters of the unit cell. This parameterisation, inspired by theoretical crystallography, has allowed us to flexibly incorporate constraints regarding the charge neutrality of the composition, the compatibility between composition and space group and between space group and lattice parameters.

By training a Crystal-GFN with a reward function based on the formation energy predicted by a proxy model trained on the MatBench database, we have shown it learns to sample candidates with low predicted formation energy (lower mean and median that the distribution of the MatBench data set). Importantly, it does so by sampling highly diverse crystals, in terms of coverage of space groups, compositions and lattice parameters.

Interesting directions for future work include extending Crystal-GFN with more domain-inspired constraints as well as with additional subspaces to sample the atomic positions in the unit cell. Alternatively, crystal structure prediction from the samples could be performed using pre-trained machine learning models that generate atomic coordinates. Furthermore, it would be interesting to explore other properties of interest beyond the formation energy of the sampled crystals.

Ethics Statement

Our work is motivated by the climate crisis, the need to develop sustainable technology and the ambition to apply our approach to materials discovery. However, as with other related work, there is a potential risk of dual use of the technology by nefarious actors (Urbina et al., 2022). The authors strongly oppose any uses or derivations of this work intended to cause harm to humans or the environment.

Code availability

The code of Crystal-GFN algorithm presented in this paper is open source and is available at https://github.com/alexhernandezgarcia/gflownet.

Acknowledgements

We thank Bruno Rousseau, Simon Blackburn, Mickael Dollé and Homin Shin for their valuable insights from the materials science perspective. Sékou-Oumar Kaba, Michael Kilgour, Harry Guo and Alain Tchagang have also participated in helpful discussions. We thank Mila’s IDT team for their support. Finally, we thank the Quebec’s Ministry of Economy, Innovation and Energy for their financial support.

This research was enabled in part by compute resources provided by Mila (mila.quebec).

References

  • Ahmad & Cai (2022) Ahmad, R. and Cai, W. Free energy calculation of crystalline solids using normalizing flows. Modelling and Simulation in Materials Science and Engineering, 30(6):065007, 2022.
  • Bengio et al. (2021) Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for non-iterative diverse candidate generation. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, 2021.
  • Cao & Kipf (2018) Cao, N. D. and Kipf, T. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv: 1805.11973, 2018.
  • Cheng et al. (2022) Cheng, G., Gong, X.-G., and Yin, W.-J. Crystal structure prediction by combining graph network and optimization algorithm. Nature Communications, 13(1), March 2022. doi: 10.1038/s41467-022-29241-4. URL https://doi.org/10.1038/s41467-022-29241-4.
  • Court et al. (2020) Court, C. J., Yildirim, B., Jain, A., and Cole, J. M. 3-d inorganic crystal structure generation and property prediction via representation learning. Journal of Chemical Information and Modeling, 60(10):4518–4535, 2020.
  • Dunn et al. (2020) Dunn, A., Wang, Q., Ganose, A., Dopp, D., Jain, and A. Benchmarking materials property prediction methods: The matbench test set and automatminer reference algorithm., 2020. URL https://matbench.materialsproject.org/#citing-matbench. Accessed 2023-09-28.
  • Duval et al. (2022) Duval, A., Schmidt, V., Miret, S., Bengio, Y., Hernández-García, A., and Rolnick, D. Phast: Physics-aware, scalable, and task-specific gnns for accelerated catalyst design. arXiv preprint arXiv: 2211.12020, 2022. URL https://arxiv.org/abs/2211.12020v3.
  • Gebauer et al. (2019) Gebauer, N., Gastegger, M., and Schütt, K. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. Advances in neural information processing systems, 32, 2019.
  • Hernandez-Garcia et al. (2023) Hernandez-Garcia, A., Saxena, N., Jain, M., Liu, C.-H., and Bengio, Y. Multi-fidelity active learning with gflownets. arXiv preprint arXiv: 2306.11715, 2023. URL https://arxiv.org/abs/2306.11715v1.
  • Hoffmann et al. (2019) Hoffmann, J., Maestrati, L., Sawada, Y., Tang, J., Sellier, J. M., and Bengio, Y. Data-driven approach to encoding and decoding 3-d crystal structures. arXiv preprint arXiv:1909.00949, 2019.
  • Jain et al. (2013) Jain, A., Ong, S. P., Hautier, G., Chen, W., Richards, W. D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G., and Persson, K. A. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Materials, 1(1):011002, 07 2013. ISSN 2166-532X. doi: 10.1063/1.4812323. URL https://doi.org/10.1063/1.4812323.
  • Jain et al. (2022) Jain, M., Bengio, E., Hernandez-Garcia, A., Rector-Brooks, J., Dossou, B. F., Ekbote, C. A., Fu, J., Zhang, T., Kilgour, M., Zhang, D., et al. Biological sequence design with GFlowNets. In International Conference on Machine Learning (ICML), volume 162. PMLR, 2022.
  • Jain et al. (2023) Jain, M., Deleu, T., Hartford, J., Liu, C.-H., Hernandez-Garcia, A., and Bengio, Y. GFlowNets for AI-driven scientific discovery. Digital Discovery, 2023.
  • Kim et al. (2020) Kim, S., Noh, J., Gu, G. H., Aspuru-Guzik, A., and Jung, Y. Generative adversarial networks for crystal structure prediction. ACS central science, 6(8):1412–1420, 2020.
  • Lahlou et al. (2023) Lahlou, S., Deleu, T., Lemos, P., Zhang, D., Volokhova, A., Hernández-García, A., Ezzine, L. N., Bengio, Y., and Malkin, N. A theory of continuous generative flow networks. In International Conference on Machine Learning (ICML), 2023.
  • Long et al. (2021) Long, T., Fortunato, N. M., Opahle, I., Zhang, Y., Samathrakis, I., Shen, C., Gutfleisch, O., and Zhang, H. Constrained crystals deep convolutional generative adversarial network for the inverse design of crystal structures. npj Computational Materials, 7(1):66, 2021.
  • Malkin et al. (2022) Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in GFlowNets. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 2022.
  • Nouira et al. (2018) Nouira, A., Crivello, J., and Sokolovska, N. Crystalgan: Learning to discover crystallographic structures with generative adversarial networks. AAAI Spring Symposium Combining Machine Learning with Knowledge Engineering, 2018.
  • Pakornchote et al. (2023) Pakornchote, T., Choomphon-anomakhun, N., Arrerut, S., Atthapak, C., Khamkaeo, S., Chotibut, T., and Bovornratanaraks, T. Diffusion probabilistic models enhance variational autoencoder for crystal structure generative modeling. arXiv preprint arXiv:2308.02165, 2023.
  • Ren et al. (2020) Ren, Z., Noh, J., Tian, S., Oviedo, F., Xing, G., Liang, Q., Aberle, A., Liu, Y., Li, Q., Jayavelu, S., et al. Inverse design of crystals using generalized invertible crystallographic representation. arXiv preprint arXiv:2005.07609, 3(6):7, 2020.
  • Ren et al. (2022) Ren, Z., Tian, S. I. P., Noh, J., Oviedo, F., Xing, G., Li, J., Liang, Q., Zhu, R., Aberle, A. G., Sun, S., et al. An invertible crystallographic representation for general inverse design of inorganic crystals with targeted properties. Matter, 5(1):314–335, 2022.
  • Satorras et al. (2021) Satorras, V. G., Hoogeboom, E., Fuchs, F., Posner, I., and Welling, M. E(n) equivariant normalizing flows. Neural Information Processing Systems, 2021.
  • Shi et al. (2020) Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, J. Graphaf: a flow-based autoregressive model for molecular graph generation. arXiv preprint arXiv:2001.09382, 2020.
  • Urbina et al. (2022) Urbina, F., Lentzos, F., Invernizzi, C., and Ekins, S. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4(3):189–191, 2022.
  • Xie et al. (2021) Xie, T., Fu, X., Ganea, O., Barzilay, R., and Jaakkola, T. Crystal diffusion variational autoencoder for periodic material generation. International Conference On Learning Representations, 2021.
  • Xu et al. (2022) Xu, M., Yu, L., Song, Y., Shi, C., Ermon, S., and Tang, J. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.
  • Zhao et al. (2021) Zhao, Y., Al-Fahdi, M., Hu, M., Siriwardane, E. M., Song, Y., Nasiri, A., and Hu, J. High-throughput discovery of novel cubic crystal materials using deep generative neural networks. Advanced Science, 8(20):2100566, 2021.
  • Zheng et al. (2023) Zheng, S., He, J., Liu, C., Shi, Y., Lu, Z., Feng, W., Ju, F., Wang, J., Zhu, J., Min, Y., et al. Towards predicting equilibrium distributions for molecular systems with deep learning. arXiv preprint arXiv:2306.05445, 2023.

Appendix A Materials and crystal structures

Materials

Materials encompass the substances that make up objects. Their distinctive characteristics arise from a combination of factors, including their composition, structure, and manufacturing processes. While materials come in various forms, our primary focus will be on solid-state materials, specifically crystals, owing to their significance in many machine learning applications we have encountered.

Solid-state materials

Solid-state materials consist of a multitude of smaller building blocks, such as atoms or molecules, securely positioned in fixed locations. They exhibit high-density packing and strong mutual attraction, resulting in a stable structure with a well-defined volume. Materials that possess a well-ordered, infinitely repetitive structure are categorised as crystals, while those lacking such long-range positional order are termed amorphous.

Crystals

Also known as crystalline materials, crystals represent a category of solid substances characterised by their periodic structure. This entails a consistent and predictable repetition of atom or molecule placement within the crystal across all spatial dimensions. It is this inherent periodicity that sets crystals apart from smaller molecules and confers upon them unique properties. Crystals manifest in diverse forms and showcase exceptional attributes like magnetism, transparency and elevated melting points. The study of crystals offers a pathway to precise characterisation and controlled manipulation, opening up avenues for engineering materials with targeted properties to fulfill a wide range of needs. The subsequent section delves into how machine learning models address and harness this periodicity.

A crystal structure includes the type and position of every atom as well as the translational axes which allow the structure to repeat. Since crystal structures extend infinitely, we need to define some small section that can be repeated. We call this small section a unit cell. Unit cells are made of a lattice and basis. Lattices tell you how the crystal is repeated, the basis tells you what is repeated. There are six parameters of a unit cell: 3 edges a,b,c𝑎𝑏𝑐a,b,citalic_a , italic_b , italic_c and the angles between the edges α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ. The edges of a unit cell may be or may not be perpendicular to each other. A crystal structure combines the lattice (how atoms are translated) with a basis (which atoms are translated) which theoretically describes every atom in the crystal. The lattice is just a mathematical idea, but the basis is the “real” component which is repeated according to the mathematical lattice.

Lattice

The lattice is fully specified by its origin and its basis vectors 𝐚isubscript𝐚𝑖\mathbf{a}_{i}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it is associated with a group of translations TΛsubscript𝑇ΛT_{\Lambda}italic_T start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT. The lattice ΛΛ\Lambdaroman_Λ is the underlying structure of the unit cells, which have the property of tilling the space nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT when translated by lattice vectors. Of particular prominence is the primitive cell U𝑈Uitalic_U for the basis 𝐚isubscript𝐚𝑖\mathbf{a}_{i}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: U{inxi𝐚i0xi<1}.approaches-limit𝑈conditional-setsuperscriptsubscript𝑖𝑛subscript𝑥𝑖subscript𝐚𝑖0subscript𝑥𝑖1U\doteq\left\{\sum_{i}^{n}{x}_{i}\mathbf{a}_{i}\mid 0\leq x_{i}<1\right\}.italic_U ≐ { ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ 0 ≤ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 1 } .

In a material, the unit cell comprises a set of atomic positions S={(Zi,𝐱i)𝐱iU}𝑆conditional-setsubscript𝑍𝑖subscript𝐱𝑖subscript𝐱𝑖𝑈S=\left\{\left(Z_{i},\mathbf{x}_{i}\right)\mid\mathbf{x}_{i}\in U\right\}italic_S = { ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_U }, where 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the atom position (called the basis vector) and Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT its atomic number. S𝑆Sitalic_S can contain an arbitrary number of atoms and is not constrained to a particular structure.

Periodic boundary conditions

Periodic boundary conditions (pbc) are commonly used to represent the infinite repeating nature of crystals and other periodic systems. Rather than confining the analysis to a single unit cell housing all atoms, pbc envision an infinite array of identical unit cells extending indefinitely in all directions. This means that if an atom or molecule exits the cell on one side, it effectively re-enters from the opposite side, as if traversing a continuous lattice. This approach enables scientists to simulate the behavior of an infinitely extending crystal using a finite computational domain.

Wyckoff positions

Also defined as classes of crystallographic orbits, Wyckoff positions categorise the position in a lattice of atoms that sit on symmetry elements. A Wyckoff position describes all the positions of an atom after a symmetry operation is applied. The number of distinct positions occupied by the atom after a symmetry operation is applied is indicated by the Wyckoff multiplicity.

Appendix B Experimental setup

In this section, we provide additional details about the hyper-parameters and experimental setup used to train the Crystal-GFN to obtain the results presented in Section 5.

In order to reduce the search space in our experiments, we apply the following restrictions:

  • Compositions consist of up to 5 different elements from the subset of these 12 elements: H, Li, C, N, O, F, Mg, Si, P, S, Cl and Fe. These are the most common elements in the MatBench data set used to train the proxy model. Note that even this reduced set yields a combinatorially large search space.

  • Compositions can contain up to 50 atoms in total and up to 16 atoms per element. This number is obtained by finding the lowest Wyckoff position multiplicity for each space group, and then computing the maximum of these values across all space groups. This means that 16 is the lowest possible value that still makes it possible for the Crystal-GFN to generate samples from all space groups while respecting their symmetry constraints.

  • Space groups are the intersection of train and validation space groups, from the proxy model’s data set. There are 113 of them.

  • The minimum and maximum lengths of the unit cell are 0.9 and 100 angstroms, respectively; the minimum and maximum angles are 50superscript5050^{\circ}50 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT and 150superscript150150^{\circ}150 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, respectively. These values are set so as to include the bulk of the training and validation sets, excluding outliers.

We train the Crystal-GFN by sampling 10 trajectories per iteration from the current forward policy. In order to encourage further exploration, 10 % of the steps in the trajectories are sampled at random from a uniform distribution. In total, we train for 50,000 iterations, which amounts to 500,000 queries to the proxy model. This took about 12 hours on a CPU-only machine.

The architecture of both the forward and the backward GFlowNet policy models is a 3-layer multi-layer perceptron with 256 units per layer. We trained with the Adam optimiser and a learning rate of 0.0001. As is common with the Trajectory Balance objective (Malkin et al., 2022), we set a higher learning rate (0.01) for the 16 learning weights used to parameterise the partition function. In order to sample structures with lower (negative) formation energies, we set a temperature T=8𝑇8T=8italic_T = 8 in the reward function defined in Section 4.4.

The distribution to sample the increments of the lattice parameters subspace is a mixture of 5 Beta distributions. In order to ensure numerical stability during training, we restrict the values of the coefficients of the Beta distributions to the range [0.1,100]0.1100[0.1,100][ 0.1 , 100 ]. One of the conditions that must be satisfied by generalised GFlowNets is that trajectories must have a finite length. To this end, we set the minimum increment to 10 % of the range of each dimension.

Appendix C Proxy MLP

C.1  Architecture

For a given crystal structure, the input to the proxy MLP is the concatenation of: a physical embedding of the crystal’s elements using PhAST (Duval et al., 2022), a learned embedding for its space group and the standardised lattice parameters. We stratify the MatBench data set into train, validation and test sets, matching their distributions of target FE. The final reward function is a Boltzmann version of the proxy: R(x)=exp(MLP(x)T)𝑅𝑥MLP𝑥𝑇R(x)=\exp\big{(}-\frac{\text{MLP}(x)}{T}\big{)}italic_R ( italic_x ) = roman_exp ( - divide start_ARG MLP ( italic_x ) end_ARG start_ARG italic_T end_ARG ) where T𝑇Titalic_T is a temperature hyper-parameter. This ensures that a lower FE yields higher positive reward, and we can control our preference for lower energies with the temperature T𝑇Titalic_T.

Overall, the architecture consists of concatenated representations for the composition (hCsubscript𝐶h_{C}italic_h start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT), the space group (hSGsubscript𝑆𝐺h_{SG}italic_h start_POSTSUBSCRIPT italic_S italic_G end_POSTSUBSCRIPT) and the lattice parameters (hLPsubscript𝐿𝑃h_{LP}italic_h start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT) that are used as input to a prediction MLP y^=MLPout([hC,hSG,hLP])^𝑦subscriptMLP𝑜𝑢𝑡subscript𝐶subscript𝑆𝐺subscript𝐿𝑃\hat{y}=\text{MLP}_{out}([h_{C},h_{SG},h_{LP}])over^ start_ARG italic_y end_ARG = MLP start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ( [ italic_h start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_S italic_G end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT ] ). In the following we describe how each hhitalic_h is obtained from the data:

  • 𝐡𝐂subscript𝐡𝐂\mathbf{h_{C}}bold_h start_POSTSUBSCRIPT bold_C end_POSTSUBSCRIPT: Each element Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the composition is embedded with PhAST (Duval et al., 2022): it is a concatenation of 1/ static physical properties PZsubscript𝑃𝑍P_{Z}italic_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT projected into some latent space with a small MLP 2/ a learned embedding for the atomic number Z𝑍Zitalic_Z, Period P𝑃Pitalic_P and the Group G𝐺Gitalic_G of the element, noted Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

    hC=MLPC([MLPP(PZ),EZ,EP,EG])subscript𝐶subscriptMLP𝐶subscriptMLP𝑃subscript𝑃𝑍subscript𝐸𝑍subscript𝐸𝑃subscript𝐸𝐺\displaystyle h_{C}=\text{MLP}_{C}\Big{(}\big{[}\text{MLP}_{P}(P_{Z}),E_{Z},E_% {P},E_{G}\big{]}\Big{)}italic_h start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = MLP start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( [ MLP start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ) , italic_E start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ] ) (1)
  • 𝐡𝐒𝐆subscript𝐡𝐒𝐆\mathbf{h_{SG}}bold_h start_POSTSUBSCRIPT bold_SG end_POSTSUBSCRIPT: Each space group is embedded as a lookup in a learnable embedding matrix.

  • 𝐡𝐋𝐏subscript𝐡𝐋𝐏\mathbf{h_{LP}}bold_h start_POSTSUBSCRIPT bold_LP end_POSTSUBSCRIPT: Lattice parameters are first standardised element-wise using the training set statistics. This 6-dimensional vector is then passed through a small MLP:

    hLP=MLPLP(LP(x)μtrain(LP)σtrain(LP))subscript𝐿𝑃subscriptMLP𝐿𝑃LP𝑥subscript𝜇𝑡𝑟𝑎𝑖𝑛LPsubscript𝜎𝑡𝑟𝑎𝑖𝑛LP\displaystyle h_{LP}=\text{MLP}_{LP}\big{(}\frac{\text{LP}(x)-\mu_{train}(% \text{LP})}{\sigma_{train}(\text{LP})}\big{)}italic_h start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT = MLP start_POSTSUBSCRIPT italic_L italic_P end_POSTSUBSCRIPT ( divide start_ARG LP ( italic_x ) - italic_μ start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( LP ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( LP ) end_ARG ) (2)

C.2  Performance

Our proxy MLP model achieves an overall Mean Absolute Error of 0.10±0.005eV/atomplus-or-minus0.100.005eV/atom0.10\pm 0.005\ \text{eV/atom}0.10 ± 0.005 eV/atom11195% confidence interval, computed modeling the MAE as a log-normal distribution. This is further justified by Fig. 5.. For reference, best performing methods in MatBench leaderboard222https://matbench.materialsproject.org/Leaderboards%20Per-Task/matbench_v0.1_matbench_mp_e_form/ can achieve up to 10×\times× better performance. However, they all use 3D positions of the crystals, and remarkably, our approach outperforms some of them, indicating that leveraging composition, space-group and lattice parameters serves as a viable representation in the formation energy estimation task.

In Figures 4 and 5 we detail further the performance of our Proxy MLP on the MatBench validation set. In particular we observe that it maintains good performance for structures with FE up to 0.3eV/atomsimilar-toabsent0.3eV/atom\sim 0.3\leavevmode\nobreak\ \text{eV/atom}∼ 0.3 eV/atom. The MAE beyond that FE level becomes high and unstable, but this is expected as those are the regions of the target space with the least data available. Additionally we can verify in Figure 4 that our stratification algorithm works as expected, with similar FE distributions in the validation and train splits.

Refer to caption
Figure 4: Proxy MLP performance on the validation set and data split FE distributions. The average MAE is 0.10eV/atom0.10eV/atom0.10\leavevmode\nobreak\ \text{eV/atom}0.10 eV/atom. We can also see the effect of the stratification algorithm which yields similar FE distributions between the train and validation data set splits.
Refer to caption
Figure 5: Distribution of FE values in the validation set and associated Proxy MLP MAE, with 25%,50%percent25percent5025\leavevmode\nobreak\ \%,50\leavevmode\nobreak\ \%25 % , 50 % and 75%percent7575\leavevmode\nobreak\ \%75 % MAE quantiles.

C.3  Hyper-parameters

We present the hyper-parameters of our Proxy MLP in Table 1 along with a description of their role in the architecture.

Hyper parameter Value Description properties_proj_size 64 Projection size of atomic properties h=𝐖pZ𝐖subscript𝑝𝑍h=\mathbf{W}p_{Z}italic_h = bold_W italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT group_emb_size 16 Embedding size of element’s group period_emb_size 256 Embedding size of element’s period z_emb_size 128 Embedding size of element’s atomic number sg_emb_size 128 Crystal space group embedding size lat_hidden_channels 284 Hidden channels for layers processing the lattice parameters lat_num_layers 1 Number of hidden layers for the lattice parameters MLP num_layers 5 Number of layers for final MLP processing composition, space group and lattice parameters hidden representations hidden_channels 576 Size of final MLP hidden layers. optimiser Adam Optimiser lr 0.0017 Learning rate batch_size 448 Batch size scheduler ReduceLROnPlateau Divide learning rate by 2 when validation MAE does not improve for 4 epochs es_patience 11 Early stopping patience (in epochs) epochs 100 Total epochs (effective after early stopping: 97)

Table 1: Hyper-parameters of the MLP proxy and their values used in our model.

Appendix D Additional results

In this section, we provide additional results demonstrating the diversity of the samples generated by the Crystal-GFN. They are displayed in Figure 6, Figure 8, Figure 9 and Figure 10.

Refer to caption
Figure 6: Distribution of total element prevalence per data set (10k Crystal-GFlowNet samples, MatBench val, MatBench train). Each element is counted as per its stoichiometry in the crystal.
Refer to caption
Figure 7: Distribution of binary element prevalence per data set (10k Crystal-GFlowNet samples, MatBench val, MatBench train). Each element is counted only once per crystal.
Refer to caption
Figure 8: Distribution of binarised element co-occurrences per data set (10k Crystal-GFN samples, MatBench val, MatBench train). A binarised co-occurrence is defined as element i𝑖iitalic_i and element j𝑗jitalic_j both being present in a crystal, regardless of their respective stoichiometry in that crystal.
Refer to caption
Figure 9: Distribution of space groups per data set (10k Crystal-GFN samples, MatBench val, MatBench train)
Refer to caption
Figure 10: Distribution of lattice parameters
=" alt="[LOGO]">