Crystal-GFN: sampling crystals with desirable properties and constraints

Mila AI4Science Alex Hernandez-Garcia Mila, Quebec AI Institute Université de Montréal Alexandre Duval Mila, Quebec AI Institute CentraleSupélec, Université Paris-Saclay Alexandra Volokhova Mila, Quebec AI Institute Université de Montréal Yoshua Bengio Mila, Quebec AI Institute Université de Montréal Divya Sharma Johns Hopkins University Pierre Luc Carrier Mila, Quebec AI Institute Yasmine Benabed Université de Montréal Centre d’excellence en électrification des transports et stockage d’énergie, Hydro-Québec Michał Koziarski Mila, Quebec AI Institute Université de Montréal Victor Schmidt Mila, Quebec AI Institute Université de Montréal

Abstract

Accelerating material discovery holds the potential to greatly help mitigate the climate crisis. Discovering new solid-state materials such as electrocatalysts, super-ionic conductors or photovoltaic materials can have a crucial impact, for instance, in improving the efficiency of renewable energy production and storage. In this paper, we introduce Crystal-GFN, a generative model of crystal structures that sequentially samples structural properties of crystalline materials, namely the space group, composition and lattice parameters. This domain-inspired approach enables the flexible incorporation of physical and structural hard constraints, as well as the use of any available predictive model of a desired physicochemical property as an objective function. To design stable materials, one must target the candidates with the lowest formation energy. Here, we use as objective the formation energy per atom of a crystal structure predicted by a new proxy machine learning model trained on MatBench. The results demonstrate that Crystal-GFN is able to sample highly diverse crystals with low (median -3.1 eV/atom) predicted formation energy.

1 Introduction

Figure 1: A schematic of the crystal generation process of Crystal-GFN. First, the space group is selected, in turn decomposed into the selection of a crystal-lattice system and a point symmetry. Then, a composition is generated by iteratively selecting an element type and its quantity. Finally, the lattice parameters of the unit cell are sampled. We introduce hard constraints, denoted by

C_{i}

, within and between these components, as described in Section 4.

Materials discovery plays a vital role in transforming numerous industries that are currently responsible for a significant fraction of the global greenhouse gas emissions. From developing cutting-edge photovoltaic panels or new-generation solid-state batteries to tackling carbon capture or catalysis, the quest for innovative materials with targeted properties has the power to reshape the technological landscape.

However, the discovery of new materials is an extremely challenging task at various levels of innovation: crystal structure generation, property prediction and successful development of synthesis protocols. In practice, given the vast space of conceivable materials, scientists aiming at designing a material with a desirable property must use their domain expertise and intuition to restrict the search space to just a few promising structures. This traditional trial-and-error process is extremely costly both in terms of time and efforts, providing machine learning (ML) with an opportunity to greatly accelerate the generation and evaluation of promising candidates.

While ML has already made significant progress with respect to property prediction and generation of small molecules (Long et al., 2021; Ren et al., 2022; Pakornchote et al., 2023; Zheng et al., 2023), its impact in the field of solid-state energy materials lags behind. At the root of this discrepancy lies the crystalline nature of their structure, which imposes periodic boundary conditions that make it difficult to use finite graphs for crystal representation. As a result, the sole task of generating a stable candidate is highly complex. It involves exploring the combinatorial space of chemical compositions, considering all possible periodic arrangements of atoms as well as respecting the complex physics that make them viable, stable and metastable structures.

Indeed, the generation of crystal 3D structures involves a multitude of factors, including specific bonding preferences between different atom types, geometric and chemical constraints, and a positioning that is correlated with the local energy minimum defined by quantum mechanics. Overall, current methods often struggle to capture the above complexities in the generation process, failing to consistently generate valid structures. Besides, most such approaches would not enable an efficient search through the crystal space or a search for crystals with a specific property.

In this work, we introduce Crystal-GFN (Fig. 1), a generative model designed to sequentially sample crystal structures in a space inspired by theoretical crystallography. Instead of directly generating atomic positions, Crystal-GFN samples the space group, the composition and the lattice parameters of a crystal structure (Section 4). This domain-inspired approach facilitates the incorporation of physicochemical and structural constraints, and allows the possibility of extending the sampling space, for example to obtain atomic positions. The GFlowNet framework offers the ability to sample proportionally to a reward function, which enables diverse sampling. Here, we first trained a predictive model of the formation energy per atom and used it as the reward function to train a Crystal-GFN for 12 hours in a CPU-only machine. In our evaluation in Section 5, we sampled 10,000 compounds from the trained Crystal-GFN to demonstrate that it is able to sample diverse candidates with low predicted formation energy. 95 % of the sampled structures have a predicted energy lower than -2 eV/atom and the overall median is -3.1 eV/atom.

2 Related Work

There exists a record of generative methods for molecular and crystal structure generation, generally falling into one of these categories: Variational Auto Encoders (VAE) (Ren et al., 2022; Pakornchote et al., 2023), Generative Adversarial Networks (GANs) (Cao & Kipf, 2018; Nouira et al., 2018; Long et al., 2021), Normalising flows (Satorras et al., 2021; Ahmad & Cai, 2022) and Diffusion models (Zheng et al., 2023; Xu et al., 2022). In general, these methods are trained to reconstruct the training data from a latent distribution by maximising its likelihood or minimising the discrepancy between generated sampled and the real distribution. Once trained, they can generate new instances by sampling from the learned latent space.

Hoffmann et al. (2019); Long et al. (2021); Court et al. (2020) treat materials as 3D voxel images, but the process of decoding images back to atom types and coordinates often results in low validity. Several methods (Ren et al., 2020; Kim et al., 2020; Zhao et al., 2021) encode unit cell-related information to generate crystals. GANCSP (Kim et al., 2020) and CubicGAN (Zhao et al., 2021) use as input fractional coordinates, element properties, and lattice parameters, to build models that generate crystals conditioned on composition or both composition and space group. Such models are generally not invariant to any Euclidean transformations.

CDVAE (Xie et al., 2021), employs a diffusion model in conjunction with a VAE to sample new materials, leveraging lattice parameters. However, it does not enable imposing domain-based constraints into the generation process. Overall, directly sampling 3D positions without considering energy minimisation can lead to physically unrealistic and unstable structures. Cheng et al. (2022) leveraged physics and symmetry-based constraints to solve an optimisation problem over crystal graphs to obtain atomic coordinates, but it requires a composition and lattice parameters as input.

Different from the above, autoregressive methods like G-SchNet Gebauer et al. (2019) or G-SphereNet generates 3D molecules sequentially, placing atom-by-atom. They are unaware of periodicity and cannot generate the lattice, making them ill-suited for crystals. This also applies to flow models Shi et al. (2020).

As most methods in the literature directly sample 3D positions, they struggle to embed essential physicochemical constraints like energy minimisation, dihedral angles geometric constraints or bonding preferences between different atom types, often leading to physically unrealistic and unstable structures. In our proposed model, we incorporate domain knowledge from materials science and crystallography to sequentially build materials by selecting the space group, composition and lattice parameters while imposing compatiblity and validity constraints. Therefore, the generated crystals adhere to additional fundamental principles and constraints and result in more physically viable structures.

3 Background

In this section, we briefly review the necessary background on GFlowNets and crystallography, before describing our proposed method, Crystal-GFN. Appendix A provides additional background about materials and crystal structures.

3.1 GFlowNets

Generative Flow Networks (GFlowNets or GFN for short), were introduced by Bengio et al. (2021) as an amortised inference method to sample from high-dimensional distributions, where both traditional methods, such as MCMC and reinforcement learning are inefficient in terms of mode mixing. In essence, GFlowNets are designed to learn a sampling policy $\pi$ to generate objects $x\in\mathcal{X}$ proportionally to a non-negative reward function $R(x)$ , that is $\pi(x)\propto R(x)$ . This facilitates the discovery of multiple modes of a given reward function, which is a desirable objective in scientific discovery (Jain et al., 2023). For instance, GFlowNets have been successfully applied for biological sequence design and molecular property prediction as part of active learning algorithms (Jain et al., 2022; Hernandez-Garcia et al., 2023).

A key property of GFlowNets is that objects are generated sequentially. Starting from a special state $s_{0}$ , transitions $s_{t}{\rightarrow}s_{t+1}\in\mathbb{A}$ are applied between states $s\in\mathcal{S}$ , forming trajectories $\tau=(s_{0}\rightarrow s_{1}\rightarrow\ldots\rightarrow x)$ , where $\mathbb{A}$ is a predefined action space and $\mathcal{S}$ the state space. This sequential construction of objects, together with the fact that the transition policy $P_{\theta}(s_{t+1}|s_{t})$ is parameterised by a neural network (with parameters $\theta$ ), provides amortisation and the potential of systematic generalisation. Generalisation is possible if the decomposition into partially constructed objects provides a structure that can be learned by a machine learning model.

In order to obtain a policy that samples proportionally to the reward distribution, several training objectives have been proposed in the literature. Here, we will make use of the Trajectory Balance objective (Malkin et al., 2022), which has proven effective in various tasks (Jain et al., 2022; Hernandez-Garcia et al., 2023; Malkin et al., 2022). While GFlowNets were initially introduced for discrete probabilistic modelling, Lahlou et al. (2023) recently proposed a generalisation of the framework for continuous or hybrid state spaces. This work empirically demonstrates this generalisation, as we propose a GFlowNet that operates in a hybrid—mixture of discrete and continuous—state space.

3.2 Crystals

Crystals are highly structured solid materials defined by a repeated arrangement of atoms in space. The elementary pattern that periodically repeats itself is called a unit cell. This unique ordering constitutes the material’s signature and influences greatly its properties. Therefore, the understanding of crystal structures and the structure-property relationship is crucial in the development of new materials. While there exists an infinite number of crystal structures, the field of theoretical crystallography has developed ways to systematically parameterise and classify crystals according to their symmetry.

Lattice

An $n$ -dimensional lattice $\Lambda$ can be defined as the set of integral combinations of the linearly independent non-copolanar lattice basis vectors $\mathbf{a}_{i}\in\mathbb{R}^{n}$ : $\Lambda\doteq\left\{\sum_{i}^{n}{m}_{i}\mathbf{a}_{i}\mid m_{i}\in\mathbb{Z}\right\}$ , given a reference origin. In 3D, the 3 lattice basis vectors can be converted into 6 parameters: $a,b,c$ determine the lengths of each dimension and $\alpha,\beta,\gamma$ the angles between dimensions.

Composition

We call composition the set of all atom types comprised in the crystal’s unit cell, introduced in their respective quantities.

Space group

The specific arrangements of atoms in a crystal will generate some degree of symmetry. Symmetry is achieved when an operation is applied to an initial arrangement of atoms and leave it unchanged. The combination of all present symmetry elements in a crystal indicates its space group. A space group element can be described as a tuple $\left(\mathbf{W},\mathbf{t}\right)$ , where $\mathbf{W}$ is the orthogonal matrix and $\mathbf{t}$ the translation vector. An element maps a vector $\mathbf{x}\in\mathbb{R}^{n}$ to $\mathbf{W}\mathbf{x}+\mathbf{t}$ . In three dimensions, space groups are classified into 230 types.

4 Crystal-GFNs

In this section, we describe the details of Crystal-GFN, the main method we introduce in this paper to explore the space of crystal structures and generate crystals with desirable properties and domain constraints.

Drawing inspiration from crystallography, we propose to represent crystals as the concatenation of three distinct components or subspaces: space group (SG), composition (C) and lattice parameters (LP) of the unit cell of a material. The Cartesian product of the spaces of these three components would make the state space $\mathcal{S}=\mathcal{S}_{SG}\times\mathcal{S}_{C}\times\mathcal{S}_{LP}$ and sample space $\mathcal{X}=\mathcal{X}_{SG}\times\mathcal{X}_{C}\times\mathcal{X}_{LP}$ of a naive, unrestricted implementation with GFlowNets. An important advantage of the GFlowNet framework and of the domain-inspired data representation that we propose to use is that it allows us to flexibly introduce domain knowledge such as well-studied geometrical constraints from crystallography as well as physicochemical constraints, described below in this section. These constraints are mostly introduced by restricting the action space $\mathbb{A}$ of the Crystal-GFN, both within and across subspaces.

In this work, we design the Crystal-GFN such that trajectories first select the space group, then the composition and finally the lattice parameters. Figure 1 offers a graphical summary of the sequential generation in a Crystal-GFN. Below we describe the details of the three subspaces and the constraints we have introduced.

4.1 Space group

In our Crystal-GFN, the sample space for the space group subspace is $\mathcal{X}_{SG}=\{1,2,\ldots,230\}$ , corresponding to the 230 symmetry groups in three dimensions. As discussed in Section 3.1, GFlowNets rely on the decomposition of objects into multiple steps to facilitate generalisation. Therefore, we draw inspiration from theoretical crystallography to incorporate additional structure into the space group subspace.

A lattice carries 14 possible arrangements, referred to as the Bravais lattices, that are categorised into seven crystal systems: triclinic, monoclinic, orthorhombic, tetragonal, trigonal, hexagonal and cubic. Alternatively, a related but slightly different categorisation is the lattice system, with also seven systems (see Fig. 2). The lattice system category is convenient for our purposes because it imposes specific constraints on the lattice parameters (see $C_{4}$ in Fig. 1). Here, we use a derived categorisation resulting from the combination of a crystal system and a lattice system. We refer to it as crystal-lattice system and it has eight categories: triclinic, monoclinic, orthorhombic, tetragonal, trigonal-rhombohedral, trigonal-hexagonal, hexagonal-hexagonal and cubic.

Besides the crystal-lattice system, we introduce another category in the Crystal-GFN space group subspace: the point symmetry (also known as site symmetry), which defines the type of symmetry of a point group. We use the following five categories of point symmetries: centrosymmetric, non-centrosymmetric, enantiomorphic, polar and enantiomorphic-polar. We choose these categories because the combination of point symmetry and crystal system gives rise to one of the 32 crystal classes or crystallographic point groups.

In sum, our space group subspace $\mathcal{S}_{SG}$ is a three-dimensional set where the entries correspond to the crystal-lattice system, the point symmetry and the space group. We define the action space such that, starting from the source state $s_{0}$ , the transitions can set any of the three dimensions. However, the selection of an option along any dimension restricts the valid options for the remaining dimensions (constraint $C_{1}$ in Fig. 1). For example, if point symmetry “non-centrosymmetric” is selected from the source state, only three crystal-lattice systems remain valid and the number of valid space groups is reduced to 25. The “stop” action is valid if and only if the space group has been selected.

4.2 Composition

We represent a composition as the number of atoms of each element present in the unit cell. Specifically, for a vocabulary of $D$ elements, where each element can have up to $K$ atoms, we construct $D$ -dimensional vectors where each entry indicates the number of atoms of element $d$ . This yields the space $\mathcal{S}_{C}=\{(k_{1},\ldots k_{D})|k_{d}\in\{0,1,\ldots K\},d=1,\ldots D\}$ . The action space of the composition subspace consists of the choice of element and number of atoms, that is $\mathbb{A}_{C}=\{1,\ldots D\}\times\{1,\ldots K\}$ , plus the special “stop” action to finish a trajectory.

To meet the electroneutrality requirement, the sum of positive and negative charges in a material must be equal. Since we have control over the action space of the GFlowNet, we incorporate a hard constraint to ensure that all the generated compositions can have a neutral charge (constraint $C_{3}$ in Fig. 1). We argue that this is an advantage over methods that aim at learning such properties implicitly from the data (Xie et al., 2021).

Finally, we also have an opportunity to incorporate inter-subspace constraints between the space group and the composition. Because of the symmetry imposed by each space group, not all compositions are possible given a space group, and given a composition not all space groups are valid. By way of illustration, the most restrictive space group—international number 230–cannot accommodate compositions with fewer than 16 atoms per element. Since here the trajectories of a Crystal-GFN first sample a space group, we restrict the action space of the GFlowNet to only allow the sampling of compositions that are compatible with the Wyckoff positions (see Appendix A) of the selected space group (constraint $C_{2}$ in Fig. 1).

4.3 Lattice parameters

The lattice parameters ( $a,b,c,\alpha,\beta,\gamma$ ), in combination with the lattice system (or crystal system) determine the shape of the unit cell in three dimensions. While both the composition and the space group subspaces are discrete, the lattice parameters are real-valued and thus yield a continuous sample space $\mathcal{X}_{LP}=\{(\ell_{min},\ell_{max})^{3}\times(\theta_{min},\theta_{max}% )^{3}\}$ , where $\ell_{min}$ and $\ell_{max}$ are the minimum and maximum lengths and $\theta_{min}$ and $\theta_{max}$ are the minimum and maximum angles, respectively. In order to sample sets of continuous lattice parameters, the GFlowNet policy model outputs the parameters of a six-dimensional mixture of Beta distributions, a continuous probability distribution defined in the interval $(0,1)$ .

Here, we incorporate an inter-subspace constraint from the space group ( $C_{4}$ in Fig. 1). Every space group is classified into a lattice system which, in turn, constrains the values of the lattice parameters. For instance, for the cubic lattice system, $a=b=c$ and $\alpha=\beta=\gamma=90^{\circ}$ . The remaining constraints are illustrated in Fig. 2. This drastically reduces the state space and removes the need for having to learn such relationships and constraints from the data.

4.4 Reward function

As explained in Section 3, our Crystal-GFN is trained to sample proportionally to a reward function. This offers the flexibility of using as reward any quantity of interest, such as a desired property of materials. As an initial proof of concept for the method, we choose the reward to be based on the formation energy (FE) per atom of the sampled crystal structure, in order to generate materials with higher likelihood of being thermodynamically stable.

The formation energy of a material is defined as the amount of energy needed to form a material from its constituent elements in their standard states. If a compound $C$ is the combination of elements $A$ and $B$ , that is $A+B\rightarrow C$ , then the formation energy ( $E_{f}$ ) of $C$ is defined as follows: $E_{f}(C)=E(C)-E(A)-E(B)$ . A negative formation energy indicates that the formation reaction of a material from its constituent elements is spontaneous or favourable. On the contrary, a positive formation energy indicates that the formation reaction is not thermodynamically favoured.

Because crystal structures sampled by the Crystal-GFN may be unknown or not characterised in existing databases, its true FE may be unknown too. This is why we train a proxy machine learning model to predict the FE given a crystal $x\in\mathcal{X}$ , which is parameterised as the output of Crystal-GFN.

In particular, we train a physics-informed Multi-Layer Perceptron (MLP) on the MatBench data set (Dunn et al., 2020), a snapshot of the Materials Project (Jain et al., 2013). For a given crystal structure, the input to the proxy MLP is the concatenation of the following (refer to Appendix C for more details):

•

A physical embedding of the crystal’s elements, obtained using PhAST (Duval et al., 2022), that is from the atomic number, the period, the group and other relevant atomic properties.
•

A learned embedding for its space group.
•

The standardised lattice parameters.

We stratify the MatBench data set into train, validation and test sets, controlling for the distribution in target FE. Our training and validation sets contain 65,048 and 23,232 data points, respectively. The final reward function is a Boltzmann transformation of the proxy output: $R(x)=\exp\big{(}-\frac{\text{MLP}(x)}{T}\big{)}$ where $T$ is a temperature hyper-parameter. This ensures that lower FE yields higher positive reward, and we can control our preference for lower energies with the temperature $T$ .

5 Empirical evaluation

In this section, we present the results of experiments designed to evaluate the potential of Crystal-GFN to explore the space of materials. The motivation of using GFlowNet to generate crystal structures is twofold: One, we are interested in discovering structures with high scores of a property of interest, in this case low formation energy. At the same time, we want to discover not just one but multiple and diverse structures. The main reason to seek diverse samples is that often the target function is underspecified and the true objective is multifaceted or unknown. A natural way of dealing with underspecification is to try multiple candidates to increase the likelihood of finding successful structures for the downstream applications Jain et al. (2023); Hernandez-Garcia et al. (2023). Thus, we here evaluate two aspects of Crystal-GFN: the distribution of the predicted formation energy of the generated samples, and their diversity.

5.1 Experimental setup

Exploring the infinitely large space of crystals to find the ones with certain properties is a daunting “needle-in-a-haystack” task. In order to make the haystack a little smaller—though still infinitely large—we restrict the search task to a subset of 113 space groups and compositions with up to 5 unique elements from a set of 12 common elements. We train the Crystal-GFN for 50,000 iterations, which amounts to 500,000 queries to the proxy model and about 12 hours on a CPU-only machine. Further details about the experimental setup are provided in Appendix B.

5.2 Results

In this section, we present results related to our proxy model, to the distribution of the formation energy in Crystal-GFN samples and to the diversity of these samples.

Proxy model of the formation energy

As presented in Section 4.4, our reward function is based on a proxy MLP trained on MatBench (Materials Project). Since part of our performance analysis of the proposed Crystal-GFN depends on the prediction of the formation energy by this proxy model, it is important to first verify its accuracy. In contrast to other methods, it does not rely on atom positions to predict the FE of a crystal, but rather on a higher-level description of the crystal, namely its space group, composition and lattice parameters, akin to the outputs of Crystal-GFN. Nonetheless, it achieves a mean absolute error (MAE) of $0.10\pm 0.005\ \text{eV/atom}$ on the validation set. We further describe its hyper-parameters and detailed performance in Appendix C.

Predicted formation energy of samples

Evaluating generative models is well known to be a difficult, task-dependent problem. Here, in order to gain insights about the sampling policy of the trained Crystal-GFN, we sample 10,000 crystals and compare the distribution with 1) the validation set from MatBench and 2) 10,000 samples from a randomly initialised and untrained Crystal-GFN. Figure 3 shows the kernel density estimation of the formation energies predicted by our proxy model in the aforementioned sets of samples. As main conclusion, we observe that training the Crystal-GFN shifts the formation energy of the sampled crystals towards lower values, with a median of -3.1 eV/atom, even lower than the median energy in the validation set. Given the vast search space, it is remarkable that with only 12 hours of training, Crystal-GFN learns to sample from a distribution where 95 % of the structures have a formation energy lower than -2 eV/atom (as predicted by the proxy model).

Diversity of samples

As discussed before, an important goal in certain materials discovery applications is to find diverse candidates. In other words, sampling crystals with low formation energy would be almost useless for most applications if all the crystals were identical or highly similar. To gain insights about the diversity, we analyse the crystal structures sampled by the trained Crystal-GFN. The main conclusion is that Crystal-GFN samples candidates with very high diversity. In particular, regarding the composition, we find that all 12 elements are found in the samples from the Crystal-GFN, and 10 of them appear in the 100 samples with lowest predicted formation energy. Regarding crystal categories and space groups, we find that 5 out of the 8 crystal-lattice systems and all 5 point symmetries are found in the top-100 samples alone. While not all 113 space groups are found in the 10,000 samples, 80 of them were present (70 %). Finally, in terms of lattice parameters, we also observe relatively similar distributions of lengths and angles, compared to the MatBench data set. Further details and visualisations of these results are provided in Appendix D.

6 Conclusions and future work

In this paper, we have introduced a new generative model, Crystal-GFN, to sample inorganic crystal structures proportionally to a property of interest. A key feature of Crystal-GFN is its flexibility to be trained with any available reward function and to incorporate domain constraints. The latter is due to the fact that Crystal-GFN constructs crystals sequentially in the space of compositions, space groups and lattice parameters of the unit cell. This parameterisation, inspired by theoretical crystallography, has allowed us to flexibly incorporate constraints regarding the charge neutrality of the composition, the compatibility between composition and space group and between space group and lattice parameters.

By training a Crystal-GFN with a reward function based on the formation energy predicted by a proxy model trained on the MatBench database, we have shown it learns to sample candidates with low predicted formation energy (lower mean and median that the distribution of the MatBench data set). Importantly, it does so by sampling highly diverse crystals, in terms of coverage of space groups, compositions and lattice parameters.

Interesting directions for future work include extending Crystal-GFN with more domain-inspired constraints as well as with additional subspaces to sample the atomic positions in the unit cell. Alternatively, crystal structure prediction from the samples could be performed using pre-trained machine learning models that generate atomic coordinates. Furthermore, it would be interesting to explore other properties of interest beyond the formation energy of the sampled crystals.

Ethics Statement

Our work is motivated by the climate crisis, the need to develop sustainable technology and the ambition to apply our approach to materials discovery. However, as with other related work, there is a potential risk of dual use of the technology by nefarious actors (Urbina et al., 2022). The authors strongly oppose any uses or derivations of this work intended to cause harm to humans or the environment.

Code availability

The code of Crystal-GFN algorithm presented in this paper is open source and is available at https://github.com/alexhernandezgarcia/gflownet.

Acknowledgements

We thank Bruno Rousseau, Simon Blackburn, Mickael Dollé and Homin Shin for their valuable insights from the materials science perspective. Sékou-Oumar Kaba, Michael Kilgour, Harry Guo and Alain Tchagang have also participated in helpful discussions. We thank Mila’s IDT team for their support. Finally, we thank the Quebec’s Ministry of Economy, Innovation and Energy for their financial support.

This research was enabled in part by compute resources provided by Mila (mila.quebec).

References

Ahmad & Cai (2022) Ahmad, R. and Cai, W. Free energy calculation of crystalline solids using normalizing flows. Modelling and Simulation in Materials Science and Engineering, 30(6):065007, 2022.
Bengio et al. (2021) Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for non-iterative diverse candidate generation. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, 2021.
Cao & Kipf (2018) Cao, N. D. and Kipf, T. Molgan: An implicit generative model for small molecular graphs. arXiv preprint arXiv: 1805.11973, 2018.
Cheng et al. (2022) Cheng, G., Gong, X.-G., and Yin, W.-J. Crystal structure prediction by combining graph network and optimization algorithm. Nature Communications, 13(1), March 2022. doi: 10.1038/s41467-022-29241-4. URL https://doi.org/10.1038/s41467-022-29241-4.
Court et al. (2020) Court, C. J., Yildirim, B., Jain, A., and Cole, J. M. 3-d inorganic crystal structure generation and property prediction via representation learning. Journal of Chemical Information and Modeling, 60(10):4518–4535, 2020.
Dunn et al. (2020) Dunn, A., Wang, Q., Ganose, A., Dopp, D., Jain, and A. Benchmarking materials property prediction methods: The matbench test set and automatminer reference algorithm., 2020. URL https://matbench.materialsproject.org/#citing-matbench. Accessed 2023-09-28.
Duval et al. (2022) Duval, A., Schmidt, V., Miret, S., Bengio, Y., Hernández-García, A., and Rolnick, D. Phast: Physics-aware, scalable, and task-specific gnns for accelerated catalyst design. arXiv preprint arXiv: 2211.12020, 2022. URL https://arxiv.org/abs/2211.12020v3.
Gebauer et al. (2019) Gebauer, N., Gastegger, M., and Schütt, K. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. Advances in neural information processing systems, 32, 2019.
Hernandez-Garcia et al. (2023) Hernandez-Garcia, A., Saxena, N., Jain, M., Liu, C.-H., and Bengio, Y. Multi-fidelity active learning with gflownets. arXiv preprint arXiv: 2306.11715, 2023. URL https://arxiv.org/abs/2306.11715v1.
Hoffmann et al. (2019) Hoffmann, J., Maestrati, L., Sawada, Y., Tang, J., Sellier, J. M., and Bengio, Y. Data-driven approach to encoding and decoding 3-d crystal structures. arXiv preprint arXiv:1909.00949, 2019.
Jain et al. (2013) Jain, A., Ong, S. P., Hautier, G., Chen, W., Richards, W. D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G., and Persson, K. A. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Materials, 1(1):011002, 07 2013. ISSN 2166-532X. doi: 10.1063/1.4812323. URL https://doi.org/10.1063/1.4812323.
Jain et al. (2022) Jain, M., Bengio, E., Hernandez-Garcia, A., Rector-Brooks, J., Dossou, B. F., Ekbote, C. A., Fu, J., Zhang, T., Kilgour, M., Zhang, D., et al. Biological sequence design with GFlowNets. In International Conference on Machine Learning (ICML), volume 162. PMLR, 2022.
Jain et al. (2023) Jain, M., Deleu, T., Hartford, J., Liu, C.-H., Hernandez-Garcia, A., and Bengio, Y. GFlowNets for AI-driven scientific discovery. Digital Discovery, 2023.
Kim et al. (2020) Kim, S., Noh, J., Gu, G. H., Aspuru-Guzik, A., and Jung, Y. Generative adversarial networks for crystal structure prediction. ACS central science, 6(8):1412–1420, 2020.
Lahlou et al. (2023) Lahlou, S., Deleu, T., Lemos, P., Zhang, D., Volokhova, A., Hernández-García, A., Ezzine, L. N., Bengio, Y., and Malkin, N. A theory of continuous generative flow networks. In International Conference on Machine Learning (ICML), 2023.
Long et al. (2021) Long, T., Fortunato, N. M., Opahle, I., Zhang, Y., Samathrakis, I., Shen, C., Gutfleisch, O., and Zhang, H. Constrained crystals deep convolutional generative adversarial network for the inverse design of crystal structures. npj Computational Materials, 7(1):66, 2021.
Malkin et al. (2022) Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in GFlowNets. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 2022.
Nouira et al. (2018) Nouira, A., Crivello, J., and Sokolovska, N. Crystalgan: Learning to discover crystallographic structures with generative adversarial networks. AAAI Spring Symposium Combining Machine Learning with Knowledge Engineering, 2018.
Pakornchote et al. (2023) Pakornchote, T., Choomphon-anomakhun, N., Arrerut, S., Atthapak, C., Khamkaeo, S., Chotibut, T., and Bovornratanaraks, T. Diffusion probabilistic models enhance variational autoencoder for crystal structure generative modeling. arXiv preprint arXiv:2308.02165, 2023.
Ren et al. (2020) Ren, Z., Noh, J., Tian, S., Oviedo, F., Xing, G., Liang, Q., Aberle, A., Liu, Y., Li, Q., Jayavelu, S., et al. Inverse design of crystals using generalized invertible crystallographic representation. arXiv preprint arXiv:2005.07609, 3(6):7, 2020.
Ren et al. (2022) Ren, Z., Tian, S. I. P., Noh, J., Oviedo, F., Xing, G., Li, J., Liang, Q., Zhu, R., Aberle, A. G., Sun, S., et al. An invertible crystallographic representation for general inverse design of inorganic crystals with targeted properties. Matter, 5(1):314–335, 2022.
Satorras et al. (2021) Satorras, V. G., Hoogeboom, E., Fuchs, F., Posner, I., and Welling, M. E(n) equivariant normalizing flows. Neural Information Processing Systems, 2021.
Shi et al. (2020) Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, J. Graphaf: a flow-based autoregressive model for molecular graph generation. arXiv preprint arXiv:2001.09382, 2020.
Urbina et al. (2022) Urbina, F., Lentzos, F., Invernizzi, C., and Ekins, S. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4(3):189–191, 2022.
Xie et al. (2021) Xie, T., Fu, X., Ganea, O., Barzilay, R., and Jaakkola, T. Crystal diffusion variational autoencoder for periodic material generation. International Conference On Learning Representations, 2021.
Xu et al. (2022) Xu, M., Yu, L., Song, Y., Shi, C., Ermon, S., and Tang, J. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.
Zhao et al. (2021) Zhao, Y., Al-Fahdi, M., Hu, M., Siriwardane, E. M., Song, Y., Nasiri, A., and Hu, J. High-throughput discovery of novel cubic crystal materials using deep generative neural networks. Advanced Science, 8(20):2100566, 2021.
Zheng et al. (2023) Zheng, S., He, J., Liu, C., Shi, Y., Lu, Z., Feng, W., Ju, F., Wang, J., Zhu, J., Min, Y., et al. Towards predicting equilibrium distributions for molecular systems with deep learning. arXiv preprint arXiv:2306.05445, 2023.

Appendix A Materials and crystal structures

Materials

Materials encompass the substances that make up objects. Their distinctive characteristics arise from a combination of factors, including their composition, structure, and manufacturing processes. While materials come in various forms, our primary focus will be on solid-state materials, specifically crystals, owing to their significance in many machine learning applications we have encountered.

Solid-state materials

Solid-state materials consist of a multitude of smaller building blocks, such as atoms or molecules, securely positioned in fixed locations. They exhibit high-density packing and strong mutual attraction, resulting in a stable structure with a well-defined volume. Materials that possess a well-ordered, infinitely repetitive structure are categorised as crystals, while those lacking such long-range positional order are termed amorphous.

Crystals

Also known as crystalline materials, crystals represent a category of solid substances characterised by their periodic structure. This entails a consistent and predictable repetition of atom or molecule placement within the crystal across all spatial dimensions. It is this inherent periodicity that sets crystals apart from smaller molecules and confers upon them unique properties. Crystals manifest in diverse forms and showcase exceptional attributes like magnetism, transparency and elevated melting points. The study of crystals offers a pathway to precise characterisation and controlled manipulation, opening up avenues for engineering materials with targeted properties to fulfill a wide range of needs. The subsequent section delves into how machine learning models address and harness this periodicity.

A crystal structure includes the type and position of every atom as well as the translational axes which allow the structure to repeat. Since crystal structures extend infinitely, we need to define some small section that can be repeated. We call this small section a unit cell. Unit cells are made of a lattice and basis. Lattices tell you how the crystal is repeated, the basis tells you what is repeated. There are six parameters of a unit cell: 3 edges $a,b,c$ and the angles between the edges $\alpha$ , $\beta$ , $\gamma$ . The edges of a unit cell may be or may not be perpendicular to each other. A crystal structure combines the lattice (how atoms are translated) with a basis (which atoms are translated) which theoretically describes every atom in the crystal. The lattice is just a mathematical idea, but the basis is the “real” component which is repeated according to the mathematical lattice.

Lattice

The lattice is fully specified by its origin and its basis vectors $\mathbf{a}_{i}$ , it is associated with a group of translations $T_{\Lambda}$ . The lattice $\Lambda$ is the underlying structure of the unit cells, which have the property of tilling the space $\mathbb{R}^{n}$ when translated by lattice vectors. Of particular prominence is the primitive cell $U$ for the basis $\mathbf{a}_{i}$ : $U\doteq\left\{\sum_{i}^{n}{x}_{i}\mathbf{a}_{i}\mid 0\leq x_{i}<1\right\}.$

In a material, the unit cell comprises a set of atomic positions $S=\left\{\left(Z_{i},\mathbf{x}_{i}\right)\mid\mathbf{x}_{i}\in U\right\}$ , where $\mathbf{x}_{i}$ is the atom position (called the basis vector) and $Z_{i}$ its atomic number. $S$ can contain an arbitrary number of atoms and is not constrained to a particular structure.

Periodic boundary conditions

Periodic boundary conditions (pbc) are commonly used to represent the infinite repeating nature of crystals and other periodic systems. Rather than confining the analysis to a single unit cell housing all atoms, pbc envision an infinite array of identical unit cells extending indefinitely in all directions. This means that if an atom or molecule exits the cell on one side, it effectively re-enters from the opposite side, as if traversing a continuous lattice. This approach enables scientists to simulate the behavior of an infinitely extending crystal using a finite computational domain.

Wyckoff positions

Also defined as classes of crystallographic orbits, Wyckoff positions categorise the position in a lattice of atoms that sit on symmetry elements. A Wyckoff position describes all the positions of an atom after a symmetry operation is applied. The number of distinct positions occupied by the atom after a symmetry operation is applied is indicated by the Wyckoff multiplicity.

Appendix B Experimental setup

In this section, we provide additional details about the hyper-parameters and experimental setup used to train the Crystal-GFN to obtain the results presented in Section 5.

In order to reduce the search space in our experiments, we apply the following restrictions:

•

Compositions consist of up to 5 different elements from the subset of these 12 elements: H, Li, C, N, O, F, Mg, Si, P, S, Cl and Fe. These are the most common elements in the MatBench data set used to train the proxy model. Note that even this reduced set yields a combinatorially large search space.
•

Compositions can contain up to 50 atoms in total and up to 16 atoms per element. This number is obtained by finding the lowest Wyckoff position multiplicity for each space group, and then computing the maximum of these values across all space groups. This means that 16 is the lowest possible value that still makes it possible for the Crystal-GFN to generate samples from all space groups while respecting their symmetry constraints.
•

Space groups are the intersection of train and validation space groups, from the proxy model’s data set. There are 113 of them.
•

The minimum and maximum lengths of the unit cell are 0.9 and 100 angstroms, respectively; the minimum and maximum angles are $50^{\circ}$ and $150^{\circ}$ , respectively. These values are set so as to include the bulk of the training and validation sets, excluding outliers.

We train the Crystal-GFN by sampling 10 trajectories per iteration from the current forward policy. In order to encourage further exploration, 10 % of the steps in the trajectories are sampled at random from a uniform distribution. In total, we train for 50,000 iterations, which amounts to 500,000 queries to the proxy model. This took about 12 hours on a CPU-only machine.

The architecture of both the forward and the backward GFlowNet policy models is a 3-layer multi-layer perceptron with 256 units per layer. We trained with the Adam optimiser and a learning rate of 0.0001. As is common with the Trajectory Balance objective (Malkin et al., 2022), we set a higher learning rate (0.01) for the 16 learning weights used to parameterise the partition function. In order to sample structures with lower (negative) formation energies, we set a temperature $T=8$ in the reward function defined in Section 4.4.

The distribution to sample the increments of the lattice parameters subspace is a mixture of 5 Beta distributions. In order to ensure numerical stability during training, we restrict the values of the coefficients of the Beta distributions to the range $[0.1,100]$ . One of the conditions that must be satisfied by generalised GFlowNets is that trajectories must have a finite length. To this end, we set the minimum increment to 10 % of the range of each dimension.

Appendix C Proxy MLP

C.1 Architecture

For a given crystal structure, the input to the proxy MLP is the concatenation of: a physical embedding of the crystal’s elements using PhAST (Duval et al., 2022), a learned embedding for its space group and the standardised lattice parameters. We stratify the MatBench data set into train, validation and test sets, matching their distributions of target FE. The final reward function is a Boltzmann version of the proxy: $R(x)=\exp\big{(}-\frac{\text{MLP}(x)}{T}\big{)}$ where $T$ is a temperature hyper-parameter. This ensures that a lower FE yields higher positive reward, and we can control our preference for lower energies with the temperature $T$ .

Overall, the architecture consists of concatenated representations for the composition ( $h_{C}$ ), the space group ( $h_{SG}$ ) and the lattice parameters ( $h_{LP}$ ) that are used as input to a prediction MLP $\hat{y}=\text{MLP}_{out}([h_{C},h_{SG},h_{LP}])$ . In the following we describe how each $h$ is obtained from the data:

•

$\mathbf{h_{C}}$ : Each element $Z_{i}$ in the composition is embedded with PhAST (Duval et al., 2022): it is a concatenation of 1/ static physical properties $P_{Z}$ projected into some latent space with a small MLP 2/ a learned embedding for the atomic number $Z$ , Period $P$ and the Group $G$ of the element, noted $E_{i}$ :

\displaystyle h_{C}=\text{MLP}_{C}\Big{(}\big{[}\text{MLP}_{P}(P_{Z}),E_{Z},E_% {P},E_{G}\big{]}\Big{)}

(1)

•

$\mathbf{h_{SG}}$ : Each space group is embedded as a lookup in a learnable embedding matrix.

•

$\mathbf{h_{LP}}$ : Lattice parameters are first standardised element-wise using the training set statistics. This 6-dimensional vector is then passed through a small MLP:

\displaystyle h_{LP}=\text{MLP}_{LP}\big{(}\frac{\text{LP}(x)-\mu_{train}(% \text{LP})}{\sigma_{train}(\text{LP})}\big{)}

(2)

C.2 Performance

Our proxy MLP model achieves an overall Mean Absolute Error of $0.10\pm 0.005\ \text{eV/atom}$ ¹¹195% confidence interval, computed modeling the MAE as a log-normal distribution. This is further justified by Fig. 5.. For reference, best performing methods in MatBench leaderboard²²2https://matbench.materialsproject.org/Leaderboards%20Per-Task/matbench_v0.1_matbench_mp_e_form/ can achieve up to 10 $\times$ better performance. However, they all use 3D positions of the crystals, and remarkably, our approach outperforms some of them, indicating that leveraging composition, space-group and lattice parameters serves as a viable representation in the formation energy estimation task.

In Figures 4 and 5 we detail further the performance of our Proxy MLP on the MatBench validation set. In particular we observe that it maintains good performance for structures with FE up to $\sim 0.3\leavevmode\nobreak\ \text{eV/atom}$ . The MAE beyond that FE level becomes high and unstable, but this is expected as those are the regions of the target space with the least data available. Additionally we can verify in Figure 4 that our stratification algorithm works as expected, with similar FE distributions in the validation and train splits.

C.3 Hyper-parameters

We present the hyper-parameters of our Proxy MLP in Table 1 along with a description of their role in the architecture.

Hyper parameter Value Description properties_proj_size 64 Projection size of atomic properties $h=\mathbf{W}p_{Z}$ group_emb_size 16 Embedding size of element’s group period_emb_size 256 Embedding size of element’s period z_emb_size 128 Embedding size of element’s atomic number sg_emb_size 128 Crystal space group embedding size lat_hidden_channels 284 Hidden channels for layers processing the lattice parameters lat_num_layers 1 Number of hidden layers for the lattice parameters MLP num_layers 5 Number of layers for final MLP processing composition, space group and lattice parameters hidden representations hidden_channels 576 Size of final MLP hidden layers. optimiser Adam Optimiser lr 0.0017 Learning rate batch_size 448 Batch size scheduler ReduceLROnPlateau Divide learning rate by 2 when validation MAE does not improve for 4 epochs es_patience 11 Early stopping patience (in epochs) epochs 100 Total epochs (effective after early stopping: 97)

Table 1: Hyper-parameters of the MLP proxy and their values used in our model.

Appendix D Additional results

In this section, we provide additional results demonstrating the diversity of the samples generated by the Crystal-GFN. They are displayed in Figure 6, Figure 8, Figure 9 and Figure 10.