\typearea

The generative quantum eigensolver (GQE) and its application for ground state search

Kouhei Nakaji Lasse Bjørn Kristensen Jorge A. Campos-Gonzalez-Angulo ¹¹1These authors contributed equally. Mohammad Ghazi Vakili ¹¹footnotemark: 1 Haozhe Huang ¹¹footnotemark: 1 Mohsen Bagherimehrab ²²2These authors contributed equally. Christoph Gorgulla ²²footnotemark: 2 FuTe Wong Alex McCaskey Jin-Sung Kim Thien Nguyen Pooja Rao Alan Aspuru-Guzik

Abstract

We introduce the generative quantum eigensolver (GQE), a novel method for applying classical generative models for quantum simulation. The GQE algorithm optimizes a classical generative model to produce quantum circuits with desired properties. Here, we develop a transformer-based implementation, which we name the generative pre-trained transformer-based (GPT) quantum eigensolver (GPT-QE), leveraging both pre-training on existing datasets and training without any prior knowledge. We demonstrate the effectiveness of training and pre-training GPT-QE in the search for ground states of electronic structure Hamiltonians. GQE strategies can extend beyond the problem of Hamiltonian simulation into other application areas of quantum computing.

1 Introduction

The field of quantum computing has experienced a remarkable surge, characterized by rapid advancements in the development of quantum devices. Notably, recent research reports the experimental realization of quantum computing with 48 logical qubits [1], marking the onset of the early fault-tolerant quantum computing regime. However, despite these advancements, this regime’s operational number of gates remains limited. Consequently, it is still unclear how these hardware leaps can be effectively translated into practical advantages in the coming decades.

A decade has passed since some of us introduced the variational quantum eigensolver (VQE) [2], which arguably marked a pivotal moment in the field of quantum computing. In VQE, a cost function is minimized by optimizing parameters embedded in a quantum circuit. The variational nature of the algorithm facilitates reducing the circuit depth so they can be implemented on near-term devices. Since its introduction, many quantum algorithms employing variational techniques (variational quantum algorithms: VQA) have been proposed [3, 4]. However, it has been demonstrated that VQAs encounter several issues, particularly with regards to their trainability for large problem instances [5, 6]. This limitation hinders their competitiveness against classical computers when dealing with problems above a certain size. In this work, we aim to circumvent these shortcomings by constructing an orthogonal, rather than complementary, approach to VQAs.

Refer to caption — Figure 1: Comparison between GQE and VQE.

During the same tumultuous decade, modern machine-learning techniques with deep neural networks have revolutionized numerous areas. In particular, there has been significant advancement in generative models for natural language processing. The advent of the Generative Pre-trained Transformer (GPT) [7] marks a milestone in the evolution of artificial intelligence. Forming the basis of Large Language Models (LLMs), GPT-like transformer models have demonstrated exceptional capabilities in understanding and generating human language. Through the simplicity and inherent efficiency of the attention mechanism [8], transformer models have demonstrated extraordinary performance across a wide array of tasks, showcasing their flexibility and expressivity in a variety of domains (e.g., [7, 8, 9, 10]). Recent achievements, highlighted by models like Chinchilla [11], demonstrate how scaling laws in machine learning can inform the efficient allocation of model size for optimized performance, hinting at even greater potential.

Given those significant achievements in classical generative models, incorporating them into quantum computing algorithms could be a pivotal step in overcoming the enduring challenges faced in practical quantum computing applications. Therefore, we propose the generative quantum eigensolver (GQE), which takes advantage of classical generative models for quantum circuits. Specifically, we employ a classical generative model —denoted as $p_{\vec{\theta}}(U)$ , with $\vec{\theta}$ as parameters and $U$ as a unitary operator—to define the probability distribution for generating quantum circuits. In simpler terms, we sample quantum circuits according to this distribution. We train $p_{\vec{\theta}}(U)$ so that generated quantum circuits are likely to have desirable properties. We emphasize that, unlike VQA and its variants, no parameters are embedded in the quantum circuit in GQE; notably, and important to their scalability, all the optimizable parameters are in the classical generative model (Fig. 1). We note that some previous works utilize a generative model for generating parameters in VQE [12, 13, 14], but in GQE, the whole circuit structure is determined by a generative model.

In designing the generative model for quantum circuits, we focus on the transformer architecture [8], which achieves significant success as the backbone of large language models. We can describe GQE with a transformer by using an analogy between natural language documents and quantum circuits (Fig. 2). For a given operator pool, defined by a set of unitary operations $\{U_{j}\}_{j=1}^{L}$ (vocabulary), the transformer generates the sequence of indices $j_{1}\dots j_{N}$ corresponding to the unitary operations $U_{j_{1}}\dots U_{j_{N}}$ (words) and constructs the quantum circuit $U_{N}(\vec{j})=U_{j_{N}}\dots U_{j_{1}}$ (document). The rules for generating indices (grammar) are trained so that a cost value calculated by quantum devices decreases. GQE with a transformer is also able to be pre-trained. If we have a dataset given as pairs of index sequences and cost values: $\{\vec{j}_{m},C(\vec{j}_{m})\}_{m=1}^{M}$ (document dataset), we can pre-train the transformer without running quantum devices, as shown in Section 2. Hence, we give the GQE with transformer the name of generative pre-trained transformer-based quantum eigensolver (GPT-QE).

We expect three advantages in training with a transformer instead of a parameterized quantum circuit: ease of optimization, quantum resource efficiency, and customizability. We have been witnessing the impressive optimizability of deep neural networks (DNN) [15, 16, 17] and in this work, all the developments in this field are readily applicable for quantum computing. By using the cost function landscape of DNNs, GPT-QE is potentially unaffected by the core optimization issues of several VQAs. Regarding quantum resource efficiency, we show in Section 2 that the number of quantum circuits run for each step of the optimization in GPT-QE does not explicitly depend neither on the number of parameters nor on the size of the operator pool since it replaces quantum gradient evaluation with sampling and backpropagation. From this feature, we expect to significantly reduce the number of quantum circuits runs compared to conventional VQAs. Additionally, pre-training also constrains the process of running quantum devices for dataset generation, as we note above. As for the customizability of our appraoch, we can append additional conditioned input –e.g., domain knowledge– to the transformer.

To demonstrate a proof of concept of the GPT-QE approach, which can be readily adapted to several families of VQAs, this paper focuses on the ground state search problem. We will explore these other applications of the method in future work. Accurate molecular electronic ground states have significant utility in applications as diverse as drug discovery [18, 19], materials science [20], and environmental solutions [21]. These ground states enable precise simulations of complex molecular structures, accelerating drug development by identifying effective candidates and properties. In materials science, they aid in designing tailored materials [22], from superconductors to catalysts [23]. Additionally, they address environmental challenges by improving energy solutions and enhancing our understanding of relevant chemical processes [24]. It must be acknowledged that ground state search by time-independent means belongs to the complexity class QMA, and it has become uncertain whether this task is a feasible candidate for finding quantum advantage [25]. However, the potential benefits of conceptual understanding and practical applications continue to drive research and development in this field [26].

We note that previous literature [27, 28, 29] proposes methods for training the structure of quantum circuits using machine learning, especially reinforcement learning. Reinforcement learning approaches tend to require a large number of intermediate quantum states in the circuit to determine each action (the next quantum gate to be generated), which leads to an increase in the number of required measurements as the number of gates increases. The Adaptive Derivative-Assembled Pseudo-Trotter ansatz VQE (ADAPT-VQE) [30] also provides a method to adaptively construct the ansatz structure. The method requires running VQE, hence many measurements, to determine the gates to be added as in the case of the reinforcement learning approaches. Conversely, GPT-QE does not require any intermediate measurements, thus potentially significantly reducing the measurement cost when running the algorithm.

The rest of the paper is organized as follows. In Section 2, we describe the details of GQE. Particularly, we construct GPT-QE and describe its training and pre-training scheme. Section 3 is dedicated to demonstrating the performance of the training and pre-training schemes by using the ground state search for the electronic structure Hamiltonians. In Section 4, we summarize what we know of the algorithm so far and suggest future directions of exploration.

2 Methods

2.1 Generative Quantum Eigensolver

The generative quantum eigensolver (GQE) is an algorithm to search the ground state of a given Hamiltonian $\hat{H}$ . Particularly, we focus on the electronic structure problem, where the Hamiltonian is written as the weighted sum of the tensor products of Pauli operators $\hat{P}_{\ell}$ : $\hat{H}=\sum_{\ell}h_{\ell}\hat{P}_{\ell}$ . To construct the approach of the GQE, we first illustrate our formulation of the generative model of quantum circuits.

We prepare the operator pool $\mathcal{G}=\{U_{j}\}_{j=1}^{L}$ , where $U_{j}$ is a unitary operator and $L$ is the size of the operator pool. One of the choices for the operator pool is a set of time evolution operators: $\{e^{i\hat{P}_{j}t_{j}}\}_{j=1}^{L}$ , which we use in our numerical experiment. Given a sequence length $N$ , we sample the sequence $\vec{j}=\{j_{1},\ldots,j_{N}\}$ according to the parameterized probability distribution $p_{N}(\vec{\theta},\vec{j})$ , where $\vec{\theta}=\{\theta_{p}\}_{p=1}^{P}$ are optimizable parameters. Using the sequence $\vec{j}$ , we construct the quantum circuit $U_{N}(\vec{j})=U_{j_{N}}\cdots U_{j_{1}}$ . We call $p_{N}(\vec{\theta},\vec{j})$ the generative model of quantum circuits. In the rest of the paper, we omit the variable $\vec{\theta}$ for simplicity. The process of sampling the sequence $\vec{j}$ according to $p_{N}(\vec{j})$ and constructing the quantum circuit $U(\vec{j})$ is simply referred to as “sampling the quantum circuit $U(\vec{j})$ according to $p_{N}(\vec{j})$ ”.

We construct GQE to search for the ground state of $\hat{H}$ with the generative model of quantum circuits $p_{N}(\vec{j})$ . The objective of the problem we target is finding $\vec{j}^{\ast}:=\operatorname*{arg\,min}_{\vec{j}}E_{N}(\hat{H},\vec{j})$ with

E_{N}(\hat{H},\vec{j}):={\textrm{Tr}}\left(\hat{H}U_{N}(\vec{j})\rho_{\textrm{% 0}}U_{N}(\vec{j})^{\dagger}\right),

(1)

where $\rho_{\textrm{0}}$ is a fixed initial quantum state. In the following, we omit the variable $\hat{H}$ depending on the context for simplicity.

We note that we need to select the operator pool $\mathcal{G}$ to be expressive enough, so that $E_{N}(\vec{j}^{\ast})$ is close enough to the ground state energy. It should also be noted that we can select $\mathcal{G}$ to accommodate the native operations and topology of each quantum device. We illustrate the specific choice for the operator pool in our numerical experiment. We train $p_{N}(\vec{j})$ so that the generated quantum circuit $U_{N}(\vec{j})$ is likely to produce a low energy quantum state. We call the approach to optimize $p_{N}(\vec{j})$ as the generative quantum eigensolver.

We now emphasize the difference between VQE and GQE. As shown in Fig. 1, in VQE, we embed parameters in the quantum circuit and optimize them to minimize the energy associated with the generated quantum state. In contrast, all parameters in GQE are embedded in the generative model $p_{N}(\vec{j})$ . Consequently, the cost function landscapes in GQE and VQE are different; considering the success of training large models with DNN [15, 16, 17], we expect that GQE potentially addresses the issue of trainability in VQE by exploiting the different landscape, which has been now moved on to the classical computer.

An advantage of the GQE approach is that we are free to choose the generative model $p_{N}(\vec{j})$ from a very rich potential set of families of generative models stemming from the field of machine learning, such as autoencoders, generative adversarial networks, diffusion models, flow models, etc. This paper focuses on the model where the Transformer implements the generative model [8], which achieves significant success as the cornerstone of large language models. In the following subsection, we describe the details of GQE implemented in this transformer setting.

2.2 GPT Quantum Eigensolver

We construct the specific GQE algorithm using the transformer architecture and provide its training scheme. As we will show later, the approach also involves pre-training; therefore, we call the method generative pre-trained transformer-based quantum eigensolver (GPT-QE). In the following, we describe how the transformer generates quantum circuits. Then, we construct its training/pre-training scheme of GPT-QE.

Quantum circuits generation in GPT-QE

The original transformer, introduced in [8], targets neural machine translation, where the model consists of an encoder for the input language and a decoder for the targeted language. In quantum circuit generation, we focus on the decoder-only transformer inspired by GPT-2 [7], developed for more general generative tasks. In the following, we refer to a decoder-only transformer simply as the transformer.

The sequence generation using the transformer can be written as the repetitions of (i) calculating the logit (logarithmic probability) with which each token is generated and (ii) sampling a token according to the corresponding probability distribution. We write the function for the probability calculation as GPT and the function for sampling as Sample.

The function GPT takes the variable-length inputs $\vec{j}^{(k)}=\{j_{1},\ldots,j_{k}\}$ , where each element takes an integer value between $1$ and $L$ (the size of the operator pool $\mathcal{G}$ ). Then it outputs the sequence of logits $W^{(k)}=\{\vec{w}^{(1)},\ldots,\vec{w}^{(k)}\}$ with the same length, where each logit $\vec{w}^{(r)}$ is a real vector of size $L$ . We note that GPT has optimizable parameters $\vec{\theta}$ included in the transformer’s architecture, which is not explicitly written in the notation GPT for simplicity.

The function GPT follows the methodology outlined in [7]. Here, we present a concise overview of GPT’s operation: Initially, GPT converts each token in the input sequence into a unique embedding, represented as a vector. These embeddings are transformed by means of multiple attention layers. Each attention layer takes a sequence of vectors as input, with the first layer’s input being the embeddings themselves. The output of each attention layer is also a sequence of vectors, maintaining the same sequence length and vector dimension as the input. This consistency allows the output of one layer to serve as the input for the subsequent layer. The final attention layer’s output is converted into a sequence of logits, constituting the output of GPT. Given an attention layer’s input $\{\vec{v}_{1},\cdots,\vec{v}_{k}\}$ , the output can be expressed as $\{\vec{a}_{1},\cdots,\vec{a}_{k}\}$ , where each $\vec{a}_{r}$ represents the attention with the same dimension as $\vec{v}_{r}$ . The attention $\vec{a}_{r}$ encapsulates the relationship between $\vec{v}_{r}$ and other elements of the input. Notably, through the process of causal masking, $\vec{a}_{r}$ depends solely on preceding elements, i.e., $\{\vec{v}_{r^{\prime}}\}_{r^{\prime}<r}$ . The standard implementation of attention computation involves running multiple attention mechanisms in parallel, and their outputs are combined into $\{\vec{a}_{1},\cdots,\vec{a}_{k}\}$ through a weighted average (multi-head attention). For a more comprehensive explanation, see [7].

The function Sample is a stochastic function that takes a logit $\vec{w}=\{w_{j}\}_{j=1}^{L}$ as its input and returns one of the tokens $j\in\{1,\ldots,L\}$ . The probability that the token $j$ is sampled is proportional to $e^{-\beta w_{j}}$ , with $\beta>0$ a hyper-parameter we can choose. It is customary to sample from the logits, which can be understood as each $j$ being sampled according to the energy $w_{j}$ and the inverse temperature $\beta$ as in statistical mechanics. For simplicity, we omit the variable $\beta$ from the function Sample.

We define the generative model of the quantum circuits in GPT-QE by the procedure to obtain a sequence $\vec{j}$ using GPT and Sample:

•

In the first step, we sample the token $j_{1}$ with the fixed input $\{0\}$ :

	$\displaystyle W^{(1)}$	$\displaystyle=\texttt{GPT}(\{0\}),$
	$\displaystyle\vec{w}^{(1)}$	$\displaystyle=W^{(1)}_{1},$
	$\displaystyle j_{1}$	$\displaystyle=\texttt{Sample}(\vec{w}^{(1)}),$

where $W^{(1)}_{1}$ denotes the first element of $W^{(1)}$ .

•

In the second step, we sample the token $j_{2}$ with the input $\{0,j_{1}\}$ :

	$\displaystyle W^{(2)}$	$\displaystyle=\texttt{GPT}(\{0,j_{1}\}),$
	$\displaystyle\vec{w}^{(2)}$	$\displaystyle=W^{(2)}_{2},$
	$\displaystyle j_{2}$	$\displaystyle=\texttt{Sample}(\vec{w}^{(2)}).$

•

In the $k$ -th step we sample the token $j_{k}$ with the input $\{0,j_{1},\ldots,j_{k-1}\}$ :

	$\displaystyle W^{(k)}$	$\displaystyle=\texttt{GPT}(\{0,j_{1},\ldots,j_{k-1}\}),$
	$\displaystyle\vec{w}^{(k)}$	$\displaystyle=W^{(k)}_{k},$
	$\displaystyle j_{k}$	$\displaystyle=\texttt{Sample}(\vec{w}^{(k)}).$

After $N$ steps, we obtain a sequence of tokens $\vec{j}=\{j_{1},\ldots,j_{N}\}$ with length $N$ .

We can readily show that the probability that $\vec{j}$ is sampled is proportional to $\exp\left(-\beta w_{\textrm{sum}}(\vec{j})\right)$ , where

w_{\textrm{sum}}(\vec{j}):=\sum_{k=1}^{N}w_{j_{k}}^{(k)}.

(2)

Therefore, the generative model in GPT-QE is

p_{N}(\beta,\vec{j})=\frac{\exp\left(-\beta w_{\textrm{sum}}(\vec{j})\right)}{% \mathcal{Z}},

(3)

where $\mathcal{Z}=\sum_{\vec{j}}\exp\left(-\beta w_{\textrm{sum}}(\vec{j})\right)$ and we write the hyper-parameter $\beta$ explicitly.

Training

To construct the training scheme for GPT-QE, let us consider the process sampling $U_{N}(\vec{j})$ according to $p_{N}(\beta,\vec{j})$ and applying it to $\rho_{0}$ . Let $\rho(\beta)$ be the quantum state generated by the stochastic process. With $\mathcal{E}_{N}(\vec{j},\rho):=U_{N}(\vec{j})\rho U_{N}(\vec{j})^{\dagger}$ , it can be written as

\begin{split}\rho(\beta)&=\sum_{j}p_{N}(\beta,\vec{j})\mathcal{E}_{N}(\vec{j},% \rho_{0})\\ &=\frac{1}{\mathcal{Z}}\sum_{j}\exp\left(-\beta w_{\textrm{sum}}(\vec{j})% \right)\mathcal{E}_{N}(\vec{j},\rho_{0}).\end{split}

(4)

We observe that if $w_{\textrm{sum}}(\vec{j})=E_{N}(\vec{j})$ is satisfied, $\rho(\beta)$ gives a pseudo thermal state with the inverse temperature $\beta$ in the sense that the quantum state is generated according to the probability $\exp\left(-\beta E_{N}(\vec{j})\right)$ . Therefore, increasing the value of $\beta$ creates a bias towards generating lower energy quantum states. We note that $\rho(\beta)$ is not, in general, the exact thermal state since the quantum state $\mathcal{E}_{N}(\vec{j},\rho_{0})$ cannot represent all the eigenstates in the whole Hilbert when $N$ and $L$ are constrained.

From this observation, we design our scheme for the training/pre-training so that $w_{\textrm{sum}}(\vec{j})\simeq E_{N}(\vec{j})$ is satisfied. More specifically, in each iteration, we sample $\{\vec{j}_{m}\}_{m=1}^{M}$ , calculate the cost function

\begin{split}&C\left(\{w_{\textrm{sum}}(\vec{j}_{m})\}_{m=1}^{M},\{E_{N}(\vec{% j}_{m})\}_{m=1}^{M}\right)\\ &=\frac{1}{M}\sum_{m=1}^{M}\tilde{C}\left(w_{\textrm{sum}}(\vec{j}_{m}),E_{N}(% \vec{j}_{m})\right),\\ &\tilde{C}(w_{\textrm{sum}}(\vec{j}),E_{N}(\vec{j})):=\left(e^{-w_{\textrm{sum% }}(\vec{j})}-e^{-E_{N}(\vec{j})}\right)^{2},\end{split}

(5)

and update the parameters in GPT by backpropagation. Since we match the sum of logits with the energy function, we call this technique logit-matching. We overview the training process in Fig. 3. It should be noted that $M$ , which corresponds to the batch size of the data in the context of the machine learning, is a hyper-parameter we can choose, and it does not explicitly depend on $N$ and $L$ . Therefore, in principle, we can freely choose the number of quantum circuit runs in each iteration. However, we must also be aware that small $M$ leads to a significant statistical error in the cost function estimation. The best strategy to choose $M$ should be studied in future work. It is also possible to optimize $M$ by using hyper-parameter optimization, which itself is an active area of exploration in the field of machine learning. We also note that the parameter $\beta$ can be used to control the trade-off between exploitation and exploration. In other words, by adjusting $\beta$ , we can influence how the algorithm balances between intensively searching within a particular area (exploitation) and exploring new, potentially promising areas (exploration). A higher value of $\beta$ typically encourages more exploitation, focusing the search on areas already identified as having high potential. Conversely, a lower $\beta$ value promotes exploration, allowing the algorithm to investigate a broader range of solutions that may lead to discovering better-performing options not yet explored. In our numerical experiment, we initially set $\beta$ to a small value and then gradually increase it over time.

Pre-Training

By applying the logit-matching technique, we can easily incorporate a pre-training scheme in GPT-QE (see Fig. 3). Suppose we have the dataset $\mathcal{D}=\{\vec{j}_{m},E_{N}(\vec{j}_{m})\}_{m=1}^{M_{D}}$ . In pre-training, we input each $\vec{j}_{m}$ to GPT and obtain $w_{\textrm{sum}}(\vec{j}_{m})$ . Then, we update parameters in GPT so that $w_{\textrm{sum}}(\vec{j}_{m})$ becomes close to $E_{N}(\vec{j}_{m})$ from the dataset by using the cost function (5). We note that we can split the dataset $\mathcal{D}$ into $B$ batches $\{\mathcal{D}_{b}\}_{b=1}^{B}$ and train GPT with each batch.

The pre-training process is entirely classical, eliminating the need to use quantum devices as long as datasets are available. The question then arises: How do we obtain these datasets? Primarily, they can be sourced from previous GPT-QE training. Specifically, previous quantum evaluations performed as part of solving related tasks can be used to construct such a dataset. This opens up the possibility of a large-scale effort to simulate several Hamiltonians (e.g. molecules and materials) in the cloud using high-performance quantum simulators or on actual quantum devices and gaining performance for GQE as the pertaining dataset becomes more comprehensive over time.

We propose three scenarios to leverage the pre-training scheme effectively: (i) model-to-model transfer, (ii) config-to-config transfer, and (iii) molecule-to-molecule transfer. Below, we describe these scenarios in detail.

(i) Model-to-model transfer scenario

In the model-to-model transfer scenario, we assume that the Hamiltonian used for data generation and the one we want to get the ground state of are the same. We first prepare a model represented as $\texttt{GPT}_{A}$ and train it by using the Hamiltonian $\hat{H}$ . While training $\texttt{GPT}_{A}$ , we obtain $\mathcal{D}=\{\vec{j}_{m},E_{N}(\vec{j}_{m})\}_{m=1}^{M}$ . The result of the training may be unsatisfactory, and we try another model, represented as $\texttt{GPT}_{B}$ . For example, $\texttt{GPT}_{B}$ may have more attention layers than $\texttt{GPT}_{A}$ . Then, the dataset $\mathcal{D}$ can be utilized for the pre-training of $\texttt{GPT}_{B}$ and we can then run the training algorithm with $\texttt{GPT}_{B}$ after the pre-training. It is also possible that the dataset $\mathcal{D}$ is used for the training of $\texttt{GPT}_{A}$ itself to obtain a better initialization (Experience replay).

Above, we describe the case where GPT-QE obtains the dataset. However, it is also possible that the dataset could be generated from other algorithms, e.g., tensor network calculations and VQE, if we can (approximately) convert the data obtained in those algorithms to a data format acceptable for GPT-QE.

(ii) Config-to-config transfer scenario

In the config-to-config transfer scenario, we utilize the data obtained in training to find the ground state of a Hamiltonian $\hat{H}$ corresponding to one spatial nuclear configuration to that of another Hamiltonian $\hat{H}^{\prime}$ corresponding to a different nuclear configuration. We still assume that $\hat{H}$ and $\hat{H}^{\prime}$ are Hamiltonians of the same molecule, albeit at different geometries. In the context of VQA, similar approaches have already been proposed [31, 32]. Let us first assume that the Hamiltonian can be parameterized as follows:

\hat{H}(\vec{\Delta})=\sum_{a=1}^{N_{H}}h_{a}(\vec{\Delta})\hat{P}_{a},

(6)

where $\vec{\Delta}$ is a set of parameters corresponding to a configuration, $N_{H}$ is the number of terms in the Hamiltonian, each $h_{a}(\vec{\Delta})$ is a real-valued function of the parameters $\vec{\Delta}$ , and each $\hat{P}_{a}$ is a tensor product of Pauli operators. An example of the parameters in $\vec{\Delta}$ is the bond length; when we change the bond length, the set of $\{\hat{P}_{a}\}_{a=1}^{N_{H}}$ does not change, and only the coefficients in the Hamiltonian change. With this parameterization scheme, we can transfer the dataset obtained in the training of $\hat{H}(\vec{\Delta})$ to a dataset available for the pre-training with the Hamiltonian $\hat{H}(\vec{\Delta}^{\prime})$ .

We propose two methods to realize the config-to-config transfer, which can be effectively combined with each other. One method is the following coefficient re-weighting (see also [31]). In the training with $\hat{H}(\vec{\Delta})$ , we estimate $\{q_{a}(\vec{j})\}_{a=1}^{N_{H}}$ defined by $q_{a}(\vec{j}):=\textrm{Tr}\left(\hat{P}_{a}\mathcal{E}_{N}(\vec{j},\rho_{0})\right)$ and combine them to estimate the energy as

E_{N}\left(\vec{\Delta},\vec{j}\right)=\sum_{a=1}^{N_{H}}h_{a}(\vec{\Delta})q_% {a}(\vec{j}),

(7)

where we simply write $E_{N}\left(\hat{H}(\vec{\Delta}),\vec{j}\right)$ as $E_{N}\left(\vec{\Delta},\vec{j}\right)$ .

The estimated values $\{q_{a}(\vec{j})\}_{a=1}^{N_{H}}$ are also usable to construct the estimation value of $E_{N}(\vec{j},\vec{\Delta}^{\prime})$ with a different configuration, i.e., we can simply combine the measured Pauli expecectation values using different coefficients $h_{a}(\vec{\Delta}^{\prime})$ . By this process, we obtain the dataset $\mathcal{D}=\left\{\vec{j}_{m},E_{N}\left(\vec{\Delta}^{\prime},\vec{j}_{m}% \right)\right\}_{m=1}^{M}$ , which can then be used for the pre-training when solving the search for the ground state of the Hamiltonian $H(\vec{\Delta}^{\prime})$ .

Another method to achieve config-to-config transfer is adding extra inputs $\vec{\Delta}$ to the GPT function. More specifically, we would imagine extending GPT so that it takes the configuration $\vec{\Delta}$ as its input in addition to the variable inputs $\{j_{1},\cdots j_{k}\}$ and outputs the logits $W^{(k)}$ . Let us write the extended function as $\texttt{GPT}_{+}$ .

By training $\texttt{GPT}_{+}$ with the different configurations $\vec{\Delta}_{1},\cdots,\vec{\Delta}_{R}$ , we obtain the datasets $\mathcal{D}$ = $\{\mathcal{D}_{r}\}_{r=1}^{R}$ , where $\mathcal{D}_{r}:=\{\vec{\Delta}_{r},\vec{j}_{m_{r}},E_{N}(\vec{\Delta}_{r},% \vec{j}_{m_{r}})\}_{m_{r}=1}^{M_{r}}$ and each $M_{r}$ is the number of sequences generated in the training with each configuration. Then, the dataset $\mathcal{D}$ can be used for the pre-training when we search for the ground state of $\hat{H}(\vec{\Delta}^{\prime})$ of a new configuration $\vec{\Delta}^{\prime}$ . In the pre-training, $\vec{\Delta}_{r}$ is used as inputs of $\texttt{GPT}_{+}$ as well as $\vec{j}_{m_{r}}$ , where $\vec{\Delta}^{\prime}$ is used as inputs in the main training.

(iii) Molecule-to-molecule transfer scenario

In the molecule-to-molecule transfer scenario, we propose utilizing the dataset generated from the training of one molecule for the pre-training of another. Implementing this approach requires a significant extension of the model. At this stage, we focus on outlining the concept rather than delving into the specifics of the implementation. In this scenario, it is crucial to use the information of the molecule, such as the Hamiltonian, as an input, similar to our approach in the $\texttt{GPT}_{+}$ model for config-to-config transfer.

The primary challenge lies in uniformly treating molecules with varying numbers and types of molecular orbitals. For instance, there is not a straightforward mapping between the molecular orbitals of $\texttt{H}_{2}$ and those of $\texttt{H}_{2}\texttt{O}$ as determined by Hartree-Fock calculations. Therefore, the gate sequence in the calculation of $\texttt{H}_{2}$ does not have an a priori correspondence with that in $\texttt{H}_{2}\texttt{O}$ . Finding transferable representations of electronic structure information that are useful for machine learning applications is a vibrant research field on its own [33, 34]. One potential solution to this challenge in our setting is to write the molecular Hamiltonian employing the basis of atomic orbitals instead of that of molecular orbitals. More specifically, we define the creation and the annihilation operators for each atomic orbital and correspond them with qubits. Then, a gate sequence for $\texttt{H}_{2}$ has an analogous physical meaning when applied to the portion corresponding to hydrogen atoms in $\texttt{H}_{2}\texttt{O}$ . This approach facilitates effective molecule-to-molecule transfer, particularly between molecules with the same atomic composition. Alternatively, one could devise a quantum circuit that couples molecular fragments encoded initially in independent qubits. In this manner, the optimized gates for the fragments can be invoked as a starting point for the full molecular circuit.

3 Results

In this section, we showcase the effectiveness of training and pre-training in GPT-QE for approximating ground states using electronic structure Hamiltonians. We use the molecular Hamiltonians of $\texttt{H}_{2}$ , LiH, $\texttt{BeH}_{2}$ , and $\texttt{N}_{2}$ in the sto-3g basis for this purpose.

The configuration of the GPT-QE model is as follows. Our operator pool is a set of Pauli time evolutions: $\mathcal{G}=\{e^{i\hat{P}_{j}t_{j}}\}_{j}$ , where $\hat{P}_{j}$ represents a tensor product of Pauli operators, and $t_{j}$ is a real value. Defining $\mathcal{P}$ as the set of $\hat{P}_{j}$ and $\mathcal{T}$ as the set of $t_{j}$ , the size of the operator pool is $|\mathcal{P}|\times|\mathcal{T}|$ . For $\mathcal{P}$ , we derive chemically inspired choices from the unitary coupled-cluster single and double excitations (UCCSD). Letting $T$ denote the sum of all fermionic excitation operators included in UCCSD, $\mathcal{P}$ is selected such that $e^{iP_{\ell}\theta_{\ell}}(P_{\ell}\in\mathcal{P})$ , with an angle $\theta_{\ell}$ ,is part of the decomposed operators when $e^{T-T^{\dagger}}$ is broken down into Pauli time evolutions by the Trotter decomposition [35]. The identity operator is also included in $\mathcal{P}$ . For $\mathcal{T}$ , we choose $\mathcal{T}=\left\{\pm 2^{k}/160\right\}_{k=1}^{4}$ . Regarding the transformer model, we employ a configuration identical to that of GPT-2 [7], featuring 12 attention layers, 12 attention heads, and 768 embedding dimensions. The initial state, $\rho_{\textrm{0}}$ , is set to the Hartree-Fock state. For numerical stability, we add an offset to the output of the quantum device. More specifically, when calculating the cost function (5), we substitute $E_{N}(\vec{j})+E_{\textrm{offset}}$ instead of the original $E_{N}(\vec{j})$ . The value of $E_{\textrm{offset}}$ is chosen to be $0$ , $7$ , $14$ , and $106$ for $\texttt{H}_{2}$ , LiH, $\texttt{BeH}_{2}$ , and $\texttt{N}_{2}$ , respectively.

In this study, we utilized CUDA Quantum [36] to execute quantum chemistry experiments. CUDA Quantum is distinguished as an open-source programming model and platform, integrating quantum processing units (QPUs), CPUs, and GPUs seamlessly. This integration makes it an ideal choice for workflows that require diverse computing capabilities, as demonstrated in the GPT-QE application we considered. CUDA Quantum facilitates kernel-based programming and is compatible with both C++ and Python, which we employed in our research. The algorithm was executed using NVIDIA A100 GPUs on NERSC’s Perlmutter, an HPE Cray Shasta-based heterogeneous system with 1,792 GPU-accelerated nodes.

3.1 Training Performance

In Fig. 4, we present the results of GPT-QE training for each electronic structure Hamiltonian. The horizontal axis represents the bond length, while the vertical axis shows the energy value. The results of GPT-QE are depicted as green points (gpt-qe). The Hartree-Fock energy is indicated by a gray dotted line (hf), and the exact full configuration interaction energy, calculated through diagonalization of the full Hamiltonian, is represented by a black line (exact).

For $\texttt{H}_{2}$ , we run 50 steps with a number of samples per step set at $M=25$ . And for the other molecules, we run 500 steps with $M=50$ . The inverse temperature $\beta$ starts at $5$ and is increased by $0.1$ at each step. The number of tokens, i.e., the length of the output sequence, is fixed at 10 for $\texttt{H}_{2}$ and 40 for the other molecules. We conducted three trials of GPT-QE and recorded the minimum energy in each. The plotted points represent the best results from these trials.

We observe that GPT-QE effectively identifies low-energy states closely approximating the ground state. In Appendix A, we also compare the average performance of GPT-QE with the case when quantum gates are randomly generated from the operator pool $\mathcal{G}$ and verify the effectiveness of the GPT-QE’s training scheme, which employs logit-matching to steer the transformer towards desired outcomes. Some of the results include errors larger than the chemical accuracy. Such errors can be further reduced by choosing suitable hyper-parameters and pre-training, as described in the upcoming subsection. It should also be noted that we can enhance the accuracy by applying VQE as a post-processing step.

For demonstration purposes, Fig 5 displays the quantum circuit corresponding to the data point for a bond length of 2.0 Å, which is drawn using an open-source quantum circuit visualization tool [37]. To our knowledge, this is the first instance of a transformer-generated quantum circuit reported in a scientific publication.

3.2 The Effect of Pre-Training

We also conduct a pre-training experiment, focusing specifically on the config-to-config transfer scenario in $\texttt{N}_{2}$ . Let $\hat{H}(d)$ represent the $\texttt{N}_{2}$ Hamiltonian at bond length $d$ . $\hat{H}(d)$ can be expressed as a weighted sum of Pauli operators:

\hat{H}(d)=\sum_{a=1}^{N_{H}}h_{a}(d)\hat{P}_{a},

(8)

where $h_{a}(d)$ denotes the weights. We transfer the dataset generated during the training with $\hat{H}(1.2)$ to a dataset for pre-training with $\hat{H}(1.4)$ . The initial training for $\hat{H}(1.2)$ comprises 500 steps, and at each step, we sample 50 sequences. Consequently, from the training, we compile the dataset $\mathcal{D}^{(1.4)}_{\textrm{original}}:=\{\vec{j}_{m},E^{(1.4)}(\vec{j}_{m})% \}_{m=1}^{M}$ , where $E^{(d)}(\vec{j})$ estimates $\textrm{Tr}\left(\hat{H}(d)U(\vec{j})\rho_{\textrm{0}}U(\vec{j})^{\dagger}\right)$ and $M=500\times 50=25,000$ . From $\mathcal{D}^{(1.4)}_{\textrm{original}}$ , we omit relatively high-energy data having $E^{(1.4)}(\vec{j}_{m})>-107.45$ and construct the dataset $\mathcal{D}^{(1.4)}$ , which has $\sim 14,700$ data.

In the pre-training phase, $\mathcal{D}^{(1.4)}$ is divided into 294 batches. The transformer is then pre-trained using each batch once. Subsequently to this pre-training, the transformer undergoes further training for 500 steps. The hyper-parameters are set to be the same as in the experiment in Section 3.1.

In Figure 6, we display training performance following pre-training. We conducted ten trials of the experiment. In each trial, the transformer’s parameters are randomly initialized with different seeds, but the same dataset $\mathcal{D}^{(1.4)}$ is used for pre-training. Let $E_{\min}(s)$ be the minimum energy found by step $s$ in each trial. The left figure presents the mean of $E_{\min}(s)$ and its standard deviation across the ten training runs. Let $E_{\text{avg}}(s)$ be the average of energies generated at step $s$ in each trial. The right figure illustrates the mean of $E_{\text{avg}}(s)$ and its standard deviation. In each figure, the result is depicted by a green circle. As a baseline, we also include results from training conducted without pre-training. Ten trials of this experiment are conducted, which include three trials from the experiment described in Section 3.1. In each figure, the benchmark result is represented by a gray triangle.

The left figure indicates that the pre-training helps to find a lower energy in this example. The minimum energy found at the final step is $\sim 0.01$ (Hartree) lower on average in the training after pre-training than without pre-training, which corresponds to $\sim 25\%$ improvement in terms of the deviation from the exact energy ( $\simeq-107.591$ Hartree). On the other hand, we observe that the average energy is almost unchanged with the pre-training, as shown in the right figure. This phenomenon can be interpreted as encouraging the pre-training to search for a wider variety of sequences; the fluctuation of the energy value around $\text{step}=50\sim 100$ in the right figure may indicate this. This diversity in sequences aids in finding lower energy configurations but also results in generating sequences with higher energy. Consequently, the average energy remains similar to that observed in training without pre-training.

4 Conclusion and Discussion

We propose the GQE algorithm, a novel method that applies a generative model to obtain quantum circuits with desired properties. In particular, we introduce GPT-QE, which is based on the transformer architecture, and we design its training and pre-training schemes using the logit-matching technique. We also introduce and discuss scenarios for obtaining datasets usable in pre-training, as well as proposals for extending the architecture’s capabilities by including inputs. In our numerical experiment, we address the problem of searching for the ground state of the electronic structure Hamiltonians of several molecules: $\texttt{H}_{2}$ , LiH, $\texttt{BeH}_{2}$ , and $\texttt{N}_{2}$ . We demonstrate that GPT-QE finds a quantum state with energy close to that of the ground state. The efficacy of the training scheme that utilizes logit-matching is confirmed through the average deviation from random circuit generation. Additionally, we validate the effectiveness of pre-training; our results show that, by utilizing pre-training, we can train the transformer successfully without operating a quantum device and significantly reduce the total number of quantum circuit runs.

Many research directions should be explored to fully enable the GPT-QE models to progress beyond proof of concept. First, it is essential to validate the performance of GPT-QE on an actual quantum device and analyze its robustness to noise. Additionally, experiments with larger molecules are required to assess the optimization behavior. Another consideration is how to effectively integrate GPT-QE with the VQE framework. A straightforward approach might involve using VQE as a post-processing step, but there are many opportunities for more integrated hybridization. As a particularly interesting existing method for hybridization, the ADAPT-VQE strategy is known to achieve high accuracy [30], though it necessitates numerous quantum circuit runs. Combining ADAPT-VQE with GPT-QE is another potential direction for future research.

Each component of GPT-QE is also open to updates and improvements. In our numerical experiment, we employ a chemically inspired operator pool, but, theoretically, any kind of operators could be included. The transformer’s design could be modified to generate a sequence of tokens and the parameters embedded in quantum gates. Such an extension would facilitate easier hybridization with VQE. To fully leverage the pre-training feature, exploring how to design suitable inputs for the transformer is crucial.

As mentioned earlier in the paper, applying and extending the GQE framework to problems beyond ground-state approximation is also feasible. For instance, if GQE can accept classical data inputs, it could be applied to supervised machine-learning problems. The critical question is determining which types of machine learning problems are best suited for the GQE framework. This inquiry requires careful study and identification of suitable problems.

The significance of pre-training underscores the need for data storage and sharing as common resources. We can take inspiration from the machine learning community, which shares multiple datasets for training and testing purposes. Through such efforts, we anticipate that practical quantum applications using the generative models will soon be achievable.

We hope that, in a parallel track, and a decade later from when some of us introduced the VQE, the community will embrace GQE and work together to enable many of the extensions briefly proposed above to help in achieving the goal of near-term quantum utility.

Acknowledgements

This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 using NERSC award NERSC DDR-ERCAP0027330. K.N. acknowledges the support of Grant-in-Aid for JSPS Research Fellow 22J01501. L.B.K. acknowledges support from the Carlsberg Foundation. A.A.-G. acknowledges support from the Canada 150 Research Chairs program and CIFAR as well a the generous support of Anders G. Frøseth. We are deeply grateful to the Defense Advanced Research Projects Agency (DARPA) for their generous support and funding of this project, under the grant number HR0011-23-3-0020.

References

[1] D. Bluvstein, S. J. Evered, A. A. Geim, S. H. Li, H. Zhou, T. Manovitz, S. Ebadi, M. Cain, M. Kalinowski, D. Hangleiter, et al., “Logical quantum processor based on reconfigurable atom arrays,” Nature, pp. 1–3, 2023.
[2] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’Brien, “A variational eigenvalue solver on a photonic quantum processor,” Nature Communications, vol. 5, p. 4213, July 2014.
[3] M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio, and P. J. Coles, “Variational quantum algorithms,” Nature Reviews Physics, vol. 3, pp. 625–644, Sept. 2021. Number: 9 Publisher: Nature Publishing Group.
[4] K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea, A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T. Menke, et al., “Noisy intermediate-scale quantum algorithms,” Reviews of Modern Physics, vol. 94, no. 1, p. 015004, 2022.
[5] A. B. Magann, S. E. Economou, and C. Arenz, “Randomized adaptive quantum state preparation,” Jan. 2023. arXiv:2301.04201 [quant-ph].
[6] S. Wang, E. Fontana, M. Cerezo, K. Sharma, A. Sone, L. Cincio, and P. J. Coles, “Noise-induced barren plateaus in variational quantum algorithms,” Nature communications, vol. 12, no. 1, p. 6961, 2021.
[7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
[9] H. Zhao, L. Jiang, J. Jia, P. H. S. Torr, and V. Koltun, “Point Transformer,”
[10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Oct. 2020.
[11] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. v. d. Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training Compute-Optimal Large Language Models,” Mar. 2022. arXiv:2203.15556 [cs].
[12] G. Verdon, M. Broughton, J. R. McClean, K. J. Sung, R. Babbush, Z. Jiang, H. Neven, and M. Mohseni, “Learning to learn with quantum neural networks via classical neural networks,” arXiv preprint arXiv:1907.05415, 2019.
[13] D. Kim and E.-G. Moon, “Preparation of entangled many-body states with machine learning,” arXiv preprint arXiv:2307.14627, 2023.
[14] Y. Yang, Z. Zhang, A. Wang, X. Xu, X. Wang, and Y. Li, “Maximising quantum-computing expressive power through randomised circuits,” arXiv preprint arXiv:2312.01947, 2023.
[15] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” Advances in neural information processing systems, vol. 31, 2018.
[16] Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in International conference on machine learning, pp. 242–252, PMLR, 2019.
[17] Y. Zhou, J. Yang, H. Zhang, Y. Liang, and V. Tarokh, “Sgd converges to global minimum in deep learning via star-convex path,” arXiv preprint arXiv:1901.00451, 2019.
[18] A. Heifetz, ed., Quantum Mechanics in Drug Discovery, vol. 2114 of Methods in Molecular Biology. New York, NY: Springer US, 2020.
[19] Y.-h. Lam, Y. Abramov, R. S. Ananthula, J. M. Elward, L. R. Hilden, S. O. Nilsson Lill, P.-O. Norrby, A. Ramirez, E. C. Sherer, J. Mustakis, and G. J. Tanoury, “Applications of Quantum Chemistry in Pharmaceutical Process Development: Current State and Opportunities,” Organic Process Research & Development, vol. 24, pp. 1496–1507, Aug. 2020. Publisher: American Chemical Society.
[20] G. Agarwal, H. A. Doan, L. A. Robertson, L. Zhang, and R. S. Assary, “Discovery of Energy Storage Molecular Materials Using Quantum Chemistry-Guided Multiobjective Bayesian Optimization,” Chemistry of Materials, vol. 33, pp. 8133–8144, Oct. 2021. Publisher: American Chemical Society.
[21] V. Zaytsev, M. Groshev, I. Maltsev, A. Durova, and V. Shabaev, “Calculation of the moscovium ground-state energy by quantum algorithms,” arXiv preprint arXiv:2207.08255, 2022.
[22] B. T. Gard, L. Zhu, G. S. Barron, N. J. Mayhall, S. E. Economou, and E. Barnes, “Efficient symmetry-preserving state preparation circuits for the variational quantum eigensolver algorithm,” npj Quantum Information, vol. 6, no. 1, p. 10, 2020.
[23] K. M. Pelzer, L. Cheng, and L. A. Curtiss, “Effects of Functional Groups in Redox-Active Organic Molecules: A High-Throughput Screening Approach,” The Journal of Physical Chemistry C, vol. 121, pp. 237–245, Jan. 2017. Publisher: American Chemical Society.
[24] C. d. l. Cruz, A. Molina, N. Patil, E. Ventosa, R. Marcilla, and A. Mavrandonakis, “New insights into phenazine-based organic redox flow batteries by using high-throughput DFT modelling,” Sustainable Energy & Fuels, vol. 4, pp. 5513–5521, Oct. 2020. Publisher: The Royal Society of Chemistry.
[25] S. Lee, J. Lee, H. Zhai, Y. Tong, A. M. Dalzell, A. Kumar, P. Helms, J. Gray, Z.-H. Cui, W. Liu, M. Kastoryano, R. Babbush, J. Preskill, D. R. Reichman, E. T. Campbell, E. F. Valeev, L. Lin, and G. K.-L. Chan, “Evaluating the evidence for exponential quantum advantage in ground-state quantum chemistry,” Nature Communications, vol. 14, p. 1952, Apr. 2023.
[26] J. Ceroni, T. F. Stetina, M. Kieferova, C. O. Marrero, J. M. Arrazola, and N. Wiebe, “Generating Approximate Ground States of Molecules Using Quantum Machine Learning,” Jan. 2023. arXiv:2210.05489 [quant-ph].
[27] Z. Liang, J. Cheng, R. Yang, H. Ren, Z. Song, D. Wu, X. Qian, T. Li, and Y. Shi, “Unleashing the potential of llms for quantum computing: A study in quantum architecture design,” arXiv preprint arXiv:2307.08191, 2023.
[28] M. Krenn, J. Landgraf, T. Foesel, and F. Marquardt, “Artificial intelligence and machine learning for quantum technologies,” Physical Review A, vol. 107, no. 1, p. 010101, 2023.
[29] T. Jaouni, S. Arlt, C. Ruiz-Gonzalez, E. Karimi, X. Gu, and M. Krenn, “Deep quantum graph dreaming: Deciphering neural network insights into quantum experiments,” arXiv preprint arXiv:2309.07056, 2023.
[30] H. R. Grimsley, S. E. Economou, E. Barnes, and N. J. Mayhall, “An adaptive variational algorithm for exact molecular simulations on a quantum computer,” Nature communications, vol. 10, no. 1, p. 3007, 2019.
[31] C. N. Self, K. E. Khosla, A. W. R. Smith, F. Sauvage, P. D. Haynes, J. Knolle, F. Mintert, and M. S. Kim, “Variational quantum algorithm with information sharing,” npj Quantum Information, vol. 7, pp. 1–7, July 2021. Number: 1 Publisher: Nature Publishing Group.
[32] A. Cervera-Lierta, J. S. Kottmann, and A. Aspuru-Guzik, “Meta-variational quantum eigensolver: Learning energy profiles of parameterized hamiltonians for quantum simulation,” PRX Quantum, vol. 2, no. 2, p. 020329, 2021.
[33] Z. Qiao, A. S. Christensen, M. Welborn, F. R. Manby, A. Anandkumar, and T. F. Miller, “Informing geometric deep learning with electronic interactions to accelerate quantum chemistry,” Proceedings of the National Academy of Sciences, vol. 119, p. e2205221119, Aug. 2022. Publisher: Proceedings of the National Academy of Sciences.
[34] D. Khan, S. Heinen, and O. A. von Lilienfeld, “Kernel based quantum machine learning at record rate: Many-body distribution functionals as compact representations,” The Journal of Chemical Physics, vol. 159, p. 034106, July 2023.
[35] H. F. Trotter, “On the product of semi-groups of operators,” Proceedings of the American Mathematical Society, vol. 10, no. 4, pp. 545–551, 1959.
[36] The CUDA Quantum development team, “CUDA Quantum: An open-source programming model for heterogeneous quantum-classical workflows,” 2023. Version 0.5.0.
[37] Qiskit contributors, “Qiskit: An open-source framework for quantum computing,” 2023.

Appendix A The benchmark experiment

In Section 3.1, we show that GPT-QE successfully finds a low-energy state. However, to verify that the training correctly works, we need to compare the performance with the case where training is not involved. For this purpose, we perform a benchmark experiment that randomly samples 10 (for $\texttt{H}_{2}$ ) or 40 (for other molecules) tokens from $\mathcal{G}$ and constructs quantum circuits. In each trial, we generate the same number of sequences as those generated in GPT-QE. We conduct three trials.

In Figure 7, we provide the mean of the minimum energy found in each trial and its standard deviation for each bond length in each molecule. The benchmark result is shown by the gray cross mark and the other result is shown with the same symbol as in Fig. 4. This confirms that the training scheme successfully guides the transformer to find lower energy, especially for the $\texttt{N}_{2}$ molecule. This verifies the validity of the logit-matching technique.