Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: typearea
  • failed: letltxmacro
  • failed: xpatch

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2401.09253v1 [quant-ph] 17 Jan 2024
\typearea

18

The generative quantum eigensolver (GQE) and its application for ground state search

Kouhei Nakaji Lasse Bjørn Kristensen Jorge A. Campos-Gonzalez-Angulo 111These authors contributed equally. Mohammad Ghazi Vakili 11footnotemark: 1 Haozhe Huang 11footnotemark: 1 Mohsen Bagherimehrab 222These authors contributed equally. Christoph Gorgulla 22footnotemark: 2 FuTe Wong Alex McCaskey Jin-Sung Kim Thien Nguyen Pooja Rao Alan Aspuru-Guzik
Abstract

We introduce the generative quantum eigensolver (GQE), a novel method for applying classical generative models for quantum simulation. The GQE algorithm optimizes a classical generative model to produce quantum circuits with desired properties. Here, we develop a transformer-based implementation, which we name the generative pre-trained transformer-based (GPT) quantum eigensolver (GPT-QE), leveraging both pre-training on existing datasets and training without any prior knowledge. We demonstrate the effectiveness of training and pre-training GPT-QE in the search for ground states of electronic structure Hamiltonians. GQE strategies can extend beyond the problem of Hamiltonian simulation into other application areas of quantum computing.

1 Introduction

The field of quantum computing has experienced a remarkable surge, characterized by rapid advancements in the development of quantum devices. Notably, recent research reports the experimental realization of quantum computing with 48 logical qubits [1], marking the onset of the early fault-tolerant quantum computing regime. However, despite these advancements, this regime’s operational number of gates remains limited. Consequently, it is still unclear how these hardware leaps can be effectively translated into practical advantages in the coming decades.

A decade has passed since some of us introduced the variational quantum eigensolver (VQE) [2], which arguably marked a pivotal moment in the field of quantum computing. In VQE, a cost function is minimized by optimizing parameters embedded in a quantum circuit. The variational nature of the algorithm facilitates reducing the circuit depth so they can be implemented on near-term devices. Since its introduction, many quantum algorithms employing variational techniques (variational quantum algorithms: VQA) have been proposed [3, 4]. However, it has been demonstrated that VQAs encounter several issues, particularly with regards to their trainability for large problem instances [5, 6]. This limitation hinders their competitiveness against classical computers when dealing with problems above a certain size. In this work, we aim to circumvent these shortcomings by constructing an orthogonal, rather than complementary, approach to VQAs.

Refer to caption
Figure 1: Comparison between GQE and VQE.

During the same tumultuous decade, modern machine-learning techniques with deep neural networks have revolutionized numerous areas. In particular, there has been significant advancement in generative models for natural language processing. The advent of the Generative Pre-trained Transformer (GPT) [7] marks a milestone in the evolution of artificial intelligence. Forming the basis of Large Language Models (LLMs), GPT-like transformer models have demonstrated exceptional capabilities in understanding and generating human language. Through the simplicity and inherent efficiency of the attention mechanism [8], transformer models have demonstrated extraordinary performance across a wide array of tasks, showcasing their flexibility and expressivity in a variety of domains (e.g., [7, 8, 9, 10]). Recent achievements, highlighted by models like Chinchilla [11], demonstrate how scaling laws in machine learning can inform the efficient allocation of model size for optimized performance, hinting at even greater potential.

Refer to caption
Figure 2: Quantum circuit generation in GPT-QE (GQE, which employs a transformer) is explored. We also show the analogy between document generation in Large Language Models (LLMs) and quantum circuit generation in GPT-QE. The details of quantum circuit generation are described in Section 2.2.

Given those significant achievements in classical generative models, incorporating them into quantum computing algorithms could be a pivotal step in overcoming the enduring challenges faced in practical quantum computing applications. Therefore, we propose the generative quantum eigensolver (GQE), which takes advantage of classical generative models for quantum circuits. Specifically, we employ a classical generative model —denoted as pθ(U)subscript𝑝𝜃𝑈p_{\vec{\theta}}(U)italic_p start_POSTSUBSCRIPT over→ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_U ), with θ𝜃\vec{\theta}over→ start_ARG italic_θ end_ARG as parameters and U𝑈Uitalic_U as a unitary operator—to define the probability distribution for generating quantum circuits. In simpler terms, we sample quantum circuits according to this distribution. We train pθ(U)subscript𝑝𝜃𝑈p_{\vec{\theta}}(U)italic_p start_POSTSUBSCRIPT over→ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_U ) so that generated quantum circuits are likely to have desirable properties. We emphasize that, unlike VQA and its variants, no parameters are embedded in the quantum circuit in GQE; notably, and important to their scalability, all the optimizable parameters are in the classical generative model (Fig. 1). We note that some previous works utilize a generative model for generating parameters in VQE [12, 13, 14], but in GQE, the whole circuit structure is determined by a generative model.

In designing the generative model for quantum circuits, we focus on the transformer architecture [8], which achieves significant success as the backbone of large language models. We can describe GQE with a transformer by using an analogy between natural language documents and quantum circuits (Fig. 2). For a given operator pool, defined by a set of unitary operations {Uj}j=1Lsuperscriptsubscriptsubscript𝑈𝑗𝑗1𝐿\{U_{j}\}_{j=1}^{L}{ italic_U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT (vocabulary), the transformer generates the sequence of indices j1jNsubscript𝑗1subscript𝑗𝑁j_{1}\dots j_{N}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_j start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT corresponding to the unitary operations Uj1UjNsubscript𝑈subscript𝑗1subscript𝑈subscript𝑗𝑁U_{j_{1}}\dots U_{j_{N}}italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT (words) and constructs the quantum circuit UN(j)=UjNUj1subscript𝑈𝑁𝑗subscript𝑈subscript𝑗𝑁subscript𝑈subscript𝑗1U_{N}(\vec{j})=U_{j_{N}}\dots U_{j_{1}}italic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) = italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT … italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (document). The rules for generating indices (grammar) are trained so that a cost value calculated by quantum devices decreases. GQE with a transformer is also able to be pre-trained. If we have a dataset given as pairs of index sequences and cost values: {jm,C(jm)}m=1Msuperscriptsubscriptsubscript𝑗𝑚𝐶subscript𝑗𝑚𝑚1𝑀\{\vec{j}_{m},C(\vec{j}_{m})\}_{m=1}^{M}{ over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_C ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT (document dataset), we can pre-train the transformer without running quantum devices, as shown in Section 2. Hence, we give the GQE with transformer the name of generative pre-trained transformer-based quantum eigensolver (GPT-QE).

We expect three advantages in training with a transformer instead of a parameterized quantum circuit: ease of optimization, quantum resource efficiency, and customizability. We have been witnessing the impressive optimizability of deep neural networks (DNN) [15, 16, 17] and in this work, all the developments in this field are readily applicable for quantum computing. By using the cost function landscape of DNNs, GPT-QE is potentially unaffected by the core optimization issues of several VQAs. Regarding quantum resource efficiency, we show in Section 2 that the number of quantum circuits run for each step of the optimization in GPT-QE does not explicitly depend neither on the number of parameters nor on the size of the operator pool since it replaces quantum gradient evaluation with sampling and backpropagation. From this feature, we expect to significantly reduce the number of quantum circuits runs compared to conventional VQAs. Additionally, pre-training also constrains the process of running quantum devices for dataset generation, as we note above. As for the customizability of our appraoch, we can append additional conditioned input –e.g., domain knowledge– to the transformer.

To demonstrate a proof of concept of the GPT-QE approach, which can be readily adapted to several families of VQAs, this paper focuses on the ground state search problem. We will explore these other applications of the method in future work. Accurate molecular electronic ground states have significant utility in applications as diverse as drug discovery [18, 19], materials science [20], and environmental solutions [21]. These ground states enable precise simulations of complex molecular structures, accelerating drug development by identifying effective candidates and properties. In materials science, they aid in designing tailored materials [22], from superconductors to catalysts [23]. Additionally, they address environmental challenges by improving energy solutions and enhancing our understanding of relevant chemical processes [24]. It must be acknowledged that ground state search by time-independent means belongs to the complexity class QMA, and it has become uncertain whether this task is a feasible candidate for finding quantum advantage [25]. However, the potential benefits of conceptual understanding and practical applications continue to drive research and development in this field [26].

We note that previous literature [27, 28, 29] proposes methods for training the structure of quantum circuits using machine learning, especially reinforcement learning. Reinforcement learning approaches tend to require a large number of intermediate quantum states in the circuit to determine each action (the next quantum gate to be generated), which leads to an increase in the number of required measurements as the number of gates increases. The Adaptive Derivative-Assembled Pseudo-Trotter ansatz VQE (ADAPT-VQE) [30] also provides a method to adaptively construct the ansatz structure. The method requires running VQE, hence many measurements, to determine the gates to be added as in the case of the reinforcement learning approaches. Conversely, GPT-QE does not require any intermediate measurements, thus potentially significantly reducing the measurement cost when running the algorithm.

The rest of the paper is organized as follows. In Section 2, we describe the details of GQE. Particularly, we construct GPT-QE and describe its training and pre-training scheme. Section 3 is dedicated to demonstrating the performance of the training and pre-training schemes by using the ground state search for the electronic structure Hamiltonians. In Section 4, we summarize what we know of the algorithm so far and suggest future directions of exploration.

2 Methods

2.1 Generative Quantum Eigensolver

The generative quantum eigensolver (GQE) is an algorithm to search the ground state of a given Hamiltonian H^^𝐻\hat{H}over^ start_ARG italic_H end_ARG. Particularly, we focus on the electronic structure problem, where the Hamiltonian is written as the weighted sum of the tensor products of Pauli operators P^subscript^𝑃\hat{P}_{\ell}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT: H^=hP^^𝐻subscriptsubscriptsubscript^𝑃\hat{H}=\sum_{\ell}h_{\ell}\hat{P}_{\ell}over^ start_ARG italic_H end_ARG = ∑ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT. To construct the approach of the GQE, we first illustrate our formulation of the generative model of quantum circuits.

We prepare the operator pool 𝒢={Uj}j=1L𝒢superscriptsubscriptsubscript𝑈𝑗𝑗1𝐿\mathcal{G}=\{U_{j}\}_{j=1}^{L}caligraphic_G = { italic_U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, where Ujsubscript𝑈𝑗U_{j}italic_U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a unitary operator and L𝐿Litalic_L is the size of the operator pool. One of the choices for the operator pool is a set of time evolution operators: {eiP^jtj}j=1Lsuperscriptsubscriptsuperscript𝑒𝑖subscript^𝑃𝑗subscript𝑡𝑗𝑗1𝐿\{e^{i\hat{P}_{j}t_{j}}\}_{j=1}^{L}{ italic_e start_POSTSUPERSCRIPT italic_i over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, which we use in our numerical experiment. Given a sequence length N𝑁Nitalic_N, we sample the sequence j={j1,,jN}𝑗subscript𝑗1subscript𝑗𝑁\vec{j}=\{j_{1},\ldots,j_{N}\}over→ start_ARG italic_j end_ARG = { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } according to the parameterized probability distribution pN(θ,j)subscript𝑝𝑁𝜃𝑗p_{N}(\vec{\theta},\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_θ end_ARG , over→ start_ARG italic_j end_ARG ), where θ={θp}p=1P𝜃superscriptsubscriptsubscript𝜃𝑝𝑝1𝑃\vec{\theta}=\{\theta_{p}\}_{p=1}^{P}over→ start_ARG italic_θ end_ARG = { italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT are optimizable parameters. Using the sequence j𝑗\vec{j}over→ start_ARG italic_j end_ARG, we construct the quantum circuit UN(j)=UjNUj1subscript𝑈𝑁𝑗subscript𝑈subscript𝑗𝑁subscript𝑈subscript𝑗1U_{N}(\vec{j})=U_{j_{N}}\cdots U_{j_{1}}italic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) = italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_U start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We call pN(θ,j)subscript𝑝𝑁𝜃𝑗p_{N}(\vec{\theta},\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_θ end_ARG , over→ start_ARG italic_j end_ARG ) the generative model of quantum circuits. In the rest of the paper, we omit the variable θ𝜃\vec{\theta}over→ start_ARG italic_θ end_ARG for simplicity. The process of sampling the sequence j𝑗\vec{j}over→ start_ARG italic_j end_ARG according to pN(j)subscript𝑝𝑁𝑗p_{N}(\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) and constructing the quantum circuit U(j)𝑈𝑗U(\vec{j})italic_U ( over→ start_ARG italic_j end_ARG ) is simply referred to as “sampling the quantum circuit U(j)𝑈𝑗U(\vec{j})italic_U ( over→ start_ARG italic_j end_ARG ) according to pN(j)subscript𝑝𝑁𝑗p_{N}(\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG )”.

We construct GQE to search for the ground state of H^^𝐻\hat{H}over^ start_ARG italic_H end_ARG with the generative model of quantum circuits pN(j)subscript𝑝𝑁𝑗p_{N}(\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ). The objective of the problem we target is finding j:=argminjEN(H^,j)assignsuperscript𝑗subscriptargmin𝑗subscript𝐸𝑁^𝐻𝑗\vec{j}^{\ast}:=\operatorname*{arg\,min}_{\vec{j}}E_{N}(\hat{H},\vec{j})over→ start_ARG italic_j end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT over→ start_ARG italic_j end_ARG end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over^ start_ARG italic_H end_ARG , over→ start_ARG italic_j end_ARG ) with

EN(H^,j):=Tr(H^UN(j)ρ0UN(j)),assignsubscript𝐸𝑁^𝐻𝑗Tr^𝐻subscript𝑈𝑁𝑗subscript𝜌0subscript𝑈𝑁superscript𝑗E_{N}(\hat{H},\vec{j}):={\textrm{Tr}}\left(\hat{H}U_{N}(\vec{j})\rho_{\textrm{% 0}}U_{N}(\vec{j})^{\dagger}\right),italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over^ start_ARG italic_H end_ARG , over→ start_ARG italic_j end_ARG ) := Tr ( over^ start_ARG italic_H end_ARG italic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) , (1)

where ρ0subscript𝜌0\rho_{\textrm{0}}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a fixed initial quantum state. In the following, we omit the variable H^^𝐻\hat{H}over^ start_ARG italic_H end_ARG depending on the context for simplicity.

We note that we need to select the operator pool 𝒢𝒢\mathcal{G}caligraphic_G to be expressive enough, so that EN(j)subscript𝐸𝑁superscript𝑗E_{N}(\vec{j}^{\ast})italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is close enough to the ground state energy. It should also be noted that we can select 𝒢𝒢\mathcal{G}caligraphic_G to accommodate the native operations and topology of each quantum device. We illustrate the specific choice for the operator pool in our numerical experiment. We train pN(j)subscript𝑝𝑁𝑗p_{N}(\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) so that the generated quantum circuit UN(j)subscript𝑈𝑁𝑗U_{N}(\vec{j})italic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) is likely to produce a low energy quantum state. We call the approach to optimize pN(j)subscript𝑝𝑁𝑗p_{N}(\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) as the generative quantum eigensolver.

We now emphasize the difference between VQE and GQE. As shown in Fig. 1, in VQE, we embed parameters in the quantum circuit and optimize them to minimize the energy associated with the generated quantum state. In contrast, all parameters in GQE are embedded in the generative model pN(j)subscript𝑝𝑁𝑗p_{N}(\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ). Consequently, the cost function landscapes in GQE and VQE are different; considering the success of training large models with DNN [15, 16, 17], we expect that GQE potentially addresses the issue of trainability in VQE by exploiting the different landscape, which has been now moved on to the classical computer.

An advantage of the GQE approach is that we are free to choose the generative model pN(j)subscript𝑝𝑁𝑗p_{N}(\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) from a very rich potential set of families of generative models stemming from the field of machine learning, such as autoencoders, generative adversarial networks, diffusion models, flow models, etc. This paper focuses on the model where the Transformer implements the generative model [8], which achieves significant success as the cornerstone of large language models. In the following subsection, we describe the details of GQE implemented in this transformer setting.

2.2 GPT Quantum Eigensolver

We construct the specific GQE algorithm using the transformer architecture and provide its training scheme. As we will show later, the approach also involves pre-training; therefore, we call the method generative pre-trained transformer-based quantum eigensolver (GPT-QE). In the following, we describe how the transformer generates quantum circuits. Then, we construct its training/pre-training scheme of GPT-QE.

Quantum circuits generation in GPT-QE

The original transformer, introduced in [8], targets neural machine translation, where the model consists of an encoder for the input language and a decoder for the targeted language. In quantum circuit generation, we focus on the decoder-only transformer inspired by GPT-2 [7], developed for more general generative tasks. In the following, we refer to a decoder-only transformer simply as the transformer.

The sequence generation using the transformer can be written as the repetitions of (i) calculating the logit (logarithmic probability) with which each token is generated and (ii) sampling a token according to the corresponding probability distribution. We write the function for the probability calculation as GPT and the function for sampling as Sample.

The function GPT takes the variable-length inputs j(k)={j1,,jk}superscript𝑗𝑘subscript𝑗1subscript𝑗𝑘\vec{j}^{(k)}=\{j_{1},\ldots,j_{k}\}over→ start_ARG italic_j end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, where each element takes an integer value between 1111 and L𝐿Litalic_L (the size of the operator pool 𝒢𝒢\mathcal{G}caligraphic_G). Then it outputs the sequence of logits W(k)={w(1),,w(k)}superscript𝑊𝑘superscript𝑤1superscript𝑤𝑘W^{(k)}=\{\vec{w}^{(1)},\ldots,\vec{w}^{(k)}\}italic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = { over→ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , over→ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } with the same length, where each logit w(r)superscript𝑤𝑟\vec{w}^{(r)}over→ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT is a real vector of size L𝐿Litalic_L. We note that GPT has optimizable parameters θ𝜃\vec{\theta}over→ start_ARG italic_θ end_ARG included in the transformer’s architecture, which is not explicitly written in the notation GPT for simplicity.

The function GPT follows the methodology outlined in [7]. Here, we present a concise overview of GPT’s operation: Initially, GPT converts each token in the input sequence into a unique embedding, represented as a vector. These embeddings are transformed by means of multiple attention layers. Each attention layer takes a sequence of vectors as input, with the first layer’s input being the embeddings themselves. The output of each attention layer is also a sequence of vectors, maintaining the same sequence length and vector dimension as the input. This consistency allows the output of one layer to serve as the input for the subsequent layer. The final attention layer’s output is converted into a sequence of logits, constituting the output of GPT. Given an attention layer’s input {v1,,vk}subscript𝑣1subscript𝑣𝑘\{\vec{v}_{1},\cdots,\vec{v}_{k}\}{ over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, the output can be expressed as {a1,,ak}subscript𝑎1subscript𝑎𝑘\{\vec{a}_{1},\cdots,\vec{a}_{k}\}{ over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, where each arsubscript𝑎𝑟\vec{a}_{r}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT represents the attention with the same dimension as vrsubscript𝑣𝑟\vec{v}_{r}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. The attention arsubscript𝑎𝑟\vec{a}_{r}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT encapsulates the relationship between vrsubscript𝑣𝑟\vec{v}_{r}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and other elements of the input. Notably, through the process of causal masking, arsubscript𝑎𝑟\vec{a}_{r}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT depends solely on preceding elements, i.e., {vr}r<rsubscriptsubscript𝑣superscript𝑟superscript𝑟𝑟\{\vec{v}_{r^{\prime}}\}_{r^{\prime}<r}{ over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_r end_POSTSUBSCRIPT. The standard implementation of attention computation involves running multiple attention mechanisms in parallel, and their outputs are combined into {a1,,ak}subscript𝑎1subscript𝑎𝑘\{\vec{a}_{1},\cdots,\vec{a}_{k}\}{ over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } through a weighted average (multi-head attention). For a more comprehensive explanation, see [7].

Refer to caption
Figure 3: Overview of the training and pre-training scheme in GPT-QE.

The function Sample is a stochastic function that takes a logit w={wj}j=1L𝑤superscriptsubscriptsubscript𝑤𝑗𝑗1𝐿\vec{w}=\{w_{j}\}_{j=1}^{L}over→ start_ARG italic_w end_ARG = { italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT as its input and returns one of the tokens j{1,,L}𝑗1𝐿j\in\{1,\ldots,L\}italic_j ∈ { 1 , … , italic_L }. The probability that the token j𝑗jitalic_j is sampled is proportional to eβwjsuperscript𝑒𝛽subscript𝑤𝑗e^{-\beta w_{j}}italic_e start_POSTSUPERSCRIPT - italic_β italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with β>0𝛽0\beta>0italic_β > 0 a hyper-parameter we can choose. It is customary to sample from the logits, which can be understood as each j𝑗jitalic_j being sampled according to the energy wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the inverse temperature β𝛽\betaitalic_β as in statistical mechanics. For simplicity, we omit the variable β𝛽\betaitalic_β from the function Sample.

We define the generative model of the quantum circuits in GPT-QE by the procedure to obtain a sequence j𝑗\vec{j}over→ start_ARG italic_j end_ARG using GPT and Sample:

  • In the first step, we sample the token j1subscript𝑗1j_{1}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with the fixed input {0}0\{0\}{ 0 }:

    W(1)superscript𝑊1\displaystyle W^{(1)}italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT =𝙶𝙿𝚃({0}),absent𝙶𝙿𝚃0\displaystyle=\texttt{GPT}(\{0\}),= GPT ( { 0 } ) ,
    w(1)superscript𝑤1\displaystyle\vec{w}^{(1)}over→ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT =W1(1),absentsubscriptsuperscript𝑊11\displaystyle=W^{(1)}_{1},= italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,
    j1subscript𝑗1\displaystyle j_{1}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =𝚂𝚊𝚖𝚙𝚕𝚎(w(1)),absent𝚂𝚊𝚖𝚙𝚕𝚎superscript𝑤1\displaystyle=\texttt{Sample}(\vec{w}^{(1)}),= Sample ( over→ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ,

    where W1(1)subscriptsuperscript𝑊11W^{(1)}_{1}italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denotes the first element of W(1)superscript𝑊1W^{(1)}italic_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT.

  • In the second step, we sample the token j2subscript𝑗2j_{2}italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with the input {0,j1}0subscript𝑗1\{0,j_{1}\}{ 0 , italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }:

    W(2)superscript𝑊2\displaystyle W^{(2)}italic_W start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT =𝙶𝙿𝚃({0,j1}),absent𝙶𝙿𝚃0subscript𝑗1\displaystyle=\texttt{GPT}(\{0,j_{1}\}),= GPT ( { 0 , italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ) ,
    w(2)superscript𝑤2\displaystyle\vec{w}^{(2)}over→ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT =W2(2),absentsubscriptsuperscript𝑊22\displaystyle=W^{(2)}_{2},= italic_W start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
    j2subscript𝑗2\displaystyle j_{2}italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =𝚂𝚊𝚖𝚙𝚕𝚎(w(2)).absent𝚂𝚊𝚖𝚙𝚕𝚎superscript𝑤2\displaystyle=\texttt{Sample}(\vec{w}^{(2)}).= Sample ( over→ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) .
  • In the k𝑘kitalic_k-th step we sample the token jksubscript𝑗𝑘j_{k}italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with the input {0,j1,,jk1}0subscript𝑗1subscript𝑗𝑘1\{0,j_{1},\ldots,j_{k-1}\}{ 0 , italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT }:

    W(k)superscript𝑊𝑘\displaystyle W^{(k)}italic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT =𝙶𝙿𝚃({0,j1,,jk1}),absent𝙶𝙿𝚃0subscript𝑗1subscript𝑗𝑘1\displaystyle=\texttt{GPT}(\{0,j_{1},\ldots,j_{k-1}\}),= GPT ( { 0 , italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT } ) ,
    w(k)superscript𝑤𝑘\displaystyle\vec{w}^{(k)}over→ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT =Wk(k),absentsubscriptsuperscript𝑊𝑘𝑘\displaystyle=W^{(k)}_{k},= italic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,
    jksubscript𝑗𝑘\displaystyle j_{k}italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =𝚂𝚊𝚖𝚙𝚕𝚎(w(k)).absent𝚂𝚊𝚖𝚙𝚕𝚎superscript𝑤𝑘\displaystyle=\texttt{Sample}(\vec{w}^{(k)}).= Sample ( over→ start_ARG italic_w end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) .

After N𝑁Nitalic_N steps, we obtain a sequence of tokens j={j1,,jN}𝑗subscript𝑗1subscript𝑗𝑁\vec{j}=\{j_{1},\ldots,j_{N}\}over→ start_ARG italic_j end_ARG = { italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } with length N𝑁Nitalic_N.

We can readily show that the probability that j𝑗\vec{j}over→ start_ARG italic_j end_ARG is sampled is proportional to exp(βwsum(j))𝛽subscript𝑤sum𝑗\exp\left(-\beta w_{\textrm{sum}}(\vec{j})\right)roman_exp ( - italic_β italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) ), where

wsum(j):=k=1Nwjk(k).assignsubscript𝑤sum𝑗superscriptsubscript𝑘1𝑁superscriptsubscript𝑤subscript𝑗𝑘𝑘w_{\textrm{sum}}(\vec{j}):=\sum_{k=1}^{N}w_{j_{k}}^{(k)}.italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) := ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT . (2)

Therefore, the generative model in GPT-QE is

pN(β,j)=exp(βwsum(j))𝒵,subscript𝑝𝑁𝛽𝑗𝛽subscript𝑤sum𝑗𝒵p_{N}(\beta,\vec{j})=\frac{\exp\left(-\beta w_{\textrm{sum}}(\vec{j})\right)}{% \mathcal{Z}},italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_β , over→ start_ARG italic_j end_ARG ) = divide start_ARG roman_exp ( - italic_β italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) ) end_ARG start_ARG caligraphic_Z end_ARG , (3)

where 𝒵=jexp(βwsum(j))𝒵subscript𝑗𝛽subscript𝑤sum𝑗\mathcal{Z}=\sum_{\vec{j}}\exp\left(-\beta w_{\textrm{sum}}(\vec{j})\right)caligraphic_Z = ∑ start_POSTSUBSCRIPT over→ start_ARG italic_j end_ARG end_POSTSUBSCRIPT roman_exp ( - italic_β italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) ) and we write the hyper-parameter β𝛽\betaitalic_β explicitly.

Training

To construct the training scheme for GPT-QE, let us consider the process sampling UN(j)subscript𝑈𝑁𝑗U_{N}(\vec{j})italic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) according to pN(β,j)subscript𝑝𝑁𝛽𝑗p_{N}(\beta,\vec{j})italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_β , over→ start_ARG italic_j end_ARG ) and applying it to ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Let ρ(β)𝜌𝛽\rho(\beta)italic_ρ ( italic_β ) be the quantum state generated by the stochastic process. With N(j,ρ):=UN(j)ρUN(j)assignsubscript𝑁𝑗𝜌subscript𝑈𝑁𝑗𝜌subscript𝑈𝑁superscript𝑗\mathcal{E}_{N}(\vec{j},\rho):=U_{N}(\vec{j})\rho U_{N}(\vec{j})^{\dagger}caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG , italic_ρ ) := italic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) italic_ρ italic_U start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , it can be written as

ρ(β)=jpN(β,j)N(j,ρ0)=1𝒵jexp(βwsum(j))N(j,ρ0).𝜌𝛽subscript𝑗subscript𝑝𝑁𝛽𝑗subscript𝑁𝑗subscript𝜌01𝒵subscript𝑗𝛽subscript𝑤sum𝑗subscript𝑁𝑗subscript𝜌0\begin{split}\rho(\beta)&=\sum_{j}p_{N}(\beta,\vec{j})\mathcal{E}_{N}(\vec{j},% \rho_{0})\\ &=\frac{1}{\mathcal{Z}}\sum_{j}\exp\left(-\beta w_{\textrm{sum}}(\vec{j})% \right)\mathcal{E}_{N}(\vec{j},\rho_{0}).\end{split}start_ROW start_CELL italic_ρ ( italic_β ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_β , over→ start_ARG italic_j end_ARG ) caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG caligraphic_Z end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( - italic_β italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) ) caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . end_CELL end_ROW (4)

We observe that if wsum(j)=EN(j)subscript𝑤sum𝑗subscript𝐸𝑁𝑗w_{\textrm{sum}}(\vec{j})=E_{N}(\vec{j})italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) = italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) is satisfied, ρ(β)𝜌𝛽\rho(\beta)italic_ρ ( italic_β ) gives a pseudo thermal state with the inverse temperature β𝛽\betaitalic_β in the sense that the quantum state is generated according to the probability exp(βEN(j))𝛽subscript𝐸𝑁𝑗\exp\left(-\beta E_{N}(\vec{j})\right)roman_exp ( - italic_β italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) ). Therefore, increasing the value of β𝛽\betaitalic_β creates a bias towards generating lower energy quantum states. We note that ρ(β)𝜌𝛽\rho(\beta)italic_ρ ( italic_β ) is not, in general, the exact thermal state since the quantum state N(j,ρ0)subscript𝑁𝑗subscript𝜌0\mathcal{E}_{N}(\vec{j},\rho_{0})caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) cannot represent all the eigenstates in the whole Hilbert when N𝑁Nitalic_N and L𝐿Litalic_L are constrained.

From this observation, we design our scheme for the training/pre-training so that wsum(j)EN(j)similar-to-or-equalssubscript𝑤sum𝑗subscript𝐸𝑁𝑗w_{\textrm{sum}}(\vec{j})\simeq E_{N}(\vec{j})italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) ≃ italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) is satisfied. More specifically, in each iteration, we sample {jm}m=1Msuperscriptsubscriptsubscript𝑗𝑚𝑚1𝑀\{\vec{j}_{m}\}_{m=1}^{M}{ over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, calculate the cost function

C({wsum(jm)}m=1M,{EN(jm)}m=1M)=1Mm=1MC~(wsum(jm),EN(jm)),C~(wsum(j),EN(j)):=(ewsum(j)eEN(j))2,formulae-sequence𝐶superscriptsubscriptsubscript𝑤sumsubscript𝑗𝑚𝑚1𝑀superscriptsubscriptsubscript𝐸𝑁subscript𝑗𝑚𝑚1𝑀1𝑀superscriptsubscript𝑚1𝑀~𝐶subscript𝑤sumsubscript𝑗𝑚subscript𝐸𝑁subscript𝑗𝑚assign~𝐶subscript𝑤sum𝑗subscript𝐸𝑁𝑗superscriptsuperscript𝑒subscript𝑤sum𝑗superscript𝑒subscript𝐸𝑁𝑗2\begin{split}&C\left(\{w_{\textrm{sum}}(\vec{j}_{m})\}_{m=1}^{M},\{E_{N}(\vec{% j}_{m})\}_{m=1}^{M}\right)\\ &=\frac{1}{M}\sum_{m=1}^{M}\tilde{C}\left(w_{\textrm{sum}}(\vec{j}_{m}),E_{N}(% \vec{j}_{m})\right),\\ &\tilde{C}(w_{\textrm{sum}}(\vec{j}),E_{N}(\vec{j})):=\left(e^{-w_{\textrm{sum% }}(\vec{j})}-e^{-E_{N}(\vec{j})}\right)^{2},\end{split}start_ROW start_CELL end_CELL start_CELL italic_C ( { italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , { italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT over~ start_ARG italic_C end_ARG ( italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL over~ start_ARG italic_C end_ARG ( italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) , italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) ) := ( italic_e start_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW (5)

and update the parameters in GPT by backpropagation. Since we match the sum of logits with the energy function, we call this technique logit-matching. We overview the training process in Fig. 3. It should be noted that M𝑀Mitalic_M, which corresponds to the batch size of the data in the context of the machine learning, is a hyper-parameter we can choose, and it does not explicitly depend on N𝑁Nitalic_N and L𝐿Litalic_L. Therefore, in principle, we can freely choose the number of quantum circuit runs in each iteration. However, we must also be aware that small M𝑀Mitalic_M leads to a significant statistical error in the cost function estimation. The best strategy to choose M𝑀Mitalic_M should be studied in future work. It is also possible to optimize M𝑀Mitalic_M by using hyper-parameter optimization, which itself is an active area of exploration in the field of machine learning. We also note that the parameter β𝛽\betaitalic_β can be used to control the trade-off between exploitation and exploration. In other words, by adjusting β𝛽\betaitalic_β, we can influence how the algorithm balances between intensively searching within a particular area (exploitation) and exploring new, potentially promising areas (exploration). A higher value of β𝛽\betaitalic_β typically encourages more exploitation, focusing the search on areas already identified as having high potential. Conversely, a lower β𝛽\betaitalic_β value promotes exploration, allowing the algorithm to investigate a broader range of solutions that may lead to discovering better-performing options not yet explored. In our numerical experiment, we initially set β𝛽\betaitalic_β to a small value and then gradually increase it over time.

Pre-Training

By applying the logit-matching technique, we can easily incorporate a pre-training scheme in GPT-QE (see Fig. 3). Suppose we have the dataset 𝒟={jm,EN(jm)}m=1MD𝒟superscriptsubscriptsubscript𝑗𝑚subscript𝐸𝑁subscript𝑗𝑚𝑚1subscript𝑀𝐷\mathcal{D}=\{\vec{j}_{m},E_{N}(\vec{j}_{m})\}_{m=1}^{M_{D}}caligraphic_D = { over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In pre-training, we input each jmsubscript𝑗𝑚\vec{j}_{m}over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to GPT and obtain wsum(jm)subscript𝑤sumsubscript𝑗𝑚w_{\textrm{sum}}(\vec{j}_{m})italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). Then, we update parameters in GPT so that wsum(jm)subscript𝑤sumsubscript𝑗𝑚w_{\textrm{sum}}(\vec{j}_{m})italic_w start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) becomes close to EN(jm)subscript𝐸𝑁subscript𝑗𝑚E_{N}(\vec{j}_{m})italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) from the dataset by using the cost function (5). We note that we can split the dataset 𝒟𝒟\mathcal{D}caligraphic_D into B𝐵Bitalic_B batches {𝒟b}b=1Bsuperscriptsubscriptsubscript𝒟𝑏𝑏1𝐵\{\mathcal{D}_{b}\}_{b=1}^{B}{ caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT and train GPT with each batch.

The pre-training process is entirely classical, eliminating the need to use quantum devices as long as datasets are available. The question then arises: How do we obtain these datasets? Primarily, they can be sourced from previous GPT-QE training. Specifically, previous quantum evaluations performed as part of solving related tasks can be used to construct such a dataset. This opens up the possibility of a large-scale effort to simulate several Hamiltonians (e.g. molecules and materials) in the cloud using high-performance quantum simulators or on actual quantum devices and gaining performance for GQE as the pertaining dataset becomes more comprehensive over time.

We propose three scenarios to leverage the pre-training scheme effectively: (i) model-to-model transfer, (ii) config-to-config transfer, and (iii) molecule-to-molecule transfer. Below, we describe these scenarios in detail.

(i) Model-to-model transfer scenario

In the model-to-model transfer scenario, we assume that the Hamiltonian used for data generation and the one we want to get the ground state of are the same. We first prepare a model represented as 𝙶𝙿𝚃Asubscript𝙶𝙿𝚃𝐴\texttt{GPT}_{A}GPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and train it by using the Hamiltonian H^^𝐻\hat{H}over^ start_ARG italic_H end_ARG. While training 𝙶𝙿𝚃Asubscript𝙶𝙿𝚃𝐴\texttt{GPT}_{A}GPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, we obtain 𝒟={jm,EN(jm)}m=1M𝒟superscriptsubscriptsubscript𝑗𝑚subscript𝐸𝑁subscript𝑗𝑚𝑚1𝑀\mathcal{D}=\{\vec{j}_{m},E_{N}(\vec{j}_{m})\}_{m=1}^{M}caligraphic_D = { over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. The result of the training may be unsatisfactory, and we try another model, represented as 𝙶𝙿𝚃Bsubscript𝙶𝙿𝚃𝐵\texttt{GPT}_{B}GPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. For example, 𝙶𝙿𝚃Bsubscript𝙶𝙿𝚃𝐵\texttt{GPT}_{B}GPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT may have more attention layers than 𝙶𝙿𝚃Asubscript𝙶𝙿𝚃𝐴\texttt{GPT}_{A}GPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. Then, the dataset 𝒟𝒟\mathcal{D}caligraphic_D can be utilized for the pre-training of 𝙶𝙿𝚃Bsubscript𝙶𝙿𝚃𝐵\texttt{GPT}_{B}GPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and we can then run the training algorithm with 𝙶𝙿𝚃Bsubscript𝙶𝙿𝚃𝐵\texttt{GPT}_{B}GPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT after the pre-training. It is also possible that the dataset 𝒟𝒟\mathcal{D}caligraphic_D is used for the training of 𝙶𝙿𝚃Asubscript𝙶𝙿𝚃𝐴\texttt{GPT}_{A}GPT start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT itself to obtain a better initialization (Experience replay).

Above, we describe the case where GPT-QE obtains the dataset. However, it is also possible that the dataset could be generated from other algorithms, e.g., tensor network calculations and VQE, if we can (approximately) convert the data obtained in those algorithms to a data format acceptable for GPT-QE.

(ii) Config-to-config transfer scenario

In the config-to-config transfer scenario, we utilize the data obtained in training to find the ground state of a Hamiltonian H^^𝐻\hat{H}over^ start_ARG italic_H end_ARG corresponding to one spatial nuclear configuration to that of another Hamiltonian H^superscript^𝐻\hat{H}^{\prime}over^ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT corresponding to a different nuclear configuration. We still assume that H^^𝐻\hat{H}over^ start_ARG italic_H end_ARG and H^superscript^𝐻\hat{H}^{\prime}over^ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are Hamiltonians of the same molecule, albeit at different geometries. In the context of VQA, similar approaches have already been proposed [31, 32]. Let us first assume that the Hamiltonian can be parameterized as follows:

H^(Δ)=a=1NHha(Δ)P^a,^𝐻Δsuperscriptsubscript𝑎1subscript𝑁𝐻subscript𝑎Δsubscript^𝑃𝑎\hat{H}(\vec{\Delta})=\sum_{a=1}^{N_{H}}h_{a}(\vec{\Delta})\hat{P}_{a},over^ start_ARG italic_H end_ARG ( over→ start_ARG roman_Δ end_ARG ) = ∑ start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over→ start_ARG roman_Δ end_ARG ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , (6)

where ΔΔ\vec{\Delta}over→ start_ARG roman_Δ end_ARG is a set of parameters corresponding to a configuration, NHsubscript𝑁𝐻N_{H}italic_N start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the number of terms in the Hamiltonian, each ha(Δ)subscript𝑎Δh_{a}(\vec{\Delta})italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over→ start_ARG roman_Δ end_ARG ) is a real-valued function of the parameters ΔΔ\vec{\Delta}over→ start_ARG roman_Δ end_ARG, and each P^asubscript^𝑃𝑎\hat{P}_{a}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is a tensor product of Pauli operators. An example of the parameters in ΔΔ\vec{\Delta}over→ start_ARG roman_Δ end_ARG is the bond length; when we change the bond length, the set of {P^a}a=1NHsuperscriptsubscriptsubscript^𝑃𝑎𝑎1subscript𝑁𝐻\{\hat{P}_{a}\}_{a=1}^{N_{H}}{ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT does not change, and only the coefficients in the Hamiltonian change. With this parameterization scheme, we can transfer the dataset obtained in the training of H^(Δ)^𝐻Δ\hat{H}(\vec{\Delta})over^ start_ARG italic_H end_ARG ( over→ start_ARG roman_Δ end_ARG ) to a dataset available for the pre-training with the Hamiltonian H^(Δ)^𝐻superscriptΔ\hat{H}(\vec{\Delta}^{\prime})over^ start_ARG italic_H end_ARG ( over→ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

We propose two methods to realize the config-to-config transfer, which can be effectively combined with each other. One method is the following coefficient re-weighting (see also [31]). In the training with H^(Δ)^𝐻Δ\hat{H}(\vec{\Delta})over^ start_ARG italic_H end_ARG ( over→ start_ARG roman_Δ end_ARG ), we estimate {qa(j)}a=1NHsuperscriptsubscriptsubscript𝑞𝑎𝑗𝑎1subscript𝑁𝐻\{q_{a}(\vec{j})\}_{a=1}^{N_{H}}{ italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) } start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT defined by qa(j):=Tr(P^aN(j,ρ0))assignsubscript𝑞𝑎𝑗Trsubscript^𝑃𝑎subscript𝑁𝑗subscript𝜌0q_{a}(\vec{j}):=\textrm{Tr}\left(\hat{P}_{a}\mathcal{E}_{N}(\vec{j},\rho_{0})\right)italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) := Tr ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) and combine them to estimate the energy as

EN(Δ,j)=a=1NHha(Δ)qa(j),subscript𝐸𝑁Δ𝑗superscriptsubscript𝑎1subscript𝑁𝐻subscript𝑎Δsubscript𝑞𝑎𝑗E_{N}\left(\vec{\Delta},\vec{j}\right)=\sum_{a=1}^{N_{H}}h_{a}(\vec{\Delta})q_% {a}(\vec{j}),italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG roman_Δ end_ARG , over→ start_ARG italic_j end_ARG ) = ∑ start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over→ start_ARG roman_Δ end_ARG ) italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) , (7)

where we simply write EN(H^(Δ),j)subscript𝐸𝑁^𝐻Δ𝑗E_{N}\left(\hat{H}(\vec{\Delta}),\vec{j}\right)italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over^ start_ARG italic_H end_ARG ( over→ start_ARG roman_Δ end_ARG ) , over→ start_ARG italic_j end_ARG ) as EN(Δ,j)subscript𝐸𝑁Δ𝑗E_{N}\left(\vec{\Delta},\vec{j}\right)italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG roman_Δ end_ARG , over→ start_ARG italic_j end_ARG ).

The estimated values {qa(j)}a=1NHsuperscriptsubscriptsubscript𝑞𝑎𝑗𝑎1subscript𝑁𝐻\{q_{a}(\vec{j})\}_{a=1}^{N_{H}}{ italic_q start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) } start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are also usable to construct the estimation value of EN(j,Δ)subscript𝐸𝑁𝑗superscriptΔE_{N}(\vec{j},\vec{\Delta}^{\prime})italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG , over→ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with a different configuration, i.e., we can simply combine the measured Pauli expecectation values using different coefficients ha(Δ)subscript𝑎superscriptΔh_{a}(\vec{\Delta}^{\prime})italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over→ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). By this process, we obtain the dataset 𝒟={jm,EN(Δ,jm)}m=1M𝒟superscriptsubscriptsubscript𝑗𝑚subscript𝐸𝑁superscriptΔsubscript𝑗𝑚𝑚1𝑀\mathcal{D}=\left\{\vec{j}_{m},E_{N}\left(\vec{\Delta}^{\prime},\vec{j}_{m}% \right)\right\}_{m=1}^{M}caligraphic_D = { over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, which can then be used for the pre-training when solving the search for the ground state of the Hamiltonian H(Δ)𝐻superscriptΔH(\vec{\Delta}^{\prime})italic_H ( over→ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

Another method to achieve config-to-config transfer is adding extra inputs ΔΔ\vec{\Delta}over→ start_ARG roman_Δ end_ARG to the GPT function. More specifically, we would imagine extending GPT so that it takes the configuration ΔΔ\vec{\Delta}over→ start_ARG roman_Δ end_ARG as its input in addition to the variable inputs {j1,jk}subscript𝑗1subscript𝑗𝑘\{j_{1},\cdots j_{k}\}{ italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ italic_j start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and outputs the logits W(k)superscript𝑊𝑘W^{(k)}italic_W start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. Let us write the extended function as 𝙶𝙿𝚃+subscript𝙶𝙿𝚃\texttt{GPT}_{+}GPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT.

By training 𝙶𝙿𝚃+subscript𝙶𝙿𝚃\texttt{GPT}_{+}GPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT with the different configurations Δ1,,ΔRsubscriptΔ1subscriptΔ𝑅\vec{\Delta}_{1},\cdots,\vec{\Delta}_{R}over→ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , over→ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, we obtain the datasets 𝒟𝒟\mathcal{D}caligraphic_D = {𝒟r}r=1Rsuperscriptsubscriptsubscript𝒟𝑟𝑟1𝑅\{\mathcal{D}_{r}\}_{r=1}^{R}{ caligraphic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT, where 𝒟r:={Δr,jmr,EN(Δr,jmr)}mr=1Mrassignsubscript𝒟𝑟superscriptsubscriptsubscriptΔ𝑟subscript𝑗subscript𝑚𝑟subscript𝐸𝑁subscriptΔ𝑟subscript𝑗subscript𝑚𝑟subscript𝑚𝑟1subscript𝑀𝑟\mathcal{D}_{r}:=\{\vec{\Delta}_{r},\vec{j}_{m_{r}},E_{N}(\vec{\Delta}_{r},% \vec{j}_{m_{r}})\}_{m_{r}=1}^{M_{r}}caligraphic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT := { over→ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and each Mrsubscript𝑀𝑟M_{r}italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the number of sequences generated in the training with each configuration. Then, the dataset 𝒟𝒟\mathcal{D}caligraphic_D can be used for the pre-training when we search for the ground state of H^(Δ)^𝐻superscriptΔ\hat{H}(\vec{\Delta}^{\prime})over^ start_ARG italic_H end_ARG ( over→ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) of a new configuration ΔsuperscriptΔ\vec{\Delta}^{\prime}over→ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In the pre-training, ΔrsubscriptΔ𝑟\vec{\Delta}_{r}over→ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is used as inputs of 𝙶𝙿𝚃+subscript𝙶𝙿𝚃\texttt{GPT}_{+}GPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT as well as jmrsubscript𝑗subscript𝑚𝑟\vec{j}_{m_{r}}over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where ΔsuperscriptΔ\vec{\Delta}^{\prime}over→ start_ARG roman_Δ end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is used as inputs in the main training.

(iii) Molecule-to-molecule transfer scenario

In the molecule-to-molecule transfer scenario, we propose utilizing the dataset generated from the training of one molecule for the pre-training of another. Implementing this approach requires a significant extension of the model. At this stage, we focus on outlining the concept rather than delving into the specifics of the implementation. In this scenario, it is crucial to use the information of the molecule, such as the Hamiltonian, as an input, similar to our approach in the 𝙶𝙿𝚃+subscript𝙶𝙿𝚃\texttt{GPT}_{+}GPT start_POSTSUBSCRIPT + end_POSTSUBSCRIPT model for config-to-config transfer.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: The training results with the electronic structure Hamiltonians: 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (top left), LiH (top right), 𝙱𝚎𝙷2subscript𝙱𝚎𝙷2\texttt{BeH}_{2}BeH start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (bottom left), and 𝙽2subscript𝙽2\texttt{N}_{2}N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (bottom right) in sto-3g basis. The results of GPT-QE are depicted as green points (gpt-qe). The Hartree-Fock energy is indicated by a gray dotted line (hf), and the exact full configuration interaction energy, calculated through diagonalization, is represented by a black line (exact). For GPT-QE, the best result in three trials is shown.
Refer to caption
Figure 5: The corresponding circuit for the GPT-QE result of the 𝙷2subscript𝙷2{\texttt{H}}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Hamiltonian at a bond length of 2.0 angstroms. The initial two X gates serve to prepare the Hartree-Fock state. The circuit is depicted across two lines for clarity. Each number positioned above a circuit component indicates its respective token number. Although there are a total of 10 tokens, one of these corresponds to the identity operator and is not shown in the diagram.
Refer to caption
Refer to caption
Figure 6: Training performance with the 𝙽2subscript𝙽2\texttt{N}_{2}N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Hamiltonian, comparing scenarios with and without pre-training. The left figure shows the minimum energy found by the step during the trial, while the right figure illustrates the mean energy of the quantum states generated at each step. The results from the training with pre-training are indicated by a green line, and those from the training without pre-training are shown by a gray line. In both figures, the mean and its standard deviation across ten random initializations is drawn.

The primary challenge lies in uniformly treating molecules with varying numbers and types of molecular orbitals. For instance, there is not a straightforward mapping between the molecular orbitals of 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and those of 𝙷2𝙾subscript𝙷2𝙾\texttt{H}_{2}\texttt{O}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT O as determined by Hartree-Fock calculations. Therefore, the gate sequence in the calculation of 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT does not have an a priori correspondence with that in 𝙷2𝙾subscript𝙷2𝙾\texttt{H}_{2}\texttt{O}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT O. Finding transferable representations of electronic structure information that are useful for machine learning applications is a vibrant research field on its own [33, 34]. One potential solution to this challenge in our setting is to write the molecular Hamiltonian employing the basis of atomic orbitals instead of that of molecular orbitals. More specifically, we define the creation and the annihilation operators for each atomic orbital and correspond them with qubits. Then, a gate sequence for 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has an analogous physical meaning when applied to the portion corresponding to hydrogen atoms in 𝙷2𝙾subscript𝙷2𝙾\texttt{H}_{2}\texttt{O}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT O. This approach facilitates effective molecule-to-molecule transfer, particularly between molecules with the same atomic composition. Alternatively, one could devise a quantum circuit that couples molecular fragments encoded initially in independent qubits. In this manner, the optimized gates for the fragments can be invoked as a starting point for the full molecular circuit.

3 Results

In this section, we showcase the effectiveness of training and pre-training in GPT-QE for approximating ground states using electronic structure Hamiltonians. We use the molecular Hamiltonians of 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, LiH, 𝙱𝚎𝙷2subscript𝙱𝚎𝙷2\texttt{BeH}_{2}BeH start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝙽2subscript𝙽2\texttt{N}_{2}N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the sto-3g basis for this purpose.

The configuration of the GPT-QE model is as follows. Our operator pool is a set of Pauli time evolutions: 𝒢={eiP^jtj}j𝒢subscriptsuperscript𝑒𝑖subscript^𝑃𝑗subscript𝑡𝑗𝑗\mathcal{G}=\{e^{i\hat{P}_{j}t_{j}}\}_{j}caligraphic_G = { italic_e start_POSTSUPERSCRIPT italic_i over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where P^jsubscript^𝑃𝑗\hat{P}_{j}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents a tensor product of Pauli operators, and tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a real value. Defining 𝒫𝒫\mathcal{P}caligraphic_P as the set of P^jsubscript^𝑃𝑗\hat{P}_{j}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝒯𝒯\mathcal{T}caligraphic_T as the set of tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the size of the operator pool is |𝒫|×|𝒯|𝒫𝒯|\mathcal{P}|\times|\mathcal{T}|| caligraphic_P | × | caligraphic_T |. For 𝒫𝒫\mathcal{P}caligraphic_P, we derive chemically inspired choices from the unitary coupled-cluster single and double excitations (UCCSD). Letting T𝑇Titalic_T denote the sum of all fermionic excitation operators included in UCCSD, 𝒫𝒫\mathcal{P}caligraphic_P is selected such that eiPθ(P𝒫)superscript𝑒𝑖subscript𝑃subscript𝜃subscript𝑃𝒫e^{iP_{\ell}\theta_{\ell}}(P_{\ell}\in\mathcal{P})italic_e start_POSTSUPERSCRIPT italic_i italic_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_P start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ caligraphic_P ), with an angle θsubscript𝜃\theta_{\ell}italic_θ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT,is part of the decomposed operators when eTTsuperscript𝑒𝑇superscript𝑇e^{T-T^{\dagger}}italic_e start_POSTSUPERSCRIPT italic_T - italic_T start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is broken down into Pauli time evolutions by the Trotter decomposition [35]. The identity operator is also included in 𝒫𝒫\mathcal{P}caligraphic_P. For 𝒯𝒯\mathcal{T}caligraphic_T, we choose 𝒯={±2k/160}k=14𝒯superscriptsubscriptplus-or-minussuperscript2𝑘160𝑘14\mathcal{T}=\left\{\pm 2^{k}/160\right\}_{k=1}^{4}caligraphic_T = { ± 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT / 160 } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. Regarding the transformer model, we employ a configuration identical to that of GPT-2 [7], featuring 12 attention layers, 12 attention heads, and 768 embedding dimensions. The initial state, ρ0subscript𝜌0\rho_{\textrm{0}}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, is set to the Hartree-Fock state. For numerical stability, we add an offset to the output of the quantum device. More specifically, when calculating the cost function (5), we substitute EN(j)+Eoffsetsubscript𝐸𝑁𝑗subscript𝐸offsetE_{N}(\vec{j})+E_{\textrm{offset}}italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ) + italic_E start_POSTSUBSCRIPT offset end_POSTSUBSCRIPT instead of the original EN(j)subscript𝐸𝑁𝑗E_{N}(\vec{j})italic_E start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( over→ start_ARG italic_j end_ARG ). The value of Eoffsetsubscript𝐸offsetE_{\textrm{offset}}italic_E start_POSTSUBSCRIPT offset end_POSTSUBSCRIPT is chosen to be 00, 7777, 14141414, and 106106106106 for 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, LiH, 𝙱𝚎𝙷2subscript𝙱𝚎𝙷2\texttt{BeH}_{2}BeH start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝙽2subscript𝙽2\texttt{N}_{2}N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively.

In this study, we utilized CUDA Quantum [36] to execute quantum chemistry experiments. CUDA Quantum is distinguished as an open-source programming model and platform, integrating quantum processing units (QPUs), CPUs, and GPUs seamlessly. This integration makes it an ideal choice for workflows that require diverse computing capabilities, as demonstrated in the GPT-QE application we considered. CUDA Quantum facilitates kernel-based programming and is compatible with both C++ and Python, which we employed in our research. The algorithm was executed using NVIDIA A100 GPUs on NERSC’s Perlmutter, an HPE Cray Shasta-based heterogeneous system with 1,792 GPU-accelerated nodes.

3.1 Training Performance

In Fig. 4, we present the results of GPT-QE training for each electronic structure Hamiltonian. The horizontal axis represents the bond length, while the vertical axis shows the energy value. The results of GPT-QE are depicted as green points (gpt-qe). The Hartree-Fock energy is indicated by a gray dotted line (hf), and the exact full configuration interaction energy, calculated through diagonalization of the full Hamiltonian, is represented by a black line (exact).

For 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we run 50 steps with a number of samples per step set at M=25𝑀25M=25italic_M = 25. And for the other molecules, we run 500 steps with M=50𝑀50M=50italic_M = 50. The inverse temperature β𝛽\betaitalic_β starts at 5555 and is increased by 0.10.10.10.1 at each step. The number of tokens, i.e., the length of the output sequence, is fixed at 10 for 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 40 for the other molecules. We conducted three trials of GPT-QE and recorded the minimum energy in each. The plotted points represent the best results from these trials.

We observe that GPT-QE effectively identifies low-energy states closely approximating the ground state. In Appendix A, we also compare the average performance of GPT-QE with the case when quantum gates are randomly generated from the operator pool 𝒢𝒢\mathcal{G}caligraphic_G and verify the effectiveness of the GPT-QE’s training scheme, which employs logit-matching to steer the transformer towards desired outcomes. Some of the results include errors larger than the chemical accuracy. Such errors can be further reduced by choosing suitable hyper-parameters and pre-training, as described in the upcoming subsection. It should also be noted that we can enhance the accuracy by applying VQE as a post-processing step.

For demonstration purposes, Fig 5 displays the quantum circuit corresponding to the data point for a bond length of 2.0 Å, which is drawn using an open-source quantum circuit visualization tool [37]. To our knowledge, this is the first instance of a transformer-generated quantum circuit reported in a scientific publication.

3.2 The Effect of Pre-Training

We also conduct a pre-training experiment, focusing specifically on the config-to-config transfer scenario in 𝙽2subscript𝙽2\texttt{N}_{2}N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Let H^(d)^𝐻𝑑\hat{H}(d)over^ start_ARG italic_H end_ARG ( italic_d ) represent the 𝙽2subscript𝙽2\texttt{N}_{2}N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Hamiltonian at bond length d𝑑ditalic_d. H^(d)^𝐻𝑑\hat{H}(d)over^ start_ARG italic_H end_ARG ( italic_d ) can be expressed as a weighted sum of Pauli operators:

H^(d)=a=1NHha(d)P^a,^𝐻𝑑superscriptsubscript𝑎1subscript𝑁𝐻subscript𝑎𝑑subscript^𝑃𝑎\hat{H}(d)=\sum_{a=1}^{N_{H}}h_{a}(d)\hat{P}_{a},over^ start_ARG italic_H end_ARG ( italic_d ) = ∑ start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_d ) over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , (8)

where ha(d)subscript𝑎𝑑h_{a}(d)italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_d ) denotes the weights. We transfer the dataset generated during the training with H^(1.2)^𝐻1.2\hat{H}(1.2)over^ start_ARG italic_H end_ARG ( 1.2 ) to a dataset for pre-training with H^(1.4)^𝐻1.4\hat{H}(1.4)over^ start_ARG italic_H end_ARG ( 1.4 ). The initial training for H^(1.2)^𝐻1.2\hat{H}(1.2)over^ start_ARG italic_H end_ARG ( 1.2 ) comprises 500 steps, and at each step, we sample 50 sequences. Consequently, from the training, we compile the dataset 𝒟original(1.4):={jm,E(1.4)(jm)}m=1Massignsubscriptsuperscript𝒟1.4originalsuperscriptsubscriptsubscript𝑗𝑚superscript𝐸1.4subscript𝑗𝑚𝑚1𝑀\mathcal{D}^{(1.4)}_{\textrm{original}}:=\{\vec{j}_{m},E^{(1.4)}(\vec{j}_{m})% \}_{m=1}^{M}caligraphic_D start_POSTSUPERSCRIPT ( 1.4 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT original end_POSTSUBSCRIPT := { over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ( 1.4 ) end_POSTSUPERSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, where E(d)(j)superscript𝐸𝑑𝑗E^{(d)}(\vec{j})italic_E start_POSTSUPERSCRIPT ( italic_d ) end_POSTSUPERSCRIPT ( over→ start_ARG italic_j end_ARG ) estimates Tr(H^(d)U(j)ρ0U(j))Tr^𝐻𝑑𝑈𝑗subscript𝜌0𝑈superscript𝑗\textrm{Tr}\left(\hat{H}(d)U(\vec{j})\rho_{\textrm{0}}U(\vec{j})^{\dagger}\right)Tr ( over^ start_ARG italic_H end_ARG ( italic_d ) italic_U ( over→ start_ARG italic_j end_ARG ) italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_U ( over→ start_ARG italic_j end_ARG ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) and M=500×50=25,000formulae-sequence𝑀5005025000M=500\times 50=25,000italic_M = 500 × 50 = 25 , 000. From 𝒟original(1.4)subscriptsuperscript𝒟1.4original\mathcal{D}^{(1.4)}_{\textrm{original}}caligraphic_D start_POSTSUPERSCRIPT ( 1.4 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT original end_POSTSUBSCRIPT, we omit relatively high-energy data having E(1.4)(jm)>107.45superscript𝐸1.4subscript𝑗𝑚107.45E^{(1.4)}(\vec{j}_{m})>-107.45italic_E start_POSTSUPERSCRIPT ( 1.4 ) end_POSTSUPERSCRIPT ( over→ start_ARG italic_j end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) > - 107.45 and construct the dataset 𝒟(1.4)superscript𝒟1.4\mathcal{D}^{(1.4)}caligraphic_D start_POSTSUPERSCRIPT ( 1.4 ) end_POSTSUPERSCRIPT, which has 14,700similar-toabsent14700\sim 14,700∼ 14 , 700 data.

In the pre-training phase, 𝒟(1.4)superscript𝒟1.4\mathcal{D}^{(1.4)}caligraphic_D start_POSTSUPERSCRIPT ( 1.4 ) end_POSTSUPERSCRIPT is divided into 294 batches. The transformer is then pre-trained using each batch once. Subsequently to this pre-training, the transformer undergoes further training for 500 steps. The hyper-parameters are set to be the same as in the experiment in Section 3.1.

In Figure 6, we display training performance following pre-training. We conducted ten trials of the experiment. In each trial, the transformer’s parameters are randomly initialized with different seeds, but the same dataset 𝒟(1.4)superscript𝒟1.4\mathcal{D}^{(1.4)}caligraphic_D start_POSTSUPERSCRIPT ( 1.4 ) end_POSTSUPERSCRIPT is used for pre-training. Let Emin(s)subscript𝐸𝑠E_{\min}(s)italic_E start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_s ) be the minimum energy found by step s𝑠sitalic_s in each trial. The left figure presents the mean of Emin(s)subscript𝐸𝑠E_{\min}(s)italic_E start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_s ) and its standard deviation across the ten training runs. Let Eavg(s)subscript𝐸avg𝑠E_{\text{avg}}(s)italic_E start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT ( italic_s ) be the average of energies generated at step s𝑠sitalic_s in each trial. The right figure illustrates the mean of Eavg(s)subscript𝐸avg𝑠E_{\text{avg}}(s)italic_E start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT ( italic_s ) and its standard deviation. In each figure, the result is depicted by a green circle. As a baseline, we also include results from training conducted without pre-training. Ten trials of this experiment are conducted, which include three trials from the experiment described in Section 3.1. In each figure, the benchmark result is represented by a gray triangle.

The left figure indicates that the pre-training helps to find a lower energy in this example. The minimum energy found at the final step is 0.01similar-toabsent0.01\sim 0.01∼ 0.01 (Hartree) lower on average in the training after pre-training than without pre-training, which corresponds to 25%similar-toabsentpercent25\sim 25\%∼ 25 % improvement in terms of the deviation from the exact energy (107.591similar-to-or-equalsabsent107.591\simeq-107.591≃ - 107.591 Hartree). On the other hand, we observe that the average energy is almost unchanged with the pre-training, as shown in the right figure. This phenomenon can be interpreted as encouraging the pre-training to search for a wider variety of sequences; the fluctuation of the energy value around step=50100step50similar-to100\text{step}=50\sim 100step = 50 ∼ 100 in the right figure may indicate this. This diversity in sequences aids in finding lower energy configurations but also results in generating sequences with higher energy. Consequently, the average energy remains similar to that observed in training without pre-training.

4 Conclusion and Discussion

We propose the GQE algorithm, a novel method that applies a generative model to obtain quantum circuits with desired properties. In particular, we introduce GPT-QE, which is based on the transformer architecture, and we design its training and pre-training schemes using the logit-matching technique. We also introduce and discuss scenarios for obtaining datasets usable in pre-training, as well as proposals for extending the architecture’s capabilities by including inputs. In our numerical experiment, we address the problem of searching for the ground state of the electronic structure Hamiltonians of several molecules: 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, LiH, 𝙱𝚎𝙷2subscript𝙱𝚎𝙷2\texttt{BeH}_{2}BeH start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝙽2subscript𝙽2\texttt{N}_{2}N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . We demonstrate that GPT-QE finds a quantum state with energy close to that of the ground state. The efficacy of the training scheme that utilizes logit-matching is confirmed through the average deviation from random circuit generation. Additionally, we validate the effectiveness of pre-training; our results show that, by utilizing pre-training, we can train the transformer successfully without operating a quantum device and significantly reduce the total number of quantum circuit runs.

Many research directions should be explored to fully enable the GPT-QE models to progress beyond proof of concept. First, it is essential to validate the performance of GPT-QE on an actual quantum device and analyze its robustness to noise. Additionally, experiments with larger molecules are required to assess the optimization behavior. Another consideration is how to effectively integrate GPT-QE with the VQE framework. A straightforward approach might involve using VQE as a post-processing step, but there are many opportunities for more integrated hybridization. As a particularly interesting existing method for hybridization, the ADAPT-VQE strategy is known to achieve high accuracy [30], though it necessitates numerous quantum circuit runs. Combining ADAPT-VQE with GPT-QE is another potential direction for future research.

Each component of GPT-QE is also open to updates and improvements. In our numerical experiment, we employ a chemically inspired operator pool, but, theoretically, any kind of operators could be included. The transformer’s design could be modified to generate a sequence of tokens and the parameters embedded in quantum gates. Such an extension would facilitate easier hybridization with VQE. To fully leverage the pre-training feature, exploring how to design suitable inputs for the transformer is crucial.

As mentioned earlier in the paper, applying and extending the GQE framework to problems beyond ground-state approximation is also feasible. For instance, if GQE can accept classical data inputs, it could be applied to supervised machine-learning problems. The critical question is determining which types of machine learning problems are best suited for the GQE framework. This inquiry requires careful study and identification of suitable problems.

The significance of pre-training underscores the need for data storage and sharing as common resources. We can take inspiration from the machine learning community, which shares multiple datasets for training and testing purposes. Through such efforts, we anticipate that practical quantum applications using the generative models will soon be achievable.

We hope that, in a parallel track, and a decade later from when some of us introduced the VQE, the community will embrace GQE and work together to enable many of the extensions briefly proposed above to help in achieving the goal of near-term quantum utility.

Acknowledgements

This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 using NERSC award NERSC DDR-ERCAP0027330. K.N. acknowledges the support of Grant-in-Aid for JSPS Research Fellow 22J01501. L.B.K. acknowledges support from the Carlsberg Foundation. A.A.-G. acknowledges support from the Canada 150 Research Chairs program and CIFAR as well a the generous support of Anders G. Frøseth. We are deeply grateful to the Defense Advanced Research Projects Agency (DARPA) for their generous support and funding of this project, under the grant number HR0011-23-3-0020.

References

  • [1] D. Bluvstein, S. J. Evered, A. A. Geim, S. H. Li, H. Zhou, T. Manovitz, S. Ebadi, M. Cain, M. Kalinowski, D. Hangleiter, et al., “Logical quantum processor based on reconfigurable atom arrays,” Nature, pp. 1–3, 2023.
  • [2] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’Brien, “A variational eigenvalue solver on a photonic quantum processor,” Nature Communications, vol. 5, p. 4213, July 2014.
  • [3] M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio, and P. J. Coles, “Variational quantum algorithms,” Nature Reviews Physics, vol. 3, pp. 625–644, Sept. 2021. Number: 9 Publisher: Nature Publishing Group.
  • [4] K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea, A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T. Menke, et al., “Noisy intermediate-scale quantum algorithms,” Reviews of Modern Physics, vol. 94, no. 1, p. 015004, 2022.
  • [5] A. B. Magann, S. E. Economou, and C. Arenz, “Randomized adaptive quantum state preparation,” Jan. 2023. arXiv:2301.04201 [quant-ph].
  • [6] S. Wang, E. Fontana, M. Cerezo, K. Sharma, A. Sone, L. Cincio, and P. J. Coles, “Noise-induced barren plateaus in variational quantum algorithms,” Nature communications, vol. 12, no. 1, p. 6961, 2021.
  • [7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019.
  • [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
  • [9] H. Zhao, L. Jiang, J. Jia, P. H. S. Torr, and V. Koltun, “Point Transformer,”
  • [10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Oct. 2020.
  • [11] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. v. d. Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training Compute-Optimal Large Language Models,” Mar. 2022. arXiv:2203.15556 [cs].
  • [12] G. Verdon, M. Broughton, J. R. McClean, K. J. Sung, R. Babbush, Z. Jiang, H. Neven, and M. Mohseni, “Learning to learn with quantum neural networks via classical neural networks,” arXiv preprint arXiv:1907.05415, 2019.
  • [13] D. Kim and E.-G. Moon, “Preparation of entangled many-body states with machine learning,” arXiv preprint arXiv:2307.14627, 2023.
  • [14] Y. Yang, Z. Zhang, A. Wang, X. Xu, X. Wang, and Y. Li, “Maximising quantum-computing expressive power through randomised circuits,” arXiv preprint arXiv:2312.01947, 2023.
  • [15] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” Advances in neural information processing systems, vol. 31, 2018.
  • [16] Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in International conference on machine learning, pp. 242–252, PMLR, 2019.
  • [17] Y. Zhou, J. Yang, H. Zhang, Y. Liang, and V. Tarokh, “Sgd converges to global minimum in deep learning via star-convex path,” arXiv preprint arXiv:1901.00451, 2019.
  • [18] A. Heifetz, ed., Quantum Mechanics in Drug Discovery, vol. 2114 of Methods in Molecular Biology. New York, NY: Springer US, 2020.
  • [19] Y.-h. Lam, Y. Abramov, R. S. Ananthula, J. M. Elward, L. R. Hilden, S. O. Nilsson Lill, P.-O. Norrby, A. Ramirez, E. C. Sherer, J. Mustakis, and G. J. Tanoury, “Applications of Quantum Chemistry in Pharmaceutical Process Development: Current State and Opportunities,” Organic Process Research & Development, vol. 24, pp. 1496–1507, Aug. 2020. Publisher: American Chemical Society.
  • [20] G. Agarwal, H. A. Doan, L. A. Robertson, L. Zhang, and R. S. Assary, “Discovery of Energy Storage Molecular Materials Using Quantum Chemistry-Guided Multiobjective Bayesian Optimization,” Chemistry of Materials, vol. 33, pp. 8133–8144, Oct. 2021. Publisher: American Chemical Society.
  • [21] V. Zaytsev, M. Groshev, I. Maltsev, A. Durova, and V. Shabaev, “Calculation of the moscovium ground-state energy by quantum algorithms,” arXiv preprint arXiv:2207.08255, 2022.
  • [22] B. T. Gard, L. Zhu, G. S. Barron, N. J. Mayhall, S. E. Economou, and E. Barnes, “Efficient symmetry-preserving state preparation circuits for the variational quantum eigensolver algorithm,” npj Quantum Information, vol. 6, no. 1, p. 10, 2020.
  • [23] K. M. Pelzer, L. Cheng, and L. A. Curtiss, “Effects of Functional Groups in Redox-Active Organic Molecules: A High-Throughput Screening Approach,” The Journal of Physical Chemistry C, vol. 121, pp. 237–245, Jan. 2017. Publisher: American Chemical Society.
  • [24] C. d. l. Cruz, A. Molina, N. Patil, E. Ventosa, R. Marcilla, and A. Mavrandonakis, “New insights into phenazine-based organic redox flow batteries by using high-throughput DFT modelling,” Sustainable Energy & Fuels, vol. 4, pp. 5513–5521, Oct. 2020. Publisher: The Royal Society of Chemistry.
  • [25] S. Lee, J. Lee, H. Zhai, Y. Tong, A. M. Dalzell, A. Kumar, P. Helms, J. Gray, Z.-H. Cui, W. Liu, M. Kastoryano, R. Babbush, J. Preskill, D. R. Reichman, E. T. Campbell, E. F. Valeev, L. Lin, and G. K.-L. Chan, “Evaluating the evidence for exponential quantum advantage in ground-state quantum chemistry,” Nature Communications, vol. 14, p. 1952, Apr. 2023.
  • [26] J. Ceroni, T. F. Stetina, M. Kieferova, C. O. Marrero, J. M. Arrazola, and N. Wiebe, “Generating Approximate Ground States of Molecules Using Quantum Machine Learning,” Jan. 2023. arXiv:2210.05489 [quant-ph].
  • [27] Z. Liang, J. Cheng, R. Yang, H. Ren, Z. Song, D. Wu, X. Qian, T. Li, and Y. Shi, “Unleashing the potential of llms for quantum computing: A study in quantum architecture design,” arXiv preprint arXiv:2307.08191, 2023.
  • [28] M. Krenn, J. Landgraf, T. Foesel, and F. Marquardt, “Artificial intelligence and machine learning for quantum technologies,” Physical Review A, vol. 107, no. 1, p. 010101, 2023.
  • [29] T. Jaouni, S. Arlt, C. Ruiz-Gonzalez, E. Karimi, X. Gu, and M. Krenn, “Deep quantum graph dreaming: Deciphering neural network insights into quantum experiments,” arXiv preprint arXiv:2309.07056, 2023.
  • [30] H. R. Grimsley, S. E. Economou, E. Barnes, and N. J. Mayhall, “An adaptive variational algorithm for exact molecular simulations on a quantum computer,” Nature communications, vol. 10, no. 1, p. 3007, 2019.
  • [31] C. N. Self, K. E. Khosla, A. W. R. Smith, F. Sauvage, P. D. Haynes, J. Knolle, F. Mintert, and M. S. Kim, “Variational quantum algorithm with information sharing,” npj Quantum Information, vol. 7, pp. 1–7, July 2021. Number: 1 Publisher: Nature Publishing Group.
  • [32] A. Cervera-Lierta, J. S. Kottmann, and A. Aspuru-Guzik, “Meta-variational quantum eigensolver: Learning energy profiles of parameterized hamiltonians for quantum simulation,” PRX Quantum, vol. 2, no. 2, p. 020329, 2021.
  • [33] Z. Qiao, A. S. Christensen, M. Welborn, F. R. Manby, A. Anandkumar, and T. F. Miller, “Informing geometric deep learning with electronic interactions to accelerate quantum chemistry,” Proceedings of the National Academy of Sciences, vol. 119, p. e2205221119, Aug. 2022. Publisher: Proceedings of the National Academy of Sciences.
  • [34] D. Khan, S. Heinen, and O. A. von Lilienfeld, “Kernel based quantum machine learning at record rate: Many-body distribution functionals as compact representations,” The Journal of Chemical Physics, vol. 159, p. 034106, July 2023.
  • [35] H. F. Trotter, “On the product of semi-groups of operators,” Proceedings of the American Mathematical Society, vol. 10, no. 4, pp. 545–551, 1959.
  • [36] The CUDA Quantum development team, “CUDA Quantum: An open-source programming model for heterogeneous quantum-classical workflows,” 2023. Version 0.5.0.
  • [37] Qiskit contributors, “Qiskit: An open-source framework for quantum computing,” 2023.

Appendix A The benchmark experiment

In Section 3.1, we show that GPT-QE successfully finds a low-energy state. However, to verify that the training correctly works, we need to compare the performance with the case where training is not involved. For this purpose, we perform a benchmark experiment that randomly samples 10 (for 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) or 40 (for other molecules) tokens from 𝒢𝒢\mathcal{G}caligraphic_G and constructs quantum circuits. In each trial, we generate the same number of sequences as those generated in GPT-QE. We conduct three trials.

In Figure 7, we provide the mean of the minimum energy found in each trial and its standard deviation for each bond length in each molecule. The benchmark result is shown by the gray cross mark and the other result is shown with the same symbol as in Fig. 4. This confirms that the training scheme successfully guides the transformer to find lower energy, especially for the 𝙽2subscript𝙽2\texttt{N}_{2}N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT molecule. This verifies the validity of the logit-matching technique.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: The average training results with the electronic structure Hamiltonians: 𝙷2subscript𝙷2\texttt{H}_{2}H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (top left), LiH (top right), 𝙱𝚎𝙷2subscript𝙱𝚎𝙷2\texttt{BeH}_{2}BeH start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (bottom left), and 𝙽2subscript𝙽2\texttt{N}_{2}N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (bottom right) in sto-3g basis. The results of GPT-QE are depicted as green points (gpt-qe). The Hartree-Fock energy is indicated by a gray dotted line (hf), and the exact full configuration interaction (CI) energy, calculated through diagonalization, is represented by a black line (exact). For comparison, we also include results from randomly sampling gates from 𝒢𝒢\mathcal{G}caligraphic_G, which are indicated by gray cross marks (benchmark). For each data point, we run three trials and show the mean value and its deviation. When the error bar is not shown, the standard deviation is within the size of the data point.