Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Managing Classical Processing Requirements for Quantum Error Correction

Satvik Maurya, Swamit Tannu University of Wisconsin-Madison
Abstract.

Quantum Error Correction requires decoders to process syndromes generated by the error-correction circuits. These decoders must process syndromes faster than they are being generated to prevent a backlog of undecoded syndromes that can exponentially increase the memory and time required to execute the program. This has resulted in the development of fast hardware decoders that accelerate decoding. Applications utilizing error-corrected quantum computers will require hundreds to thousands of logical qubits and provisioning a hardware decoder for every logical qubit can be very costly. In this work, we present a framework to reduce the number of hardware decoders and navigate the compute-memory trade-offs without sacrificing the performance or reliability of program execution. Through workload-centric characterizations, we propose efficient decoder scheduling policies which can reduce the number of hardware decoders required to run a program by up to 10×\times× while consuming less than 100 MB of memory.

1. Introduction

As quantum computing enters a phase of rapid scaling to enable Fault-Tolerant Quantum Computing (FTQC), the classical processing resources required to support Quantum Error Correcting (QEC) codes must be scaled proportionally. QEC codes generate a stream of syndromes repeatedly by measuring parity qubits every cycle, and a decoder algorithm running on the classical control computer processes the stream of syndrome bits to detect errors and correct errors. Recent demonstrations have shown how the Surface Code (Fowler2012, ) can be deployed experimentally to suppress logical error rates (Google2023, ), how neutral atoms can be used to realize up to 48 logical qubits (Bluvstein2023, ), and how four logical qubits could be created with thirty physical qubits to achieve an 800×\times× reduction in the error rate (quantinuum2024, ). These demonstrations are precursors to complex systems with more logical qubits requiring significant classical processing resources to enable fault-tolerant architectures.

Building a universal fault-tolerant quantum computer requires support for both Clifford and non-Clifford gates. For the Surface Code, applying a non-Clifford T𝑇Titalic_T-gate requires decoding prior errors so that an appropriate correction can be applied (Fowler2012, ; Horsman2012, ; Litinski2019, ). The decoding cannot be deferred, thus requiring decoding to be performed in real-time. Moreover, there is even a broader constraint on decoding throughput – if syndromes are generated faster than they can be processed, computation can be slowed down exponentially due to the backlog problem (Terhal2015, ). Qubit technologies such as superconducting qubits have fast syndrome cycle times in the order of 1μ𝜇\muitalic_μ(Google2023, ), which require decoder latencies to be smaller than the syndrome cycle time.

Applications that can benefit from FTQC will require hundreds to thousands of logical qubits to function (Blunt2024, ). Depending on the error-correcting code used, replicating decoders for every logical qubit in the system can become very expensive and intractable in terms of cost and complexity. To reduce this cost, fast, hardware-efficient decoders have been proposed which sacrifice some accuracy for speed and scalability by making approximations in the decoding process (Ravi2023, ; Riverlane2023, ; Vittal2023, ; Alavisamani2024, ). However, catering to hundreds to thousands of logical qubits with these specialized decoders will still result in complex and costly systems – in this work, we aim to show how the total number of decoders can be reduced without affecting the performance or reliability of the quantum computer, thus allowing for more scalable classical processing for QEC.

Refer to caption
(a)
Refer to caption
(b)
Figure 1. (a) The compute and memory trade-offs for the classical processing required to implement QEC – fewer decoders result in an exponential increase in the memory required to store undecoded syndromes, and we aim to avoid this exponential growth in memory while reducing the number of decoders; (b) Virtual decoders – a scheduling policy will allocate decoders to the logical qubits.
\Description

[some figure]

In this paper, we present VQD: Virtual Quantum Decoding, a framework that aims to provide the illusion that there are decoders for every logical qubit while using significantly fewer hardware decoders to enable scalable and efficient classical processing necessary for fault-tolerant quantum computers. As shown in Fig. 1(a), the objective of VaDER is to reduce classical computing and memory resources needed to execute quantum programs on a fault-tolerant quantum computer without causing a performance slowdown or increase in the logical error rate. This is challenging because when we reduce the number of physical decoders, the memory required to store undecoded syndromes can grow exponentially if the syndrome generation and syndrome processing rates are not matched. More importantly, this exponential increase in memory due to syndrome backlog will lead to an exponential increase in the time it takes to process all the undecoded syndromes, thereby significantly increasing the execution time of the program (Terhal2015, ). Our experimental evaluations also show that if a logical qubit is not decoded for extended periods (which will occur if there are fewer decoders than qubits), then it can cause the decoder latency to increase due to an increase in undecoded errors. This increase in the decoder latency can affect the application of non-Clifford states – if decoding is delayed for a logical qubit before applying a non-Clifford state, the application of the non-Clifford state could be delayed since the decoder might take more time than usual to decode all prior rounds.

Given the challenges in sharing decoder hardware and to understand how it can be enabled, we characterized representative FTQC workloads to understand the decoding requirements from a performance and reliability perspective. Our characterization using a lattice surgery compiler (watkins2023high, ) revealed that there is a limited amount of operational parallelism due to long sequences of T𝑇Titalic_T and H𝐻Hitalic_H gates resulting from Clifford + T𝑇Titalic_T decomposition, which is necessary for universality. This is crucial, as non-Clifford gates are the reason real-time decoding is necessary for FTQC. Non-Clifford operations are the only operations where the decoding is in the critical path, and fortunately, they occur in a highly serialized manner. Therefore, a physical decoder per logical qubit is unnecessary and will lead to severe underutilization.

Armed with this insight, we propose a system architecture with significantly fewer physical decoders than the number of logical qubits. Furthermore, we design efficient decoder scheduling policies for such systems. Such a system can be visualized in Fig. 1(b). We propose a scheduling policy that minimizes the Longest Undecoded Sequence, termed as the MLS policy. We compare it with the Round Robin (RR) and Most Frequently Decoded (MFD) policies – our evaluations show that the MLS policy can reduce the number of hardware decoders by up to 10×\times× while ensuring that no logical qubit remains undecoded for a significantly long period.

We also propose a noise-adaptive scheduling policy that can prioritize decoding of logical qubits that incur a sharp increase in the physical error rate due to phenomena such as cosmic rays (McEwen2021, ) and leakage due to heating (Miao2023, ). This involves a simple detector that can schedule decoding for a logical qubit in case the syndromes for that logical qubit show a sudden increase in bit-flips. Next, we show how some decoding tasks can be offloaded to software to further improve the efficacy of decoder scheduling policies.

Balancing compute and memory is a classic architectural problem, and we use VQD to explore these trade-offs. Prior research on decoders and classical processing required for fault-tolerant quantum computers have focused on reducing the hardware cost of implementing decoders by making approximations in the decoding algorithm, sometimes at the cost of accuracy (Vittal2023, ; Riverlane2023, ; Alavisamani2024, ). With VaDER, we show that even if individual decoders have a high hardware cost, the overall cost can be reduced significantly by virtualizing decoders.

2. Quantum Error Correction and Decoding

In this section, we cover high-level details of Quantum Error Correction and the role of decoders.

2.1. Quantum Error Correction

Quantum Error Correction (QEC) improves the reliability of a system by utilizing many physical qubits to encode a single logical qubit (Shor1995, ; Knill1997, ). Most QEC codes can be categorized as stabilizer codes (Terhal2015, ) – some promising stabilizer codes include quantum Low Distance Parity Check (qLDPC) codes (Bravyi2024, ) and Surface Codes (Fowler2012, ). Owing to their relatively relaxed connectivity requirements that can be realized with hardware available today, we focus specifically on the Surface Code. Note that this work can be extended to other QEC codes apart from the Surface Code.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2. (a) A logical qubit (d=3𝑑3d=3italic_d = 3); (b) Syndrome generation and measurements; (c) A typical procedure for detecting errors by decoding syndromes.
\Description

[some figure]

A single rotated Surface Code patch of distance d=3𝑑3d=3italic_d = 3 is shown in Fig. 2(a) (Horsman2012, ). Black circles denote data qubits and each data qubit is connected to X𝑋Xitalic_X and Z𝑍Zitalic_Z parity qubits. The Surface Code works by repeatedly measuring syndromes, which correspond to the measurements performed on all X𝑋Xitalic_X and Z𝑍Zitalic_Z measure qubits in a patch after performing the sequence of gates shown in Fig. 2(b). By repeatedly measuring these syndromes, both bit-flip and phase-flip errors occurring within a patch can be detected by the decoder.

2.2. Decoding Errors

Fig. 2(c) shows the general procedure of how any generic QEC code works – syndromes are constantly being generated, which are then fed to a decoder. Syndromes contain information about which qubits have flipped in every round of syndrome measurements, and these flips allow the decoder to determine what errors on the data qubits caused those flips. Since errors can always be expressed in the form of Pauli gate, they can be corrected in software without executing any physical operations on the logical qubits. This is achieved by updating the Pauli frames for all data qubits that make up that logical qubit, which adjusts the interpretation of future measurements by accounting for the error that was detected (Terhal2015, ). For the Surface Code, the decoding problem is commonly formulated as a Minimum Weight Perfect Matching (MWPM) problem, which leverages a graph representation of the syndrome measurements (Fowler2015, ; Wu2022, ).

Refer to caption
Figure 3. Consumption of a magic (T𝑇Titalic_T) state with LS.
\Description

[some figure]

2.3. Non-Clifford Gates

On a Surface Code error-corrected quantum computer, all Clifford gates can be performed reliably either in software or via logical operations performed via Braiding (Fowler2012, ) or Lattice Surgery (Horsman2012, ). However, non-Clifford gates such as the T𝑇Titalic_T gate cannot be applied in a fault-tolerant manner directly. This is because logical qubits initialized with the T𝑇Titalic_T gate will have an error probability equal to the underlying physical error rate of the system, p𝑝pitalic_p, thus making them impure (Litinski2019magic, ). However, multiple impure states can be used to distill fewer, purer logical qubits with a T𝑇Titalic_T state (known as a magic state |mket𝑚|m\rangle| italic_m ⟩) – this process is known as magic state distillation (Bravyi2012, ; Gupta2024, ).

Fig. 3 shows how a T𝑇Titalic_T gate can be applied to a logical qubit P𝑃Pitalic_P in a fault-tolerant manner by using Lattice Surgery (LS) (Litinski2019, ). The magic state |mket𝑚|m\rangle| italic_m ⟩ is a purified T𝑇Titalic_T-state. Since a non-Clifford gate is being applied, all prior errors that affected P𝑃Pitalic_P must be known before |mket𝑚|m\rangle| italic_m ⟩ is applied to prevent errors from spreading (Terhal2015, ). Lattice Surgery can be used to perform a ZZtensor-product𝑍𝑍Z\otimes Zitalic_Z ⊗ italic_Z operation on P𝑃Pitalic_P and |mket𝑚|m\rangle| italic_m ⟩ to apply the magic state (Litinski2019, ; Litinski2019magic, ). Once the ZZtensor-product𝑍𝑍Z\otimes Zitalic_Z ⊗ italic_Z operation is performed, the decoding result of P𝑃Pitalic_P prior to Lattice Surgery is combined with the decoding result of Lattice Surgery multi-body measurement and the measurement of the patch containing the magic state |mket𝑚|m\rangle| italic_m ⟩ to determine an appropriate Clifford correction111The auto-corrected π/8𝜋8\pi/8italic_π / 8 gate in (Litinski2019, ; Litinski2019magic, ) uses an additional ancillary qubit that has not been shown.. This correction needs to be known before the next logical operation involving a non-Clifford gate.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
\Description

[some figure]

Figure 4. (a) Exponential increase in the memory required to store undecoded syndromes for different number of available decoders for the wstate-60 benchmark – 200 decoders correspond to assigning a decoder to every logical qubit; (b) Number of T𝑇Titalic_T-gates required for different workloads; (c) Exponential increase in the decoder latency per round (in nanoseconds) as the number of rounds is increased from d𝑑ditalic_d to 20d20𝑑20d20 italic_d – the increase is higher for larger code distances (p=104𝑝superscript104p=10^{-4}italic_p = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT); (d) Slow increase in the logical error rate as the number of rounds of error correction increase with p=103𝑝superscript103p=10^{-3}italic_p = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, p=104𝑝superscript104p=10^{-4}italic_p = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT; (e) Histogram of the number of concurrent critical decodes for different workloads with the EDPC layout – there is limited parallelism as far as critical decodes are concerned (f) Number of FPGAs required for decoders to implement different workloads with (i) a Single T𝑇Titalic_T-factory and (ii) Optimal number of T𝑇Titalic_T-factories.
\Description

[fig]

2.4. Critical Decodes

For a logical qubit that is only executing Clifford gates, errors can be decoded at any point of time (even after the experiment has ended). This is because all Clifford corrections can be commuted to the end of the circuit, essentially allowing the syndromes to be post-processed rather than decoded in real-time (Terhal2015, ). However, universal fault-tolerant quantum computer require non-Clifford gates such as the T𝑇Titalic_T-gate – this makes real-time decoding a necessity since syndromes for a patch must be decoded before the application of a non-Clifford gate. We call decodes that must happen before the application of a non-Clifford gate critical decodes since all syndromes generated up to that point for that logical qubit must be decoded before computation can proceed.

3. Classical Processing Requirements

Having explained the relevant details about QEC and the role of decoders and critical decodes, we now cover the classical processing requirements for FTQC.

3.1. Syndrome Generation and Processing Rates

As discussed in Section 2, applying non-Clifford gates requires decoders to be up-to date with the latest syndrome for a logical qubit before computation can proceed. As shown by Terhal (Terhal2015, , p. 20), if the rate at which syndromes are generated rgensubscript𝑟𝑔𝑒𝑛r_{gen}italic_r start_POSTSUBSCRIPT italic_g italic_e italic_n end_POSTSUBSCRIPT is faster than the rate at which they are processed rprocsubscript𝑟𝑝𝑟𝑜𝑐r_{proc}italic_r start_POSTSUBSCRIPT italic_p italic_r italic_o italic_c end_POSTSUBSCRIPT, the memory required for storing syndromes that are yet to be decoded increases exponentially (referred to as the backlog problem). This exponential increase in the memory required also leads to an exponential increase in the runtime of the workload. Fig. 4(a) shows the exponential increase in the memory required to store undecoded syndromes for the wstate-60 benchmark with the number of rounds for various numbers of available decoders. For larger and longer running workloads, the memory requirements will be much higher. By reducing the number of available decoders, rprocsubscript𝑟𝑝𝑟𝑜𝑐r_{proc}italic_r start_POSTSUBSCRIPT italic_p italic_r italic_o italic_c end_POSTSUBSCRIPT is effectively reduced, leading to the exponential growth in memory. Decoding is necessary at runtime when consuming T𝑇Titalic_T-states, as discussed in Section 2, and Fig. 4(b) shows that it will be a frequent operation for most workloads.

rprocsubscript𝑟𝑝𝑟𝑜𝑐r_{proc}italic_r start_POSTSUBSCRIPT italic_p italic_r italic_o italic_c end_POSTSUBSCRIPT must be consistently greater than rgensubscript𝑟𝑔𝑒𝑛r_{gen}italic_r start_POSTSUBSCRIPT italic_g italic_e italic_n end_POSTSUBSCRIPT to prevent an exponential increase in memory requirements.

3.2. Why is Real-Time Decoding Needed?

This requirement for syndromes to be processed faster than they are generated has motivated research to build fast and accurate hardware decoders for the Surface Code (Smith2023, ; Ravi2023, ; Riverlane2023, ; Vittal2023, ; Alavisamani2024, ), especially for systems using superconducting qubit architectures due to their fast gate times. While the syndrome processing rates achieved by these decoders are far higher than typical syndrome generation rates achieved today (Google2023, ; Bluvstein2023, ), leaving syndromes undecoded for many successive rounds can be problematic. Fig. 4(c) shows how the decoder latency normalized to the number of rounds (latency per round) processed by the decoder can increase exponentially with the number of rounds of undecoded syndromes, especially for d=7,9𝑑79d=7,9italic_d = 7 , 9222Note that the worst-case increase in decoding latencies for parallel window decoders (Skoric2023, ) in this scenario would be similar (best-case would be a linear slowdown while slowing down the logical clock). (circuit-level noise p=104𝑝superscript104p=10^{-4}italic_p = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT). The number of rounds was chosen as a multiple of d𝑑ditalic_d since it represents the shortest period required for executing a logical operation (Horsman2012, ). This slowdown is easily explainable by the fact that errors will accumulate the longer a logical qubit remains undecoded, thus requiring more corrections to be performed. This can result in a significant slowdown, and hence a decrease in rprocsubscript𝑟𝑝𝑟𝑜𝑐r_{proc}italic_r start_POSTSUBSCRIPT italic_p italic_r italic_o italic_c end_POSTSUBSCRIPT leading to higher memory requirements to store undecoded syndromes. However, note that the slowdown in rprocsubscript𝑟𝑝𝑟𝑜𝑐r_{proc}italic_r start_POSTSUBSCRIPT italic_p italic_r italic_o italic_c end_POSTSUBSCRIPT will result in more rounds of error correction required to complete the computation. Fig. 4(d) shows how the logical error rate grows slowly with the number of rounds for p=103𝑝superscript103p=10^{-3}italic_p = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, p=104𝑝superscript104p=10^{-4}italic_p = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT respectively. Code distances are selected to achieve a target logical error rate after the N𝑁Nitalic_N rounds it takes to complete a program (Beverland2022, ; Blunt2024, ) – critical decodes delayed exponentially due to a slow rprocsubscript𝑟𝑝𝑟𝑜𝑐r_{proc}italic_r start_POSTSUBSCRIPT italic_p italic_r italic_o italic_c end_POSTSUBSCRIPT will exacerbate the logical error rate, since more rounds would be needed to complete the computation.

Leaving syndromes undecoded for even tens of rounds can result in an exponential slowdown in the decoder processing rate.

3.3. Concurrency and Delayed Decoding

Critical decodes must be serviced during the execution of a program to avoid the increase in memory requirements and processing times discussed above. If there are many concurrent critical decodes that occur frequently during the execution of a program, an appropriate number of decoders will be needed to process these critical decodes. The level of concurrency depends entirely on the layout used to build the quantum computer – layouts such as the Compact (Litinski2019, , p. 7), Fast (Litinski2019, , p. 9), and the Edge-Disjoint Path (EDPC) (Beverland2022edpc, ) are some proposed layouts that can be used to build fault-tolerant quantum computers using the Surface Code. Layouts determine the number of physical qubits required – for example, the Compact layout requires the fewest physical qubits since there is a single routing lane at the cost of completely serializing operations. The Fast and EDPC layouts allow more concurrency at the cost of more physical qubits.

What is the average level of concurrency when executing a quantum program on a fault-tolerant quantum computer? Fig. 4(e) shows a histogram of the number of critical decodes333For the Surface Code, we assume every logical qubit requires two decoders – one each for the X𝑋Xitalic_X and Z𝑍Zitalic_Z observables. for select workloads generated by the Lattice Surgery Compiler (watkins2023high, ). This histogram shows that the peak concurrency is attained very infrequently. This implies that most logical qubits function as memory qubits or execute Clifford gates more often than T𝑇Titalic_T-gates, and this can allow some qubits to not be decoded in real-time.

Quantum programs are serial in terms of T𝑇Titalic_T-gates applied – not every qubit always requires access to a fast decoder.

3.4. Goal: Make Classical Processing Efficient

Provisioning a hardware decoder for every logical qubit in the system can be resource intensive – Fig. 4(f) shows an estimate of the number of FPGAs required just for decoding when (i) a single distillation factory is used, and (ii) an optimal number of distillation factories are used for different workloads (similar-to\sim10% FPGA LUTs/decoder (Vittal2023, ; Riverlane2023, )). This estimation was done using the Azure QRE (Beverland2022, ) that uses the Fast layout. Note that the total hardware requirement will be significantly higher because of control and readout components. Having shown how the syndrome processing rate is crucial in ensuring that computation does not require excessive memory and time and how quantum programs are inherently serial in terms of critical decodes, the question we seek to answer in this work is–

How can we minimize the use of hardware decoders and lower classical processing costs without sacrificing the performance and reliability?

4. Virtual Quantum Decoders

We now show how the number of hardware decoders can be reduced to be less than the number of logical qubits in the system, and how decoding can be scheduled to prevent excessive accumulation of undecoded syndromes.

Refer to caption
Figure 5. (Left) Decoders for every logical qubit; (Right) Time-division multiplexing of decoders between qubits.
\Description

[some figure]

4.1. Working with Fewer Hardware Decoders

Reducing the number of available decoders implies that qubits will share hardware resources, resulting in time-division multiplexing of decoder instances among logical qubits. Fig. 5 shows how compared to a system with decoders for every logical qubit in the system (N𝑁Nitalic_N qubits, N𝑁Nitalic_N decoders), a system with fewer (M𝑀Mitalic_M, N>M𝑁𝑀N>Mitalic_N > italic_M) decoders will require resources to be shared with time.

Time-division multiplexing of hardware resources will require the following considerations: (i) If the number of critical decodes at a given time step exceed the number of hardware decoders, the overflowing critical decodes will have to be deferred to the next available time step, and (ii) Qubits cannot be left undecoded for extended periods of time. For the first consideration, deferring critical decodes will increase serialization in the program – this offsets all benefits offered by the Fast and EDPC layouts. For the second consideration, leaving a qubit undecoded for too long will result in an exponential increase in the syndrome processing latency and memory required to store undecoded syndromes.

Since not all qubits will be involved in critical decodes at every time step, there will be some decoders available at a given point of time which will not be decoding a logical qubit involved in the consumption of a T𝑇Titalic_T-state. Allocating these free hardware decoders to logical qubits at every time thus becomes a scheduling problem.

Before we discuss different decoder scheduling policies, it is important to understand the time granularity at which any scheduling policy will operate on. Since logical operations (Clifford or non-Clifford) in a Surface Code error-corrected quantum computer will require at least d𝑑ditalic_d rounds before the next operations, we define a slice (watkins2023high, ) as the smallest time step between logical operations that a decoder scheduling policy can work on. Every slice consists of d𝑑ditalic_d rounds of syndrome measurements, thus making the scheduling policy agnostic of the actual code distance used.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6. (a) Illustration of the longest undecoded sequence – Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has the longest undecoded sequence before the last decode; (b) MFD policy – undecoded qubits are sorted according to the number of critical decodes they are involved in after time slice t𝑡titalic_t and MC𝑀𝐶M-Citalic_M - italic_C qubits are selected from this sorted list; (c) RR policy – the MC𝑀𝐶M-Citalic_M - italic_C qubits decoded in time slice t𝑡titalic_t are not decoded in time slice t+1𝑡1t+1italic_t + 1; (d) MLS policy – undecoded qubits are sorted according to their undecoded sequence lengths and MC𝑀𝐶M-Citalic_M - italic_C qubits are selected from this sorted list.
\Description

[some figure]

4.2. Decoder Scheduling Policies

Static decoder scheduling refers to decoder scheduling that can be performed at compile time. Since most quantum programs do not have any control-flow instructions, scheduling can be performed statically. Scheduling decoders is similar to CPU scheduling performed by all operating systems today, where the number of processes is more than the number of available processor cores (Liu1973, ).

Longest Undecoded Sequence: To quantify the fairness of a decoder scheduling policy, we use ‘Longest Undecoded Sequence’, which measures how well the decoders are servicing all logical qubits. A large undecoded sequence length implies that a qubit has been left undecoded for a long time – increasing the memory consumed to store undecoded syndromes. Fig. 6(a) shows an example of determining the longest undecoded sequence length.

Consider an arbitrary time slice t𝑡titalic_t in the execution of a quantum program. There are N𝑁Nitalic_N logical qubits and M𝑀Mitalic_M hardware decoders (N>M𝑁𝑀N>Mitalic_N > italic_M). All decoding scheduling policies will have two components: The first will assign the decoders necessary for all critical decodes C𝐶Citalic_C in the time slice t𝑡titalic_t. The second will assign all the remaining MC𝑀𝐶M-Citalic_M - italic_C hardware decoders to the NC𝑁𝐶N-Citalic_N - italic_C qubits based on the scheduling policy used. We now discuss three decoder scheduling policies (all policies are illustrated in Fig. 6(b) – Fig. 6(d)):

4.2.1. Most Frequently Critically Decoded (MFD)

A logical qubit that consumes a significant number of T𝑇Titalic_T-states during the execution of a program would have a frequent requirement of critical decodes – leaving such a logical qubit undecoded for more than a few slices would make subsequent critical decodes take longer, thus slowing down computation. This motivates the MFD scheduling policy that prioritizes decoding of logical qubits that have numerous critical decodes in the future at any given time slice. The MFD policy will ensure that future critical decodes have a minimized number of undecoded syndromes for the qubits that have frequent critical decodes.

Caveats: Because the MFD policy prioritizes logical qubits with frequent critical decodes, it will likely starve other qubits of decoding, leading to longer undecoded sequences.

4.2.2. Round Robin (RR)

Derived from CPU scheduling policies used by operating systems, the RR policy does not prioritize any specific logical qubits – rather, it chooses MC𝑀𝐶M-Citalic_M - italic_C qubits in a round-robin manner in every time slice to ensure fairness for all qubits in the system.

Caveats: For regions of a program where there are many critical decodes, the RR policy could still starve some logical qubits since MC𝑀𝐶M-Citalic_M - italic_C will be much smaller, yielding a smaller window for decoders to be assigned. Since there is no prioritization, the RR policy will not be able to rectify this until the round-robin window reaches the qubits being starved.

4.2.3. Minimize Longest Undecoded Sequence (MLS)

The longest undecoded sequence length at any given time slice is an indicator of how well the decoder scheduling policy is servicing all qubits in the system. We use this as a motivator for the MLS policy, which tries to minimize the longest undecoded sequence at every time slice. The MLS policy works as follows: at any time slice t𝑡titalic_t, qubits are sorted on the basis of their current undecoded sequence lengths. Then, MC𝑀𝐶M-Citalic_M - italic_C qubits with the largest undecoded sequence lengths are assigned hardware decoders.

Caveats: In cases where there the number of logical qubits is far greater than the number of decoders (N>>Mmuch-greater-than𝑁𝑀N>>Mitalic_N > > italic_M), the MLS policy will not be able to work effectively.

4.3. Noise-Adaptive Decoder Scheduling

While control-flow instructions would necessitate runtime scheduling of decoders, events such as cosmic rays (McEwen2021, ) and leakage (Miao2023, ) can result in a temporary burst of errors for some physical qubits in the lattice that can impact some logical qubits. Scheduling after a control-flow instruction can be performed using any static scheduling policies for the program after the control-flow instruction. However, since the static scheduling policies do not account for the error-rate, spikes in errors due to cosmic rays and leakage cannot be factored at runtime without hardware support.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 7. (a) Increase in the average number of bit-flips in syndromes for different code distances as the physical error rate is increased; (b) An error burst results in increased bit-flips in the syndromes of patch P𝑃Pitalic_P which can be detected and used to prioritize the decoding of P𝑃Pitalic_P in the next time step; (c) Decoding can be offloaded to software provided there is enough time before a critical decode occurs.
\Description

[fig]

Detecting a spike in the physical error rate will either require errors to be decoded or additional hardware modules to detect the spike. While error-correcting codes can tolerate temporary increases in the error-rate (Google2023, ), the increase in errors can result in longer decoding latencies since the decoding task becomes harder with more errors. If a logical qubit affected by these events is not scheduled for decoding immediately after the event, decoding it before applying a non-Clifford gate could take longer, thus delaying the operation and causing a slowdown. As shown in Fig. 7(a), an increase in the physical error rate results in a higher number of bit-flips (especially for larger code distances), which can be detected with simple components in the control hardware. Fig. 7(b) shows how additional flips can be detected and used to dynamically prioritize the decoding of an arbitrary patch P𝑃Pitalic_P, which suffers from a temporary burst of errors. Note that the detection is different from decoding – we are merely predicting that there are more errors due to higher bit-flips in the syndromes.

4.4. Offloading to Software Decoders

Software decoders are slow and also have a higher variance in decoding latencies (Delfosse2023, ). However, when scheduling decoding tasks for logical qubits, software decoders can be leveraged to further reduce the undecoded sequence lengths. As shown in Fig. 7(c), some syndromes for a logical qubit can be offloaded to software while the hardware decoders are busy elsewhere. To prevent scheduled hardware decoding from being delayed, a buffer (three slices in this example) must be used to ensure that the software offloading completes before the next hardware decode.

5. Decoding for Distillation Factories

The decoder scheduling policies in the previous sections catered only to the decoding of algorithmic logical qubits (data logical qubits, magic state storage, ancillary logical qubits required for Lattice Surgery). In this section, we discuss decoding for distillation factories.

5.1. Distillation Factories

Magic state distillation factories generate few low-error logical qubits with non-Clifford states from many high-error logical qubits. Distillation factories run for very short periods at a time – this allows for smaller code distances to be used for creating the logical qubits for distillation (Litinski2019magic, ). As shown in Fig. 8, the error probability of a magic state is low enough to be useful even with d=7𝑑7d=7italic_d = 7 (d𝑑ditalic_d refers to dXsubscript𝑑𝑋d_{X}italic_d start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT in (Litinski2019magic, )).

Refer to caption
Figure 8. Output error probabilities of magic states after 15-1 distillation for different d𝑑ditalic_d (Litinski2019magic, ) (p=104𝑝superscript104p=10^{-4}italic_p = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT).
\Description

[fig]

5.2. Using Fast, Low-Footprint Decoders

Smaller code distances for distillation factories provide two main benefits: the number of physical qubits required are much lower, and more importantly, both hardware (Vittal2023, ; Smith2023, ; Delfosse2020, ; Caune2023, ) and software (Delfosse2023, ) decoders are faster and less complex. For example, LUT based decoders (Das2022lilliput, ) have been shown to be effective up to d=5𝑑5d=5italic_d = 5 without requiring significant hardware resources. Predecoders (Delfosse2020, ; Ravi2023, ; Smith2023, ) reduce the complexity and decoding effort required for lower code distances as well. Compared to algorithmic logical qubits, which require a large code distance to survive millions of error correction cycles (Beverland2022, ; Blunt2024, ), the decoding requirements of distillation factories are far more relaxed, which reduces the hardware resource requirements as well.

The decoding overhead of magic state distillation is significantly lower than algorithmic logical qubits – it can thus leverage lightweight decoders, considerably reducing the hardware cost for distillation factories.

6. Methodology

We now describe the methodology used to evaluate different decoder scheduling policies and for estimating classical resources required for executing workloads on a Surface Code error-corrected quantum computer.

6.1. Compiler

We use the Lattice Surgery Compiler (LSC) (watkins2023high, ) to generate Intermediate Representations (IR) of workloads that can be executed on an error-corrected quantum computer using the Surface Code with lattice Surgery. LSC can generate IR that denote Lattice Surgery instructions from the QASM (qasm, ) representation of a workload. LSC handles mapping and routing based on the layout provided to the compiler. We configure LSC to use a ‘wave’ scheduling that maximizes the number of concurrent instructions executed in every time slice. LSC also uses Gridsynth (gridsynth, ) to deal with arbitrary rotations. However, since it is still under development, LSC has some limitations:

  • LSC is limited to multi-body measurements between only two logical qubits.

  • LSC abstracts away distillation factories, only magic state storage sites are considered.

  • LSC works for a limited set of layouts and is extremely slow for large workloads like shor.

6.2. Simulation Framework

Using the IR generated by LSC, we build a framework that can parse the IR and determine the critical decodes in every slice, generate a timeline of all operations, and assign decoders to all logical qubits depending on the scheduling policy. In case the number of critical decodes in a particular slice are more than the number of hardware decoders configured, decoder-resources can rewrite the IR to defer critical decodes to the next slice (potentially increasing the execution time of the program).

Layouts: We use three layouts for our evaluations – Fast and Compact layouts (Litinski2019, ), and the EDPC layout (Beverland2022edpc, ). The Compact layout uses the fewest logical qubits and, due to a single routing lane, allows only one magic state to be consumed per time slice – we thus use it only to compare total execution times with the Fast and EDPC layouts.

Benchmarks: We use benchmarks from MQT Bench (quetschlich2023mqtbench, ) and QASMBench (Li2023, ). We use shor-15, a chemistry workload gndstate-14, a NISQ workload qaoa-14444Used for its arbitrary rotations and similarity to chemistry workloads., random, wstate, and arithmetic workloads adder-28, multiplier-45, Quantum Fourier Transform qft-20 – which can be used as building blocks for other algorithms.

Other Software: Stim (Gidney2021, ) was used for simulating stabilizer circuits to generate syndromes and error rates. Azure QRE (Beverland2022, ) was used for resource estimations.

7. Evaluations

In this section, we present some results for different scheduling policies and savings in decoder hardware.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 9. (a) Max. concurrent critical decodes for all layouts and workloads; (b) Total execution time (in code cycles) needed to execute all workloads with different layouts (c) Estimated number of logical qubits for all workloads (code distances estimated by Azure QRE (Beverland2022, ) also annotated).
\Description

[some figure]

7.1. Research Questions

We aim to answer the following questions:

  1. (1)

    How many decoders can we virtualize in the system without affecting performance?

  2. (2)

    How long do qubits go undecoded when using different scheduling policies?

  3. (3)

    How do scheduling policies affect memory usage when using virtualized decoders?

7.2. Baseline Statistics

We consider the baseline to have decoders for every logical qubit in the system. Fig. 9(a) shows the maximum number of critical decodes that occur during the execution of different workloads for all selected layouts. The Compact layout has a maximum of two critical decode per slice between two logical qubits. Fig. 9(b) shows the total time required to finish all workloads555shor-15 was stopped after 100,000d𝑑ditalic_d code cycles. – the EDPC and Fast layouts are significantly faster than Compact, and the EDPC layout has a very slight advantage over Fast for most workloads and also uses fewer qubits 9(c) (only algorithmic logical qubits are considered). Due to these advantages, we consider only the EDPC layout for all further evaluations.

Decoder Latency: For all evaluations, we assume that the decoder latency is significantly smaller than the syndrome cycle time. This is a reasonable assumption, since most hardware decoders (Riverlane2023, ; Vittal2023, ; Alavisamani2024, ) have latencies far less than 1μ𝜇\muitalic_μs. This assumption allows a decoder to process multiple slices worth of syndromes in a single slice. For example, for d=11𝑑11d=11italic_d = 11, a slice will consist of 11 rounds corresponding to a duration of roughly 11μ𝜇\muitalic_μs (1μ𝜇\muitalic_μs per round (Google2023, )) – a decoder latency of 150similar-toabsent150\sim 150∼ 150ns (Riverlane2023, ) can allow the decoder to process 10 slices worth of syndromes in a single slice.

7.3. Decoder Scheduling Efficacy

Refer to caption
Figure 10. Hardware decoders used for different configurations (EDPC layout).
\Description

[some figure]

Refer to caption
(a) Max. Concurrency
Refer to caption
(b) Midpoint
Figure 11. Longest undecoded sequence when using the (a) Max. Concurrency; (b) Midpoint configurations.
\Description

[some figure]

Fig. 10 shows the three decoder configurations selected for this work.

  • The All Qubits configuration denotes the baseline where all qubits have a decoder.

  • Max. Concurrency refers to the configuration where the number of hardware decoders in the system corresponds to the peak concurrent critical decodes for every workload shown in Fig. 9(a).

  • Midpoint refers to a configuration where the number of hardware decoders is the midpoint between the max. and min. concurrent critical decodes (=Max.+Min.2=\frac{Max.+Min.}{2}= divide start_ARG italic_M italic_a italic_x . + italic_M italic_i italic_n . end_ARG start_ARG 2 end_ARG).

The minimum concurrent critical decodes corresponds to two critical decodes between two logical qubits. All evaluations are for algorithmic logical qubits, logical qubits required for distillation are not considered. For the All Qubits configuration, the longest undecoded sequence length will be zero, since every logical qubit has an assigned decoder.

7.3.1. Longest Undecoded Sequences

To evaluate the performance of the decoder scheduling policies described in Section 4, we determine the longest undecoded sequence lengths for all workloads when using the Max. Concurrency and Midpoint configurations. Since these configurations use far fewer hardware decoders than qubits, the longest undecoded sequence is a good measure of whether qubits are being starved of decoding. Fig. 11(b) shows the longest undecoded sequence lengths for the (a) Max. Concurrency and (b) Midpoint configurations. The MFD policy leads to qubits being starved of decoding, since it prioritizes qubits that have frequent critical decodes. For the Max. Concurrency configuration, the gndstate and qaoa workloads do relatively well with the MFD configuration signifying that almost all logical qubits have a similar number of critical decodes, leading to a fairer scheduling. While the RR policy performs significantly better than the MFD policy, the MLS policy consistently performs better than both policies – MLS reduces the longest undecoded sequence lengths for almost every workload to 10similar-toabsent10\sim 10∼ 10 slices for both Max. Concurrency and Midpoint configurations.

Refer to caption
(a) Max. Concurrency
Refer to caption
(b) Midpoint
Figure 12. Peak memory required for storing undecoded syndromes for different scheduling policies when using the (a) Max. Concurrency; (b) Midpoint configurations.
\Description

[some figure]

7.3.2. Memory Usage

The reduction in the longest undecoded sequence lengths also corresponds to lower memory usage for storing undecoded syndromes. This is crucial since reducing the number of hardware decoders will require more memory to store syndromes for qubits that have not been decoded. Fig. 12(b) shows the memory required for different workloads with the (a) Max. Concurrency and (b) Midpoint configurations (the Azure QRE estimated the code distances used to determine the memory requirements). Due to longer undecoded sequences, the MFD policy can require up to 100 GB of memory for some workloads while the MLS policy rarely requires more than 100 MB of memory, which is orders of magnitude better than the MFD policy and 2-4x better than the RR policy.

Refer to caption
Figure 13. Total slices required by workloads with different decoder configurations normalized to the slices required by the baseline (MLS policy).
\Description

[some figure]

7.3.3. Slowdown due to Fewer Decoders

Fewer hardware decoders imply that some critical decodes in an arbitrary slice have to be deferred to subsequent slices, resulting in potentially more slices for completing the program. Fig. 13 shows the number of slices required for all workloads normalized with respect to the baseline number of slices shown in Fig. 9(b) for the Min. Concurrency, Max. Concurrency, and Midpoint configurations. Since the Min. Concurrency configuration allocates only four decoders for two critical decodes per slice, some workloads are slowed down by ¿10%. The Midpoint configuration however does not cause any slowdown except in the gndstate workload, and there is no slowdown for the Max. Concurrency configuration.

Refer to caption
Figure 14. Increase in the LER (lower is better) with the Midpoint configuration.
\Description

[some figure]

7.3.4. Impact on Logical Error Rate

As discussed in Section 3, delaying decoding does not affect the logical error rate (LER) by itself. However, leaving a qubit undecoded for a long time before a critical decode can slow the decoder down, thus requiring additional rounds to complete the computation. In this evaluation, we would thus like to show how the longest undecoded sequence length can impact the LER. Fig. 14 shows how the MFD policy can increase the final LER of the computation due to longer undecoded sequence lengths. The RR policy increases the LER slightly for adder-28 and random-40, and the MLS policy does not incur any degradation in the LER. Note that this estimation is optimistic since we assumed the number of rounds increases linearly with the undecoded sequence length – in reality, it could be worse since the decoder latency can increase exponentially with the undecoded sequence length.

Refer to caption
Figure 15. Reduction in the peak memory usage when using the Midpoint configuration with software offloading.
\Description

[some figure]

7.3.5. Software Offloading

The longest undecoded sequences (and consequently the memory requirements) can be reduced further by leveraging software decoders. The only constraint while doing so is that due to longer software decoding latencies, critical decodes should not be delayed because prior software decodes for a logical qubit have not yet finished. For evaluating the effect of software decoding, we set the number of hardware decoders to the Midpoint configuration and make a pessimistic assumption that a single slice worth of syndromes takes three slices (about 3×\times×d microseconds) to be decoded in software (in reality, it could be much lower with optimized software decoders (Higott2023, )). Fig. 15 shows the reduction in the peak memory usage for all scheduling polices when software offloading is performed – software offloading can achieve a reduction of up to 3x.

7.4. Discussion

The results shown in previous sections show that VQD reduces the number of hardware decoders for algorithmic logical qubits by nearly one order of magnitude for most workloads with the Midpoint configuration, which, when combined with the MLS scheduling policy, results in significantly reduced memory requirements and low undecoded sequence lengths. Memory requirements and undecoded sequence lengths can be further reduced by offloading some non-critical decodes to software.

Why is the 100GB\rightarrow100MB reduction important?: Compared to the cost of building FTQC systems, 100GB of memory is immaterial. However, it is worth noting that the benchmarks used in this paper are quite small – real applications will run for far longer and use far more logical qubits, thus potentially requiring orders of magnitude more memory.

Applicability to other codes: While our evaluations have focused on the Surface Code, other error-correcting codes such as quantum LDPC codes (Bravyi2024, ) also require decoding with a significantly higher complexity than Surface Code. qLDPC codes use Belief Propagation for decoding errors (Old2023, ) along with Ordered Statistics Decoding (OSD) (Roffe2020, ; Fuentes2021, ) to enable accurate decoding. These algorithms are complex and highly resource intensive (Gong2024, ). Building fast, accurate decoders for such codes will likely require significant hardware resources. Our work enables amortizing the cost of expensive decoders via virtualization to build efficient and scalable quantum memory using qLDPC codes.

Better Capacity Planning: We envision large-scale quantum computers will be closely integrated with HPC-style systems, where scientific applications can leverage quantum subroutines using QPUs (humbleQPU, ). In this setting, non-critical software decoders can run on traditional HPC platforms to alleviate the pressure on hardware decoders. Moreover, the virtualization of decoders can help us harness shot-level parallelism – all quantum programs, even on FTQC, must be executed multiple times. We can concurrently run the copies of quantum programs on multiple QPUs. However, quantum resources increase linearly for running “k𝑘kitalic_k” copies concurrently. Our work, VQD, shows that with decoder virtualization, we can enable effective sharing of classical resources, dramatically reducing overall costs and improving resource utilization.

8. Related Work

This is one of the first works to perform a workload-oriented study of the classical processing requirements and system-level scheduling policies for error-corrected quantum computers. Prior to this work, Bombín et al. (Bombin2023, ) introduced modular decoding, which is the closest work that divides the global decoding task to sub-tasks without sacrificing decoder accuracy. However, this work, and other works such as parallelized window decoding (Skoric2023, ; Tan2022, ) always assume decoders for every logical qubit. This work shows that not all qubits require access to fast decoders at all times, thus allowing decoders to be virtualized. Other works that are broadly connected to this work are summarized below.

System-level Studies: Delfosse et al. (Delfosse2023, ) studied the speed vs. accuracy tradeoff for decoders used in FTQC. XQSim (Byun2022, ) is a full-system FTQC simulator. Stein et al. (Stein2023, ) proposed a heterogeneous architecture for FTQC, virtual logical qubits were proposed in (Baker2021, ), Lin et al. (Lin2023, ) explored modular architectures for error-correcting codes and scheduling for distillation factories was proposed in (Ding2018, ). (Kim2024, ) described a blueprint of a fault-tolerant quantum computer.

Decoder Designs: Neural network based decoders (googleRNN, ; ueno2022, ; Overwater2022, ; Meinerz2022, ; Gicev2023, ; Varsamopoulos2017, ), LUT-based decoders (Das2022lilliput, ; Tomita2014, ), decoders based on the union-find algorithm (Das2022afs, ; Riverlane2023, ), and optimized MWPM decoders (Vittal2023, ; Alavisamani2024, ) have been proposed. In general, neural network decoders are generally far slower and therefore not ideal for fast qubit technologies such as superconducting qubits. Other predecoders (Delfosse2020, ; Smith2023, ) and partial decoders (Caune2023, ) have also been proposed. Decoders based on superconducting logic (Ueno2021, ; Ravi2023, ) target cryogenic implementations.

9. Conclusions

Scaling quantum computers to enable Quantum Error Correction will require specialized hardware for decoding errors. Prior work has focused on reducing the hardware resources required to build decoders. In this work, we take a full-system view and show that with the right decoder scheduling policy, it is not necessary for an error-corrected quantum computer to provide every logical qubit with a dedicated hardware decoder. The MLS policy enables the reduction of hardware decoders by up to 10x while requiring similar-to\sim100 MB or less of memory for storing undecoded syndromes without increasing the program execution time or the target logical error rate. The efficacy of the MLS policy is enhanced with software offloading of some decoding tasks. We also propose a noise-adaptive scheduling mechanism that can prioritize the decoding of logical qubits that incur a temporary increase in the physical error rate.

References

  • [1] Rajeev Acharya, Igor Aleiner, Richard Allen, Trond I. Andersen, Markus Ansmann, Frank Arute, Kunal Arya, Abraham Asfaw, Juan Atalaya, Ryan Babbush, Dave Bacon, Joseph C. Bardin, Joao Basso, Andreas Bengtsson, Sergio Boixo, Gina Bortoli, Alexandre Bourassa, Jenna Bovaird, Leon Brill, Michael Broughton, Bob B. Buckley, David A. Buell, Tim Burger, Brian Burkett, Nicholas Bushnell, Yu Chen, Zijun Chen, Ben Chiaro, Josh Cogan, Roberto Collins, Paul Conner, William Courtney, Alexander L. Crook, Ben Curtin, Dripto M. Debroy, Alexander Del Toro Barba, Sean Demura, Andrew Dunsworth, Daniel Eppens, Catherine Erickson, Lara Faoro, Edward Farhi, Reza Fatemi, Leslie Flores Burgos, Ebrahim Forati, Austin G. Fowler, Brooks Foxen, William Giang, Craig Gidney, Dar Gilboa, Marissa Giustina, Alejandro Grajales Dau, Jonathan A. Gross, Steve Habegger, Michael C. Hamilton, Matthew P. Harrigan, Sean D. Harrington, Oscar Higgott, Jeremy Hilton, Markus Hoffmann, Sabrina Hong, Trent Huang, Ashley Huff, William J. Huggins, Lev B. Ioffe, Sergei V. Isakov, Justin Iveland, Evan Jeffrey, Zhang Jiang, Cody Jones, Pavol Juhas, Dvir Kafri, Kostyantyn Kechedzhi, Julian Kelly, Tanuj Khattar, Mostafa Khezri, Mária Kieferová, Seon Kim, Alexei Kitaev, Paul V. Klimov, Andrey R. Klots, Alexander N. Korotkov, Fedor Kostritsa, John Mark Kreikebaum, David Landhuis, Pavel Laptev, Kim-Ming Lau, Lily Laws, Joonho Lee, Kenny Lee, Brian J. Lester, Alexander Lill, Wayne Liu, Aditya Locharla, Erik Lucero, Fionn D. Malone, Jeffrey Marshall, Orion Martin, Jarrod R. McClean, Trevor McCourt, Matt McEwen, Anthony Megrant, Bernardo Meurer Costa, Xiao Mi, Kevin C. Miao, Masoud Mohseni, Shirin Montazeri, Alexis Morvan, Emily Mount, Wojciech Mruczkiewicz, Ofer Naaman, Matthew Neeley, Charles Neill, Ani Nersisyan, Hartmut Neven, Michael Newman, Jiun How Ng, Anthony Nguyen, Murray Nguyen, Murphy Yuezhen Niu, Thomas E. O’Brien, Alex Opremcak, John Platt, Andre Petukhov, Rebecca Potter, Leonid P. Pryadko, Chris Quintana, Pedram Roushan, Nicholas C. Rubin, Negar Saei, Daniel Sank, Kannan Sankaragomathi, Kevin J. Satzinger, Henry F. Schurkus, Christopher Schuster, Michael J. Shearn, Aaron Shorter, Vladimir Shvarts, Jindra Skruzny, Vadim Smelyanskiy, W. Clarke Smith, George Sterling, Doug Strain, Marco Szalay, Alfredo Torres, Guifre Vidal, Benjamin Villalonga, Catherine Vollgraff Heidweiller, Theodore White, Cheng Xing, Z. Jamie Yao, Ping Yeh, Juhwan Yoo, Grayson Young, Adam Zalcman, Yaxing Zhang, and Ningfeng Zhu. Suppressing quantum errors by scaling a surface code logical qubit. Nature, 614(7949):676–681, February 2023.
  • [2] Narges Alavisamani, Suhas Vittal, Ramin Ayanzadeh, Poulami Das, and Moinuddin Qureshi. Promatch: Extending the reach of real-time quantum error correction with adaptive predecoding, 2024.
  • [3] Jonathan M. Baker, Casey Duckering, David I. Schuster, and Frederic T. Chong. Virtual logical qubits: A compact architecture for fault-tolerant quantum computing. IEEE Micro, 41(3):95–101, May 2021.
  • [4] Ben Barber, Kenton M. Barnes, Tomasz Bialas, Okan Buğdaycı, Earl T. Campbell, Neil I. Gillespie, Kauser Johar, Ram Rajan, Adam W. Richardson, Luka Skoric, Canberk Topal, Mark L. Turner, and Abbas B. Ziad. A real-time, scalable, fast and highly resource efficient decoder for a quantum computer, 2023.
  • [5] Johannes Bausch, Andrew W Senior, Francisco J H Heras, Thomas Edlich, Alex Davies, Michael Newman, Cody Jones, Kevin Satzinger, Murphy Yuezhen Niu, Sam Blackwell, George Holland, Dvir Kafri, Juan Atalaya, Craig Gidney, Demis Hassabis, Sergio Boixo, Hartmut Neven, and Pushmeet Kohli. Learning to decode the surface code with a recurrent, transformer-based neural network, 2023.
  • [6] Michael Beverland, Vadym Kliuchnikov, and Eddie Schoute. Surface code compilation via edge-disjoint paths. PRX Quantum, 3(2), May 2022.
  • [7] Michael E. Beverland, Prakash Murali, Matthias Troyer, Krysta M. Svore, Torsten Hoefler, Vadym Kliuchnikov, Guang Hao Low, Mathias Soeken, Aarthi Sundaram, and Alexander Vaschillo. Assessing requirements to scale to practical quantum advantage, 2022.
  • [8] Nick S. Blunt, György P. Gehér, and Alexandra E. Moylett. Compilation of a simple chemistry application to quantum error correction primitives. Physical Review Research, 6(1), March 2024.
  • [9] Dolev Bluvstein, Simon J. Evered, Alexandra A. Geim, Sophie H. Li, Hengyun Zhou, Tom Manovitz, Sepehr Ebadi, Madelyn Cain, Marcin Kalinowski, Dominik Hangleiter, J. Pablo Bonilla Ataides, Nishad Maskara, Iris Cong, Xun Gao, Pedro Sales Rodriguez, Thomas Karolyshyn, Giulia Semeghini, Michael J. Gullans, Markus Greiner, Vladan Vuletić, and Mikhail D. Lukin. Logical quantum processor based on reconfigurable atom arrays. Nature, 626(7997):58–65, December 2023.
  • [10] Héctor Bombín, Chris Dawson, Ye-Hua Liu, Naomi Nickerson, Fernando Pastawski, and Sam Roberts. Modular decoding: parallelizable real-time decoding for quantum computers, 2023.
  • [11] Sergey Bravyi, Andrew W. Cross, Jay M. Gambetta, Dmitri Maslov, Patrick Rall, and Theodore J. Yoder. High-threshold and low-overhead fault-tolerant quantum memory. Nature, 627(8005):778–782, March 2024.
  • [12] Sergey Bravyi and Jeongwan Haah. Magic-state distillation with low overhead. Physical Review A, 86(5), November 2012.
  • [13] Keith A. Britt and Travis S. Humble. High-performance computing with quantum processing units. ACM Journal on Emerging Technologies in Computing Systems, 13(3):1–13, March 2017.
  • [14] Ilkwon Byun, Junpyo Kim, Dongmoon Min, Ikki Nagaoka, Kosuke Fukumitsu, Iori Ishikawa, Teruo Tanimoto, Masamitsu Tanaka, Koji Inoue, and Jangwoo Kim. Xqsim: modeling cross-technology control processors for 10+k qubit quantum computers. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22. ACM, June 2022.
  • [15] Laura Caune, Brendan Reid, Joan Camps, and Earl Campbell. Belief propagation as a partial decoder, 2023.
  • [16] Andrew W. Cross, Lev S. Bishop, John A. Smolin, and Jay M. Gambetta. Open quantum assembly language, 2017.
  • [17] M. P. da Silva, C. Ryan-Anderson, J. M. Bello-Rivas, A. Chernoguzov, J. M. Dreiling, C. Foltz, F. Frachon, J. P. Gaebler, T. M. Gatterman, L. Grans-Samuelsson, D. Hayes, N. Hewitt, J. Johansen, D. Lucchetti, M. Mills, S. A. Moses, B. Neyenhuis, A. Paz, J. Pino, P. Siegfried, J. Strabley, A. Sundaram, D. Tom, S. J. Wernli, M. Zanner, R. P. Stutz, and K. M. Svore. Demonstration of logical qubits and repeated error correction with better-than-physical error rates, 2024.
  • [18] Poulami Das, Aditya Locharla, and Cody Jones. Lilliput: a lightweight low-latency lookup-table decoder for near-term quantum error correction. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22. ACM, February 2022.
  • [19] Poulami Das, Christopher A. Pattison, Srilatha Manne, Douglas M. Carmean, Krysta M. Svore, Moinuddin Qureshi, and Nicolas Delfosse. Afs: Accurate, fast, and scalable error-decoding for fault-tolerant quantum computers. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, April 2022.
  • [20] Nicolas Delfosse. Hierarchical decoding to reduce hardware requirements for quantum computing, 2020.
  • [21] Nicolas Delfosse, Andres Paz, Alexander Vaschillo, and Krysta M. Svore. How to choose a decoder for a fault-tolerant quantum computer? the speed vs accuracy trade-off, 2023.
  • [22] Yongshan Ding, Adam Holmes, Ali Javadi-Abhari, Diana Franklin, Margaret Martonosi, and Frederic Chong. Magic-state functional units: Mapping and scheduling multi-level distillation circuits for fault-tolerant quantum architectures. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, October 2018.
  • [23] Austin G. Fowler. Minimum weight perfect matching of fault-tolerant topological quantum error correction in average o(1) parallel time. Quantum Info. Comput., 15(1–2):145–158, jan 2015.
  • [24] Austin G. Fowler, Matteo Mariantoni, John M. Martinis, and Andrew N. Cleland. Surface codes: Towards practical large-scale quantum computation. Physical Review A, 86(3), September 2012.
  • [25] Patricio Fuentes, Josu Etxezarreta Martinez, Pedro M. Crespo, and Javier Garcia-Frias. Degeneracy and its impact on the decoding of sparse quantum codes. IEEE Access, 9:89093–89119, 2021.
  • [26] Spiro Gicev, Lloyd C. L. Hollenberg, and Muhammad Usman. A scalable and fast artificial neural network syndrome decoder for surface codes. Quantum, 7:1058, July 2023.
  • [27] Craig Gidney. Stim: a fast stabilizer circuit simulator. Quantum, 5:497, July 2021.
  • [28] Anqi Gong, Sebastian Cammerer, and Joseph M. Renes. Toward low-latency iterative decoding of qldpc codes under circuit-level noise, 2024.
  • [29] Riddhi S. Gupta, Neereja Sundaresan, Thomas Alexander, Christopher J. Wood, Seth T. Merkel, Michael B. Healy, Marius Hillenbrand, Tomas Jochym-O’Connor, James R. Wootton, Theodore J. Yoder, Andrew W. Cross, Maika Takita, and Benjamin J. Brown. Encoding a magic state with beyond break-even fidelity. Nature, 625(7994):259–263, January 2024.
  • [30] Oscar Higgott and Craig Gidney. Sparse blossom: correcting a million errors per core second with minimum-weight matching, 2023.
  • [31] Dominic Horsman, Austin G Fowler, Simon Devitt, and Rodney Van Meter. Surface code quantum computing by lattice surgery. New Journal of Physics, 14(12):123011, December 2012.
  • [32] Junpyo Kim, Dongmoon Min, Jungmin Cho, Hyeonseong Jeong, Ilkwon Byun, Junhyuk Choi, Juwon Hong, and Jangwoo Kin. A fault-tolerant million qubit-scale distributed quantum computer. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, April 2024.
  • [33] Emanuel Knill and Raymond Laflamme. Theory of quantum error-correcting codes. Physical Review A, 55(2):900–911, February 1997.
  • [34] Ang Li, Samuel Stein, Sriram Krishnamoorthy, and James Ang. Qasmbench: A low-level quantum benchmark suite for nisq evaluation and simulation. ACM Transactions on Quantum Computing, 4(2):1–26, February 2023.
  • [35] Sophia Fuhui Lin, Joshua Viszlai, Kaitlin N. Smith, Gokul Subramanian Ravi, Charles Yuan, Frederic T. Chong, and Benjamin J. Brown. Codesign of quantum error-correcting codes and modular chiplets in the presence of defects, 2023.
  • [36] Daniel Litinski. A game of surface codes: Large-scale quantum computing with lattice surgery. Quantum, 3:128, March 2019.
  • [37] Daniel Litinski. Magic state distillation: Not as costly as you think. Quantum, 3:205, December 2019.
  • [38] C. L. Liu and James W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. Journal of the ACM, 20(1):46–61, January 1973.
  • [39] Matt McEwen, Lara Faoro, Kunal Arya, Andrew Dunsworth, Trent Huang, Seon Kim, Brian Burkett, Austin Fowler, Frank Arute, Joseph C. Bardin, Andreas Bengtsson, Alexander Bilmes, Bob B. Buckley, Nicholas Bushnell, Zijun Chen, Roberto Collins, Sean Demura, Alan R. Derk, Catherine Erickson, Marissa Giustina, Sean D. Harrington, Sabrina Hong, Evan Jeffrey, Julian Kelly, Paul V. Klimov, Fedor Kostritsa, Pavel Laptev, Aditya Locharla, Xiao Mi, Kevin C. Miao, Shirin Montazeri, Josh Mutus, Ofer Naaman, Matthew Neeley, Charles Neill, Alex Opremcak, Chris Quintana, Nicholas Redd, Pedram Roushan, Daniel Sank, Kevin J. Satzinger, Vladimir Shvarts, Theodore White, Z. Jamie Yao, Ping Yeh, Juhwan Yoo, Yu Chen, Vadim Smelyanskiy, John M. Martinis, Hartmut Neven, Anthony Megrant, Lev Ioffe, and Rami Barends. Resolving catastrophic error bursts from cosmic rays in large arrays of superconducting qubits. Nature Physics, 18(1):107–111, December 2021.
  • [40] Kai Meinerz, Chae-Yeun Park, and Simon Trebst. Scalable neural decoder for topological surface codes. Physical Review Letters, 128(8), February 2022.
  • [41] Kevin C. Miao, Matt McEwen, Juan Atalaya, Dvir Kafri, Leonid P. Pryadko, Andreas Bengtsson, Alex Opremcak, Kevin J. Satzinger, Zijun Chen, Paul V. Klimov, Chris Quintana, Rajeev Acharya, Kyle Anderson, Markus Ansmann, Frank Arute, Kunal Arya, Abraham Asfaw, Joseph C. Bardin, Alexandre Bourassa, Jenna Bovaird, Leon Brill, Bob B. Buckley, David A. Buell, Tim Burger, Brian Burkett, Nicholas Bushnell, Juan Campero, Ben Chiaro, Roberto Collins, Paul Conner, Alexander L. Crook, Ben Curtin, Dripto M. Debroy, Sean Demura, Andrew Dunsworth, Catherine Erickson, Reza Fatemi, Vinicius S. Ferreira, Leslie Flores Burgos, Ebrahim Forati, Austin G. Fowler, Brooks Foxen, Gonzalo Garcia, William Giang, Craig Gidney, Marissa Giustina, Raja Gosula, Alejandro Grajales Dau, Jonathan A. Gross, Michael C. Hamilton, Sean D. Harrington, Paula Heu, Jeremy Hilton, Markus R. Hoffmann, Sabrina Hong, Trent Huang, Ashley Huff, Justin Iveland, Evan Jeffrey, Zhang Jiang, Cody Jones, Julian Kelly, Seon Kim, Fedor Kostritsa, John Mark Kreikebaum, David Landhuis, Pavel Laptev, Lily Laws, Kenny Lee, Brian J. Lester, Alexander T. Lill, Wayne Liu, Aditya Locharla, Erik Lucero, Steven Martin, Anthony Megrant, Xiao Mi, Shirin Montazeri, Alexis Morvan, Ofer Naaman, Matthew Neeley, Charles Neill, Ani Nersisyan, Michael Newman, Jiun How Ng, Anthony Nguyen, Murray Nguyen, Rebecca Potter, Charles Rocque, Pedram Roushan, Kannan Sankaragomathi, Henry F. Schurkus, Christopher Schuster, Michael J. Shearn, Aaron Shorter, Noah Shutty, Vladimir Shvarts, Jindra Skruzny, W. Clarke Smith, George Sterling, Marco Szalay, Douglas Thor, Alfredo Torres, Theodore White, Bryan W. K. Woo, Z. Jamie Yao, Ping Yeh, Juhwan Yoo, Grayson Young, Adam Zalcman, Ningfeng Zhu, Nicholas Zobrist, Hartmut Neven, Vadim Smelyanskiy, Andre Petukhov, Alexander N. Korotkov, Daniel Sank, and Yu Chen. Overcoming leakage in quantum error correction. Nature Physics, 19(12):1780–1786, October 2023.
  • [42] Josias Old and Manuel Rispler. Generalized belief propagation algorithms for decoding of surface codes. Quantum, 7:1037, June 2023.
  • [43] Ramon W. J. Overwater, Masoud Babaie, and Fabio Sebastiano. Neural-network decoders for quantum error correction using surface codes: A space exploration of the hardware cost-performance tradeoffs. IEEE Transactions on Quantum Engineering, 3:1–19, 2022.
  • [44] Nils Quetschlich, Lukas Burgholzer, and Robert Wille. MQT Bench: Benchmarking software and design automation tools for quantum computing. Quantum, 2023. MQT Bench is available at https://www.cda.cit.tum.de/mqtbench/.
  • [45] Gokul Subramanian Ravi, Jonathan M. Baker, Arash Fayyazi, Sophia Fuhui Lin, Ali Javadi-Abhari, Massoud Pedram, and Frederic T. Chong. Better than worst-case decoding for quantum error correction. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’23. ACM, January 2023.
  • [46] Joschka Roffe, David R. White, Simon Burton, and Earl Campbell. Decoding across the quantum low-density parity-check code landscape. Physical Review Research, 2(4), December 2020.
  • [47] Neil J. Ross and Peter Selinger. Optimal ancilla-free clifford+t approximation of z-rotations, 2014.
  • [48] Peter W. Shor. Scheme for reducing decoherence in quantum computer memory. Physical Review A, 52(4):R2493–R2496, October 1995.
  • [49] Luka Skoric, Dan E. Browne, Kenton M. Barnes, Neil I. Gillespie, and Earl T. Campbell. Parallel window decoding enables scalable fault tolerant quantum computation. Nature Communications, 14(1), November 2023.
  • [50] Samuel C. Smith, Benjamin J. Brown, and Stephen D. Bartlett. Local predecoder to reduce the bandwidth and latency of quantum error correction. Physical Review Applied, 19(3), March 2023.
  • [51] Samuel Stein, Sara Sussman, Teague Tomesh, Charles Guinn, Esin Tureci, Sophia Fuhui Lin, Wei Tang, James Ang, Srivatsan Chakram, Ang Li, Margaret Martonosi, Fred Chong, Andrew A. Houck, Isaac L. Chuang, and Michael Demarco. Hetarch: Heterogeneous microarchitectures for superconducting quantum systems. In 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’23. ACM, October 2023.
  • [52] Xinyu Tan, Fang Zhang, Rui Chao, Yaoyun Shi, and Jianxin Chen. Scalable surface code decoders with parallelization in time, 2022.
  • [53] Barbara M. Terhal. Quantum error correction for quantum memories. Reviews of Modern Physics, 87(2):307–346, April 2015.
  • [54] Yu Tomita and Krysta M. Svore. Low-distance surface codes under realistic quantum noise. Physical Review A, 90(6), December 2014.
  • [55] Yosuke Ueno, Masaaki Kondo, Masamitsu Tanaka, Yasunari Suzuki, and Yutaka Tabuchi. Qecool: On-line quantum error correction with a superconducting decoder for surface code. In 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, December 2021.
  • [56] Yosuke Ueno, Masaaki Kondo, Masamitsu Tanaka, Yasunari Suzuki, and Yutaka Tabuchi. Neo-qec: Neural network enhanced online superconducting decoder for surface codes, 2022.
  • [57] Savvas Varsamopoulos, Ben Criger, and Koen Bertels. Decoding small surface codes with feedforward neural networks. Quantum Science and Technology, 3(1):015004, November 2017.
  • [58] Suhas Vittal, Poulami Das, and Moinuddin Qureshi. Astrea: Accurate quantum error-decoding via practical minimum-weight perfect-matching. In Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23. ACM, June 2023.
  • [59] George Watkins, Hoang Minh Nguyen, Keelan Watkins, Steven Pearce, Hoi-Kwan Lau, and Alexandru Paler. A high performance compiler for very large scale surface code computations, 2023.
  • [60] Yue Wu, Namitha Liyanage, and Lin Zhong. An interpretation of union-find decoder on weighted graphs, 2022.