-
Graph Neural Networks for Parameterized Quantum Circuits Expressibility Estimation
Authors:
Shamminuj Aktar,
Andreas Bärtschi,
Diane Oyen,
Stephan Eidenbenz,
Abdel-Hameed A. Badawy
Abstract:
Parameterized quantum circuits (PQCs) are fundamental to quantum machine learning (QML), quantum optimization, and variational quantum algorithms (VQAs). The expressibility of PQCs is a measure that determines their capability to harness the full potential of the quantum state space. It is thus a crucial guidepost to know when selecting a particular PQC ansatz. However, the existing technique for…
▽ More
Parameterized quantum circuits (PQCs) are fundamental to quantum machine learning (QML), quantum optimization, and variational quantum algorithms (VQAs). The expressibility of PQCs is a measure that determines their capability to harness the full potential of the quantum state space. It is thus a crucial guidepost to know when selecting a particular PQC ansatz. However, the existing technique for expressibility computation through statistical estimation requires a large number of samples, which poses significant challenges due to time and computational resource constraints. This paper introduces a novel approach for expressibility estimation of PQCs using Graph Neural Networks (GNNs). We demonstrate the predictive power of our GNN model with a dataset consisting of 25,000 samples from the noiseless IBM QASM Simulator and 12,000 samples from three distinct noisy quantum backends. The model accurately estimates expressibility, with root mean square errors (RMSE) of 0.05 and 0.06 for the noiseless and noisy backends, respectively. We compare our model's predictions with reference circuits [Sim and others, QuTe'2019] and IBM Qiskit's hardware-efficient ansatz sets to further evaluate our model's performance. Our experimental evaluation in noiseless and noisy scenarios reveals a close alignment with ground truth expressibility values, highlighting the model's efficacy. Moreover, our model exhibits promising extrapolation capabilities, predicting expressibility values with low RMSE for out-of-range qubit circuits trained solely on only up to 5-qubit circuit sets. This work thus provides a reliable means of efficiently evaluating the expressibility of diverse PQCs on noiseless simulators and hardware.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Guarantees on Warm-Started QAOA: Single-Round Approximation Ratios for 3-Regular MAXCUT and Higher-Round Scaling Limits
Authors:
Reuben Tate,
Stephan Eidenbenz
Abstract:
We generalize Farhi et al.'s 0.6924-approximation result technique of the Max-Cut Quantum Approximate Optimization Algorithm (QAOA) on 3-regular graphs to obtain provable lower bounds on the approximation ratio for warm-started QAOA. Given an initialization angle $θ$, we consider warm-starts where the initial state is a product state where each qubit position is angle $θ$ away from either the nort…
▽ More
We generalize Farhi et al.'s 0.6924-approximation result technique of the Max-Cut Quantum Approximate Optimization Algorithm (QAOA) on 3-regular graphs to obtain provable lower bounds on the approximation ratio for warm-started QAOA. Given an initialization angle $θ$, we consider warm-starts where the initial state is a product state where each qubit position is angle $θ$ away from either the north or south pole of the Bloch sphere; of the two possible qubit positions the position of each qubit is decided by some classically obtained cut encoded as a bitstring $b$. We illustrate through plots how the properties of $b$ and the initialization angle $θ$ influence the bound on the approximation ratios of warm-started QAOA. We consider various classical algorithms (and the cuts they produce which we use to generate the warm-start). Our results strongly suggest that there does not exist any choice of initialization angle that yields a (worst-case) approximation ratio that simultaneously beats standard QAOA and the classical algorithm used to create the warm-start.
Additionally, we show that at $θ=60^\circ$, warm-started QAOA is able to (effectively) recover the cut used to generate the warm-start, thus suggesting that in practice, this value could be a promising starting angle to explore alternate solutions in a heuristic fashion. Finally, for any combinatorial optimization problem with integer-valued objective values, we provide bounds on the required circuit depth needed for warm-started QAOA to achieve some change in approximation ratio; more specifically, we show that for small $θ$, the bound is roughly proportional to $1/θ$.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
Trainability Barriers in Low-Depth QAOA Landscapes
Authors:
Joel Rajakumar,
John Golden,
Andreas Bärtschi,
Stephan Eidenbenz
Abstract:
The Quantum Alternating Operator Ansatz (QAOA) is a prominent variational quantum algorithm for solving combinatorial optimization problems. Its effectiveness depends on identifying input parameters that yield high-quality solutions. However, understanding the complexity of training QAOA remains an under-explored area. Previous results have given analytical performance guarantees for a small, fixe…
▽ More
The Quantum Alternating Operator Ansatz (QAOA) is a prominent variational quantum algorithm for solving combinatorial optimization problems. Its effectiveness depends on identifying input parameters that yield high-quality solutions. However, understanding the complexity of training QAOA remains an under-explored area. Previous results have given analytical performance guarantees for a small, fixed number of parameters. At the opposite end of the spectrum, barren plateaus are likely to emerge at $Ω(n)$ parameters for $n$ qubits. In this work, we study the difficulty of training in the intermediate regime, which is the focus of most current numerical studies and near-term hardware implementations. Through extensive numerical analysis of the quality and quantity of local minima, we argue that QAOA landscapes can exhibit a superpolynomial growth in the number of low-quality local minima even when the number of parameters scales logarithmically with $n$. This means that the common technique of gradient descent from randomly initialized parameters is doomed to fail beyond small $n$, and emphasizes the need for good initial guesses of the optimal parameters.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Hierarchical Multigrid Ansatz for Variational Quantum Algorithms
Authors:
Christo Meriwether Keller,
Stephan Eidenbenz,
Andreas Bärtschi,
Daniel O'Malley,
John Golden,
Satyajayant Misra
Abstract:
Quantum computing is an emerging topic in engineering that promises to enhance supercomputing using fundamental physics. In the near term, the best candidate algorithms for achieving this advantage are variational quantum algorithms (VQAs). We design and numerically evaluate a novel ansatz for VQAs, focusing in particular on the variational quantum eigensolver (VQE). As our ansatz is inspired by c…
▽ More
Quantum computing is an emerging topic in engineering that promises to enhance supercomputing using fundamental physics. In the near term, the best candidate algorithms for achieving this advantage are variational quantum algorithms (VQAs). We design and numerically evaluate a novel ansatz for VQAs, focusing in particular on the variational quantum eigensolver (VQE). As our ansatz is inspired by classical multigrid hierarchy methods, we call it "multigrid'' ansatz. The multigrid ansatz creates a parameterized quantum circuit for a quantum problem on $n$ qubits by successively building and optimizing circuits for smaller qubit counts $j < n$, reusing optimized parameter values as initial solutions to next level hierarchy at $j+1$. We show through numerical simulation that the multigrid ansatz outperforms the standard hardware-efficient ansatz in terms of solution quality for the Laplacian eigensolver as well as for a large class of combinatorial optimization problems with specific examples for MaxCut and Maximum $k$-Satisfiability. Our studies establish the multi-grid ansatz as a viable candidate for many VQAs and in particular present a promising alternative to the QAOA approach for combinatorial optimization problems.
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
Scaling Whole-Chip QAOA for Higher-Order Ising Spin Glass Models on Heavy-Hex Graphs
Authors:
Elijah Pelofske,
Andreas Bärtschi,
Lukasz Cincio,
John Golden,
Stephan Eidenbenz
Abstract:
We show through numerical simulation that the Quantum Alternating Operator Ansatz (QAOA) for higher-order, random-coefficient, heavy-hex compatible spin glass Ising models has strong parameter concentration across problem sizes from $16$ up to $127$ qubits for $p=1$ up to $p=5$, which allows for straight-forward transfer learning of QAOA angles on instance sizes where exhaustive grid-search is pro…
▽ More
We show through numerical simulation that the Quantum Alternating Operator Ansatz (QAOA) for higher-order, random-coefficient, heavy-hex compatible spin glass Ising models has strong parameter concentration across problem sizes from $16$ up to $127$ qubits for $p=1$ up to $p=5$, which allows for straight-forward transfer learning of QAOA angles on instance sizes where exhaustive grid-search is prohibitive even for $p>1$. We use Matrix Product State (MPS) simulation at different bond dimensions to obtain confidence in these results, and we obtain the optimal solutions to these combinatorial optimization problems using CPLEX. In order to assess the ability of current noisy quantum hardware to exploit such parameter concentration, we execute short-depth QAOA circuits (with a CNOT depth of 6 per $p$, resulting in circuits which contain $1420$ two qubit gates for $127$ qubit $p=5$ QAOA) on $100$ higher-order (cubic term) Ising models on IBM quantum superconducting processors with $16, 27, 127$ qubits using QAOA angles learned from a single $16$-qubit instance. We show that (i) the best quantum processors generally find lower energy solutions up to $p=3$ for 27 qubit systems and up to $p=2$ for 127 qubit systems and are overcome by noise at higher values of $p$, (ii) the best quantum processors find mean energies that are about a factor of two off from the noise-free numerical simulation results. Additional insights from our experiments are that large performance differences exist among different quantum processors even of the same generation and that dynamical decoupling significantly improve performance for some, but decrease performance for other quantum processors. Lastly we show $p=1$ QAOA angle mean energy landscapes computed using up to a $414$ qubit quantum computer, showing that the mean QAOA energy landscapes remain very similar as the problem size changes.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
Provable bounds for noise-free expectation values computed from noisy samples
Authors:
Samantha V. Barron,
Daniel J. Egger,
Elijah Pelofske,
Andreas Bärtschi,
Stephan Eidenbenz,
Matthis Lehmkuehler,
Stefan Woerner
Abstract:
In this paper, we explore the impact of noise on quantum computing, particularly focusing on the challenges when sampling bit strings from noisy quantum computers as well as the implications for optimization and machine learning applications. We formally quantify the sampling overhead to extract good samples from noisy quantum computers and relate it to the layer fidelity, a metric to determine th…
▽ More
In this paper, we explore the impact of noise on quantum computing, particularly focusing on the challenges when sampling bit strings from noisy quantum computers as well as the implications for optimization and machine learning applications. We formally quantify the sampling overhead to extract good samples from noisy quantum computers and relate it to the layer fidelity, a metric to determine the performance of noisy quantum processors. Further, we show how this allows us to use the Conditional Value at Risk of noisy samples to determine provable bounds on noise-free expectation values. We discuss how to leverage these bounds for different algorithms and demonstrate our findings through experiments on a real quantum computer involving up to 127 qubits. The results show a strong alignment with theoretical predictions.
△ Less
Submitted 1 December, 2023;
originally announced December 2023.
-
LLVM Static Analysis for Program Characterization and Memory Reuse Profile Estimation
Authors:
Atanu Barai,
Nandakishore Santhi,
Abdur Razzak,
Stephan Eidenbenz,
Abdel-Hameed A. Badawy
Abstract:
Profiling various application characteristics, including the number of different arithmetic operations performed, memory footprint, etc., dynamically is time- and space-consuming. On the other hand, static analysis methods, although fast, can be less accurate. This paper presents an LLVM-based probabilistic static analysis method that accurately predicts different program characteristics and estim…
▽ More
Profiling various application characteristics, including the number of different arithmetic operations performed, memory footprint, etc., dynamically is time- and space-consuming. On the other hand, static analysis methods, although fast, can be less accurate. This paper presents an LLVM-based probabilistic static analysis method that accurately predicts different program characteristics and estimates the reuse distance profile of a program by analyzing the LLVM IR file in constant time, regardless of program input size. We generate the basic-block-level control flow graph of the target application kernel and determine basic-block execution counts by solving the linear balance equation involving the adjacent basic blocks' transition probabilities. Finally, we represent the kernel memory accesses in a bracketed format and employ a recursive algorithm to calculate the reuse distance profile. The results show that our approach can predict application characteristics accurately compared to another LLVM-based dynamic code analysis tool, Byfl.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
Probing Quantum Telecloning on Superconducting Quantum Processors
Authors:
Elijah Pelofske,
Andreas Bärtschi,
Stephan Eidenbenz,
Bryan Garcia,
Boris Kiefer
Abstract:
Quantum information can not be perfectly cloned, but approximate copies of quantum information can be generated. Quantum telecloning combines approximate quantum cloning, more typically referred as quantum cloning, and quantum teleportation. Quantum telecloning allows approximate copies of quantum information to be constructed by separate parties, using the classical results of a Bell measurement…
▽ More
Quantum information can not be perfectly cloned, but approximate copies of quantum information can be generated. Quantum telecloning combines approximate quantum cloning, more typically referred as quantum cloning, and quantum teleportation. Quantum telecloning allows approximate copies of quantum information to be constructed by separate parties, using the classical results of a Bell measurement made on a prepared quantum telecloning state. Quantum telecloning can be implemented as a circuit on quantum computers using a classical co-processor to compute classical feed forward instructions using if statements based on the results of a mid-circuit Bell measurement in real time. We present universal, symmetric, optimal $1 \rightarrow M$ telecloning circuits, and experimentally demonstrate these quantum telecloning circuits for $M=2$ up to $M=10$, natively executed with real time classical control systems on IBM Quantum superconducting processors, known as dynamic circuits. We perform the cloning procedure on many different message states across the Bloch sphere, on $7$ IBM Quantum processors, optionally using the error suppression technique X-X sequence digital dynamical decoupling. Two circuit optimizations are utilized, one which removes ancilla qubits for $M=2, 3$, and one which reduces the total number of gates in the circuit but still uses ancilla qubits. Parallel single qubit tomography with MLE density matrix reconstruction is used in order to compute the mixed state density matrices of the clone qubits, and clone quality is measured using quantum fidelity. These results present one of the largest and most comprehensive NISQ computer experimental analyses on (single qubit) quantum telecloning to date. The clone fidelity sharply decreases to $0.5$ for $M > 5$, but for $M=2$ we are able to achieve a mean clone fidelity of up to $0.79$ using dynamical decoupling.
△ Less
Submitted 8 May, 2024; v1 submitted 29 August, 2023;
originally announced August 2023.
-
Lower Bounds on Number of QAOA Rounds Required for Guaranteed Approximation Ratios
Authors:
Naphan Benchasattabuse,
Andreas Bärtschi,
Luis Pedro García-Pintos,
John Golden,
Nathan Lemons,
Stephan Eidenbenz
Abstract:
The quantum alternating operator ansatz (QAOA) is a heuristic hybrid quantum-classical algorithm for finding high-quality approximate solutions to combinatorial optimization problems, such as Maximum Satisfiability. While QAOA is well-studied, theoretical results as to its runtime or approximation ratio guarantees are still relatively sparse. We provide some of the first lower bounds for the numbe…
▽ More
The quantum alternating operator ansatz (QAOA) is a heuristic hybrid quantum-classical algorithm for finding high-quality approximate solutions to combinatorial optimization problems, such as Maximum Satisfiability. While QAOA is well-studied, theoretical results as to its runtime or approximation ratio guarantees are still relatively sparse. We provide some of the first lower bounds for the number of rounds (the dominant component of QAOA runtimes) required for QAOA. For our main result, (i) we leverage a connection between quantum annealing times and the angles of QAOA to derive a lower bound on the number of rounds of QAOA with respect to the guaranteed approximation ratio. We apply and calculate this bound with Grover-style mixing unitaries and (ii) show that this type of QAOA requires at least a polynomial number of rounds to guarantee any constant approximation ratios for most problems. We also (iii) show that the bound depends only on the statistical values of the objective functions, and when the problem can be modeled as a $k$-local Hamiltonian, can be easily estimated from the coefficients of the Hamiltonians. For the conventional transverse field mixer, (iv) our framework gives a trivial lower bound to all bounded occurrence local cost problems and all strictly $k$-local cost Hamiltonians matching known results that constant approximation ratio is obtainable with constant round QAOA for a few optimization problems from these classes. Using our novel proof framework, (v) we recover the Grover lower bound for unstructured search and -- with small modification -- show that our bound applies to any QAOA-style search protocol that starts in the ground state of the mixing unitaries.
△ Less
Submitted 3 September, 2023; v1 submitted 29 August, 2023;
originally announced August 2023.
-
Increasing the Measured Effective Quantum Volume with Zero Noise Extrapolation
Authors:
Elijah Pelofske,
Vincent Russo,
Ryan LaRose,
Andrea Mari,
Dan Strano,
Andreas Bärtschi,
Stephan Eidenbenz,
William J. Zeng
Abstract:
Quantum Volume is a full-stack benchmark for near-term quantum computers. It quantifies the largest size of a square circuit which can be executed on the target device with reasonable fidelity. Error mitigation is a set of techniques intended to remove the effects of noise present in the computation of noisy quantum computers when computing an expectation value of interest. Effective quantum volum…
▽ More
Quantum Volume is a full-stack benchmark for near-term quantum computers. It quantifies the largest size of a square circuit which can be executed on the target device with reasonable fidelity. Error mitigation is a set of techniques intended to remove the effects of noise present in the computation of noisy quantum computers when computing an expectation value of interest. Effective quantum volume is a proposed metric that applies error mitigation to the quantum volume protocol in order to evaluate the effectiveness not only of the target device but also of the error mitigation algorithm. Digital Zero-Noise Extrapolation (ZNE) is an error mitigation technique that estimates the noiseless expectation value using circuit folding to amplify errors by known scale factors and extrapolating to the zero-noise limit. Here we demonstrate that ZNE, with global and local unitary folding with fractional scale factors, in conjunction with dynamical decoupling, can increase the effective quantum volume over the vendor-measured quantum volume. Specifically, we measure the effective quantum volume of four IBM Quantum superconducting processor units, obtaining values that are larger than the vendor-measured quantum volume on each device. This is the first such increase reported.
△ Less
Submitted 2 July, 2024; v1 submitted 27 June, 2023;
originally announced June 2023.
-
High-Round QAOA for MAX $k$-SAT on Trapped Ion NISQ Devices
Authors:
Elijah Pelofske,
Andreas Bärtschi,
John Golden,
Stephan Eidenbenz
Abstract:
The Quantum Alternating Operator Ansatz (QAOA) is a hybrid classical-quantum algorithm that aims to sample the optimal solution(s) of discrete combinatorial optimization problems. We present optimized QAOA circuit constructions for sampling MAX $k$-SAT problems, specifically for $k=3$ and $k=4$. The novel $4$-SAT QAOA circuit construction we present uses measurement based uncomputation, followed b…
▽ More
The Quantum Alternating Operator Ansatz (QAOA) is a hybrid classical-quantum algorithm that aims to sample the optimal solution(s) of discrete combinatorial optimization problems. We present optimized QAOA circuit constructions for sampling MAX $k$-SAT problems, specifically for $k=3$ and $k=4$. The novel $4$-SAT QAOA circuit construction we present uses measurement based uncomputation, followed by classical feed forward conditional operations. The QAOA circuit parameters for $3$-SAT are optimized via exact classical (noise-free) simulation, using HPC resources to simulate up to $20$ rounds on $10$ qubits. In order to explore the limits of current NISQ devices we execute these optimized QAOA circuits for random $3$-SAT test instances with clause-to-variable ratio $4$ on four trapped ion quantum computers: Quantinuum H1-1 (20 qubits), IonQ Harmony (11 qubits), IonQ Aria 1 (25 qubits), and IonQ Forte (30 qubits). The QAOA circuits that are executed include $n=10$ up to $p=20$, and $n=22$ for $p=1$ and $p=2$. The high round circuits use upwards of 9,000 individual gate instructions, making these some of the largest QAOA circuits executed on NISQ devices. Our main finding is that current NISQ devices perform best at low round counts (i.e., $p = 1,\ldots, 5$) and then -- as expected due to noise -- gradually start returning satisfiability truth assignments that are no better than randomly picked solutions as the number of QAOA rounds are further increased.
△ Less
Submitted 10 August, 2023; v1 submitted 5 June, 2023;
originally announced June 2023.
-
The Quantum Alternating Operator Ansatz for Satisfiability Problems
Authors:
John Golden,
Andreas Bärtschi,
Daniel O'Malley,
Stephan Eidenbenz
Abstract:
We comparatively study, through large-scale numerical simulation, the performance across a large set of Quantum Alternating Operator Ansatz (QAOA) implementations for finding approximate and optimum solutions to unconstrained combinatorial optimization problems. Our survey includes over 100 different mixing unitaries, and we combine each mixer with both the standard phase separator unitary represe…
▽ More
We comparatively study, through large-scale numerical simulation, the performance across a large set of Quantum Alternating Operator Ansatz (QAOA) implementations for finding approximate and optimum solutions to unconstrained combinatorial optimization problems. Our survey includes over 100 different mixing unitaries, and we combine each mixer with both the standard phase separator unitary representing the objective function and a thresholded version. Our numerical tests for randomly chosen instances of the unconstrained optimization problems Max 2-SAT and Max 3-SAT reveal that the traditional transverse-field mixer with the standard phase separator performs best for problem sizes of 8 through 14 variables, while the recently introduced Grover mixer with thresholding wins at problems of size 6. This result (i) corrects earlier work suggesting that the Grover mixer is a superior mixer based only on results from problems of size 6, thus illustrating the need to push numerical simulation to larger problem sizes to more accurately predict performance; and (ii) it suggests that more complicated mixers and phase separators may not improve QAOA performance.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
Quantum Annealing vs. QAOA: 127 Qubit Higher-Order Ising Problems on NISQ Computers
Authors:
Elijah Pelofske,
Andreas Bärtschi,
Stephan Eidenbenz
Abstract:
Quantum annealing (QA) and Quantum Alternating Operator Ansatz (QAOA) are both heuristic quantum algorithms intended for sampling optimal solutions of combinatorial optimization problems. In this article we implement a rigorous direct comparison between QA on D-Wave hardware and QAOA on IBMQ hardware. These two quantum algorithms are also compared against classical simulated annealing. The studied…
▽ More
Quantum annealing (QA) and Quantum Alternating Operator Ansatz (QAOA) are both heuristic quantum algorithms intended for sampling optimal solutions of combinatorial optimization problems. In this article we implement a rigorous direct comparison between QA on D-Wave hardware and QAOA on IBMQ hardware. These two quantum algorithms are also compared against classical simulated annealing. The studied problems are instances of a class of Ising models, with variable assignments of $+1$ or $-1$, that contain cubic $ZZZ$ interactions (higher order terms) and match both the native connectivity of the Pegasus topology D-Wave chips and the heavy hexagonal lattice of the IBMQ chips. The novel QAOA implementation on the heavy hexagonal lattice has a CNOT depth of $6$ per round and allows for usage of an entire heavy hexagonal lattice. Experimentally, QAOA is executed on an ensemble of randomly generated Ising instances with a grid search over $1$ and $2$ round angles using all 127 programmable superconducting transmon qubits of ibm_washington. The error suppression technique digital dynamical decoupling is also tested on all QAOA circuits. QA is executed on the same Ising instances with the programmable superconducting flux qubit devices D-Wave Advantage_system4.1 and Advantage_system6.1 using modified annealing schedules with pauses. We find that QA outperforms QAOA on all problem instances. We also find that dynamical decoupling enables 2-round QAOA to marginally outperform 1-round QAOA, which is not the case without dynamical decoupling.
△ Less
Submitted 18 March, 2023; v1 submitted 1 January, 2023;
originally announced January 2023.
-
Optimized Telecloning Circuits: Theory and Practice of Nine NISQ Clones
Authors:
Elijah Pelofske,
Andreas Bärtschi,
Stephan Eidenbenz
Abstract:
Although perfect copying of an unknown quantum state is not possible, approximate cloning is possible in quantum mechanics. Quantum telecloning is a variant of approximate quantum cloning which uses quantum teleportation to allow for the use of classical communication to create physically separate clones of a quantum state. We present results of a of $1 \rightarrow 9$ universal, symmetric, optimal…
▽ More
Although perfect copying of an unknown quantum state is not possible, approximate cloning is possible in quantum mechanics. Quantum telecloning is a variant of approximate quantum cloning which uses quantum teleportation to allow for the use of classical communication to create physically separate clones of a quantum state. We present results of a of $1 \rightarrow 9$ universal, symmetric, optimal quantum telecloning implementation on a cloud accessible quantum computer - the Quantinuum H1-1 device. The H1-1 device allows direct creation of the telecloning protocol due to real time classical if-statements that are conditional on the mid-circuit measurement outcome of a Bell measurement. In this implementation, we also provide an improvement over previous work for the circuit model description of quantum telecloning, which reduces the required gate depth and gate count for an all-to-all connectivity. The demonstration of creating $9$ approximate clones on a quantum processor is the largest number of clones that has been generated, telecloning or otherwise.
△ Less
Submitted 30 November, 2022; v1 submitted 18 October, 2022;
originally announced October 2022.
-
Scalable Experimental Bounds for Dicke and GHZ States Fidelities
Authors:
Shamminuj Aktar,
Andreas Bärtschi,
Abdel-Hameed A. Badawy,
Stephan Eidenbenz
Abstract:
Estimating the state preparation fidelity of highly entangled states on noisy intermediate-scale quantum (NISQ) devices is an important task for benchmarking and application considerations. Unfortunately, exact fidelity measurements quickly become prohibitively expensive, as they scale exponentially as $O(3^N)$ for $N$-qubit states, using full state tomography with measurements in all Pauli bases…
▽ More
Estimating the state preparation fidelity of highly entangled states on noisy intermediate-scale quantum (NISQ) devices is an important task for benchmarking and application considerations. Unfortunately, exact fidelity measurements quickly become prohibitively expensive, as they scale exponentially as $O(3^N)$ for $N$-qubit states, using full state tomography with measurements in all Pauli bases combinations. However, Somma and others [PhysRevA.74.052302] established that the complexity could be drastically reduced when looking at fidelity lower bounds for states that exhibit symmetries, such as Dicke States and GHZ States. For larger states, these bounds still need to be tight enough to provide reasonable estimations on NISQ devices.
For the first time and more than 15 years after the theoretical introduction, we report meaningful lower bounds for the state preparation fidelity of all Dicke States up to $N=10$, and all GHZ states up to $N=20$ on Quantinuum H1 ion-trap systems using efficient implementations of recently proposed scalable circuits for these states. Our achieved lower bounds match or exceed previously reported exact fidelities on superconducting systems for much smaller states. This work provides a path forward to benchmarking entanglement as NISQ devices improve in size and quality.
△ Less
Submitted 31 August, 2023; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Generating Hidden Markov Models from Process Models Through Nonnegative Tensor Factorization
Authors:
Erik Skau,
Andrew Hollis,
Stephan Eidenbenz,
Kim Rasmussen,
Boian Alexandrov
Abstract:
Monitoring of industrial processes is a critical capability in industry and in government to ensure reliability of production cycles, quick emergency response, and national security. Process monitoring allows users to gauge the progress of an organization in an industrial process or predict the degradation or aging of machine parts in processes taking place at a remote location. Similar to many da…
▽ More
Monitoring of industrial processes is a critical capability in industry and in government to ensure reliability of production cycles, quick emergency response, and national security. Process monitoring allows users to gauge the progress of an organization in an industrial process or predict the degradation or aging of machine parts in processes taking place at a remote location. Similar to many data science applications, we usually only have access to limited raw data, such as satellite imagery, short video clips, event logs, and signatures captured by a small set of sensors. To combat data scarcity, we leverage the knowledge of Subject Matter Experts (SMEs) who are familiar with the actions of interest. SMEs provide expert knowledge of the essential activities required for task completion and the resources necessary to carry out each of these activities. Various process mining techniques have been developed for this type of analysis; typically such approaches combine theoretical process models built based on domain expert insights with ad-hoc integration of available pieces of raw data. Here, we introduce a novel mathematically sound method that integrates theoretical process models (as proposed by SMEs) with interrelated minimal Hidden Markov Models (HMM), built via nonnegative tensor factorization. Our method consolidates: (a) theoretical process models, (b) HMMs, (c) coupled nonnegative matrix-tensor factorizations, and (d) custom model selection. To demonstrate our methodology and its abilities, we apply it on simple synthetic and real world process models.
△ Less
Submitted 26 April, 2024; v1 submitted 3 October, 2022;
originally announced October 2022.
-
Short-Depth Circuits for Dicke State Preparation
Authors:
Andreas Bärtschi,
Stephan Eidenbenz
Abstract:
We present short-depth circuits to deterministically prepare any Dicke state |Dn,k>, which is the equal-amplitude superposition of all n-qubit computational basis states with Hamming Weight k. Dicke states are an important class of entangled quantum states with a large variety of applications, and a long history of experimental creation in physical systems. On the other hand, not much is known reg…
▽ More
We present short-depth circuits to deterministically prepare any Dicke state |Dn,k>, which is the equal-amplitude superposition of all n-qubit computational basis states with Hamming Weight k. Dicke states are an important class of entangled quantum states with a large variety of applications, and a long history of experimental creation in physical systems. On the other hand, not much is known regarding efficient scalable quantum circuits for Dicke state preparation on realistic quantum computing hardware connectivities.
Here we present preparation circuits for Dicke states |Dn,k> with (i) a depth of O(k log(n/k)) for All-to-All connectivity (such as on current ion trap devices); (ii) a depth of O(k sqrt(n/k)) = O(sqrt(nk) for Grid connectivity on grids of size Omega(sqrt(n/s)) x O(sqrt(ns)) with s<=k (such as on current superconducting qubit devices).
Both approaches have a total gate count of O(kn), need no ancilla qubits, and generalize to both the preparation and compression of symmetric pure states in which all non-zero amplitudes correspond to states with Hamming weight at most k. Thus our work significantly improves and expands previous state-of-the art circuits which had depth O(n) on a Linear Nearest Neighbor connectivity for arbitrary k (Fundamentals of Computation Theory 2019) and depth O(log n) on All-to-All connectivity for k=1 (Advanced Quantum Technologies 2019).
△ Less
Submitted 20 July, 2022;
originally announced July 2022.
-
Quantum Telecloning on NISQ Computers
Authors:
Elijah Pelofske,
Andreas Bärtschi,
Bryan Garcia,
Boris Kiefer,
Stephan Eidenbenz
Abstract:
Due to the no-cloning theorem, generating perfect quantum clones of an arbitrary unknown quantum state is not possible, however approximate quantum clones can be constructed. Quantum telecloning is a protocol that originates from a combination of quantum teleportation and quantum cloning. Here we present $1 \rightarrow 2$ and $1 \rightarrow 3$ quantum telecloning circuits, with and without ancilla…
▽ More
Due to the no-cloning theorem, generating perfect quantum clones of an arbitrary unknown quantum state is not possible, however approximate quantum clones can be constructed. Quantum telecloning is a protocol that originates from a combination of quantum teleportation and quantum cloning. Here we present $1 \rightarrow 2$ and $1 \rightarrow 3$ quantum telecloning circuits, with and without ancilla, that are theoretically optimal (meaning the clones have the highest fidelity allowed by quantum mechanics), universal (meaning the clone fidelity is independent of the state being cloned), and symmetric (meaning the clones all have the same fidelity). We implement these circuits on gate model IBMQ and Quantinuum NISQ hardware and quantify the clone fidelities using parallel single qubit state tomography. Quantum telecloning using mid-circuit measurement with classical feed-forward control (i.e. real time if statements) is demonstrated on the Quantinuum H1-2 device. Two alternative implementations of quantum telecloning, deferred measurement and post selection, are demonstrated on ibmq\_montreal, where mid-circuit measurements with real time if statements are not available. Our results show that NISQ devices can achieve near-optimal quantum telecloning fidelity; for example the Quantinuum H1-2 device running the telecloning circuits without ancilla achieved a mean clone fidelity of $0.824$ with standard deviation of $0.024$ for two clone circuits and $0.765$ with standard deviation of $0.022$ for three clone circuits. The theoretical fidelity limits are $0.8\overline{3}$ for two clones and $0.\overline{7}$ for three clones. This demonstrates the viability of performing experimental analysis of quantum information networks and quantum cryptography protocols on NISQ computers.
△ Less
Submitted 1 August, 2022; v1 submitted 29 April, 2022;
originally announced May 2022.
-
Quantum Volume in Practice: What Users Can Expect from NISQ Devices
Authors:
Elijah Pelofske,
Andreas Bärtschi,
Stephan Eidenbenz
Abstract:
Quantum volume (QV) has become the de-facto standard benchmark to quantify the capability of Noisy Intermediate-Scale Quantum (NISQ) devices. While QV values are often reported by NISQ providers for their systems, we perform our own series of QV calculations on 24 NISQ devices currently offered by IBM Q, IonQ, Rigetti, Oxford Quantum Circuits, and Quantinuum (formerly Honeywell). Our approach char…
▽ More
Quantum volume (QV) has become the de-facto standard benchmark to quantify the capability of Noisy Intermediate-Scale Quantum (NISQ) devices. While QV values are often reported by NISQ providers for their systems, we perform our own series of QV calculations on 24 NISQ devices currently offered by IBM Q, IonQ, Rigetti, Oxford Quantum Circuits, and Quantinuum (formerly Honeywell). Our approach characterizes the performances that an advanced user of these NISQ devices can expect to achieve with a reasonable amount of optimization, but without white-box access to the device. In particular, we compile QV circuits to standard gate sets of the vendor using compiler optimization routines where available, and we perform experiments across different qubit subsets. We find that running QV tests requires very significant compilation cycles, QV values achieved in our tests typically lag behind officially reported results and also depend significantly on the classical compilation effort invested.
△ Less
Submitted 21 August, 2023; v1 submitted 7 March, 2022;
originally announced March 2022.
-
Distributed Out-of-Memory NMF on CPU/GPU Architectures
Authors:
Ismael Boureima,
Manish Bhattarai,
Maksim Eren,
Erik Skau,
Philip Romero,
Stephan Eidenbenz,
Boian Alexandrov
Abstract:
We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and…
▽ More
We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library NCCL based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10e-6.
△ Less
Submitted 12 September, 2023; v1 submitted 18 February, 2022;
originally announced February 2022.
-
BB-ML: Basic Block Performance Prediction using Machine Learning Techniques
Authors:
Hamdy Abdelkhalik,
Shamminuj Aktar,
Yehia Arafa,
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Nishant Panda,
Nirmal Prajapati,
Nazmul Haque Turja,
Stephan Eidenbenz,
Abdel-Hameed Badawy
Abstract:
Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a…
▽ More
Recent years have seen the adoption of Machine Learning (ML) techniques to predict the performance of large-scale applications, mostly at a coarse level. In contrast, we propose to use ML techniques for performance prediction at a much finer granularity, namely at the Basic Block (BB) level, which are single entry, single exit code blocks that are used for analysis by the compilers to break down a large code into manageable pieces. We extrapolate the basic block execution counts of GPU applications and use them for predicting the performance for large input sizes from the counts of smaller input sizes. We train a Poisson Neural Network (PNN) model using random input values as well as the lowest input values of the application to learn the relationship between inputs and basic block counts. Experimental results show that the model can accurately predict the basic block execution counts of 16 GPU benchmarks. We achieve an accuracy of 93.5% in extrapolating the basic block counts for large input sets when trained on smaller input sets and an accuracy of 97.7% in predicting basic block counts on random instances. In a case study, we apply the ML model to CUDA GPU benchmarks for performance prediction across a spectrum of applications. We use a variety of metrics for evaluation, including global memory requests and the active cycles of tensor cores, ALU, and FMA units. Results demonstrate the model's capability of predicting the performance of large datasets with an average error rate of 0.85% and 0.17% for global and shared memory requests, respectively. Additionally, to address the utilization of the main functional units in Ampere architecture GPUs, we calculate the active cycles for tensor cores, ALU, FMA, and FP64 units and achieve an average error of 2.3% and 10.66% for ALU and FMA units while the maximum observed error across all tested applications and units reaches 18.5%.
△ Less
Submitted 11 November, 2023; v1 submitted 15 February, 2022;
originally announced February 2022.
-
Numerical Evidence for Exponential Speed-up of QAOA over Unstructured Search for Approximate Constrained Optimization
Authors:
John Golden,
Andreas Bärtschi,
Stephan Eidenbenz,
Daniel O'Malley
Abstract:
Despite much recent work, the true promise and limitations of the Quantum Alternating Operator Ansatz (QAOA) are unclear. A critical question regarding QAOA is to what extent its performance scales with the input size of the problem instance, in particular the necessary growth in the number of QAOA rounds to reach a high approximation ratio. We present numerical evidence for an exponential speed-u…
▽ More
Despite much recent work, the true promise and limitations of the Quantum Alternating Operator Ansatz (QAOA) are unclear. A critical question regarding QAOA is to what extent its performance scales with the input size of the problem instance, in particular the necessary growth in the number of QAOA rounds to reach a high approximation ratio. We present numerical evidence for an exponential speed-up of QAOA over Grover-style unstructured search in finding approximate solutions to constrained optimization problems. Our result provides a strong hint that QAOA is able to exploit the structure of an optimization problem and thus overcome the lower bound for unstructured search.
To this end, we conduct a comprehensive numerical study on several Hamming-weight constrained optimization problems for which we include combinations of all standardly studied mixer and phase separator Hamiltonians (Ring mixer, Clique mixer, Objective Value phase separator) as well as quantum minimum-finding inspired Hamiltonians (Grover mixer, Threshold-based phase separator). We identify Clique-Objective-QAOA with an exponential speed-up over Grover-Threshold-QAOA and tie the latter's scaling to that of unstructured search, with all other QAOA combinations coming in at a distant third. Our result suggests that maximizing QAOA performance requires a judicious choice of mixer and phase separator, and should trigger further research into other QAOA variations.
△ Less
Submitted 9 May, 2023; v1 submitted 1 February, 2022;
originally announced February 2022.
-
A Divide-and-Conquer Approach to Dicke State Preparation
Authors:
Shamminuj Aktar,
Andreas Bärtschi,
Abdel-Hameed A. Badawy,
Stephan Eidenbenz
Abstract:
We present a divide-and-conquer approach to deterministically prepare Dicke states $\lvert D_k^n\rangle$ (i.e., equal-weight superpositions of all $n$-qubit states with Hamming Weight $k$) on quantum computers. In an experimental evaluation for up to $n=6$ qubits on IBM Quantum Sydney and Montreal devices, we achieve significantly higher state fidelity compared to previous results [Mukherjee and o…
▽ More
We present a divide-and-conquer approach to deterministically prepare Dicke states $\lvert D_k^n\rangle$ (i.e., equal-weight superpositions of all $n$-qubit states with Hamming Weight $k$) on quantum computers. In an experimental evaluation for up to $n=6$ qubits on IBM Quantum Sydney and Montreal devices, we achieve significantly higher state fidelity compared to previous results [Mukherjee and others, TQE'2020], [Cruz and others, QuTe'2019]. The fidelity gains are achieved through several techniques: Our circuits first "divide" the Hamming weight between blocks of $n/2$ qubits, and then "conquer" those blocks with improved versions of Dicke state unitaries [Bärtschi and others, FCT'2019]. Due to the sparse connectivity on IBM's heavy-hex-architectures, these circuits are implemented for linear nearest neighbor topologies. Further gains in (estimating) the state fidelity are due to our use of measurement error mitigation and hardware progress.
△ Less
Submitted 9 June, 2022; v1 submitted 23 December, 2021;
originally announced December 2021.
-
Sampling on NISQ Devices: "Who's the Fairest One of All?"
Authors:
Elijah Pelofske,
John Golden,
Andreas Bärtschi,
Daniel O'Malley,
Stephan Eidenbenz
Abstract:
Modern NISQ devices are subject to a variety of biases and sources of noise that degrade the solution quality of computations carried out on these devices. A natural question that arises in the NISQ era, is how fairly do these devices sample ground state solutions. To this end, we run five fair sampling problems (each with at least three ground state solutions) that are based both on quantum annea…
▽ More
Modern NISQ devices are subject to a variety of biases and sources of noise that degrade the solution quality of computations carried out on these devices. A natural question that arises in the NISQ era, is how fairly do these devices sample ground state solutions. To this end, we run five fair sampling problems (each with at least three ground state solutions) that are based both on quantum annealing and on the Grover Mixer-QAOA algorithm for gate-based NISQ hardware. In particular, we use seven IBM~Q devices, the Aspen-9 Rigetti device, the IonQ device, and three D-Wave quantum annealers. For each of the fair sampling problems, we measure the ground state probability, the relative fairness of the frequency of each ground state solution with respect to the other ground state solutions, and the aggregate error as given by each hardware provider. Overall, our results show that NISQ devices do not achieve fair sampling yet. We also observe differences in the software stack with a particular focus on compilation techniques that illustrate what work will still need to be done to achieve a seamless integration of frontend (i.e. quantum circuit description) and backend compilation.
△ Less
Submitted 13 July, 2021;
originally announced July 2021.
-
Threshold-Based Quantum Optimization
Authors:
John Golden,
Andreas Bärtschi,
Daniel O'Malley,
Stephan Eidenbenz
Abstract:
We propose and study Th-QAOA (pronounced Threshold QAOA), a variation of the Quantum Alternating Operator Ansatz (QAOA) that replaces the standard phase separator operator, which encodes the objective function, with a threshold function that returns a value $1$ for solutions with an objective value above the threshold and a $0$ otherwise. We vary the threshold value to arrive at a quantum optimiza…
▽ More
We propose and study Th-QAOA (pronounced Threshold QAOA), a variation of the Quantum Alternating Operator Ansatz (QAOA) that replaces the standard phase separator operator, which encodes the objective function, with a threshold function that returns a value $1$ for solutions with an objective value above the threshold and a $0$ otherwise. We vary the threshold value to arrive at a quantum optimization algorithm. We focus on a combination with the Grover Mixer operator; the resulting GM-Th-QAOA can be viewed as a generalization of Grover's quantum search algorithm and its minimum/maximum finding cousin to approximate optimization. Our main findings include: (i) we provide intuitive arguments and show empirically that the optimum parameter values of GM-Th-QAOA (angles and threshold value) can be found with $O(\log(p) \times \log M)$ iterations of the classical outer loop, where $p$ is the number of QAOA rounds and $M$ is an upper bound on the solution value (often the number of vertices or edges in an input graph), thus eliminating the notorious outer-loop parameter finding issue of other QAOA algorithms; (ii) GM-Th-QAOA can be simulated classically with little effort up to 100 qubits through a set of tricks that cut down memory requirements; (iii) somewhat surprisingly, GM-Th-QAOA outperforms non-thresholded GM-QAOA in terms of approximation ratios achieved. This third result holds across a range of optimization problems (MaxCut, Max k-VertexCover, Max k-DensestSubgraph, MaxBisection) and various experimental design parameters, such as different input edge densities and constraint sizes.
△ Less
Submitted 23 August, 2021; v1 submitted 25 June, 2021;
originally announced June 2021.
-
PPT-Multicore: Performance Prediction of OpenMP applications using Reuse Profiles and Analytical Modeling
Authors:
Atanu Barai,
Yehia Arafa,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Nandakishore Santhi,
Stephan Eidenbenz
Abstract:
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application's l…
▽ More
We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel application performance running on a multicore processor. PPT-Multicore builds upon our previous work towards a multicore cache model. We extract LLVM basic block labeled memory trace using an architecture-independent LLVM-based instrumentation tool only once in an application's lifetime. The model uses the memory trace and other parameters from an instrumented sequentially executed binary. We use a probabilistic and computationally efficient reuse profile to predict the cache hit rates and runtimes of OpenMP programs' parallel sections. We model Intel's Broadwell, Haswell, and AMD's Zen2 architectures and validate our framework using different applications from PolyBench and PARSEC benchmark suites. The results show that PPT-Multicore can predict cache hit rates with an overall average error rate of 1.23% while predicting the runtime with an error rate of 9.08%.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
PPT-SASMM: Scalable Analytical Shared Memory Model: Predicting the Performance of Multicore Caches from a Single-Threaded Execution Trace
Authors:
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Abdel-Hameed Badawy,
Yehia Arafa,
Stephan Eidenbenz
Abstract:
Performance modeling of parallel applications on multicore processors remains a challenge in computational co-design due to multicore processors' complex design. Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM). SASMM can predict the performance of parallel applications running on a multicore. SASMM uses a probabilistic…
▽ More
Performance modeling of parallel applications on multicore processors remains a challenge in computational co-design due to multicore processors' complex design. Multicores include complex private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model (SASMM). SASMM can predict the performance of parallel applications running on a multicore. SASMM uses a probabilistic and computationally-efficient method to predict the reuse distance profiles of caches in multicores. SASMM relies on a stochastic, static basic block-level analysis of reuse profiles. The profiles are calculated from the memory traces of applications that run sequentially rather than using multi-threaded traces. The experiments show that our model can predict private L1 cache hit rates with 2.12% and shared L2 cache hit rates with about 1.50% error rate.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
Fair Sampling Error Analysis on NISQ Devices
Authors:
John Golden,
Andreas Bärtschi,
Daniel O'Malley,
Stephan Eidenbenz
Abstract:
We study the status of fair sampling on Noisy Intermediate Scale Quantum (NISQ) devices, in particular the IBM Q family of backends. Using the recently introduced Grover Mixer-QAOA algorithm for discrete optimization, we generate fair sampling circuits to solve six problems of varying difficulty, each with several optimal solutions, which we then run on twenty backends across the IBM Q system. For…
▽ More
We study the status of fair sampling on Noisy Intermediate Scale Quantum (NISQ) devices, in particular the IBM Q family of backends. Using the recently introduced Grover Mixer-QAOA algorithm for discrete optimization, we generate fair sampling circuits to solve six problems of varying difficulty, each with several optimal solutions, which we then run on twenty backends across the IBM Q system. For a given circuit evaluated on a specific set of qubits, we evaluate: how frequently the qubits return an optimal solution to the problem, the fairness with which the qubits sample from all optimal solutions, and the reported hardware error rate of the qubits. To quantify fairness, we define a novel metric based on Pearson's $χ^2$ test. We find that fairness is relatively high for circuits with small and large error rates, but drops for circuits with medium error rates. This indicates that structured errors dominate in this regime, while unstructured errors, which are random and thus inherently fair, dominate in noisier qubits and longer circuits. Our results show that fairness can be a powerful tool for understanding the intricate web of errors affecting current NISQ hardware.
△ Less
Submitted 11 June, 2022; v1 submitted 8 January, 2021;
originally announced January 2021.
-
Machine Learning Enabled Scalable Performance Prediction of Scientific Codes
Authors:
Gopinath Chennupati,
Nandakishore Santhi,
Phill Romero,
Stephan Eidenbenz
Abstract:
We present the Analytical Memory Model with Pipelines (AMMP) of the Performance Prediction Toolkit (PPT). PPT-AMMP takes high-level source code and hardware architecture parameters as input, predicts runtime of that code on the target hardware platform, which is defined in the input parameters. PPT-AMMP transforms the code to an (architecture-independent) intermediate representation, then (i) anal…
▽ More
We present the Analytical Memory Model with Pipelines (AMMP) of the Performance Prediction Toolkit (PPT). PPT-AMMP takes high-level source code and hardware architecture parameters as input, predicts runtime of that code on the target hardware platform, which is defined in the input parameters. PPT-AMMP transforms the code to an (architecture-independent) intermediate representation, then (i) analyzes the basic block structure of the code, (ii) processes architecture-independent virtual memory access patterns that it uses to build memory reuse distance distribution models for each basic block, (iii) runs detailed basic-block level simulations to determine hardware pipeline usage.
PPT-AMMP uses machine learning and regression techniques to build the prediction models based on small instances of the input code, then integrates into a higher-order discrete-event simulation model of PPT running on Simian PDES engine. We validate PPT-AMMP on four standard computational physics benchmarks, finally present a use case of hardware parameter sensitivity analysis to identify bottleneck hardware resources on different code inputs. We further extend PPT-AMMP to predict the performance of scientific application (radiation transport), SNAP. We analyze the application of multi-variate regression models that accurately predict the reuse profiles and the basic block counts. The predicted runtimes of SNAP when compared to that of actual times are accurate.
△ Less
Submitted 12 November, 2020; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Grover Mixers for QAOA: Shifting Complexity from Mixer Design to State Preparation
Authors:
Andreas Bärtschi,
Stephan Eidenbenz
Abstract:
We propose GM-QAOA, a variation of the Quantum Alternating Operator Ansatz (QAOA) that uses Grover-like selective phase shift mixing operators. GM-QAOA works on any NP optimization problem for which it is possible to efficiently prepare an equal superposition of all feasible solutions; it is designed to perform particularly well for constraint optimization problems, where not all possible variable…
▽ More
We propose GM-QAOA, a variation of the Quantum Alternating Operator Ansatz (QAOA) that uses Grover-like selective phase shift mixing operators. GM-QAOA works on any NP optimization problem for which it is possible to efficiently prepare an equal superposition of all feasible solutions; it is designed to perform particularly well for constraint optimization problems, where not all possible variable assignments are feasible solutions. GM-QAOA has the following features: (i) It is not susceptible to Hamiltonian Simulation error (such as Trotterization errors) as its operators can be implemented exactly using standard gate sets and (ii) Solutions with the same objective value are always sampled with the same amplitude.
We illustrate the potential of GM-QAOA on several optimization problem classes: for permutation-based optimization problems such as the Traveling Salesperson Problem, we present an efficient algorithm to prepare a superposition of all possible permutations of $n$ numbers, defined on $O(n^2)$ qubits; for the hard constraint $k$-Vertex-Cover problem, and for an application to Discrete Portfolio Rebalancing, we show that GM-QAOA outperforms existing QAOA approaches.
△ Less
Submitted 2 October, 2020; v1 submitted 30 May, 2020;
originally announced June 2020.
-
Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs
Authors:
Yehia Arafa,
Ammar ElWazir,
Abdelrahman ElKanishy,
Youssef Aly,
Ayatelrahman Elsayed,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Stephan Eidenbenz,
Nandakishore Santhi
Abstract:
GPUs are prevalent in modern computing systems at all scales. They consume a significant fraction of the energy in these systems. However, vendors do not publish the actual cost of the power/energy overhead of their internal microarchitecture. In this paper, we accurately measure the energy consumption of various PTX instructions found in modern NVIDIA GPUs. We provide an exhaustive comparison of…
▽ More
GPUs are prevalent in modern computing systems at all scales. They consume a significant fraction of the energy in these systems. However, vendors do not publish the actual cost of the power/energy overhead of their internal microarchitecture. In this paper, we accurately measure the energy consumption of various PTX instructions found in modern NVIDIA GPUs. We provide an exhaustive comparison of more than 40 instructions for four high-end NVIDIA GPUs from four different generations (Maxwell, Pascal, Volta, and Turing). Furthermore, we show the effect of the CUDA compiler optimizations on the energy consumption of each instruction. We use three different software techniques to read the GPU on-chip power sensors, which use NVIDIA's NVML API and provide an in-depth comparison between these techniques. Additionally, we verified the software measurement techniques against a custom-designed hardware power measurement. The results show that Volta GPUs have the best energy efficiency of all the other generations for the different categories of the instructions. This work should aid in understanding NVIDIA GPUs' microarchitecture. It should also make energy measurements of any GPU kernel both efficient and accurate.
△ Less
Submitted 2 June, 2020; v1 submitted 18 February, 2020;
originally announced February 2020.
-
Modeling Shared Cache Performance of OpenMP Programs using Reuse Distance
Authors:
Atanu Barai,
Gopinath Chennupati,
Nandakishore Santhi,
Abdel-Hameed A. Badawy,
Stephan Eidenbenz
Abstract:
Performance modeling of parallel applications on multicore computers remains a challenge in computational co-design due to the complex design of multicore processors including private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model to predict the performance of parallel applications that runs on a multicore computer and shares the same level of cache in the hier…
▽ More
Performance modeling of parallel applications on multicore computers remains a challenge in computational co-design due to the complex design of multicore processors including private and shared memory hierarchies. We present a Scalable Analytical Shared Memory Model to predict the performance of parallel applications that runs on a multicore computer and shares the same level of cache in the hierarchy. This model uses a computationally efficient, probabilistic method to predict the reuse distance profiles, where reuse distance is a hardware architecture-independent measure of the patterns of virtual memory accesses. It relies on a stochastic, static basic block-level analysis of reuse profiles measured from the memory traces of applications ran sequentially on small instances rather than using a multi-threaded trace. The results indicate that the hit-rate predictions on the shared cache are accurate.
△ Less
Submitted 29 July, 2019;
originally announced July 2019.
-
Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs
Authors:
Yehia Arafa,
Abdel-Hameed Badawy,
Gopinath Chennupati,
Nandakishore Santhi,
Stephan Eidenbenz
Abstract:
The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed ch…
▽ More
The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portable analysis for exposing the latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies found in GPUs at the micro-architecture level. Furthermore, we show the impact of the various optimizations the CUDA compiler can perform over the various latencies. We perform our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures: Kepler, Maxwell, Pascal, Volta, and Turing. The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in modeling the hardware accurately. Also, software developers can perform informed optimizations to their applications.
△ Less
Submitted 1 September, 2019; v1 submitted 21 May, 2019;
originally announced May 2019.
-
Deterministic Preparation of Dicke States
Authors:
Andreas Bärtschi,
Stephan Eidenbenz
Abstract:
The Dicke state $|D_k^n\rangle$ is an equal-weight superposition of all $n$-qubit states with Hamming Weight $k$ (i.e. all strings of length $n$ with exactly $k$ ones over a binary alphabet). Dicke states are an important class of entangled quantum states that among other things serve as starting states for combinatorial optimization quantum algorithms.
We present a deterministic quantum algorit…
▽ More
The Dicke state $|D_k^n\rangle$ is an equal-weight superposition of all $n$-qubit states with Hamming Weight $k$ (i.e. all strings of length $n$ with exactly $k$ ones over a binary alphabet). Dicke states are an important class of entangled quantum states that among other things serve as starting states for combinatorial optimization quantum algorithms.
We present a deterministic quantum algorithm for the preparation of Dicke states. Implemented as a quantum circuit, our scheme uses $O(kn)$ gates, has depth $O(n)$ and needs no ancilla qubits. The inductive nature of our approach allows for linear-depth preparation of arbitrary symmetric pure states and -- used in reverse -- yields a quasilinear-depth circuit for efficient compression of quantum information in the form of symmetric pure states, improving on existing work requiring quadratic depth. All of these properties even hold for Linear Nearest Neighbor architectures.
△ Less
Submitted 15 April, 2019;
originally announced April 2019.
-
The ISTI Rapid Response on Exploring Cloud Computing 2018
Authors:
Carleton Coffrin,
James Arnold,
Stephan Eidenbenz,
Derek Aberle,
John Ambrosiano,
Zachary Baker,
Sara Brambilla,
Michael Brown,
K. Nolan Carter,
Pinghan Chu,
Patrick Conry,
Keeley Costigan,
Ariane Eberhardt,
David M. Fobes,
Adam Gausmann,
Sean Harris,
Donovan Heimer,
Marlin Holmes,
Bill Junor,
Csaba Kiss,
Steve Linger,
Rodman Linn,
Li-Ta Lo,
Jonathan MacCarthy,
Omar Marcillo
, et al. (23 additional authors not shown)
Abstract:
This report describes eighteen projects that explored how commercial cloud computing services can be utilized for scientific computation at national laboratories. These demonstrations ranged from deploying proprietary software in a cloud environment to leveraging established cloud-based analytics workflows for processing scientific datasets. By and large, the projects were successful and collectiv…
▽ More
This report describes eighteen projects that explored how commercial cloud computing services can be utilized for scientific computation at national laboratories. These demonstrations ranged from deploying proprietary software in a cloud environment to leveraging established cloud-based analytics workflows for processing scientific datasets. By and large, the projects were successful and collectively they suggest that cloud computing can be a valuable computational resource for scientific computation at national laboratories.
△ Less
Submitted 4 January, 2019;
originally announced January 2019.
-
Quantum Algorithm Implementations for Beginners
Authors:
Abhijith J.,
Adetokunbo Adedoyin,
John Ambrosiano,
Petr Anisimov,
William Casper,
Gopinath Chennupati,
Carleton Coffrin,
Hristo Djidjev,
David Gunter,
Satish Karra,
Nathan Lemons,
Shizeng Lin,
Alexander Malyzhenkov,
David Mascarenas,
Susan Mniszewski,
Balu Nadiga,
Daniel O'Malley,
Diane Oyen,
Scott Pakin,
Lakshman Prasad,
Randy Roberts,
Phillip Romero,
Nandakishore Santhi,
Nikolai Sinitsyn,
Pieter J. Swart
, et al. (9 additional authors not shown)
Abstract:
As quantum computers become available to the general public, the need has arisen to train a cohort of quantum programmers, many of whom have been developing classical computer programs for most of their careers. While currently available quantum computers have less than 100 qubits, quantum computing hardware is widely expected to grow in terms of qubit count, quality, and connectivity. This review…
▽ More
As quantum computers become available to the general public, the need has arisen to train a cohort of quantum programmers, many of whom have been developing classical computer programs for most of their careers. While currently available quantum computers have less than 100 qubits, quantum computing hardware is widely expected to grow in terms of qubit count, quality, and connectivity. This review aims to explain the principles of quantum programming, which are quite different from classical programming, with straightforward algebra that makes understanding of the underlying fascinating quantum mechanical principles optional. We give an introduction to quantum computing algorithms and their implementation on real quantum hardware. We survey 20 different quantum algorithms, attempting to describe each in a succinct and self-contained fashion. We show how these algorithms can be implemented on IBM's quantum computer, and in each case, we discuss the results of the implementation with respect to differences between the simulator and the actual hardware runs. This article introduces computer scientists, physicists, and engineers to quantum algorithms and provides a blueprint for their implementations.
△ Less
Submitted 26 June, 2022; v1 submitted 10 April, 2018;
originally announced April 2018.
-
Online Dominating Set
Authors:
Joan Boyar,
Stephan J. Eidenbenz,
Lene M. Favrholdt,
Michal Kotrbčík,
Kim S. Larsen
Abstract:
This paper is devoted to the online dominating set problem and its variants. We believe the paper represents the first systematic study of the effect of two limitations of online algorithms: making irrevocable decisions while not knowing the future, and being incremental, i.e., having to maintain solutions to all prefixes of the input. This is quantified through competitive analyses of online algo…
▽ More
This paper is devoted to the online dominating set problem and its variants. We believe the paper represents the first systematic study of the effect of two limitations of online algorithms: making irrevocable decisions while not knowing the future, and being incremental, i.e., having to maintain solutions to all prefixes of the input. This is quantified through competitive analyses of online algorithms against two optimal algorithms, both knowing the entire input, but only one having to be incremental. We also consider the competitive ratio of the weaker of the two optimal algorithms against the other.
We consider important graph classes, distinguishing between connected and not necessarily connected graphs. For the classic graph classes of trees, bipartite, planar, and general graphs, we obtain tight results in almost all cases. We also derive upper and lower bounds for the class of bounded-degree graphs. From these analyses, we get detailed information regarding the significance of the necessary requirement that online algorithms be incremental. In some cases, having to be incremental fully accounts for the online algorithm's disadvantage.
△ Less
Submitted 13 September, 2018; v1 submitted 18 April, 2016;
originally announced April 2016.
-
Hierarchical and Matrix Structures in a Large Organizational Email Network: Visualization and Modeling Approaches
Authors:
Benjamin H. Sims,
Nikolai Sinitsyn,
Stephan J. Eidenbenz
Abstract:
This paper presents findings from a study of the email network of a large scientific research organization, focusing on methods for visualizing and modeling organizational hierarchies within large, complex network datasets. In the first part of the paper, we find that visualization and interpretation of complex organizational network data is facilitated by integration of network data with informat…
▽ More
This paper presents findings from a study of the email network of a large scientific research organization, focusing on methods for visualizing and modeling organizational hierarchies within large, complex network datasets. In the first part of the paper, we find that visualization and interpretation of complex organizational network data is facilitated by integration of network data with information on formal organizational divisions and levels. By aggregating and visualizing email traffic between organizational units at various levels, we derive several insights into how large subdivisions of the organization interact with each other and with outside organizations. Our analysis shows that line and program management interactions in this organization systematically deviate from the idealized pattern of interaction prescribed by "matrix management." In the second part of the paper, we propose a power law model for predicting degree distribution of organizational email traffic based on hierarchical relationships between managers and employees. This model considers the influence of global email announcements sent from managers to all employees under their supervision, and the role support staff play in generating email traffic, acting as agents for managers. We also analyze patterns in email traffic volume over the course of a work week.
△ Less
Submitted 16 November, 2014;
originally announced November 2014.