In this section, we present implementations of the examples of Section
2.2 using the differentiable transforms implemented in PennyLane. For each example, we detail a specific scenario in which the differentiability yields significant advantages, insights, or enables novel functionality. Code snippets are provided for each example, and the full implementations are available in a GitHub repository.
2 Unless otherwise noted, all examples can be implemented in the most recent release of PennyLane (v0.27).
3.3.1 Optimizing Gradient Computation in a Noisy Setting.
While the parameter-shift rule works for a large variety of gates in variational quantum algorithms, we occasionally chance upon unitaries that we wish to train on hardware that do not permit a parameter-shift rule. This could be for a variety of reasons; perhaps the unitary does not satisfy the form \(e^{iGx}\), or the eigenvalue spectrum of its generator is unknown.
In such cases, we typically must fall back to numerical methods of differentiation on hardware such as the method of finite-differences. Previous work exploring finite-differences in a noisy setting has shown that the optimal finite-difference step size for first-order forward difference is of the form [
24]
where
N is the number of shots (samples) used to estimate expectation values,
\(\sigma _0\) is the single-shot variance of the estimates, and
\(f^{\prime \prime }(x)\) is the second derivative of the quantum function at the evaluation point. While for large
N we can make the approximation
\(h^*\approx N^{-0.25}\), for small
N on hardware, we must manually compute the second derivative of the quantum function in order to determine a decent estimate for the gradient step size, which can further introduce error while adding a prohibitive number of additional quantum evaluations required per optimization step. Instead, we can wrap the gradient computation in a quantum transform that learns optimal parameters for the finite-difference step size in the presence of noise.
Consider a variational quantum function circuit that we would like to optimize on a noisy device using first-order forward finite-differences and \(N=1,\!000\) shots. Rather than hard-coding a constant finite-difference step size, we can include the variance of the single-shot gradient as a quantity to minimize in the cost function by using the qml. gradients.finite_diff transform.
An implementation is presented in Figure
9; the full code example, including the circuit definition, is available on our GitHub repository. Starting with
\(x=0.1\) and
\(h=10^{-7}\) (the default step size value of the
finite_diff transform), we can now write an optimization loop that
(1)
computes the cost value \(f(x, h)\) and an estimate of the quantum gradient \(\partial _x f(x,h)\) using single-shot finite differences with step size h (implemented together in cost_and_grad).
(2)
using autodifferentiation, computes the partial derivative of the cost value with respect to the step size, \(\partial _h f(x, h)\).
(3)
applies a gradient descent step for both parameters x and h.
The results of this “adaptive step size finite-difference” optimization is compared to both a naïve finite-difference optimization (using the default step size of
\(h=10^{-7}\)) and the optimal finite-difference optimization (by computing Equation (
8) at every step) in Figure
10. It can be seen that the naïve finite-difference optimization fails to converge to the minimum at all; a step size of
\(10^{-7}\) results in a very large quantum gradient variance (as can be verified from Equation (
8)). The adaptive finite difference, as well as the optimum finite difference, by contrast, are both able to converge to the minimum, with the optimum variant converging particularly quickly, as would be expected.
Nevertheless, the results show that knowing the underlying theoretical characteristics of a system (such as in the optimal case) are not required. By simply encoding the quantities we wish to minimize—the expectation value and the gradient variance—the use of differentiable quantum transforms allows us to train the model hyperparameters to minimize error during gradient descent. Such approaches may also be viable in more complex models, where optimal or error-minimizing hyperparameter values are not known in advance.
3.3.2 Augmenting Differentiable Compilation Transforms with JIT Compilation.
The core difference between circuit compilation in PennyLane and other quantum software libraries is that nearly all of its compilation routines are quantum transforms that preserve differentiability.
3 This is accomplished by only applying mathematical operations on the parameters through which gradients can be propagated in the underlying autodifferentiation frameworks. This occasionally requires framework-specific considerations such as typecasting or addition of very small numbers to avoid undefined gradients around critical values. As the framework-specific handling of operations is done within the transforms using
qml.math, a user need only pass variables with the type of the desired framework to a QNode and request that circuit be compiled; the framework can be interchanged without any modification to the quantum functions themselves.
All compilation transforms are implemented as
qfunc_transforms, and as such, manipulate a circuit at the level of its tape. Transforms include rotation merging, single-qubit gate fusion, inverse cancellation, and moving single-qubit gates through control/target qubits of controlled operations. A top-level
qml.compile transform is made available to the user to facilitate creation of custom compilation pipelines. The right panel of Figure
11 displays an example pipeline that consists of pushing commuting gates left through controls and targets of two-qubit gates and then fusing all sequences of adjacent single-qubit gates into a single
qml.Rot (general parametrized unitary) operation. This yields the reduced-size circuit in the left panel of Figure
11. The compiled circuit remains fully differentiable with respect to the input arguments, even though they may not appear directly as the arguments in any of the gates of the compiled circuit.
At the end of the code listing of Figure
11, the method to obtain the gradients with respect to the input parameters is shown. Differentiable compilation can reduce the quantum resources required to compute gradients, as the autodifferentiation framework keeps track of the changes in variables and can produce circuits with a reduced number of parameters. A disadvantage of applying transforms in this way is that every circuit execution involves feeding the quantum function through the transform pipeline. For large circuits and multi-step pipelines that involve more mathematically complex operations such as full fusion of single-qubit gates, this could lead to significant temporal overhead. This is especially undesirable for gradient computations, as these by nature involve multiple executions of a quantum circuit.
Most of PennyLane’s circuit compilation transforms remain differentiable even after applying JIT compilation. As opposed to traditional (classical) compilation that is performed prior to runtime, JIT compiles a program dynamically at runtime. For instance, in JAX [
6], this is accomplished for
jax.jit using tracing: On the first execution, a “tracer” object is sent through the code, keeping track of the operations performed on it. This is then compiled into a function in an intermediate representation and can be run with real inputs. The first execution of JIT-compiled code takes substantially longer due to the time required to trace; however, subsequent executions are much faster as they are executed using the optimized version. Other frameworks have analogous functionality, such as
tf.function in TensorFlow, which first constructs a computational graph using a traced object.
Here we show an explicit example using the JAX interface and
jax.jit. Enabling compatibility of quantum transforms with JIT requires different considerations than ensuring differentiability with respect to arguments is preserved. In particular, conditional statements present a challenge for the tracer. For example, suppose that in the process of merging a sequence of
\(RZ\) gates, a cumulative angle of 0 is obtained. If optimizing by hand, or in non-jitted code, then whether to apply that gate can be expressed using a conditional statement that checks the value of the angle. However, this leads to two branches of the code with different structure: one in which a gate is applied and one in which it is not. In JAX, one can use a special type of conditional (
jax.lax.cond) that is recognized by the tracer; however, the output of both branches must have the same type and shape, and furthermore, it must be a JAX type such as an array (i.e., not a quantum gate). Thus, when jitting, one is limited to using the conditional branches to compute and return some function of a gate’s numerical parameters, and that gate runs even if the parameter is 0. PennyLane’s transforms contain separate conditional statements that first check whether a gate’s parameter is a tracer object prior to compiling. Then, jitted compilation can be done at the cost of occasionally applying a gate with a trivial parameter, and non-jitted compilation will always remove such gates. However, more sophisticated routines such as the two-qubit unitary decomposition are currently non-jittable: The structure of the circuit is computed as a function of the unitary and has four conditional branches with template circuits that depend on the computed number of CNOTs.
4Two-qubit decompositions notwithstanding, JIT compilation yields benefits in both the classical and quantum aspects of running an algorithm. The classical preprocessing, which involves applying the quantum transforms, becomes significantly faster after the first jitted evaluation, as it has been compiled to machine code that is then executed directly, bypassing the Python interpreter. We can then run the optimized circuit on quantum devices, which will typically have lower depth and fewer operations, without additional overhead. Furthermore, the same can be done for computation of gradients: Not only may the number of quantum evaluations be reduced as a result of compiling the circuit, but the jitted gradient, after the first execution, will run significantly faster than the original.
Consider the circuit in Figure
12 that has 20 parameters. Using
jax.grad to evaluate the gradient with the parameter-shift rules leads to 41 device executions (two per parameter, plus one for the initial forward pass). Running the same circuit but compiling using the
qml.transforms.single_qubit_fusion transform will merge all adjacent single-qubit gates. The compiled circuit has 15 effective parameters, as well as lower circuit depth (expanding the rotations gives an original depth of 10, and compiled depth of 8). Table
1 presents sample runtimes for computing the gradients of this circuit. Even though the number of quantum evaluations is lower, the application of the transform adds significant time overhead. However,
jit can be applied to both the original and compiled gradient functions to speed up the computation substantially.
New values of the parameters can be passed to the jitted and compiled gradient function without needing to run jit again. In variational algorithms, where a circuit and its gradient are evaluated on the order of thousands of times with different parameters over the course of the optimization process, this would lead to a substantial speedup. While this is only a small example, it demonstrates that ensuring compilation transforms preserve differentiability of the input parameters can lead to benefits both when executing on a quantum device and on a simulator.
3.3.3 Differentiable Error Mitigation.
In this section, we will implement fully differentiable ZNE (i.e., the extrapolated value is itself differentiable with respect to the input circuit parameters) with unitary folding [
9]. In this method, a circuit
U is first applied, followed by repetitions of
\(U^{\dagger }U\). Following the example of the error-mitigation library
mitiq [
19], the number of such folds
\(n_{f}\) is computed based on a scale factor
\(\lambda\) according to the expression
\(n_{f} = (\lambda - 1)/2\), rounded to the nearest integer. A qfunc transform implementing this is shown in the left panel of Figure
13.
ZNE is naturally implemented as a batch transform, shown in the right panel of Figure
13. The transform creates multiple versions of the initial tape with different amounts of folding. It then returns those tapes, along with a processing function that performs the noise extrapolation on the results of executing the tapes (implemented in
fit_zne).
The
zne batch transform can be applied either to a quantum tape to obtain the transformed tapes and processing function, or to a QNode to directly obtain the mitigated value. In fact, once defined, a user can obtain error-mitigated results simply by adding a single line to their code: the
@zne decorator. Some pseudocode for this process is demonstrated in Figure
14 (the full example included in our GitHub repository). Furthermore, if the
fit_zne function is implemented in a differentiable manner (using, e.g., the PennyLane
math module), then the extrapolated value itself will be differentiable with respect to circuit input parameters.
In addition to using the qml.transforms module to build differentiable ZNE methods from scratch, as of v0.25 PennyLane includes a built-in batch transform, mitigate_with_zne. The transform is accompanied by custom folding functions, as well fitting options such as Richardson and polynomial extrapolation.
3.3.4 Learning Noise Parameters with Transforms.
The differentiability of transform parameters can be leveraged for characterization tasks such as learning noise parameters. Consider a simple noisy device where every single-qubit gate is depolarized by the same qubit-dependent amount. We can simulate this noise using a transform that applies an appropriate depolarization channel after every single-qubit gate, as shown in Figure
15.
Suppose the set of true depolarization parameters is
\(\mathbf {p} = [0.05, 0.02]\), i.e.,
\(p = 0.05\) for the first qubit and
\(p = 0.02\) for the second. In the left panel of Figure
16, we choose a suitable circuit for experimentation and use a transform to learn
both depolarization parameters. We first set up a representation of our noisy device and bind to it a QNode that applies the transform using the true depolarization parameters. Note that this step is merely for simulation purposes, and it can be substituted with a noisy hardware device that does not apply any transforms at all.
Next, we optimize to learn these depolarization parameters, using a simple least-squares loss for the cost. It computes the difference between the output of a transformed QNode whose transform parameters we are trying to learn (which is running on an ideal, simulated device), and the noisy QNode (which is running on a device with the true amount of depolarization noise). Optimization was performed using gradient descent, and results are plotted in the right panel of Figure
16 (full code can be found in the GitHub repository). The learned parameters are close to the true values, with variability due to shot noise. While this is a simple example, it could be generalized to more complex noise models, such as gate- and qubit-dependent noise, other types of noise such as gate over-rotation, as well as composition of noise channels.