Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Reversible logic circuit synthesis

2002, Proceedings of the 2002 …

Reversible Logic Circuit Synthesis  Vivek V. Shende, Aditya K. Prasad, Igor L. Markov, and John P. Hayes Advanced Computer Architecture Laboratory, University of Michigan, Ann Arbor, MI 48109-2122 fvshende,akprasad,imarkov,jhayesg@umich.edu ABSTRACT Reversible or information-lossless circuits have applications in digital signal processing, communication, computer graphics and cryptography. They are also a fundamental requirement in the emerging field of quantum computation. We investigate the synthesis of reversible circuits that employ a minimum number of gates and contain no redundant input-output line-pairs (temporary storage channels). We prove constructively that every even permutation can be implemented without temporary storage using NOT, CNOT and TOFFOLI gates. We describe an algorithm for the synthesis of optimal circuits and study the reversible functions on three wires, reporting distributions of circuit sizes. Finally, in an application important to quantum computing, we synthesize oracle circuits for Grover’s search algorithm, and show a significant improvement over a previously proposed synthesis algorithm. 1. INTRODUCTION In most computing tasks, the number of output bits is relatively small compared to the number of input bits. For example, in a decision problem, the output is only one bit (yes or no) and the input can be as large as desired. However, computational tasks in digital signal processing, communication, computer graphics and cryptography require that all of the information encoded in the input be preserved in the output. Some of those tasks are important enough to justify adding new microprocessor instructions to the HP PA-RISC (MAX and MAX-2), Sun SPARC (VIS), PowerPC (AltiVec), IA-32 and IA-64 (MMX) instruction sets [13, 8]. In particular, new bit-permutation instructions were shown to vastly improve performance of several standard algorithms, including matrix transposition and DES, as well as two recent cryptographic algorithms Twofish and Serpent [8]. Bit permutations are a special case of reversible functions, that is, functions that permute the set of possible input values. For example, the butterfly operation (x; y) ! (x + y; x y) is reversible but is not a bit permutation. It is a key element of Fast Fourier Transform algorithms and has been used in application-specific processors from Tensilica. One might expect to get further speed-ups by adding instructions to allow com This work was partially supported by the Undergraduate Summer Research Program at the University of Michigan and by the DARPA QuIST program. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing official policies of endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA) or the U.S. Government. 0-7803-7607-2/02/$17.00 ©2002 IEEE putation of an arbitrary reversible function. The problem of chaining such instructions together provides one motivation for studying reversible logic circuits, that is, logic circuits composed of gates computing reversible functions. Reversible circuits are also interesting because the loss of information implies energy loss [2]. Younis and Knight [16] showed that some reversible circuits can be made asymptotically energylossless if their delay is allowed to be arbitrarily large. Currently, energy losses due to irreversibility are dwarfed by the overall power dissipation, but this may change if power dissipation improves. In particular, reversibility is important for nanotechnologies where switching devices with gain are difficult to build. Finally, reversible circuits can be viewed as a special case of quantum circuits because quantum evolution must be reversible [9]. Classical (non-quantum) reversible gates are subject to the same “circuit rules”, whether they operate on classical bits or quantum states. In fact, popular universal gate libraries for quantum computation often contain as subsets universal gate libraries for classical reversible computation. While the speed-ups which make quantum computing attractive are not available without purely quantum gates, logic synthesis for classical reversible circuits is a first step toward synthesis of quantum circuits. Moreover, algorithms for quantum communications and cryptography often do not have classical counterparts because they act on quantum states, even if their action in a given computational basis corresponds to classical reversible functions on bit-strings. Another connection between classical and quantum computing comes from Grover’s search algorithm. Circuits for Grover’s algorithm contain large parts consisting of NOT, CNOT and TOFFOLI gates only [9]. We review existing work on classical reversible circuits [10]. Toffoli [14] gives constructions for an arbitrary reversible or irreversible function in terms of a certain gate library. However, his method makes use of a large number of temporary storage channels, i.e. input-output wire-pairs other than those on which the function is computed. Sasao and Kinoshita show that any conservative function ( f (x) is conservative if for all x, x and f (x) contain the same number of 1s in their binary expansions) has an implementation with only three temporary storage channels using a certain fixed library of conservative gates, although no explicit construction is given [11]. Kerntopf uses exhaustive search methods to examine small-scale synthesis problems and related theoretical questions about reversible circuit synthesis [5]. Our work pursues synthesis of optimal reversible circuits which can be implemented without temporary storage channels. In Section 3 we show by explicit construction that any reversible function which performs an even permutation on the input values can be synthesized using the CNTS (CNOT, NOT, TOFFOLI, and SWAP) gate library under such constraints. In Section 4 we present synthesis algorithms for decomposing such a function into a circuit with a minimal number of gates. Besides branch-and-bound, we use a dynamic programming technique that exploits reversibility. Applications to quantum computing are examined in Section 5. (a) (b) x y z 2 4 x’ y’ z’ 1 0 0 0 1 1 0 0 1 3 5 (c) x 0 0 0 0 1 1 1 1 y 0 0 1 1 0 0 1 1 z 0 1 0 1 0 1 0 1 x0 0 0 0 0 1 1 1 1 y0 0 0 1 1 0 0 1 1 z0 0 1 1 0 0 1 1 0 Figure 1: (a) A 3  3 reversible circuit computing CNOT, (b) the corresponding matrix, and (c) its truth table. 2. BACKGROUND In conventional (irreversible) circuit synthesis, one typically starts with a universal gate library and some specification of a Boolean function. The goal is to find a logic circuit that implements the Boolean function and minimizes a given cost metric, e.g., the number of gates or the circuit depth. At a high level, reversible circuit synthesis is just a special case in which no fanout is allowed and all gates must be reversible. D EFINITION 1. A gate is reversible if the (Boolean) function it computes is bijective. A necessary condition is that the gate have the same number of input and output wires. If it has k, it is called a k  k gate, or a gate on k wires. We will think of the mth input wire and the mth output wire as really being the same wire. Many gates satisfying these conditions have been examined. We will consider a specific set defined by Toffoli [14]. D EFINITION 2. A k-CNOT is a (k + 1)  (k + 1) gate. It leaves the first k inputs unchanged, and inverts the last iff all others are 1. Clearly the k-CNOT gates are all reversible. The first three of these have special names. The 0-CNOT is just an inverter, referred to as a NOT gate, and denoted N. It performs the operation (x) ! (x  1), where  denotes XOR. The 1-CNOT, which performs the operation (y; x) ! (y; x  y) is referred to as a ControlledNOT, or CNOT (C). The 2-CNOT is called a TOFFOLI (T) gate, and performs the operation (z; y; x) ! (z; y; x  yz). We will also be using another reversible gate, called the SWAP (S) gate. It is a 2  2 gate which exchanges the inputs; that is, (x; y) ! (y; x). One reason for choosing these particular gates is that they appear often in the quantum computing context [9]. We will be working with circuits from a given, limited-gate library. Usually, this will be the CNTS gate library, consisting of the CNOT, NOT, and TOFFOLI, and SWAP gates defined above. D EFINITION 3. A well-formed reversible logic circuit is an acyclic combinational logic circuit in which all gates are reversible, and are interconnected without fanout. As with reversible gates, a reversible circuit has the same number of input and output wires; again we will call a reversible circuit with n inputs an n  n circuit, or a circuit on n wires. We can also think of an n  n circuit as the inner workings of an n  n reversible gate. This also allows us to draw reversible circuits as arrays of horizontal lines representing wires, in which gates are represented by vertically-oriented symbols. For example, in Figure 1a, we see a reversible circuit drawn in standard notation [9]. The  symbols represent inverters and the  symbols represent controls. A vertical line connecting a control to an inverter means that the inverter is only applied if the wire on which the control is set carries a 1 signal. Thus, the gates used are, from left to right, TOFFOLI, NOT, TOFFOLI, and NOT. (a) (b) Figure 2: Reversible circuit equivalences: (a) T(1,2;3)  N(1)  T(1,2;3)  N(1) = C(2;3), and (b) C(3;2)  C(2;3)  C(3;2) = S(2,3). Since we will be dealing only with bijective functions, we represent them using the cycle notation, from elementary algebra, where a permutation is represented by disjoint cycles of variables. For example, the truth table in Figure 1b is represented by (2; 3)(6; 7) because the corresponding function swaps 010 (2) and 011 (3), and 110 (6) and 111 (7). The set of all permutation of n marks is denoted Sn , so the set of bijective functions with n binary inputs is S2n . We will call (2; 3)(6; 7) CNT-constructible since it can be computed by a circuit with gates from the CNT gate library. More generally: D EFINITION 4. Let L be a (reversible) gate library. An L-circuit is a circuit with only gates from L. A permutation π 2 S2n is Lconstructible if it can be computed by an n  n L-circuit. In Figure 2a we see that the circuit in Figure 1a is equivalent to one consisting of a single C gate. Pairs of circuits computing the same function are very useful, since we can substitute one for another. On the right, we see similarly that three C gates can be used to replace a S gate. Figure 2 therefore shows us that the C and S gates in the CNTS gate library can be removed without losing computational power. We will still use the CNTS gate library in synthesis to reduce gate counts and potentially speed up synthesis This is motivated by Figure 2, which shows how to replace four gates with one C gate, and thus up to 12 gates with one S. D EFINITION 5. Two reversible circuits are equivalent if they compute the same function. Figure 2a illustrates the use of “temporary storage”. A C gate only needs two wires, but if we simulate it with two N gates and two T gates, we need a third wire. The value of the third wire emerges unaltered. More generally, consider the general reversible circuit of Figure 3. The top n k lines transfer n k signals Y to the corresponding wires on the other side of the circuit. The bottom k wires enter as the input value X and emerge as the output value f (X ). These wires usually serve as an essential workspace for computing f (X ). Following Toffoli, we say this circuit computes f (X ) using n k lines of temporary storage [14]. D EFINITION 6. Let L be a reversible gate library. Then L is universal if for all k and all permutations π 2 S2k , there exists some l such that some L-constructible circuit computes π using l wires of temporary storage. (Note that we do not assume that fixed inputs are available.) It is a result of Toffoli’s that the CNT gate library is universal; he also showed that one can bound the amount of temporary storage n−1 Y k k−1 X n−1 .. . .. . .. . Reversible circuit C 1 0 Figure 3: Circuit C with n .. . Y k k−1 1 0 f(X) k wires Y of temporary storage. required to compute a permutation in S2n by n 3. We are interested in trying to synthesize permutations using no extra storage. For an example of what limitations this puts on the set of computable permutations, suppose we were working with only the C gate library. Then the following is true: P ROPOSITION 1. Every C-constructible permutation computes an invertible linear transformation. Moreover, every invertible linear transformation is computable by a C-constructible circuit. Finally, S2n has ∏in=01 (2n 2i ) C-constructible permutations. Proof: A function f is linear iff f (x  y) = f (x)  f (y), where  denotes bitwise XOR. The composition of two linear functions is a linear function. Therefore, to show that all C circuits are linear, it suffices to prove each C gate computes a linear transformation. Indeed, C(x1  y1 ; x2  y2 ) = (x1  y1 ; x1  y1  x2  y2 ) = (x1 ; x1  y1 )  (x2 ; x2  y2 ) = C (x1 ; y1 )  C (x2 ; y2 ). On the other hand, observe that the linearity in terms of bit-wise  matches the linearity in vector spaces over the two-element field F2 . In the basis 10 : : : 0, 01 : : : 0, : : :, 0 : : : 01, the matrices corresponding to individual C gates account for all the elementary row-addition matrices. An example is given in Figure 1. Because there is only one nonzero scalar in F2 , any invertible matrix in GL(F2 ) can be written as a product of these. Thus, any invertible linear transformation can be computed by a C-circuit. Finally, a linear mapping is fully defined by its values on basis vectors. There are 2n 1 ways of mapping the 2n -bit string 10:::0. Once we fixed its image, there are 2n 2 ways of mapping 010:::0, and so on. Each time we map one of these basis bit-strings it can’t map to the subspace spanned by the previous bit-strings. There are 2n 2i choices for the i-th basis bit-string. Once all basis bit-strings are mapped, the mapping of the rest is specified by linearity. A similar result for CNT-constructible permutations requires: D EFINITION 7. A permutation is called even if it can be written as the product of an even number of transpositions. The set of even permutations in Sn is denoted An . It is well-known that if a permutation can be written as the product of an even number of transpositions, then it may not be written as the product of an odd number of transpositions. Moreover, half the permutations in Sn are even for n > 1. Decompose permutation into list of cycles, LC Remove a cycle, C no Decompose C into list LT1 of transpositons Is LC empty? Append LT1 to LT yes no Remove two transpositions partitioning as necessary Is LT empty? yes Output circuit Create subcircuit corresponding to transposition pair Append to circuit Figure 5: Flowchart for algorithm from proof of Proposition 3. Proposition 13 in the Appendix. It therefore suffices to show that pairs of disjoint transpositions are constructible, as we can chain together their circuits to obtain the circuit for π. First, we observe that the permutation with cycle decomposition (0; 1)(2; 3) can be computed by a circuit consisting of a (n 2)-CNOT gate with the controls on the top n 2 wires and the inverter on the bottom wire, and with an N gate on each side of each control. We can replace the (n 2)-CNOT gate with a 8(n 5) T gates [1, Corollary 7.4]. Let S = fa; b; c; d g with a; b; c; d distinct. In Proposition 14 in the Appendix, we explicitly construct a circuit computing a permutation πS such that πS (a) = 0, πS (b) = 1, πS (c) = 2, and πS (d ) = 3. Because (a; b)(c; d ) = πS (0; 1)(2; 3)πS 1 , we can sequence the circuits for πS , (0; 1)(2; 3), and πS 1 , to obtain a circuit for (a; b)(c; d ). P ROPOSITION 2. Any n  n circuit with no n  n gates computes an even permutation [14]. The following two corollaries give a way to synthesize circuits computing odd permutations using temporary storage, and also extend Proposition 3 to an arbitrary universal gate library. To illustrate this proposition, consider the following example. A 2  2 circuit consisting of a single S gate performs the permutation (1; 2), as the inputs 01 and 10 are interchanged, and the inputs 00 and 11 remain fixed. This permutation consists of one transposition, and is therefore odd. On the other hand, in a 3  3 circuit, one can check that a swap gate on the bottom two wires performs the permutation (1; 2)(5; 6), which is even. P ROPOSITION 4. Every permutation is CNT-constructable with at most one wire of temporary storage. Proof: Suppose we have a n  n gate G computing π 2 S2n , and we place it on the bottom n wires of an (n + 1)  (n + 1) reversible circuit; let π̃ be the permutation computed by this new circuit. Then by Proposition 2, π̃ is even. Another way of seeing is this to observe that each cycle in π appears “twice” in π̃, once when the top wire carries 0 and once when it carries 1. By Proposition 3, π̃ is CNT-constructible. Let C be a CNT-circuit computing π̃. Then C computes π with one line of temporary storage. 3. THEORETICAL RESULTS Since the CNTS gate library contains no gates of size greater than three, Proposition 2 implies that every CNTS-constructible (without temporary storage) permutation is even for n  4. The converse is true as well: P ROPOSITION 3. Every even permutation is CNT-constructible. Proof: It follows from a result of Toffoli’s [14] that every permutation in S2n is CNT-constructible for n < 4. This is explicitly verified in Table 1. Suppose n  4. Any permutation π 2 A2n can be written as the product of pairs of disjoint transpositions; for a proof, see P ROPOSITION 5. For any universal gate library L and sufficiently large n, permutations in A2n are L-constructible, and those in S2n are realizable with at most one wire of temporary storage. Proof: Since L is universal, there is some number k such that we can compute the permutations corresponding to the NOT, CNOT, and TOFFOLI gates using k total wires. Let n > k, and let π 2 A2n . By Proposition 3, we can find a CNT-circuit C computing π, and can replace every occurrence of N, C, or T gate with a circuit computing it. The second claim follows from Propositions 3 and 4. 0 1 0 1 0 1 0 1 1111 0000 0 1 0 1 0 1 0 1 11111 00000 0 1 0 1 0 1 0 1 1111 0000 0 1 0 1 0 1 0 1 1111 0000 0 1 0 1 0 1 0 1 0000 1111 0 1 0 1 1111 0000 0 1 0 1 0 1 0 1 0 1 1 0 00000 11111 0 0 1 0 1 0 1 11111 00000 0 1 0 1 1 0 1 0 1 0 1 0 1 0000 1111 0 1 0 1 1111 0000 0 1 0 1 0 1 0 1 0 1 1111 0000 0 1 0 1 1111 0000 0 1 0 0 00000 0 0 111 000 1111 0000 1111 0000 11111 1111 0000 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 11111 0 0 1 1 1 1 111 000 1111 0000 1111 0000 00000 1111 0000 0 0 1 0 1 0 1 0 1 111 000 1111 0000 1111 0000 11111 00000 1111 0000 0 0 0 1 1 0 1 1 1111 0000 0 000000 111111 1 0 1 0 1 1 111111 000000 0 0 1 111111 000000 0 1 1111 0000 0 0 11111 0 0 111 000 1111 0000 1111 0000 00000 1111 0000 1 1 1 1 0 1 0 1 0 1 0 1 111 000 1111 0000 1111 0000 11111 1111 0000 0 1 0 00000 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 111 000 1111 0000 1111 0000 00000 11111 1111 0000 0 1111 0000 1 0 1 1111 0000 0 1 0 1 0 1 1111 0000 (a) (b) Figure 4: Equivalences between reversible circuits used in our constructions. Proposition 3 is proven by an explicit construction, which constitutes a (non-optimal) circuit synthesis heuristic; see Figure 5. For permutations in A2n , the runtime and the length of the circuits produced are both Θ(n2n ) in the worst case. In general, the complexity is Θ(ns) where s is the number of indices moved by the permutation we are trying to synthesize. This agrees with the above estimate, as at most 2n indices may be moved. Later, we describe an algorithm which synthesizes optimal circuits using an arbitrary gate library. Roughly speaking, the performance of this algorithm is improved by using a smaller gate library, as long as the average circuit length is not significantly increased. D EFINITION 8. For any gate libraries L1 : : : Lk , a L1 j : : : jLk circuit is an L1 -circuit followed by an L2 -circuit, . . . , followed by an Lk -circuit. A permutation computed by an L1 j : : : jLk -circuit is L1 j : : : jLk -constructible. P ROPOSITION 6. For every CNT-circuit there is an equivalent CTjN-circuit. Proof: First, we move all the N gates toward the outputs of the circuit. Each box in Figure 4a indicates a way of replacing an NjCT circuit with a CTjN circuit. Moreover, every possible way for an N gate to appear to the immediate left of a C or a T is accounted for, up to permuting the input and output wires. Now, number the nonN gates in the circuit in a reverse topological order starting from the outputs. In particular, if two gates appear at the same level in a circuit diagram, they must be independent, and one can order them arbitrarily. Let d be the number of the highest-numbered gate with an N gate to its left. All N gates past the d-th gate G can be reordered with the G gate without introducing new N gates on the other side of G. In any event, as there are no remaining N gates to the left of G anymore, d decreases. This process terminates with all the N gates are clustered together at the circuit outputs. If we always cancel redundant N gates, then no more than two new gates will be introduced for each non-inverter originally in the circuit; additionally, there will be no more than n total N gates when the process is complete. Thus if the original circuit had l gates, then the new circuit has at most 3(l 1) + n gates. P ROPOSITION 7. The permutation π computed by a CTjN-circuit determines the πCT and πN computed by the CT and N sub-circuits. Proof: C and T gates (and hence CT-circuits) fix 0. Thus π(0) = πN (0). But the image of 0 (or anything else) under an N-circuit completely determines the πN . Hence πCT = ππN 1 = ππN . Thus, if we want a CNT-circuit computing a permutation π, we can quickly compute πN and then simplify the problem to that of finding a CT-circuit for ππN . By Proposition 6, we know that a minimal-gate circuit of this form has at most about three times as many gates as the gate-minimal circuit computing π. Figure 4b shows how to move a C gate past a T gate, and account for every possible way a C can appear to the left of a T, up to permuting wires. From this, one might expect every CT circuit to be equivalent to a TjC circuit. This is not the case, however. We note that the proof of Proposition 6 in fact requires the ability to move an arbitrary number of N gates past any other given gate, while Figure 4b only allows us to move one C gate past a given T gate. However, many CT circuits are equivalent to TjC circuits, and the following result holds: P ROPOSITION 8. The permutation π computed by a TjC-circuit determines permutations πT and πC computed by the sub-circuits. Proof: By Proposition 1, any C-circuit is linear, so it suffices to check its values on the basis elements (binary expansions of 2i ). As any T circuit fixes these, π(2i ) = πC Æ πT (2i ) = πC (2i ), so the permutation π uniquely determines πC . πT = ππC 1 . Proposition 8 implies that the number of TjC-constructible permutations in S2k is equal to the number that are C-constructible times the number that are T-constructible. In Section 4, we use this to show that there exist CT-constructible permutations which are not TjC-constructible. 4. OPTIMAL SYNTHESIS Now that we know which permutations admit circuit realizations without extra storage, we seek optimal realizations of this type. A circuit is optimal if no equivalent circuit has smaller cost; in our case, the cost function will be the number of gates in the circuit. P ROPOSITION 9. (Property of Optimality) If B is a sub-circuit of an optimal circuit A, then B is optimal. Proof: Suppose not. Then let B0 be a circuit with fewer gates than B, but computing the same function. If we replace B by B0 , we get another circuit A0 which computes the same function as A. But since we have only modified B, A0 must be as much smaller than A as B0 is smaller than B. However, A was assumed to be optimal, hence this is a contradiction. Note: equivalent, optimal circuits can have the same number of gates. Proposition 9 allows us to build a library of small optimal circuits by dynamic programming because the first m gates of an optimal (m + 1)-gate circuit form an optimal subcircuit. Therefore, to examine all optimal (m + 1)-gate circuits, we iterate through optimal m-gate circuits and add single gates at the end in all possible ways. Some of the (m + 1)-gate circuits found may have been synthesized with fewer gates. Those which have not are optimal. In fact, instead of storing a library of all optimal circuits, we store one optimal circuit per synthesized permutation and also store optimal circuits of a given size together. One way to find an optimal circuit for a given permutation π is to generate all optimal k-gate circuits for increasing values of k until a circuit computing π is found. This procedure requires Θ(2n !) memory in the worst case (n is the number of wires) and may require more memory than is available. Therefore, we stop growing CIRCUIT find circ(COST, PERM, CURR CCT) // assumes circuit library stored in LIB if (COST  k) // If PERM can be computed by a circuit with  k gates, // such a circuit must be in the library return CURR CCT * LIB[DEPTH].find(PERM) else // Try building the goal circuit from k-gate circuits for each C in LIB[k] // Divide PERM by permutation computed by C PERM2 PERM * INVERSE(C.perm) // and try to synthesize the result TEMP CCT find circ(depth-k,PERM2) if (TEMP CCT != NIL) return TEMP CCT Figure 6: Finding a circuit of cost COST that computes permutation PERM (NIL returned if no such circuit exists). CURR CCT, TEMP CCT and records in LIB represent circuits, and include a field “perm” storing the permutation computed. The * character means concatenation of circuits, and NIL*<anything>=NIL. the circuit library at m-gate circuits, when hardware limitations become an issue. The second stage of the algorithm uses the computed library of optimal circuits and, in our implementation, starts by reading the library from a file. Since little additional memory is available, we trade off runtime for memory. We use a technique known as depth-first search with iterative deepening (DFID) [6]. After a given permutation is checked against the circuit library, we seek circuits with j = m + 1 gates that implement this permutation. If none are found, we seek circuits with j = m + 2 gates, etc. This algorithm, in general, needs an additional termination condition to prevent infinite looping for inputs which cannot be synthesized with a given gate library. For each j, we consider all permutations optimally synthesizable in m gates. For each such permutation ρ, we multiply π by ρ 1 and recursively try to synthesize the result using j m gates. When j m  m, this can be done by checking against the existing library. Otherwise, the recursion depth increases. Pseudocode for this stage of our algorithm is given in Figure 6. In addition to being more memory-efficient than straightforward dynamic programming, our algorithm is faster than branching over all possible circuits. To quantify these improvements, consider a library of circuits of size m or less, containing lm circuits of size m. We analyze the efficiency of the algorithms discussed by simulating them on an input permutation of cost k. Our algorithm reb(k 1)=m references to the circuit library. Simple branchquires lm ing is no better than our algorithm with m = 1, and thus takes at b(k 1)=m times more than our algoleast l1k steps, which is l1k =lm rithm. A speed-up can be expected because lm  l1m , but specific numerical values of that expression depend on the numbers of suboptimal and redundant optimal circuits of length m. Indeed, Table 1 lists values of lm for various subsets of the CNTS gate library and m = 3. For example, for the NT gate gate library, k = 12, b(k 1)=m = 3, l1 = 6 and lm = 88. Therefore the performance b(k 1)=m = 612 =883  3194:2. Yet, this comparison is ratio is l1k =lm Size 12 11 10 9 8 7 6 5 4 3 2 1 0 Total Time N 0 0 0 0 0 0 0 0 0 1 3 3 1 8 1 C 0 0 0 0 0 0 2 24 60 51 24 6 1 168 1 T 0 0 0 0 0 0 0 0 5 9 6 3 1 24 1 NC 0 0 0 0 0 14 215 474 393 187 51 9 1 1344 30 CT 0 0 0 0 6 386 1688 1784 845 261 60 9 1 5040 215 NT 47 1690 8363 12237 9339 5097 2262 870 296 88 24 6 1 40320 97 CNT 0 0 0 0 577 10253 17049 8921 2780 625 102 12 1 40320 40 CNTS 0 0 0 0 32 6817 17531 11194 3752 844 134 15 1 40320 15 Table 1: Number of permutations computable in an optimal Lcircuit using a given number of gates. L  CNT S. Runtimes are given in seconds for a 2GHz Pentium-4 Xeon workstation. incomplete because it does not account for time spent building circuit libraries. We point out, however, that this charge is amortized over multiple synthesis operations. In our experiments, generating a circuit library on three wires of up to three gates (m = 3) from the CNTS gate library takes less than a minute on a 2-GHz Pentium-4 Xeon. Using such libraries, all of Table 1 can be generated in minutes,1 but cannot be generated even in several hours using branching. Let us now see what additional information we can glean from Table 1. Adding the C gate to the NT library appears to significantly reduce circuit size, but further adding the S gate does not help as much. To illustrate this, we show sample worst-case circuits on three wires for the NT, CNT, and CNTS gate libraries in Figure 7. The totals in Table 1, can be independently determined by the following arguments. Every reversible function on three wires can be synthesized using the CNT gate library [14], and there are 8! = 40; 320 of these. All can be synthesized with the NT library because the C gate is redundant in the CNT library; see Figure 2a. On the other hand, adding the S gate to the library cannot decrease the number of synthesizable functions. Therefore, the totals in the NT and CNTS columns must be 40; 320 as well. On the other side of the table, the number of possible N circuits is just 23 = 8 since there are three wires, and there can be at most one N gate per wire in an optimal circuit (else we can cancel redundant pairs.) By Propositions 6 and 7, the number of CN-constructible permutations should be the product of the number of N-constructible permutations and the number of C constructible permutations, since any CN-constructible permutation can be written uniquely as a product of an N-constructible and a C-constructible permutation. So the total in the CN column should be the product of the totals in the C and N columns, which it is. Similarly, the total in the CNT column should be the product of the totals in the CT and N columns; this allows one to deduce the total number of CT-constructible permutations from values we know. Finally, Proposition 1 states that the number of permutations implementable on n wires with C gates is ∏ni=01 (2n 2i ). For n = 3 this yields 168 and agrees with Table 1. 1 Although complete statistics for all 16! 4-wire functions are beyond our reach, average synthesis times are less than one second when the input function can be implemented with eight gates or less. Functions requiring nine or more gates tend to take more than 1.5 hours to synthesize. In this case memory constraints limit our circuit library to 4-gate circuits, and the large jump in runtime after the 8-gate mark is due to an extra level of recursion. Figure 7: Worst-case L-circuits where L is NT, CNT and CNTS. We can also add to the discussion of TjC constructible circuits we began in Section 3. By Proposition 7, the number of TjCconstructible permutations can be computed as the product of the numbers of T-constructible and C-constructible permutations. Table 1 mentions 24 T-circuits and 168 C-circuits on three wires. The product, 4032, is less than 5040, the number of CT constructible permutations on three wires. Therefore: P ROPOSITION 10. There exist CT constructible permutations in S8 which are not TjC constructible. Finally, we observe that the longest optimal C-circuits on 3, 4 and 5 wires merely permute the wires. Our experimental data supports the conjecture that no optimal C-circuit on n wires has more than 3(n 1) gates, and the ones with 3(n 1) gates represent wire permutations that leave no wire fixed. However, an informationtheoretic counting argument shows that the optimal gate count in an optimal C-circuit is at least O(n2 =log(n)). This asymptotic bound is produced by comparing the number of unique C-circuits on n wires and the number of circuits formed by chains of up to d C gates [12]. Identifying specific worst-case circuits and describing families with worst-case asymptotics remains a challenge. 5. QUANTUM SEARCH APPLICATIONS Quantum computation is necessarily reversible, and quantum circuits generalize their reversible counterparts in the classical domain [9]. Instead of wires, information is stored on qubits, whose states we write as j0i and j1i instead of 0 and 1. There is an added complexity — a qubit can be in a superposition state that combines j0i and j1i. Specifically, j0i and j1i are thought of as vectors of the computational basis, and the value of a qubit can be any unit vector in the space they span. The scenario is similar when considering many qubits at once: the possible configurations of the corresponding classical system are now the computational basis, and any unit vector in the linear space they span is a valid configuration of the quantum system. Just as the classical configurations of the circuit persist as basis vectors of the space of quantum configurations, so too classical reversible gates persist in the quantum context. Non-classical gates are allowed; in fact, any (invertible) norm-preserving linear operator is allowed as a quantum gate. However, quantum gate libraries often have very few non-classical gates [9]. An important example of a non-classical gate (and the only one used in this paper) is the Hadamard gate H. It operates on one qubit, and is defined as follows: H j0i = p1 (j0i + j1i), and 2 H j1i = p1 (j0i j1i). Note that because H is linear, giving the 2 images of the computational basis elements defines it completely. During the course of the computation, the quantum state can be anywhere in the linear space spanned by the computational basis. However, a serious limitation is imposed by quantum measurement, performed after a quantum circuit is executed. A measurement nondeterministically collapses the state onto some vector in a basis corresponding to the measurement being performed. The probabilities of outcomes depend on the measured state — basis vectors [nearly] orthogonal to the measured state are least likely to appear as outcomes of measurement. For example, if H j0i were measured in the computational basis, it would be seen as j0i half the time, and j1i the other half. Despite this limitation, quantum circuits have significantly more computational power than classical circuits. In this work, we consider Grover’s search algorithm, which is provably faster than any non-quantum algorithm for the same problem [4]. Grover’s algorithm selects one of N unordered items that satisfy a given predicate. No structural information about the predicate is used p — it is treated as a black box. Grover’s algorithm completes in Θ( N ) time, not counting the evaluation of the predicate, thereby achieving a quadratic speedup over the best possible classical algorithms if no structural information about the predicate is used in search. Grover’s algorithm presupposes that the desired items are indexed from 0 to 2n 1 (padding is required when N is not a power of two). Its first step is to use H gates to bring the system into a superposition of all the computational basis states. Then, a transformation called the Grover operator iteratively changes the state of the system so that subsequent measurement will, with high probability, yield this index. Since the result can be easily verified and since the input is classical, the procedure can be repeated until successful. If the procedure guarantees success with probability > 0:5, relatively few (poly(n)) repetitions are required to decrease the overall probability of failure below that of classical computers. To implement the Grover operator, one needs an oracle circuit that represents the search predicate f (x). This circuit transforms an arbitrary basis state jxi to the state ( 1) f (x) jxi. The oracle is followed by (i) several Hadamard gates, (ii) a subcircuit which flips the sign on all computational basis states other than j0i, and (iii) more Hadamard gates. A sample Grover-operator circuit for a search on 2 qubits is shown in Figure 8 and uses one qubit of temporary storage [9]. The search space here is f0; 1; 2; 3g, and the desired indices are 0 and 3. The oracle circuit is highlighted by a dashed line. While the portion following the oracle is fixed, the oracle may vary depending on the search criterion. Unfortunately, most works on Grover’s algorithm do not address the synthesis of oracle circuits and their complexity. According to Bettelli et al. [3], this is a major obstacle for automatic compilation of high-level quantum programs, and little help is available. P ROPOSITION 11. With one temporary storage qubit, the problem of synthesizing a quantum circuit that transforms computational basis states jxi to ( 1) f (x) jxi can be reduced to a problem in the synthesis of classical reversible circuits [9]. Proof: Define the permutation π f by π f (x; y) = (x; y  f (x)), and define a unitary operator U f by letting it permute the states of the computational basis according to π f . The additional qubit is initialized to j i = H j1i so that U f jx; i = ( 1) f (x) jx; i. If we H H H H H H H H H H Figure 8: A Grover-operator circuit with oracle highlighted. Size XOR OPT T OPT 0 1 1 1 1 4 4 7 2 6 6 21 3 4 4 35 4 4 4 36 5 12 12 28 6 18 21 28 7 12 24 36 8 6 29 35 9 12 33 21 10 19 44 7 11 16 46 1 12 10 22 0 13 8 5 0 14 10 1 0 15 16 0 0 16 19 0 0 17 12 0 0 18 6 0 0 19 12 0 0 20 18 0 0 21 12 0 0 22 4 0 0 23 4 0 0 24 6 0 0 25 4 0 0 26 1 0 0 Table 2: Circuit size distribution of 3+2 ROM based circuits synthesized using various algorithms. Circuit Size No. of circuits 0 1 1 7 2 21 3 35 4 35 5 24 6 4 7 1 Total 128 Table 3: Optimal 3+1 oracle circuits for Grover’s search. now ignore the value of the last qubit, the system is in the state 1) f (x) jxi, which is exactly the state needed for Grover’s algorithm. Since a quantum operator is completely determined by its behavior on a given computational basis, any circuit implementing π f implements U f . In particular, since reversible gates may be implemented with quantum technology, we can synthesize U f as a reversible logic circuit. ( Quantum computers implemented so far are severely limited by the number of simultaneously available qubits. While n qubits are necessary for Grover’s search, one should try to minimize the number of additional temporary storage qubits. One such qubit is entailed by Proposition 11 to convert classical reversible circuits to alter the phase of quantum states. Another qubit is required to synthesize circuits for odd π f , according to Proposition 4. Constructively, given π f , we can use the algorithm of Section 4 to find an optimal circuit for it. Figure 3 gives the optimal circuit sizes of functions π f corresponding to 3-input 1-output functions f (“3+1 oracles”) which can be synthesized on four wires. These circuits are significantly smaller than many optimal circuits on four wires. This is not surprising, as they perform less computation. In Grover oracle circuits, the main input lines preserve their input values and only the temporary storage lines can change their values. Therefore, Travaglione et al. [15] studied circuits where some lines cannot be changed even at intermediate stages of computation. In their terminology, a circuit with k lines that we are allowed to modify and an arbitrary number of read-only lines is called a k-bit ROM-based circuit. They show how to compute permutation π f arising from a Boolean function f using a 1-bit quantum ROMbased circuit, and prove that if only classical gates are allowed, two writable bits are necessary. Two bits are sufficient if the CNT gate library is used. Their synthesis algorithms rely on XOR sumof-products decompositions of f . We outline their method in the Appendix, in a proof of the following result. P ROPOSITION 12. There exists a reversible 2-bit ROM based CNT-circuit computing (x; a; b) ! (x; a; b  f (x)), where x is a k-bit input. If a function’s XOR decomposition consists of only one term, let k be the number of literals appearing (without complementation) If k > 0 then there will be 3  2k 1 2 gates. [15]. We apply the construction given in the Appendix to all 256 functions implementable in 2-bit ROM based circuits with 3 bits of ROM. The circuit size distribution is given in the line labeled XOR in Table 2. That is compared with optimal circuit sizes produced by the algorithm from Section 4. The line OPT T gives the size distribution of circuits synthesized under the restriction [15] that at most one control bit per gate be a ROM bit, which is observed by the heuristic based on XOR decomposition. This is why, for all j, the sum of the first j numbers in the OPT T line is greater than or equal to that in the XOR line. Travaglione et al [15] mention that their results do not depend on the above restriction, and the OPT line of Table 2 relaxes it.2 Most functions computable by a 2-bit ROM-based circuit actually require two writeable bits [15]. Whether or not a given function can be computed by a 1-bit ROM-based CNT-circuit, can be determined by the following constructive procedure. Observe that gates in 1-bit ROM circuits can be reordered arbitrarily, as no gate affects the control-bits of any other gate. Thus, whether or not a C or T gate flips the controlled bit, depends only on the circuit inputs. Furthermore, multiple copies of the same gate on the same wires cancel out, and we can assume that at most one is present in an optimal circuit. A synthesis procedure can then check which gates are present by applying the permutation on every possible input combination with zero, one, or two 1s in its binary expansion. If the value of the function is 1, the circuit must have a N, C or T gate controlled by those bits. Observe that adding the S gate to the gate library during k + 1 ROM synthesis will never decrease circuit sizes — no two wires can be swapped since at least one of them is a ROM wire. In the case of k + 2 ROM synthesis, only the two non-ROM wires can be swapped, and one of them must be returned to its initial value by the end of the computation. We ran an experiment comparing circuit lengths in the 3+2 ROM-based case and found no improvement in circuit sizes upon adding the S gate, however we have been unable to prove this in the general case. 6. CONCLUSIONS We have explored a number of promising techniques for synthesizing optimal and near-optimal reversible circuits that require little or no temporary storage. In particular, we have proven constructively that every even permutation function can be synthesized without temporary storage using the CNT gate library, and our proof is the basis of a reasonably efficient heuristic synthesis algorithm. We have also derived various equivalences among CNTcircuits that are useful for synthesis purposes. Our experimental data for optimal reversible circuits on three wires using various subsets of the CNTS library reveals some interesting characteristics of these circuits. Finally, we have applied our approach to the design of oracle circuits for a key quantum computing application, Grover’s search algorithm, and obtained much smaller circuits than previous methods. While the algorithm to synthesize optimal circuits scales better than its counterparts for irreversible computation [7], it is still limited by an exponentially growing search space. In on-going work, we are attempting to extend the proposed methods to handle larger and more general reversible circuits, with the eventual goal of synthesizing quantum circuits containing dozens of qubits. 2 Using a circuit library with  6 gates (191Mb file, 1.5 min to generate), the OPT line takes 5 min to generate. For the OPT T line, we first find the 250 optimal circuits of size  12 (15 min) using a 6-gate library (61Mb, 5min). The remaining 6 functions were synthesized in 5 min with a 7-gate library (376Mb, 10 min). 7. REFERENCES [1] A. Barenco et al., “Elementary Gates For Quantum Computation,” Physical Review A, 52, 1995, pp. 3457-3467. [2] C. Bennett, “Logical Reversibility of Computation,” IBM J. of Research and Development, 17, 1973, pp. 525-532. [3] S. Bettelli, L. Serafini and T. Calarco, “Toward an Architecture for Quantum Programming,” Nov. 2001, http://arxiv.org/abs/cs.PL/0103009. [4] L. K. Grover, “A Framework For Fast Quantum Mechanical Algorithms,” Proc. Symp. on Theory of Computing, 1998. [5] P. Kerntopf, “A Comparison of Logical Efficiency of Reversible and Conventional Gates,” IWLS 2000, pp. 261-9. [6] R. Korf, “Artificial Intelligence Search Algorithms”, Algorithms and Theory of Computation Handbook, CRC Press, 1999. [7] E. Lawler, “An Approach to Multilevel Boolean Minimization,” JACM, 11, July 1964, pp. 283-295. [8] J. P. McGregor and R. B. Lee, “Architectural Enhancements for Fast Subword Permutations with Repetitions in Cryptographic Applications,” ICCD, 2001, pp. 453-461. [9] M. Nielsen and I. Chuang, Quantum Computation and Quantum Information, Cambridge Univ. Press, 2000. [10] M. Perkowski et al., “A General Decomposition For Reversible Logic,” Reed-Muller Workshop, Aug. 2001. [11] T. Sasao and K. Kinoshita, “Conservative Logic Elements and Their Universality,” IEEE Trans. on Computers, 28, 1979, pp. 682-685. [12] T. Silke, “PROBLEM: register swap,” December 1995, http://www.mathematik.uni-bielefeld.de/ ˜silke/PROBLEMS/bit swap [13] Z. Shi and R. Lee, “Bit Permutation Instructions for Accelerating Software Cryptography,” IEEE Intl. Conf. on Application-specific Systems, Architectures, and Processors, July 2000, pp. 138-148. [14] T. Toffoli, “Reversible Computing,” Tech. Memo MIT/LCS/TM-151, MIT Lab for Comp. Sci, 1980. [15] B. Travaglione et al. “ROM-based computation: Quantum Versus Classical,” 2001. http://arxiv.org/abs/quant-ph/0109016 [16] S. Younis and T. Knight, “Asymptotically Zero Energy Split-Level Charge Recovery Logic,” Workshop on Low Power Design, 1994. Appendix Below we state and prove technical results used in Section 3 and then detail a proof of Propostion 12. P ROPOSITION 13. For n  5, we can write any permutation in An as the product of no more than n pairs of disjoint transpositions. Proof: Fix π 2 An . Then take the cycle decomposition of π and decompose each cycle into transpositions to write π as a product of c(π)  n transpositions. Since π is even, we know c(π) = 2k for some k. Pair up the 2i-th and (2i + 1)-st transpositions. Some of these pairs can not be disjoint, but since n  5 we can write (a; b)(a; c) = (a; b)(d ; e)(d ; e)(a; c) where d 6= e are distinct from a; b; c. Thus breaking up non-disjoint pairs, we write π as a product of 2k = c(π)  n pairs of transpositions. P ROPOSITION 14. Let n  4, and a; b; c; d be distinct integers between 0 and n 1. Then there exists a constructable permutation π 2 A2n such that π(a) = 0, π(b) = 1, π(c) = 2, and π(d ) = 3. It takes at most 2n N gates, 4(n + 1) C gates, and 2(n 2) T gates. Proof: Start with an empty circuit and place N gates on every line corresponding to a 1 in the binary expansion of a. Let π0 be the permutation performed by the circuit so far; π0 (a) = 0. Since b 6= a, so π0 (b) 6= 0 and therefore π0 (b) has at least one 1 in its binary expansion. Say it’s on the k-th line; then using C gates controlled on the k-th line, flip any other non-zero bits of b0 . Finally, if k 6= 1, swap the k-th bit and the 0-th bit. This can always be done using 3 C gates, as in Figure 2. In this case, since we know that the bottom bit is 0 and the k-th bit is 1, we need only 2. Let π1 be the permutation performed by the circuit so far. by construction, π1 (b) = 1, and since C gates fix 0, we have π1 (a) = π0 (a) = 0. As before, c 6= b; a =) π1 (c) 6= 1; 0 hence π1 (c) has a 1 somewhere in its binary expansion other than the lowest bit, say in the p-th bit. Using the algorithm of the previous paragraph, flip every other bit to 0 and then swap the p-th and 2-nd bit; we note that again we have not affected 0, and none of our C gates have been controlled on the bottom line, we cannot move 1. The permutation π2 performed by the circuit thus far has the property that π2 (c) = 2, π2 (b) = 1, π2 (a) = 0. Finally, observe that π2 (d )  3; if it is in fact 3 then we are done, if not then we have π2 (d )  4, and some bit in the binary expansion of π2 (d ) other than the lowest two bits must be 1; let it be the q-th bit. Then using C gates controlled the q-th bit, flip the bottom two wires to 1 if necessary, and use T gates controlled on these bottom two bits to clear off the rest of the wires. We are now done, as none of these gates affect 0; 1; 2, and this subcircuit sends π2 (d ) ! 3. A careful count of the gates used verifies the final claim of the proposition. P ROPOSITION 12. There exists a reversible 2-bit ROM based CNT-circuit computing (x; a; b) ! (x; a; b  f (x)), where x is a k bit input. If a function’s XOR decomposition consists of only one term, let k be the number of literals appearing (without complementation) If k > 0 then there will be 3  2k 1 2 gates. [15]. Proof: Assume we are given an XOR sum-of-products decomposition of f . Then it suffices to know how to transform (x; a; b) ! (x; a; b  p) for an arbitrary product of uncomplemented literals p, because then we can add the terms in an XOR decomposition term by term. So, without the loss of generality, let p = x1 : : : xm . Denote by T (a; b; c) a T gate with controls on a; b and inverter on c. Similarly, denote by C(a; b) a C gate with control on a and inverter on b. Number the ROM wires 1 : : : k, and the non-ROM wires k + 1 and k + 2. Let us first suppose that there is at least one uncomplemented literal, and put a C(1; k + 2) on the circuit; note that C(1; k + 2) applied to the input (x; a; b) gives (x; a; b  x1 ). We will write this as C(1; k + 2) : (x; a; b) ! (x; a; b  x1 ), and denote this operation by V1 . Then, we define the circuit V20 as the sequence of gates T (2; k + 2; k + 1)V0 T (2; k + 2; k + 1)V0 , and one can check that V20 : (x; a; b) ! (x; a  x1 x2 ; b). We define V2 by exchanging the wires k + 1 and k + 2; clearly V2 : (x; a; b) ! (x; a; b  x1 x2 ). In general, given a circuit Vl : (x; a; b  x1 : : : xl 1 ) ! (x; a  x1 : : : xl ), we define Vl0+1 := T (l + 1; k + 2; k + 1)Vl T (l + 1; k + 2; k + 1)Vl ; one can check that Vl0+1 : (x; a; b) ! (x; a  x1 : : : xl +1 ; b). Define Vl +1 by exchanging the wires k + 1 and k + 2; then clearly Vl +1 : (x; a; b) ! (x; a; b  x1 : : : x1+1 ). By induction, we can get as many uncomplemented literals in this product as we like.