Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Depth-optimized reversible circuit synthesis

2013, Quantum Information Processing

Depth-Optimized Reversible Circuit Synthesis arXiv:1208.5425v1 [quant-ph] 27 Aug 2012 Mona Arabzadeh, Morteza Saheb Zamani, Mehdi Sedighi, Mehdi Saeedi Abstract In this paper, simultaneous reduction of circuit depth and synthesis cost of reversible circuits in quantum technologies with limited interaction is addressed. We developed a cycle-based synthesis algorithm which uses negative controls and limited distance between gate lines. To improve circuit depth, a new parallel structure is introduced in which before synthesis a set of disjoint cycles are extracted from the input specification and distributed into some subsets. The cycles of each subset are synthesized independently on different sets of ancillae. Accordingly, each disjoint set can be synthesized by different synthesis methods. Our analysis shows that the best worst-case synthesis cost of reversible circuits in the linear nearest neighbor architecture is improved by the proposed approach. Our experimental results reveal the effectiveness of the proposed approach to reduce cost and circuit depth for several benchmarks. Keywords Reversible logic · Synthesis · Linear nearest neighbor architecture · Circuit depth 1 Introduction Boolean reversible circuits have attracted attention as components in several quantum algorithms including Shor’s quantum factoring [1] and stabilizer circuits [2]. In the recent years, considerable efforts have been made to synthesize a Boolean reversible function by a set of quantum gates [3]. The proposed technologies for quantum computing suffer from practical limitations for implementation. For example, popular quantum technologies allow computation on a few qubits in a linear nearest neighbor (LNN) architecture where only adjacent qubits can interact [4]. Additionally, physical qubits are fragile and can hold their states only for a limited time, called coherence time, [5]. To reflect technological constraints in the synthesis stage, different technology-specific cost metrics have been introduced. – Two-qubit cost is the number of two-qubit gates of any type and the number of one-qubit gates (reported separately) in a given circuit. The number of two-qubit gates for an n-qubit Toffoli gate (for n ≥ 3) is estimated as 10n − 25 [6]. Quantum cost (QC) is the number of NOT, CNOT, controlled-V and controlled-V† gates required to implement a given reversible function. – Interaction cost is the distance between gate qubits for any two-qubit gate. Quantum circuit technologies with 1D, 2D and 3D interactions exist [4]. Interaction cost for a circuit is calculated by a summation over the interaction costs of its gates. – Number of ancillae and garbage qubits reflect the limited number of qubits in the current quantum technologies. – Depth is the largest number of elementary gates on any path from inputs to outputs in a circuit. Reducing circuit depth can increase coherence time. Synthesis of reversible Boolean circuits has an exponential search space. Consequently, many heuristic algorithms have been proposed to consider the effects of quantum cost and two-qubit cost in the synthesis stage [7-10]. Additionally, several post-process optimization methods have been developed to improve A preliminary and partial version of this paper was presented at the 2011 International Workshop on Logic and Synthesis, San Diego, USA. M. Arabzadeh, M. Saheb Zamani, M. Sedighi, M. Saeedi Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran. E-mail: {m.arabzadeh, szamani, msedighi, msaeedi}@aut.ac.ir M. Saeedi is currently with the Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA 90089-2562. E-mail: msaeedi@usc.edu 2 M. Arabzadeh, M. Saheb Zamani, M. Sedighi, M. Saeedi quantum cost [8, 11, 6], interaction cost [12, 13], and depth [14]. However, the number of algorithms which consider different parameters simultaneously — the focus of this work — is very limited. Besides technological limitations, studying theoretical aspects of circuits with either limited interactions among qubits of gates or limited depth attracts interest in complexity theory. For example, NCi is the class of decision problems solvable by a uniform family of Boolean circuits with polynomial size, depth of O(logi n) and fan-in=2. QNC is the class of constant-depth quantum circuits without fanout gates [15]. In this paper, a synthesis algorithm for Boolean reversible circuits is proposed which uses a cycle-based strategy to synthesize circuits for the LNN architecture. The proposed technique leads to improved synthesis costs as compared to the best prior methods for several benchmarks. Moreover, a parallel structure for reversible Boolean circuits is presented which significantly reduces circuit depth with 2n ancillae. Overall, our circuits can be considered as depth-optimized reversible circuits for the LNN architecture. This paper is organized as follows. Basic concepts are introduced in Section 2. Related synthesis and post-process optimization methods are reviewed in Section 3. The proposed cycle-based synthesis algorithm for the LNN architecture is described in Section 4. Section 5 presents a parallel structure to reduce circuit depth. Experimental results are reported in Section 6, and Section 7 concludes the paper. 2 Basic Concepts In this section, preliminary concepts are briefly introduced. Further background can be found in [3]. Permutation Function. Let B be any set and define f : B → B as a one-to-one and onto transition function. The function f is a permutation function, as applying f to B leads to a set with the same elements of B and probably in a different order. If B = {1, 2, 3, ..., m}, there exist two elements bi and bj belonging to B such that f (bi ) = bj . A k-cycle with length k is denoted as (b1 , b2 , ..., bk ) which means that f (b1 ) = b2 , f (b2 ) = b3 , ..., and f (bk ) = b1 . A given k-cycle (b1 , b2 , ..., bk ) could be written in different ways, such as (b2 , b3 , ...bk , b1 ). Cycles c1 and c2 are called disjoint if they have no common members. Any permutation can be written uniquely, except for the order, as a product of disjoint cycles. If two cycles c1 and c2 are disjoint, they can commute, i.e., c1 c2 = c2 c1 . A cycle with length two is called transposition. A cycle or a permutation is called even (odd ) if it can be written as an even (odd) number of transpositions. When k-cycle is even (odd) then k is odd (even). Reversible Function. An n-input, n-output, fully specified Boolean function f : B → B over variables X = {x0 , ..., xn−1 } is called reversible if it maps each input pattern to a unique output pattern. Each reversible function can be considered as a permutation function. The added lines to a circuit are called ancillae and typically start out with a 0 or 1. Reversible Gate. An n-input, n-output gate is reversible if it realizes a reversible function. A multiplecontrol Toffoli gate can be written as Cm NOT(C; t), where C = {i1 , . . . , im } is the set of control lines, t = {j} with C ∩ t = ∅ is the target line and 0 ≤ i, j ≤ n − 1. A control line may be positive (negative ) which means that if its value is one (zero), the value of the target is inverted. For m=0 and m=1, the gates are called NOT (N) and CNOT (C), respectively. For m=2, the gate is called C2 NOT or Toffoli (T). The SWAP(a,b) gate changes the value of two qubits a and b, and can be constructed by three CNOT gates C(a,b)C(b,a)C(a,b). The controlled-V (controlled-V†) gate changes the value of its target line using the transformation given by the matrix V (V† ) if the control line has the value 1. V = 1 + i 1 −i 1−i 1 i ,V † = −i 1 i 1 2 2 h i h i 3 Related Work In this section, we review prior synthesis and optimization techniques that are used in this paper. In [16], an NCT-based synthesis method is proposed which decomposes a given cycle into a set of transpositions. To implement an arbitrary transposition (a, b)(c, d) for distinct a, b, c, d 6= 0, 2i , the authors introduced three subcircuits, namely π , κ0 and π −1 (the inverse of π ), where the κ0 circuit, Cn−2 NOT(a2 , ..., an−1 ; a0), implements a fixed transposition (2n − 4, 2n − 3) (2n − 2, 2n − 1). Accordingly, a synthesis algorithm was proposed to transform a, b, c and d to 2n − 4, 2n − 3, 2n − 2 and 2n − 1, respectively. By cascading π , κ0 and π −1 , an arbitrary transposition can be implemented with quantum cost 34n − 64. The NCT-based synthesis method in [16] was extensively improved in [10], k-cycle method hereafter. In the k-cycle method, a given cycle of length ≥ 6 is decomposed into a set of cycles of lengths < 6, called elementary cycles. Next, a set of synthesis algorithms was proposed to synthesize different elementary cycles, Depth-Optimized Reversible Circuit Synthesis 3 x x x x y G y G Cin Sum Cin Cout 0 0 V V V+ V V V Sum V+ V (a) Cout (b) x x y 0 G V V V+ V Cout Sum Cin (c) Fig. 1 (a) 3-input reversible full adder with optimal depth 4 [14], (b) the circuit in (a) after inserting SWAP gates and (c) reducing the number of SWAP gates by [12]. i.e., a pair of 2-cycles, a single 3-cycle, a pair of 3-cycles, a single 5-cycle, a pair of 5-cycles, a single 2-cycle (4-cycle) followed by a single 4-cycle (2-cycle) and a pair of 4-cycles. Similar to [16], 0 and 2i terms are fixed before synthesis because their effect on their synthesis results is negligible [10]. NCT gates with positive controls are used in both [16] and [10]. The effect of decomposition on the result of [10] was considered in [17] where a cycle-assignment technique based on graph matching was proposed. The worst-case quantum cost for synthesizing an arbitrary reversible function on n lines is 8.5n2n + o(2n ) in [10]. In [14], the authors introduced a post-process optimization algorithm to reduce the depth of a given quantum circuit. To achieve this, a set of circuit templates (circuit identities) was proposed to reduce quantum cost and circuit depth. The suggested templates are applied to change either gate locations or control/target positions in a subcircuit to parallelize more gates. The introduced templates were used by a greedy algorithm which starts from gate i and traverses the gates afterwards. At each step, the algorithm moves gates to left whenever possible and applies templates to check whether other gates can be moved to left or not. If no change is possible, it starts the same process from gate i + 1. In [12], a synthesis flow was proposed to improve the interaction cost of a given quantum circuit. The authors studied the exact synthesis of some small gates for the LNN architecture. The proposed optimal circuits are used to simplify larger circuits. Besides, some circuit templates are introduced to reduce the number of SWAP gates. Finally, local and global reordering of input qubits are considered to reorder gate qubits for improving the interaction cost. The proposed techniques were consolidated in a unified design flow to implement a given circuit with arbitrary interactions for architectures with limited interactions. Fig. 1-a shows a 3-input full adder with depth 4 [14] and six elementary gates. Actually, depth 4 is optimal since four qubits are involved in the fourth qubit [14]. Fig. 1-b shows the same circuit after inserting SWAP gates to make the gate qubits adjacent with QC=24 and depth=23. Fig. 1-c illustrates the same circuit after applying the method in [12] for reducing the number of SWAP gates where QC=18 and depth=17. 4 The Proposed Cycle-Based Synthesis Method for Interaction Cost The main contribution of [10] is to propose a cycle-based synthesis approach with the primary focus on quantum cost as the sole metric considered. However, another important implementational constraint, namely interaction cost, is considered besides the quantum cost in our proposed cycle-based method in this section. To do that, we improve the k-cycle method by using negative controls and adapting the synthesis algorithms of elementary cycles to the LNN architecture. Particularly, two new elementary odd cycles, a 2-cycle and a 4-cycle, are included to improve quantum cost. These odd cycles are synthesized as a pair of 2-cycles and a pair of 4-cycles in [10] with one ancilla. Odd cycles need one ancilla in the NCT library for the implementation [16]. In our experiments, we used this ancilla for the decomposition of complex gates into elementary gates. Additionally, 0 and 2i terms are not fixed before synthesis to be used in the proposed parallel structure as discussed in Section 5. Negative controls can reduce the number of elementary gates in the κ0 , π and π −1 circuits both with and without considering nearest neighbor restriction. Multiple-control Toffoli gates with at least one positive control can be simulated as efficiently as complex Toffoli gates with only positive controls [14]. By using 4 M. Arabzadeh, M. Saheb Zamani, M. Sedighi, M. Saeedi a0 a0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 a0 a1 a1 1 a2 ak-1 a2 ak a3 ak+1 a4 a5 ak+2 a6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 π (b) (c) a0 a1 a1 a2 a2 a3 a3 a4 a4 a5 a5 a6 a6 a7 a7 1 1 a7 an-1 (a) 1 1 1 an-1 1 1 a0 κ0(2,2) a0 1 1 1 1 a1 1 1 a2 a3 1 1 1 1 1 1 1 1 a4 1 a5 a6 1 a7 π κ0(2,2) (d) Fig. 2 (a) The κ0(2,2) circuit in [16, 10]. (b) The proposed κ0(2,2) circuit. Each control at position i, 0 ≤ i ≤ n − 1, i 6= k + 2 is negative. (c) An example of π circuit in [10]. a0 is used to control CNOTs in the first part. The second subcircuit is the circuit in [10, Theorem 3.1]. (d) An example of π circuit in the proposed method. Here, k=3. Refer to Table 1. Algorithm 1: Gate selection in the π circuit Input: L n-bit input terms. Bit value at position i of the j-th input term is b(i,j) . 0 . L n-bit κ0 terms. Bit value at position i of the j-th κ0 term is bκ (i,j) P ivot is the boldfaced position in the intermediate terms in Table 1. Output: The π circuit. for i in 0 to L do if b(i,P ivot) 6=1 then Set b(i,P ivot) =1 by either a CNOT or a Toffoli gate; end for j in 0 to P ivot do 0 then if b(i,j) 6= bκ (i,j) Find a position p: b(i,p) =1, b(k,p) 6=1 (k < i), |p − j| is the minimum possible value, and p ≤ P ivot; Apply CNOT(p;j); end end for j in n-1 to P ivot+1 do 0 if b(i,j) 6= bκ then (i,j) Find a position p: b(i,p) =1, b(k,p) 6=1 (k < i), |p − j| is the minimum possible value, and p ≥ P ivot; Apply CNOT(p;j); end end end CNOT and Toffoli gates with negative controls, one may not fix 0 and 2i terms before synthesis as compared with the methods in [16, 10]. Cycle Construction Length (CCL) is defined as the number of lines required to implement a given cycle of length L. In theory, the minimum CCL is log2 L. To implement the elementary cycles by NCT gates, at most two more lines are required in the proposed approach — one to avoid Toffoli gates without any positive control in the κ0 circuit, and one to improve circuit cost in the π , π −1 circuits. Accordingly, we set CCL(2) =2, CCL(2,2)=4, CCL(3) =3, CCL(3,3)=5, CCL(4) =4, CCL(4,2)=5, CCL(4,4)=5, CCL(5) =5 and CCL(5,5)=6. For an n-line circuit, lines required to construct a given cycle, CCL in total, can be selected in n × (n − 1) × ... ×(n − CCL − 1) different ways. To improve interaction cost and depth we place the selected lines close to each other in the middle of the κ0 circuit at positions k, k ± 1 and k ± 2 for k = ⌊n/2⌋. Details are discussed later. To synthesize a given elementary cycle, one needs to change input terms into the terms specified by the κ0 circuit. This is done by converting the input terms into intermediate terms specified by the π circuit. Afterwards, the intermediate terms are transformed into κ0 terms by a few specific gates, called static gates. In the proposed method, the control and target lines in the π circuit are selected such that interaction cost can be reduced. Since κ0 cycles are constructed in the middle of the circuit and the intermediate terms are designed with at least one “1”, as boldfaced in column Int. Terms in Table 1, it is possible to select control and target lines of each gate with length ≤ ⌈(n − CCL)/2 + CCL⌉. Considering two SWAP gates with cost 6 leads to QCLN N ≤ 3(n + CCL) for each gate. To reduce circuit depth, the gates required to fix bit positions at the first half and the second half are applied in parallel. Algorithm 1 provides the details. Depth-Optimized Reversible Circuit Synthesis 5 The κ0(2,2) circuits in [16, 10] and the proposed κ0(2,2) circuit are shown in Fig. 2-a and Fig. 2-b, respectively. Fig. 2-c illustrates one example of the π circuit in [10]. The input term is “11110111” which should be changed to the second term in the κ0(2,2) circuit in [10], i.e., “11111101”. This is done by a circuit with QC=16 and depth=11. In contrast, “11110111” should be changed to “00100100” in the proposed method. Fig. 2-d shows the π circuit with QC=5 and depth=3 based on Algorithm 1. 4.1 Building Blocks In this section, direct synthesis of the suggested elementary cycles, i.e., (2), (2,2), (3), (3,3), (4,2), (4,4), (5), (5,5), is discussed. Fig. 3 illustrates the κ0 circuits of all elementary cycles. We give a full description of the synthesis method for a pair of 2-cycles first. (2,2)-synthesis: To change (a, b)(c, d) to κ0(2,2) terms: – At most n NOT gates can be used to convert a to “0...1000...0”. Other terms b, c, and d may be changed to new terms b′ , c′ and d′ , respectively. – At most one CNOT gate conditioned on either the i-th line i6=k + 2 (positive) or i = k + 2 (negative) can be used to set the (k − 1)-th bit of b′ . Next, at most n − 1 CNOT gates conditioned on the (k − 1)-th bit can be applied to change the j -th bit of b′ (0 ≤ j ≤ n − 1, j6=k − 1) to “0...1001...0”. c′ , and d′ may be changed to new terms c′′ and d′′ . – At most one CNOT gate conditioned on either the i-th line i6=k + 2 (positive) or i = k + 2 (negative) can be used to set the k-th bit of c′′ . Next, at most n − 1 CNOT gates with positive control conditioned on the k-th bit can be applied to change the j -th bit of c′′ (0 ≤ j ≤ n − 1, j = 6 k) to “0...1010...0”. The last term d′′ may be changed to a new term d′′′ . – At most one CNOT gate conditioned on either the i-th line i6=k + 2 (positive) or i = k + 2 (negative) can be used to set the (k + 1)-th bit of d′′′ . Next, at most n − 1 CNOT gates with positive control conditioned on the (k + 1)-th bit can be applied to change the j -th bit of d′′′ (0 ≤ j ≤ n − 1, j6= k + 2) to “0...1111...0”. – A Toffoli gate conditioned on the (k − 1)-th and the k-th lines can be used to set the (k + 1)-th line. Therefore, it changes “0...1111...0” to “0...1011...0”. Note that converting each term does not corrupt the previously fixed terms. The same number of gates are needed for the π −1 circuit. Accordingly, a total number of 8n + 22 elementary gates are required for the π and π −1 circuits. The κ0 circuit in Fig. 3-b implements (2k+2 , 2k+2 +2k−1 )(2k+2 +2k , 2k+2 +2k +2k−1 ) with cost 24n − 88. Therefore, an arbitrary pair of 2-cycles (a, b)(c, d) can be implemented by at most 32n − 66 elementary gates. Following the above discussion for the (2,2)-synthesis method, details for the synthesis of other elementary cycles are given in Table 1. In this table, subscripts in column Input Cycle(s) denote orders in considering each term. Intermediate terms are represented by binary expansions with LSB on the right and the underlined bit in the k-th position (k=⌊ n2 ⌋). The boldfaced “1” is Pivot in Algorithm 1 for each term. The parenthesized pairs in column Max. Cost represent CNOT count with negative and positive controls, respectively. The numbers given in column Terms for the κ0 circuit are bit positions with value “1” in binary representation. Table 2 reports the resulting quantum cost of each elementary cycle. As can be seen, the total number of elementary gates is improved by a linear factor in most cases. Considering the worst-case cost of 3(n + CCL) for each gate in the π and π −1 circuits in the LNN architecture and 6n − 12 elementary gates (i.e., two chains of n − 2 SWAP gates) for the κ0 circuits leads to the results given in Total Cost (LNN) column in Table 2. 4.2 Worst-Case Analysis In this section, an upper bound on the number of gates in the proposed cycle-based method is calculated. To achieve this, let all terms of a truth table be involved in the input cycles to have a cycle with the maximum length 2n for an n-input/n-output function. To convert a cycle with length>5 to a set of elementary cycles, we may have some repeated terms in non-disjoint cycles. As such, 2n +ar shows the maximum number of n terms where ar is the maximum number of repeated terms and can be estimated as ar = ar−51 +4 , a0 = 25 n Plog (2 n −5 ) n i 2 +5 −5 which results in ar = 25 + i=25 4 = 2n−2 +log5 ( 2 5i number of elementary gates in our approach. n −5 4 ) − 94 . Theorem 1 discusses the maximum Theorem 1 The maximum number of elementary gates for any permutation in the proposed approach is 9.4n2n − 18.82n + o(n2 ) and 42.4n2 2n + o(n3 ) without and with considering interaction cost, respectively. 6 M. Arabzadeh, M. Saheb Zamani, M. Sedighi, M. Saeedi Table 1 Direct synthesis of elementary cycles. Subscripts in the input cycles denote the orders in considering each term. ⌋). The boldfaced “1” is Pivot in Algorithm 1. Numbers given for κ0 terms The underlined bit in the k-th position (k=⌊ n 2 are bit positions with “1” in the binary expansion. Input Cycle(s) (a1 , b2 ) (a1 , b2 ) (c3 , d4 ) (a1 , b2 , c3 ) (a1 , b2 , c3 ) (d4 , e5 , f6 ) (a1 , b2 , c3 , d4 ) (a1 , b2 , c3 , d4 ) (e5 , f6 ) (a1 , b2 , c3 , d4 ) (e5 , f6 , g7 , h8 ) (a1 , b4 , c2 , d3 , e5 ) (a1 , b4 , c2 , d3 , e10 ) (f5 , g8 , h6 , i7 , j9 ) Int. Terms (0...10...0) (0...11...0) (0...1000...0) (0...1001...0) (0...1010...0) (0...1111...0) (0...001...0) (0...101...0) (0...111...0) (0...00001...0) (0...00011...0) (0...00111...0) (0...10001...0) (0...11011...0) (0...11111...0) (0...1000...0) (0...1001...0) (0...1010...0) (0...1111...0) (0...10000...0) (0...10010...0) (0...10100...0) (0...11110...0) (0...10111...0) (0...11011...0) (0...10000...0) (0...10001...0) (0...10010...0) (0...10111...0) (0...10100...0) (0...11101...0) (0...11110...0) (0...11111...0) (0...10000...0) (0...10001...0) (0...10010...0) (0...11011...0) (0...10111...0) (0...100000...0) (0...100001...0) (0...100010...0) (0...100111...0) (0...101000...0) (0...111001...0) (0...111010...0) (0...111011...0) (0...101111...0) (0...110111...0) π or π −1 Circuit Max. Cost Static Gates nN n(1,n-1) C nN n(1,n-1) C n(1,n-1) C n(1,n-1) C T(k − 1, k; k + 1) nN n(1,n-1) C n(1,n-1) C nN n(1,n-1) C n(1,n-1) C 1 T, n-1 C 1 T, n-1 C T(k − 1, k + 2; k + 1) 1 T, n-1 C T(k, k + 2; k + 1) nN n(1,n-1) C n(1,n-1) C n(1,n-1) C T(k − 1, k; k + 1) nN n(1,n-1) C n(1,n-1) C n(1,n-1) C T(k − 1, k; k + 1) n(1,n-1) C 1 T, n-1 C T(k − 2, k ′ ; k + 1) nN n(1,n-1) C n(1,n-1) C n(1,n-1) C n(1,n-1) C T(k − 2, k − 1; k) 1 T, n-1 C T(k − 2, k; k + 1) 1 T, n-1 C T(k − 1, k; k + 1) n(1,n-1) C T(k − 2, k − 1, k; k + 1) nN n(1,n-1) C n(1,n-1) C n(1,n-1) C T(k − 2, k − 1; k + 1) n(1,n-1) C nN n(1,n-1) C n(1,n-1) C n(1,n-1) C T(k − 3, k − 2; k − 1) n(1,n-1) C 1 T, n-1 C T(k − 3, k; k + 1) 1 T, n-1 C T(k − 2, k; k + 1) n(1,n-1) C T(k − 2, k − 1, k; k + 1) n(1,n-1) C 1 T, n-1 C T(k − 1, k ′ ; k + 1) κ0 Circuit Terms (k + 1) (k − 1)(k + 1) (k + 2) (k + 2)(k − 1) (k + 2)(k) (k + 2)(k)(k − 1) (k − 1) (k + 1)(k − 1) (k + 1)(k)(k − 1) (k − 2) (k − 1)(k − 2) (k)(k − 1)(k − 2) (k + 2)(k − 2) (k + 2)(k − 1)(k − 2) (k + 2)(k)(k − 1)(k − 2) (k + 2) (k − 1)(k + 2) (k)(k + 2) (k − 1)(k)(k + 2) (k + 2) (k + 2)(k − 1) (k + 2)(k) (k + 2)(k)(k − 1) (k + 2)(k)(k − 1)(k − 2) (k + 2)(k − 1)(k − 2) (k + 2) (k + 2)(k − 2) (k + 2)(k − 1) (k + 2)(k − 1)(k − 2) (k + 2)(k) (k + 2)(k)(k − 2) (k + 2)(k)(k − 1) (k + 2)(k)(k − 1)(k − 2) (k + 2) (k + 2)(k − 2) (k + 2)(k − 1) (k + 2)(k − 1)(k − 2) (k + 2)(k)(k − 1)(k − 2) (k + 2) (k + 2)(k − 3) (k + 2)(k − 2) (k + 2)(k − 2)(k − 3) (k + 2)(k) (k + 2)(k)(k − 3) (k + 2)(k)(k − 2) (k + 2)(k)(k − 2)(k − 3) (k + 2)(k)(k − 1)(k − 2)(k − 3) (k + 2)(k − 1)(k − 2)(k − 3) Fig. 3-a 3-b 3-c 3-d 3-e 3-f 3-g 3-h 3-i Table 2 Worst-case costs for elementary cycles. EC (2) (2,2) (3) (3,3) (4) (4,2) (4,4) (5) (5,5) Length 2 4 3 6 4 6 8 5 10 κ0 24n-64 24n-88 24n-88 24n-112 48n-152 36n-204 36n-204 48n-166 36n-204 π, π −1 2n+2 4n+11 3n+4 6n+26 4n+11 6n+14 8n+46 5n+13 10n+57 The Proposed Method Total Cost Cost/Length 28n-60 14n-30 32n-66 8n-16.5 30n-80 10n-26.7 36n-60 6n-10 56n-130 14n-32.5 48n-176 8n-29.4 52n-112 6.5n-14 58n-140 11.6n-28 56n-90 5.6n-9 Total Cost (LNN) 145n2 -666n+772 147n2 -791n+1100 146n2 -804n+1068 149n2 -907n+1474 291n2 -1463n+1868 221n2 -1615n+2483 223n2 -1573n+2678 292n2 -1537n+2057 225n2 -319n+2790 Total Cost 34n-30 34n-64 32n-82 38n-46 50n-84 50n-122 56n-126 60n-130 64n-54 [10] Cost/Length 17n-15 8.5n-16 10.7n-27.3 6.3n-15.3 12.5n-21 8.3n-20.3 7n-15.7 12n-26 6.4n-5.4 Depth-Optimized Reversible Circuit Synthesis 7 a0 a0 a0 a0 a0 ak-1 ak ak+1 ak ak+1 ak+2 k controls ak-2 ak-1 k+1 controls ak-1 ak ak+1 m=n-k-1 controls ak+2 m=n-k-1 controls an-1 (a) (b) (c) (d) a0 (e) a0 ak-2 k+1 controls ak-1 ak ak+1 ak+2 n-k-2 controls an-1 ak+1 an-1 an-1 an-1 an-1 ak-2 ak-1 ak ak ak+1 k controls ak-2 n-k-1 controls ak-1 ak n-k-2 controls ak+1 ak+2 an-1 (g) (f) a0 a0 ak-3 ak-2 ak-2 ak-1 ak ak+1 ak+2 ak-1 ak ak+2 an-1 an-1 k-1 controls ak+1 (h) n-k-1 controls (i) Fig. 3 The κ0 circuit structures for different elementary cycles. The circuit structures for cycles (2,2), (3), (3,3), (4,2), (4,4), (5), and (5,5) are similar to those proposed in [10]. The new circuits for (2) and (4) besides the application of negative controls and the revised terms in the κ0 circuits improve quantum cost and interaction cost. Proof In Table 2, the column Cost/Length determines a cost needed for setting a term in each elementary cycle. To calculate the maximum cost, suppose at most one 3-cycle, one 4-cycle and one 5-cycle are included which can be synthesized by the related synthesis algorithms. All other terms are supposed to be synthesized as pairs of 2-cycles. Note that the number of elementary gates for fixing terms in a pair of 2-cycles is greater than any other pairs (See Table 2). The repeated terms in non-disjoint 5-cycles are synthesized by the (5,5)-cycle synthesis method. Accordingly we will have, 3×Cost/Length3 + 4×Cost/Length4 + 5×Cost/Length5 + (2n −12)×Cost/Length2,2 + ar ×Cost/Length5,5 which leads to 9.4n2n − 18.8 × 2n + 2.8n2 + 43.5n − 152.1 elementary gates in the worstcase with arbitrary interaction and 42.4n2 2n + 11.3n3 + 288.2n2 with limited interaction. The worst-case quantum cost of [10] is 51n2 2n for architectures with limited interaction. 5 Synthesis with Parallel Structure In this section, a parallel circuit structure is introduced for reversible logic that can be used to considerably reduce circuit depth of reversible circuits in most cases. The general idea is to copy input lines into k sets of zero-initialized ancillae, divide the input specification into k sets of disjoint cycles and then synthesize each set independently by using the prepared ancillae. The final results can be recovered by several CNOTs. It should be mentioned that adding ancillae has been previously used for quantum cost reduction in the synthesis and optimization methods [9, 6]. In the proposed method, ancillae are used for the propose of depth reduction without considerable overhead on quantum cost, thanks to the specific form of input representation, i.e., cycle. Note that each cycle can be synthesized by a different synthesis method. 8 M. Arabzadeh, M. Saheb Zamani, M. Sedighi, M. Saeedi a0 a1 a0 a1 a0 a1 a0 a1 an-1 an-1 an-1 an-1 0 0 a0 a1 0 0 a0 a1 0 an-1 a0 a1 0 0 0 0 0 an-1 a0 a1 0 an-1 0 an-1 0 0 a0 a1 0 0 a0 a1 0 an-1 0 an-1 (a) (b) a0 a1 a2 a3 a0 a1 a2 a3 0 0 0 0 0 0 0 0 a0 a1 a2 a3 a0 a1 a2 a3 0 0 0 0 a0 a1 a2 a3 0 0 0 0 a0 a1 a2 a3 (c) Fig. 4 (a) The input storing block with linear depth. (b) An alternative circuit structure with improved interaction cost and linear depth. (c) A logarithmic-depth circuit structure. Input Storing Block. Copying an arbitrary quantum state is not possible in general but a Boolean value can be copied into a zero-initialized ancilla by a CNOT gate conditioned on the main line and targeted on the ancilla. For m n-line zero-initialized ancillae, the input storing block includes mn CNOT gates with constant depth m. Fig. 4-a shows the input storing block for a circuit with n main lines and m n-line ancillae. The interaction cost can be calculated as n(n − 1)(1 + 2 + ... + m − 1) = (1/2)nm(n − 1)(m − 1). Fig. 4-b illustrates another circuit structure with improved interaction cost, mn(n − 1). Circuit depth in Fig. 4-a can be improved from linear factor to logarithmic factor O(log m) [15] as shown in Fig. 4-c. Thus, interaction Plog m−1 2i cost can be calculated as n(n − 1) i=02 2 = (1/2)n(n − 1)(m2 − 2). Output Restoring Block. Since each subcircuit implements a set of disjoint cycles, for a given input combination, only one circuit (active) produces the results and the outputs of other subcircuits (inactive) are the same as the inputs. The number of inactive subcircuits is equal to the number of n-line ancillae registers, which is even. As such, XORing (by CNOT) the outputs of all subcircuits on the main lines cancels inputs and restores correct outputs at the main lines. Overall, for m n-line ancillae and m+1 sets of disjoint cycles, mn CNOTs with depth m are sufficient. Fig. 5-a illustrates the output restoring block for m n-line ancillae with interaction cost nm(n − 1)(m − 1). CNOT-circuit with common target can be implemented with logarithmic depth [15] as illustrated in Fig. 5-b for n=4 and m=4. In this case, interaction cost is  Plog m−1 i n(n − 1) i=02 2 − 1 2i+1 = (1/2)nm(n − 1)(2m + log2 m + 2). Theorem 2 Consider a given specification F on n lines written as a set of disjoint cycles C1 C2 ...Cm for an odd m. Assume that subcircuit Li implements Ci . The specification F can be implemented with depth O(depthmax (Li )) in the presence of m n-line ancillae. Proof Copying the input lines to m − 1 n-line zero-initialized ancillae replicates inputs at the ancillae. Disjoint cycles commute. Hence, each subcircuit can be implemented on one register independently. The input storing/output restoring blocks have constant depth m. Therefore, circuit depth is dominated by the maximum depth of all subcircuits. Depth-Optimized Reversible Circuit Synthesis 9 a0 a1 a2 a3 a0 a1 an-1 (a) (b) Fig. 5 (a) The output restoring block with linear depth. (b) The output restoring block with logarithmic depth for four main lines and four 4-line ancillae. a0 a0 a1 a1 a2 a2 a3 a3 0 G 0 G 0 G 0 G 0 G 0 G 0 G G 0 Input storing block Output restoring block Fig. 6 An example of the proposed parallel cycle-based structure for a 4-line function. A given specification may contain a set of disjoint cycles with exponential lengths, i.e., O(2n ). In such cases, circuit depth cannot be further improved by Theorem 2. However, as will be shown in Section 6, circuit depth can be reduced considerably even with a small number of n-line ancillae. To efficiently employ the result of Theorem 2, one needs to determine disjoint cycle sets. Example 1 Assume that the input cycles (1,3) (7,10) (0,4) (6,15) (2,8) (5,13) are given for a circuit with 4 lines. All cycles are elementary and no decomposition is required. Let 2 4-line ancillae be available and each pair of 2-cycles be assigned to one set, i.e., (1,3) (7,10) to set #1, (0,4) (6,15) to set #2 and (2,8) (5,13) to set #3. Applying the input storing block provides the input data on the added zero-initialized ancillae. Now, the proposed method in Section 4 can be applied for each cycle pair which leads to three subcircuits. To combine the results, one needs to add the output restoring block. Accordingly, total depth is equal to the maximum depth of the synthesized subcircuits (i.e., 33) plus 4 (2 for each input storing/output restoring block). Fig. 6 illustrates the result. Cycle Distribution. Consider n elementary cycles and m register sets, including the input register. The problem is to assign disjoint cycles into different registers such that the total depth of the circuit in each register is minimized and the depths of the registers are almost equal. To achieve this goal, we modeled the cycle distribution problem as the bin packing problem1 with a few exceptions. In our modeling, registers are bins and cycles are objects. Each cycle is decomposed into a set of elementary cycles and cost values in Table 2 are used as the weights of elementary cycles. If the input permutation is odd, the permutation in one bin should be odd. Many heuristic algorithms have been developed to solve different variants of the bin packing problem. Examples include first fit and best fit algorithms. 1 Bin packing problem is a combinatorial NP-hard problem in computational complexity theory in which objects of different weights must be packed into a finite number of bins of capacity W such that the number of used bins are minimized. Given a bin of size W and P a list w1 , ..., wn of sizes of the items, one should find an integer B and a B-partition wi ≤ W for all k = 1, ..., B. A solution is optimal if it has minimal B. S1 ∪ ... ∪ SB of {1, ..., n} such that i∈S k 10 M. Arabzadeh, M. Saheb Zamani, M. Sedighi, M. Saeedi Table 3 Benchmark specifications before and after decomposition. Benchmark Function hwb8 hwb9 hwb10 hwb11 nth prime7 nth prime8 nth prime9 nth prime10 nth prime11 Before EC 48 54 228 1 1 - DCM nEC 16 38 26 186 5 1 3 3 6 (2) 36 152 1 2 1 2 # of Cycles After DCM (3) (4) (5) 28 16 54 76 154 186 372 3 1 28 1 62 1 125 2 253 1 507 After DCM & DIST set1 set2 set3 26 28 26 43 43 44 101 103 102 186 186 186 19 6 8 63 122 4 2 96 85 75 315 36 159 Depth circ1 1923 4344 9929 23862 1519 5852 13783 12115 46888 circ2 1995 4347 10058 23826 390 393 10947 5470 circ3 1953 3988 9898 23827 734 346 9329 23765 total 1999 4351 10062 23866 1523 5852 13787 12119 46892 To solve the problem, a best fit algorithm is developed which sorts c elementary cycles according to their maximum synthesis costs and proceeds one cycle at a time. To distribute cycles, the first cycle is selected and temporarily assigned to bin i for 1 ≤ i ≤ m. Then, the total cost is calculated among all the bins and the cycle is permanently assigned to the bin which results in the lowest total cost. In the case of a tie, the bins are selected in sequence. The algorithm continues until all the cycles are assigned. Therefore, the total time complexity is O(c log c) + O(cm2 ). At the end, the algorithm checks the permutation of each bin to make sure that at most one bin has an odd permutation. Odd permutations need one ancilla in the NCT library [16]. If more than one bin is found with an odd permutation (called odd bin), the algorithm moves the smallest odd cycle of the odd bin with maximum depth to the odd bin with the minimum depth. This can take O(m) time. After the changes, the involved bins should have even permutations. This process is continued until at most one bin with an odd permutation exists — this occurs when the input permutation is odd and at least m even permutations exist to fill all the bins. Altogether, the whole process has a time complexity of O(c log c) + O(cm2 ). 6 Experimental Results The proposed cycle-based synthesis method for the LNN architecture and the suggested parallel structure for reversible logic synthesis were implemented in C++ and all of the experiments were performed on an Intel Pentium IV 2.5GHz computer with 4GB memory. To evaluate the proposed synthesis method, some of the reversible benchmark functions from [18] were synthesized. The selection criteria for these benchmarks will be discussed later in this section and their specifications are given in Table 3 before and after decomposition. The decomposition approach of [10] is used in our method to decompose the input cycles into the proposed elementary cycles. The number of elementary cycles (EC) and non-elementary cycles (nEC) of each benchmark is reported is this table. After decomposition, all cycles are elementary with length<6. Note that [10] proposes the best prior synthesis algorithm for medium-size hwbN and N-th prime functions if no ancilla is available [18]. While hwbN functions can be implemented with a polynomial cost O(n log2 n) if a logarithmic number of garbage bits ⌈log n⌉ + 1 is available [18], the proposed approach is more general and can be applied to many reversible functions. To evaluate the proposed parallel structure, the cycle-based algorithm of Section 4 was used for synthesizing each subset. Since the number of signals is limited in the current quantum technologies, the minimum number of ancillae (2 n-line registers) was used. Therefore, the number of input cycles should be >3 to have at least one cycle in each subset. In our experiments, the results of [14] were used for decomposing multiple-control Toffoli gates and calculating quantum cost for the gates with negative controls. Besides, the two-qubit cost model of [6] is used for evaluating the results. A naive SWAP insertion method and the method of [12] were used to evaluate the results for the LNN architecture. For the naive method, move and delete rules were applied on the synthesized circuits to remove redundant gates. To estimate circuit depth, the greedy level compaction algorithm of [14] was implemented without applying the templates. Table 4 and Table 5 report the quantum cost (QC), the two-qubit cost (2-qubit) and the depth (Depth) for the synthesized circuits without and with limited interaction. Since [10] does not target the limited interaction in the LNN architecture, we used the method of [12] on the results of [10] and ours to insert SWAP gates. Runtime of [10] and our method is less than one minute for the selected benchmarks. In the proposed method, this time includes the time required for applying the distribution procedure in the parallel structure and the time required for synthesis and applying the move and delete rules. In the parallel structure, due to the qubit reordering in [12], at most 3n(3n − 1) SWAP gates are used between the input storing block, the subsets and the output restoring block to order lines. Depth-Optimized Reversible Circuit Synthesis 11 Table 4 Comparison of the proposed approach and prior best results. #A is the number of ancillae. R and P are used for regular and parallel structures, respectively. The resulted circuits are available at http://ceit.aut.ac.ir/˜arabzadeh/results/, and may be viewed with RCViewer+ [19]. Benchmark Function n hwb8 8 hwb9 9 hwb10 10 hwb11 11 nth prime7 7 nth prime8 8 nth prime9 9 nth prime10 10 nth prime11 11 R/P R P R P R P R P R P R P R P R P R P The #A 16 18 20 22 14 18 20 22 Proposed QC 6686 6964 14474 15262 35298 35890 86864 87234 2888 3100 7016 16820 17507 38843 39317 92863 93389 Method 2-qubit 4468 4730 10382 10764 23584 23874 65260 65442 2296 2398 5624 11907 12053 27743 27933 67401 67677 Depth 5622 1999 12054 4351 29751 10062 71418 23866 2473 1523 5852 14285 13787 31924 12119 75668 46892 QC [10] 2-qubit Depth 6940 5348 5442 16173 12479 12472 35618 25453 27812 90745 71175 69763 3172 2841 2514 7618 6622 5793 17975 14076 13941 40301 31841 31254 95433 75474 72934 Average Improvement (%) QC 2-qubit Depth 3.6 16.4 -3.3 -0.3 11.5 63.2 10.5 16.8 3.3 5.6 13.7 65.1 0.8 7.3 -6.9 -0.7 6.2 63.8 4.2 8.3 -2.3 3.8 8.0 65.7 8.9 19.1 1.6 2.2 15.5 39.4 7.9 15.0 -1.0 6.4 15.4 -2.4 2.6 14.3 1.1 3.6 12.8 -2.1 2.4 12.2 61.2 2.6 10.6 -3.7 2.1 10.3 35.7 5.4 13.6 -1.9 2.2 11.5 49.4 Table 5 Comparison of the proposed approach and the one in [10] with the nearest neighbor limitation. The improvment column compares the results after applying [12] on both methods. The resulted circuits are available at http://ceit.aut.ac.ir/˜arabzadeh/results/, and may be viewed with RCViewer+ [19]. Benchmark Function n hwb8 8 hwb9 9 hwb10 10 hwb11 11 nth prime7 7 nth prime8 8 nth prime9 9 nth prime10 10 nth prime11 11 Average R/P #A R P R P R P R P R P R P R P R P R P 16 18 20 22 14 18 20 22 The Proposed Method +Naive +[12] QC Depth QC Depth 36684 32313 31553 20940 46788 14758 36045 9248 87310 74676 77860 46958 100228 31810 87389 19597 279496 248524 202903 112623 291014 89021 212616 41479 682182 605294 562817 297986 685944 205472 569876 104372 12264 10649 10922 9799 15106 7734 15897 6930 35976 29975 30796 26920 91984 76910 90511 54457 98686 76020 95362 54850 241538 199996 222865 124122 250526 79165 228777 49613 654910 577721 576047 308413 665132 361756 585165 195500 [10]+[12] QC Depth 36732 22720 91805 51181 228240 117893 611843 307114 15356 10130 42059 24574 99003 55737 248901 137091 625320 324005 Improvement (%) QC 14.0 1.8 15.1 4.8 11.1 6.8 8.0 6.8 28.8 -3.5 26.7 8.5 3.6 10.4 8.0 7.8 6.4 14.6 4.4 Depth 7.8 59.2 8.2 61.7 4.4 64.8 2.9 66.0 3.2 31.5 -9.5 2.2 1.5 9.4 63.8 4.8 39.6 3.8 48.6 As can be seen in Table 5, the effect of the post-process method is more significant for [10] but altogether the results of the proposed LNN-based method are better than those of [10] after applying [12] on both methods. Notice that using negative controls does not allow to increase the quantum cost. For odd permutations, one more ancilla should be added. The two-qubit costs are compared in Table 4 and the results show 13.6% and 11.5% improvement on average for the regular and parallel structures, respectively. In the parallel structure, the average depth improvement of the N-th prime benchmarks is less than that of hwbN functions since the input cycles of those functions are unstructured with different cycle lengths which result in unbalanced subsets after distribution. Input cycle distributions after decomposition (DCM) and distribution (DIST) are reported in Table 3. For hwbN functions, applying the distribution method leads to 3 sets with almost the same numbers of elementary cycles. We report the circuit depth for each set along with the total depth after considering the effect of input storing and output restoring blocks in this table. As 12 M. Arabzadeh, M. Saheb Zamani, M. Sedighi, M. Saeedi reported in Table 3, function nth prime8 has one disjoint input cycle. Accordingly, the resulting elementary cycles should be assigned to one set by the proposed method. In choosing the benchmark functions that were considered in this paper, the general guidelines presented in [10] and [3] were considered. These guidelines stipulate that one of the scenarios in which the cycle-based methods render significantly superior results is when the input function contains permutations without regular patterns such as hwbN, N-th prime [10] functions. For this reason, only the results of these functions are reported in this paper. As for other functions in [18], some are reported in [10] along with a discussion on their suitability for the cycle-based approach (like Permanent). To avoid being repetitive, we did not include this set in this paper. There are yet other benchmarks that include important arithmetic functions like adders, multipliers and group arithmetic (e.g., in Galois Fields). Since the proposed cycle-based synthesis method is a general synthesis approach, it may not produce interesting results compared to other approaches specifically developed for those benchmark functions. 7 Conclusion In this paper, a synthesis approach is proposed in order to reduce logical depth for architectures with limited interactions which applies a cycle-based approach to synthesize a given specification. The proposed method focuses on the interaction cost and depth besides the traditional quantum cost metric as a multi-objective view in the large picture. To achieve this, we redesigned the elementary cycles in [10] with negative controls and limited interaction between gate lines. Moreover, a new parallel circuit structure was proposed for reversible logic in the presence of several ancillae registers. Altogether, the mentioned structure, which can be used with other synthesis methods, filling with the proposed cycle-based synthesis method for interaction cost leads to our whole flow for depth-optimized reversible circuit synthesis. A given permutation is written as a set of disjoint cycles to be used in the proposed parallel circuit structure. Then, the resulting cycles are distributed among the available n-line registers based on the bin packing problem. The cycles are then synthesized on the assigned registers independently. Our experiments and analysis show the effectiveness of the proposed approach with and without the interaction cost limitations for the attempted benchmarks and in the worst-case. References 1. I. L. Markov and M. Saeedi. Constant-optimized quantum circuits for modular multiplication and exponentiation. Quant. Inf. and Comput., 12(5&6):0361–0394, 2012. 2. S. Aaronson and D. Gottesman. Improved simulation of stabilizer circuits. Phys. Rev. A, 70:052328, 2004. 3. M. Saeedi and I. L. Markov. Synthesis and optimization of reversible circuits - a survey. ACM Computing Surveys, e-print, arXiv:1110.2574, 2012. 4. D. Cheung, D. Maslov, and S. Severini. Translation techniques between quantum circuit architectures. In Workshop on Quantum Information Processing, 2007. 5. R. Van Meter and M. Oskin. Architectural implications of quantum computing technologies. J. Emerg. Technol. Comput. Syst., 2(1):31–63, 2006. 6. D. Maslov and M. Saeedi. Reversible circuit optimization via leaving the Boolean domain. IEEE Trans. on CAD, 30(6):806–816, 2011. 7. P. Gupta, A. Agrawal, and N. K. Jha. An algorithm for synthesis of reversible logic circuits. IEEE Trans. on CAD, 25(11):2317–2330, 2006. 8. D. Maslov, G. W. Dueck, and D. M. Miller. Techniques for the synthesis of reversible Toffoli networks. ACM Trans. Des. Autom. Electron. Syst., 12(4):42, 2007. 9. R. Wille and R. Drechsler. BDD-based synthesis of reversible logic for large functions. Design Autom. Conf., pages 270–275, 2009. 10. M. Saeedi, M. Saheb Zamani, M. Sedighi, and Z. Sasanian. Reversible circuit synthesis using a cycle-based approach. J. Emerg. Technol. in Comput. Syst., 6(4):1–26, December 2010. 11. D. M. Miller, R. Wille, and R. Drechsler. Reducing reversible circuit cost by adding lines. Int’l Symp. on Multiple-Valued Logic, pages 217–222, 2010. 12. M. Saeedi, R. Wille, and R. Drechsler. Synthesis of quantum circuits for nearest neighbor architectures. Quant. Inf. Proc., 10(3):355–377, 2011. 13. Y. Hirata, M. Nakanishi, S. Yamashita, and Y. Nakashima. An efficient conversion of quantum circuits to a linear nearest neighbor architecture. Quant. Inf. and Comput., 11(1&2):0142–0166, 2011. 14. D. Maslov, G. W. Dueck, D. M. Miller, and C. Negrevergne. Quantum circuit simplification and level compaction. IEEE Trans. on CAD, 27(3):436–444, March 2008. 15. C. Moore and M. Nilsson. Parallel quantum computation and quantum codes. SIAM Journal on Computing, 31:799– 815, 2001. 16. V. V. Shende, A. K. Prasad, I. L. Markov, and J. P. Hayes. Synthesis of reversible logic circuits. IEEE Trans. on CAD, 22(6):710–722, June 2003. 17. M. Saeedi, M. Sedighi, and M. Saheb Zamani. A library-based synthesis methodology for reversible logic. Microelectron. J., 41(4):185–194, Apr 2010. Depth-Optimized Reversible Circuit Synthesis 13 18. D. Maslov. Reversible logic synthesis benchmarks page. http://webhome.cs.uvic.ca/˜dmaslov, 2011. 19. M. Arabzadeh, and M. Saeedi. RCviewer+, A viewer/analyzer for reversible and quantum circuits, version 2.41. available at http://ceit.aut.ac.ir/QDA/RCV.htm, 2011.