Gate-Diffusion Input (GDI) A Power Efficient Method For Digital Combinatorial Circuits
Gate-Diffusion Input (GDI) A Power Efficient Method For Digital Combinatorial Circuits
Gate-Diffusion Input (GDI) A Power Efficient Method For Digital Combinatorial Circuits
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 5, OCTOBER 2002
I. INTRODUCTION
HIS rapid development of portable digital applications, the demand for increasing speed, compact implementation, and low power dissipation triggers numerous research efforts [1][3]. The wish to improve the performance of logic circuits, once based on traditional CMOS technology, resulted in the development of many logic design techniques during the last two decades [17]. One form of logic that is popular in low-power digital circuits is pass-transistor logic (PTL). Formal methods for deriving pass-transistor logic have been presented for nMOS. They are based on the model, where a set of control signals is applied to the gates of nMOS transistors. Another set of data signals are applied to the sources of the n-transistors [1]. Many PTL circuit implementations have been proposed in the literature [1], [2], [4][6], [14]. Some of the main advantages of PTL over standard CMOS design are 1) high speed, due to the small node capacitances; 2) low power dissipation, as a result of the reduced number of transistors; and 3) lower interconnection effects [7], [8], due to a small area. However, most of the PTL implementations have two basic problems. First, the threshold drop across the single-channel pass transistors results in reduced current drive and hence slower
Manuscript received May 1, 2001; revised January 26, 2002. A. Morgenshtein is with the Biomedical Engineering Department, TechnionIsrael Institute of Technology, Haifa 32000, Israel (e-mail: arkadiy@ tx.technion.ac.il). A. Fish is with the Electrical Engineering Department, Ben-Gurion University, Israel (e-mail: afish@ee.bgu.ac.il). I. A. Wagner is with IBM Haifa Labs, Haifa University, Mount Carmel, Israel (e-mail: wagner@il.ibm.com). Digital Object Identifier 10.1109/TVLSI.2002.801578
operation at reduced supply voltages; this is particularly important for low-power design since it is desirable to operate at the lowest possible voltage level. Second, since the high , the input voltage level at the regenerative inverters is not PMOS device in the inverter is not fully turned off, and hence direct-path static power dissipation could be significant [4]. There are many sorts of PTL techniques that intend to solve the problems mentioned above [5]. 1) Transmission gate CMOS (TG) uses transmission gate logic to realize complex logic functions using a small number of complementary transistors. It solves the problem of low logic level swing by using pMOS as well as nMOS [1]. 2) Complementary pass-transistor logic (CPL) features complementary inputs/outputs using nMOS pass-transistor logic with CMOS output inverters. CPLs most important feature is the small stack height and the internal node low swing, which contribute to lowering the power consumption. The CPL suffers from static power consumption due to the low swing at the gates of the output inverters. To lower the power consumption of CPL circuits, LCPL and SRPL circuit styles are used. Those styles contain pMOS restoration transistors or cross-coupled inverters (respectively). 3) Double pass-transistor logic (DPL) uses complementary transistors to keep full swing operation and reduce the dc power consumption. This eliminates the need for restoration circuitry. One disadvantage of DPL is the large area used due to the presence of pMOS transistors. An additional problem of existing PTL is top-down logic design complexity, which prevents the pass transistors from capturing a major role in real logic LSIs [6]. One of the main reasons for this is that no simple and universal cell library is available for PTL-based design. This paper proposes a new low-power design technique that allows solving most of the problems mentioned abovegate diffusion input (GDI) technique. The GDI approach allows implementation of a wide range of complex logic functions using only two transistors. This method is suitable for design of fast, low-power circuits, using a reduced number of transistors (as compared to CMOS and existing PTL techniques), while improving logic level swing and static power characteristics and allowing simple top-down design by using small cell library. Section II presents basic GDI functions and their circuit principle. In Section III, a detailed analysis of GDI cell is presented. Section IV shows a design methodology for GDI circuitry. Comparisons of some basic logic functions and high-level combinatorial circuits designed in CMOS, PTL, and GDI are discussed
567
FOR
DIFFERENT INPUT
in Section V. Section VI presents measurements of a test chip, fabricated in GDI and CMOS. Conclusions and future work are discussed in Section VII. II. BASIC GDI FUNCTIONS The GDI method is based on the use of a simple cell as shown in Fig. 1. At first glance, the basic cell reminds one of the standard CMOS inverter, but there are some important differences. 1) The GDI cell contains three inputs: G (common gate input of nMOS and pMOS), P (input to the source/drain of pMOS), and N (input to the source/drain of nMOS). 2) Bulks of both nMOS and pMOS are connected to N or P (respectively), so it can be arbitrarily biased at contrast with a CMOS inverter. It must be remarked that not all of the functions are possible in standard p-well CMOS process but can be successfully implemented in twin-well CMOS or silicon on insulator (SOI) technologies. This issue will be discussed in Section VII. Table I shows how a simple change of the input configuration of the simple GDI cell corresponds to very different Boolean functions. Most of these functions are complex (612 transistors) in CMOS, as well as in standard PTL implementations, but very simple (only two transistors per function) in the GDI design method. In this paper, most of the designed circuits were based on the F1 and F2 functions. The reasons for this are as follows. 1) Both F1 and F2 are complete logic families (allows realization of any possible two-input logic function). 2) F1 is the only GDI function that can be realized in a standard p-well CMOS process, because the bulk of any nMOS is constantly and equally biased. 3) When N input is driven at high logic level and P input is at low logic level, the diodes between NMOS and PMOS
bulks to Out are directly polarized and there is a short between N and P, resulting in static power dissipation and . This causes a drawback for OR, AND, and MUX implementaconfiguration. The effect tions in regular CMOS with can be reduced if the design is performed in floating-bulk SOI technologies [22], where a full GDI library can be implemented. In that case, floating-bulk effects have to be considered. As can be seen, the GDI cell structure is different from the existing PTL techniques, reviewed in Section I, and has some important features, which allow improvements in design complexity level, transistor count, and power dissipation (all of these will be discussed in Sections IVVI). Understanding of GDI cell properties demands a deeper operational analysis of the basic cell in different cases and configurations. III. ANALYSIS OF GDI CIRCUITS In this section, we analyze GDI circuits. First we explain their operation and analyze their transient behavior. Then we consider swing restoration issues and switching characteristics. A. Operational Analysis of GDI Cell As mentioned in Section I, one of the common problems of PTL design methods is the low swing of output signals because of the threshold drop across the single-channel pass transistors. In existing PTL techniques, additional buffering circuitry is used to overcome this problem. To understand the effects of the low swing problem in a GDI cell, we suggest the following analysis, based on the example of F1 function, and can be easily extended to use in other GDI functions. Table II presents a full set of logic states and related functionality modes of F1. As can be seen from Table II, the only state where low swing , . In this case, the occurs in the output value is (instead of the expected 0 V) because voltage level of F1 is of the poor high-to-low transition characteristics of the pMOS pass transistor [4]. It is obvious that the only case (among all the possible transitions) where the effect occurs is the transition , to , . from The fact that demands special emphasis is that in about 50% ), the GDI cell operates as a regular of the cases (for CMOS inverter, which is widely used as a digital buffer for logic-level restoration. In some of these cases, when
568
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 5, OCTOBER 2002
TABLE II INPUT LOGIC STATES VERSUS FUNCTIONALITY AND OUTPUT SWING OF F1 FUNCTION
Fig. 2.
without a swing drop from the previous stages, a GDI cell functions as an inverter buffer and recovers the voltage swing. Although this feature allows a self-swing restoration in certain cases, in this paper the worst case is assumed and additional circuitry is used for swing restoration in the implemented circuits. B. Transient Analysis The exact transient analysis for a basic GDI cell, in most cases, is similar to a standard CMOS inverter, widely presented in the literature [9], [10]. This classic analysis is based on the is expressed as folShockley model, where the drain current lows: subthreshold region linear region saturation region is drivability factor, is threshold voltage, is where channel width, and is channel length. was In contrast with CMOS inverter analysis [8], where must be taken as an input voltage, in most of GDI circuits considered as a variable of input voltage in the Shockley model. In this paper, we shall only discuss the aspects in which GDI differs from CMOS. The case of most interest is when a step signal is supplied to diffusion of nMOS transistor and causes a swing drop in output. Fig. 2 shows the schematic and a transient response for this case. During this response, the nMOS transistor passes from saturation to subthreshold region. In assuming the fast transition in the input, the linear region can be neglected in our analysis. Analytical expressions that describe the transient response can be derived from (1), while considering capacitive load in the output. The capacitive current is (2) is the voltage across the where is the output capacitance, , is the current charging the capacitor, and capacitance is the drain current through the N-channel device. as a function of time is derived as folThe expression for lows: In saturation region (3) (1)
where, in the case of GDI cells linked through diffusion inputs, the capacitance includes both diffusion and well capacitances of the driven cell. The integral form of (3) is (4) The same expression can be written as (5)
(6) where , , and are constants of the process or the given circuit. The final expression of transient response in the saturation region is (7) is a constant of where is time in the saturation region and integration and is calculated for initial conditions ( ). The solution of (7) is done numerically (e.g., in MATLAB) ). for specific values of ( continues rising After entering the subthreshold region, according to (1) while the output capacitance is charged by In subthreshold region
(8) (9) where is temperature in K, is Boltzmanns constant, charge of an electron, and is a constant is
(10) The expression for response in the subthreshold region is (11) (12)
569
where is constant of integration defined by the initial condiis the threshold voltage. tions, is from (10), and It must be noted that the analysis of propagation delay of a basic GDI cell given by (2)(7) can be refined by taking into account the effect of the diode between the NMOS source and body. This diode is forward biased during the transient (Fig. 2). By conducting an additional current, it contributes to charging . This current contribution can be calthe output capacitance culated to be (13) is the diode current, is the reverse current, and where is a factor between one and two. This current should be added in (2) to derive an improved propagation delay, resulting in a faster transient operation of GDI cell. C. Analysis of Swing-Restoring Buffers As mentioned above, an important concern in PTL circuit design is the problem of swing degradation. This section presents a methodology for swing restoration in GDI circuits under constraints of area (power) and circuit frequency (delay). The simplest method of swing restoration is to add a buffer stage after every GDI cell. This will certainly prevent the voltage drop, but the payment will be in additional area, delay, and power dissipation, which makes this method highly inefficient. Note that our approach to swing restoration is rather simple; various buffering techniques are presented in the literature, e.g., [6] and [14]. Given a clocked logic circuit with known Tcycle and Tsetup, buffering of cascaded GDI cells will be optimal if the following conditions are preserved. 1) Successive Swing Restoration: While cascading GDI cells, each cell contributes a voltage drop in the output that . Assuming 0.3 as a maximal allowed is equal to voltage drop of the whole cascade, the number of linked GDI cells between two buffers is limited by (14) As shown above, after exiting the saturation area, the value is equal to and decreasing with time as follows, of using (9): (15) Equation (15) applies for subthreshold region only, namely, for . According to (15), remaining in the subthreshold region for will assure a significant decrease of and as a result, increasing in the number of linked cells . This allows achieving successive swing restoration while using a lower number of buffers. Fig. 3 presents Cadence Spectre simulation results of operation in the subthreshold region in an AND gate implemented in GDI. If interconnection effects are essential, a signal potential loss over long interconnects has to be treated. In this case, (15) will be extended with respect to IR drop.
Fig. 3. Decrease of V in subthreshold region with V
= 3:3 V.
Suppose that the voltage has to be applied to the drain input of the NMOS (Fig. 2) through a long wire. For given and dimensions, the resistance of the interconnect is defined by (16) is a metal sheet resistance per square. where and causing the The current flowing through the wire voltage drop is given by (17) can be determined by the equalization between the wire and NMOS transistors currents as follows: (18) is found from (1) according to the operation where region of the transistor. Equation (18) can be solved numerically, expression is represented and its contribution to the final by (19) from (15). with The operation in the subthreshold region causes increase of delay. Therefore, this method can be efficiently used mostly in low-frequency design. reduction and threshold nonscalability, Scaling, namely, influences the number of required buffers in GDI design (14). As a result, when operation with the lower supply voltages is remain, insertion performed, while the same technology and of additional buffers has to be considered. The direct impact of this is on the area and number of gates. Finally, several points have to be emphasized concerning the buffer insertion topology in GDI. 1) Buffer insertion has to be considered only in the case of linking GDI cells through diffusion inputs. No buffers are needed before gate inputs of GDI cells. 2) Due to this feature, the mixed path topology can be used as an efficient method for buffer insertion. It allows one to reduce the number of buffers by intermittently involving diffusion and gate inputs in a given signal path.
570
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 5, OCTOBER 2002
Fig. 4.
3) The designer should check the tradeoff between buffer insertion and delay, area, and power consumption to achieve an efficient swing restoration. 2) Impacts of Process Variation on Swing Restoration: In every VLSI process, there are variations in parameters like variations, etc. The process depenthreshold tracking, and influences the value of and the dence of swing restoration in GDI. This effect can be best described to the mentioned parameter by defining a sensitivity of variations as follows: Current sensitivity of Threshold sensitivity of (20) (21)
Fig. 5.
is given by (19). where 3) Constraint of Maximal Delay of Cascade: A signal path in cascade can be represented by a single-branch RC tree [11], [12], where are effective resistances of conducting transistors are capacitive loads caused by following devices. An and example of an RC tree can be seen in Fig. 4. is defined as the resistance of the path between the input is the resisand output (for RC tree without side branches). is the capacitance at node tance between input and node . . The following times are defined in order to derive the bounds of delay in an RC tree: (22)
Fig. 6.
(23) The maximal delay of the RC tree can be derived numerically from bound for the time, given in the following equation: (24) The number of stages maximal total delay time in GDI cascade can be found for , while using the condition (25) was assumed and Notice that (25) can be checked only after a suitable RC tree was built. The maximal number of stages in cascade between two and . buffers is therefore the minimal value between D. Switching Characteristics: GDI Versus CMOS Due to the complexity of the logic function that can be implemented in a GDI cell by using only two transistors, it is important to perform a comparison of its switching characteristics
with CMOS gate, whose logic function is of the same order of complexity. This comparison can be used as a base for delay estimation in early stages of circuit design, if GDI or CMOS design techniques are considered. While a GDI cells characteristics are close to a standard inverter, the gate with equivalent functional complexity in CMOS will be NAND. The switching behavior of the inverter can be generalized by examining the parasitic capacitances and resistances associated with the inverter [15], [16]. Consider the inverter shown in Fig. 5 with its equivalent digital model. The propagation delay for an inverter driving a capacitive load is (26) is the total capacitance on the output of the inverter, where that is, the sum of the output capacitance of the inverter, any capacitance of interconnecting lines, and the input capacitance of the following gate(s). A NAND gate with a series connection of identical n-channel MOSFETs is shown in Fig. 6. We can estimate the intrinsic switching time of series-connected MOSFETs with an external load capacitance by [16]
(27) The first term in this equation represents the intrinsic switching time of the series connection of MOSFETs, while the second charging . term represents RC delay caused by
571
For equal to 3/2 and assuming two serial n-MOS transistors, the propagation delay in NAND is (28) Therefore, the delay of a NAND gate compared to a GDI gate is approximated by (29) and the low bound is for where the high bound is for high . low Note that this ratio will become better if the effect of the body-source diode in a GDI cell is considered (14) and the delay formula in (7) is used in its improved form. E. Fan-in and Fanout Bounds in GDI 1) Fanout: Following the analysis performed in Section III-E, GDI cells with a two-transistor structure can be compared with CMOS gates with equivalent functional complexity. This approach allows definition of fanout bounds by using the logic-effort concept [23]. The logic effort is directly related to the fanout when the effort delay of a logic structure is analyzed. The effort delay of the logic gate is the product of these two factors (30) where is the effort delay, is the logic effort, and represents a fanout of the gate. For a desired delay, reducing the logic effort results in an improved fanout by the same ratio. The values of logic effort are given by Sutherland in [23] for inputs of various static CMOS gates normalized comparatively to the logic effort of inverter. While a GDI cells logic effort is close to a standard inverter, the equivalent logic functions in CMOS will be NAND, NOR, or MUX, depending on GDI input configuration (Table I). It can be seen from [23] that the following improving factors in of GDI are derived comparatively to CMOS: a) for F1, F2 versus NAND, the factor is 4/3; b) for F1,F2 versus NOR, the factor is 5/3; and c) for MUX in GDI versus CMOS , the factor is 2. The presented values are correct for the gate input of a GDI cell, which makes its characteristics similar to those of the CMOS inverter. If the diffusion input is considered, an additional factor has to be applied to represent the capacitance ratio between the gate and diffusion inputs: the given above . Both parameters factors have to be multiplied by are defined by the design technology. 2) Fan-in: As will be shown in Section IV, an ( 2)-input GDI cell can be implemented by extension of any n-input CMOS structure. While the stack of serial MOSFET devices and fan-in in CMOS gates are limited by body-effect considerations, the addition of diffusion inputs in GDI for the same structure results in an improved fan-in, defined by Fan-in Fan-in (31)
Fig. 7. General GDI cell implementation.
F. Discussion In this section, the analysis of a basic GDI cell was presented. The operational and transient analysis was performed, as well as comparison of switching characteristics of CMOS and GDI, showing the advantages of GDI in terms of delay, number of transistors, and area and power consumption. Several drawbacks, mostly related to inputs connection to MOSFET wells, have to be mentioned: 1) the threshold drop and, in some cases, an increased diffusion input capacitance (both exist also in PTL techniques and were considered in simulations and analysis) and 2) the relative increase of circuit area because of separated MOSFET wells (comparisons based on real layouts will be presented in Section V). However, as we shall show in Sections V and VI, those drawbacks are mostly compensated by the advantages of GDI circuits. IV. A DESIGN METHODOLOGY FOR COMBINATORIAL CIRCUITS USING GDI CELLS A. Designing Leaf Cells in GDI The examples of GDI functions given in Table I refer only to extension of a single-input CMOS inverter structure to a tripleinput GDI cell in order to achieve implementation of complicated logic functions with a minimal number of transistors. Actually, this approach can be defined in more general form. Extension of any n-input CMOS structure to an ( 2)-input GDI cell can be done by introducing an input P instead of supply in the pMOS block of a CMOS structure and an voltage in the nMOS block (see Fig. 7). input instead of This extended implementation can be represented by the following logic expression: (32) is a logic function of an nMOS block (not where of the whole original n-input CMOS structure). An example for this extension can be seen in Fig. 8, where a three-input CMOS structure is converted to a five-input GDI cell. Equation (32) can be used to implement a Shannon expansion } as [18], writing a function with inputs { (33)
Note that for F1 and F2 functions, where only one additional input applied to diffusion, the fan-in will increase by one compared to CMOS.
572
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 5, OCTOBER 2002
0.35- m technology process V , frequency demand of 40 MHz, and load capacitance of 100 fF, the maximal calculated with number of stages is dictated by (14), where . The derived value will be two stages of GDI cells between the buffers. C. High-Level Design
Fig. 8. Example of five-input GDI cell.
and
are (34)
Shannon expansion is a very useful technique for precomputation-based low-power design in sequential logic circuits, due to its multiplexing properties [13]. In multiplexer-based precomcan be used as an enable line of functions putation, input and and as the select line of a multiplexer that chooses between data of and , so that for a given value of , only one or blocks will operate, which significantly reduces of the the power dissipation of the circuit. Due to their special properties, GDI cells can be successfully used for low-power design of combinatorial circuits, while combining two approaches: 1) Shannon expansion and 2) combinational logic precomputation, where transitions of logic values are prevented from propagating through the circuit if the final result does not change as a result of those transitions. Fig. 9 shows an architecture based on (32). We implement the and . Depending on functions , only one of the functions will drive the value of the data computed as a result of its input transitions, while the data transitions from the other function are prevented from propagating to the next logic block C. The applicability of Shannon expansion to any logic function allows GDI implementation of any digital circuit in order to achieve low power dissipation. B. Buffer Insertion and can be derived. By using (14) and (25), values of As mentioned, the maximal number of stages in cascade and . between two buffers is the minimal value between and depends on process parameters, The calculation of frequency demand, and output loads. For example, given a
As was mentioned in Section II, one of the most significant problems of existing PTL is that no simple and universal cell library is available for PTL-based design. The result is a difficulty in developing synthesis tools. This section contains a simple algorithm that allows realization of any logic function by using only the basic GDI cell. It is based on Shannon expansion (27), where any function can be written as follows:
(35) It was shown in Section II (Table I) that the output function of a basic GDI cell (where A, B, and C are inputs to G, P, and N, respectively) is Out (36)
This fact makes a standard GDI cell very suitable for implementation of any logic function that was written by Shannon expansion. For example If Out (37) Fig. 10 shows an algorithm based on Shannon expansion that allows implemention of any function by GDI cells. Algorithms steps: 1) Given a function with variables. 2) Check, if the function is not equal to 1, 0, or not inverted single variable. 3) If it is equal, no additional hardware is needed. 4) If it is not, use Shannon expansion (35) for a given function. 5) Use GDI cell for the function implementation, using prod). ucts of Shannon expansion ( , , and 6) Back to step 2) for both functions and . then
573
its particular MUX-like nature, so that most area overhead is eliminated. The presented approach can be used in combination with existing cell library-based synthesis tools to achieve an optimized design. V. COMPARISONS WITH OTHER LOGIC STYLES A. Leaf Cells Comparisons 1) Cells and Simulation Conditions: Five sets of comparisons were carried out on various logic gates. Circuits were designed at the transistor level in a 0.35- m twin-well CMOS V, V). The cirprocess technology ( cuits were simulated using Cadence Spectre at 3.3 V, 40 MHz, and 27 C, with load capacitance of 100 fF. In our simulations, the well capacitance and other parasitic parameters were taken into account. Each set includes a logic cell implemented in four different techniques: GDI, CMOS, transmission gate, and nMOS pass gate. Cells were designed for a minimal number of transistors in each technique as shown in Table III, while in nMOS pass-gate cells a buffer was added because of low ). Most circuits swing of output voltage ( ratio of three to achieve the best where implemented with power-delay performance Same transitions of logic values were supplied to the inputs of the test circuits in each technique. Measured values apply to transitions in inputs connected to gate of transistors, in order to achieve a consistent comparison. Measurements were performed on test circuits that were placed between two blocks, which contain circuits similar to the device under test (DUT). The measured power is that of the DUT, including the power consumed by driving the next stage, thus accounting for the input power consumption and not just the power directly consumed from supply. This allows more realistic environment conditions for test circuit, instead of the ideal input transitions of the simulators voltage sources [24]. supply The fact that no GDI cell contains full implies that the only power consumed is through the inputs, as GDI cells are fed only by the previous circuits. A similar phenomenon is partially observed in most PTL circuits, but there the power consumption from the source is caused by CMOS buffers, which are included in every regular PTL. Yet, in real circuits and simulations, current flow from the sources can be measured in GDI. It is caused by buffers that are connected between cascaded cells. Hence, a fair comparison between the techniques must be performed for measurements that are carried out from cells series with buffers and not from a single cell. GDI and TG test circuits contain two basic cells with one output buffer. N-PG contains two buffers: one after each cell. CMOS has no buffers in test circuits. 2) Comparisons and Results: For each technique, average power, maximal delay, and number of transistors were measured. The results are given in Table IV. a) Number of transistors comparison: Among all the design techniques, GDI proves to have the minimal number of transistors. Each GDI gate was implemented using only two transistors. The worst case, with respect to transistor count, is for the CMOS MUX gate (multiplexers are the well-known domain of pass-transistor logic). In this sense, the PTL techniques prove to be inferior compared to GDI.
Fig. 10. Algorithm for implementing a given logic function using GDI cells, based on Shannon expansion.
One advantage of this algorithm is the ability to calculate the maximal count of transistors needed for n-input function implementation, in predesign stage. This can be calculated by (38) is the maximal number of transistors that are needed where to implement the function, is the maximal count of GDI cells and is the number of variables in the given function. Knowledge of the maximal number of GDI cells will fix firmly the final maximal area of the circuit. The following pseudocode shows how any combinatorial function can be synthesized by means of three-input GDI cells, not : where input /* recursively synthesize an * / function /* with GDI cells */ Algorithm If ( ) then return(1) ) then return(0) else if ( else return(G(SyntGDI(f jxn=1); xn ; SyntGDI(f jxn=0))); As an example, if above procedure will return XOR , the
where stands for GDI and stands for an inverted GDI cell that is inserted as a postprocess in order to keep signal integrity. This approach can be used in combination with existing cell library-based synthesis tools to achieve an optimized design. It must be noticed that, as has been shown before, using Shannon expansion in regular logic circuits results in a lower power dissipation but requires significant area overhead. This overhead is caused by the additional precomputation circuitry. On the other hand, a Shannon-based GDI design does not require a special precomputation circuitry because of
574
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 5, OCTOBER 2002
TABLE III AND, OR, AND XOR CELLS USING GDI, CMOS, AND PTL DESIGN TECHNIQUES FOR TWIN-WELL PROCESS
TABLE IV LOGIC GATE COMPARISONS (GDI, CMOS, TRANSMISSION GATE, AND nMOS PASS GATE) USING THE CIRCUIT TOPOLOGIES FROMTABLE IV
b) Power dissipation comparison: Results are given for power dissipation in different gates. Consistently for all design techniques, the MUX gate has the largest power consumption because of its complicated implementation (CMOS) and the presence of additional input. On the other hand, ANDs power dissipation is the minimal among all the gates. Still, most GDI logic gates prove to be the most power efficient among the four compared design techniques (only F2 gate shows an advantage of nMOS pass gate compared to GDI).
c) Delay comparison: The best performance with respect to circuit delay was measured in GDI and TG circuits. The advantage of TG in some circuits can be explained by the fact that one nMOS and one pMOS transistor is conducting at once for each logic state in TG gates. It should be noticed that the results of CMOS delays compared to GDI in most cases are bounded according to (29), as expected. Circuits implemented in N-PG are the slowest, because of the need for additional buffer circuitry in each gate.
575
TABLE V POWER AND DELAY RESULTS FOR DIFFERENT LOAD CONDITIONS OF OR AND AND CELLS USING GDI (F1 CONFIGURATION), CMOS, AND PTL TECHNIQUES
3) Discussion: Among the presented design techniques, GDI proves to have the best performance values and lowest transistor count. Even in the cases where power or delay parameters of some GDI gates are inferior, compared to TG or N-PG, the power-delay products and transistor count of GDI are lower. Only the TG design method is a viable alternative for GDI if high-frequency operation is of concern. B. Cell Comparisons for Different Load Conditions A fair comparison of properties of techniques mentioned above should involve measuring delay and power consumption under different load conditions of the cell. In this section, the results of parametric simulations for power and delay measurement are presented. The simulations were carried out in SPECTRE to compare between NOR and AND GDI cells, using F1 function, and CMOS, N-PG, and TG techniques (as presented in Table V) in 0.24- m CMOS technology. A regular CMOS inverter was used as a load for DUT, with dimensions of 2.4 m /0.24 m for PFET and 0.9 m /0.24 m for NFET. In this technology, the given load size applies a load capacitance of about 1 fF. To achieve a dependence of simulations on load conditions, load size was multiplied by PS parameter (changing from one to three). The results of power and delay as a function of PS parameter are presented in Table V, showing the consistent advantage of GDI. C. High-Level Circuit Comparisons 1) Circuits and Design Methods: Wishing to cover a wide range of possible circuits, design methods, and properties comparisons, several digital combinatorial circuits were implemented using various methods (GDI, PTL, and CMOS), design techniques, and technology processes. Table VI contains a full
G: GDI, C: CMOS, P: PTL, * fabricated circuits, * * research in progress (0.35 twin-well technology).
list of circuits implemented during the research with respect to design methods and processes. a) AND, OR, and XOR Cells: Table III contains AND, OR, and XOR cells using GDI, CMOS, and PTL design techniques. It must be noticed that use of the full GDI library is not possible in a regular p-well CMOS process. As a result, only function F1 and its expansions could be implemented. Table VII consists of implemented GDI basic functions for a regular p-well process and their layouts. b) 8-bit CLA adder: Carry-lookahead adder (CLA) structure is well known and widely used thanks to its high-speed operation while calculating the carries in parallel [1]. The carry of the th stage may be expressed as (39) where generate signal propagate signal (40) (41)
576
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 5, OCTOBER 2002
TABLE VII AND, OR, AND XOR CELLS USING GDI FOR REGULAR P-WELL PROCESS
Fig. 12. Structure of a 4-bit ripple comparator. Fig. 11. Generic CLA: (a) basic scheme and (b) carry generator (3-bit only). TABLE VIII TWO-SIGNAL REPRESENTATION OF COMPARISON RESULT
or
if
(43) Every basic unit includes two inputs of comparison data from previous units. Logic implementation of each unit is based on the following expressions (48) (49) d) 4-bit Multiplier: The multiplier circuit is based on generation of partial products and their addition, creating a final product. The following equations represent both the multiplied numbers and the product: (50)
For four stages of lookahead, the appropriate terms are (44) (45) (46) (47) Fig. 11 shows a generic carry-lookahead adder. The PG generation and SUM generation circuits surround a carry-generate block. The circuit presented is a 4-bit adder that can be replicated in order to create an 8-bit adder, due to fan-in and size limitations of the gates. c) 4-bit ripple comparator: A 4-bit ripple comparator consists of a cascade of four identical basic units, while the comparison data are transmitted through the units (Fig. 12). Comparison of the most significant bit digit is done first, proceeding down to the least significant bit. The outcome of comparison in every unit is represented by two signals C and D according to Table VIII.
(51) The multiplier contains an array of interconnected basic cells [5], as shown in Fig. 13.
577
Fig. 15.
Fig. 13.
Each multiplier cell (Fig. 14) represents one bit of partial product and is responsible for: 1) generating a bit of the correct partial product in response to the input signals; 2) adding this bit to the cumulative sum propagated from the row above. The cell consists of two components: an AND gate and adder to generate the partial product bit and add this bit to the previous sum. e) Sequential logic circuits: Although this paper covers mostly combinatorial digital circuits, some implementations of sequential logic circuits were also performed. Fig. 15 presents the basic scheme an n-bit counter based on toggle flip-flop (TFF) cells. The circuit was implemented in 0.35- m twin-well CMOS process technology, and its research is currently in progress. Layouts of basic TFF cells can be seen in Fig. 16 with respect to the number of transistors and area of each cell. Fig. 16(a) and (b) presents layouts of GDI TFFs based on F1 and F2 functions, respectively. Fig. 16(c) shows the layout of CMOS TFF.
Fig. 16.
2) Simulated Results and Comparisons: This section presents the results of performance comparisons of some of the digital circuits mentioned above. All given measurements were carried out on a representative pattern of possible input transitions, with the worst case assumption used to find a maximal delay of the circuit. Power dissipation was calculated as an average over the pattern. a) 8-bit CLA adder: GDI versus CMOS and TG: An 8-bit adder was realized in a 1.6- m CMOS process. Two chips were designed, and their layouts can be seen in Fig. 17. Each chip
578
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 5, OCTOBER 2002
(a)
(b)
Fig. 17. Layouts of two 8-bit adder chips: (a) GDI and CMOS and (b) GDI2 and TG. Fig. 19. TABLE IX SIMULATED RESULTS OF GDI, CMOS, AND TG 8-BIT ADDER Layout of 8-bit comparator chip.
Fig. 20.
Fig. 18.
contains a GDI circuit and a compared circuit implemented in either CMOS or TG. Performance comparisons were done by simulating in CaV, MHz, and 27 C. Sevdence Spectre at eral parameters were measured: average power, maximal delay, power-delay product, number of transistors, and circuit area. The results are assembled in Table IX and Fig. 18. As can be seen, the GDI adder proves to be the most power-efficient circuit. Power dissipation in GDI is less than in CMOS and in TG, yet the delay of TG is less than that of GDI (as expected). CMOS circuit got the highest delay44.9% more than GDI. In spite of the inferior speed of GDI relative to TG, the power-delay product of GDI is less than both TG and CMOS. Because of the use of a limited GDI cell library in p-well CMOS process, the number of transistors and area of CMOS and GDI circuits are close, but much less than in the TG adder implementation. b) 8-bit comparator: GDI versus CMOS and N-PG: The implementation of an 8-bit comparator was carried out in the V, MHz, same 1.6- m CMOS process at
and 27 C. The layout of a chip that contains three compared circuits can be seen in Fig. 19. GDI proves to have the best performance among the tested design methods, as can be seen in Fig. 20 and Table X. The results of power, delay, and power-delay product of GDI are the best among the compared circuits, while N-PG has the worst performance results. Here, as well as in the adder circuit, the limited GDI library was used because of process constraints. As a result, the final area of the GDI comparator is greater than CMOS and N-PG, while the number of transistors in all three circuits is the same. c) 4-bit multiplier: GDI versus CMOS: The multiplier was implemented in 0.5- m CMOS technology, 3.3-V supply, 50 MHz, and 27 C. To achieve a robust measure of the power-delay product, we ran our simulations on CMOS and GDI circuits that were parametric in size, e.g., running with means that the transistor widths are twice those when . Spectre simulations were done on running with schematic circuits, while changing the area parameter from one to eight. Fig. 21 describes the changing of power, delay, and power-delay product as a function of . As can been seen, GDI
579
Fig. 22.
Fig. 21. (a) Power, (b) delay, and (c) power-delay product as function of area parameter . TABLE XI SIMULATED RESULTS OF GDI AND CMOS 4-BIT MULTIPLIER
shows better results in all parameters for all area coefficients. Twenty-six transistors were used in the GDI multiplier, relative to 44 transistors used in CMOS. An additional comparison was done for circuits with the same delay value (1.03 ns). The results of area, power dissipation, and power-delay can be seen in Table XI. VI. MEASUREMENTS OF A TEST CHIP An 8-bit adder designed in GDI and CMOS [Fig. 17(a)] was fabricated in 1.6- m CMOS technology (MOSIS). The voltage supplies of two circuits were separated in order to enable a separate power measurement. After the postprocessing, three types of chips were available: GDI adder, CMOS adder, and chips that contain both circuits, connected. This allowed carrying out measurements of dynamic power of the circuits while eliminating the static power dissipation and power dissipation of output
pads, which contain buffers and additional circuitry. The test chip is shown in Fig. 22. Several sets of measurements and tests where applied on test chips, using the EXCELL 100 + testing system of IMS. To demonstrate the influence of scaling on a given GDI circuit, the measurements were performed with various supply voltages. 1) Operational Tests: Both circuits were checked for proper operation, while using two scripts, which generated patterns of input values. The first set of values was generated according to binary order of input numbers. The second set included more than 20 000 random transitions, which were used in delay and power measurements. 2) Delay Measurements: The maximal delay of both circuits was measured by increasing the frequency of input signal and checking the results of addition. The frequency, where the first error appears, defines the delay of the circuit. Table XII presents the results of delay in GDI and CMOS adders for various voltage supply levels. It can be noticed that for the given implementation and the output load, defined by the testing system, both circuits have equal delays. 3) Dynamic Power Measurements: Wishing to eliminate the influence of the circuitry in the output pads, which causes high additional power dissipation, a set of measurements in low frequencies was performed for various supply voltages. Those results represent the static power dissipation of the test chip. Then, power measurements at high frequencies were performed and static power values were subtracted from those results to achieve the dynamic power at the given frequency. Final results of dynamic power dissipation are shown in Table XIII. Dynamic power measurements were performed for various frequencies, respective to the voltage supply level. The measure-
580
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 5, OCTOBER 2002
TABLE XIII MEASURED POWER DISSIPATION OF GDI AND CMOS 8-BIT ADDERS
The advantages of GDI technique, namely, Shannon-based design algorithm, two-transistor implementation of complex logic functions, and in-cell swing restoration under certain operating conditions, are unique within existing low-power design techniques. This, together with positive measurement and simulation results, provides evidence that GDI design might enrich the toolbox of VLSI circuit designers. We hope that the presented results will encourage further research activities on the GDI technique. Implementations of different kinds of digital and mixed circuits have to be carried out in order to determine the fields of circuitry where GDI is superior over other styles. The issue of sequential logic design is currently being explored, as well as technology compatibility for twin-well CMOS process. More work is required in the automation of a logic design methodology based on GDI cells. ACKNOWLEDGMENT The authors thank Prof. E. G. Friedman for his constructive comments and suggestions. They also thank G. Samuel and the staff of the Technion Research Center of Microelectronic Systems for their support during the research. The authors also thank M. Feldman, A. Panush, and other students for participating in projects in different stages of the research. Finally, they thank the anonymous referees for their thorough review and useful comments. REFERENCES
[1] N. Weste and K. Eshraghian, Principles of CMOS digital design. Reading, MA: Addison-Wesley, pp. 304307. [2] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, Low- power CMOS digital design, IEEE J. Solid-State Circuits, vol. 27, pp. 473484, Apr. 1992. [3] A. P. Chandrakasan and R. W. Brodersen, Minimizing power consumption in digital CMOS circuits, Proc. IEEE, vol. 83, pp. 498523, Apr. 1995. [4] W. Al-Assadi, A. P. Jayasumana, and Y. K. Malaiya, Pass-transistor logic design, Int. J. Electron., vol. 70, pp. 739749, 1991. [5] I. S. Abu-Khater, A. Bellaouar, and M. I. Elmastry, Circuit techniques for CMOS low-power high-performance multipliers, IEEE J. SolidState Circuits, vol. 31, pp. 15351546, Oct. 1996. [6] K. Yano, Y. Sasaki, K. Rikino, and K. Seki, Top-down pass-transistor logic design, IEEE J. Solid-State Circuits, vol. 31, pp. 792803, June 1996. [7] T. Sakurai, Closed-form expressions for interconnection delay, coupling, and crosstalk in VLSIs, IEEE Trans. Electron Devices, vol. 40, pp. 118124, Jan. 1993. [8] V. Adler and E. G. Friedman, Delay and power expressions for a CMOS inverter driving a resistive-capacitive load, Analog Integrat. Circuits Signal Process., vol. 14, pp. 2939, 1997. [9] J. R. Burns, Switching response of complementary symmetry MOS transistor logic circuits, RCA Rev., vol. 25, pp. 627661, Dec. 1964. [10] T. Sakurai and A. R. Newton, Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas, IEEE J. Solid-State Circuits, vol. 25, pp. 584593, Apr. 1990. [11] W. C. Elmore, The transient response of damped linear networks with particular regard to wideband amplifiers, J. Appl. Phys., vol. 19, pp. 5563, Jan. 1948. [12] J. Rubinstein, P. Penfield, and M. A. Horowitz, Signal delay in RC tree networks, IEEE Trans. Computer-Aided Design, vol. CAD-2, pp. 202211, July 1983. [13] M. Alidina, J. Monteiro, S. Devadas, A. Ghosh, and M. Papaefthymiou, Precomputation-based sequential logic optimization for low power, IEEE Trans. VLSI Syst., vol. 2, pp. 426435, Dec. 1994. [14] R. Zimmermann and W. Fichtner, Low-power logic styles: CMOS versus pass-transistor logic, IEEE J. Solid-State Circuits, vol. 32, pp. 10791090, June 1997.
ments were performed for 5-V supply at 12.5 MHz, for 4.5 V at 10 MHz, and for the rest of the values at 4 MHz. 4) Power-Delay Product: Due to equal delay values in both circuits, the normalized power-delay product has about the same values as those of power measurements. For power and powerdelay product, improvements in the range of 1145% were measured. It must be noted that there is a difference between the simulations and measured data. This is caused by the fact that in all the presented circuits, the simulations have been performed while placing the DUT in the environment of logic circuits designed in the same technique, while in the test chip measurements, the single DUT has been connected directly to output pads, causing a significantly higher load capacitance. Still, in both measured and simulated results, the relative advantage of GDI is preserved. VII. CONCLUSION AND FUTURE RESEARCH A novel GDI technique for low-power design was presented. An 8-bit CLA adder was fabricated using GDI and CMOS and used as a test vehicle. Numerous logic gates and high-level digital circuits are implemented in various methods and process technologies, and their simulation results are discussed. Comparisons with existing TG and N-PG techniques were carried out, showing an up to 45% reduction of power-delay product in the test chip in GDI over CMOS and significant improvements in performance, as well as decreased number of transistors and area in most simulated GDI circuits over CMOS and PTL. An operational analysis and a design methodology were also presented. The GDI technique allows use of a simple and efficient design algorithm, based on the Shannon expansion. It makes GDI suitable for synthesis and realization of combinatorial logic in real LSI chips, while using a single-cell library. This proves to be an additional advantage of GDI over CMOS and PTL. Most of the circuits were implemented in regular p-well CMOS processes, which casts a limitation on a GDI cell library. Still, even in limited-library-based GDI circuits, significant improvements of performance are observed. Implementations of GDI circuits in SOI or twin-well CMOS processes are expected to supply more power-delay efficient design, due to the use of a complete cell library with reduced transistor count.
581
[15] J. P. Uyemura, Circuit Design for CMOS VLSI. Norwell, MA: Kluwer Academic, 1992, pp. 88129. [16] R. J. Baker, H. W. Li, and D. E. Boyce, CMOS circuit design, layout, and simulation, IEEE Press Series on Microelectronic Systems, pp. 205242. [17] SIA Roadmap [Online]. Available: http://www.sematech.org/public/roadmap/doc [18] E. Shannon and W. Weaver, The Mathematical Theory of Information. Urbana-Champaign: University of Illinois Press, 1969. [19] A. Morgenshtein, A. Fish, and I. A. Wagner, Gate-diffusion input (GDI)A technique for low power design of digital circuits: Analysis and characterization, submitted for publication. , Gate-diffusion input (GDI)A novel power efficient method [20] for digital circuits: A design methodology, presented at the 14th Int. ASIC/SOC Conf., Washington, DC, Sept. 2001. [21] A. Chandrakasan, W. J. Bowhill, and F. Fox, Design of High Performance Microprocessor Circuits, 2000, ch. 5, pp. 8097. [22] I. Sutherland, B. Sproull, and D. Harris, Logical Effort: Designing Fast CMOS Circuits. San Mateo, CA: Morgan Kaufmann, p. 7. [23] V. Stojanovic and V. G. Oklobdzija, Comparative analysis of masterslave latches and flip-flops for high-performance and low-power systems, IEEE J. Solid-State Circuits, vol. 34, pp. 536547, Apr. 1999. [24] K. Bernstein, L. M. Carrig, C. M. Durham, and P. A. Hansen, High Speed CMOS Design Styles. Norwell, MA: Kluwer Academic, 1998.
Alexander Fish was born in Kharkov, Ukraine, in 1976. He received the B.Sc. degree in electrical engineering from the TechnionIsrael Institute of Technology, Haifa, and the M.Sc. degree in electrical engineering from the Ben-Gurion University, Israel, in 1999 and 2002, respectively. He has been a Teaching and Research Assistant in the Electrical Engineering Department, Ben-Gurion University, since 1999. His research interests include CMOS active pixel sensor design, image processing, pattern recognition, and low-power design techniques for digital circuits.
Arkadiy Morgenshtein was born in Cishinev, Moldova, in 1977. He received the B.Sc. degree in electrical engineering from the TechnionIsrael Institute of Technology, Haifa, in 1999, where he is currently pursuing the M.Sc. degree in biomedical engineering. He has been a Teaching and Research Assistant in the Electrical Engineering Department, Technion, since 1999. His research interests include low-power design techniques for digital circuits, biosensor microsystems for brain monitoring, and digital camera design in CMOS technology.
Israel A. Wagner received the B.Sc. degree (cum laude) in computer engineering from the TechnionIsrael Institute of Technology, Haifa, in 1987, the M.Sc. degree (cum laude) in computer science from the Hebrew University, Jerusalem, Israel, in 1990, and the Ph.D. degree in computer science from The Technion in 1999. He was a Research Engineer with General Microwave, Jerusalem, from 1987 until 1990, when he joined the IBM Haifa Laboratories as a Staff Member. He is currently an Adjunct Lecturer in the Computer Science Department at The Technion. His research interests include manual and automatic VLSI design, multiagent robotics, computational geometry, and graph theory. Dr. Wagner is a member of MAA and AMS.