Module-1 Introduction to ASICs
Module-1 Introduction to ASICs
The physical size of a silicon die varies from a few millimeters on a side to over
1 inch on a side, but instead we often measure the size of an IC by the number of
logic gates or the number of transistors that the IC contains. As a unit of measure
a gate equivalent corresponds to a two-input NAND gate (a circuit that performs
the logic function, F = A " B ). Often we just use the term gates instead of gate
equivalents when we are measuring chip sizenot to be confused with the gate
terminal of a transistor. For example, a 100 k-gate IC contains the equivalent of
100,000 two-input NAND gates.
The semiconductor industry has evolved from the first ICs of the early 1970s and
matured rapidly since then. Early small-scale integration ( SSI ) ICs contained a
few (1 to 10) logic gatesNAND gates, NOR gates, and so onamounting to a few
tens of transistors. The era of medium-scale integration ( MSI ) increased the
range of integrated logic available to counters and similar, larger scale, logic
functions. The era of large-scale integration ( LSI ) packed even larger logic
functions, such as the first microprocessors, into a single chip. The era of very
large-scale integration ( VLSI ) now offers 64-bit microprocessors, complete with
cache memory and floating-point arithmetic unitswell over a million transistors
on a single piece of silicon. As CMOS process technology improves, transistors
continue to get smaller and ICs hold more and more transistors. Some people
(especially in Japan) use the term ultralarge scale integration ( ULSI ), but most
people stop at the term VLSI; otherwise we have to start inventing new words.
The earliest ICs used bipolar technology and the majority of logic ICs used either
transistortransistor logic ( TTL ) or emitter-coupled logic (ECL). Although
invented before the bipolar transistor, the metal-oxide-silicon ( MOS ) transistor
was initially difficult to manufacture because of problems with the oxide
interface. As these problems were gradually solved, metal-gate n -channel MOS (
nMOS or NMOS ) technology developed in the 1970s. At that time MOS
technology required fewer masking steps, was denser, and consumed less power
than equivalent bipolar ICs. This meant that, for a given performance, an MOS
IC was cheaper than a bipolar IC and led to investment and growth of the MOS
IC market.
By the early 1980s the aluminum gates of the transistors were replaced by
polysilicon gates, but the name MOS remained. The introduction of polysilicon
as a gate material was a major improvement in CMOS technology, making it
easier to make two types of transistors, n -channel MOS and p -channel MOS
transistors, on the same ICa complementary MOS ( CMOS , never cMOS)
technology. The principal advantage of CMOS over NMOS is lower power
consumption. Another advantage of a polysilicon gate was a simplification of the
fabrication process, allowing devices to be scaled down in size.
There are four CMOS transistors in a two-input NAND gate (and a two-input
NOR gate too), so to convert between gates and transistors, you multiply the
number of gates by 4 to obtain the number of transistors. We can also measure an
IC by the smallest feature size (roughly half the length of the smallest transistor)
imprinted on the IC. Transistor dimensions are measured in microns (a micron, 1
m m, is a millionth of a meter). Thus we talk about a 0.5 m m IC or say an IC is
built in (or with) a 0.5 m m process, meaning that the smallest transistors are 0.5
m m in length. We give a special label, l or lambda , to this smallest feature size.
Since lambda is equal to half of the smallest transistor length, l ª 0.25 m m in a
0.5 m m process. Many of the drawings in this book use a scale marked with
lambda for the same reason we place a scale on a map.
A modern submicron CMOS process is now just as complicated as a submicron
bipolar or BiCMOS (a combination of bipolar and CMOS) process. However,
CMOS ICs have established a dominant position, are manufactured in much
greater volume than any other technology, and therefore, because of the economy
of scale, the cost of CMOS ICs is less than a bipolar or BiCMOS IC for the same
function. Bipolar and BiCMOS ICs are still used for special needs. For example,
bipolar technology is generally capable of handling higher voltages than CMOS.
This makes bipolar and BiCMOS ICs useful in power electronics, cars, telephone
circuits, and so on.
Some digital logic ICs and their analog counterparts (analog/digital converters,
for example) are standard parts , or standard ICs. You can select standard ICs
from catalogs and data books and buy them from distributors. Systems
manufacturers and designers can use the same standard part in a variety of
different microelectronic systems (systems that use microelectronics or ICs).
With the advent of VLSI in the 1980s engineers began to realize the advantages
of designing an IC that was customized or tailored to a particular system or
application rather than using standard ICs alone. Microelectronic system design
then becomes a matter of defining the functions that you can implement using
standard ICs and then implementing the remaining logic functions (sometimes
called glue logic ) with one or more custom ICs . As VLSI became possible you
could build a system from a smaller number of components by combining many
standard ICs into a few custom ICs. Building a microelectronic system with
fewer ICs allows you to reduce cost and improve reliability.
Of course, there are many situations in which it is not appropriate to use a custom
IC for each and every part of an microelectronic system. If you need a large
amount of memory, for example, it is still best to use standard memory ICs,
either dynamic random-access memory ( DRAM or dRAM), or static RAM (
SRAM or sRAM), in conjunction with custom ICs.
One of the first conferences to be devoted to this rapidly emerging segment of the
IC industry was the IEEE Custom Integrated Circuits Conference (CICC), and
the proceedings of this annual conference form a useful reference to the
development of custom ICs. As different types of custom ICs began to evolve for
different types of applications, these new ICs gave rise to a new term:
application-specific IC, or ASIC. Now we have the IEEE International ASIC
Conference , which tracks advances in ASICs separately from other types of
custom ICs. Although the exact definition of an ASIC is difficult, we shall look at
some examples to help clarify what people in the IC industry understand by the
term.
Examples of ICs that are not ASICs include standard parts such as: memory chips
sold as a commodity itemROMs, DRAM, and SRAM; microprocessors; TTL or
TTL-equivalent ICs at SSI, MSI, and LSI levels.
Examples of ICs that are ASICs include: a chip for a toy bear that talks; a chip
for a satellite; a chip designed to handle the interface between memory and a
microprocessor for a workstation CPU; and a chip containing a microprocessor as
a cell together with other logic.
As a general rule, if you can find it in a data book, then it is probably not an
ASIC, but there are some exceptions. For example, two ICs that might or might
not be considered ASICs are a controller chip for a PC and a chip for a modem.
Both of these examples are specific to an application (shades of an ASIC) but are
sold to many different system vendors (shades of a standard part). ASICs such as
these are sometimes called application-specific standard products ( ASSPs ).
Trying to decide which members of the huge IC family are application-specific is
trickyafter all, every IC has an application. For example, people do not usually
consider an application-specific microprocessor to be an ASIC. I shall describe
how to design an ASIC that may include large cells such as microprocessors, but
I shall not describe the design of the microprocessors themselves. Defining an
ASIC by looking at the application can be confusing, so we shall look at a
different way to categorize the IC family. The easiest way to recognize people is
by their faces and physical characteristics: tall, short, thin. The easiest
characteristics of ASICs to understand are physical ones too, and we shall look at
these next. It is important to understand these differences because they affect
such factors as the price of an ASIC and the way you design an ASIC.
1.1 Types of ASICs
ICs are made on a thin (a few hundred microns thick), circular silicon wafer ,
with each wafer holding hundreds of die (sometimes people use dies or dice for
the plural of die). The transistors and wiring are made from many layers (usually
between 10 and 15 distinct layers) built on top of one another. Each successive
mask layer has a pattern that is defined using a mask similar to a glass
photographic slide. The first half-dozen or so layers define the transistors. The
last half-dozen or so layers define the metal wires between the transistors (the
interconnect ).
A full-custom IC includes some (possibly all) logic cells that are customized and
all mask layers that are customized. A microprocessor is an example of a
full-custom ICdesigners spend many hours squeezing the most out of every last
square micron of microprocessor chip space by hand. Customizing all of the IC
features in this way allows designers to include analog circuits, optimized
memory cells, or mechanical structures on an IC, for example. Full-custom ICs
are the most expensive to manufacture and to design. The manufacturing lead
time (the time it takes just to make an ICnot including design time) is typically
eight weeks for a full-custom IC. These specialized full-custom ICs are often
intended for a specific application, so we might call some of them full-custom
ASICs.
We shall discuss full-custom ASICs briefly next, but the members of the IC
family that we are more interested in are semicustom ASICs , for which all of the
logic cells are predesigned and some (possibly all) of the mask layers are
customized. Using predesigned cells from a cell library makes our lives as
designers much, much easier. There are two types of semicustom ASICs that we
shall cover: standard-cellbased ASICs and gate-arraybased ASICs. Following
this we shall describe the programmable ASICs , for which all of the logic cells
are predesigned and none of the mask layers are customized. There are two types
of programmable ASICs: the programmable logic device and, the newest member
of the ASIC family, the field-programmable gate array.
For many analog designs the close matching of transistors is crucial to circuit
operation. For these circuit designs pairs of transistors are used, located adjacent
to each other. Device physics dictates that a pair of bipolar transistors will always
match more precisely than CMOS transistors of a comparable size. Bipolar
technology has historically been more widely used for full-custom analog design
because of its improved precision. Despite its poorer analog properties, the use of
CMOS technology for analog functions is increasing. There are two reasons for
this. The first reason is that CMOS is now by far the most widely available IC
technology. Many more CMOS ASICs and CMOS standard products are now
being manufactured than bipolar ICs. The second reason is that increased levels
of integration require mixing analog and digital functions on the same IC: this
has forced designers to find ways to use CMOS technology to implement analog
functions. Circuit designers, using clever new techniques, have been very
successful in finding new ways to design analog CMOS circuits that can
approach the accuracy of bipolar analog designs.
Each standard cell in the library is constructed using full-custom design methods,
but you can use these predesigned and precharacterized circuits without having to
do any full-custom design yourself. This design style gives you the same
performance and flexibility advantages of a full-custom ASIC but reduces design
time and reduces risk.
Standard cells are designed to fit together like bricks in a wall. Figure 1.3 shows
an example of a simple standard cell (it is simple in the sense it is not maximized
for densitybut ideal for showing you its internal construction). Power and ground
buses (VDD and GND or VSS) run horizontally on metal lines inside the cells.
FIGURE 1.3 Looking down on the layout of a standard cell. This cell would be
approximately 25 microns wide on an ASIC with l (lambda) = 0.25 microns (a
micron is 10 6 m). Standard cells are stacked like bricks in a wall; the abutment
box (AB) defines the edges of the brick. The difference between the bounding
box (BB) and the AB is the area of overlap between the bricks. Power supplies
(labeled VDD and GND) run horizontally inside a standard cell on a metal layer
that lies above the transistor layers. Each different shaded and labeled pattern
represents a different layer. This standard cell has center connectors (the three
squares, labeled A1, B1, and Z) that allow the cell to connect to others. The
layout was drawn using ROSE, a symbolic layout editor developed by Rockwell
and Compass, and then imported into Tanner Researchs L-Edit.
FIGURE 1.4 Routing the CBIC (cell-based IC) shown in Figure 1.2. The use of
regularly shaped standard cells, such as the one in Figure 1.3, from a library
allows ASICs like this to be designed automatically. This ASIC uses two
separate layers of metal interconnect (metal1 and metal2) running at right angles
to each other (like traces on a printed-circuit board). Interconnections between
logic cells uses spaces (called channels) between the rows of cells. ASICs may
have three (or more) layers of metal allowing the cell rows to touch with the
interconnect running over the top of the cells.
All the mask layers of a CBIC are customized. This allows megacells (SRAM, a
SCSI controller, or an MPEG decoder, for example) to be placed on the same IC
with standard cells. Megacells are usually supplied by an ASIC or library
company complete with behavioral models and some way to test them (a test
strategy). ASIC library companies also supply compilers to generate flexible
DRAM, SRAM, and ROM blocks. Since all mask layers on a standard-cell
design are customized, memory design is more efficient and denser than for gate
arrays.
For logic that operates on multiple signals across a data busa datapath ( DP )the
use of standard cells may not be the most efficient ASIC design style. Some
ASIC library companies provide a datapath compiler that automatically generates
datapath logic . A datapath library typically contains cells such as adders,
subtracters, multipliers, and simple arithmetic and logical units ( ALUs ). The
connectors of datapath library cells are pitch-matched to each other so that they
fit together. Connecting datapath cells to form a datapath usually, but not always,
results in faster and denser layout than using standard cells or a gate array.
Standard-cell and gate-array libraries may contain hundreds of different logic
cells, including combinational functions (NAND, NOR, AND, OR gates) with
multiple inputs, as well as latches and flip-flops with different combinations of
reset, preset and clocking options. The ASIC library company provides designers
with a data book in paper or electronic form with all of the functional
descriptions and timing information for each library element.
The key difference between a channelless gate array and channeled gate array is
that there are no predefined areas set aside for routing between cells on a
channelless gate array. Instead we route over the top of the gate-array devices.
We can do this because we customize the contact layer that defines the
connections between metal1, the first layer of metal, and the transistors. When
we use an area of transistors for routing in a channelless array, we do not make
any contacts to the devices lying underneath; we simply leave the transistors
unused.
The logic densitythe amount of logic that can be implemented in a given silicon
areais higher for channelless gate arrays than for channeled gate arrays. This is
usually attributed to the difference in structure between the two types of array. In
fact, the difference occurs because the contact mask is customized in a
channelless gate array, but is not usually customized in a channeled gate array.
This leads to denser cells in the channelless architectures. Customizing the
contact layer in a channelless gate array allows us to increase the density of
gate-array cells because we can route over the top of unused contact sites.
An embedded gate array gives the improved area efficiency and increased
performance of a CBIC but with the lower cost and faster turnaround of an MGA.
One disadvantage of an embedded gate array is that the embedded function is
fixed. For example, if an embedded gate array contains an area set aside for a 32
k-bit memory, but we only need a 16 k-bit memory, then we may have to waste
half of the embedded memory function. However, this may still be more efficient
and cheaper than implementing a 32 k-bit memory using macros on a SOG array.
ASIC vendors may offer several embedded gate array structures containing
different memory types and sizes as well as a variety of embedded functions.
ASIC companies wishing to offer a wide range of embedded functions must
ensure that enough customers use each different embedded gate array to give the
cost advantages over a custom gate array or CBIC (the Sun Microsystems
SPARCstation 1 described in Section 1.3 made use of LSI Logic embedded gate
arraysand the 10K and 100K series of embedded gate arrays were two of LSI
Logics most successful products).
● A method for programming the basic logic cells and the interconnect.
● The core is a regular array of programmable basic logic cells that can
implement combinational as well as sequential logic (flip-flops).
● A matrix of programmable interconnect surrounds the basic logic cells.
● A method for programming the basic logic cells and the interconnect.
● The core is a regular array of programmable basic logic cells that can
implement combinational as well as sequential logic (flip-flops).
● A matrix of programmable interconnect surrounds the basic logic cells.
● A behavioral model
● A Verilog/VHDL model
● A test strategy
● A circuit schematic
● A cell icon
● A wire-load model
● A routing model
For MGA and CBIC cell libraries we need to complete cell design and cell layout
and shall discuss this in Chapter 2. The ASIC designer may not actually see the
layout if it is hidden inside a phantom, but the layout will be needed eventually.
In a programmable ASIC the cell layout is part of the programmable ASIC
design (see Chapter 4).
The ASIC designer needs a high-level, behavioral model for each cell because
simulation at the detailed timing level takes too long for a complete ASIC design.
For a NAND gate a behavioral model is simple. A multiport RAM model can be
very complex. We shall discuss behavioral models when we describe Verilog and
VHDL in Chapter 10 and Chapter 11. The designer may require Verilog and
VHDL models in addition to the models for a particular logic simulator.
ASIC designers also need a detailed timing model for each cell to determine the
performance of the critical pieces of an ASIC. It is too difficult, too
time-consuming, and too expensive to build every cell in silicon and measure the
cell delays. Instead library engineers simulate the delay of each cell, a process
known as characterization . Characterizing a standard-cell or gate-array library
involves circuit extraction from the full-custom cell layout for each cell. The
extracted schematic includes all the parasitic resistance and capacitance elements.
Then library engineers perform a simulation of each cell including the parasitic
elements to determine the switching delays. The simulation models for the
transistors are derived from measurements on special chips included on a wafer
called process control monitors ( PCMs ) or drop-ins . Library engineers then use
the results of the circuit simulation to generate detailed timing models for logic
simulation. We shall cover timing models in Chapter 13.
All ASICs need to be production tested (programmable ASICs may be tested by
the manufacturer before they are customized, but they still need to be tested).
Simple cells in small or medium-size blocks can be tested using automated
techniques, but large blocks such as RAM or multipliers need a planned strategy.
We shall discuss test in Chapter 14.
The cell schematic (a netlist description) describes each cell so that the cell
designer can perform simulation for complex cells. You may not need the
detailed cell schematic for all cells, but you need enough information to compare
what you think is on the silicon (the schematic) with what is actually on the
silicon (the layout)this is a layout versus schematic ( LVS ) check.
If the ASIC designer uses schematic entry, each cell needs a cell icon together
with connector and naming information that can be used by design tools from
different vendors. We shall cover ASIC design using schematic entry in
Chapter 9. One of the advantages of using logic synthesis (Chapter 12) rather
than schematic design entry is eliminating the problems with icons, connectors,
and cell names. Logic synthesis also makes moving an ASIC between different
cell libraries, or retargeting , much easier.
In order to estimate the parasitic capacitance of wires before we actually
complete any routing, we need a statistical estimate of the capacitance for a net in
a given size circuit block. This usually takes the form of a look-up table known as
a wire-load model . We also need a routing model for each cell. Large cells are
too complex for the physical design or layout tools to handle directly and we
need a simpler representationa phantom of the physical layout that still contains
all the necessary information. The phantom may include information that tells the
automated routing tool where it can and cannot place wires over the cell, as well
as the location and types of the connections to the cell.
1.6 Summary
In this chapter we have looked at the difference between full-custom ASICs,
semi-custom ASICs, and programmable ASICs. Table 1.3 summarizes their
different features. ASICs use a library of predesigned and precharacterized logic
cells. In fact, we could define an ASIC as a design style that uses a cell library
rather than in terms of what an ASIC is or what an ASIC does.
TABLE 1.3 Types of ASIC.
Custom mask Custom logic
ASIC type Family member
layers cells
Full-custom Analog/digital All Some
Semicustom Cell-based (CBIC) All None
Masked gate array (MGA) Some None
Field-programmable gate array
Programmable None None
(FPGA)
Programmable logic device (PLD) None None
You can think of ICs like pizza. A full-custom pizza is built from scratch. You
can customize all the layers of a CBIC pizza, but from a predefined selection, and
it takes a while to cook. An MGA pizza uses precooked crusts with fixed sizes
and you choose only from a few different standard types on a menu. This makes
MGA pizza a little faster to cook and a little cheaper. An FPGA is rather like a
frozen pizzayou buy it at the supermarket in a limited selection of sizes and
types, but you can put it in the microwave at home and it will be ready in a few
minutes.
In each chapter we shall indicate the key concepts. In this chapter they are
● The difference between full-custom and semicustom ASICs
Next, in Chapter 2, we shall take a closer look at the semicustom ASICs that
were introduced in this chapter.
L ast E d ited by S P 1411 2 0 0 4
CMOS LOGIC
A CMOS transistor (or device) has four terminals: gate , source , drain , and a
fourth terminal that we shall ignore until the next section. A CMOS transistor is a
switch. The switch must be conducting or on to allow current to flow between the
source and drain terminals (using open and closed for switches is confusingfor
the same reason we say a tap is on and not that it is closed ). The transistor source
and drain terminals are equivalent as far as digital signals are concernedwe do
not worry about labeling an electrical switch with two terminals.
● V AB is the potential difference, or voltage, between nodes A and B in a
The sum uses the parity function ('1' if there are an odd numbers of '1's in the inputs).
The carry out, COUT, uses the 2-of-3 majority function ('1' if the majority of the inputs
are '1'). We can combine these two functions in a single FA logic cell, ADD(A[ i ], B[ i
], CIN, S[ i ], COUT), shown in Figure 2.20(a), where
S[ i ] = SUM (A[ i ], B[ i ], CIN) , (2.40)
Now we can build a 4-bit ripple-carry adder ( RCA ) by connecting four of these ADD
cells together as shown in Figure 2.20(b). The i th ADD cell is arranged with the
following: two bus inputs A[ i ], B[ i ]; one bus output S[ i ]; an input, CIN, that is the
carry in from stage ( i 1) below and is also passed up to the cell above as an output;
and an output, COUT, that is the carry out to stage ( i + 1) above. In the 4-bit adder
shown in Figure 2.20(b) we connect the carry input, CIN[0], to VSS and use COUT[3]
and COUT[2] to indicate arithmetic overflow (in Section 2.6.1 we shall see why we
may need both signals). Notice that we build the ADD cell so that COUT[2] is
available at the top of the datapath when we need it.
Figure 2.20(c) shows a layout of the ADD cell. The A inputs, B inputs, and S outputs
all use m1 interconnect running in the horizontal directionwe call these data signals.
Other signals can enter or exit from the top or bottom and run vertically across the
datapath in m2we call these control signals. We can also use m1 for control and m2 for
data, but we normally do not mix these approaches in the same structure. Control
signals are typically clocks and other signals common to elements. For example, in
Figure 2.20(c) the carry signals, CIN and COUT, run vertically in m2 between cells. To
build a 4-bit adder we stack four ADD cells creating the array structure shown in
Figure 2.20(d). In this case the A and B data bus inputs enter from the left and bus S,
the sum, exits at the right, but we can connect A, B, and S to either side if we want.
The layout of buswide logic that operates on data signals in this fashion is called a
datapath . The module ADD is a datapath cell or datapath element . Just as we do for
standard cells we make all the datapath cells in a library the same height so we can abut
other datapath cells on either side of the adder to create a more complex datapath.
When people talk about a datapath they always assume that it is oriented so that
increasing the size in bits makes the datapath grow in height, upwards in the vertical
direction, and adding different datapath elements to increase the function makes the
datapath grow in width, in the horizontal directionbut we can rotate and position a
completed datapath in any direction we want on a chip.
FIGURE 2.20 A datapath adder. (a) A full-adder (FA) cell with inputs (A and B), a
carry in, CIN, sum output, S, and carry out, COUT. (b) A 4-bit adder. (c) The layout,
using two-level metal, with data in m1 and control in m2. In this example the wiring is
completed outside the cell; it is also possible to design the datapath cells to contain the
wiring. Using three levels of metal, it is possible to wire over the top of the datapath
cells. (d) The datapath layout.
What is the difference between using a datapath, standard cells, or gate arrays? Cells
are placed together in rows on a CBIC or an MGA, but there is no generally no
regularity to the arrangement of the cells within the rowswe let software arrange the
cells and complete the interconnect. Datapath layout automatically takes care of most
of the interconnect between the cells with the following advantages:
● Regular layout produces predictable and equal delay for each bit.
FIGURE 2.21 Symbols for a datapath adder. (a) A data bus is shown by a heavy line
(1.5 point) and a bus symbol. If the bus is n -bits wide then MSB = n 1. (b) An
alternative symbol for an adder. (c) Control signals are shown as lightweight (0.5
point) lines.
Some schematic datapath symbols include only data signals and omit the control
signalsbut we must not forget them. In Figure 2.21, for example, we may need to
explicitly tie CIN[0] to VSS and use COUT[MSB] and COUT[MSB 1] to detect
overflow. Why might we need both of these control signals? Table 2.11 shows the
process of simple arithmetic for the different binary number representations, including
unsigned, signed magnitude, ones complement, and twos complement.
TABLE 2.11 Binary arithmetic.
Binary Number Representation
Operation Signed Ones Twos
Unsigned
magnitude complement complement
if positive
no change then MSB = 0 if negative then flip if negative then {flip
bits bits; add 1}
else MSB = 1
3= 0011 0011 0011 0011
3= NA 1011 1100 1101
zero = 0000 0000 or 1000 1111 or 0000 0000
max.
1111 = 15 0111 = 7 0111 = 7 0111 = 7
positive =
max.
0000= 0 1111 = 7 1000 = 7 1000 = 8
negative =
addition =
if SG(A) =
S=A+B SG(B) then S S =
=A+B A+B+
= addend +
augend S=A+B else { if B < A COUT[MSB] S=A+B
then S = A B
else S = B COUT is carry out
SG(A) = A}
sign of A
addition if SG(A) =
OR =
result:
COUT[MSB] SG(B) then OV = OV =
OV = OV =
overflow, COUT[MSB] XOR(COUT[MSB], XOR(COUT[MSB],
COUT is COUT[MSB1]) COUT[MSB 1])
OR = out else OV = 0
carry out (impossible)
of range
if SG(A) =
SG(B) then
SG(S) =
SG(S) =
SG(A)
sign of S
NA else { if B < A NA NA
then SG(S) =
S=A+B SG(A)
else SG(S) =
SG(B)}
subtraction
=
SG(B) =
D=A B Z = B (negate); Z = B (negate);
D=A B NOT(SG(B));
= minuend D=A+Z D=A+Z
D=A+B
subtrahend
subtraction
result : OR =
OV = BOUT[MSB]
as in addition as in addition as in addition
overflow, BOUT is
OR = out borrow out
of range
negation : Z = A;
Z=A NA SG(Z) = Z = NOT(A) Z = NOT(A) + 1
(negate) NOT(SG(A))
2.6.2 Adders
We can view addition in terms of generate , G[ i ], and propagate , P[ i ], signals.
method 1 method 2
G[i] = A[i] · B[i] G[ i ] = A[ i ] · B[ i ] (2.42)
P[ i ] = A[ i ] • B[ i P[ i ] = A[ i ] + B[ i ] (2.43)
C[ i ] = G[ i ] + P[ i ] · C[ i 1] C[ i ] = G[ i ] + P[ i ] · C[ i 1] (2.44)
S[ i ] = P[ i ] • C[ i 1] S[ i ] = A[ i ] • B[ i ] • C[ i 1] (2.45)
where C[ i ] is the carry-out signal from stage i , equal to the carry in of stage ( i + 1).
Thus, C[ i ] = COUT[ i ] = CIN[ i + 1]. We need to be careful because C[0] might
represent either the carry in or the carry out of the LSB stage. For an adder we set the
carry in to the first stage (stage zero), C[1] or CIN[0], to '0'. Some people use delete
(D) or kill (K) in various ways for the complements of G[i] and P[i], but unfortunately
others use C for COUT and D for CINso I avoid using any of these. Do not confuse the
two different methods (both of which are used) in Eqs. 2.422.45 when forming the
sum, since the propagate signal, P[ i ] , is different for each method.
Figure 2.22(a) shows a conventional RCA. The delay of an n -bit RCA is proportional
to n and is limited by the propagation of the carry signal through all of the stages. We
can reduce delay by using pairs of go-faster bubbles to change AND and OR gates to
fast two-input NAND gates as shown in Figure 2.22(a). Alternatively, we can write the
equations for the carry signal in two different ways:
either C[ i ] = A[ i ] · B[ i ] + P[ i ] · C[ i 1] (2.46)
or C[ i ] = (A[ i ] + B[ i ] ) · (P[ i ]' + C[ i 1]), (2.47)
where P[ i ]'= NOT(P[ i ]). Equations 2.46 and 2.47 allow us to build the carry chain
from two-input NAND gates, one per cell, using different logic in even and odd stages
(Figure 2.22b):
even stages odd stages
C1[i]' = P[i ] · C3[i 1] · C4[i 1] C3[i]' = P[i ] · C1[i 1] · C2[i 1] (2.48)
C2[i] = A[i ] + B[i ] C4[i]' = A[i ] · B[i ] (2.49)
C[i] = C1[i ] · C2[i ] C[i] = C3[i ] ' + C4[i ]' (2.50)
(the carry inputs to stage zero are C3[1] = C4[1] = '0'). We can use the RCA of
Figure 2.22(b) in a datapath, with standard cells, or on a gate array.
Instead of propagating the carries through each stage of an RCA, Figure 2.23 shows a
different approach. A carry-save adder ( CSA ) cell CSA(A1[ i ], A2[ i ], A3[ i ], CIN,
S1[ i ], S2[ i ], COUT) has three outputs:
S1[ i ] = CIN , (2.51)
S2[ i ] = A1[ i ] • A2[ i ] • A3[ i ] = PARITY(A1[ i ], A2[ i ], A3[ i ]) , (2.52)
COUT = A1[ i ] · A2[ i ] + [(A1[ i ] + A2[ i ]) · A3[ i ]] = MAJ(A1[ i ], A2[ i ],
(2.53)
A3[ i ]) .
The inputs, A1, A2, and A3; and outputs, S1 and S2, are buses. The input, CIN, is the
carry from stage ( i 1). The carry in, CIN, is connected directly to the output bus S1
indicated by the schematic symbol (Figure 2.23a). We connect CIN[0] to VSS. The
output, COUT, is the carry out to stage ( i + 1).
A 4-bit CSA is shown in Figure 2.23(b). The arithmetic overflow signal for ones
complement or twos complement arithmetic, OV, is XOR(COUT[MSB], COUT[MSB
1]) as shown in Figure 2.23(c). In a CSA the carries are saved at each stage and
shifted left onto the bus S1. There is thus no carry propagation and the delay of a CSA
is constant. At the output of a CSA we still need to add the S1 bus (all the saved
carries) and the S2 bus (all the sums) to get an n -bit result using a final stage that is not
shown in Figure 2.23(c). We might regard the n -bit sum as being encoded in the two
buses, S1 and S2, in the form of the parity and majority functions.
We can use a CSA to add multiple inputsas an example, an adder with four 4-bit inputs
is shown in Figure 2.23(d). The last stage sums two input buses using a carry-propagate
adder ( CPA ). We have used an RCA as the CPA in Figure 2.23(d) and (e), but we can
use any type of adder. Notice in Figure 2.23(e) how the two CSA cells and the RCA
cell abut together horizontally to form a bit slice (or slice) and then the slices are
stacked vertically to form the datapath.
FIGURE 2.22 The carry-save adder (CSA). (a) A CSA cell. (b) A 4-bit CSA.
(c) Symbol for a CSA. (d) A four-input CSA. (e) The datapath for a four-input, 4-bit
adder using CSAs with a ripple-carry adder (RCA) as the final stage. (f) A pipelined
adder. (g) The datapath for the pipelined version showing the pipeline registers as well
as the clock control lines that use m2.
Adders based on this principle are called carry-bypass adders ( CBA ) [Sato et al.,
1992]. Large, custom adders employ Manchester-carry chains to compute the carries
and the bypass operation using TGs or just pass transistors [Weste and Eshraghian,
1993, pp. 530531]. These types of carry chains may be part of a predesigned ASIC
adder cell, but are not used by ASIC designers.
Instead of checking the propagate signals we can check the inputs. For example we can
compute SKIP = (A[ i 1] • B[ i 1]) + (A[ i ] • B[ i ] ) and then use a 2:1 MUX to
select C[ i ]. Thus,
CSKIP[ i ] = (G[ i ] + P[ i ] · C[ i 1]) · SKIP' + C[ i 2] · SKIP . (2.55)
This is a carry-skip adder [Keutzer, Malik, and Saldanha, 1991; Lehman, 1961].
Carry-bypass and carry-skip adders may include redundant logic (since the carry is
computed in two different wayswe just take the first signal to arrive). We must be
careful that the redundant logic is not optimized away during logic synthesis.
If we evaluate Eq. 2.44 recursively for i = 1, we get the following:
C[1] = G[1] + P[1] · C[0]
= G[1] + P[1] · (G[0] + P[1] · C[1])
= G[1] + P[1] · G[0] . (2.56)
This result means that we can look ahead by two stages and calculate the carry into
the third stage (bit 2), which is C[1], using only the first-stage inputs (to calculate G[0])
and the second-stage inputs. This is a carry-lookahead adder ( CLA ) [MacSorley,
1961]. If we continue expanding Eq. 2.44, we find:
C[2] = G[2] + P[2] · G[1] + P[2] · P[1] · G[0] ,
C[3] = G[3] + P[2] · G[2] + P[2] · P[1] · G[1] + P[3] · P[2] · P[1] · G[0] . (2.57)
As we look ahead further these equations become more complex, take longer to
calculate, and the logic becomes less regular when implemented using cells with a
limited number of inputs. Datapath layout must fit in a bit slice, so the physical and
logical structure of each bit must be similar. In a standard cell or gate array we are not
so concerned about a regular physical structure, but a regular logical structure
simplifies design. The BrentKung adder reduces the delay and increases the regularity
of the carry-lookahead scheme [Brent and Kung, 1982]. Figure 2.24(a) shows a regular
4-bit CLA, using the carry-lookahead generator cell (CLG) shown in Figure 2.24(b).
FIGURE 2.23 The BrentKung carry-lookahead adder (CLA). (a) Carry generation in a
4-bit CLA. (b) A cell to generate the lookahead terms, C[0]C[3]. (c) Cells L1, L2, and
L3 are rearranged into a tree that has less delay. Cell L4 is added to calculate C[2] that
is lost in the translation. (d) and (e) Simplified representations of parts a and c. (f) The
lookahead logic for an 8-bit adder. The inputs, 07, are the propagate and carry terms
formed from the inputs to the adder. (g) An 8-bit BrentKung CLA. The outputs of the
lookahead logic are the carry bits that (together with the inputs) form the sum. One
advantage of this adder is that delays from the inputs to the outputs are more nearly
equal than in other adders. This tends to reduce the number of unwanted and
unnecessary switching events and thus reduces power dissipation.
In a carry-select adder we duplicate two small adders (usually 4-bit or 8-bit adders
often CLAs) for the cases CIN = '0' and CIN = '1' and then use a MUX to select the
case that we needwasteful, but fast [Bedrij, 1962]. A carry-select adder is often used as
the fast adder in a datapath library because its layout is regular.
We can use the carry-select, carry-bypass, and carry-skip architectures to split a 12-bit
adder, for example, into three blocks. The delay of the adder is then partly dependent
on the delays of the MUX between each block. Suppose the delay due to 1-bit in an
adder block (we shall call this a bit delay) is approximately equal to the MUX delay. In
this case may be faster to make the blocks 3, 4, and 5-bits long instead of being equal in
size. Now the delays into the final MUX are equal3 bit-delays plus 2 MUX delays for
the carry signal from bits 06 and 5 bit-delays for the carry from bits 711. Adjusting
the block size reduces the delay of large adders (more than 16 bits).
We can extend the idea behind a carry-select adder as follows. Suppose we have an n
-bit adder that generates two sums: One sum assumes a carry-in condition of '0', the
other sum assumes a carry-in condition of '1'. We can split this n -bit adder into an i -bit
adder for the i LSBs and an ( n i )-bit adder for the n i MSBs. Both of the smaller
adders generate two conditional sums as well as true and complement carry signals.
The two (true and complement) carry signals from the LSB adder are used to select
between the two ( n i + 1)-bit conditional sums from the MSB adder using 2( n i + 1)
two-input MUXes. This is a conditional-sum adder (also often abbreviated to CSA)
[Sklansky, 1960]. We can recursively apply this technique. For example, we can split a
16-bit adder using i = 8 and n = 8; then we can split one or both 8bit adders againand
so on.
Figure 2.25 shows the simplest form of an n -bit conditional-sum adder that uses n
single-bit conditional adders, H (each with four outputs: two conditional sums, true
carry, and complement carry), together with a tree of 2:1 MUXes (Qi_j). The
conditional-sum adder is usually the fastest of all the adders we have discussed (it is the
fastest when logic cell delay increases with the number of inputsthis is true for all
ASICs except FPGAs).
FIGURE 2.24 The conditional-sum adder. (a) A 1-bit conditional adder that calculates
the sum and carry out assuming the carry in is either '1' or '0'. (b) The multiplexer that
selects between sums and carries. (c) A 4-bit conditional-sum adder with carry input,
C[0].
Figure 2.26 shows the normalized delay and area figures for a set of predesigned
datapath adders. The data in Figure 2.26 is from a series of ASIC datapath cell libraries
(Compass Passport) that may be synthesized together with test vectors and simulation
models. We can combine the different adder techniques, but the adders then lose
regularity and become less suited to a datapath implementation.
FIGURE 2.25 Datapath adders. This data is from a series of submicron datapath
libraries. (a) Delay normalized to a two-input NAND logic cell delay (approximately
equal to 250 ps in a 0.5 m m process). For example, a 64-bit ripple-carry adder (RCA)
has a delay of approximately 30 ns in a 0.5 m m process. The spread in delay is due to
variation in delays between different inputs and outputs. An n -bit RCA has a delay
proportional to n . The delay of an n -bit carry-select adder is approximately
proportional to log 2 n . The carry-save adder delay is constant (but requires a
carry-propagate adder to complete an addition). (b) In a datapath library the area of all
adders are proportional to the bit size.
There are other adders that are not used in datapaths, but are occasionally useful in
ASIC design. A serial adder is smaller but slower than the parallel adders we have
described [Denyer and Renshaw, 1985]. The carry-completion adder is a variable delay
adder and rarely used in synchronous designs [Sklansky, 1960].
2.6.4 Multipliers
Figure 2.27 shows a symmetric 6-bit array multiplier (an n -bit multiplier multiplies
two n -bit numbers; we shall use n -bit by m -bit multiplier if the lengths are different).
Adders a0f0 may be eliminated, which then eliminates adders a1a6, leaving an
asymmetric CSA array of 30 (5 ¥ 6) adders (including one half adder). An n -bit array
multiplier has a delay proportional to n plus the delay of the CPA (adders b6f6 in
Figure 2.27). There are two items we can attack to improve the performance of a
multiplier: the number of partial products and the addition of the partial products.
FIGURE 2.26 Multiplication. A 6-bit array multiplier using a final carry-propagate
adder (full-adder cells a6f6, a ripple-carry adder). Apart from the generation of the
summands this multiplier uses the same structure as the carry-save adder of
Figure 2.23(d).
where each 3-bit group overlaps by one bit. We pad B with a zero, B n . . . B 1 B 0 0, to
match the first term in Eq. 2.61. If B has an odd number of bits, then we extend the
sign: B n B n . . . B 1 B 0 0. For example, B = 01011 (eleven), encodes to E = 1 11 (16
4 1); and B = 101 is E = 1 1. This is called Booth encoding and reduces the number of
partial products by a factor of two and thus considerably reduces the area as well as
increasing the speed of our multiplier [Booth, 1951].
Next we turn our attention to improving the speed of addition in the CSA array.
Figure 2.28(a) shows a section of the 6-bit array multiplier from Figure 2.27. We can
collapse the chain of adders a0f5 (5 adder delays) to the Wallace tree consisting of
adders 5.15.4 (4 adder delays) shown in Figure 2.28(b).
FIGURE 2.27 Tree-based multiplication. (a) The portion of Figure 2.27 that calculates
the sum bit, P 5 , using a chain of adders (cells a0f5). (b) We can collapse this chain to
a Wallace tree (cells 5.15.5). (c) The stages of multiplication.
Figure 2.28(c) pictorially represents multiplication as a sort of golf course. Each link
corresponds to an adder. The holes or dots are the outputs of one stage (and the inputs
of the next). At each stage we have the following three choices: (1) sum three outputs
using a full adder (denoted by a box enclosing three dots); (2) sum two outputs using a
half adder (a box with two dots); (3) pass the outputs directly to the next stage. The two
outputs of an adder are joined by a diagonal line (full adders use black dots, half adders
white dots). The object of the game is to choose (1), (2), or (3) at each stage to
maximize the performance of the multiplier. In tree-based multipliers there are two
ways to do thisworking forward and working backward.
In a Wallace-tree multiplier we work forward from the multiplier inputs, compressing
the number of signals to be added at each stage [Wallace, 1960]. We can view an FA as
a 3:2 compressor or (3, 2) counter it counts the number of '1's on the inputs. Thus, for
example, an input of '101' (two '1's) results in an output '10' (2). A half adder is a (2, 2)
counter . To form P 5 in Figure 2.29 we must add 6 summands (S 05 , S 14 , S 23 , S 32 ,
S 41 , and S 50 ) and 4 carries from the P 4 column. We add these in stages 17,
compressing from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last carry
from column P 4 , and this means we expand (rather than compress) the number of
signals (from 2 to 3) between stages 3 and 5. The maximum delay through the CSA
array of Figure 2.29 is 6 adder delays. To this we must add the delay of the 4-bit (9
inputs) CPA (stage 7). There are 26 adders (6 half adders) plus the 4 adders in the CPA.
FIGURE 2.28 A 6-bit Wallace-tree multiplier. The carry-save adder (CSA) requires 26
adders (cells 126, six are half adders). The final carry-propagate adder (CPA) consists
of 4 adder cells (2730). The delay of the CSA is 6 adders. The delay of the CPA is 4
adders.
In a Dadda multiplier (Figure 2.30) we work backward from the final product [Dadda,
1965]. Each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, . . . outputs (each successive
stage is 3/2 times largerrounded down to an integer). Thus, for example, in
Figure 2.28(d) we require 3 stages (with 3 adder delaysplus the delay of a 10-bit output
CPA) for a 6-bit Dadda multiplier. There are 19 adders (4 half adders) in the CSA plus
the 10 adders (2 half adders) in the CPA. A Dadda multiplier is usually faster and
smaller than a Wallace-tree multiplier.
FIGURE 2.29 The 6-bit Dadda multiplier. The carry-save adder (CSA) requires 20
adders (cells 120, four are half adders). The carry-propagate adder (CPA, cells 2130)
is a ripple-carry adder (RCA). The CSA is smaller (20 versus 26 adders), faster (3
adder delays versus 6 adder delays), and more regular than the Wallace-tree CSA of
Figure 2.29. The overall speed of this implementation is approximately the same as the
Wallace-tree multiplier of Figure 2.29; however, the speed may be increased by
substituting a faster CPA.
In general, the number of stages and thus delay (in units of an FA delayexcluding the
CPA) for an n -bit tree-based multiplier using (3, 2) counters is
log 1.5 n = log 10 n /log 10 1.5 = log 10 n /0.176 . (2.64)
The redundant binary representation is not unique. We can represent 101 (decimal), for
example, by 1100101 (binary and CSD vector) or 1 1 100111. As another example, 188
(decimal) can be represented by 10111100 (binary), 1 1 1000 1 00, 10 1 00 1 100, or 10
1 000 1 00 (CSD vector). Redundant binary addition of binary, redundant binary, or
CSD vectors does not result in a unique sum, and addition of two CSD vectors does not
result in a CSD vector. Each n -bit redundant binary number requires a rather wasteful
2 n -bit binary number for storage. Thus 10 1 is represented as 010010, for example
(using sign magnitude). The other disadvantage of redundant binary arithmetic is the
need to convert to and from binary representation.
Table 2.14 shows the (5, 3) residue number system . As an example, 11 (decimal) is
represented as [1, 2] residue (5, 3) since 11R 5 = 11 mod 5 = 1 and 11R 3 = 11 mod 3 =
2. The size of this system is thus 3 ¥ 5 = 15. We add, subtract, or multiply residue
numbers using the modulus of each bit positionwithout any carry. Thus:
4 [4, 1] 12 [2, 0] 3 [3, 0]
+ 7 + [2, 1] 4 - [4, 1] ¥ 4 ¥ [4, 1]
= 11 = [1, 2] = 8 = [3, 2] = 12 = [2, 0]
TABLE 2.14 The 5, 3 residue number system.
n residue 5 residue 3 n residue 5 residue 3 n residue 5 residue 3
00 0 50 2 10 0 1
11 1 61 0 11 1 2
22 2 72 1 12 2 0
33 0 83 2 13 3 1
44 1 94 0 14 4 2
The choice of moduli determines the system size and the computing complexity. The
most useful choices are relative primes (such as 3 and 5). With p prime, numbers of the
form 2 p and 2 p 1 are particularly useful (2 p 1 are Mersennes numbers ) [Waser and
Flynn, 1982].
These equations are the same as those for the FA (Eqs. 2.38 and 2.39) except that the B
input is inverted and the sense of the carry chain is inverted. To build a subtracter that
calculates (A B) we invert the entire B input bus and connect the BIN[0] input to
VDD (not to VSS as we did for CIN[0] in an adder). As an example, to subtract B =
'0011' from A = '1001' we calculate '1001' + '1100' + '1' = '0110'. As with an adder, the
true overflow is XOR(BOUT[MSB], BOUT[MSB 1]).
We can build a ripple-borrow subtracter (a type of borrow-propagate subtracter), a
borrow-save subtracter, and a borrow-select subtracter in the same way we built these
adder architectures. An adder/subtracter has a control signal that gates the A input with
an exclusive-OR cell (forming a programmable inversion) to switch between an adder
or subtracter. Some adder/subtracters gate both inputs to allow us to compute (A B).
We must be careful to connect the input to the LSB of the carry chain (CIN[0] or
BIN[0]) when changing between addition (connect to VSS) and subtraction (connect to
VDD).
A barrel shifter rotates or shifts an input bus by a specified amount. For example if we
have an eight-input barrel shifter with input '1111 0000' and we specify a shift of
'0001 0000' (3, coded by bit position) the right-shifted 8-bit output is '0001 1110'. A
barrel shifter may rotate left or right (or switch between the two under a separate
control). A barrel shifter may also have an output width that is smaller than the input.
To use a simple example, we may have an 8-bit input and a 4-bit output. This situation
is equivalent to having a barrel shifter with two 4-bit inputs and a 4-bit output. Barrel
shifters are used extensively in floating-point arithmetic to align (we call this normalize
and denormalize ) floating-point numbers (with sign, exponent, and mantissa).
A leading-one detector is used with a normalizing (left-shift) barrel shifter to align
mantissas in floating-point numbers. The input is an n -bit bus A, the output is an n -bit
bus, S, with a single '1' in the bit position corresponding to the most significant '1' in
the input. Thus, for example, if the input is A = '0000 0101' the leading-one detector
output is S = '0000 0100', indicating the leading one in A is in bit position 2 (bit 7 is the
MSB, bit zero is the LSB). If we feed the output, S, of the leading-one detector to the
shift select input of a normalizing (left-shift) barrel shifter, the shifter will normalize
the input A. In our example, with an input of A = '0000 0101', and a left-shift of S =
'0000 0100', the barrel shifter will shift A left by five bits and the output of the shifter is
Z = '1010 0000'. Now that Z is aligned (with the MSB equal to '1') we can multiply Z
with another normalized number.
The output of a priority encoder is the binary-encoded position of the leading one in an
input. For example, with an input A = '0000 0101' the leading 1 is in bit position 3
(MSB is bit position 7) so the output of a 4-bit priority encoder would be Z = '0011' (3).
In some cell libraries the encoding is reversed so that the MSB has an output code of
zero, in this case Z = '0101' (5). This second, reversed, encoding scheme is useful in
floating-point arithmetic. If A is a mantissa and we normalize A to '1010 0000' we have
to subtract 5 from the exponent, this exponent correction is equal to the output of the
priority encoder.
An accumulator is an adder/subtracter and a register. Sometimes these are combined
with a multiplier to form a multiplieraccumulator ( MAC ). An incrementer adds 1 to
the input bus, Z = A + 1, so we can use this function, together with a register, to negate
a twos complement number for example. The implementation is Z[ i ] = XOR(A[ i ],
CIN[ i ]), and COUT[ i ] = AND(A[ i ], CIN[ i ]). The carry-in control input, CIN[0],
thus acts as an enable: If it is set to '0' the output is the same as the input.
The implementation of arithmetic cells is often a little more complicated than we have
explained. CMOS logic is naturally inverting, so that it is faster to implement an
incrementer as
Z[ i (even)] = XOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NAND(A[ i ], CIN[ i ]).
This inverts COUT, so that in the following stage we must invert it again. If we push an
inverting bubble to the input CIN we find that:
Z[ i (odd)] = XNOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NOR(NOT(A[ i ]), CIN[ i
]).
In many datapath implementations all odd-bit cells operate on inverted carry signals,
and thus the odd-bit and even-bit datapath elements are different. In fact, all the adder
and subtracter datapath elements we have described may use this technique. Normally
this is completely hidden from the designer in the datapath assembly and any output
control signals are inverted, if necessary, by inserting buffers.
A decrementer subtracts 1 from the input bus, the logical implementation is Z[ i ] =
XOR(A[ i ], CIN[ i ]) and COUT[ i ] = AND(NOT(A[ i ]), CIN[ i ]). The
implementation may invert the odd carry signals, with CIN[0] again acting as an
enable.
An incrementer/decrementer has a second control input that gates the input, inverting
the input to the carry chain. This has the effect of selecting either the increment or
decrement function.
Using the all-zeros detectors and all-ones detectors , remember that, for a 4-bit number,
for example, zero in ones complement arithmetic is '1111' or '0000', and that zero in
signed magnitude arithmetic is '1000' or '0000'.
A register file (or scratchpad memory) is a bank of flip-flops arranged across the bus;
sometimes these have the option of multiple ports (multiport register files) for read and
write. Normally these register files are the densest logic and hardest to fit in a datapath.
For large register files it may be more appropriate to use a multiport memory. We can
add control logic to a register file to create a first-in first-out register ( FIFO ), or last-in
first-out register ( LIFO ).
In Section 2.5 we saw that the standard-cell version and gate-array macro version of the
sequential cells (latches and flip-flops) each contain their own clock buffers. The
reason for this is that (without intelligent placement software) we do not know where a
standard cell or a gate-array macro will be placed on a chip. We also have no idea of
the condition of the clock signal coming into a sequential cell. The ability to place the
clock buffers outside the sequential cells in a datapath gives us more flexibility and
saves space. For example, we can place the clock buffers for all the clocked elements at
the top of the datapath (together with the buffers for the control signals) and river route
(in river routing the interconnect lines all flow in the same direction on the same layer)
the connections to the clock lines. This saves space and allows us to guarantee the
clock skew and timing. It may mean, however, that there is a fixed overhead associated
with a datapath. For example, it might make no sense to build a 4-bit datapath if the
clock and control buffers take up twice the space of the datapath logic. Some tools
allow us to design logic using a portable netlist . After we complete the design we can
decide whether to implement the portable netlist in a datapath, standard cells, or even a
gate array, based on area, speed, or power considerations.
2.7 I/O Cells
Figure 2.33 shows a three-state bidirectional output buffer (Tri-State ® is a
registered trademark of National Semiconductor). When the output enable (OE)
signal is high, the circuit functions as a noninverting buffer driving the value of
DATAin onto the I/O pad. When OE is low, the output transistors or drivers , M1
and M2, are disconnected. This allows multiple drivers to be connected on a bus.
It is up to the designer to make sure that a bus never has two driversa problem
known as contention .
In order to prevent the problem opposite to contentiona bus floating to an
intermediate voltage when there are no bus driverswe can use a bus keeper or
bus-hold cell (TI calls this Bus-Friendly logic). A bus keeper normally acts like
two weak (low drive-strength) cross-coupled inverters that act as a latch to retain
the last logic state on the bus, but the latch is weak enough that it may be driven
easily to the opposite state. Even though bus keepers act like latches, and will
simulate like latches, they should not be used as latches, since their drive strength
is weak.
Transistors M1 and M2 in Figure 2.33 have to drive large off-chip loads. If we
wish to change the voltage on a C = 200 pF load by 5 V in 5 ns (a slew rate of 1
Vns 1 ) we will require a current in the output transistors of I DS = C (d V /d t ) =
(200 ¥ 10 12 ) (5/5 ¥ 10 9 ) = 0.2 A or 200 mA.
Such large currents flowing in the output transistors must also flow in the power
supply bus and can cause problems. There is always some inductance in series
with the power supply, between the point at which the supply enters the ASIC
package and reaches the power bus on the chip. The inductance is due to the bond
wire, lead frame, and package pin. If we have a power-supply inductance of 2 nH
and a current changing from zero to 1 A (32 I/O cells on a bus switching at 30
mA each) in 5 ns, we will have a voltage spike on the power supply (called
power-supply bounce ) of L (d I /d t ) = (2 ¥ 10 9 )(1/(5 ¥ 10 9 )) = 0.4 V.
We do several things to alleviate this problem: We can limit the number of
simultaneously switching outputs (SSOs), we can limit the number of I/O drivers
that can be attached to any one VDD and GND pad, and we can design the output
buffer to limit the slew rate of the output (we call these slew-rate limited I/O
pads). Quiet-I/O cells also use two separate power supplies and two sets of I/O
drivers: an AC supply (clean or quiet supply) with small AC drivers for the I/O
circuits that start and stop the output slewing at the beginning and end of a output
transition, and a DC supply (noisy or dirty supply) for the transistors that handle
large currents as they slew the output.
The three-state buffer allows us to employ the same pad for input and output
bidirectional I/O . When we want to use the pad as an input, we set OE low and
take the data from DATAin. Of course, it is not necessary to have all these
features on every pad: We can build output-only or input-only pads.
We can also use many of these output cell features for input cells that have to
drive large on-chip loads (a clock pad cell, for example). Some gate arrays
simply turn an output buffer around to drive a grid of interconnect that supplies a
clock signal internally. With a typical interconnect capacitance of 0.2pFcm 1 , a
grid of 100 cm (consisting of 10 by 10 lines running all the way across a 1 cm
chip) presents a load of 20 pF to the clock buffer.
Some libraries include I/O cells that have passive pull-ups or pull-downs
(resistors) instead of the transistors, M1 and M2 (the resistors are normally still
constructed from transistors with long gate lengths). We can also omit one of the
driver transistors, M1 or M2, to form open-drain outputs that require an external
pull-up or pull-down. We can design the output driver to produce TTL output
levels rather than CMOS logic levels. We may also add input hysteresis (using a
Schmitt trigger) to the input buffer, I1 in Figure 2.33, to accept input data signals
that contain glitches (from bouncing switch contacts, for example) or that are
slow rising. The input buffer can also include a level shifter to accept TTL input
levels and shift the input signal to CMOS levels.
The gate oxide in CMOS transistors is extremely thin (100 Å or less). This leaves
the gate oxide of the I/O cell input transistors susceptible to breakdown from
static electricity ( electrostatic discharge , or ESD ). ESD arises when we or
machines handle the package leads (like the shock I sometimes get when I touch
a doorknob after walking across the carpet at work). Sometimes this problem is
called electrical overstress (EOS) since most ESD-related failures are caused not
by gate-oxide breakdown, but by the thermal stress (melting) that occurs when
the n -channel transistor in an output driver overheats (melts) due to the large
current that can flow in the drain diffusion connected to a pad during an ESD
event.
To protect the I/O cells from ESD, the input pads are normally tied to device
structures that clamp the input voltage to below the gate breakdown voltage
(which can be as low as 10 V with a 100 Å gate oxide). Some I/O cells use
transistors with a special ESD implant that increases breakdown voltage and
provides protection. I/O driver transistors can also use elongated drain structures
(ladder structures) and large drain-to-gate spacing to help limit current, but in a
salicide process that lowers the drain resistance this is difficult. One solution is to
mask the I/O cells during the salicide step. Another solution is to use pnpn and
npnp diffusion structures called silicon-controlled rectifiers (SCRs) to clamp
voltages and divert current to protect the I/O circuits from ESD.
There are several ways to model the capability of an I/O cell to withstand EOS.
The human-body model ( HBM ) represents ESD by a 100 pF capacitor
discharging through a 1.5 k W resistor (this is an International Electrotechnical
Committee, IEC, specification). Typical voltages generated by the human body
are in the range of 24 kV, and we often see an I/O pad cell rated by the voltage it
can withstand using the HBM. The machine model ( MM ) represents an ESD
event generated by automated machine handlers. Typical MM parameters use a
200 pF capacitor (typically charged to 200 V) discharged through a 25 W
resistor, corresponding to a peak initial current of nearly 10 A. The charge-device
model ( CDM , also called device chargedischarge) represents the problem when
an IC package is charged, in a shipping tube for example, and then grounded. If
the maximum charge on a package is 3 nC (a typical measured figure) and the
package capacitance to ground is 1.5 pF, we can simulate this event by charging a
1.5 pF capacitor to 2 kV and discharging it through a 1 W resistor.
If the diffusion structures in the I/O cells are not designed with care, it is possible
to construct an SCR structure unwittingly, and instead of protecting the
transistors the SCR can enter a mode where it is latched on and conducting large
enough currents to destroy the chip. This failure mode is called latch-up .
Latch-up can occur if the pn -diodes on a chip become forward-biased and inject
minority carriers (electrons in p -type material, holes in n -type material) into the
substrate. The sourcesubstrate and drainsubstrate diodes can become
forward-biased due to power-supply bounce or output undershoot (the cell
outputs fall below V SS ) or overshoot (outputs rise to greater than V DD ) for
example. These injected minority carriers can travel fairly large distances and
interact with nearby transistors causing latch-up. I/O cells normally surround the
I/O transistors with guard rings (a continuous ring of n -diffusion in an n -well
connected to VDD, and a ring of p -diffusion in a p -well connected to VSS) to
collect these minority carriers. This is a problem that can also occur in the logic
core and this is one reason that we normally include substrate and well
connections to the power supplies in every cell.
2.8 Cell Compilers
The process of hand crafting circuits and layout for a full-custom IC is a tedious,
time-consuming, and error-prone task. There are two types of automated layout
assembly tools, often known as a silicon compilers . The first type produces a
specific kind of circuit, a RAM compiler or multiplier compiler , for example.
The second type of compiler is more flexible, usually providing a programming
language that assembles or tiles layout from an input command file, but this is
full-custom IC design.
We can build a register file from latches or flip-flops, but, at 4.56.5 gates (1826
transistors) per bit, this is an expensive way to build memory. Dynamic RAM
(DRAM) can use a cell with only one transistor, storing charge on a capacitor
that has to be periodically refreshed as the charge leaks away. ASIC RAM is
invariably static (SRAM), so we do not need to refresh the bits. When we refer to
RAM in an ASIC environment we almost always mean SRAM. Most ASIC
RAMs use a six-transistor cell (four transistors to form two cross-coupled
inverters that form the storage loop, and two more transistors to allow us to read
from and write to the cell). RAM compilers are available that produce single-port
RAM (a single shared bus for read and write) as well as dual-port RAMs , and
multiport RAMs . In a multi-port RAM the compiler may or may not handle the
problem of address contention (attempts to read and write to the same RAM
address simultaneously). RAM can be asynchronous (the read and write cycles
are triggered by control and/or address transitions asynchronous to a clock) or
synchronous (using the system clock).
In addition to producing layout we also need a model compiler so that we can
verify the circuit at the behavioral level, and we need a netlist from a netlist
compiler so that we can simulate the circuit and verify that it works correctly at
the structural level. Silicon compilers are thus complex pieces of software. We
assume that a silicon compiler will produce working silicon even if every
configuration has not been tested. This is still ASIC design, but now we are
relying on the fact that the tool works correctly and therefore the compiled blocks
are correct by construction .
2.9 Summary
The most important concepts that we covered in this chapter are the following:
● The use of transistors as switches
● Pushing bubbles
● Ratio of logic