Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

Module-1 Introduction to ASICs

21 sheme advance VLSI 1st module

Uploaded by

gadagtrupti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Module-1 Introduction to ASICs

21 sheme advance VLSI 1st module

Uploaded by

gadagtrupti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

L ast E d ited b y S P 14 1 12 0 0 4

Module-1: Introduction to ASICs


INTRODUCTION TO ASICs
An ASIC (pronounced a-sick; bold typeface defines a new term) is an
application-specific integrated circuit at least that is what the acronym stands for.
Before we answer the question of what that means we first look at the evolution
of the silicon chip or integrated circuit ( IC ).
Figure 1.1(a) shows an IC package (this is a pin-grid array, or PGA, shown
upside down; the pins will go through holes in a printed-circuit board). People
often call the package a chip, but, as you can see in Figure 1.1(b), the silicon chip
itself (more properly called a die ) is mounted in the cavity under the sealed lid.
A PGA package is usually made from a ceramic material, but plastic packages
are also common.

FIGURE 1.1 An integrated


circuit (IC). (a) A pin-grid
array (PGA) package. (b) The
silicon die or chip is under
the package lid.

The physical size of a silicon die varies from a few millimeters on a side to over
1 inch on a side, but instead we often measure the size of an IC by the number of
logic gates or the number of transistors that the IC contains. As a unit of measure
a gate equivalent corresponds to a two-input NAND gate (a circuit that performs
the logic function, F = A " B ). Often we just use the term gates instead of gate
equivalents when we are measuring chip sizenot to be confused with the gate
terminal of a transistor. For example, a 100 k-gate IC contains the equivalent of
100,000 two-input NAND gates.
The semiconductor industry has evolved from the first ICs of the early 1970s and
matured rapidly since then. Early small-scale integration ( SSI ) ICs contained a
few (1 to 10) logic gatesNAND gates, NOR gates, and so onamounting to a few
tens of transistors. The era of medium-scale integration ( MSI ) increased the
range of integrated logic available to counters and similar, larger scale, logic
functions. The era of large-scale integration ( LSI ) packed even larger logic
functions, such as the first microprocessors, into a single chip. The era of very
large-scale integration ( VLSI ) now offers 64-bit microprocessors, complete with
cache memory and floating-point arithmetic unitswell over a million transistors
on a single piece of silicon. As CMOS process technology improves, transistors
continue to get smaller and ICs hold more and more transistors. Some people
(especially in Japan) use the term ultralarge scale integration ( ULSI ), but most
people stop at the term VLSI; otherwise we have to start inventing new words.
The earliest ICs used bipolar technology and the majority of logic ICs used either
transistortransistor logic ( TTL ) or emitter-coupled logic (ECL). Although
invented before the bipolar transistor, the metal-oxide-silicon ( MOS ) transistor
was initially difficult to manufacture because of problems with the oxide
interface. As these problems were gradually solved, metal-gate n -channel MOS (
nMOS or NMOS ) technology developed in the 1970s. At that time MOS
technology required fewer masking steps, was denser, and consumed less power
than equivalent bipolar ICs. This meant that, for a given performance, an MOS
IC was cheaper than a bipolar IC and led to investment and growth of the MOS
IC market.
By the early 1980s the aluminum gates of the transistors were replaced by
polysilicon gates, but the name MOS remained. The introduction of polysilicon
as a gate material was a major improvement in CMOS technology, making it
easier to make two types of transistors, n -channel MOS and p -channel MOS
transistors, on the same ICa complementary MOS ( CMOS , never cMOS)
technology. The principal advantage of CMOS over NMOS is lower power
consumption. Another advantage of a polysilicon gate was a simplification of the
fabrication process, allowing devices to be scaled down in size.
There are four CMOS transistors in a two-input NAND gate (and a two-input
NOR gate too), so to convert between gates and transistors, you multiply the
number of gates by 4 to obtain the number of transistors. We can also measure an
IC by the smallest feature size (roughly half the length of the smallest transistor)
imprinted on the IC. Transistor dimensions are measured in microns (a micron, 1
m m, is a millionth of a meter). Thus we talk about a 0.5 m m IC or say an IC is
built in (or with) a 0.5 m m process, meaning that the smallest transistors are 0.5
m m in length. We give a special label, l or lambda , to this smallest feature size.
Since lambda is equal to half of the smallest transistor length, l ª 0.25 m m in a
0.5 m m process. Many of the drawings in this book use a scale marked with
lambda for the same reason we place a scale on a map.
A modern submicron CMOS process is now just as complicated as a submicron
bipolar or BiCMOS (a combination of bipolar and CMOS) process. However,
CMOS ICs have established a dominant position, are manufactured in much
greater volume than any other technology, and therefore, because of the economy
of scale, the cost of CMOS ICs is less than a bipolar or BiCMOS IC for the same
function. Bipolar and BiCMOS ICs are still used for special needs. For example,
bipolar technology is generally capable of handling higher voltages than CMOS.
This makes bipolar and BiCMOS ICs useful in power electronics, cars, telephone
circuits, and so on.
Some digital logic ICs and their analog counterparts (analog/digital converters,
for example) are standard parts , or standard ICs. You can select standard ICs
from catalogs and data books and buy them from distributors. Systems
manufacturers and designers can use the same standard part in a variety of
different microelectronic systems (systems that use microelectronics or ICs).
With the advent of VLSI in the 1980s engineers began to realize the advantages
of designing an IC that was customized or tailored to a particular system or
application rather than using standard ICs alone. Microelectronic system design
then becomes a matter of defining the functions that you can implement using
standard ICs and then implementing the remaining logic functions (sometimes
called glue logic ) with one or more custom ICs . As VLSI became possible you
could build a system from a smaller number of components by combining many
standard ICs into a few custom ICs. Building a microelectronic system with
fewer ICs allows you to reduce cost and improve reliability.
Of course, there are many situations in which it is not appropriate to use a custom
IC for each and every part of an microelectronic system. If you need a large
amount of memory, for example, it is still best to use standard memory ICs,
either dynamic random-access memory ( DRAM or dRAM), or static RAM (
SRAM or sRAM), in conjunction with custom ICs.
One of the first conferences to be devoted to this rapidly emerging segment of the
IC industry was the IEEE Custom Integrated Circuits Conference (CICC), and
the proceedings of this annual conference form a useful reference to the
development of custom ICs. As different types of custom ICs began to evolve for
different types of applications, these new ICs gave rise to a new term:
application-specific IC, or ASIC. Now we have the IEEE International ASIC
Conference , which tracks advances in ASICs separately from other types of
custom ICs. Although the exact definition of an ASIC is difficult, we shall look at
some examples to help clarify what people in the IC industry understand by the
term.
Examples of ICs that are not ASICs include standard parts such as: memory chips
sold as a commodity itemROMs, DRAM, and SRAM; microprocessors; TTL or
TTL-equivalent ICs at SSI, MSI, and LSI levels.
Examples of ICs that are ASICs include: a chip for a toy bear that talks; a chip
for a satellite; a chip designed to handle the interface between memory and a
microprocessor for a workstation CPU; and a chip containing a microprocessor as
a cell together with other logic.
As a general rule, if you can find it in a data book, then it is probably not an
ASIC, but there are some exceptions. For example, two ICs that might or might
not be considered ASICs are a controller chip for a PC and a chip for a modem.
Both of these examples are specific to an application (shades of an ASIC) but are
sold to many different system vendors (shades of a standard part). ASICs such as
these are sometimes called application-specific standard products ( ASSPs ).
Trying to decide which members of the huge IC family are application-specific is
trickyafter all, every IC has an application. For example, people do not usually
consider an application-specific microprocessor to be an ASIC. I shall describe
how to design an ASIC that may include large cells such as microprocessors, but
I shall not describe the design of the microprocessors themselves. Defining an
ASIC by looking at the application can be confusing, so we shall look at a
different way to categorize the IC family. The easiest way to recognize people is
by their faces and physical characteristics: tall, short, thin. The easiest
characteristics of ASICs to understand are physical ones too, and we shall look at
these next. It is important to understand these differences because they affect
such factors as the price of an ASIC and the way you design an ASIC.
1.1 Types of ASICs
ICs are made on a thin (a few hundred microns thick), circular silicon wafer ,
with each wafer holding hundreds of die (sometimes people use dies or dice for
the plural of die). The transistors and wiring are made from many layers (usually
between 10 and 15 distinct layers) built on top of one another. Each successive
mask layer has a pattern that is defined using a mask similar to a glass
photographic slide. The first half-dozen or so layers define the transistors. The
last half-dozen or so layers define the metal wires between the transistors (the
interconnect ).
A full-custom IC includes some (possibly all) logic cells that are customized and
all mask layers that are customized. A microprocessor is an example of a
full-custom ICdesigners spend many hours squeezing the most out of every last
square micron of microprocessor chip space by hand. Customizing all of the IC
features in this way allows designers to include analog circuits, optimized
memory cells, or mechanical structures on an IC, for example. Full-custom ICs
are the most expensive to manufacture and to design. The manufacturing lead
time (the time it takes just to make an ICnot including design time) is typically
eight weeks for a full-custom IC. These specialized full-custom ICs are often
intended for a specific application, so we might call some of them full-custom
ASICs.
We shall discuss full-custom ASICs briefly next, but the members of the IC
family that we are more interested in are semicustom ASICs , for which all of the
logic cells are predesigned and some (possibly all) of the mask layers are
customized. Using predesigned cells from a cell library makes our lives as
designers much, much easier. There are two types of semicustom ASICs that we
shall cover: standard-cellbased ASICs and gate-arraybased ASICs. Following
this we shall describe the programmable ASICs , for which all of the logic cells
are predesigned and none of the mask layers are customized. There are two types
of programmable ASICs: the programmable logic device and, the newest member
of the ASIC family, the field-programmable gate array.

1.1.1 Full-Custom ASICs


In a full-custom ASIC an engineer designs some or all of the logic cells, circuits,
or layout specifically for one ASIC. This means the designer abandons the
approach of using pretested and precharacterized cells for all or part of that
design. It makes sense to take this approach only if there are no suitable existing
cell libraries available that can be used for the entire design. This might be
because existing cell libraries are not fast enough, or the logic cells are not small
enough or consume too much power. You may need to use full-custom design if
the ASIC technology is new or so specialized that there are no existing cell
libraries or because the ASIC is so specialized that some circuits must be custom
designed. Fewer and fewer full-custom ICs are being designed because of the
problems with these special parts of the ASIC. There is one growing member of
this family, though, the mixed analog/digital ASIC, which we shall discuss next.
Bipolar technology has historically been used for precision analog functions.
There are some fundamental reasons for this. In all integrated circuits the
matching of component characteristics between chips is very poor, while the
matching of characteristics between components on the same chip is excellent.
Suppose we have transistors T1, T2, and T3 on an analog/digital ASIC. The three
transistors are all the same size and are constructed in an identical fashion.
Transistors T1 and T2 are located adjacent to each other and have the same
orientation. Transistor T3 is the same size as T1 and T2 but is located on the
other side of the chip from T1 and T2 and has a different orientation. ICs are
made in batches called wafer lots. A wafer lot is a group of silicon wafers that are
all processed together. Usually there are between 5 and 30 wafers in a lot. Each
wafer can contain tens or hundreds of chips depending on the size of the IC and
the wafer.
If we were to make measurements of the characteristics of transistors T1, T2, and
T3 we would find the following:
● Transistors T1 will have virtually identical characteristics to T2 on the
same IC. We say that the transistors match well or the tracking between
devices is excellent.
● Transistor T3 will match transistors T1 and T2 on the same IC very well,
but not as closely as T1 matches T2 on the same IC.
● Transistor T1, T2, and T3 will match fairly well with transistors T1, T2,
and T3 on a different IC on the same wafer. The matching will depend on
how far apart the two ICs are on the wafer.
● Transistors on ICs from different wafers in the same wafer lot will not
match very well.
● Transistors on ICs from different wafer lots will match very poorly.

For many analog designs the close matching of transistors is crucial to circuit
operation. For these circuit designs pairs of transistors are used, located adjacent
to each other. Device physics dictates that a pair of bipolar transistors will always
match more precisely than CMOS transistors of a comparable size. Bipolar
technology has historically been more widely used for full-custom analog design
because of its improved precision. Despite its poorer analog properties, the use of
CMOS technology for analog functions is increasing. There are two reasons for
this. The first reason is that CMOS is now by far the most widely available IC
technology. Many more CMOS ASICs and CMOS standard products are now
being manufactured than bipolar ICs. The second reason is that increased levels
of integration require mixing analog and digital functions on the same IC: this
has forced designers to find ways to use CMOS technology to implement analog
functions. Circuit designers, using clever new techniques, have been very
successful in finding new ways to design analog CMOS circuits that can
approach the accuracy of bipolar analog designs.

1.1.2 Standard-CellBased ASICs


A cell-based ASIC (cell-based IC, or CBIC a common term in Japan,
pronounced sea-bick) uses predesigned logic cells (AND gates, OR gates,
multiplexers, and flip-flops, for example) known as standard cells . We could
apply the term CBIC to any IC that uses cells, but it is generally accepted that a
cell-based ASIC or CBIC means a standard-cellbased ASIC.
The standard-cell areas (also called flexible blocks) in a CBIC are built of rows
of standard cellslike a wall built of bricks. The standard-cell areas may be used
in combination with larger predesigned cells, perhaps microcontrollers or even
microprocessors, known as megacells . Megacells are also called megafunctions,
full-custom blocks, system-level macros (SLMs), fixed blocks, cores, or
Functional Standard Blocks (FSBs).
The ASIC designer defines only the placement of the standard cells and the
interconnect in a CBIC. However, the standard cells can be placed anywhere on
the silicon; this means that all the mask layers of a CBIC are customized and are
unique to a particular customer. The advantage of CBICs is that designers save
time, money, and reduce risk by using a predesigned, pretested, and
precharacterized standard-cell library . In addition each standard cell can be
optimized individually. During the design of the cell library each and every
transistor in every standard cell can be chosen to maximize speed or minimize
area, for example. The disadvantages are the time or expense of designing or
buying the standard-cell library and the time needed to fabricate all layers of the
ASIC for each new design.
Figure 1.2 shows a CBIC (looking down on the die shown in Figure 1.1b, for
example). The important features of this type of ASIC are as follows:
● All mask layers are customizedtransistors and interconnect.

● Custom blocks can be embedded.

● Manufacturing lead time is about eight weeks.


FIGURE 1.2 A cell-based ASIC
(CBIC) die with a single
standard-cell area (a flexible
block) together with four fixed
blocks. The flexible block
contains rows of standard cells.
This is what you might see
through a low-powered
microscope looking down on the
die of Figure 1.1(b). The small
squares around the edge of the die
are bonding pads that are
connected to the pins of the ASIC
package.

Each standard cell in the library is constructed using full-custom design methods,
but you can use these predesigned and precharacterized circuits without having to
do any full-custom design yourself. This design style gives you the same
performance and flexibility advantages of a full-custom ASIC but reduces design
time and reduces risk.
Standard cells are designed to fit together like bricks in a wall. Figure 1.3 shows
an example of a simple standard cell (it is simple in the sense it is not maximized
for densitybut ideal for showing you its internal construction). Power and ground
buses (VDD and GND or VSS) run horizontally on metal lines inside the cells.
FIGURE 1.3 Looking down on the layout of a standard cell. This cell would be
approximately 25 microns wide on an ASIC with l (lambda) = 0.25 microns (a
micron is 10 6 m). Standard cells are stacked like bricks in a wall; the abutment
box (AB) defines the edges of the brick. The difference between the bounding
box (BB) and the AB is the area of overlap between the bricks. Power supplies
(labeled VDD and GND) run horizontally inside a standard cell on a metal layer
that lies above the transistor layers. Each different shaded and labeled pattern
represents a different layer. This standard cell has center connectors (the three
squares, labeled A1, B1, and Z) that allow the cell to connect to others. The
layout was drawn using ROSE, a symbolic layout editor developed by Rockwell
and Compass, and then imported into Tanner Researchs L-Edit.

Standard-cell design allows the automation of the process of assembling an


ASIC. Groups of standard cells fit horizontally together to form rows. The rows
stack vertically to form flexible rectangular blocks (which you can reshape
during design). You may then connect a flexible block built from several rows of
standard cells to other standard-cell blocks or other full-custom logic blocks. For
example, you might want to include a custom interface to a standard, predesigned
microcontroller together with some memory. The microcontroller block may be a
fixed-size megacell, you might generate the memory using a memory compiler,
and the custom logic and memory controller will be built from flexible
standard-cell blocks, shaped to fit in the empty spaces on the chip.
Both cell-based and gate-array ASICs use predefined cells, but there is a
differencewe can change the transistor sizes in a standard cell to optimize speed
and performance, but the device sizes in a gate array are fixed. This results in a
trade-off in performance and area in a gate array at the silicon level. The trade-off
between area and performance is made at the library level for a standard-cell
ASIC.
Modern CMOS ASICs use two, three, or more levels (or layers) of metal for
interconnect. This allows wires to cross over different layers in the same way that
we use copper traces on different layers on a printed-circuit board. In a two-level
metal CMOS technology, connections to the standard-cell inputs and outputs are
usually made using the second level of metal ( metal2 , the upper level of metal)
at the tops and bottoms of the cells. In a three-level metal technology,
connections may be internal to the logic cell (as they are in Figure 1.3). This
allows for more sophisticated routing programs to take advantage of the extra
metal layer to route interconnect over the top of the logic cells. We shall cover
the details of routing ASICs in Chapter 17.
A connection that needs to cross over a row of standard cells uses a feedthrough.
The term feedthrough can refer either to the piece of metal that is used to pass a
signal through a cell or to a space in a cell waiting to be used as a feedthrough
very confusing. Figure 1.4 shows two feedthroughs: one in cell A.14 and one in
cell A.23.
In both two-level and three-level metal technology, the power buses (VDD and
GND) inside the standard cells normally use the lowest (closest to the transistors)
layer of metal ( metal1 ). The width of each row of standard cells is adjusted so
that they may be aligned using spacer cells . The power buses, or rails, are then
connected to additional vertical power rails using row-end cells at the aligned
ends of each standard-cell block. If the rows of standard cells are long, then
vertical power rails can also be run in metal2 through the cell rows using special
power cells that just connect to VDD and GND. Usually the designer manually
controls the number and width of the vertical power rails connected to the
standard-cell blocks during physical design. A diagram of the power distribution
scheme for a CBIC is shown in Figure 1.4.

FIGURE 1.4 Routing the CBIC (cell-based IC) shown in Figure 1.2. The use of
regularly shaped standard cells, such as the one in Figure 1.3, from a library
allows ASICs like this to be designed automatically. This ASIC uses two
separate layers of metal interconnect (metal1 and metal2) running at right angles
to each other (like traces on a printed-circuit board). Interconnections between
logic cells uses spaces (called channels) between the rows of cells. ASICs may
have three (or more) layers of metal allowing the cell rows to touch with the
interconnect running over the top of the cells.

All the mask layers of a CBIC are customized. This allows megacells (SRAM, a
SCSI controller, or an MPEG decoder, for example) to be placed on the same IC
with standard cells. Megacells are usually supplied by an ASIC or library
company complete with behavioral models and some way to test them (a test
strategy). ASIC library companies also supply compilers to generate flexible
DRAM, SRAM, and ROM blocks. Since all mask layers on a standard-cell
design are customized, memory design is more efficient and denser than for gate
arrays.
For logic that operates on multiple signals across a data busa datapath ( DP )the
use of standard cells may not be the most efficient ASIC design style. Some
ASIC library companies provide a datapath compiler that automatically generates
datapath logic . A datapath library typically contains cells such as adders,
subtracters, multipliers, and simple arithmetic and logical units ( ALUs ). The
connectors of datapath library cells are pitch-matched to each other so that they
fit together. Connecting datapath cells to form a datapath usually, but not always,
results in faster and denser layout than using standard cells or a gate array.
Standard-cell and gate-array libraries may contain hundreds of different logic
cells, including combinational functions (NAND, NOR, AND, OR gates) with
multiple inputs, as well as latches and flip-flops with different combinations of
reset, preset and clocking options. The ASIC library company provides designers
with a data book in paper or electronic form with all of the functional
descriptions and timing information for each library element.

1.1.3 Gate-ArrayBased ASICs


In a gate array (sometimes abbreviated to GA) or gate-arraybased ASIC the
transistors are predefined on the silicon wafer. The predefined pattern of
transistors on a gate array is the base array , and the smallest element that is
replicated to make the base array (like an M. C. Escher drawing, or tiles on a
floor) is the base cell (sometimes called a primitive cell ). Only the top few layers
of metal, which define the interconnect between transistors, are defined by the
designer using custom masks. To distinguish this type of gate array from other
types of gate array, it is often called a masked gate array ( MGA ). The designer
chooses from a gate-array library of predesigned and precharacterized logic cells.
The logic cells in a gate-array library are often called macros . The reason for this
is that the base-cell layout is the same for each logic cell, and only the
interconnect (inside cells and between cells) is customized, so that there is a
similarity between gate-array macros and a software macro. Inside IBM,
gate-array macros are known as books (so that books are part of a library), but
unfortunately this descriptive term is not very widely used outside IBM.
We can complete the diffusion steps that form the transistors and then stockpile
wafers (sometimes we call a gate array a prediffused array for this reason). Since
only the metal interconnections are unique to an MGA, we can use the stockpiled
wafers for different customers as needed. Using wafers prefabricated up to the
metallization steps reduces the time needed to make an MGA, the turnaround
time , to a few days or at most a couple of weeks. The costs for all the initial
fabrication steps for an MGA are shared for each customer and this reduces the
cost of an MGA compared to a full-custom or standard-cell ASIC design.
There are the following different types of MGA or gate-arraybased ASICs:
● Channeled gate arrays.

● Channelless gate arrays.


● Structured gate arrays.
The hyphenation of these terms when they are used as adjectives explains their
construction. For example, in the term channeled gate-array architecture, the
gate array is channeled , as will be explained. There are two common ways of
arranging (or arraying) the transistors on a MGA: in a channeled gate array we
leave space between the rows of transistors for wiring; the routing on a
channelless gate array uses rows of unused transistors. The channeled gate array
was the first to be developed, but the channelless gate-array architecture is now
more widely used. A structured (or embedded) gate array can be either channeled
or channelless but it includes (or embeds) a custom block.

1.1.4 Channeled Gate Array


Figure 1.5 shows a channeled gate array . The important features of this type of
MGA are:
● Only the interconnect is customized.

● The interconnect uses predefined spaces between rows of base cells.

● Manufacturing lead time is between two days and two weeks.

FIGURE 1.5 A channeled gate-array die.


The spaces between rows of the base cells
are set aside for interconnect.

A channeled gate array is similar to a CBICboth use rows of cells separated by


channels used for interconnect. One difference is that the space for interconnect
between rows of cells are fixed in height in a channeled gate array, whereas the
space between rows of cells may be adjusted in a CBIC.

1.1.5 Channelless Gate Array


Figure 1.6 shows a channelless gate array (also known as a channel-free gate
array , sea-of-gates array , or SOG array). The important features of this type of
MGA are as follows:
● Only some (the top few) mask layers are customizedthe interconnect.

● Manufacturing lead time is between two days and two weeks.


FIGURE 1.6 A channelless gate-array or
sea-of-gates (SOG) array die. The core
area of the die is completely filled with an
array of base cells (the base array).

The key difference between a channelless gate array and channeled gate array is
that there are no predefined areas set aside for routing between cells on a
channelless gate array. Instead we route over the top of the gate-array devices.
We can do this because we customize the contact layer that defines the
connections between metal1, the first layer of metal, and the transistors. When
we use an area of transistors for routing in a channelless array, we do not make
any contacts to the devices lying underneath; we simply leave the transistors
unused.
The logic densitythe amount of logic that can be implemented in a given silicon
areais higher for channelless gate arrays than for channeled gate arrays. This is
usually attributed to the difference in structure between the two types of array. In
fact, the difference occurs because the contact mask is customized in a
channelless gate array, but is not usually customized in a channeled gate array.
This leads to denser cells in the channelless architectures. Customizing the
contact layer in a channelless gate array allows us to increase the density of
gate-array cells because we can route over the top of unused contact sites.

1.1.6 Structured Gate Array


An embedded gate array or structured gate array (also known as masterslice or
masterimage ) combines some of the features of CBICs and MGAs. One of the
disadvantages of the MGA is the fixed gate-array base cell. This makes the
implementation of memory, for example, difficult and inefficient. In an
embedded gate array we set aside some of the IC area and dedicate it to a specific
function. This embedded area either can contain a different base cell that is more
suitable for building memory cells, or it can contain a complete circuit block,
such as a microcontroller.
Figure 1.7 shows an embedded gate array. The important features of this type of
MGA are the following:
● Only the interconnect is customized.

● Custom blocks (the same for each design) can be embedded.

● Manufacturing lead time is between two days and two weeks.


FIGURE 1.7 A structured or
embedded gate-array die showing
an embedded block in the upper
left corner (a static random-access
memory, for example). The rest of
the die is filled with an array of
base cells.

An embedded gate array gives the improved area efficiency and increased
performance of a CBIC but with the lower cost and faster turnaround of an MGA.
One disadvantage of an embedded gate array is that the embedded function is
fixed. For example, if an embedded gate array contains an area set aside for a 32
k-bit memory, but we only need a 16 k-bit memory, then we may have to waste
half of the embedded memory function. However, this may still be more efficient
and cheaper than implementing a 32 k-bit memory using macros on a SOG array.
ASIC vendors may offer several embedded gate array structures containing
different memory types and sizes as well as a variety of embedded functions.
ASIC companies wishing to offer a wide range of embedded functions must
ensure that enough customers use each different embedded gate array to give the
cost advantages over a custom gate array or CBIC (the Sun Microsystems
SPARCstation 1 described in Section 1.3 made use of LSI Logic embedded gate
arraysand the 10K and 100K series of embedded gate arrays were two of LSI
Logics most successful products).

1.1.7 Programmable Logic Devices


Programmable logic devices ( PLDs ) are standard ICs that are available in
standard configurations from a catalog of parts and are sold in very high volume
to many different customers. However, PLDs may be configured or programmed
to create a part customized to a specific application, and so they also belong to
the family of ASICs. PLDs use different technologies to allow programming of
the device. Figure 1.8 shows a PLD and the following important features that all
PLDs have in common:
● No customized mask layers or logic cells

● Fast design turnaround

● A single large block of programmable interconnect

● A matrix of logic macrocells that usually consist of programmable array


logic followed by a flip-flop or latch
FIGURE 1.8 A programmable
logic device (PLD) die. The
macrocells typically consist of
programmable array logic
followed by a flip-flop or latch.
The macrocells are connected
using a large programmable
interconnect block.

The simplest type of programmable IC is a read-only memory ( ROM ). The most


common types of ROM use a metal fuse that can be blown permanently (a
programmable ROM or PROM ). An electrically programmable ROM , or
EPROM , uses programmable MOS transistors whose characteristics are altered
by applying a high voltage. You can erase an EPROM either by using another
high voltage (an electrically erasable PROM , or EEPROM ) or by exposing the
device to ultraviolet light ( UV-erasable PROM , or UVPROM ).
There is another type of ROM that can be placed on any ASICa
mask-programmable ROM (mask-programmed ROM or masked ROM). A
masked ROM is a regular array of transistors permanently programmed using
custom mask patterns. An embedded masked ROM is thus a large, specialized,
logic cell.
The same programmable technologies used to make ROMs can be applied to
more flexible logic structures. By using the programmable devices in a large
array of AND gates and an array of OR gates, we create a family of flexible and
programmable logic devices called logic arrays . The company Monolithic
Memories (bought by AMD) was the first to produce Programmable Array Logic
(PAL ® , a registered trademark of AMD) devices that you can use, for example,
as transition decoders for state machines. A PAL can also include registers
(flip-flops) to store the current state information so that you can use a PAL to
make a complete state machine.
Just as we have a mask-programmable ROM, we could place a logic array as a
cell on a custom ASIC. This type of logic array is called a programmable logic
array (PLA). There is a difference between a PAL and a PLA: a PLA has a
programmable AND logic array, or AND plane , followed by a programmable
OR logic array, or OR plane ; a PAL has a programmable AND plane and, in
contrast to a PLA, a fixed OR plane.
Depending on how the PLD is programmed, we can have an erasable PLD
(EPLD), or mask-programmed PLD (sometimes called a masked PLD but usually
just PLD). The first PALs, PLAs, and PLDs were based on bipolar technology
and used programmable fuses or links. CMOS PLDs usually employ
floating-gate transistors (see Section 4.3, EPROM and EEPROM Technology).
1.1.8 Field-Programmable Gate Arrays
A step above the PLD in complexity is the field-programmable gate array (
FPGA ). There is very little difference between an FPGA and a PLDan FPGA is
usually just larger and more complex than a PLD. In fact, some companies that
manufacture programmable ASICs call their products FPGAs and some call them
complex PLDs . FPGAs are the newest member of the ASIC family and are
rapidly growing in importance, replacing TTL in microelectronic systems. Even
though an FPGA is a type of gate array, we do not consider the term gate-array
based ASICs to include FPGAs. This may change as FPGAs and MGAs start to
look more alike.
Figure 1.9 illustrates the essential characteristics of an FPGA:
● None of the mask layers are customized.

● A method for programming the basic logic cells and the interconnect.

● The core is a regular array of programmable basic logic cells that can
implement combinational as well as sequential logic (flip-flops).
● A matrix of programmable interconnect surrounds the basic logic cells.

● Programmable I/O cells surround the core.

● Design turnaround is a few hours.

We shall examine these features in detail in Chapters 48.

FIGURE 1.9 A field-programmable


gate array (FPGA) die. All FPGAs
contain a regular structure of
programmable basic logic cells
surrounded by programmable
interconnect. The exact type, size,
and number of the programmable
basic logic cells varies
tremendously.
1.2 Design Flow
Figure 1.10 shows the sequence of steps to design an ASIC; we call this a design
flow . The steps are listed below (numbered to correspond to the labels in
Figure 1.10) with a brief description of the function of each step.

FIGURE 1.10 ASIC design flow.


1. Design entry. Enter the design into an ASIC design system, either using a
hardware description language ( HDL ) or schematic entry .
2. Logic synthesis. Use an HDL (VHDL or Verilog) and a logic synthesis
tool to produce a netlist a description of the logic cells and their
connections.
3. System partitioning. Divide a large system into ASIC-sized pieces.
4. Prelayout simulation. Check to see if the design functions correctly.
5. Floorplanning. Arrange the blocks of the netlist on the chip.
6. Placement. Decide the locations of cells in a block.
1.1.8 Field-Programmable Gate Arrays
A step above the PLD in complexity is the field-programmable gate array (
FPGA ). There is very little difference between an FPGA and a PLDan FPGA is
usually just larger and more complex than a PLD. In fact, some companies that
manufacture programmable ASICs call their products FPGAs and some call them
complex PLDs . FPGAs are the newest member of the ASIC family and are
rapidly growing in importance, replacing TTL in microelectronic systems. Even
though an FPGA is a type of gate array, we do not consider the term gate-array
based ASICs to include FPGAs. This may change as FPGAs and MGAs start to
look more alike.
Figure 1.9 illustrates the essential characteristics of an FPGA:
● None of the mask layers are customized.

● A method for programming the basic logic cells and the interconnect.

● The core is a regular array of programmable basic logic cells that can
implement combinational as well as sequential logic (flip-flops).
● A matrix of programmable interconnect surrounds the basic logic cells.

● Programmable I/O cells surround the core.

● Design turnaround is a few hours.

We shall examine these features in detail in Chapters 48.

FIGURE 1.9 A field-programmable


gate array (FPGA) die. All FPGAs
contain a regular structure of
programmable basic logic cells
surrounded by programmable
interconnect. The exact type, size,
and number of the programmable
basic logic cells varies
tremendously.
1.5 ASIC Cell Libraries
The cell library is the key part of ASIC design. For a programmable ASIC the
FPGA company supplies you with a library of logic cells in the form of a design
kit , you normally do not have a choice, and the cost is usually a few thousand
dollars. For MGAs and CBICs you have three choices: the ASIC vendor (the
company that will build your ASIC) will supply a cell library, or you can buy a
cell library from a third-party library vendor , or you can build your own cell
library.
The first choice, using an ASIC-vendor library , requires you to use a set of
design tools approved by the ASIC vendor to enter and simulate your design.
You have to buy the tools, and the cost of the cell library is folded into the NRE.
Some ASIC vendors (especially for MGAs) supply tools that they have
developed in-house. For some reason the more common model in Japan is to use
tools supplied by the ASIC vendor, but in the United States, Europe, and
elsewhere designers want to choose their own tools. Perhaps this has to do with
the relationship between customer and supplier being a lot closer in Japan than it
is elsewhere.
An ASIC vendor library is normally a phantom library the cells are empty boxes,
or phantoms , but contain enough information for layout (for example, you would
only see the bounding box or abutment box in a phantom version of the cell in
Figure 1.3). After you complete layout you hand off a netlist to the ASIC vendor,
who fills in the empty boxes ( phantom instantiation ) before manufacturing your
chip.
The second and third choices require you to make a buy-or-build decision . If you
complete an ASIC design using a cell library that you bought, you also own the
masks (the tooling ) that are used to manufacture your ASIC. This is called
customer-owned tooling ( COT , pronounced see-oh-tee). A library vendor
normally develops a cell library using information about a process supplied by an
ASIC foundry . An ASIC foundry (in contrast to an ASIC vendor) only provides
manufacturing, with no design help. If the cell library meets the foundry
specifications, we call this a qualified cell library . These cell libraries are
normally expensive (possibly several hundred thousand dollars), but if a library is
qualified at several foundries this allows you to shop around for the most
attractive terms. This means that buying an expensive library can be cheaper in
the long run than the other solutions for high-volume production.
The third choice is to develop a cell library in-house. Many large computer and
electronics companies make this choice. Most of the cell libraries designed today
are still developed in-house despite the fact that the process of library
development is complex and very expensive.
However created, each cell in an ASIC cell library must contain the following:
● A physical layout

● A behavioral model

● A Verilog/VHDL model

● A detailed timing model

● A test strategy

● A circuit schematic

● A cell icon

● A wire-load model

● A routing model

For MGA and CBIC cell libraries we need to complete cell design and cell layout
and shall discuss this in Chapter 2. The ASIC designer may not actually see the
layout if it is hidden inside a phantom, but the layout will be needed eventually.
In a programmable ASIC the cell layout is part of the programmable ASIC
design (see Chapter 4).
The ASIC designer needs a high-level, behavioral model for each cell because
simulation at the detailed timing level takes too long for a complete ASIC design.
For a NAND gate a behavioral model is simple. A multiport RAM model can be
very complex. We shall discuss behavioral models when we describe Verilog and
VHDL in Chapter 10 and Chapter 11. The designer may require Verilog and
VHDL models in addition to the models for a particular logic simulator.
ASIC designers also need a detailed timing model for each cell to determine the
performance of the critical pieces of an ASIC. It is too difficult, too
time-consuming, and too expensive to build every cell in silicon and measure the
cell delays. Instead library engineers simulate the delay of each cell, a process
known as characterization . Characterizing a standard-cell or gate-array library
involves circuit extraction from the full-custom cell layout for each cell. The
extracted schematic includes all the parasitic resistance and capacitance elements.
Then library engineers perform a simulation of each cell including the parasitic
elements to determine the switching delays. The simulation models for the
transistors are derived from measurements on special chips included on a wafer
called process control monitors ( PCMs ) or drop-ins . Library engineers then use
the results of the circuit simulation to generate detailed timing models for logic
simulation. We shall cover timing models in Chapter 13.
All ASICs need to be production tested (programmable ASICs may be tested by
the manufacturer before they are customized, but they still need to be tested).
Simple cells in small or medium-size blocks can be tested using automated
techniques, but large blocks such as RAM or multipliers need a planned strategy.
We shall discuss test in Chapter 14.
The cell schematic (a netlist description) describes each cell so that the cell
designer can perform simulation for complex cells. You may not need the
detailed cell schematic for all cells, but you need enough information to compare
what you think is on the silicon (the schematic) with what is actually on the
silicon (the layout)this is a layout versus schematic ( LVS ) check.
If the ASIC designer uses schematic entry, each cell needs a cell icon together
with connector and naming information that can be used by design tools from
different vendors. We shall cover ASIC design using schematic entry in
Chapter 9. One of the advantages of using logic synthesis (Chapter 12) rather
than schematic design entry is eliminating the problems with icons, connectors,
and cell names. Logic synthesis also makes moving an ASIC between different
cell libraries, or retargeting , much easier.
In order to estimate the parasitic capacitance of wires before we actually
complete any routing, we need a statistical estimate of the capacitance for a net in
a given size circuit block. This usually takes the form of a look-up table known as
a wire-load model . We also need a routing model for each cell. Large cells are
too complex for the physical design or layout tools to handle directly and we
need a simpler representationa phantom of the physical layout that still contains
all the necessary information. The phantom may include information that tells the
automated routing tool where it can and cannot place wires over the cell, as well
as the location and types of the connections to the cell.
1.6 Summary
In this chapter we have looked at the difference between full-custom ASICs,
semi-custom ASICs, and programmable ASICs. Table 1.3 summarizes their
different features. ASICs use a library of predesigned and precharacterized logic
cells. In fact, we could define an ASIC as a design style that uses a cell library
rather than in terms of what an ASIC is or what an ASIC does.
TABLE 1.3 Types of ASIC.
Custom mask Custom logic
ASIC type Family member
layers cells
Full-custom Analog/digital All Some
Semicustom Cell-based (CBIC) All None
Masked gate array (MGA) Some None
Field-programmable gate array
Programmable None None
(FPGA)
Programmable logic device (PLD) None None

You can think of ICs like pizza. A full-custom pizza is built from scratch. You
can customize all the layers of a CBIC pizza, but from a predefined selection, and
it takes a while to cook. An MGA pizza uses precooked crusts with fixed sizes
and you choose only from a few different standard types on a menu. This makes
MGA pizza a little faster to cook and a little cheaper. An FPGA is rather like a
frozen pizzayou buy it at the supermarket in a limited selection of sizes and
types, but you can put it in the microwave at home and it will be ready in a few
minutes.
In each chapter we shall indicate the key concepts. In this chapter they are
● The difference between full-custom and semicustom ASICs

● The difference between standard-cell, gate-array, and programmable


ASICs
● The ASIC design flow

● Design economics including part cost, NRE, and breakeven volume

● The contents and use of an ASIC cell library

Next, in Chapter 2, we shall take a closer look at the semicustom ASICs that
were introduced in this chapter.
L ast E d ited by S P 1411 2 0 0 4

CMOS LOGIC
A CMOS transistor (or device) has four terminals: gate , source , drain , and a
fourth terminal that we shall ignore until the next section. A CMOS transistor is a
switch. The switch must be conducting or on to allow current to flow between the
source and drain terminals (using open and closed for switches is confusingfor
the same reason we say a tap is on and not that it is closed ). The transistor source
and drain terminals are equivalent as far as digital signals are concernedwe do
not worry about labeling an electrical switch with two terminals.
● V AB is the potential difference, or voltage, between nodes A and B in a

circuit; V AB is positive if node A is more positive than node B.


● Italics denote variables; constants are set in roman (upright) type.
Uppercase letters denote DC, large-signal, or steady-state voltages.
● For TTL the positive power supply is called VCC (V CC or V CC ). The 'C'
denotes that the supply is connected indirectly to the collectors of the npn
bipolar transistors (a bipolar transistor has a collector, base, and emitter
corresponding roughly to the drain, gate, and source of an MOS
transistor).
● Following the example of TTL we used VDD (V DD or V DD ) to denote
the positive supply in an NMOS chip where the devices are all n -channel
transistors and the drains of these devices are connected indirectly to the
positive supply. The supply nomenclature for NMOS chips has stuck for
CMOS.
● VDD is the name of the power supply node or net; V DD represents the
value (uppercase since V DD is a DC quantity). Since V DD is a variable, it
is italic (words and multiletter abbreviations use romanthus it is V DD , but
V drain ).
● Logic designers often call the CMOS negative supply VSS or VSS even if
it is actually ground or GND. I shall use VSS for the node and V SS for the
value.
● CMOS uses positive logic VDD is logic '1' and VSS is logic '0'.
We turn a transistor on or off using the gate terminal. There are two kinds of
CMOS transistors: n -channel transistors and p -channel transistors. An n
-channel transistor requires a logic '1' (from now on Ill just say a '1') on the gate
to make the switch conducting (to turn the transistor on ). A p -channel transistor
requires a logic '0' (again from now on, Ill just say a '0') on the gate to make the
switch nonconducting (to turn the transistor off ). The p -channel transistor
symbol has a bubble on its gate to remind us that the gate has to be a '0' to turn
the transistor on . All this is shown in Figure 2.1(a) and (b).

FIGURE 2.1 CMOS transistors as switches. (a) An n -channel transistor. (b) A p


-channel transistor. (c) A CMOS inverter and its symbol (an equilateral triangle
and a circle ).

If we connect an n -channel transistor in series with a p -channel transistor, as


shown in Figure 2.1(c), we form an inverter . With four transistors we can form a
two-input NAND gate (Figure 2.2a). We can also make a two-input NOR gate
(Figure 2.2b). Logic designers normally use the terms NAND gate and logic gate
(or just gate), but I shall try to use the terms NAND cell and logic cell rather than
NAND gate or logic gate in this chapter to avoid any possible confusion with the
gate terminal of a transistor.
FIGURE 2.2 CMOS logic. (a) A two-input NAND logic cell. (b) A two-input
NOR logic cell. The n -channel and p -channel transistor switches implement the
'1's and '0's of a Karnaugh map.

2.1 CMOS Transistors


2.2 The CMOS Process
2.3 CMOS Design Rules
2.4 Combinational Logic Cells
2.5 Sequential Logic Cells
2.6 Datapath Logic Cells
2.7 I/O Cells
2.8 Cell Compilers
2.9 Summary
2.10 Problems
2.11 Bibliography
2.12 References
2.6 Datapath Logic Cells
Suppose we wish to build an n -bit adder (that adds two n -bit numbers) and to exploit
the regularity of this function in the layout. We can do so using a datapath structure.
The following two functions, SUM and COUT, implement the sum and carry out for a
full adder ( FA ) with two data inputs (A, B) and a carry in, CIN:
SUM = A • B • CIN = SUM(A, B, CIN) = PARITY(A, B, CIN) , (2.38)

COUT = A · B + A · CIN + B · CIN = MAJ(A, B, CIN). (2.39)

The sum uses the parity function ('1' if there are an odd numbers of '1's in the inputs).
The carry out, COUT, uses the 2-of-3 majority function ('1' if the majority of the inputs
are '1'). We can combine these two functions in a single FA logic cell, ADD(A[ i ], B[ i
], CIN, S[ i ], COUT), shown in Figure 2.20(a), where
S[ i ] = SUM (A[ i ], B[ i ], CIN) , (2.40)

COUT = MAJ (A[ i ], B[ i ], CIN) . (2.41)

Now we can build a 4-bit ripple-carry adder ( RCA ) by connecting four of these ADD
cells together as shown in Figure 2.20(b). The i th ADD cell is arranged with the
following: two bus inputs A[ i ], B[ i ]; one bus output S[ i ]; an input, CIN, that is the
carry in from stage ( i 1) below and is also passed up to the cell above as an output;
and an output, COUT, that is the carry out to stage ( i + 1) above. In the 4-bit adder
shown in Figure 2.20(b) we connect the carry input, CIN[0], to VSS and use COUT[3]
and COUT[2] to indicate arithmetic overflow (in Section 2.6.1 we shall see why we
may need both signals). Notice that we build the ADD cell so that COUT[2] is
available at the top of the datapath when we need it.
Figure 2.20(c) shows a layout of the ADD cell. The A inputs, B inputs, and S outputs
all use m1 interconnect running in the horizontal directionwe call these data signals.
Other signals can enter or exit from the top or bottom and run vertically across the
datapath in m2we call these control signals. We can also use m1 for control and m2 for
data, but we normally do not mix these approaches in the same structure. Control
signals are typically clocks and other signals common to elements. For example, in
Figure 2.20(c) the carry signals, CIN and COUT, run vertically in m2 between cells. To
build a 4-bit adder we stack four ADD cells creating the array structure shown in
Figure 2.20(d). In this case the A and B data bus inputs enter from the left and bus S,
the sum, exits at the right, but we can connect A, B, and S to either side if we want.
The layout of buswide logic that operates on data signals in this fashion is called a
datapath . The module ADD is a datapath cell or datapath element . Just as we do for
standard cells we make all the datapath cells in a library the same height so we can abut
other datapath cells on either side of the adder to create a more complex datapath.
When people talk about a datapath they always assume that it is oriented so that
increasing the size in bits makes the datapath grow in height, upwards in the vertical
direction, and adding different datapath elements to increase the function makes the
datapath grow in width, in the horizontal directionbut we can rotate and position a
completed datapath in any direction we want on a chip.

FIGURE 2.20 A datapath adder. (a) A full-adder (FA) cell with inputs (A and B), a
carry in, CIN, sum output, S, and carry out, COUT. (b) A 4-bit adder. (c) The layout,
using two-level metal, with data in m1 and control in m2. In this example the wiring is
completed outside the cell; it is also possible to design the datapath cells to contain the
wiring. Using three levels of metal, it is possible to wire over the top of the datapath
cells. (d) The datapath layout.

What is the difference between using a datapath, standard cells, or gate arrays? Cells
are placed together in rows on a CBIC or an MGA, but there is no generally no
regularity to the arrangement of the cells within the rowswe let software arrange the
cells and complete the interconnect. Datapath layout automatically takes care of most
of the interconnect between the cells with the following advantages:
● Regular layout produces predictable and equal delay for each bit.

● Interconnect between cells can be built into each cell.

There are some disadvantages of using a datapath:


● The overhead (buffering and routing the control signals, for example) can make a
narrow (small number of bits) datapath larger and slower than a standard-cell (or
even gate-array) implementation.
● Datapath cells have to be predesigned (otherwise we are using full-custom
design) for use in a wide range of datapath sizes. Datapath cell design can be
harder than designing gate-array macros or standard cells.
● Software to assemble a datapath is more complex and not as widely used as
software for assembling standard cells or gate arrays.
There are some newer standard-cell and gate-array tools that can take advantage of
regularity in a design and position cells carefully. The problem is in finding the
regularity if it is not specified. Using a datapath is one way to specify regularity to
ASIC design tools.
2.6.1 Datapath Elements
Figure 2.21 shows some typical datapath symbols for an adder (people rarely use the
IEEE standards in ASIC datapath libraries). I use heavy lines (they are 1.5 point wide)
with a stroke to denote a data bus (that flows in the horizontal direction in a datapath),
and regular lines (0.5 point) to denote the control signals (that flow vertically in a
datapath). At the risk of adding confusion where there is none, this stroke to indicate a
data bus has nothing to do with mixed-logic conventions. For a bus, A[31:0] denotes a
32-bit bus with A[31] as the leftmost or most-significant bit or MSB , and A[0] as the
least-significant bit or LSB . Sometimes we shall use A[MSB] or A[LSB] to refer to
these bits. Notice that if we have an n -bit bus and LSB = 0, then MSB = n 1. Also, for
example, A[4] is the fifth bit on the bus (from the LSB). We use a ' S ' or 'ADD' inside
the symbol to denote an adder instead of '+', so we can attach '' or '+/' to the inputs for
a subtracter or adder/subtracter.

FIGURE 2.21 Symbols for a datapath adder. (a) A data bus is shown by a heavy line
(1.5 point) and a bus symbol. If the bus is n -bits wide then MSB = n 1. (b) An
alternative symbol for an adder. (c) Control signals are shown as lightweight (0.5
point) lines.

Some schematic datapath symbols include only data signals and omit the control
signalsbut we must not forget them. In Figure 2.21, for example, we may need to
explicitly tie CIN[0] to VSS and use COUT[MSB] and COUT[MSB 1] to detect
overflow. Why might we need both of these control signals? Table 2.11 shows the
process of simple arithmetic for the different binary number representations, including
unsigned, signed magnitude, ones complement, and twos complement.
TABLE 2.11 Binary arithmetic.
Binary Number Representation
Operation Signed Ones Twos
Unsigned
magnitude complement complement
if positive
no change then MSB = 0 if negative then flip if negative then {flip
bits bits; add 1}
else MSB = 1
3= 0011 0011 0011 0011
3= NA 1011 1100 1101
zero = 0000 0000 or 1000 1111 or 0000 0000
max.
1111 = 15 0111 = 7 0111 = 7 0111 = 7
positive =
max.
0000= 0 1111 = 7 1000 = 7 1000 = 8
negative =
addition =
if SG(A) =
S=A+B SG(B) then S S =
=A+B A+B+
= addend +
augend S=A+B else { if B < A COUT[MSB] S=A+B
then S = A B
else S = B COUT is carry out
SG(A) = A}
sign of A
addition if SG(A) =
OR =
result:
COUT[MSB] SG(B) then OV = OV =
OV = OV =
overflow, COUT[MSB] XOR(COUT[MSB], XOR(COUT[MSB],
COUT is COUT[MSB1]) COUT[MSB 1])
OR = out else OV = 0
carry out (impossible)
of range
if SG(A) =
SG(B) then
SG(S) =
SG(S) =
SG(A)
sign of S
NA else { if B < A NA NA
then SG(S) =
S=A+B SG(A)
else SG(S) =
SG(B)}
subtraction
=
SG(B) =
D=A B Z = B (negate); Z = B (negate);
D=A B NOT(SG(B));
= minuend D=A+Z D=A+Z
D=A+B

subtrahend
subtraction
result : OR =
OV = BOUT[MSB]
as in addition as in addition as in addition
overflow, BOUT is
OR = out borrow out
of range
negation : Z = A;
Z=A NA SG(Z) = Z = NOT(A) Z = NOT(A) + 1
(negate) NOT(SG(A))

2.6.2 Adders
We can view addition in terms of generate , G[ i ], and propagate , P[ i ], signals.
method 1 method 2
G[i] = A[i] · B[i] G[ i ] = A[ i ] · B[ i ] (2.42)
P[ i ] = A[ i ] • B[ i P[ i ] = A[ i ] + B[ i ] (2.43)
C[ i ] = G[ i ] + P[ i ] · C[ i 1] C[ i ] = G[ i ] + P[ i ] · C[ i 1] (2.44)
S[ i ] = P[ i ] • C[ i 1] S[ i ] = A[ i ] • B[ i ] • C[ i 1] (2.45)

where C[ i ] is the carry-out signal from stage i , equal to the carry in of stage ( i + 1).
Thus, C[ i ] = COUT[ i ] = CIN[ i + 1]. We need to be careful because C[0] might
represent either the carry in or the carry out of the LSB stage. For an adder we set the
carry in to the first stage (stage zero), C[1] or CIN[0], to '0'. Some people use delete
(D) or kill (K) in various ways for the complements of G[i] and P[i], but unfortunately
others use C for COUT and D for CINso I avoid using any of these. Do not confuse the
two different methods (both of which are used) in Eqs. 2.422.45 when forming the
sum, since the propagate signal, P[ i ] , is different for each method.
Figure 2.22(a) shows a conventional RCA. The delay of an n -bit RCA is proportional
to n and is limited by the propagation of the carry signal through all of the stages. We
can reduce delay by using pairs of go-faster bubbles to change AND and OR gates to
fast two-input NAND gates as shown in Figure 2.22(a). Alternatively, we can write the
equations for the carry signal in two different ways:
either C[ i ] = A[ i ] · B[ i ] + P[ i ] · C[ i 1] (2.46)
or C[ i ] = (A[ i ] + B[ i ] ) · (P[ i ]' + C[ i 1]), (2.47)

where P[ i ]'= NOT(P[ i ]). Equations 2.46 and 2.47 allow us to build the carry chain
from two-input NAND gates, one per cell, using different logic in even and odd stages
(Figure 2.22b):
even stages odd stages
C1[i]' = P[i ] · C3[i 1] · C4[i 1] C3[i]' = P[i ] · C1[i 1] · C2[i 1] (2.48)
C2[i] = A[i ] + B[i ] C4[i]' = A[i ] · B[i ] (2.49)
C[i] = C1[i ] · C2[i ] C[i] = C3[i ] ' + C4[i ]' (2.50)

(the carry inputs to stage zero are C3[1] = C4[1] = '0'). We can use the RCA of
Figure 2.22(b) in a datapath, with standard cells, or on a gate array.
Instead of propagating the carries through each stage of an RCA, Figure 2.23 shows a
different approach. A carry-save adder ( CSA ) cell CSA(A1[ i ], A2[ i ], A3[ i ], CIN,
S1[ i ], S2[ i ], COUT) has three outputs:
S1[ i ] = CIN , (2.51)
S2[ i ] = A1[ i ] • A2[ i ] • A3[ i ] = PARITY(A1[ i ], A2[ i ], A3[ i ]) , (2.52)
COUT = A1[ i ] · A2[ i ] + [(A1[ i ] + A2[ i ]) · A3[ i ]] = MAJ(A1[ i ], A2[ i ],
(2.53)
A3[ i ]) .

The inputs, A1, A2, and A3; and outputs, S1 and S2, are buses. The input, CIN, is the
carry from stage ( i 1). The carry in, CIN, is connected directly to the output bus S1
indicated by the schematic symbol (Figure 2.23a). We connect CIN[0] to VSS. The
output, COUT, is the carry out to stage ( i + 1).
A 4-bit CSA is shown in Figure 2.23(b). The arithmetic overflow signal for ones
complement or twos complement arithmetic, OV, is XOR(COUT[MSB], COUT[MSB
1]) as shown in Figure 2.23(c). In a CSA the carries are saved at each stage and
shifted left onto the bus S1. There is thus no carry propagation and the delay of a CSA
is constant. At the output of a CSA we still need to add the S1 bus (all the saved
carries) and the S2 bus (all the sums) to get an n -bit result using a final stage that is not
shown in Figure 2.23(c). We might regard the n -bit sum as being encoded in the two
buses, S1 and S2, in the form of the parity and majority functions.
We can use a CSA to add multiple inputsas an example, an adder with four 4-bit inputs
is shown in Figure 2.23(d). The last stage sums two input buses using a carry-propagate
adder ( CPA ). We have used an RCA as the CPA in Figure 2.23(d) and (e), but we can
use any type of adder. Notice in Figure 2.23(e) how the two CSA cells and the RCA
cell abut together horizontally to form a bit slice (or slice) and then the slices are
stacked vertically to form the datapath.
FIGURE 2.22 The carry-save adder (CSA). (a) A CSA cell. (b) A 4-bit CSA.
(c) Symbol for a CSA. (d) A four-input CSA. (e) The datapath for a four-input, 4-bit
adder using CSAs with a ripple-carry adder (RCA) as the final stage. (f) A pipelined
adder. (g) The datapath for the pipelined version showing the pipeline registers as well
as the clock control lines that use m2.

We can register the CSA stages by adding vectors of flip-flops as shown in


Figure 2.23(f). This reduces the adder delay to that of the slowest adder stage, usually
the CPA. By using registers between stages of combinational logic we use pipelining to
increase the speed and pay a price of increased area (for the registers) and introduce
latency . It takes a few clock cycles (the latency, equal to n clock cycles for an n -stage
pipeline) to fill the pipeline, but once it is filled, the answers emerge every clock cycle.
Ferris wheels work much the same way. When the fair opens it takes a while (latency)
to fill the wheel, but once it is full the people can get on and off every few seconds.
(We can also pipeline the RCA of Figure 2.20. We add i registers on the A and B
inputs before ADD[ i ] and add ( n i ) registers after the output S[ i ], with a single
register before each C[ i ].)
The problem with an RCA is that every stage has to wait to make its carry decision, C[
i ], until the previous stage has calculated C[ i 1]. If we examine the propagate signals
we can bypass this critical path. Thus, for example, to bypass the carries for bits 47
(stages 58) of an adder we can compute BYPASS = P[4].P[5].P[6].P[7] and then use a
MUX as follows:
C[7] = (G[7] + P[7] · C[6]) · BYPASS' + C[3] · BYPASS . (2.54)

Adders based on this principle are called carry-bypass adders ( CBA ) [Sato et al.,
1992]. Large, custom adders employ Manchester-carry chains to compute the carries
and the bypass operation using TGs or just pass transistors [Weste and Eshraghian,
1993, pp. 530531]. These types of carry chains may be part of a predesigned ASIC
adder cell, but are not used by ASIC designers.
Instead of checking the propagate signals we can check the inputs. For example we can
compute SKIP = (A[ i 1] • B[ i 1]) + (A[ i ] • B[ i ] ) and then use a 2:1 MUX to
select C[ i ]. Thus,
CSKIP[ i ] = (G[ i ] + P[ i ] · C[ i 1]) · SKIP' + C[ i 2] · SKIP . (2.55)

This is a carry-skip adder [Keutzer, Malik, and Saldanha, 1991; Lehman, 1961].
Carry-bypass and carry-skip adders may include redundant logic (since the carry is
computed in two different wayswe just take the first signal to arrive). We must be
careful that the redundant logic is not optimized away during logic synthesis.
If we evaluate Eq. 2.44 recursively for i = 1, we get the following:
C[1] = G[1] + P[1] · C[0]
= G[1] + P[1] · (G[0] + P[1] · C[1])
= G[1] + P[1] · G[0] . (2.56)

This result means that we can look ahead by two stages and calculate the carry into
the third stage (bit 2), which is C[1], using only the first-stage inputs (to calculate G[0])
and the second-stage inputs. This is a carry-lookahead adder ( CLA ) [MacSorley,
1961]. If we continue expanding Eq. 2.44, we find:
C[2] = G[2] + P[2] · G[1] + P[2] · P[1] · G[0] ,

C[3] = G[3] + P[2] · G[2] + P[2] · P[1] · G[1] + P[3] · P[2] · P[1] · G[0] . (2.57)

As we look ahead further these equations become more complex, take longer to
calculate, and the logic becomes less regular when implemented using cells with a
limited number of inputs. Datapath layout must fit in a bit slice, so the physical and
logical structure of each bit must be similar. In a standard cell or gate array we are not
so concerned about a regular physical structure, but a regular logical structure
simplifies design. The BrentKung adder reduces the delay and increases the regularity
of the carry-lookahead scheme [Brent and Kung, 1982]. Figure 2.24(a) shows a regular
4-bit CLA, using the carry-lookahead generator cell (CLG) shown in Figure 2.24(b).
FIGURE 2.23 The BrentKung carry-lookahead adder (CLA). (a) Carry generation in a
4-bit CLA. (b) A cell to generate the lookahead terms, C[0]C[3]. (c) Cells L1, L2, and
L3 are rearranged into a tree that has less delay. Cell L4 is added to calculate C[2] that
is lost in the translation. (d) and (e) Simplified representations of parts a and c. (f) The
lookahead logic for an 8-bit adder. The inputs, 07, are the propagate and carry terms
formed from the inputs to the adder. (g) An 8-bit BrentKung CLA. The outputs of the
lookahead logic are the carry bits that (together with the inputs) form the sum. One
advantage of this adder is that delays from the inputs to the outputs are more nearly
equal than in other adders. This tends to reduce the number of unwanted and
unnecessary switching events and thus reduces power dissipation.

In a carry-select adder we duplicate two small adders (usually 4-bit or 8-bit adders
often CLAs) for the cases CIN = '0' and CIN = '1' and then use a MUX to select the
case that we needwasteful, but fast [Bedrij, 1962]. A carry-select adder is often used as
the fast adder in a datapath library because its layout is regular.
We can use the carry-select, carry-bypass, and carry-skip architectures to split a 12-bit
adder, for example, into three blocks. The delay of the adder is then partly dependent
on the delays of the MUX between each block. Suppose the delay due to 1-bit in an
adder block (we shall call this a bit delay) is approximately equal to the MUX delay. In
this case may be faster to make the blocks 3, 4, and 5-bits long instead of being equal in
size. Now the delays into the final MUX are equal3 bit-delays plus 2 MUX delays for
the carry signal from bits 06 and 5 bit-delays for the carry from bits 711. Adjusting
the block size reduces the delay of large adders (more than 16 bits).
We can extend the idea behind a carry-select adder as follows. Suppose we have an n
-bit adder that generates two sums: One sum assumes a carry-in condition of '0', the
other sum assumes a carry-in condition of '1'. We can split this n -bit adder into an i -bit
adder for the i LSBs and an ( n i )-bit adder for the n i MSBs. Both of the smaller
adders generate two conditional sums as well as true and complement carry signals.
The two (true and complement) carry signals from the LSB adder are used to select
between the two ( n i + 1)-bit conditional sums from the MSB adder using 2( n i + 1)
two-input MUXes. This is a conditional-sum adder (also often abbreviated to CSA)
[Sklansky, 1960]. We can recursively apply this technique. For example, we can split a
16-bit adder using i = 8 and n = 8; then we can split one or both 8bit adders againand
so on.
Figure 2.25 shows the simplest form of an n -bit conditional-sum adder that uses n
single-bit conditional adders, H (each with four outputs: two conditional sums, true
carry, and complement carry), together with a tree of 2:1 MUXes (Qi_j). The
conditional-sum adder is usually the fastest of all the adders we have discussed (it is the
fastest when logic cell delay increases with the number of inputsthis is true for all
ASICs except FPGAs).
FIGURE 2.24 The conditional-sum adder. (a) A 1-bit conditional adder that calculates
the sum and carry out assuming the carry in is either '1' or '0'. (b) The multiplexer that
selects between sums and carries. (c) A 4-bit conditional-sum adder with carry input,
C[0].

2.6.3 A Simple Example


How do we make and use datapath elements? What does a design look like? We may
use predesigned cells from a library or build the elements ourselves from logic cells
using a schematic or a design language. Table 2.12 shows an 8-bit conditional-sum
adder intended for an FPGA. This Verilog implementation uses the same structure as
Figure 2.25, but the equations are collapsed to use four or five variables. A basic logic
cell in certain Xilinx FPGAs, for example, can implement two equations of the same
four variables or one equation with five variables. The equations shown in Table 2.12
requires three levels of FPGA logic cells (so, for example, if each FPGA logic cell has
a 5 ns delay, the 8-bit conditional-sum adder delay is 15 ns).
TABLE 2.12 An 8-bit conditional-sum adder (the notation is described in Figure 2.25).
module m8bitCSum (C0, a, b, s, C8); // Verilog conditional-sum adder for an FPGA
input [7:0] C0, a, b; output [7:0] s; output C8;
wire
A7,A6,A5,A4,A3,A2,A1,A0,B7,B6,B5,B4,B3,B2,B1,B0,S8,S7,S6,S5,S4,S3,S2,S1,S0;
wire C0, C2, C4_2_0, C4_2_1, S5_4_0, S5_4_1, C6, C6_4_0, C6_4_1, C8;
assign {A7,A6,A5,A4,A3,A2,A1,A0} = a; assign {B7,B6,B5,B4,B3,B2,B1,B0} = b;
assign s = { S7,S6,S5,S4,S3,S2,S1,S0 };
assign S0 = A0^B0^C0 ; // start of level 1: & = AND, ^ = XOR, | = OR, ! = NOT
assign S1 = A1^B1^(A0&B0|(A0|B0)&C0) ;
assign C2 = A1&B1|(A1|B1)&(A0&B0|(A0|B0)&C0) ;
assign C4_2_0 = A3&B3|(A3|B3)&(A2&B2) ; assign C4_2_1 =
A3&B3|(A3|B3)&(A2|B2) ;
assign S5_4_0 = A5^B5^(A4&B4) ; assign S5_4_1 = A5^B5^(A4|B4) ;
assign C6_4_0 = A5&B5|(A5|B5)&(A4&B4) ; assign C6_4_1 =
A5&B5|(A5|B5)&(A4|B4) ;
assign S2 = A2^B2^C2 ; // start of level 2
assign S3 = A3^B3^(A2&B2|(A2|B2)&C2) ;
assign S4 = A4^B4^(C4_2_0|C4_2_1&C2) ;
assign S5 = S5_4_0& !(C4_2_0|C4_2_1&C2)|S5_4_1&(C4_2_0|C4_2_1&C2) ;
assign C6 = C6_4_0|C6_4_1&(C4_2_0|C4_2_1&C2) ;
assign S6 = A6^B6^C6 ; // start of level 3
assign S7 = A7^B7^(A6&B6|(A6|B6)&C6) ;
assign C8 = A7&B7|(A7|B7s)&(A6&B6|(A6|B6)&C6) ;
endmodule

Figure 2.26 shows the normalized delay and area figures for a set of predesigned
datapath adders. The data in Figure 2.26 is from a series of ASIC datapath cell libraries
(Compass Passport) that may be synthesized together with test vectors and simulation
models. We can combine the different adder techniques, but the adders then lose
regularity and become less suited to a datapath implementation.
FIGURE 2.25 Datapath adders. This data is from a series of submicron datapath
libraries. (a) Delay normalized to a two-input NAND logic cell delay (approximately
equal to 250 ps in a 0.5 m m process). For example, a 64-bit ripple-carry adder (RCA)
has a delay of approximately 30 ns in a 0.5 m m process. The spread in delay is due to
variation in delays between different inputs and outputs. An n -bit RCA has a delay
proportional to n . The delay of an n -bit carry-select adder is approximately
proportional to log 2 n . The carry-save adder delay is constant (but requires a
carry-propagate adder to complete an addition). (b) In a datapath library the area of all
adders are proportional to the bit size.

There are other adders that are not used in datapaths, but are occasionally useful in
ASIC design. A serial adder is smaller but slower than the parallel adders we have
described [Denyer and Renshaw, 1985]. The carry-completion adder is a variable delay
adder and rarely used in synchronous designs [Sklansky, 1960].

2.6.4 Multipliers
Figure 2.27 shows a symmetric 6-bit array multiplier (an n -bit multiplier multiplies
two n -bit numbers; we shall use n -bit by m -bit multiplier if the lengths are different).
Adders a0f0 may be eliminated, which then eliminates adders a1a6, leaving an
asymmetric CSA array of 30 (5 ¥ 6) adders (including one half adder). An n -bit array
multiplier has a delay proportional to n plus the delay of the CPA (adders b6f6 in
Figure 2.27). There are two items we can attack to improve the performance of a
multiplier: the number of partial products and the addition of the partial products.
FIGURE 2.26 Multiplication. A 6-bit array multiplier using a final carry-propagate
adder (full-adder cells a6f6, a ripple-carry adder). Apart from the generation of the
summands this multiplier uses the same structure as the carry-save adder of
Figure 2.23(d).

Suppose we wish to multiply 15 (the multiplicand ) by 19 (the multiplier ) mentally. It


is easier to calculate 15 ¥ 20 and subtract 15. In effect we complete the multiplication
as 15 ¥ (20 1) and we could write this as 15 ¥ 2 1 , with the overbar representing a
minus sign. Now suppose we wish to multiply an 8-bit binary number, A, by B =
00010111 (decimal 16 + 4 + 2 + 1 = 23). It is easier to multiply A by the canonical
signed-digit vector ( CSD vector ) D = 0010 1 001 (decimal 32 8 + 1 = 23) since this
requires only three add or subtract operations (and a subtraction is as easy as an
addition). We say B has a weight of 4 and D has a weight of 3. By using D instead of B
we have reduced the number of partial products by 1 (= 4 3).
We can recode (or encode) any binary number, B, as a CSD vector, D, as follows
(canonical means there is only one CSD vector for any number):
D i = B i + C i 2C i+1 , (2.58)
where C i + 1 is the carry from the sum of B i + 1 + B i + C i (we start with C 0 = 0).

As another example, if B = 011 (B 2 = 0, B 1 = 1, B 0 = 1; decimal 3), then, using


Eq. 2.58,
D 0 = B 0 + C 0 2C 1 =1+0 2=1,
D 1 = B 1 + C 1 2C 2 = 1 + 1 2 = 0,
D 2 = B 2 + C 2 2C 3 = 0 + 1 0 = 1, (2.59)

so that D = 10 1 (decimal 4 1 = 3). CSD vectors are useful to represent fixed


coefficients in digital filters, for example.
We can recode using a radix other than 2. Suppose B is an ( n + 1)-digit twos
complement number,
B=B0+B12+B222+...+Bi2i+...+Bn 1 2n 1 B n 2 n . (2.60)

We can rewrite the expression for B using the following sleight-of-hand:


B = B 0 + (B 0 B 1 )2 + . . . + (B i 1 B i )2 i + . . . + B n 1 2n 1 B n
2B B =
2n
= (2B 1 + B 0 )2 0 + (2B 3 + B 2 + B 1 )2 2 + . . .
+ (2B i + B i 1 +Bi 2 )2 i 1 + (2B i+2 + B i + 1 + B i )2 i + 1 + . . .
+ (2B n +Bi 1 +Bi 2 )2 n 1 . (2.61)

This is very useful. Consider B = 101001 (decimal 9 32 = 23, n = 5),


B = 101001
= (2B 1 + B 0 )2 0 + (2B 3 + B 2 + B 1 )2 2 + (2B 5 + B 4 + B 3 )2 4
((2 ¥ 0) + 1)2 0 + ((2 ¥ 1) + 0 + 0)2 2 + ((2 ¥ 1) + 0 + 1)2 4 . (2.62)

Equation 2.61 tells us how to encode B as a radix-4 signed digit, E = 12 1 (decimal 16


8 + 1 = 23). To multiply by B encoded as E we only have to perform a multiplication
by 2 (a shift) and three add/subtract operations.
Using Eq. 2.61 we can encode any number by taking groups of three bits at a time and
calculating
Ej = 2B i + B i 1 + B i 2 ,
E j + 1 = 2B i + 2 + B i + 1 + B i , . . . , (2.63)

where each 3-bit group overlaps by one bit. We pad B with a zero, B n . . . B 1 B 0 0, to
match the first term in Eq. 2.61. If B has an odd number of bits, then we extend the
sign: B n B n . . . B 1 B 0 0. For example, B = 01011 (eleven), encodes to E = 1 11 (16
4 1); and B = 101 is E = 1 1. This is called Booth encoding and reduces the number of
partial products by a factor of two and thus considerably reduces the area as well as
increasing the speed of our multiplier [Booth, 1951].
Next we turn our attention to improving the speed of addition in the CSA array.
Figure 2.28(a) shows a section of the 6-bit array multiplier from Figure 2.27. We can
collapse the chain of adders a0f5 (5 adder delays) to the Wallace tree consisting of
adders 5.15.4 (4 adder delays) shown in Figure 2.28(b).

FIGURE 2.27 Tree-based multiplication. (a) The portion of Figure 2.27 that calculates
the sum bit, P 5 , using a chain of adders (cells a0f5). (b) We can collapse this chain to
a Wallace tree (cells 5.15.5). (c) The stages of multiplication.

Figure 2.28(c) pictorially represents multiplication as a sort of golf course. Each link
corresponds to an adder. The holes or dots are the outputs of one stage (and the inputs
of the next). At each stage we have the following three choices: (1) sum three outputs
using a full adder (denoted by a box enclosing three dots); (2) sum two outputs using a
half adder (a box with two dots); (3) pass the outputs directly to the next stage. The two
outputs of an adder are joined by a diagonal line (full adders use black dots, half adders
white dots). The object of the game is to choose (1), (2), or (3) at each stage to
maximize the performance of the multiplier. In tree-based multipliers there are two
ways to do thisworking forward and working backward.
In a Wallace-tree multiplier we work forward from the multiplier inputs, compressing
the number of signals to be added at each stage [Wallace, 1960]. We can view an FA as
a 3:2 compressor or (3, 2) counter it counts the number of '1's on the inputs. Thus, for
example, an input of '101' (two '1's) results in an output '10' (2). A half adder is a (2, 2)
counter . To form P 5 in Figure 2.29 we must add 6 summands (S 05 , S 14 , S 23 , S 32 ,
S 41 , and S 50 ) and 4 carries from the P 4 column. We add these in stages 17,
compressing from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last carry
from column P 4 , and this means we expand (rather than compress) the number of
signals (from 2 to 3) between stages 3 and 5. The maximum delay through the CSA
array of Figure 2.29 is 6 adder delays. To this we must add the delay of the 4-bit (9
inputs) CPA (stage 7). There are 26 adders (6 half adders) plus the 4 adders in the CPA.

FIGURE 2.28 A 6-bit Wallace-tree multiplier. The carry-save adder (CSA) requires 26
adders (cells 126, six are half adders). The final carry-propagate adder (CPA) consists
of 4 adder cells (2730). The delay of the CSA is 6 adders. The delay of the CPA is 4
adders.

In a Dadda multiplier (Figure 2.30) we work backward from the final product [Dadda,
1965]. Each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, . . . outputs (each successive
stage is 3/2 times largerrounded down to an integer). Thus, for example, in
Figure 2.28(d) we require 3 stages (with 3 adder delaysplus the delay of a 10-bit output
CPA) for a 6-bit Dadda multiplier. There are 19 adders (4 half adders) in the CSA plus
the 10 adders (2 half adders) in the CPA. A Dadda multiplier is usually faster and
smaller than a Wallace-tree multiplier.
FIGURE 2.29 The 6-bit Dadda multiplier. The carry-save adder (CSA) requires 20
adders (cells 120, four are half adders). The carry-propagate adder (CPA, cells 2130)
is a ripple-carry adder (RCA). The CSA is smaller (20 versus 26 adders), faster (3
adder delays versus 6 adder delays), and more regular than the Wallace-tree CSA of
Figure 2.29. The overall speed of this implementation is approximately the same as the
Wallace-tree multiplier of Figure 2.29; however, the speed may be increased by
substituting a faster CPA.

In general, the number of stages and thus delay (in units of an FA delayexcluding the
CPA) for an n -bit tree-based multiplier using (3, 2) counters is
log 1.5 n = log 10 n /log 10 1.5 = log 10 n /0.176 . (2.64)

Figure 2.31(a) shows how the partial-product array is constructed in a conventional


4-bit multiplier. The FerrariStefanelli multiplier (Figure 2.31b) nests multipliersthe
2-bit submultipliers reduce the number of partial products [Ferrari and Stefanelli,
1969].

FIGURE 2.30 FerrariStefanelli multiplier. (a) A conventional 4-bit array multiplier


using AND gates to calculate the summands with (2, 2) and (3, 2) counters to sum the
partial products. (b) A 4-bit FerrariStefanelli multiplier using 2-bit submultipliers to
construct the partial product array. (c) A circuit implementation for an inverting 2-bit
submultiplier.
There are several issues in deciding between parallel multiplier architectures:
1. Since it is easier to fold triangles rather than trapezoids into squares, a
Wallace-tree multiplier is more suited to full-custom layout, but is slightly larger,
than a Dadda multiplierboth are less regular than an array multiplier. For
cell-based ASICs, a Dadda multiplier is smaller than a Wallace-tree multiplier.
2. The overall multiplier speed does depend on the size and architecture of the final
CPA, but this may be optimized independently of the CSA array. This means a
Dadda multiplier is always at least as fast as the Wallace-tree version.
3. The low-order bits of any parallel multiplier settle first and can be added in the
CPA before the remaining bits settle. This allows multiplication and the final
addition to be overlapped in time.
4. Any of the parallel multiplier architectures may be pipelined. We may also use a
variably pipelined approach that tailors the register locations to the size of the
multiplier.
5. Using (4, 2), (5, 3), (7, 3), or (15, 4) counters increases the stage compression
and permits the size of the stages to be tuned. Some ASIC cell libraries contain a
(7, 3) countera 2-bit full-adder . A (15, 4) counter is a 3-bit full adder. There is a
trade-off in using these counters between the speed and size of the logic cells and
the delay as well as area of the interconnect.
6. Power dissipation is reduced by the tree-based structures. The simplified
carry-save logic produces fewer signal transitions and the tree structures produce
fewer glitches than a chain.
7. None of the multiplier structures we have discussed take into account the
possibility of staggered arrival times for different bits of the multiplicand or the
multiplier. Optimization then requires a logic-synthesis tool.

2.6.5 Other Arithmetic Systems


There are other schemes for addition and multiplication that are useful in special
circumstances. Addition of numbers using redundant binary encoding avoids carry
propagation and is thus potentially very fast. Table 2.13 shows the rules for addition
using an intermediate carry and sum that are added without the need for carry. For
example,
binary decimal redundant binary CSD vector
1010111 87 10101001 10 1 0 1 00 1 addend
+ 1100101 101 + 11100111 + 01100101 augend
01001110 = 11 00 1 100 intermediate sum
1 1 00010 1 11000000 intermediate carry
= 10111100 = 188 1 1 1000 1 00 10 1 00 1 100 sum
TABLE 2.13 Redundant binary addition.
Intermediate Intermediate
A[ i ] B[ i ] A[ i 1] B[ i 1]
sum carry
1 1 x x 0 1
1 0 A[i 1]=0/1 and B[i 1]=0/1 1 0
0 1 A[i 1]= 1 or B[i 1]= 1 1 1
1 1 x x 0 0
1 1 x x 0 0
0 0 x x 0 0
0 1 A[i 1]=0/1 and B[i 1]=0/1 1 1
1 0 A[i 1]= 1 or B[i 1]= 1 1 0
1 1 x x 0 1

The redundant binary representation is not unique. We can represent 101 (decimal), for
example, by 1100101 (binary and CSD vector) or 1 1 100111. As another example, 188
(decimal) can be represented by 10111100 (binary), 1 1 1000 1 00, 10 1 00 1 100, or 10
1 000 1 00 (CSD vector). Redundant binary addition of binary, redundant binary, or
CSD vectors does not result in a unique sum, and addition of two CSD vectors does not
result in a CSD vector. Each n -bit redundant binary number requires a rather wasteful
2 n -bit binary number for storage. Thus 10 1 is represented as 010010, for example
(using sign magnitude). The other disadvantage of redundant binary arithmetic is the
need to convert to and from binary representation.
Table 2.14 shows the (5, 3) residue number system . As an example, 11 (decimal) is
represented as [1, 2] residue (5, 3) since 11R 5 = 11 mod 5 = 1 and 11R 3 = 11 mod 3 =
2. The size of this system is thus 3 ¥ 5 = 15. We add, subtract, or multiply residue
numbers using the modulus of each bit positionwithout any carry. Thus:
4 [4, 1] 12 [2, 0] 3 [3, 0]
+ 7 + [2, 1] 4 - [4, 1] ¥ 4 ¥ [4, 1]
= 11 = [1, 2] = 8 = [3, 2] = 12 = [2, 0]
TABLE 2.14 The 5, 3 residue number system.
n residue 5 residue 3 n residue 5 residue 3 n residue 5 residue 3
00 0 50 2 10 0 1
11 1 61 0 11 1 2
22 2 72 1 12 2 0
33 0 83 2 13 3 1
44 1 94 0 14 4 2

The choice of moduli determines the system size and the computing complexity. The
most useful choices are relative primes (such as 3 and 5). With p prime, numbers of the
form 2 p and 2 p 1 are particularly useful (2 p 1 are Mersennes numbers ) [Waser and
Flynn, 1982].

2.6.6 Other Datapath Operators


Figure 2.32 shows symbols for some other datapath elements. The combinational
datapath cells, NAND, NOR, and so on, and sequential datapath cells (flip-flops and
latches) have standard-cell equivalents and function identically. I use a bold outline (1
point) for datapath cells instead of the regular (0.5 point) line I use for scalar symbols.
We call a set of identical cells a vector of datapath elements in the same way that a bold
symbol, A , represents a vector and A represents a scalar.
FIGURE 2.31 Symbols for datapath elements. (a) An array or vector of flip-flops (a
register). (b) A two-input NAND cell with databus inputs. (c) A two-input NAND cell
with a control input. (d) A buswide MUX. (e) An incrementer/decrementer. (f) An
all-zeros detector. (g) An all-ones detector. (h) An adder/subtracter.

A subtracter is similar to an adder, except in a full subtracter we have a borrow-in


signal, BIN; a borrow-out signal, BOUT; and a difference signal, DIFF:
DIFF = A • NOT(B) • NOT( BIN)
SUM(A, NOT(B), NOT(BIN)) (2.65)
NOT(BOUT) = A · NOT(B) + A · NOT(BIN) + NOT(B) · NOT(BIN)
MAJ(NOT(A), B, NOT(BIN)) (2.66)

These equations are the same as those for the FA (Eqs. 2.38 and 2.39) except that the B
input is inverted and the sense of the carry chain is inverted. To build a subtracter that
calculates (A B) we invert the entire B input bus and connect the BIN[0] input to
VDD (not to VSS as we did for CIN[0] in an adder). As an example, to subtract B =
'0011' from A = '1001' we calculate '1001' + '1100' + '1' = '0110'. As with an adder, the
true overflow is XOR(BOUT[MSB], BOUT[MSB 1]).
We can build a ripple-borrow subtracter (a type of borrow-propagate subtracter), a
borrow-save subtracter, and a borrow-select subtracter in the same way we built these
adder architectures. An adder/subtracter has a control signal that gates the A input with
an exclusive-OR cell (forming a programmable inversion) to switch between an adder
or subtracter. Some adder/subtracters gate both inputs to allow us to compute (A B).
We must be careful to connect the input to the LSB of the carry chain (CIN[0] or
BIN[0]) when changing between addition (connect to VSS) and subtraction (connect to
VDD).
A barrel shifter rotates or shifts an input bus by a specified amount. For example if we
have an eight-input barrel shifter with input '1111 0000' and we specify a shift of
'0001 0000' (3, coded by bit position) the right-shifted 8-bit output is '0001 1110'. A
barrel shifter may rotate left or right (or switch between the two under a separate
control). A barrel shifter may also have an output width that is smaller than the input.
To use a simple example, we may have an 8-bit input and a 4-bit output. This situation
is equivalent to having a barrel shifter with two 4-bit inputs and a 4-bit output. Barrel
shifters are used extensively in floating-point arithmetic to align (we call this normalize
and denormalize ) floating-point numbers (with sign, exponent, and mantissa).
A leading-one detector is used with a normalizing (left-shift) barrel shifter to align
mantissas in floating-point numbers. The input is an n -bit bus A, the output is an n -bit
bus, S, with a single '1' in the bit position corresponding to the most significant '1' in
the input. Thus, for example, if the input is A = '0000 0101' the leading-one detector
output is S = '0000 0100', indicating the leading one in A is in bit position 2 (bit 7 is the
MSB, bit zero is the LSB). If we feed the output, S, of the leading-one detector to the
shift select input of a normalizing (left-shift) barrel shifter, the shifter will normalize
the input A. In our example, with an input of A = '0000 0101', and a left-shift of S =
'0000 0100', the barrel shifter will shift A left by five bits and the output of the shifter is
Z = '1010 0000'. Now that Z is aligned (with the MSB equal to '1') we can multiply Z
with another normalized number.
The output of a priority encoder is the binary-encoded position of the leading one in an
input. For example, with an input A = '0000 0101' the leading 1 is in bit position 3
(MSB is bit position 7) so the output of a 4-bit priority encoder would be Z = '0011' (3).
In some cell libraries the encoding is reversed so that the MSB has an output code of
zero, in this case Z = '0101' (5). This second, reversed, encoding scheme is useful in
floating-point arithmetic. If A is a mantissa and we normalize A to '1010 0000' we have
to subtract 5 from the exponent, this exponent correction is equal to the output of the
priority encoder.
An accumulator is an adder/subtracter and a register. Sometimes these are combined
with a multiplier to form a multiplieraccumulator ( MAC ). An incrementer adds 1 to
the input bus, Z = A + 1, so we can use this function, together with a register, to negate
a twos complement number for example. The implementation is Z[ i ] = XOR(A[ i ],
CIN[ i ]), and COUT[ i ] = AND(A[ i ], CIN[ i ]). The carry-in control input, CIN[0],
thus acts as an enable: If it is set to '0' the output is the same as the input.
The implementation of arithmetic cells is often a little more complicated than we have
explained. CMOS logic is naturally inverting, so that it is faster to implement an
incrementer as
Z[ i (even)] = XOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NAND(A[ i ], CIN[ i ]).
This inverts COUT, so that in the following stage we must invert it again. If we push an
inverting bubble to the input CIN we find that:
Z[ i (odd)] = XNOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NOR(NOT(A[ i ]), CIN[ i
]).
In many datapath implementations all odd-bit cells operate on inverted carry signals,
and thus the odd-bit and even-bit datapath elements are different. In fact, all the adder
and subtracter datapath elements we have described may use this technique. Normally
this is completely hidden from the designer in the datapath assembly and any output
control signals are inverted, if necessary, by inserting buffers.
A decrementer subtracts 1 from the input bus, the logical implementation is Z[ i ] =
XOR(A[ i ], CIN[ i ]) and COUT[ i ] = AND(NOT(A[ i ]), CIN[ i ]). The
implementation may invert the odd carry signals, with CIN[0] again acting as an
enable.
An incrementer/decrementer has a second control input that gates the input, inverting
the input to the carry chain. This has the effect of selecting either the increment or
decrement function.
Using the all-zeros detectors and all-ones detectors , remember that, for a 4-bit number,
for example, zero in ones complement arithmetic is '1111' or '0000', and that zero in
signed magnitude arithmetic is '1000' or '0000'.
A register file (or scratchpad memory) is a bank of flip-flops arranged across the bus;
sometimes these have the option of multiple ports (multiport register files) for read and
write. Normally these register files are the densest logic and hardest to fit in a datapath.
For large register files it may be more appropriate to use a multiport memory. We can
add control logic to a register file to create a first-in first-out register ( FIFO ), or last-in
first-out register ( LIFO ).
In Section 2.5 we saw that the standard-cell version and gate-array macro version of the
sequential cells (latches and flip-flops) each contain their own clock buffers. The
reason for this is that (without intelligent placement software) we do not know where a
standard cell or a gate-array macro will be placed on a chip. We also have no idea of
the condition of the clock signal coming into a sequential cell. The ability to place the
clock buffers outside the sequential cells in a datapath gives us more flexibility and
saves space. For example, we can place the clock buffers for all the clocked elements at
the top of the datapath (together with the buffers for the control signals) and river route
(in river routing the interconnect lines all flow in the same direction on the same layer)
the connections to the clock lines. This saves space and allows us to guarantee the
clock skew and timing. It may mean, however, that there is a fixed overhead associated
with a datapath. For example, it might make no sense to build a 4-bit datapath if the
clock and control buffers take up twice the space of the datapath logic. Some tools
allow us to design logic using a portable netlist . After we complete the design we can
decide whether to implement the portable netlist in a datapath, standard cells, or even a
gate array, based on area, speed, or power considerations.
2.7 I/O Cells
Figure 2.33 shows a three-state bidirectional output buffer (Tri-State ® is a
registered trademark of National Semiconductor). When the output enable (OE)
signal is high, the circuit functions as a noninverting buffer driving the value of
DATAin onto the I/O pad. When OE is low, the output transistors or drivers , M1
and M2, are disconnected. This allows multiple drivers to be connected on a bus.
It is up to the designer to make sure that a bus never has two driversa problem
known as contention .
In order to prevent the problem opposite to contentiona bus floating to an
intermediate voltage when there are no bus driverswe can use a bus keeper or
bus-hold cell (TI calls this Bus-Friendly logic). A bus keeper normally acts like
two weak (low drive-strength) cross-coupled inverters that act as a latch to retain
the last logic state on the bus, but the latch is weak enough that it may be driven
easily to the opposite state. Even though bus keepers act like latches, and will
simulate like latches, they should not be used as latches, since their drive strength
is weak.
Transistors M1 and M2 in Figure 2.33 have to drive large off-chip loads. If we
wish to change the voltage on a C = 200 pF load by 5 V in 5 ns (a slew rate of 1
Vns 1 ) we will require a current in the output transistors of I DS = C (d V /d t ) =
(200 ¥ 10 12 ) (5/5 ¥ 10 9 ) = 0.2 A or 200 mA.
Such large currents flowing in the output transistors must also flow in the power
supply bus and can cause problems. There is always some inductance in series
with the power supply, between the point at which the supply enters the ASIC
package and reaches the power bus on the chip. The inductance is due to the bond
wire, lead frame, and package pin. If we have a power-supply inductance of 2 nH
and a current changing from zero to 1 A (32 I/O cells on a bus switching at 30
mA each) in 5 ns, we will have a voltage spike on the power supply (called
power-supply bounce ) of L (d I /d t ) = (2 ¥ 10 9 )(1/(5 ¥ 10 9 )) = 0.4 V.
We do several things to alleviate this problem: We can limit the number of
simultaneously switching outputs (SSOs), we can limit the number of I/O drivers
that can be attached to any one VDD and GND pad, and we can design the output
buffer to limit the slew rate of the output (we call these slew-rate limited I/O
pads). Quiet-I/O cells also use two separate power supplies and two sets of I/O
drivers: an AC supply (clean or quiet supply) with small AC drivers for the I/O
circuits that start and stop the output slewing at the beginning and end of a output
transition, and a DC supply (noisy or dirty supply) for the transistors that handle
large currents as they slew the output.
The three-state buffer allows us to employ the same pad for input and output
bidirectional I/O . When we want to use the pad as an input, we set OE low and
take the data from DATAin. Of course, it is not necessary to have all these
features on every pad: We can build output-only or input-only pads.

FIGURE 2.32 A three-state bidirectional


output buffer. When the output enable,
OE, is '1' the output section is enabled
and drives the I/O pad. When OE is '0'
the output buffer is placed in a
high-impedance state.

We can also use many of these output cell features for input cells that have to
drive large on-chip loads (a clock pad cell, for example). Some gate arrays
simply turn an output buffer around to drive a grid of interconnect that supplies a
clock signal internally. With a typical interconnect capacitance of 0.2pFcm 1 , a
grid of 100 cm (consisting of 10 by 10 lines running all the way across a 1 cm
chip) presents a load of 20 pF to the clock buffer.
Some libraries include I/O cells that have passive pull-ups or pull-downs
(resistors) instead of the transistors, M1 and M2 (the resistors are normally still
constructed from transistors with long gate lengths). We can also omit one of the
driver transistors, M1 or M2, to form open-drain outputs that require an external
pull-up or pull-down. We can design the output driver to produce TTL output
levels rather than CMOS logic levels. We may also add input hysteresis (using a
Schmitt trigger) to the input buffer, I1 in Figure 2.33, to accept input data signals
that contain glitches (from bouncing switch contacts, for example) or that are
slow rising. The input buffer can also include a level shifter to accept TTL input
levels and shift the input signal to CMOS levels.
The gate oxide in CMOS transistors is extremely thin (100 Å or less). This leaves
the gate oxide of the I/O cell input transistors susceptible to breakdown from
static electricity ( electrostatic discharge , or ESD ). ESD arises when we or
machines handle the package leads (like the shock I sometimes get when I touch
a doorknob after walking across the carpet at work). Sometimes this problem is
called electrical overstress (EOS) since most ESD-related failures are caused not
by gate-oxide breakdown, but by the thermal stress (melting) that occurs when
the n -channel transistor in an output driver overheats (melts) due to the large
current that can flow in the drain diffusion connected to a pad during an ESD
event.
To protect the I/O cells from ESD, the input pads are normally tied to device
structures that clamp the input voltage to below the gate breakdown voltage
(which can be as low as 10 V with a 100 Å gate oxide). Some I/O cells use
transistors with a special ESD implant that increases breakdown voltage and
provides protection. I/O driver transistors can also use elongated drain structures
(ladder structures) and large drain-to-gate spacing to help limit current, but in a
salicide process that lowers the drain resistance this is difficult. One solution is to
mask the I/O cells during the salicide step. Another solution is to use pnpn and
npnp diffusion structures called silicon-controlled rectifiers (SCRs) to clamp
voltages and divert current to protect the I/O circuits from ESD.
There are several ways to model the capability of an I/O cell to withstand EOS.
The human-body model ( HBM ) represents ESD by a 100 pF capacitor
discharging through a 1.5 k W resistor (this is an International Electrotechnical
Committee, IEC, specification). Typical voltages generated by the human body
are in the range of 24 kV, and we often see an I/O pad cell rated by the voltage it
can withstand using the HBM. The machine model ( MM ) represents an ESD
event generated by automated machine handlers. Typical MM parameters use a
200 pF capacitor (typically charged to 200 V) discharged through a 25 W
resistor, corresponding to a peak initial current of nearly 10 A. The charge-device
model ( CDM , also called device chargedischarge) represents the problem when
an IC package is charged, in a shipping tube for example, and then grounded. If
the maximum charge on a package is 3 nC (a typical measured figure) and the
package capacitance to ground is 1.5 pF, we can simulate this event by charging a
1.5 pF capacitor to 2 kV and discharging it through a 1 W resistor.
If the diffusion structures in the I/O cells are not designed with care, it is possible
to construct an SCR structure unwittingly, and instead of protecting the
transistors the SCR can enter a mode where it is latched on and conducting large
enough currents to destroy the chip. This failure mode is called latch-up .
Latch-up can occur if the pn -diodes on a chip become forward-biased and inject
minority carriers (electrons in p -type material, holes in n -type material) into the
substrate. The sourcesubstrate and drainsubstrate diodes can become
forward-biased due to power-supply bounce or output undershoot (the cell
outputs fall below V SS ) or overshoot (outputs rise to greater than V DD ) for
example. These injected minority carriers can travel fairly large distances and
interact with nearby transistors causing latch-up. I/O cells normally surround the
I/O transistors with guard rings (a continuous ring of n -diffusion in an n -well
connected to VDD, and a ring of p -diffusion in a p -well connected to VSS) to
collect these minority carriers. This is a problem that can also occur in the logic
core and this is one reason that we normally include substrate and well
connections to the power supplies in every cell.
2.8 Cell Compilers
The process of hand crafting circuits and layout for a full-custom IC is a tedious,
time-consuming, and error-prone task. There are two types of automated layout
assembly tools, often known as a silicon compilers . The first type produces a
specific kind of circuit, a RAM compiler or multiplier compiler , for example.
The second type of compiler is more flexible, usually providing a programming
language that assembles or tiles layout from an input command file, but this is
full-custom IC design.
We can build a register file from latches or flip-flops, but, at 4.56.5 gates (1826
transistors) per bit, this is an expensive way to build memory. Dynamic RAM
(DRAM) can use a cell with only one transistor, storing charge on a capacitor
that has to be periodically refreshed as the charge leaks away. ASIC RAM is
invariably static (SRAM), so we do not need to refresh the bits. When we refer to
RAM in an ASIC environment we almost always mean SRAM. Most ASIC
RAMs use a six-transistor cell (four transistors to form two cross-coupled
inverters that form the storage loop, and two more transistors to allow us to read
from and write to the cell). RAM compilers are available that produce single-port
RAM (a single shared bus for read and write) as well as dual-port RAMs , and
multiport RAMs . In a multi-port RAM the compiler may or may not handle the
problem of address contention (attempts to read and write to the same RAM
address simultaneously). RAM can be asynchronous (the read and write cycles
are triggered by control and/or address transitions asynchronous to a clock) or
synchronous (using the system clock).
In addition to producing layout we also need a model compiler so that we can
verify the circuit at the behavioral level, and we need a netlist from a netlist
compiler so that we can simulate the circuit and verify that it works correctly at
the structural level. Silicon compilers are thus complex pieces of software. We
assume that a silicon compiler will produce working silicon even if every
configuration has not been tested. This is still ASIC design, but now we are
relying on the fact that the tool works correctly and therefore the compiled blocks
are correct by construction .
2.9 Summary
The most important concepts that we covered in this chapter are the following:
● The use of transistors as switches

● The difference between flip-flop and a latch

● The meaning of setup time and hold time

● Pipelines and latency

● The difference between datapath, standard-cell, and gate-array logic cells

● Strong and weak logic levels

● Pushing bubbles

● Ratio of logic

● Resistance per square of layers and their relative values in CMOS

● Design rules and l

You might also like