Asic Smith

ASICs...
the website
ASICs... the website
INDEX
q q q q q q q q q q q q q q q q q q
Chapter 1: Introduction to ASICs Chapter 2: CMOS Logic Chapter 3: ASIC Library Design Chapter 4: Programmable ASICs Chapter 5: Programmable ASIC Logic Cells Chapter 6: Programmable ASIC I/O Cells Chapter 7: Programmable ASIC Interconnect Chapter 8: Programmable ASIC Design Software Chapter 9: Low-Level Design Entry Chapter 10: VHDL (the links to the IEEE VHDL LRM are protected) Chapter 11: Verilog HDL Chapter 12: Logic Synthesis Chapter 13: Simulation Chapter 14: Test Chapter 15: System Partitioning Chapter 16: Floorplanning and Placement Chapter 17: Routing Appendix A: VHDL Resources (the complex style sheets used in these files will work with Microsoft Internet Explorer 4.0, but not very well with Netscape 4.0 or earlier versions of either browser) Appendix B: Verilog HDL Resources
INTRODUCTION TO ASICs
[ Chapter Index ] [ Next page ] Portions from Application-Specific Integrated Circuits Copyright 1997 by Addison Wesley Longman, Inc.
An ASIC (pronounced a-sick; bold typeface defines a new term) is an applicationspecific integrated circuit at least that is what the acronym stands for. Before we answer the question of what that means we first look at the evolution of the silicon chip or integrated circuit ( IC ). Figure 1.1(a) shows an IC package (this is a pin-grid array, or PGA, shown upside down; the pins will go through holes in a printed-circuit board). People often call the package a chip, but, as you can see in Figure 1.1(b), the silicon chip itself (more properly called a die ) is mounted in the cavity under the sealed lid. A PGA package is usually made from a ceramic material, but plastic packages are also common.
FIGURE 1.1 An integrated circuit (IC). (a) A pin-grid array (PGA) package. (b) The silicon die or chip is under the package lid.
The physical size of a silicon die varies from a few millimeters on a side to over 1 inch on a side, but instead we often measure the size of an IC by the number of logic gates or the number of transistors that the IC contains. As a unit of measure a gate equivalent corresponds to a two-input NAND gate (a circuit that performs the logic
function, F = A B ). Often we just use the term gates instead of gate equivalents when we are measuring chip sizenot to be confused with the gate terminal of a transistor. For example, a 100 k-gate IC contains the equivalent of 100,000 two-input NAND gates. The semiconductor industry has evolved from the first ICs of the early 1970s and matured rapidly since then. Early small-scale integration ( SSI ) ICs contained a few (1 to 10) logic gatesNAND gates, NOR gates, and so onamounting to a few tens of transistors. The era of medium-scale integration ( MSI ) increased the range of integrated logic available to counters and similar, larger scale, logic functions. The era of large-scale integration ( LSI ) packed even larger logic functions, such as the first microprocessors, into a single chip. The era of very large-scale integration ( VLSI ) now offers 64-bit microprocessors, complete with cache memory and floating-point arithmetic unitswell over a million transistorson a single piece of silicon. As CMOS process technology improves, transistors continue to get smaller and ICs hold more and more transistors. Some people (especially in Japan) use the term ultralarge scale integration ( ULSI ), but most people stop at the term VLSI; otherwise we have to start inventing new words. The earliest ICs used bipolar technology and the majority of logic ICs used either transistortransistor logic ( TTL ) or emitter-coupled logic (ECL). Although invented before the bipolar transistor, the metal-oxide-silicon ( MOS ) transistor was initially difficult to manufacture because of problems with the oxide interface. As these problems were gradually solved, metal-gate n -channel MOS ( nMOS or NMOS ) technology developed in the 1970s. At that time MOS technology required fewer masking steps, was denser, and consumed less power than equivalent bipolar ICs. This meant that, for a given performance, an MOS IC was cheaper than a bipolar IC and led to investment and growth of the MOS IC market. By the early 1980s the aluminum gates of the transistors were replaced by polysilicon gates, but the name MOS remained. The introduction of polysilicon as a gate material was a major improvement in CMOS technology, making it easier to make two types of transistors, n -channel MOS and p -channel MOS transistors, on the same ICa complementary MOS ( CMOS , never cMOS) technology. The principal advantage of CMOS over NMOS is lower power consumption. Another advantage of a polysilicon gate was a simplification of the fabrication process, allowing devices to be
scaled down in size. There are four CMOS transistors in a two-input NAND gate (and a two-input NOR gate too), so to convert between gates and transistors, you multiply the number of gates by 4 to obtain the number of transistors. We can also measure an IC by the smallest feature size (roughly half the length of the smallest transistor) imprinted on the IC. Transistor dimensions are measured in microns (a micron, 1 m, is a millionth of a meter). Thus we talk about a 0.5 m IC or say an IC is built in (or with) a 0.5 m process, meaning that the smallest transistors are 0.5 m in length. We give a special label, or lambda , to this smallest feature size. Since lambda is equal to half of the smallest transistor length, 0.25 m in a 0.5 m process. Many of the drawings in this book use a scale marked with lambda for the same reason we place a scale on a map. A modern submicron CMOS process is now just as complicated as a submicron bipolar or BiCMOS (a combination of bipolar and CMOS) process. However, CMOS ICs have established a dominant position, are manufactured in much greater volume than any other technology, and therefore, because of the economy of scale, the cost of CMOS ICs is less than a bipolar or BiCMOS IC for the same function. Bipolar and BiCMOS ICs are still used for special needs. For example, bipolar technology is generally capable of handling higher voltages than CMOS. This makes bipolar and BiCMOS ICs useful in power electronics, cars, telephone circuits, and so on. Some digital logic ICs and their analog counterparts (analog/digital converters, for example) are standard parts , or standard ICs. You can select standard ICs from catalogs and data books and buy them from distributors. Systems manufacturers and designers can use the same standard part in a variety of different microelectronic systems (systems that use microelectronics or ICs). With the advent of VLSI in the 1980s engineers began to realize the advantages of designing an IC that was customized or tailored to a particular system or application rather than using standard ICs alone. Microelectronic system design then becomes a matter of defining the functions that you can implement using standard ICs and then implementing the remaining logic functions (sometimes called glue logic ) with one or more custom ICs . As VLSI became possible you could build a system from a smaller number of components by combining many standard ICs into a few custom
]
ICs. Building a microelectronic system with fewer ICs allows you to reduce cost and improve reliability. Of course, there are many situations in which it is not appropriate to use a custom IC for each and every part of an microelectronic system. If you need a large amount of memory, for example, it is still best to use standard memory ICs, either dynamic random-access memory ( DRAM or dRAM), or static RAM ( SRAM or sRAM), in conjunction with custom ICs. One of the first conferences to be devoted to this rapidly emerging segment of the IC industry was the IEEE Custom Integrated Circuits Conference (CICC), and the proceedings of this annual conference form a useful reference to the development of custom ICs. As different types of custom ICs began to evolve for different types of applications, these new ICs gave rise to a new term: application-specific IC, or ASIC. Now we have the IEEE International ASIC Conference , which tracks advances in ASICs separately from other types of custom ICs. Although the exact definition of an ASIC is difficult, we shall look at some examples to help clarify what people in the IC industry understand by the term. Examples of ICs that are not ASICs include standard parts such as: memory chips sold as a commodity itemROMs, DRAM, and SRAM; microprocessors; TTL or TTL-equivalent ICs at SSI, MSI, and LSI levels. Examples of ICs that are ASICs include: a chip for a toy bear that talks; a chip for a satellite; a chip designed to handle the interface between memory and a microprocessor for a workstation CPU; and a chip containing a microprocessor as a cell together with other logic. As a general rule, if you can find it in a data book, then it is probably not an ASIC, but there are some exceptions. For example, two ICs that might or might not be considered ASICs are a controller chip for a PC and a chip for a modem. Both of these examples are specific to an application (shades of an ASIC) but are sold to many different system vendors (shades of a standard part). ASICs such as these are sometimes called application-specific standard products ( ASSPs ). Trying to decide which members of the huge IC family are application-specific is trickyafter all, every IC has an application. For example, people do not usually consider an application-specific microprocessor to be an ASIC. I shall describe how
to design an ASIC that may include large cells such as microprocessors, but I shall not describe the design of the microprocessors themselves. Defining an ASIC by looking at the application can be confusing, so we shall look at a different way to categorize the IC family. The easiest way to recognize people is by their faces and physical characteristics: tall, short, thin. The easiest characteristics of ASICs to understand are physical ones too, and we shall look at these next. It is important to understand these differences because they affect such factors as the price of an ASIC and the way you design an ASIC. 1.1 Types of ASICs 1.2 Design Flow 1.3 Case Study 1.4 Economics of ASICs 1.5 ASIC Cell Libraries 1.6 Summary 1.7 Problems 1.8 Bibliography 1.9 References
[ Chapter Index ] [ Next page ]
1.1 Types of ASICs
[ Chapter start ] [ Previous page ] [ Next page ] Portions from Application-Specific Integrated Circuits Copyright 1997 by Addison Wesley Longman, Inc.
1.1 Types of ASICs

ICs are made on a thin (a few hundred microns thick), circular silicon wafer , with each wafer holding hundreds of die (sometimes people use dies or dice for the plural of die). The transistors and wiring are made from many layers (usually between 10 and 15 distinct layers) built on top of one another. Each successive mask layer has a pattern that is defined using a mask similar to a glass photographic slide. The first half-dozen or so layers define the transistors. The last half-dozen or so layers define the metal wires between the transistors (the interconnect ). A full-custom IC includes some (possibly all) logic cells that are customized and all mask layers that are customized. A microprocessor is an example of a full-custom ICdesigners spend many hours squeezing the most out of every last square micron of microprocessor chip space by hand. Customizing all of the IC features in this way allows designers to include analog circuits, optimized memory cells, or mechanical structures on an IC, for example. Full-custom ICs are the most expensive to manufacture and to design. The manufacturing lead time (the time it takes just to make an ICnot including design time) is typically eight weeks for a full-custom IC. These specialized full-custom ICs are often intended for a specific application, so we might call some of them full-custom ASICs. We shall discuss full-custom ASICs briefly next, but the members of the IC family that we are more interested in are semicustom ASICs , for which all of the logic cells are predesigned and some (possibly all) of the mask layers are customized. Using predesigned cells from a cell library makes our lives as designers much, much easier. There are two types of semicustom ASICs that we shall cover: standard-cellbased ASICs and gate-arraybased ASICs. Following this we shall describe the programmable ASICs , for which all of the logic cells are predesigned and none of
1.1 Types of ASICs
the mask layers are customized. There are two types of programmable ASICs: the programmable logic device and, the newest member of the ASIC family, the fieldprogrammable gate array.
1.1.1 Full-Custom ASICs

In a full-custom ASIC an engineer designs some or all of the logic cells, circuits, or layout specifically for one ASIC. This means the designer abandons the approach of using pretested and precharacterized cells for all or part of that design. It makes sense to take this approach only if there are no suitable existing cell libraries available that can be used for the entire design. This might be because existing cell libraries are not fast enough, or the logic cells are not small enough or consume too much power. You may need to use full-custom design if the ASIC technology is new or so specialized that there are no existing cell libraries or because the ASIC is so specialized that some circuits must be custom designed. Fewer and fewer full-custom ICs are being designed because of the problems with these special parts of the ASIC. There is one growing member of this family, though, the mixed analog/digital ASIC, which we shall discuss next. Bipolar technology has historically been used for precision analog functions. There are some fundamental reasons for this. In all integrated circuits the matching of component characteristics between chips is very poor, while the matching of characteristics between components on the same chip is excellent. Suppose we have transistors T1, T2, and T3 on an analog/digital ASIC. The three transistors are all the same size and are constructed in an identical fashion. Transistors T1 and T2 are located adjacent to each other and have the same orientation. Transistor T3 is the same size as T1 and T2 but is located on the other side of the chip from T1 and T2 and has a different orientation. ICs are made in batches called wafer lots. A wafer lot is a group of silicon wafers that are all processed together. Usually there are between 5 and 30 wafers in a lot. Each wafer can contain tens or hundreds of chips depending on the size of the IC and the wafer. If we were to make measurements of the characteristics of transistors T1, T2, and T3 we would find the following:
1.1 Types of ASICs

q
Transistors T1 will have virtually identical characteristics to T2 on the same IC. We say that the transistors match well or the tracking between devices is excellent. Transistor T3 will match transistors T1 and T2 on the same IC very well, but not as closely as T1 matches T2 on the same IC. Transistor T1, T2, and T3 will match fairly well with transistors T1, T2, and T3 on a different IC on the same wafer. The matching will depend on how far apart the two ICs are on the wafer. Transistors on ICs from different wafers in the same wafer lot will not match very well. Transistors on ICs from different wafer lots will match very poorly.
For many analog designs the close matching of transistors is crucial to circuit operation. For these circuit designs pairs of transistors are used, located adjacent to each other. Device physics dictates that a pair of bipolar transistors will always match more precisely than CMOS transistors of a comparable size. Bipolar technology has historically been more widely used for full-custom analog design because of its improved precision. Despite its poorer analog properties, the use of CMOS technology for analog functions is increasing. There are two reasons for this. The first reason is that CMOS is now by far the most widely available IC technology. Many more CMOS ASICs and CMOS standard products are now being manufactured than bipolar ICs. The second reason is that increased levels of integration require mixing analog and digital functions on the same IC: this has forced designers to find ways to use CMOS technology to implement analog functions. Circuit designers, using clever new techniques, have been very successful in finding new ways to design analog CMOS circuits that can approach the accuracy of bipolar analog designs.
1.1.2 Standard-CellBased ASICs

A cell-based ASIC (cell-based IC, or CBIC a common term in Japan, pronounced sea-bick) uses predesigned logic cells (AND gates, OR gates, multiplexers, and flipflops, for example) known as standard cells . We could apply the term CBIC to any IC that uses cells, but it is generally accepted that a cell-based ASIC or CBIC means a standard-cellbased ASIC.
1.1 Types of ASICs
The standard-cell areas (also called flexible blocks) in a CBIC are built of rows of standard cellslike a wall built of bricks. The standard-cell areas may be used in combination with larger predesigned cells, perhaps microcontrollers or even microprocessors, known as megacells . Megacells are also called megafunctions, fullcustom blocks, system-level macros (SLMs), fixed blocks, cores, or Functional Standard Blocks (FSBs). The ASIC designer defines only the placement of the standard cells and the interconnect in a CBIC. However, the standard cells can be placed anywhere on the silicon; this means that all the mask layers of a CBIC are customized and are unique to a particular customer. The advantage of CBICs is that designers save time, money, and reduce risk by using a predesigned, pretested, and precharacterized standard-cell library . In addition each standard cell can be optimized individually. During the design of the cell library each and every transistor in every standard cell can be chosen to maximize speed or minimize area, for example. The disadvantages are the time or expense of designing or buying the standard-cell library and the time needed to fabricate all layers of the ASIC for each new design. Figure 1.2 shows a CBIC (looking down on the die shown in Figure 1.1b, for example). The important features of this type of ASIC are as follows:
q q q
All mask layers are customizedtransistors and interconnect. Custom blocks can be embedded. Manufacturing lead time is about eight weeks. FIGURE 1.2 A cell-based ASIC (CBIC) die with a single standard-cell area (a flexible block) together with four fixed blocks. The flexible block contains rows of standard cells. This is what you might see through a low-powered microscope looking down on the die of Figure 1.1(b). The small squares around the edge of the die are bonding pads that are connected to the pins of the ASIC package.
1.1 Types of ASICs
Each standard cell in the library is constructed using full-custom design methods, but you can use these predesigned and precharacterized circuits without having to do any full-custom design yourself. This design style gives you the same performance and flexibility advantages of a full-custom ASIC but reduces design time and reduces risk. Standard cells are designed to fit together like bricks in a wall. Figure 1.3 shows an example of a simple standard cell (it is simple in the sense it is not maximized for densitybut ideal for showing you its internal construction). Power and ground buses (VDD and GND or VSS) run horizontally on metal lines inside the cells.
FIGURE 1.3 Looking down on the layout of a standard cell. This cell would be approximately 25 microns wide on an ASIC with (lambda) = 0.25 microns (a micron is 10 6 m). Standard cells are stacked like bricks in a wall; the abutment box (AB) defines the edges of the brick. The difference between the bounding box (BB) and the AB is the area of overlap between the bricks. Power supplies (labeled VDD and GND) run horizontally inside a standard cell on a metal layer that lies above the transistor layers. Each different shaded and labeled pattern represents a different layer. This standard cell has
1.1 Types of ASICs
center connectors (the three squares, labeled A1, B1, and Z) that allow the cell to connect to others. The layout was drawn using ROSE, a symbolic layout editor developed by Rockwell and Compass, and then imported into Tanner Researchs LEdit. Standard-cell design allows the automation of the process of assembling an ASIC. Groups of standard cells fit horizontally together to form rows. The rows stack vertically to form flexible rectangular blocks (which you can reshape during design). You may then connect a flexible block built from several rows of standard cells to other standard-cell blocks or other full-custom logic blocks. For example, you might want to include a custom interface to a standard, predesigned microcontroller together with some memory. The microcontroller block may be a fixed-size megacell, you might generate the memory using a memory compiler, and the custom logic and memory controller will be built from flexible standard-cell blocks, shaped to fit in the empty spaces on the chip. Both cell-based and gate-array ASICs use predefined cells, but there is a differencewe can change the transistor sizes in a standard cell to optimize speed and performance, but the device sizes in a gate array are fixed. This results in a tradeoff in performance and area in a gate array at the silicon level. The trade-off between area and performance is made at the library level for a standard-cell ASIC. Modern CMOS ASICs use two, three, or more levels (or layers) of metal for interconnect. This allows wires to cross over different layers in the same way that we use copper traces on different layers on a printed-circuit board. In a two-level metal CMOS technology, connections to the standard-cell inputs and outputs are usually made using the second level of metal ( metal2 , the upper level of metal) at the tops and bottoms of the cells. In a three-level metal technology, connections may be internal to the logic cell (as they are in Figure 1.3). This allows for more sophisticated routing programs to take advantage of the extra metal layer to route interconnect over the top of the logic cells. We shall cover the details of routing ASICs in Chapter 17. A connection that needs to cross over a row of standard cells uses a feedthrough. The term feedthrough can refer either to the piece of metal that is used to pass a signal through a cell or to a space in a cell waiting to be used as a feedthroughvery
1.1 Types of ASICs
confusing. Figure 1.4 shows two feedthroughs: one in cell A.14 and one in cell A.23. In both two-level and three-level metal technology, the power buses (VDD and GND) inside the standard cells normally use the lowest (closest to the transistors) layer of metal ( metal1 ). The width of each row of standard cells is adjusted so that they may be aligned using spacer cells . The power buses, or rails, are then connected to additional vertical power rails using row-end cells at the aligned ends of each standard-cell block. If the rows of standard cells are long, then vertical power rails can also be run in metal2 through the cell rows using special power cells that just connect to VDD and GND. Usually the designer manually controls the number and width of the vertical power rails connected to the standard-cell blocks during physical design. A diagram of the power distribution scheme for a CBIC is shown in Figure 1.4.
1.1 Types of ASICs
FIGURE 1.4 Routing the CBIC (cell-based IC) shown in Figure 1.2. The use of regularly shaped standard cells, such as the one in Figure 1.3, from a library allows ASICs like this to be designed automatically. This ASIC uses two separate layers of metal interconnect (metal1 and metal2) running at right angles to each other (like traces on a printed-circuit board). Interconnections between logic cells uses spaces (called channels) between the rows of cells. ASICs may have three (or more) layers of metal allowing the cell rows to touch with the interconnect running over the top of the cells. All the mask layers of a CBIC are customized. This allows megacells (SRAM, a SCSI controller, or an MPEG decoder, for example) to be placed on the same IC with standard cells. Megacells are usually supplied by an ASIC or library company complete with behavioral models and some way to test them (a test strategy). ASIC library companies also supply compilers to generate flexible DRAM, SRAM, and ROM blocks. Since all mask layers on a standard-cell design are customized, memory design is more efficient and denser than for gate arrays. For logic that operates on multiple signals across a data busa datapath ( DP )the use of standard cells may not be the most efficient ASIC design style. Some ASIC library companies provide a datapath compiler that automatically generates datapath logic . A datapath library typically contains cells such as adders, subtracters, multipliers, and simple arithmetic and logical units ( ALUs ). The connectors of datapath library cells are pitch-matched to each other so that they fit together. Connecting datapath cells to form a datapath usually, but not always, results in faster and denser layout than using standard cells or a gate array. Standard-cell and gate-array libraries may contain hundreds of different logic cells, including combinational functions (NAND, NOR, AND, OR gates) with multiple inputs, as well as latches and flip-flops with different combinations of reset, preset and clocking options. The ASIC library company provides designers with a data book in paper or electronic form with all of the functional descriptions and timing information for each library element.
1.1 Types of ASICs
1.1.3 Gate-ArrayBased ASICs

In a gate array (sometimes abbreviated to GA) or gate-arraybased ASIC the transistors are predefined on the silicon wafer. The predefined pattern of transistors on a gate array is the base array , and the smallest element that is replicated to make the base array (like an M. C. Escher drawing, or tiles on a floor) is the base cell (sometimes called a primitive cell ). Only the top few layers of metal, which define the interconnect between transistors, are defined by the designer using custom masks. To distinguish this type of gate array from other types of gate array, it is often called a masked gate array ( MGA ). The designer chooses from a gate-array library of predesigned and precharacterized logic cells. The logic cells in a gate-array library are often called macros . The reason for this is that the base-cell layout is the same for each logic cell, and only the interconnect (inside cells and between cells) is customized, so that there is a similarity between gate-array macros and a software macro. Inside IBM, gate-array macros are known as books (so that books are part of a library), but unfortunately this descriptive term is not very widely used outside IBM. We can complete the diffusion steps that form the transistors and then stockpile wafers (sometimes we call a gate array a prediffused array for this reason). Since only the metal interconnections are unique to an MGA, we can use the stockpiled wafers for different customers as needed. Using wafers prefabricated up to the metallization steps reduces the time needed to make an MGA, the turnaround time , to a few days or at most a couple of weeks. The costs for all the initial fabrication steps for an MGA are shared for each customer and this reduces the cost of an MGA compared to a full-custom or standard-cell ASIC design. There are the following different types of MGA or gate-arraybased ASICs:
q q q
Channeled gate arrays. Channelless gate arrays. Structured gate arrays.
The hyphenation of these terms when they are used as adjectives explains their construction. For example, in the term channeled gate-array architecture, the gate array is channeled , as will be explained. There are two common ways of arranging
1.1 Types of ASICs
(or arraying) the transistors on a MGA: in a channeled gate array we leave space between the rows of transistors for wiring; the routing on a channelless gate array uses rows of unused transistors. The channeled gate array was the first to be developed, but the channelless gate-array architecture is now more widely used. A structured (or embedded) gate array can be either channeled or channelless but it includes (or embeds) a custom block.
1.1.4 Channeled Gate Array

Figure 1.5 shows a channeled gate array . The important features of this type of MGA are:
q q q
Only the interconnect is customized. The interconnect uses predefined spaces between rows of base cells. Manufacturing lead time is between two days and two weeks.
FIGURE 1.5 A channeled gate-array die. The spaces between rows of the base cells are set aside for interconnect.
A channeled gate array is similar to a CBICboth use rows of cells separated by channels used for interconnect. One difference is that the space for interconnect between rows of cells are fixed in height in a channeled gate array, whereas the space between rows of cells may be adjusted in a CBIC.
1.1.5 Channelless Gate Array
1.1 Types of ASICs
Figure 1.6 shows a channelless gate array (also known as a channel-free gate array , sea-of-gates array , or SOG array). The important features of this type of MGA are as follows:
q q
Only some (the top few) mask layers are customizedthe interconnect. Manufacturing lead time is between two days and two weeks.
FIGURE 1.6 A channelless gate-array or sea-ofgates (SOG) array die. The core area of the die is completely filled with an array of base cells (the base array).
The key difference between a channelless gate array and channeled gate array is that there are no predefined areas set aside for routing between cells on a channelless gate array. Instead we route over the top of the gate-array devices. We can do this because we customize the contact layer that defines the connections between metal1, the first layer of metal, and the transistors. When we use an area of transistors for routing in a channelless array, we do not make any contacts to the devices lying underneath; we simply leave the transistors unused. The logic densitythe amount of logic that can be implemented in a given silicon areais higher for channelless gate arrays than for channeled gate arrays. This is usually attributed to the difference in structure between the two types of array. In fact, the difference occurs because the contact mask is customized in a channelless gate array, but is not usually customized in a channeled gate array. This leads to denser cells in the channelless architectures. Customizing the contact layer in a channelless gate array allows us to increase the density of gate-array cells because we can route over the top of unused contact sites.
1.1.6 Structured Gate Array
1.1 Types of ASICs
An embedded gate array or structured gate array (also known as masterslice or masterimage ) combines some of the features of CBICs and MGAs. One of the disadvantages of the MGA is the fixed gate-array base cell. This makes the implementation of memory, for example, difficult and inefficient. In an embedded gate array we set aside some of the IC area and dedicate it to a specific function. This embedded area either can contain a different base cell that is more suitable for building memory cells, or it can contain a complete circuit block, such as a microcontroller. Figure 1.7 shows an embedded gate array. The important features of this type of MGA are the following:
q q q
Only the interconnect is customized. Custom blocks (the same for each design) can be embedded. Manufacturing lead time is between two days and two weeks.
FIGURE 1.7 A structured or embedded gate-array die showing an embedded block in the upper left corner (a static randomaccess memory, for example). The rest of the die is filled with an array of base cells.
An embedded gate array gives the improved area efficiency and increased performance of a CBIC but with the lower cost and faster turnaround of an MGA. One disadvantage of an embedded gate array is that the embedded function is fixed. For example, if an embedded gate array contains an area set aside for a 32 k-bit memory, but we only need a 16 k-bit memory, then we may have to waste half of the embedded memory function. However, this may still be more efficient and cheaper than implementing a 32 k-bit memory using macros on a SOG array. ASIC vendors may offer several embedded gate array structures containing different
1.1 Types of ASICs
memory types and sizes as well as a variety of embedded functions. ASIC companies wishing to offer a wide range of embedded functions must ensure that enough customers use each different embedded gate array to give the cost advantages over a custom gate array or CBIC (the Sun Microsystems SPARCstation 1 described in Section 1.3 made use of LSI Logic embedded gate arraysand the 10K and 100K series of embedded gate arrays were two of LSI Logics most successful products).
1.1.7 Programmable Logic Devices

Programmable logic devices ( PLDs ) are standard ICs that are available in standard configurations from a catalog of parts and are sold in very high volume to many different customers. However, PLDs may be configured or programmed to create a part customized to a specific application, and so they also belong to the family of ASICs. PLDs use different technologies to allow programming of the device. Figure 1.8 shows a PLD and the following important features that all PLDs have in common:
q q q q
No customized mask layers or logic cells Fast design turnaround A single large block of programmable interconnect A matrix of logic macrocells that usually consist of programmable array logic followed by a flip-flop or latch
FIGURE 1.8 A programmable logic device (PLD) die. The macrocells typically consist of programmable array logic followed by a flip-flop or latch. The macrocells are connected using a large programmable interconnect block.
The simplest type of programmable IC is a read-only memory ( ROM ). The most
1.1 Types of ASICs
common types of ROM use a metal fuse that can be blown permanently (a programmable ROM or PROM ). An electrically programmable ROM , or EPROM , uses programmable MOS transistors whose characteristics are altered by applying a high voltage. You can erase an EPROM either by using another high voltage (an electrically erasable PROM , or EEPROM ) or by exposing the device to ultraviolet light ( UV-erasable PROM , or UVPROM ). There is another type of ROM that can be placed on any ASICa maskprogrammable ROM (mask-programmed ROM or masked ROM). A masked ROM is a regular array of transistors permanently programmed using custom mask patterns. An embedded masked ROM is thus a large, specialized, logic cell. The same programmable technologies used to make ROMs can be applied to more flexible logic structures. By using the programmable devices in a large array of AND gates and an array of OR gates, we create a family of flexible and programmable logic devices called logic arrays . The company Monolithic Memories (bought by AMD) was the first to produce Programmable Array Logic (PAL , a registered trademark of AMD) devices that you can use, for example, as transition decoders for state machines. A PAL can also include registers (flip-flops) to store the current state information so that you can use a PAL to make a complete state machine. Just as we have a mask-programmable ROM, we could place a logic array as a cell on a custom ASIC. This type of logic array is called a programmable logic array (PLA). There is a difference between a PAL and a PLA: a PLA has a programmable AND logic array, or AND plane , followed by a programmable OR logic array, or OR plane ; a PAL has a programmable AND plane and, in contrast to a PLA, a fixed OR plane. Depending on how the PLD is programmed, we can have an erasable PLD (EPLD), or mask-programmed PLD (sometimes called a masked PLD but usually just PLD). The first PALs, PLAs, and PLDs were based on bipolar technology and used programmable fuses or links. CMOS PLDs usually employ floating-gate transistors (see Section 4.3, EPROM and EEPROM Technology).
1.1.8 Field-Programmable Gate Arrays
1.1 Types of ASICs
A step above the PLD in complexity is the field-programmable gate array ( FPGA ). There is very little difference between an FPGA and a PLDan FPGA is usually just larger and more complex than a PLD. In fact, some companies that manufacture programmable ASICs call their products FPGAs and some call them complex PLDs . FPGAs are the newest member of the ASIC family and are rapidly growing in importance, replacing TTL in microelectronic systems. Even though an FPGA is a type of gate array, we do not consider the term gate-arraybased ASICs to include FPGAs. This may change as FPGAs and MGAs start to look more alike. Figure 1.9 illustrates the essential characteristics of an FPGA:
q q q
q q q
None of the mask layers are customized. A method for programming the basic logic cells and the interconnect. The core is a regular array of programmable basic logic cells that can implement combinational as well as sequential logic (flip-flops). A matrix of programmable interconnect surrounds the basic logic cells. Programmable I/O cells surround the core. Design turnaround is a few hours.
We shall examine these features in detail in Chapters 48.
FIGURE 1.9 A field-programmable gate array (FPGA) die. All FPGAs contain a regular structure of programmable basic logic cells surrounded by programmable interconnect. The exact type, size, and number of the programmable basic logic cells varies tremendously.
[ Chapter start ] [ Previous page ] [ Next page ]
1.2 Design Flow
1.2 Design Flow

Figure 1.10 shows the sequence of steps to design an ASIC; we call this a design flow . The steps are listed below (numbered to correspond to the labels in Figure 1.10) with a brief description of the function of each step.
FIGURE 1.10 ASIC design flow. 1. Design entry. Enter the design into an ASIC design system, either using a
file:///C|/Documents%20and%20Settings/saran%20...i.edu/_msmith/ASICs/HTML/Book2/CH01/CH01.2.htm (1 of 2) [5/30/2004 11:00:44 PM]
1.2 Design Flow
hardware description language ( HDL ) or schematic entry . 2. Logic synthesis. Use an HDL (VHDL or Verilog) and a logic synthesis tool to produce a netlist a description of the logic cells and their connections. 3. System partitioning. Divide a large system into ASIC-sized pieces. 4. Prelayout simulation. Check to see if the design functions correctly. 5. Floorplanning. Arrange the blocks of the netlist on the chip. 6. Placement. Decide the locations of cells in a block. 7. Routing. Make the connections between cells and blocks. 8. Extraction. Determine the resistance and capacitance of the interconnect. 9. Postlayout simulation. Check to see the design still works with the added loads of the interconnect. Steps 14 are part of logical design , and steps 59 are part of physical design . There is some overlap. For example, system partitioning might be considered as either logical or physical design. To put it another way, when we are performing system partitioning we have to consider both logical and physical factors. Chapters 914 of this book is largely about logical design and Chapters 1517 largely about physical design. [ Chapter start ] [ Previous page ] [ Next page ]
1.3 Case Study
1.3 Case Study

Sun Microsystems released the SPARCstation 1 in April 1989. It is now an old design but a very important example because it was one of the first workstations to make extensive use of ASICs to achieve the following:
q q q
Better performance at lower cost Compact size, reduced power, and quiet operation Reduced number of parts, easier assembly, and improved reliability
The SPARCstation 1 contains about 50 ICs on the system motherboardexcluding the DRAM used for the system memory (standard parts). The SPARCstation 1 designers partitioned the system into the nine ASlCs shown in Table 1.1 and wrote specifications for each ASICthis took about three months 1 . LSI Logic and Fujitsu designed the SPARC integer unit (IU) and floating-point unit ( FPU ) to these specifications. The clock ASIC is a fairly straightforward design and, of the six remaining ASICs, the video controller/data buffer, the RAM controller, and the direct memory access ( DMA ) controller are defined by the 32-bit system bus ( SBus ) and the other ASICs that they connect to. The rest of the system is partitioned into three more ASICs: the cache controller , memory-management unit (MMU), and the data buffer. These three ASICs, with the IU and FPU, have the most critical timing paths and determine the system partitioning. The design of ASICs 38 in Table 1.1 took five Sun engineers six months after the specifications were complete. During the design process, the Sun engineers simulated the entire SPARCstation 1including execution of the Sun operating system (SunOS).
1.3 Case Study
TABLE 1.1 The ASICs in the Sun Microsystems SPARCstation 1. SPARCstation 1 ASIC Gates (k-gates) 1 SPARC integer unit (IU) 20 2 SPARC floating-point unit (FPU) 50 3 Cache controller 9 4 Memory-management unit (MMU) 5 5 Data buffer 3 6 Direct memory access (DMA) controller 9 7 Video controller/data buffer 4 8 RAM controller 1 9 Clock generator 1 Table 1.2 shows the software tools used to design the SPARCstation 1, many of which are now obsolete. The important point to notice, though, is that there is a lot more to microelectronic system design than designing the ASICsless than one-third of the tools listed in Table 1.2 were ASIC design tools. TABLE 1.2 The CAD tools used in the design of the Sun Microsystems SPARCstation 1. Design level Function Tool 2 ASIC design ASIC physical design ASIC logic synthesis ASIC simulation Schematic capture PCB layout Timing verification Mechanical design Case and enclosure Thermal analysis LSI Logic Internal tools and UC Berkeley tools LSI Logic Valid Logic Valid Logic Allegro Quad Design Motive and internal tools Autocad Pacific Numerix
Board design
1.3 Case Study
Management
Structural analysis Scheduling Documentation
Cosmos Suntrac Interleaf and FrameMaker
The SPARCstation 1 cost about $9000 in 1989 or, since it has an execution rate of approximately 12 million instructions per second (MIPS), $750/MIPS. Using ASIC technology reduces the motherboard to about the size of a piece of paper8.5 inches by 11 incheswith a power consumption of about 12 W. The SPARCstation 1 pizza box is 16 inches across and 3 inches highsmaller than a typical IBM-compatible personal computer in 1989. This speed, power, and size performance is (there are still SPARCstation 1s in use) made possible by using ASICs. We shall return to the SPARCstation 1, to look more closely at the partitioning step, in Section 15.3, System Partitioning. 1. Some information in Section 1.3 and Section 15.3 is from the SPARCstation 10 Architecture GuideMay 1992, p. 2 and pp. 2728 and from two publicity brochures (known as sparkle sheets). The first is Concept to System: How Sun Microsystems Created SPARCstation 1 Using LSI Logic's ASIC System Technology, A. Bechtolsheim, T. Westberg, M. Insley, and J. Ludemann of Sun Microsystems; J-H. Huang and D. Boyle of LSI Logic. This is an LSI Logic publication. The second paper is SPARCstation 1: Beyond the 3M Horizon, A. Bechtolsheim and E. Frank, a Sun Microsystems publication. I did not include these as references since they are impossible to obtain now, but I would like to give credit to Andy Bechtolsheim and the Sun Microsystems and LSI Logic engineers. 2. Names are trademarks of their respective companies. [ Chapter start ] [ Previous page ] [ Next page ]
1.4 Economics of ASICs

In this section we shall discuss the economics of using ASICs in a product and compare the most popular types of ASICs: an FPGA, an MGA, and a CBIC. To make an economic comparison between these alternatives, we consider the ASIC itself as a product and examine the components of product cost: fixed costs and variable costs. Making cost comparisons is dangerouscosts change rapidly and the semiconductor industry is notorious for keeping its costs, prices, and pricing strategy closely guarded secrets. The figures in the following sections are approximate and used to illustrate the different components of cost.
1.4.1 Comparison Between ASIC Technologies

The most obvious economic factor in making a choice between the different ASIC types is the part cost . Part costs vary enormouslyyou can pay anywhere from a few dollars to several hundreds of dollars for an ASIC. In general, however, FPGAs are more expensive per gate than MGAs, which are, in turn, more expensive than CBICs. For example, a 0.5 m, 20 k-gate array might cost 0.010.02 cents/gate (for more than 10,000 parts) or $2$4 per part, but an equivalent FPGA might be $20. The price per gate for an FPGA to implement the same function is typically 25 times the cost of an MGA or CBIC. Given that an FPGA is more expensive than an MGA, which is more expensive than a CBIC, when and why does it make sense to choose a more expensive part? Is the increased flexibility of an FPGA worth the extra cost per part? Given that an MGA or CBIC is specially tailored for each customer, there are extra hidden costs associated with this step that we should consider. To make a true comparison between the
file:///C|/Documents%20and%20Settings/saran%2....edu/_msmith/ASICs/HTML/Book2/CH01/CH01.4.htm (1 of 10) [5/30/2004 11:00:49 PM]
different ASIC technologies, we shall quantify some of these costs.
1.4.2 Product Cost

The total cost of any product can be separated into fixed costs and variable costs : total product cost = fixed product cost + variable product cost products sold
(1.1)
Fixed costs are independent of sales volume the number of products sold. However, the fixed costs amortized per product sold (fixed costs divided by products sold) decrease as sales volume increases. Variable costs include the cost of the parts used in the product, assembly costs, and other manufacturing costs. Let us look more closely at the parts in a product. If we want to buy ASICs to assemble our product, the total part cost is total part cost = fixed part cost + variable cost per part volume of parts. (1.2) Our fixed cost when we use an FPGA is lowwe just have to buy the software and any programming equipment. The fixed part costs for an MGA or CBIC are higher and include the costs of the masks, simulation, and test program development. We shall discuss these extra costs in more detail in Sections 1.4.3 and 1.4.4. Figure 1.11 shows a break-even graph that compares the total part cost for an FPGA, MGA, and a CBIC with the following assumptions:
q q q
FPGA fixed cost is $21,800, part cost is $39. MGA fixed cost is $86,000, part cost is $10. CBIC fixed cost is $146,000, part cost is $8.
At low volumes, the MGA and the CBIC are more expensive because of their higher fixed costs. The total part costs of two alternative types of ASIC are equal at the break-even volume . In Figure 1.11 the break-even volume for the FPGA and the
MGA is about 2000 parts. The break-even volume between the FPGA and the CBIC is about 4000 parts. The break-even volume between the MGA and the CBIC is higherat about 20,000 parts.
FIGURE 1.11 A break-even analysis for an FPGA, a masked gate array (MGA) and a custom cell-based ASIC (CBIC). The break-even volume between two technologies is the point at which the total cost of parts are equal. These numbers are very approximate. We shall describe how to calculate the fixed part costs next. Following that we shall discuss how we came up with cost per part of $39, $10, and $8 for the FPGA, MGA, and CBIC.
1.4.3 ASIC Fixed Costs

Figure 1.12 shows a spreadsheet, Fixed Costs, that calculates the fixed part costs associated with ASIC design.
FIGURE 1.12 A spreadsheet, Fixed Costs, for a fieldprogrammable gate array (FPGA), a masked gate array (MGA), and a cell-based ASIC (CBIC). These costs can vary wildly. The training cost includes the cost of the time to learn any new electronic design automation ( EDA ) system. For example, a new FPGA design system might require a few days to learn; a new gate-array or cell-based design system might require taking a course. Figure 1.12 assumes that the cost of an engineer (including overhead, benefits, infrastructure, and so on) is between $100,000 and $200,000 per year or $2000 to $4000 per week (in the United States in 1990s dollars). Next we consider the hardware and software cost for ASIC design. Figure 1.12 shows some typical figures, but you can spend anywhere from $1000 to $1 million (and more) on ASIC design software and the necessary infrastructure. We try to measure productivity of an ASIC designer in gates (or transistors) per day. This is like trying to predict how long it takes to dig a hole, and the number of gates per day an engineer averages varies wildly. ASIC design productivity must increase as ASIC sizes increase and will depend on experience, design tools, and the ASIC
complexity. If we are using similar design methods, design productivity ought to be independent of the type of ASIC, but FPGA design software is usually available as a complete bundle on a PC. This means that it is often easier to learn and use than semicustom ASIC design tools. Every ASIC has to pass a production test to make sure that it works. With modern test tools the generation of any test circuits on each ASIC that are needed for production testing can be automatic, but it still involves a cost for design for test . An FPGA is tested by the manufacturer before it is sold to you and before you program it. You are still paying for testing an FPGA, but it is a hidden cost folded into the part cost of the FPGA. You do have to pay for any programming costs for an FPGA, but we can include these in the hardware and software cost. The nonrecurring-engineering ( NRE ) charge includes the cost of work done by the ASIC vendor and the cost of the masks. The production test uses sets of test inputs called test vectors , often many thousands of them. Most ASIC vendors require simulation to generate test vectors and test programs for production testing, and will charge for a test-program development cost . The number of masks required by an ASIC during fabrication can range from three or four (for a gate array) to 15 or more (for a CBIC). Total mask costs can range from $5000 to $50,000 or more. The total NRE charge can range from $10,000 to $300,000 or more and will vary with volume and the size of the ASIC. If you commit to high volumes (above 100,000 parts), the vendor may waive the NRE charge. The NRE charge may also include the costs of software tools, design verification, and prototype samples. If your design does not work the first time, you have to complete a further design pass ( turn or spin ) that requires additional NRE charges. Normally you sign a contract (sign off a design) with an ASIC vendor that guarantees first-pass successthis means that if you designed your ASIC according to rules specified by the vendor, then the vendor guarantees that the silicon will perform according to the simulation or you get your money back. This is why the difference between semicustom and full-custom design styles is so importantthe ASIC vendor will not (and cannot) guarantee your design will work if you use any full-custom design techniques. Nowadays it is almost routine to have an ASIC work on the first pass. However, if your design does fail, it is little consolation to have a second pass for free if your
company goes bankrupt in the meantime. Figure 1.13 shows a profit model that represents the profit flow during the product lifetime . Using this model, we can estimate the lost profit due to any delay.
FIGURE 1.13 A profit model. If a product is introduced on time, the total sales are $60 million (the area of the higher triangle). With a three-month (one fiscal quarter) delay the sales decline to $25 million. The difference is shown as the shaded area between the two triangles and amounts to a lost revenue of $35 million. Suppose we have the following situation:
q q
The product lifetime is 18 months (6 fiscal quarters). The product sales increase (linearly) at $10 million per quarter independently of when the product is introduced (we suppose this is because we can increase production and sales only at a fixed rate). The product reaches its peak sales at a point in time that is independent of when we introduce a product (because of external market factors that we cannot control). The product declines in sales (linearly) to the end of its lifea point in time that is also independent of when we introduce the product (again due to external market forces).
The simple profit and revenue model of Figure 1.13 shows us that we would lose
$35 million in sales in this situation due to a 3-month delay. Despite the obvious problems with such a simple model (how can we introduce the same product twice to compare the performance?), it is widely used in marketing. In the electronics industry product lifetimes continue to shrink. In the PC industry it is not unusual to have a product lifetime of 18 months or less. This means that it is critical to achieve a rapid design time (or high product velocity ) with no delays. The last fixed cost shown in Figure 1.12 corresponds to an insurance policy. When a company buys an ASIC part, it needs to be assured that it will always have a backup source, or second source , in case something happens to its first or primary source. Established FPGA companies have a second source that produces equivalent parts. With a custom ASIC you may have to do some redesign to transfer your ASIC to the second source. However, for all ASIC types, switching production to a second source will involve some cost. Figure 1.12 assumes a second-source cost of $2000 for all types of ASIC (the amount may be substantially more than this).
1.4.4 ASIC Variable Costs

Figure 1.14 shows a spreadsheet, Variable Costs, that calculates some example part costs. This spreadsheet uses the terms and parameters defined below the figure.
FIGURE 1.14 A spreadsheet, Variable Costs, to calculate the part cost (that is the variable cost for a product using ASICs) for different ASIC technologies.
q
The wafer size increases every few years. From 1985 to 1990, 4-inch to 6-inch diameter wafers were common; equipment using 6-inch to 8-inch wafers was introduced between 1990 and 1995; the next step is the 300 cm or 12-inch wafer. The 12-inch wafer will probably take us to 2005. The wafer cost depends on the equipment costs, process costs, and overhead in the fabrication line. A typical wafer cost is between $1000 and $5000, with $2000 being average; the cost declines slightly during the life of a process and increases only slightly from one process generation to the next. Moores Law (after Gordon Moore of Intel) models the observation that the number of transistors on a chip roughly doubles every 18 months. Not all designs follow this law, but a large ASIC design seems to grow by a factor of 10 every 5 years (close to Moores Law). In 1990 a large ASIC design size was 10 k-gate, in 1995 a large design was about 100 k-gate, in 2000 it will be 1 M-gate, in 2005 it will be 10 M-gate. The gate density is the number of gate equivalents per unit area (remember: a gate equivalent, or gate, corresponds to a two-input NAND gate). The gate utilization is the percentage of gates that are on a die that we can use (on a gate array we waste some gate space for interconnect). The die size is determined by the design size (in gates), the gate density, and the utilization of the die. The number of die per wafer depends on the die size and the wafer size (we have to pack rectangular or square die, together with some test chips, on to a circular wafer so some space is wasted). The defect density is a measure of the quality of the fabrication process. The smaller the defect density the less likely there is to be a flaw on any one die. A single defect on a die is almost always fatal for that die. Defect density usually increases with the number of steps in a process. A defect density of less than 1 cm 2 is typical and required for a submicron CMOS process. The yield of a process is the key to a profitable ASIC company. The yield is the fraction of die on a wafer that are good (expressed as a percentage). Yield depends on the complexity and maturity of a process. A process may start out
with a yield of close to zero for complex chips, which then climbs to above 50 percent within the first few months of production. Within a year the yield has to be brought to around 80 percent for the average complexity ASIC for the process to be profitable. Yields of 90 percent or more are not uncommon. The die cost is determined by wafer cost, number of die per wafer, and the yield. Of these parameters, the most variable and the most critical to control is the yield. The profit margin (what you sell a product for, less what it costs you to make it, divided by the cost) is determined by the ASIC companys fixed and variable costs. ASIC vendors that make and sell custom ASICs have huge fixed and variable costs associated with building and running fabrication facilities (a fabrication plant is a fab ). FPGA companies are typically fabless they do not own a fabthey must pass on the costs of the chip manufacture (plus the profit margin of the chip manufacturer) and the development cost of the FPGA structure in the FPGA part cost. The profitability of any company in the ASIC business varies greatly. The price per gate (usually measured in cents per gate) is determined by die costs and design size. It varies with design size and declines over time. The part cost is determined by all of the preceding factors. As such it will vary widely with time, process, yield, economic climate, ASIC size and complexity, and many other factors.
As an estimate you can assume that the price per gate for any process technology falls at about 20 % per year during its life (the average life of a CMOS process is 24 years, and can vary widely). Beyond the life of a process, prices can increase as demand falls and the fabrication equipment becomes harder to maintain. Figure 1.15 shows the price per gate for the different ASICs and process technologies using the following assumptions:
q
For any new process technology the price per gate decreases by 40 % in the first year, 30 % in the second year, and then remains constant. A new process technology is introduced approximately every 2 years, with feature size decreasing by a factor of two every 5 years as follows: 2 m in 1985, 1.5 m in 1987, 1 m in 1989, 0.80.6 m in 19911993, 0.50.35 m in 19961997, 0.250.18 m in 19982000. CBICs and MGAs are introduced at approximately the same time and price.

q
The price of a new process technology is initially 10 % above the process that it replaces. FPGAs are introduced one year after CBICs that use the same process technology. The initial FPGA price (per gate) is 10 percent higher than the initial price for CBICs or MGAs using the same process technology.
From Figure 1.15 you can see that the successive introduction of new process technologies every 2 years drives the price per gate down at a rate close to 30 percent per year. The cost figures that we have used in this section are very approximate and can vary widely (this means they may be off by a factor of 2 but probably are correct within a factor of 10). ASIC companies do use spreadsheet models like these to calculate their costs.
FIGURE 1.15 Example price per gate figures. Having decided if, and then which, ASIC technology is appropriate, you need to choose the appropriate cell library. Next we shall discuss the issues surrounding ASIC cell libraries: the different types, their sources, and their contents. [ Chapter start ] [ Previous page ] [ Next page ]
file:///C|/Documents%20and%20Settings/saran%...edu/_msmith/ASICs/HTML/Book2/CH01/CH01.4.htm (10 of 10) [5/30/2004 11:00:49 PM]
1.5 ASIC Cell Libraries

The cell library is the key part of ASIC design. For a programmable ASIC the FPGA company supplies you with a library of logic cells in the form of a design kit , you normally do not have a choice, and the cost is usually a few thousand dollars. For MGAs and CBICs you have three choices: the ASIC vendor (the company that will build your ASIC) will supply a cell library, or you can buy a cell library from a thirdparty library vendor , or you can build your own cell library. The first choice, using an ASIC-vendor library , requires you to use a set of design tools approved by the ASIC vendor to enter and simulate your design. You have to buy the tools, and the cost of the cell library is folded into the NRE. Some ASIC vendors (especially for MGAs) supply tools that they have developed in-house. For some reason the more common model in Japan is to use tools supplied by the ASIC vendor, but in the United States, Europe, and elsewhere designers want to choose their own tools. Perhaps this has to do with the relationship between customer and supplier being a lot closer in Japan than it is elsewhere. An ASIC vendor library is normally a phantom library the cells are empty boxes, or phantoms , but contain enough information for layout (for example, you would only see the bounding box or abutment box in a phantom version of the cell in Figure 1.3). After you complete layout you hand off a netlist to the ASIC vendor, who fills in the empty boxes ( phantom instantiation ) before manufacturing your chip. The second and third choices require you to make a buy-or-build decision . If you complete an ASIC design using a cell library that you bought, you also own the masks (the tooling ) that are used to manufacture your ASIC. This is called customerowned tooling ( COT , pronounced see-oh-tee). A library vendor normally
develops a cell library using information about a process supplied by an ASIC foundry . An ASIC foundry (in contrast to an ASIC vendor) only provides manufacturing, with no design help. If the cell library meets the foundry specifications, we call this a qualified cell library . These cell libraries are normally expensive (possibly several hundred thousand dollars), but if a library is qualified at several foundries this allows you to shop around for the most attractive terms. This means that buying an expensive library can be cheaper in the long run than the other solutions for high-volume production. The third choice is to develop a cell library in-house. Many large computer and electronics companies make this choice. Most of the cell libraries designed today are still developed in-house despite the fact that the process of library development is complex and very expensive. However created, each cell in an ASIC cell library must contain the following:
q q q q q q q q q
A physical layout A behavioral model A Verilog/VHDL model A detailed timing model A test strategy A circuit schematic A cell icon A wire-load model A routing model
For MGA and CBIC cell libraries we need to complete cell design and cell layout and shall discuss this in Chapter 2. The ASIC designer may not actually see the layout if it is hidden inside a phantom, but the layout will be needed eventually. In a programmable ASIC the cell layout is part of the programmable ASIC design (see Chapter 4). The ASIC designer needs a high-level, behavioral model for each cell because simulation at the detailed timing level takes too long for a complete ASIC design. For a NAND gate a behavioral model is simple. A multiport RAM model can be very complex. We shall discuss behavioral models when we describe Verilog and VHDL
in Chapter 10 and Chapter 11. The designer may require Verilog and VHDL models in addition to the models for a particular logic simulator. ASIC designers also need a detailed timing model for each cell to determine the performance of the critical pieces of an ASIC. It is too difficult, too time-consuming, and too expensive to build every cell in silicon and measure the cell delays. Instead library engineers simulate the delay of each cell, a process known as characterization . Characterizing a standard-cell or gate-array library involves circuit extraction from the full-custom cell layout for each cell. The extracted schematic includes all the parasitic resistance and capacitance elements. Then library engineers perform a simulation of each cell including the parasitic elements to determine the switching delays. The simulation models for the transistors are derived from measurements on special chips included on a wafer called process control monitors ( PCMs ) or dropins . Library engineers then use the results of the circuit simulation to generate detailed timing models for logic simulation. We shall cover timing models in Chapter 13. All ASICs need to be production tested (programmable ASICs may be tested by the manufacturer before they are customized, but they still need to be tested). Simple cells in small or medium-size blocks can be tested using automated techniques, but large blocks such as RAM or multipliers need a planned strategy. We shall discuss test in Chapter 14. The cell schematic (a netlist description) describes each cell so that the cell designer can perform simulation for complex cells. You may not need the detailed cell schematic for all cells, but you need enough information to compare what you think is on the silicon (the schematic) with what is actually on the silicon (the layout)this is a layout versus schematic ( LVS ) check. If the ASIC designer uses schematic entry, each cell needs a cell icon together with connector and naming information that can be used by design tools from different vendors. We shall cover ASIC design using schematic entry in Chapter 9. One of the advantages of using logic synthesis (Chapter 12) rather than schematic design entry is eliminating the problems with icons, connectors, and cell names. Logic synthesis also makes moving an ASIC between different cell libraries, or retargeting , much easier. In order to estimate the parasitic capacitance of wires before we actually complete any
routing, we need a statistical estimate of the capacitance for a net in a given size circuit block. This usually takes the form of a look-up table known as a wire-load model . We also need a routing model for each cell. Large cells are too complex for the physical design or layout tools to handle directly and we need a simpler representationa phantom of the physical layout that still contains all the necessary information. The phantom may include information that tells the automated routing tool where it can and cannot place wires over the cell, as well as the location and types of the connections to the cell. [ Chapter start ] [ Previous page ] [ Next page ]
CMOS LOGIC
CMOS LOGIC
A CMOS transistor (or device) has four terminals: gate , source , drain , and a fourth terminal that we shall ignore until the next section. A CMOS transistor is a switch. The switch must be conducting or on to allow current to flow between the source and drain terminals (using open and closed for switches is confusingfor the same reason we say a tap is on and not that it is closed ). The transistor source and drain terminals are equivalent as far as digital signals are concernedwe do not worry about labeling an electrical switch with two terminals.
q
V AB is the potential difference, or voltage, between nodes A and B in a circuit; V AB is positive if node A is more positive than node B. Italics denote variables; constants are set in roman (upright) type. Uppercase letters denote DC, large-signal, or steady-state voltages. For TTL the positive power supply is called VCC (V CC or V CC ). The 'C' denotes that the supply is connected indirectly to the collectors of the npn bipolar transistors (a bipolar transistor has a collector, base, and emittercorresponding roughly to the drain, gate, and source of an MOS transistor). Following the example of TTL we used VDD (V DD or V DD ) to denote the positive supply in an NMOS chip where the devices are all n -channel transistors and the drains of these devices are connected indirectly to the positive supply. The supply nomenclature for NMOS chips has stuck for CMOS. VDD is the name of the power supply node or net; V DD represents the value (uppercase since V DD is a DC quantity). Since V DD is a variable, it is italic (words and multiletter abbreviations use romanthus it is V DD , but V drain ).
file:///C|/Documents%20and%20Settings/saran%20...aii.edu/_msmith/ASICs/HTML/Book2/CH02/CH02.htm (1 of 4) [5/30/2004 11:01:05 PM]
CMOS LOGIC
q
Logic designers often call the CMOS negative supply VSS or VSS even if it is actually ground or GND. I shall use VSS for the node and V SS for the value. CMOS uses positive logic VDD is logic '1' and VSS is logic '0'.
We turn a transistor on or off using the gate terminal. There are two kinds of CMOS transistors: n -channel transistors and p -channel transistors. An n -channel transistor requires a logic '1' (from now on Ill just say a '1') on the gate to make the switch conducting (to turn the transistor on ). A p -channel transistor requires a logic '0' (again from now on, Ill just say a '0') on the gate to make the switch nonconducting (to turn the transistor off ). The p -channel transistor symbol has a bubble on its gate to remind us that the gate has to be a '0' to turn the transistor on . All this is shown in Figure 2.1(a) and (b).
FIGURE 2.1 CMOS transistors as switches. (a) An n -channel transistor. (b) A p -channel transistor. (c) A CMOS inverter and its symbol (an equilateral triangle and a circle ). If we connect an n -channel transistor in series with a p -channel transistor, as shown in Figure 2.1(c), we form an inverter . With four transistors we can form a two-input NAND gate (Figure 2.2a). We can also make a two-input NOR gate (Figure 2.2b). Logic designers normally use the terms NAND gate and logic gate (or just gate), but I shall try to use the terms NAND cell and logic cell rather than NAND gate or logic gate in this chapter to avoid any possible confusion with the gate terminal of a
CMOS LOGIC
transistor.
FIGURE 2.2 CMOS logic. (a) A two-input NAND logic cell. (b) A two-input NOR logic cell. The n -channel and p -channel transistor switches implement the '1's and '0's of a Karnaugh map. 2.1 CMOS Transistors 2.2 The CMOS Process 2.3 CMOS Design Rules 2.4 Combinational Logic Cells 2.5 Sequential Logic Cells 2.6 Datapath Logic Cells 2.7 I/O Cells 2.8 Cell Compilers 2.9 Summary 2.10 Problems 2.11 Bibliography 2.12 References
CMOS LOGIC
2.1 CMOS Transistors

Figure 2.3 illustrates how electrons and holes abandon their dopant atoms leaving a depletion region around a transistors source and drain. The region between source and drain is normally nonconducting. To make an n -channel transistor conducting, we must apply a positive voltage V GS (the gate voltage with respect to the source) that is greater than the n -channel transistor threshold voltage , V t n (a typical value is 0.5 V and, as far as we are presently concerned, is a constant). This establishes a thin ( 50 ) conducting channel of electrons under the gate. MOS transistors can carry a very small current (the subthreshold current a few microamperes or less) with V GS < V t n , but we shall ignore this. A transistor can be conducting ( V GS > V t n ) without any current flowing. To make current flow in an n -channel transistor we must also apply a positive voltage, V DS , to the drain with respect to the source. Figure 2.3 shows these connections and the connection to the fourth terminal of an MOS transistorthe bulk ( well , tub , or substrate ) terminal. For an n -channel transistor we must connect the bulk to the most negative potential, GND or VSS, to reverse bias the bulk-to-drain and bulk-to-source pn -diodes. The arrow in the fourterminal n -channel transistor symbol in Figure 2.3 reflects the polarity of these pn diodes.
FIGURE 2.3 An n -channel MOS transistor. The gate-oxide thickness, T OX , is approximately 100 angstroms (0.01 m). A typical transistor length, L = 2 . The bulk may be either the substrate or a well. The diodes represent pn -junctions that must be reverse-biased. The current flowing in the transistor is current (amperes) = charge (coulombs) per unit time (second). (2.1) We can express the current in terms of the total charge in the channel, Q (imagine taking a picture and counting the number of electrons in the channel at that instant). If t f (for time of flight sometimes called the transit time ) is the time that it takes an electron to cross between source and drain, the drain-to-source current, I DSn , is I DSn = Q / t f . (2.2) We need to find Q and t f . The velocity of the electrons v (a vector) is given by the equation that forms the basis of Ohms law: v = n E , (2.3)
where n is the electron mobility ( p is the hole mobility ) and E is the electric field (with units Vm 1 ). Typical carrier mobility values are n = 5001000 cm 2 V 1 s 1 and p = 100400 cm 2 V 1 s 1 . Equation 2.3 is a vector equation, but we shall ignore the vertical electric field and concentrate on the horizontal electric field, E x , that moves the electrons between source and drain. The horizontal component of the electric field is E x = V DS / L, directed from the drain to the source, where L is the channel length (see Figure 2.3). The electrons travel a distance L with horizontal velocity v x = n E x , so that L vx L2 n V DS
t f = = . (2.4)
Next we find the channel charge, Q . The channel and the gate form the plates of a capacitor, separated by an insulatorthe gate oxide. We know that the charge on a linear capacitor, C, is Q = C V . Our lower plate, the channel, is not a linear conductor. Charge only appears on the lower plate when the voltage between the gate and the channel, V GC , exceeds the n -channel threshold voltage. For our nonlinear capacitor we need to modify the equation for a linear capacitor to the following: Q = C ( V GC V t n ) . (2.5) The lower plate of our capacitor is resistive and conducting current, so that the potential in the channel, V GC , varies. In fact, V GC = V GS at the source and V GC = V GS V DS at the drain. What we really should do is find an expression for the channel charge as a function of channel voltage and sum (integrate) the charge all the way across the channel, from x = 0 (at the source) to x = L (at the drain). Instead we shall assume that the channel voltage, V GC ( x ), is a linear function of distance from the source and take the average value of the charge, which is thus
Q = C [ ( V GS V t n ) 0.5 V DS ] . (2.6) The gate capacitance, C , is given by the formula for a parallel-plate capacitor with length L , width W , and plate separation equal to the gate-oxide thickness, T ox . Thus the gate capacitance is WL ox C = T ox where e ox is the gate-oxide dielectric permittivity. For silicon dioxide, Si0 2 , e ox 3.45 10 11 Fm 1 , so that, for a typical gate-oxide thickness of 100 (1 = 1 angstrom = 0.1 nm), the gate capacitance per unit area, C ox 3 f F m 2 . Now we can express the channel charge in terms of the transistor parameters, Q = WL C ox [ ( V GS V t n ) 0.5 V DS ] . (2.8) Finally, the drainsource current is I DSn = Q/ t f = (W/L) n C ox [ ( V GS V t n ) 0.5 V DS ] V DS = (W/L)k ' n [ ( V GS V t n ) 0.5 V DS ] V DS . (2.9) = WLC ox , (2.7)
The constant k ' n is the process transconductance parameter (or intrinsic transconductance ): k ' n = n C ox . (2.10)
We also define n , the transistor gain factor (or just gain factor ) as n = k ' n (W/L) . (2.11) The factor W/L (transistor width divided by length) is the transistor shape factor . Equation 2.9 describes the linear region (or triode region) of operation. This equation is valid until V DS = V GS V t n and then predicts that I DS decreases with increasing V DS , which does not make physical sense. At V DS = V GS V t n = V DS (sat) (the saturation voltage ) there is no longer enough voltage between the gate and the drain end of the channel to support any channel charge. Clearly a small amount of charge remains or the current would go to zero, but with very little free charge the channel resistance in a small region close to the drain increases rapidly and any further increase in V DS is dropped over this region. Thus for V DS > V GS V t n (the saturation region , or pentode region, of operation) the drain current IDS remains approximately constant at the saturation current , I DSn (sat) , where I DSn (sat) = ( n /2)( V GS V t n ) 2 ; V GS > V t n . (2.12) Figure 2.4 shows the n -channel transistor I DS V DS characteristics for a generic 0.5 m CMOS process that we shall call G5 . We can fit Eq. 2.12 to the long-channel transistor characteristics (W = 60 m, L = 6 m) in Figure 2.4(a). If I DSn (sat) = 2.5 mA (with V DS = 3.0 V, V GS = 3.0 V, V t n = 0.65 V, T ox =100 ), the intrinsic transconductance is 2(L/W) I DSn (sat) k ' n = ( V GS V t n ) 2 (2.13)
2 (6/60) (2.5 10 3 ) (3.0 0.65) 2
= 9.05 10 5 AV 2 or approximately 90 AV 2 . This value of k ' n , calculated in the saturation region, will be different (typically lower by a factor of 2 or more) from the value of k ' n measured in the linear region. We assumed the mobility, n , and the threshold voltage, V t n , are constantsneither of which is true, as we shall see in Section 2.1.2. For the p -channel transistor in the G5 process, I DSp (sat) = 850 A ( V DS = 3.0 V, V GS = 3.0 V, V t p = 0.85 V, W = 60 m, L = 6 m). Then 2 (L/W) ( I DSp (sat) ) k ' p = ( V GS V t p ) 2 2 (6/60) (850 10 6 ) (3.0 (0.85) ) 2 = 3.68 10 5 AV 2 The next section explains the signs in Eq. 2.14. (2.14)
(a)
(b)
FIGURE 2.4 MOS n -channel transistor characteristics for a generic 0.5 m process (G5). (a) A short-channel transistor, with W = 6 m and L = 0.6 m (drawn) and a long-channel transistor (W = 60 m, L = 6 m) (b) The 6/0.6 characteristics represented as a surface. (c) A long-channel transistor obeys a square-law characteristic between I DS and V GS in the saturation region ( V DS = 3 V). A shortchannel transistor shows a more linear characteristic due to velocity saturation. Normally, all of the transistors used on an ASIC have short channels.
(c)
2.1.1 P-Channel Transistors

The source and drain of CMOS transistors look identical; we have to know which way the current is flowing to distinguish them. The source of an n -channel transistor is lower in potential than the drain and vice versa for a p -channel transistor. In an n channel transistor the threshold voltage, V t n , is normally positive, and the terminal voltages V DS and V GS are also usually positive. In a p -channel transistor V t p is normally negative and we have a choice: We can write everything in terms of the magnitudes of the voltages and currents or we can use negative signs in a consistent fashion. Here are the equations for a p -channel transistor using negative signs: k ' p (W/L) [ ( V GS V t p ) 0.5 V DS ] V DS ; V DS > V = GS V t p
I DSp
(2.15)
I DSp (sat) = p /2 ( V GS V t p ) 2 ; V DS < V GS V t p . In these two equations V t p is negative, and the terminal voltages V DS and V GS are also normally negative (and 3 V < 2 V, for example). The current I DSp is then negative, corresponding to conventional current flowing from source to drain of a p channel transistor (and hence the negative sign for I DSp (sat) in Eq. 2.14).
2.1.2 Velocity Saturation

For a deep submicron transistor, Eq. 2.12 may overestimate the drainsource current by a factor of 2 or more. There are three reasons for this error. First, the threshold voltage is not constant. Second, the actual length of the channel (the electrical or effective length, often written as L eff ) is less than the drawn (mask) length. The third reason is that Eq. 2.3 is not valid for high electric fields. The electrons cannot move any faster than about v max n = 10 5 ms 1 when the electric field is above 10 6 Vm 1 (reached when 1 V is dropped across 1 m); the electrons become velocity saturated . In this case t f = L eff / v max n , the drainsource saturation current is independent of the transistor length, and Eq. 2.12 becomes
I DSn (sat) =
Wv max n C ox ( V GS V t n ) ; V DS > V DS (sat) (velocity saturated).
(2.16)
We can see this behavior for the short-channel transistor characteristics in Figure 2.4(a) and (c). Transistor current is often specified per micron of gate width because of the form of Eq. 2.16. As an example, suppose I DSn (sat) / W = 300 A m 1 for the n -channel transistors in our G5 process (with V DS = 3.0 V, V GS = 3.0 V, V t n = 0.65 V, L eff = 0.5 m and T ox = 100 ). Then E x (3 0.65) V / 0.5 m 5 V m 1 , I DSn (sat) /W v max n = C ox ( V GS V t n ) (300 10 6 ) (1 10 6 ) = (3.45 10 3 ) (3 0.65) = 37,000 ms 1 and t f 0.5 m/37,000 ms 1 13 ps. The value for v max n is lower than the 10 5 ms 1 we expected because the carrier velocity is also lowered by mobility degradation due the vertical electric fieldwhich we have ignored. This vertical field forces the carriers to keep bumping in to the interface between the silicon and the gate oxide, slowing them down. (2.17)
2.1.3 SPICE Models
The simulation program SPICE (which stands for Simulation Program with Integrated Circuit Emphasis ) is often used to characterize logic cells. Table 2.1 shows a typical set of model parameters for our G5 process. The SPICE parameter KP (given in AV 2 ) corresponds to k ' n (and k ' p ). SPICE parameters VT0 and TOX correspond to V t n (and V t p ), and T ox . SPICE parameter U0 (given in cm 2 V 1 s ) corresponds to the ideal bulk mobility values, n (and p ). Many of the other parameters model velocity saturation and mobility degradation (and thus the effective value of k ' n and k ' p ).
1
TABLE 2.1 SPICE parameters for a generic 0.5 m process, G5 (0.6 m drawn gate length). The n-channel transistor characteristics are shown in Figure 2.4. .MODEL CMOSN NMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=1 VTO=0.65 DELTA=0.7 + LD=5E-08 KP=2E-04 UO=550 THETA=0.27 RSH=2 GAMMA=0.6 NSUB=1.4E+17 NFS=6E+11 + VMAX=2E+05 ETA=3.7E-02 KAPPA=2.9E-02 CGDO=3.0E-10 CGSO=3.0E-10 CGBO=4.0E-10 + CJ=5.6E-04 MJ=0.56 CJSW=5E-11 MJSW=0.52 PB=1 .MODEL CMOSP PMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=-1 VTO=-0.92 DELTA=0.29 + LD=3.5E-08 KP=4.9E-05 UO=135 THETA=0.18 RSH=2 GAMMA=0.47 NSUB=8.5E+16 NFS=6.5E+11 + VMAX=2.5E+05 ETA=2.45E-02 KAPPA=7.96 CGDO=2.4E-10 CGSO=2.4E-10 CGBO=3.8E-10 + CJ=9.3E-04 MJ=0.47 CJSW=2.9E-10 MJSW=0.505 PB=1
2.1.4 Logic Levels

Figure 2.5 shows how to use transistors as logic switches. The bulk connection for the n -channel transistor in Figure 2.5(ab) is a p -well. The bulk connection for the p channel transistor is an n -well. The remaining connections show what happens when we try and pass a logic signal between the drain and source terminals.
FIGURE 2.5 CMOS logic levels. (a) A strong '0'. (b) A weak '1'. (c) A weak '0'. (d) A strong '1'. ( V t n is positive and V t p is negative.) The depth of the channels is greatly exaggerated. In Figure 2.5(a) we apply a logic '1' (or VDD I shall use these interchangeably) to the gate and a logic '0' ( V SS ) to the source (we know it is the source since electrons must flow from this point, since V SS is the lowest voltage on the chip). The application of these voltages makes the n -channel transistor conduct current, and electrons flow from source to drain.
Suppose the drain is initially at logic '1'; then the n -channel transistor will begin to discharge any capacitance that is connected to its drain (due to another logic cell, for example). This will continue until the drain terminal reaches a logic '0', and at that time V GD and V GS are both equal to V DD , a full logic '1'. The transistor is strongly conducting now (with a large channel charge, Q , but there is no current flowing since V DS = 0 V). The transistor will strongly object to attempts to change its drain terminal from a logic '0'. We say that the logic level at the drain is a strong '0'. In Figure 2.5(b) we apply a logic '1' to the drain (it must now be the drain since electrons have to flow toward a logic '1'). The situation is now quite differentthe transistor is still on but V GS is decreasing as the source voltage approaches its final value. In fact, the source terminal never gets to a logic '1'the source will stop increasing in voltage when V GS reaches V t n . At this point the transistor is very nearly off and the source voltage creeps slowly up to V DD V t n . Because the transistor is very nearly off, it would be easy for a logic cell connected to the source to change the potential there, since there is so little channel charge. The logic level at the source is a weak '1'. Figure 2.5(cd) show the state of affairs for a p -channel transistor is the exact reverse or complement of the n -channel transistor situation. In summary, we have the following logic levels:
q q
An n -channel transistor provides a strong '0', but a weak '1'. A p -channel transistor provides a strong '1', but a weak '0'.
Sometimes we refer to the weak versions of '0' and '1' as degraded logic levels . In CMOS technology we can use both types of transistor together to produce strong '0' logic levels as well as strong '1' logic levels. [ Chapter start ] [ Previous page ] [ Next page ]
2.2 The CMOS Process

Figure 2.6 outlines the steps to create an integrated circuit. The starting material is silicon, Si, refined from quartzite (with less than 1 impurity in 10 10 silicon atoms). We draw a single-crystal silicon boule (or ingot) from a crucible containing a melt at approximately 1500 C (the melting point of silicon at 1 atm. pressure is 1414 C). This method is known as Czochralski growth. Acceptor ( p -type) or donor ( n -type) dopants may be introduced into the melt to alter the type of silicon grown. The boule is sawn to form thin circular wafers (6, 8, or 12 inches in diameter, and typically 600 m thick), and a flat is ground (the primary flat), perpendicular to the <110> crystal axisas a this edge down indication. The boule is drawn so that the wafer surface is either in the (111) or (100) crystal planes. A smaller secondary flat indicates the wafer crystalline orientation and doping type. A typical submicron CMOS processes uses p -type (100) wafers with a resistivity of approximately 10 cmthis type of wafer has two flats, 90 apart. Wafers are made by chemical companies and sold to the IC manufacturers. A blank 8-inch wafer costs about $100. To begin IC fabrication we place a batch of wafers (a wafer lot ) on a boat and grow a layer (typically a few thousand angstroms) of silicon dioxide , SiO 2 , using a furnace. Silicon is used in the semiconductor industry not so much for the properties of silicon, but because of the physical, chemical, and electrical properties of its native oxide, SiO 2 . An IC fabrication process contains a series of masking steps (that in turn contain other steps) to create the layers that define the transistors and metal interconnect.
FIGURE 2.6 IC fabrication. Grow crystalline silicon (1); make a wafer (23); grow a silicon dioxide (oxide) layer in a furnace (4); apply liquid photoresist (resist) (5); mask exposure (6); a cross-section through a wafer showing the developed resist (7); etch the oxide layer (8); ion implantation (910); strip the resist (11); strip the oxide (12). Steps similar to 412 are repeated for each layer (typically 1220 times for a CMOS process). Each masking step starts by spinning a thin layer (approximately 1 m) of liquid photoresist ( resist ) onto each wafer. The wafers are baked at about 100 C to remove the solvent and harden the resist before being exposed to ultraviolet (UV) light (typically less than 200 nm wavelength) through a mask . The UV light alters the structure of the resist, allowing it to be removed by developing. The exposed oxide may then be etched (removed). Dry plasma etching etches in the vertical direction much faster than it does horizontally (an anisotropic etch). Wet etch techniques are usually isotropic . The resist functions as a mask during the etch step and transfers the desired pattern to the oxide layer. Dopant ions are then introduced into the exposed silicon areas. Figure 2.6 illustrates the use of ion implantation . An ion implanter is a cross between a TV and a mass spectrometer and fires dopant ions into the silicon wafer. Ions can only penetrate materials to a depth (the range , normally a few microns) that depends on the closely controlled implant energy (measured in keVusually between 10 and 100 keV; an
electron volt, 1 eV, is 1.6 10 19 J). By using layers of resist, oxide, and polysilicon we can prevent dopant ions from reaching the silicon surface and thus block the silicon from receiving an implant . We control the doping level by counting the number of ions we implant (by integrating the ion-beam current). The implant dose is measured in atoms/cm 2 (typical doses are from 10 13 to 10 15 cm 2 ). As an alternative to ion implantation we may instead strip the resist and introduce dopants by diffusion from a gaseous source in a furnace. Once we have completed the transistor diffusion layers we can deposit layers of other materials. Layers of polycrystalline silicon (polysilicon or poly ), SiO 2 , and silicon nitride (Si 3 N 4 ), for example, may be deposited using chemical vapor deposition ( CVD ). Metal layers can be deposited using sputtering . All these layers are patterned using masks and similar photolithography steps to those shown in Figure 2.6. TABLE 2.2 CMOS process layers. Derivation from Mask/layer name drawn layers n -well p -well active polysilicon n -diffusion implant 2 p -diffusion implant 2 contact metal1 = nwell 1 = pwell 1 = pdiff + ndiff = poly = grow (ndiff) = grow (pdiff) = contact = m1
Alternative names MOSIS for mask/layer mask label bulk, substrate, tub, n CWN tub, moat bulk, substrate, tub, p CWP tub, moat thin oxide, thinox, CAA island, gate oxide poly, gate CPG ndiff, n -select, nplus, n+ pdiff, p -select, pplus, p+ contact cut, poly contact, diffusion contact first-level metal CSN CSP CCP and CCA 3 CMF
metal2 via2 metal3 glass
= m2 = via2 = m3 = glass
second-level metal metal2/metal3 via, m2/m3 via third-level metal passivation, overglass, pad
CMS CVS CMT COG
Table 2.2 shows the mask layers (and their relation to the drawn layers) for a submicron, silicon-gate, three-level metal, self-aligned, CMOS process . A process in which the effective gate length is less than 1 m is referred to as a submicron process . Gate lengths below 0.35 m are considered in the deep-submicron regime. Figure 2.7 shows the layers that we draw to define the masks for the logic cell of Figure 1.3. Potential confusion arises because we like to keep layout simple but maintain a what you see is what you get (WYSIWYG) approach. This means that the drawn layers do not correspond directly to the masks in all cases.
(a) nwell
(b) pwell
(c) ndiff
(d) pdiff
(e) poly
(f) contact
(g) m1
(h) via
(i) m2
(j) cell
(k) phantom
FIGURE 2.7 The standard cell shown in Figure 1.3. (a)(i) The drawn layers that define the masks. The active mask is the union of the ndiff and pdiff drawn layers. The n diffusion implant and p -diffusion implant masks are bloated versions of the ndiff and pdiff drawn layers. (j) The complete cell layout. (k) The phantom cell layout. Often an ASIC vendor hides the details of the internal cell construction. The phantom cell is used for layout by the customer and then instantiated by the ASIC vendor after layout is complete. This layout uses grayscale stipple patterns to distinguish between layers. We can construct wells in a CMOS process in several ways. In an n-well process , the substrate is p -type (the wafer itself) and we use an n -well mask to build the n well. We do not need a p -well mask because there are no p -wells in an n -well processthe n -channel transistors all sit in the substrate (the wafer)but we often draw the p -well layer as though it existed. In a p-well process we use a p -well mask to make the p -wells and the n -wells are the substrate. In a twin-tub (or twin-well ) process, we create individual wells for both types of transistors, and neither well is the substrate (which may be either n -type or p -type). There are even triple-well processes used to achieve even more control over the transistor performance. Whatever process that we use we must connect all the n -wells to the most positive potential on the chip, normally VDD, and all the p -wells to VSS; otherwise we may forward bias the bulk to source/drain pn -junctions. The bulk connections for CMOS transistors are not usually drawn in digital circuit schematics, but these substrate contacts ( well contacts or tub ties ) are very important. After we make the well(s),
we grow a layer (approximately 1500 ) of Si 3 N 4 over the wafer. The active mask (CAA) leaves this nitride layer only in the active areas that will later become transistors or substrate contacts. Thus CAA (mask) = ndiff (drawn) pdiff (drawn) , (2.18) the symbol represents OR (union) of the two drawn layers, ndiff and pdiff. Everything outside the active areas is known as the field region, or just field . Next we implant the substrate to prevent unwanted transistors from forming in the field regionthis is the field implant or channel-stop implant . The nitride over the active areas acts as an implant mask and we may use another field-implant mask at this step also. Following this we grow a thick (approximately 5000 ) layer of SiO 2 , the field oxide ( FOX ). The FOX will not grow over the nitride areas. When we strip the nitride we are left with FOX in the areas we do not want to dope the silicon. Following this we deposit, dope, mask, and etch the poly gate material, CPG (mask) = poly (drawn). Next we create the doped regions that form the sources, drains, and substrate contacts using ion implantation. The poly gate functions like masking tape in these steps. One implant (using phosphorous or arsenic ions) forms the n -type source/drain for the n -channel transistors and n -type substrate contacts (CSN). A second implant (using boron ions) forms the p -type sourcedrain for the p -channel transistors and p -type substrate contacts (CSP). These implants are masked as follows CSN (mask) = grow (ndiff (drawn)), (2.19) CSP (mask) = grow (pdiff (drawn)), (2.20) where grow means that we expand or bloat the drawn ndiff and drawn pdiff layers slightly (usually by a few ). During implantation the dopant ions are blocked by the resist pattern defined by the CSN and CSP masks. The CSN mask thus prevents the n -type regions being implanted with p -type dopants (and vice versa for the CSP mask). As we shall see, the CSN and CSP masks are not intended to define the edges of the n -type and p file:///C|/Documents%20and%20Settings/saran%2....edu/_msmith/ASICs/HTML/Book2/CH02/CH02.2.htm (6 of 13) [5/30/2004 11:01:18 PM]
type regions. Instead these two masks function more like newspaper that prevents paint from spraying everywhere. The dopant ions are also blocked from reaching the silicon surface by the poly gates and this aligns the edge of the source and drain regions to the edges of the gates (we call this a self-aligned process ). In addition, the implants are blocked by the FOX and this defines the outside edges of the source, drain, and substrate contact regions. The only areas of the silicon surface that are doped n -type are n -diffusion (silicon) = (CAA (mask) CSN (mask)) ( CPG (mask)) ; (2.21) where the symbol represents AND (the intersection of two layers); and the symbol represents NOT. Similarly, the only regions that are doped p -type are p -diffusion (silicon) = (CAA (mask) CSP (mask)) ( CPG (mask)) . (2.22) If the CSN and CSP masks do not overlap, it is possible to save a mask by using one implant mask (CSN or CSP) for the other type (CSP or CSN). We can do this by using a positive resist (the pattern of resist remaining after developing is the same as the dark areas on the mask) for one implant step and a negative resist (vice versa) for the other step. However, because of the poor resolution of negative resist and because of difficulties in generating the implant masks automatically from the drawn diffusions (especially when opposite diffusion types are drawn close to each other or touching), it is now common to draw both implant masks as well as the two diffusion layers. It is important to remember that, even though poly is above diffusion, the polysilicon is deposited first and acts like masking tape. It is rather like airbrushing a stripeyou use masking tape and spray everywhere without worrying about making straight lines. The edges of the pattern will align to the edge of the tape. Here the analogy ends because the poly is left in place. Thus,
n -diffusion (silicon) = (ndiff (drawn)) ( poly (drawn)) and (2.23) (2.24) p -diffusion (silicon) = (pdiff (drawn)) ( poly (drawn)) . In the ASIC industry the names nplus, n +, and n -diffusion (as well as the p -type equivalents) are used in various ways. These names may refer to either the drawn diffusion layer (that we call ndiff), the mask (CSN), or the doped region on the silicon (the intersection of the active and implant mask that we call n -diffusion)very confusing. The source and drain are often formed from two separate implants. The first is a light implant close to the edge of the gate, the second a heavier implant that forms the rest of the source or drain region. The separate diffusions reduce the electric field near the drain end of the channel. Tailoring the device characteristics in this fashion is known as drain engineering and a process including these steps is referred to as an LDD process , for lightly doped drain ; the first light implant is known as an LDD diffusion or LDD implant. FIGURE 2.8 Drawn layers and an example set of black-and-white stipple patterns for a CMOS process. On top are the patterns as they appear in layout. Underneath are the magnified 8-by-8 pixel patterns. If we are trying to simplify layout we may use solid black or white for contact and vias. If we have contacts and vias placed on top of one another we may use stipple patterns or other means to help distinguish between them. Each stipple pattern is transparent, so that black shows through from underneath when layers are superimposed. There are no standards for these patterns. Figure 2.8 shows a stipple-pattern matrix for a CMOS process. When we draw
layout you can see through the layersall the stipple patterns are ORed together. Figure 2.9 shows the transistor layers as they appear in layout (drawn using the patterns from Figure 2.8) and as they appear on the silicon. Figure 2.10 shows the same thing for the interconnect layers.
FIGURE 2.9 The transistor layers. (a) A p -channel transistor as drawn in layout. (b) The corresponding silicon cross section (the heavy lines in part a show the cuts). This is how a p -channel transistor would look just after completing the source and drain implant steps. FIGURE 2.10 The interconnect layers. (a) Metal layers as drawn in layout. (b) The corresponding structure (as it might appear in a scanning-electron micrograph). The insulating layers between the metal layers are not shown. Contact is made to the underlying silicon through a platinum barrier layer. Each via consists of a tungsten plug. Each metal layer consists of a titaniumtungsten and aluminumcopper sandwich. Most deep submicron CMOS processes use metal structures similar to this. The scale, rounding, and irregularity of the features
are realistic.
2.2.1 Sheet Resistance

Tables 2.3 and 2.4 show the sheet resistance for each conducting layer (in decreasing order of resistance) for two different generations of CMOS process. TABLE 2.3 Sheet resistance (1 m CMOS). Sheet Layer Units resistance n -well poly ndiffusion pdiffusion m1/2 m3 1.15 0.25 3.5 2.0 75 20 140 40 70 6 30 3 k/ square / square / square / square m/ square m/ square TABLE 2.4 Sheet resistance (0.35 m CMOS). Sheet Layer Units resistance n -well poly ndiffusion pdiffusion m1/2/3 metal4 1 0.4 10 4.0 3.5 2.0 2.5 1.5 60 6 30 3 k/ square / square / square / square m/ square m/ square
The diffusion layers, n -diffusion and p -diffusion, both have a high resistivitytypically from 1100 /square. We measure resistance in / square (ohms per square) because for a fixed thickness of material it does not matter what the size of a square isthe resistance is the same. Thus the resistance of a rectangular shape of a sheet of material may be calculated from the number of squares it contains times the sheet resistance in / square. We can use diffusion for very short connections inside a logic cell, but not for interconnect between logic cells. Poly has the next highest resistance to diffusion. Most submicron CMOS processes use a
silicide material (a metallic compound of silicon) that has much lower resistivity (at several /square) than the poly or diffusion layers alone. Examples are tantalum silicide, TaSi; tungsten silicide, WSi; or titanium silicide, TiSi. The stoichiometry of these deposited silicides varies. For example, for tungsten silicide W:Si 1:2.6. There are two types of silicide process. In a silicide process only the gate is silicided. This reduces the poly sheet resistance, but not that of the sourcedrain. In a selfaligned silicide ( salicide ) process, both the gate and the sourcedrain regions are silicided. In some processes silicide can be used to connect adjacent poly and diffusion (we call this feature LI , white metal, local interconnect, metal0, or m0). LI is useful to reduce the area of ASIC RAM cells, for example. Interconnect uses metal layers with resistivities of tens of m /square, several orders of magnitude less than the other layers. There are usually several layers of metal in a CMOS ASIC process, each separated by an insulating layer. The metal layer above the poly gate layer is the first-level metal ( m1 or metal1), the next is the second-level metal ( m2 or metal2), and so on. We can make connections from m1 to diffusion using diffusion contacts or to the poly using polysilicon contacts . After we etch the contact holes a thin barrier metal (typically platinum) is deposited over the silicon and poly. Next we form contact plugs ( via plugs for connections between metal layers) to reduce contact resistance and the likelihood of breaks in the contacts. Tungsten is commonly used for these plugs. Following this we form the metal layers as sandwiches. The middle of the sandwich is a layer (usually from 3000 to 10,000 ) of aluminum and copper. The top and bottom layers are normally titaniumtungsten (TiW, pronounced tie-tungsten). Submicron processes use chemicalmechanical polishing ( CMP ) to smooth the wafers flat before each metal deposition step to help with step coverage. An insulating glass, often sputtered quartz (SiO 2 ), though other materials are also used, is deposited between metal layers to help create a smooth surface for the deposition of the metal. Design rules may refer to this insulator as an intermetal oxide ( IMO ) whether they are in fact oxides or not, or interlevel dielectric ( ILD ). The IMO may be a spin-on polymer; boron-doped phosphosilicate glass (BPSG); Si 3 N 4 ; or sandwiches of these materials (oxynitrides, for example). We make the connections between m1 and m2 using metal vias , cuts , or just vias .
We cannot connect m2 directly to diffusion or poly; instead we must make these connections through m1 using a via. Most processes allow contacts and vias to be placed directly above each other without restriction, arrangements known as stacked vias and stacked contacts . We call a process with m1 and m2 a two-level metal ( 2LM ) technology. A 3LM process includes a third-level metal layer ( m3 or metal3), and some processes include more metal layers. In this case a connection between m1 and m2 will use an m1/m2 via, or via1 ; a connection between m2 and m3 will use an m2/m3 via, or via2 , and so on. The minimum spacing of interconnects, the metal pitch , may increase with successive metal layers. The minimum metal pitch is the minimum spacing between the centers of adjacent interconnects and is equal to the minimum metal width plus the minimum metal spacing. Aluminum interconnect tends to break when carrying a high current density. Collisions between high-energy electrons and atoms move the metal atoms over a long period of time in a process known as electromigration . Copper is added to the aluminum to help reduce the problem. The other solution is to reduce the current density by using wider than minimum-width metal lines. Tables 2.5 and 2.6 show maximum specified contact resistance and via resistance for two generations of CMOS processes. Notice that a m1 contact in either process is equal in resistance to several hundred squares of metal. TABLE 2.5 Contact resistance (1 m CMOS). Contact/via Resistance type (maximum) m2/m3 via (via2) 5 m1/m2 via (via1) m1/ p -diffusion contact m1/ n -diffusion contact m1/poly contact 2 20 20 20 TABLE 2.6 Contact resistance (0.35 m CMOS). Contact/via Resistance type (maximum) m2/m3 via (via2) 6 m1/m2 via (via1) m1/ p -diffusion contact m1/ n -diffusion contact m1/poly contact 6 20 20 20
1. If only one well layer is drawn, the other mask may be derived from the drawn layer. For example, p -well (mask) = not (nwell (drawn)). A single-well process requires only one well mask. 2. The implant masks may be derived or drawn. 3. Largely for historical reasons the contacts to poly and contacts to active have different layer names. In the past this allowed a different sizing or process bias to be applied to each contact type when the mask was made. [ Chapter start ] [ Previous page ] [ Next page ]
2.3 CMOS Design Rules

Figure 2.11 defines the design rules for a CMOS process using pictures. Arrows between objects denote a minimum spacing, and arrows showing the size of an object denote a minimum width. Rule 3.1, for example, is the minimum width of poly (2 ). Each of the rule numbers may have different values for different manufacturersthere are no standards for design rules. Tables 2.72.9 show the MOSIS scalable CMOS rules. Table 2.7 shows the layer rules for the process front end , which is the front end of the line (as in production line) or FEOL . Table 2.8 shows the rules for the process back end ( BEOL ), the metal interconnect, and Table 2.9 shows the rules for the pad layer and glass layer.
FIGURE 2.11 The MOSIS scalable CMOS design rules (rev. 7). Dimensions are in . Rule numbers are in parentheses (missing rule sets 1113 are extensions to this basic process). TABLE 2.7 MOSIS scalable CMOS rules version 7the process front end. Layer Rule Explanation Value / well (CWN, CWP) 1.1 minimum width 10
1.2 1.3 1.4
minimum space (different potential, a hot well) minimum space (same potential) minimum space (different well type)
9 0 or 6 0
active (CAA)
2.1/2.2 minimum width/space source/drain active to well edge 2.3 space substrate/well contact active to 2.4 well edge space minimum space between active 2.5 (different implant type) 3.1/3.2 3.3 3.4 3.5 minimum width/space minimum gate extension of active minimum active extension of poly minimum field poly to active space minimum select spacing to channel of transistor 1 minimum select overlap of active minimum select overlap of contact minimum select width and spacing 2 exact contact size minimum poly overlap minimum contact spacing
3 5 3 0 or 4
poly (CPG)
2 2 3 1
select (CSN, CSP)
4.1 4.2 4.3 4.4
3 2 1 2 22 1.5 2
poly contact (CCP)
5.1.a 5.2.a 5.3.a
active contact (CCA)
6.1.a 6.2.a 6.3.a 6.4.a
exact contact size minimum active overlap minimum contact spacing minimum space to gate of transistor
22 1.5 2 2
TABLE 2.8 MOSIS scalable CMOS rules version 7the process back end. Layer Rule Explanation Value / metal1 (CMF) 7.1 minimum width 3 7.2.a minimum space 3 minimum space (for minimum-width wires 2 7.2.b only) 7.3 minimum overlap of poly contact 1 7.4 minimum overlap of active contact 1 via1 (CVA) 8.1 exact size 22 8.2 minimum via spacing 3 8.3 minimum overlap by metal1 1 8.4 minimum spacing to contact 2 8.5 minimum spacing to poly or active edge 2 metal2 (CMS) 9.1 minimum width 3 9.2.a minimum space 4 minimum space (for minimum-width wires 9.2.b 3 only) 9.3 minimum overlap of via1 1 via2 (CVS) 14.1 exact size 22 14.2 minimum space 3 14.3 minimum overlap by metal2 1 14.4 minimum spacing to via1 2
metal3 (CMT)
15.1 15.2 15.3
minimum width minimum space minimum overlap of via2
6 4 2
TABLE 2.9 MOSIS scalable CMOS rules version 7the pads and overglass (passivation). Layer Rule Explanation Value glass (COG) 10.1 10.2 10.3 10.4 10.5 minimum bonding-pad width minimum probe-pad width pad overlap of glass opening minimum pad spacing to unrelated metal2 (or metal3) minimum pad spacing to unrelated metal1, poly, or active 100 m 100 m 75 m 75 m 6m 30 m 15 m
The rules in Table 2.7 and Table 2.8 are given as multiples of . If we use lambdabased rules we can move between successive process generations just by changing the value of . For example, we can scale 0.5 m layouts ( = 0.25 m) by a factor of 0.175 / 0.25 for a 0.35 m process ( = 0.175 m)at least in theory. You may get an inkling of the practical problems from the fact that the values for pad dimensions and spacing in Table 2.9 are given in microns and not in . This is because bonding to the pads is an operation that does not scale well. Often companies have two sets of design rules: one in (with fractional rules) and the other in microns. Ideally we would like to express all of the design rules in integer multiples of . This was true for revisions 46, but not revision 7 of the MOSIS rules. In revision 7 rules 5.2a/6.2a are noninteger. The original MeadConway NMOS rules include a noninteger 1.5 rule for the implant layer. 1. To ensure source and drain width. 2. Different select types may touch but not overlap.
2.4 Combinational Logic Cells

The AND-OR-INVERT (AOI) and the OR-AND-INVERT (OAI) logic cells are particularly efficient in CMOS. Figure 2.12 shows an AOI221 and an OAI321 logic cell (the logic symbols in Figure 2.12 are not standards, but are widely used). All indices (the indices are the numbers after AOI or OAI) in the logic cell name greater than 1 correspond to the inputs to the first level or stagethe AND gate(s) in an AOI cell, for example. An index of '1' corresponds to a direct input to the secondstage cell. We write indices in descending order; so it is AOI221 and not AOI122 (but both are equivalent cells), and AOI32 not AOI23. If we have more than one direct input to the second stage we repeat the '1'; thus an AOI211 cell performs the function Z = (A.B + C + D)'. A three-input NAND cell is an OAI111, but calling it that would be very confusing. These rules are not standard, but form a convention that we shall adopt and one that is widely used in the ASIC industry. There are many ways to represent the logical operator, AND. I shall use the middle dot and write A B (rather than AB, A.B, or A B); occasionally I may use AND(A, B). Similarly I shall write A + B as well as OR(A, B). I shall use an apostrophe like this, A', to denote the complement of A rather than A since sometimes it is difficult or inappropriate to use an overbar ( vinculum ) or diacritical mark (macron). It is possible to misinterpret AB' as A B rather than AB (but the former alternative would be A B' in my convention). I shall be careful in these situations.
FIGURE 2.12 Naming and numbering complex CMOS combinational cells. (a) An AND-OR-INVERT cell, an AOI221. (b) An OR-AND-INVERT cell, an OAI321. Numbering is always in descending order.
We can express the function of the AOI221 cell in Figure 2.12(a) as Z = (A B + C D + E)' . (2.25) We can also write this equation unambiguously as Z = OAI221(A, B, C, D, E), just as we might write X = NAND (I, J, K) to describe the logic function X = (I J K)'. This notation is useful because, for example, if we write OAI321(P, Q, R, S, T, U) we immediately know that U (the sixth input) is the (only) direct input connected to the second stage. Sometimes we need to refer to particular inputs without listing them all. We can adopt another convention that letters of the input names change with the index position. Now we can refer to input B2 of an AOI321 cell, for example, and know which input we are talking about without writing Z = AOI321(A1, A2, A3, B1, B2, C) . (2.26) Table 2.10 shows the AOI family of logic cells with three indices (with branches in the family for AOI, OAI, AO, and OA cells). There are 5 types and 14 separate members of each branch of this family. There are thus 4 14 = 56 cells of the type X abc where X = {OAI, AOI, OA, AO} and each of the indexes a , b , and c can range from 1 to 3. We form the AND-OR (AO) and OR-AND (OA) cells by adding an inverter to the output of an AOI or OAI cell. TABLE 2.10 The AOI family of cells with three index numbers or less.
Cell type 1 Xa1 Xa11 Xab Xab1 Xabc Total
Cells X21, X31 X211, X311 X22, X33, X32 X221, X331, X321 X222, X333, X332, X322
Number of unique cells 2 2 3 3 4 14
2.4.1 Pushing Bubbles

The AOI and OAI logic cells can be built using a single stage in CMOS using seriesparallel networks of transistors called stacks. Figure 2.13 illustrates the procedure to build the n -channel and p -channel stacks, using the AOI221 cell as an example.
FIGURE 2.13 Constructing a CMOS logic cellan AOI221. (a) First build the dual icon by using de Morgans theorem to push inversion bubbles to the inputs. (b) Next build the n channel and p -channel stacks from series and parallel combinations of transistors. (c) Adjust transistor sizes so that the n- channel and p -channel stacks have equal strengths.
Here are the steps to construct any single-stage combinational CMOS logic cell: 1. Draw a schematic icon with an inversion (bubble) on the last cell (the bubbleout schematic). Use de Morgans theorems A NAND is an OR with inverted inputs and a NOR is an AND with inverted inputsto push the output bubble back to the inputs (this the dual icon or bubble-in schematic). 2. Form the n -channel stack working from the inputs on the bubble-out schematic: OR translates to a parallel connection, AND translates to a series connection. If you have a bubble at an input, you need an inverter. 3. Form the p -channel stack using the bubble-in schematic (ignore the inversions at the inputsthe bubbles on the gate terminals of the p -channel transistors take care of these). If you do not have a bubble at the input gate terminals, you need an inverter (these will be the same input gate terminals that had bubbles in the bubble-out schematic). The two stacks are network duals (they can be derived from each other by swapping series connections for parallel, and parallel for series connections). The n -channel stack implements the strong '0's of the function and the p -channel stack provides the strong '1's. The final step is to adjust the drive strength of the logic cell by sizing the transistors.
2.4.2 Drive Strength

Normally we ratio the sizes of the n -channel and p -channel transistors in an inverter so that both types of transistors have the same resistance, or drive strength . That is, we make n = p . At low dopant concentrations and low electric fields n is about twice p . To compensate we make the shape factor, W/L, of the p -channel transistor in an inverter about twice that of the n -channel transistor (we say the logic has a ratio of 2). Since the transistor lengths are normally equal to the minimum poly width for both types of transistors, the ratio of the transistor widths is also equal to 2. With the high dopant concentrations and high electric fields in submicron transistors the difference in mobilities is lesstypically between 1 and 1.5. Logic cells in a library have a range of drive strengths. We normally call the minimum-size inverter a 1X inverter. The drive strength of a logic cell is often used
as a suffix; thus a 1X inverter has a cell name such as INVX1 or INVD1. An inverter with transistors that are twice the size will be an INVX2. Drive strengths are normally scaled in a geometric ratio, so we have 1X, 2X, 4X, and (sometimes) 8X or even higher, drive-strength cells. We can size a logic cell using these basic rules:
q
Any string of transistors connected between a power supply and the output in a cell with 1X drive should have the same resistance as the n -channel transistor in a 1X inverter. A transistor with shape factor W 1 /L 1 has a resistance proportional to L 1 /W 1 (so the larger W 1 is, the smaller the resistance). Two transistors in parallel with shape factors W 1 /L 1 and W 2 /L 2 are equivalent to a single transistor (W 1 /L 1 + W 2 /L 2 )/1. For example, a 2/1 in parallel with a 3/1 is a 5/1. Two transistors, with shape factors W 1 /L 2 and W 2 /L 2 , in series are equivalent to a single 1/(L 1 /W 1 + L 2 /W 2 ) transistor.
For example, a transistor with shape factor 3/1 (we shall call this a 3/1) in series with another 3/1 is equivalent to a 1/((1/3) + (1/3)) or a 3/2. We can use the following method to calculate equivalent transistor sizes:
q q
To add transistors in parallel, make all the lengths 1 and add the widths. To add transistors in series, make all the widths 1 and add the lengths.
We have to be careful to keep W and L reasonable. For example, a 3/1 in series with a 2/1 is equivalent to a 1/((1/3) + (1/2)) or 1/0.83. Since we cannot make a device 2 wide and 1.66 long, a 1/0.83 is more naturally written as 3/2.5. We like to keep both W and L as integer multiples of 0.5 (equivalent to making W and L integer multiples of ), but W and L must be greater than 1. In Figure 2.13(c) the transistors in the AOI221 cell are sized so that any string through the p -channel stack has a drive strength equivalent to a 2/1 p -channel transistor (we choose the worst case, if more than one transistor in parallel is conducting then the drive strength will be higher). The n -channel stack is sized so
that it has a drive strength of a 1/1 n -channel transistor. The ratio in this library is thus 2. If we were to use four drive strengths for each of the AOI family of cells shown in Table 2.10, we would have a total of 224 combinational library cellsjust for the AOI family. The synthesis tools can handle this number of cells, but we may not be able to design this many cells in a reasonable amount of time. Section 3.3, Logical Effort, will help us choose the most logically efficient cells.
2.4.3 Transmission Gates

Figure 2.14(a) and (b) shows a CMOS transmission gate ( TG , TX gate, pass gate, coupler). We connect a p -channel transistor (to transmit a strong '1') in parallel with an n -channel transistor (to transmit a strong '0').
FIGURE 2.14 CMOS transmission gate (TG). (a) An nchannel and p -channel transistor in parallel form a TG. (b) A common symbol for a TG. (c) The charge-sharing problem. We can express the function of a TG as Z = TG(A, S) , (2.27) but this is ambiguousif we write TG(X, Y), how do we know if X is connected to the gates or sources/drains of the TG? We shall always define TG(X, Y) when we use it. It is tempting to write TG(A, S) = A S, but what is the value of Z when S ='0' in Figure 2.14(a), since Z is then left floating? A TG is a switch, not an AND logic cell.
There is a potential problem if we use a TG as a switch connecting a node Z that has a large capacitance, C BIG , to an input node A that has only a small capacitance C SMALL (see Figure 2.14c). If the initial voltage at A is V SMALL and the initial voltage at Z is V BIG , when we close the TG (by setting S = '1') the final voltage on both nodes A and Z is C BIG V BIG + C SMALL V SMALL VF = C BIG + C SMALL Imagine we want to drive a '0' onto node Z from node A. Suppose C BIG = 0.2 pF (about 10 standard loads in a 0.5 m process) and C SMALL = 0.02 pF, V BIG = 0 V and V SMALL = 5 V; then (0.2 10 12 ) (0) + (0.02 10 12 ) (5) VF = (0.2 10 12 ) + (0.02 10 12 ) This is not what we want at all, the big capacitor has forced node A to a voltage close to a '0'. This type of problem is known as charge sharing . We should make sure that either (1) node A is strong enough to overcome the big capacitor, or (2) insulate node A from node Z by including a buffer (an inverter, for example) between node A and node Z. We must not use charge to drive another logic cellonly a logic cell can drive a logic cell. If we omit one of the transistors in a TG (usually the p -channel transistor) we have a pass transistor . There is a branch of full-custom VLSI design that uses passtransistor logic. Much of this is based on relay-based logic, since a single transistor switch looks like a relay contact. There are many problems associated with passtransistor logic related to charge sharing, reduced noise margins, and the difficulty of predicting delays. Though pass transistors may appear in an ASIC cell inside a
. (2.28)
= 0.45 V . (2.29)
library, they are not used by ASIC designers.
FIGURE 2.15 The CMOS multiplexer (MUX). (a) A noninverting 2:1 MUX using transmission gates without buffering. (b) A symbol for a MUX (note how the inputs are labeled). (c) An IEEE standard symbol for a MUX. (d) A nonstandard, but very common, IEEE symbol for a MUX. (e) An inverting MUX with output buffer. (f) A noninverting buffered MUX. We can use two TGs to form a multiplexer (or multiplexorpeople use both orthographies) as shown in Figure 2.15(a). We often shorten multiplexer to MUX . The MUX function for two data inputs, A and B, with a select signal S, is Z = TG(A, S') + TG(B, S) . (2.30) We can write this as Z = A S' + B S, since node Z is always connected to one or other of the inputs (and we assume both are driven). This is a two-input MUX (2-to-1 MUX or 2:1 MUX). Unfortunately, we can also write the MUX function as Z = A S + B S', so it is difficult to write the MUX function unambiguously as Z = MUX(X, Y, Z). For example, is the select input X, Y, or Z? We shall define the function MUX(X, Y, Z) each time we use it. We must also be careful to label a MUX if we use the symbol shown in Figure 2.15(b). Symbols for a MUX are shown in Figure 2.15(bd). In the IEEE notation 'G' specifies an AND dependency. Thus, in Figure 2.15(c), G = '1' selects the input labeled '1'. Figure 2.15(d) uses the common control block symbol (the notched rectangle). Here, G1 = '1' selects the input '1', and
G1 = '0' selects the input ' 1 '. Strictly this form of IEEE symbol should be used only for elements with more than one section controlled by common signals, but the symbol of Figure 2.15(d) is used often for a 2:1 MUX. The MUX shown in Figure 2.15(a) works, but there is a potential charge-sharing problem if we cascade MUXes (connect them in series). Instead most ASIC libraries use MUX cells built with a more conservative approach. We could buffer the output using an inverter (Figure 2.15e), but then the MUX becomes inverting. To build a safe, noninverting MUX we can buffer the inputs and output (Figure 2.15f)requiring 12 transistors, or 3 gate equivalents (only the gate equivalent counts are shown from now on). Figure 2.16 shows how to use an OAI22 logic cell (and an inverter) to implement an inverting MUX. The implementation in equation form (2.5 gates) is ZN = = = = A' S' + B' S [(A' S')' (B' S)']' [ (A + S) (B + S')]' OAI22[A, S, B, NOT(S)] . (2.31)
(both A' and NOT(A) represent an inverter, depending on which representation is most convenientthey are equivalent). I often use an equation to describe a cell implementation.
FIGURE 2.16 An inverting 2:1 MUX based on an OAI22 cell.
The following factors will determine which MUX implementation is best: 1. Do we want to minimize the delay between the select input and the output or
between the data inputs and the output? 2. Do we want an inverting or noninverting MUX? 3. Do we object to having any logic cell inputs tied directly to the source/drain diffusions of a transmission gate? (Some companies forbid such transmissiongate inputs since some simulation tools cannot handle them.) 4. Do we object to any logic cell outputs being tied to the source/drain of a transmission gate? (Some companies will not allow this because of the dangers of charge sharing.) 5. What drive strength do we require (and is size or speed more important)? A minimum-size TG is a little slower than a minimum-size inverter, so there is not much difference between the implementations shown in Figure 2.15 and Figure 2.16, but the difference can become important for 4:1 and larger MUXes.
2.4.4 Exclusive-OR Cell

The two-input exclusive-OR ( XOR , EXOR, not-equivalence, ring-OR) function is A1 A2 = XOR(A1, A2) = A1 A2' + A1' A2 . (2.32) We are now using multiletter symbols, but there should be no doubt that A1' means anything other than NOT(A1). We can implement a two-input XOR using a MUX and an inverter as follows (2 gates): XOR(A1, A2) = MUX[NOT(A1), A1, A2] , (2.33) where MUX(A, B, S) = A S + B S ' . (2.34) This implementation only buffers one input and does not buffer the MUX output. We can use inverter buffers (3.5 gates total) or an inverting MUX so that the XOR cell
does not have any external connections to source/drain diffusions as follows (3 gates total): XOR(A1, A2) = NOT[MUX(NOT[NOT(A1)], NOT(A1), A2)] . (2.35) We can also implement a two-input XOR using an AOI21 (and a NOR cell), since XOR(A1, A2) = A1 A2' + A1' A2 [ (A1 A2) + (A1 + A2)' ]' = = AOI21[A1, A2, NOR(A1, A2)], (2.36) (2.5 gates). Similarly we can implement an exclusive-NOR (XNOR, equivalence) logic cell using an inverting MUX (and two inverters, total 3.5 gates) or an OAI21 logic cell (and a NAND cell, total 2.5 gates) as follows (using the MUX function of Eq. 2.34): XNOR(A1, A2) = A1 A2 + NOT(A1) NOT(A2 NOT[NOT[MUX(A1, NOT (A1), A2]] = = OAI21[A1, A2, NAND(A1, A2)] . (2.37)
1. Xabc: X = {AOI, AO, OAI, OA}; a, b, c = {2, 3}; { } means choose one. [ Chapter start ] [ Previous page ] [ Next page ]
2.5 Sequential Logic Cells

There are two main approaches to clocking in VLSI design: multiphase clocks or a single clock and synchronous design . The second approach has the following key advantages: (1) it allows automated design, (2) it is safe, and (3) it permits vendor signoff (a guarantee that the ASIC will work as simulated). These advantages of synchronous design (especially the last one) usually outweigh every other consideration in the choice of a clocking scheme. The vast majority of ASICs use a rigid synchronous design style.
2.5.1 Latch
Figure 2.17(a) shows a sequential logic cella latch . The internal clock signals, CLKN (N for negative) and CLKP (P for positive), are generated from the system clock, CLK, by two inverters (I4 and I5) that are part of every latch cellit is usually too dangerous to have these signals supplied externally, even though it would save space.
FIGURE 2.17 CMOS latch. (a) A positive-enable latch using transmission gates without output buffering, the enable (clock) signal is buffered inside the latch. (b) A positive-enable latch is transparent while the enable is high. (c) The latch stores the last value at D when the enable goes low. To emphasize the difference between a latch and flip-flop, sometimes people refer to the clock input of a latch as an enable . This makes sense when we look at Figure 2.17(b), which shows the operation of a latch. When the clock input is high, the latch is transparent changes at the D input appear at the output Q (quite different from a flip-flop as we shall see). When the enable (clock) goes low (Figure 2.17c), inverters I2 and I3 are connected together, forming a storage loop that holds the last value on D until the enable goes high again. The storage loop will hold its state as long as power is on; we call this a static latch. A sequential logic cell is different from a combinational cell because it has this feature of storage or memory. Notice that the output Q is unbuffered and connected directly to the output of I2 (and the input of I3), which is a storage node. In an ASIC library we are conservative and add an inverter to buffer the output, isolate the sensitive storage node, and thus invert the sense of Q. If we want both Q and QN we have to add two inverters to the circuit of Figure 2.17(a). This means that a latch requires seven inverters and two TGs (4.5 gates). The latch of Figure 2.17(a) is a positive-enable D latch, active-high D latch, or transparent-high D latch (sometimes people also call this a D-type latch). A negativeenable (active-low) D latch can be built by inverting all the clock polarities in Figure 2.17(a) (swap CLKN for CLKP and vice-versa).
2.5.2 Flip-Flop
Figure 2.18(a) shows a flip-flop constructed from two D latches: a master latch (the first one) and a slave latch . This flip-flop contains a total of nine inverters and four TGs, or 6.5 gates. In this flip-flop design the storage node S is buffered and the clockto-Q delay will be one inverter delay less than the clock-to-QN delay.
FIGURE 2.18 CMOS flip-flop. (a) This negativeedgetriggered flip-flop consists of two latches: master and slave. (b) While the clock is high, the master latch is loaded. (c) As the clock goes low, the slave latch loads the value of the master latch. (d) Waveforms illustrating the definition of the flip-flop setup time t SU , hold time t H , and propagation delay from clock to Q, t PD . In Figure 2.18(b) the clock input is high, the master latch is transparent, and node M (for master) will follow the D input. Meanwhile the slave latch is disconnected from the master latch and is storing whatever the previous value of Q was. As the clock
goes low (the negative edge) the slave latch is enabled and will update its state (and the output Q) to the value of node M at the negative edge of the clock. The slave latch will then keep this value of M at the output Q, despite any changes at the D input while the clock is low (Figure 2.18c). When the clock goes high again, the slave latch will store the captured value of M (and we are back where we started our explanation). The combination of the master and slave latches acts to capture or sample the D input at the negative clock edge, the active clock edge . This type of flip-flop is a negativeedgetriggered flip-flop and its behavior is quite different from a latch. The behavior is shown on the IEEE symbol by using a triangular notch to denote an edgesensitive input. A bubble shows the input is sensitive to the negative edge. To build a positive-edgetriggered flip-flop we invert the polarity of all the clocksas we did for a latch. The waveforms in Figure 2.18(d) show the operation of the flip-flop as we have described it, and illustrate the definition of setup time ( t SU ), hold time ( t H ), and clock-to-Q propagation delay ( t PD ). We must keep the data stable (a fixed logic '1' or '0') for a time t SU prior to the active clock edge, and stable for a time t H after the active clock edge (during the decision window shown). In Figure 2.18(d) times are measured from the points at which the waveforms cross 50 percent of V DD . We say the trip point is 50 percent or 0.5. Common choices are 0.5 or 0.65/0.35 (a signal has to reach 0.65 V DD to be a '1', and reach 0.35 V DD to be a '0'), or 0.1/0.9 (there is no standard way to write a trip point). Some vendors use different trip points for the input and output waveforms (especially in I/O cells). The flip-flop in Figure 2.18(a) is a D flip-flop and is by far the most widely used type of flip-flop in ASIC design. There are other types of flip-flopsJ-K, T (toggle), and S-R flip-flopsthat are provided in some ASIC cell libraries mainly for compatibility with TTL design. Some people use the term register to mean an array (more than one) of flip-flops or latches (on a data bus, for example), but some people use register to mean a single flip-flop or a latch. This is confusing since flip-flops and latches are quite different in their behavior. When I am talking about logic cells, I use the term register to mean more than one flip-flop.
To add an asynchronous set (Q to '1') or asynchronous reset (Q to '0') to the flipflop of Figure 2.18(a), we replace one inverter in both the master and slave latches with two-input NAND cells. Thus, for an active-low set, we replace I2 and I7 with two-input NAND cells, and, for an active-low reset, we replace I3 and I6. For both set and reset we replace all four inverters: I2, I3, I6, and I7. Some TTL flip-flops have dominant reset or dominant set , but this is difficult (and dangerous) to do in ASIC design. An input that forces Q to '1' is sometimes also called preset . The IEEE logic symbols use 'P' to denote an input with a presetting action. An input that forces Q to '0' is often also called clear . The IEEE symbols use 'R' to denote an input with a resetting action.
2.5.3 Clocked Inverter

Figure 2.19 shows how we can derive the structure of a clocked inverter from the series combination of an inverter and a TG. The arrows in Figure 2.19(b) represent the flow of current when the inverter is charging ( I R ) or discharging ( I F ) a load capacitance through the TG. We can break the connection between the inverter cells and use the circuit of Figure 2.19(c) without substantially affecting the operation of the circuit. The symbol for the clocked inverter shown in Figure 2.19(d) is common, but by no means a standard.
FIGURE 2.19 Clocked inverter. (a) An inverter plus transmission gate (TG). (b) The current flow in the inverter and TG allows us to break the connection between the transistors in the inverter. (c) Breaking the connection forms a clocked inverter. (d) A common symbol. We can use the clocked inverter to replace the inverterTG pairs in latches and flipflops. For example, we can replace one or both of the inverters I1 and I3 (together with the TGs that follow them) in Figure 2.17(a) by clocked inverters. There is not much to choose between the different implementations in this case, except that layout may be easier for the clocked inverter versions (since there is one less connection to make). More interesting is the flip-flop design: We can only replace inverters I1, I3, and I7 (and the TGs that follow them) in Figure 2.18(a) by clocked inverters. We cannot replace inverter I6 because it is not directly connected to a TG. We can replace the TG attached to node M with a clocked inverter, and this will invert the sense of the output Q, which thus becomes QN. Now the clock-to-Q delay will be slower than clock-to-QN, since Q (which was QN) now comes one inverter later than QN. If we wish to build a flip-flop with a fast clock-to-QN delay it may be better to build it using clocked inverters and use inverters with TGs for a flip-flop with a fast clockto-Q delay. In fact, since we do not always use both Q and QN outputs of a flip-flop, some libraries include Q only or QN only flip-flops that are slightly smaller than those with both polarity outputs. It is slightly easier to layout clocked inverters than an inverter plus a TG, so flip-flops in commercial libraries include a mixture of clocked-inverter and TG implementations. [ Chapter start ] [ Previous page ] [ Next page ]
2.6 Datapath Logic Cells

Suppose we wish to build an n -bit adder (that adds two n -bit numbers) and to exploit the regularity of this function in the layout. We can do so using a datapath structure. The following two functions, SUM and COUT, implement the sum and carry out for a full adder ( FA ) with two data inputs (A, B) and a carry in, CIN: SUM = A B CIN = SUM(A, B, CIN) = PARITY(A, B, CIN) , (2.38)
COUT = A B + A CIN + B CIN = MAJ(A, B, CIN).
(2.39)
The sum uses the parity function ('1' if there are an odd numbers of '1's in the inputs). The carry out, COUT, uses the 2-of-3 majority function ('1' if the majority of the inputs are '1'). We can combine these two functions in a single FA logic cell, ADD(A[ i ], B[ i ], CIN, S[ i ], COUT), shown in Figure 2.20(a), where S[ i ] = SUM (A[ i ], B[ i ], CIN) , (2.40)
COUT = MAJ (A[ i ], B[ i ], CIN) . (2.41) Now we can build a 4-bit ripple-carry adder ( RCA ) by connecting four of these ADD cells together as shown in Figure 2.20(b). The i th ADD cell is arranged with the following: two bus inputs A[ i ], B[ i ]; one bus output S[ i ]; an input, CIN, that is the carry in from stage ( i 1) below and is also passed up to the cell above as an output; and an output, COUT, that is the carry out to stage ( i + 1) above. In the 4-bit adder shown in Figure 2.20(b) we connect the carry input, CIN[0], to VSS and use COUT[3] and COUT[2] to indicate arithmetic overflow (in Section 2.6.1 we shall see why we may need both signals). Notice that we build the ADD cell so that COUT[2] is available at the top of the datapath when we need it. Figure 2.20(c) shows a layout of the ADD cell. The A inputs, B inputs, and S outputs all use m1 interconnect running in the horizontal directionwe call these data signals. Other signals can enter or exit from the top or bottom and run vertically across the datapath in m2we call these control signals. We can also use m1 for control and m2 for data, but we normally do not mix these approaches in the same structure. Control signals are typically clocks and other signals common to elements. For example, in Figure 2.20(c) the carry signals, CIN and COUT, run vertically in m2 between cells. To build a 4-bit adder we stack four ADD cells creating the array structure shown in Figure 2.20(d). In this case the A and B data bus inputs enter from the left and bus S, the sum,
exits at the right, but we can connect A, B, and S to either side if we want. The layout of buswide logic that operates on data signals in this fashion is called a datapath . The module ADD is a datapath cell or datapath element . Just as we do for standard cells we make all the datapath cells in a library the same height so we can abut other datapath cells on either side of the adder to create a more complex datapath. When people talk about a datapath they always assume that it is oriented so that increasing the size in bits makes the datapath grow in height, upwards in the vertical direction, and adding different datapath elements to increase the function makes the datapath grow in width, in the horizontal directionbut we can rotate and position a completed datapath in any direction we want on a chip.
FIGURE 2.20 A datapath adder. (a) A full-adder (FA) cell with inputs (A and B), a carry in, CIN, sum output, S, and carry out, COUT. (b) A 4-bit adder. (c) The layout, using two-level metal, with data in m1 and control in m2. In this example the wiring is completed outside the cell; it is also possible to design the datapath cells to contain the wiring. Using three levels of metal, it is possible to wire over the top of the datapath cells. (d) The datapath layout. What is the difference between using a datapath, standard cells, or gate arrays? Cells are placed together in rows on a CBIC or an MGA, but there is no generally no regularity to the arrangement of the cells within the rowswe let software arrange the cells and complete the interconnect. Datapath layout automatically takes care of most of the interconnect between the cells with the following advantages:
q q
Regular layout produces predictable and equal delay for each bit. Interconnect between cells can be built into each cell.
There are some disadvantages of using a datapath:

q
The overhead (buffering and routing the control signals, for example) can make a narrow (small number of bits) datapath larger and slower than a standard-cell (or even gate-array) implementation. Datapath cells have to be predesigned (otherwise we are using full-custom design) for use in a wide range of datapath sizes. Datapath cell design can be harder than designing gate-array macros or standard cells. Software to assemble a datapath is more complex and not as widely used as software for assembling standard cells or gate arrays.
There are some newer standard-cell and gate-array tools that can take advantage of regularity in a design and
position cells carefully. The problem is in finding the regularity if it is not specified. Using a datapath is one way to specify regularity to ASIC design tools.
2.6.1 Datapath Elements

Figure 2.21 shows some typical datapath symbols for an adder (people rarely use the IEEE standards in ASIC datapath libraries). I use heavy lines (they are 1.5 point wide) with a stroke to denote a data bus (that flows in the horizontal direction in a datapath), and regular lines (0.5 point) to denote the control signals (that flow vertically in a datapath). At the risk of adding confusion where there is none, this stroke to indicate a data bus has nothing to do with mixed-logic conventions. For a bus, A[31:0] denotes a 32-bit bus with A[31] as the leftmost or mostsignificant bit or MSB , and A[0] as the least-significant bit or LSB . Sometimes we shall use A[MSB] or A[LSB] to refer to these bits. Notice that if we have an n -bit bus and LSB = 0, then MSB = n 1. Also, for example, A[4] is the fifth bit on the bus (from the LSB). We use a ' ' or 'ADD' inside the symbol to denote an adder instead of '+', so we can attach '' or '+/' to the inputs for a subtracter or adder/subtracter.
FIGURE 2.21 Symbols for a datapath adder. (a) A data bus is shown by a heavy line (1.5 point) and a bus symbol. If the bus is n -bits wide then MSB = n 1. (b) An alternative symbol for an adder. (c) Control signals are shown as lightweight (0.5 point) lines. Some schematic datapath symbols include only data signals and omit the control signalsbut we must not forget them. In Figure 2.21, for example, we may need to explicitly tie CIN[0] to VSS and use COUT[MSB] and COUT[MSB 1] to detect overflow. Why might we need both of these control signals? Table 2.11 shows the process of simple arithmetic for the different binary number representations, including unsigned, signed magnitude, ones complement, and twos complement. TABLE 2.11 Binary arithmetic. Operation Binary Number Representation Signed Ones magnitude complement if positive if negative then flip then MSB = 0 bits else MSB = 1 0011 0011 1011 1100 0000 or 1000 1111 or 0000 Twos complement if negative then {flip bits; add 1} 0011 1101 0000
Unsigned no change
3= 3 = zero =
0011 NA 0000
max. positive = max. negative = addition = S=A+B = addend + augend SG(A) = sign of A addition result: OV = overflow, OR = out of range
1111 = 15 0000= 0
0111 = 7 1111 = 7
0111 = 7 1000 = 7
0111 = 7 1000 = 8
S=A+B
if SG(A) = SG(B) then S = A+B S= else { if B < A + B + COUT[MSB] A then S = A COUT is carry out B else S = B A} if SG(A) = SG(B) then OV = COUT[MSB] else OV = 0 (impossible) OV = XOR(COUT[MSB], COUT[MSB1])
S=A+B
OR = COUT[MSB] COUT is carry out
OV = XOR(COUT[MSB], COUT[MSB 1])
SG(S) = sign of S S=A+B
NA
if SG(A) = SG(B) then SG(S) = SG(A) else { if B < NA A then SG(S) = SG(A) else SG(S) = SG(B)} SG(B) = NOT(SG(B)); D=A+B Z = B (negate); D=A+Z
NA
subtraction = D=AB = minuend subtrahend subtraction result : OV = overflow, OR = out of range negation : Z = A (negate)
D=AB
Z = B (negate); D=A+Z
OR = BOUT[MSB] BOUT is borrow out
as in addition
as in addition
as in addition
NA
Z = A; SG(Z) = NOT(SG(A))
Z = NOT(A)
Z = NOT(A) + 1
2.6.2 Adders
We can view addition in terms of generate , G[ i ], and propagate , P[ i ], signals.
method 1 G[i] = A[i] B[i]
method 2 G[ i ] = A[ i ] B[ i ] P[ i ] = A[ i ] + B[ i ] P[ i ] = A[ i ] B[ i C[ i ] = G[ i ] + P[ i ] C[ i 1] C[ i ] = G[ i ] + P[ i ] C[ i 1] S[ i ] = P[ i ] C[ i 1]
(2.42) (2.43) (2.44) S[ i ] = A[ i ] B[ i ] C[ i 1] (2.45)
where C[ i ] is the carry-out signal from stage i , equal to the carry in of stage ( i + 1). Thus, C[ i ] = COUT[ i ] = CIN[ i + 1]. We need to be careful because C[0] might represent either the carry in or the carry out of the LSB stage. For an adder we set the carry in to the first stage (stage zero), C[1] or CIN[0], to '0'. Some people use delete (D) or kill (K) in various ways for the complements of G[i] and P[i], but unfortunately others use C for COUT and D for CINso I avoid using any of these. Do not confuse the two different methods (both of which are used) in Eqs. 2.422.45 when forming the sum, since the propagate signal, P[ i ] , is different for each method. Figure 2.22(a) shows a conventional RCA. The delay of an n -bit RCA is proportional to n and is limited by the propagation of the carry signal through all of the stages. We can reduce delay by using pairs of go-faster bubbles to change AND and OR gates to fast two-input NAND gates as shown in Figure 2.22(a). Alternatively, we can write the equations for the carry signal in two different ways: either C[ i ] = A[ i ] B[ i ] + P[ i ] C[ i 1] (2.46) or C[ i ] = (A[ i ] + B[ i ] ) (P[ i ]' + C[ i 1]), (2.47) where P[ i ]'= NOT(P[ i ]). Equations 2.46 and 2.47 allow us to build the carry chain from two-input NAND gates, one per cell, using different logic in even and odd stages (Figure 2.22b): even stages C1[i]' = P[i ] C3[i 1] C4[i 1] C2[i] = A[i ] + B[i ] C[i] = C1[i ] C2[i ] odd stages C3[i]' = P[i ] C1[i 1] C2[i 1] (2.48) C4[i]' = A[i ] B[i ] (2.49) C[i] = C3[i ] ' + C4[i ]' (2.50)
(the carry inputs to stage zero are C3[1] = C4[1] = '0'). We can use the RCA of Figure 2.22(b) in a datapath, with standard cells, or on a gate array. Instead of propagating the carries through each stage of an RCA, Figure 2.23 shows a different approach. A carry-save adder ( CSA ) cell CSA(A1[ i ], A2[ i ], A3[ i ], CIN, S1[ i ], S2[ i ], COUT) has three outputs: S1[ i ] = CIN , S2[ i ] = A1[ i ] A2[ i ] A3[ i ] = PARITY(A1[ i ], A2[ i ], A3[ i ]) , (2.51) (2.52)
COUT = A1[ i ] A2[ i ] + [(A1[ i ] + A2[ i ]) A3[ i ]] = MAJ(A1[ i ], A2[ i ], A3[ i ]) . (2.53) The inputs, A1, A2, and A3; and outputs, S1 and S2, are buses. The input, CIN, is the carry from stage ( i 1).
The carry in, CIN, is connected directly to the output bus S1indicated by the schematic symbol (Figure 2.23a). We connect CIN[0] to VSS. The output, COUT, is the carry out to stage ( i + 1). A 4-bit CSA is shown in Figure 2.23(b). The arithmetic overflow signal for ones complement or twos complement arithmetic, OV, is XOR(COUT[MSB], COUT[MSB 1]) as shown in Figure 2.23(c). In a CSA the carries are saved at each stage and shifted left onto the bus S1. There is thus no carry propagation and the delay of a CSA is constant. At the output of a CSA we still need to add the S1 bus (all the saved carries) and the S2 bus (all the sums) to get an n -bit result using a final stage that is not shown in Figure 2.23(c). We might regard the n -bit sum as being encoded in the two buses, S1 and S2, in the form of the parity and majority functions. We can use a CSA to add multiple inputsas an example, an adder with four 4-bit inputs is shown in Figure 2.23(d). The last stage sums two input buses using a carry-propagate adder ( CPA ). We have used an RCA as the CPA in Figure 2.23(d) and (e), but we can use any type of adder. Notice in Figure 2.23(e) how the two CSA cells and the RCA cell abut together horizontally to form a bit slice (or slice) and then the slices are stacked vertically to form the datapath.
FIGURE 2.22 The carry-save adder (CSA). (a) A CSA cell. (b) A 4-bit CSA. (c) Symbol for a CSA. (d) A four-input CSA. (e) The datapath for a four-input, 4-bit adder using CSAs with a ripple-carry adder (RCA) as the final stage. (f) A pipelined adder. (g) The datapath for the pipelined version showing the pipeline registers as well as the clock control lines that use m2. We can register the CSA stages by adding vectors of flip-flops as shown in Figure 2.23(f). This reduces the adder delay to that of the slowest adder stage, usually the CPA. By using registers between stages of combinational logic we use pipelining to increase the speed and pay a price of increased area (for the registers) and introduce latency . It takes a few clock cycles (the latency, equal to n clock cycles for an n -stage pipeline) to fill the pipeline, but once it is filled, the answers emerge every clock cycle. Ferris wheels work much the same
way. When the fair opens it takes a while (latency) to fill the wheel, but once it is full the people can get on and off every few seconds. (We can also pipeline the RCA of Figure 2.20. We add i registers on the A and B inputs before ADD[ i ] and add ( n i ) registers after the output S[ i ], with a single register before each C[ i ].) The problem with an RCA is that every stage has to wait to make its carry decision, C[ i ], until the previous stage has calculated C[ i 1]. If we examine the propagate signals we can bypass this critical path. Thus, for example, to bypass the carries for bits 47 (stages 58) of an adder we can compute BYPASS = P[4].P[5].P[6].P[7] and then use a MUX as follows: C[7] = (G[7] + P[7] C[6]) BYPASS' + C[3] BYPASS . (2.54) Adders based on this principle are called carry-bypass adders ( CBA ) [Sato et al., 1992]. Large, custom adders employ Manchester-carry chains to compute the carries and the bypass operation using TGs or just pass transistors [Weste and Eshraghian, 1993, pp. 530531]. These types of carry chains may be part of a predesigned ASIC adder cell, but are not used by ASIC designers. Instead of checking the propagate signals we can check the inputs. For example we can compute SKIP = (A[ i 1] B[ i 1]) + (A[ i ] B[ i ] ) and then use a 2:1 MUX to select C[ i ]. Thus, CSKIP[ i ] = (G[ i ] + P[ i ] C[ i 1]) SKIP' + C[ i 2] SKIP . (2.55) This is a carry-skip adder [Keutzer, Malik, and Saldanha, 1991; Lehman, 1961]. Carry-bypass and carry-skip adders may include redundant logic (since the carry is computed in two different wayswe just take the first signal to arrive). We must be careful that the redundant logic is not optimized away during logic synthesis. If we evaluate Eq. 2.44 recursively for i = 1, we get the following: C[1] = G[1] + P[1] C[0] = G[1] + P[1] (G[0] + P[1] C[1]) = G[1] + P[1] G[0] . (2.56) This result means that we can look ahead by two stages and calculate the carry into the third stage (bit 2), which is C[1], using only the first-stage inputs (to calculate G[0]) and the second-stage inputs. This is a carrylookahead adder ( CLA ) [MacSorley, 1961]. If we continue expanding Eq. 2.44, we find: C[2] = G[2] + P[2] G[1] + P[2] P[1] G[0] , C[3] = G[3] + P[2] G[2] + P[2] P[1] G[1] + P[3] P[2] P[1] G[0] . (2.57) As we look ahead further these equations become more complex, take longer to calculate, and the logic becomes less regular when implemented using cells with a limited number of inputs. Datapath layout must fit in a bit slice, so the physical and logical structure of each bit must be similar. In a standard cell or gate array we are not so concerned about a regular physical structure, but a regular logical structure simplifies design. The BrentKung adder reduces the delay and increases the regularity of the carry-lookahead scheme [Brent and
Kung, 1982]. Figure 2.24(a) shows a regular 4-bit CLA, using the carry-lookahead generator cell (CLG) shown in Figure 2.24(b).
FIGURE 2.23 The BrentKung carry-lookahead adder (CLA). (a) Carry generation in a 4-bit CLA. (b) A cell to generate the lookahead terms, C[0]C[3]. (c) Cells L1, L2, and L3 are rearranged into a tree that has less delay. Cell L4 is added to calculate C[2] that is lost in the translation. (d) and (e) Simplified representations of parts a and c. (f) The lookahead logic for an 8-bit adder. The inputs, 07, are the propagate and carry terms formed from the inputs to the adder. (g) An 8-bit BrentKung CLA. The outputs of the lookahead logic are the carry bits that (together with the inputs) form the sum. One advantage of this adder is that delays from the inputs to the outputs are more nearly equal than in other adders. This tends to reduce the number of unwanted and unnecessary switching events and thus reduces power dissipation. In a carry-select adder we duplicate two small adders (usually 4-bit or 8-bit addersoften CLAs) for the cases CIN = '0' and CIN = '1' and then use a MUX to select the case that we needwasteful, but fast [Bedrij, 1962]. A carry-select adder is often used as the fast adder in a datapath library because its layout is regular. We can use the carry-select, carry-bypass, and carry-skip architectures to split a 12-bit adder, for example, into three blocks. The delay of the adder is then partly dependent on the delays of the MUX between each block. Suppose the delay due to 1-bit in an adder block (we shall call this a bit delay) is approximately equal to the MUX delay. In this case may be faster to make the blocks 3, 4, and 5-bits long instead of being equal in size. Now the delays into the final MUX are equal3 bit-delays plus 2 MUX delays for the carry signal from bits 06 and 5 bit-delays for the carry from bits 711. Adjusting the block size reduces the delay of large adders (more than 16 bits).
We can extend the idea behind a carry-select adder as follows. Suppose we have an n -bit adder that generates two sums: One sum assumes a carry-in condition of '0', the other sum assumes a carry-in condition of '1'. We can split this n -bit adder into an i -bit adder for the i LSBs and an ( n i )-bit adder for the n i MSBs. Both of the smaller adders generate two conditional sums as well as true and complement carry signals. The two (true and complement) carry signals from the LSB adder are used to select between the two ( n i + 1)-bit conditional sums from the MSB adder using 2( n i + 1) two-input MUXes. This is a conditional-sum adder (also often abbreviated to CSA) [Sklansky, 1960]. We can recursively apply this technique. For example, we can split a 16bit adder using i = 8 and n = 8; then we can split one or both 8bit adders againand so on. Figure 2.25 shows the simplest form of an n -bit conditional-sum adder that uses n single-bit conditional adders, H (each with four outputs: two conditional sums, true carry, and complement carry), together with a tree of 2:1 MUXes (Qi_j). The conditional-sum adder is usually the fastest of all the adders we have discussed (it is the fastest when logic cell delay increases with the number of inputsthis is true for all ASICs except FPGAs).
FIGURE 2.24 The conditional-sum adder. (a) A 1-bit conditional adder that calculates the sum and carry out assuming the carry in is either '1' or '0'. (b) The multiplexer that selects between sums and carries. (c) A 4-bit conditional-sum adder with carry input, C[0].
2.6.3 A Simple Example

How do we make and use datapath elements? What does a design look like? We may use predesigned cells from a library or build the elements ourselves from logic cells using a schematic or a design language. Table 2.12 shows an 8-bit conditional-sum adder intended for an FPGA. This Verilog implementation uses the same structure as Figure 2.25, but the equations are collapsed to use four or five variables. A basic logic cell in certain Xilinx FPGAs, for example, can implement two equations of the same four variables or one equation with five variables. The equations shown in Table 2.12 requires three levels of FPGA logic cells (so, for example, if each FPGA logic cell has a 5 ns delay, the 8-bit conditional-sum adder delay is 15 ns).
TABLE 2.12 An 8-bit conditional-sum adder (the notation is described in Figure 2.25). module m8bitCSum (C0, a, b, s, C8); // Verilog conditional-sum adder for an FPGA input [7:0] C0, a, b; output [7:0] s; output C8; wire A7,A6,A5,A4,A3,A2,A1,A0,B7,B6,B5,B4,B3,B2,B1,B0,S8,S7,S6,S5,S4,S3,S2,S1,S0; wire C0, C2, C4_2_0, C4_2_1, S5_4_0, S5_4_1, C6, C6_4_0, C6_4_1, C8; assign {A7,A6,A5,A4,A3,A2,A1,A0} = a; assign {B7,B6,B5,B4,B3,B2,B1,B0} = b; assign s = { S7,S6,S5,S4,S3,S2,S1,S0 }; assign S0 = A0^B0^C0 ; // start of level 1: & = AND, ^ = XOR, | = OR, ! = NOT assign S1 = A1^B1^(A0&B0|(A0|B0)&C0) ; assign C2 = A1&B1|(A1|B1)&(A0&B0|(A0|B0)&C0) ; assign C4_2_0 = A3&B3|(A3|B3)&(A2&B2) ; assign C4_2_1 = A3&B3|(A3|B3)&(A2|B2) ; assign S5_4_0 = A5^B5^(A4&B4) ; assign S5_4_1 = A5^B5^(A4|B4) ; assign C6_4_0 = A5&B5|(A5|B5)&(A4&B4) ; assign C6_4_1 = A5&B5|(A5|B5)&(A4|B4) ; assign S2 = A2^B2^C2 ; // start of level 2 assign S3 = A3^B3^(A2&B2|(A2|B2)&C2) ; assign S4 = A4^B4^(C4_2_0|C4_2_1&C2) ; assign S5 = S5_4_0& !(C4_2_0|C4_2_1&C2)|S5_4_1&(C4_2_0|C4_2_1&C2) ; assign C6 = C6_4_0|C6_4_1&(C4_2_0|C4_2_1&C2) ; assign S6 = A6^B6^C6 ; // start of level 3 assign S7 = A7^B7^(A6&B6|(A6|B6)&C6) ; assign C8 = A7&B7|(A7|B7s)&(A6&B6|(A6|B6)&C6) ; endmodule
Figure 2.26 shows the normalized delay and area figures for a set of predesigned datapath adders. The data in Figure 2.26 is from a series of ASIC datapath cell libraries (Compass Passport) that may be synthesized together with test vectors and simulation models. We can combine the different adder techniques, but the adders then lose regularity and become less suited to a datapath implementation.
FIGURE 2.25 Datapath adders. This data is from a series of submicron datapath libraries. (a) Delay normalized to a two-input NAND logic cell delay (approximately equal to 250 ps in a 0.5 m process). For example, a 64-bit ripple-carry adder (RCA) has a delay of approximately 30 ns in a 0.5 m process. The spread in delay is due to variation in delays between different inputs and outputs. An n -bit RCA has a delay proportional to n . The delay of an n -bit carry-select adder is approximately proportional to log 2 n . The carry-save adder delay is constant (but requires a carrypropagate adder to complete an addition). (b) In a datapath library the area of all adders are proportional to the bit size. There are other adders that are not used in datapaths, but are occasionally useful in ASIC design. A serial adder is smaller but slower than the parallel adders we have described [Denyer and Renshaw, 1985]. The carrycompletion adder is a variable delay adder and rarely used in synchronous designs [Sklansky, 1960].
2.6.4 Multipliers
Figure 2.27 shows a symmetric 6-bit array multiplier (an n -bit multiplier multiplies two n -bit numbers; we shall use n -bit by m -bit multiplier if the lengths are different). Adders a0f0 may be eliminated, which then eliminates adders a1a6, leaving an asymmetric CSA array of 30 (5 6) adders (including one half adder). An n -bit array multiplier has a delay proportional to n plus the delay of the CPA (adders b6f6 in Figure 2.27). There are two items we can attack to improve the performance of a multiplier: the number of partial products and the addition of the partial products.
FIGURE 2.26 Multiplication. A 6-bit array multiplier using a final carry-propagate adder (full-adder cells a6f6, a ripple-carry adder). Apart from the generation of the summands this multiplier uses the same structure as the carry-save adder of Figure 2.23(d). Suppose we wish to multiply 15 (the multiplicand ) by 19 (the multiplier ) mentally. It is easier to calculate 15 20 and subtract 15. In effect we complete the multiplication as 15 (20 1) and we could write this as 15 2 1 , with the overbar representing a minus sign. Now suppose we wish to multiply an 8-bit binary number, A, by B = 00010111 (decimal 16 + 4 + 2 + 1 = 23). It is easier to multiply A by the canonical signed-digit vector ( CSD vector ) D = 0010 1 001 (decimal 32 8 + 1 = 23) since this requires only three add or subtract operations (and a subtraction is as easy as an addition). We say B has a weight of 4 and D has a weight of 3. By using D instead of B we have reduced the number of partial products by 1 (= 4 3). We can recode (or encode) any binary number, B, as a CSD vector, D, as follows (canonical means there is only one CSD vector for any number): D i = B i + C i 2C i + 1 , (2.58) where C i + 1 is the carry from the sum of B i + 1 + B i + C i (we start with C 0 = 0). As another example, if B = 011 (B 2 = 0, B 1 = 1, B 0 = 1; decimal 3), then, using Eq. 2.58, D 0 = B 0 + C 0 2C 1 = 1 + 0 2 = 1 , D 1 = B 1 + C 1 2C 2 = 1 + 1 2 = 0, D 2 = B 2 + C 2 2C 3 = 0 + 1 0 = 1, (2.59)
so that D = 10 1 (decimal 4 1 = 3). CSD vectors are useful to represent fixed coefficients in digital filters, for example. We can recode using a radix other than 2. Suppose B is an ( n + 1)-digit twos complement number, B = B 0 + B 1 2 + B 2 2 2 + . . . + B i 2 i + . . . + B n 1 2 n 1 B n 2 n . (2.60) We can rewrite the expression for B using the following sleight-of-hand: 2B B = B = B 0 + (B 0 B 1 )2 + . . . + (B i 1 B i )2 i + . . . + B n 1 2 n 1 B n 2 n = (2B 1 + B 0 )2 0 + (2B 3 + B 2 + B 1 )2 2 + . . . + (2B i + B i 1 + B i 2 )2 i 1 + (2B i + 2 + B i + 1 + B i )2 i + 1 + . . . + (2B n + B i 1 + B i 2 )2 n 1 . (2.61)
This is very useful. Consider B = 101001 (decimal 9 32 = 23, n = 5), B = 101001 = (2B 1 + B 0 )2 0 + (2B 3 + B 2 + B 1 )2 2 + (2B 5 + B 4 + B 3 )2 4 (2.62) ((2 0) + 1)2 0 + ((2 1) + 0 + 0)2 2 + ((2 1) + 0 + 1)2 4 . Equation 2.61 tells us how to encode B as a radix-4 signed digit, E = 12 1 (decimal 16 8 + 1 = 23). To multiply by B encoded as E we only have to perform a multiplication by 2 (a shift) and three add/subtract operations. Using Eq. 2.61 we can encode any number by taking groups of three bits at a time and calculating E j = 2B i + B i 1 + B i 2 , E j + 1 = 2B i + 2 + B i + 1 + B i , . . . , (2.63) where each 3-bit group overlaps by one bit. We pad B with a zero, B n . . . B 1 B 0 0, to match the first term in Eq. 2.61. If B has an odd number of bits, then we extend the sign: B n B n . . . B 1 B 0 0. For example, B = 01011 (eleven), encodes to E = 1 11 (16 4 1); and B = 101 is E = 1 1. This is called Booth encoding and reduces the number of partial products by a factor of two and thus considerably reduces the area as well as increasing the speed of our multiplier [Booth, 1951]. Next we turn our attention to improving the speed of addition in the CSA array. Figure 2.28(a) shows a section of the 6-bit array multiplier from Figure 2.27. We can collapse the chain of adders a0f5 (5 adder delays) to the Wallace tree consisting of adders 5.15.4 (4 adder delays) shown in Figure 2.28(b).
FIGURE 2.27 Tree-based multiplication. (a) The portion of Figure 2.27 that calculates the sum bit, P 5 , using a chain of adders (cells a0f5). (b) We can collapse this chain to a Wallace tree (cells 5.15.5). (c) The stages of multiplication. Figure 2.28(c) pictorially represents multiplication as a sort of golf course. Each link corresponds to an adder. The holes or dots are the outputs of one stage (and the inputs of the next). At each stage we have the following three choices: (1) sum three outputs using a full adder (denoted by a box enclosing three dots); (2) sum two outputs using a half adder (a box with two dots); (3) pass the outputs directly to the next stage. The two outputs of an adder are joined by a diagonal line (full adders use black dots, half adders white dots). The object of the game is to choose (1), (2), or (3) at each stage to maximize the performance of the multiplier. In tree-based multipliers there are two ways to do thisworking forward and working backward. In a Wallace-tree multiplier we work forward from the multiplier inputs, compressing the number of signals to be added at each stage [Wallace, 1960]. We can view an FA as a 3:2 compressor or (3, 2) counter it counts the number of '1's on the inputs. Thus, for example, an input of '101' (two '1's) results in an output '10' (2). A half adder is a (2, 2) counter . To form P 5 in Figure 2.29 we must add 6 summands (S 05 , S 14 , S 23 , S 32 , S 41 , and S 50 ) and 4 carries from the P 4 column. We add these in stages 17, compressing from 6:3:2:2:3:1:1. Notice that we wait until stage 5 to add the last carry from column P 4 , and this means we expand (rather than compress) the number of signals (from 2 to 3) between stages 3 and 5. The maximum delay through the CSA array of Figure 2.29 is 6 adder delays. To this we must add the delay of the 4-bit (9 inputs) CPA (stage 7). There are 26 adders (6 half adders) plus the 4 adders in the CPA.
FIGURE 2.28 A 6-bit Wallace-tree multiplier. The carry-save adder (CSA) requires 26 adders (cells 126, six are half adders). The final carry-propagate adder (CPA) consists of 4 adder cells (2730). The delay of the CSA is 6 adders. The delay of the CPA is 4 adders. In a Dadda multiplier (Figure 2.30) we work backward from the final product [Dadda, 1965]. Each stage has a maximum of 2, 3, 4, 6, 9, 13, 19, . . . outputs (each successive stage is 3/2 times largerrounded down to an integer). Thus, for example, in Figure 2.28(d) we require 3 stages (with 3 adder delaysplus the delay of a 10bit output CPA) for a 6-bit Dadda multiplier. There are 19 adders (4 half adders) in the CSA plus the 10 adders (2 half adders) in the CPA. A Dadda multiplier is usually faster and smaller than a Wallace-tree multiplier.
FIGURE 2.29 The 6-bit Dadda multiplier. The carry-save adder (CSA) requires 20 adders (cells 120, four are half adders). The carry-propagate adder (CPA, cells 2130) is a ripple-carry adder (RCA). The CSA is smaller (20 versus 26 adders), faster (3 adder delays versus 6 adder delays), and more regular than the Wallace-tree CSA of Figure 2.29. The overall speed of this implementation is approximately the same as the Wallace-tree multiplier of Figure 2.29; however, the speed may be increased by substituting a faster CPA.
In general, the number of stages and thus delay (in units of an FA delayexcluding the CPA) for an n -bit treebased multiplier using (3, 2) counters is log 1.5 n = log 10 n /log 10 1.5 = log 10 n /0.176 . (2.64) Figure 2.31(a) shows how the partial-product array is constructed in a conventional 4-bit multiplier. The FerrariStefanelli multiplier (Figure 2.31b) nests multipliersthe 2-bit submultipliers reduce the number of partial products [Ferrari and Stefanelli, 1969].
FIGURE 2.30 FerrariStefanelli multiplier. (a) A conventional 4-bit array multiplier using AND gates to calculate the summands with (2, 2) and (3, 2) counters to sum the partial products. (b) A 4-bit FerrariStefanelli multiplier using 2-bit submultipliers to construct the partial product array. (c) A circuit implementation for an inverting 2-bit submultiplier. There are several issues in deciding between parallel multiplier architectures: 1. Since it is easier to fold triangles rather than trapezoids into squares, a Wallace-tree multiplier is more suited to full-custom layout, but is slightly larger, than a Dadda multiplierboth are less regular than an array multiplier. For cell-based ASICs, a Dadda multiplier is smaller than a Wallace-tree multiplier. 2. The overall multiplier speed does depend on the size and architecture of the final CPA, but this may be optimized independently of the CSA array. This means a Dadda multiplier is always at least as fast as the Wallace-tree version. 3. The low-order bits of any parallel multiplier settle first and can be added in the CPA before the remaining bits settle. This allows multiplication and the final addition to be overlapped in time. 4. Any of the parallel multiplier architectures may be pipelined. We may also use a variably pipelined approach that tailors the register locations to the size of the multiplier. 5. Using (4, 2), (5, 3), (7, 3), or (15, 4) counters increases the stage compression and permits the size of the stages to be tuned. Some ASIC cell libraries contain a (7, 3) countera 2-bit full-adder . A (15, 4) counter is a 3-bit full adder. There is a trade-off in using these counters between the speed and size of the logic cells and the delay as well as area of the interconnect. 6. Power dissipation is reduced by the tree-based structures. The simplified carry-save logic produces fewer signal transitions and the tree structures produce fewer glitches than a chain.
7. None of the multiplier structures we have discussed take into account the possibility of staggered arrival times for different bits of the multiplicand or the multiplier. Optimization then requires a logic-synthesis tool.
2.6.5 Other Arithmetic Systems

There are other schemes for addition and multiplication that are useful in special circumstances. Addition of numbers using redundant binary encoding avoids carry propagation and is thus potentially very fast. Table 2.13 shows the rules for addition using an intermediate carry and sum that are added without the need for carry. For example, binary decimal redundant binary CSD vector 1010111 87 10101001 10 1 0 1 00 1 + 1100101 101 + 11100111 + 01100101 01001110 = 11 00 1 100 1 1 00010 1 11000000 = 10111100 = 188 1 1 1000 1 00 10 1 00 1 100 TABLE 2.13 Redundant binary addition. A[ i ] B[ i ] 1 1 0 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 A[ i 1] B[ i 1] Intermediate Intermediate sum carry 1 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1
addend augend intermediate sum intermediate carry sum
x x A[i 1]=0/1 and B[i 1]=0/1 A[i 1]= 1 or B[i 1]= 1 x x x x x x A[i 1]=0/1 and B[i 1]=0/1 A[i 1]= 1 or B[i 1]= 1 x x
The redundant binary representation is not unique. We can represent 101 (decimal), for example, by 1100101 (binary and CSD vector) or 1 1 100111. As another example, 188 (decimal) can be represented by 10111100 (binary), 1 1 1000 1 00, 10 1 00 1 100, or 10 1 000 1 00 (CSD vector). Redundant binary addition of binary, redundant binary, or CSD vectors does not result in a unique sum, and addition of two CSD vectors does not result in a CSD vector. Each n -bit redundant binary number requires a rather wasteful 2 n -bit binary number for storage. Thus 10 1 is represented as 010010, for example (using sign magnitude). The other disadvantage of redundant binary arithmetic is the need to convert to and from binary representation. Table 2.14 shows the (5, 3) residue number system . As an example, 11 (decimal) is represented as [1, 2] residue (5, 3) since 11R 5 = 11 mod 5 = 1 and 11R 3 = 11 mod 3 = 2. The size of this system is thus 3 5 = 15. We add, subtract, or multiply residue numbers using the modulus of each bit positionwithout any carry. Thus:
4 [4, 1] 12 [2, 0] 3 [3, 0] + 7 + [2, 1] 4 - [4, 1] 4 [4, 1] = 11 = [1, 2] = 8 = [3, 2] = 12 = [2, 0] TABLE 2.14 The 5, 3 residue number system. n residue 5 residue 3 n residue 5 residue 3 0 0 0 5 0 2 1 1 1 6 1 0 2 2 2 7 2 1 3 3 0 8 3 2 4 4 1 9 4 0
n residue 5 residue 3 10 0 1 11 1 2 12 2 0 13 3 1 14 4 2
The choice of moduli determines the system size and the computing complexity. The most useful choices are relative primes (such as 3 and 5). With p prime, numbers of the form 2 p and 2 p 1 are particularly useful (2 p 1 are Mersennes numbers ) [Waser and Flynn, 1982].
2.6.6 Other Datapath Operators

Figure 2.32 shows symbols for some other datapath elements. The combinational datapath cells, NAND, NOR, and so on, and sequential datapath cells (flip-flops and latches) have standard-cell equivalents and function identically. I use a bold outline (1 point) for datapath cells instead of the regular (0.5 point) line I use for scalar symbols. We call a set of identical cells a vector of datapath elements in the same way that a bold symbol, A , represents a vector and A represents a scalar.
FIGURE 2.31 Symbols for datapath elements. (a) An array or vector of flip-flops (a register). (b) A two-input NAND cell with databus inputs. (c) A two-input NAND cell with a control input. (d) A buswide MUX. (e) An incrementer/decrementer. (f) An all-zeros detector. (g) An all-ones detector. (h) An adder/subtracter. A subtracter is similar to an adder, except in a full subtracter we have a borrow-in signal, BIN; a borrow-out signal, BOUT; and a difference signal, DIFF:
DIFF = A NOT(B) ( BIN) SUM(A, NOT(B), NOT(BIN)) (2.65) NOT(BOUT) = A NOT(B) + A NOT(BIN) + NOT(B) NOT(BIN) MAJ(NOT(A), B, NOT(BIN)) (2.66) These equations are the same as those for the FA (Eqs. 2.38 and 2.39) except that the B input is inverted and the sense of the carry chain is inverted. To build a subtracter that calculates (A B) we invert the entire B input bus and connect the BIN[0] input to VDD (not to VSS as we did for CIN[0] in an adder). As an example, to subtract B = '0011' from A = '1001' we calculate '1001' + '1100' + '1' = '0110'. As with an adder, the true overflow is XOR(BOUT[MSB], BOUT[MSB 1]). We can build a ripple-borrow subtracter (a type of borrow-propagate subtracter), a borrow-save subtracter, and a borrow-select subtracter in the same way we built these adder architectures. An adder/subtracter has a control signal that gates the A input with an exclusive-OR cell (forming a programmable inversion) to switch between an adder or subtracter. Some adder/subtracters gate both inputs to allow us to compute (A B). We must be careful to connect the input to the LSB of the carry chain (CIN[0] or BIN[0]) when changing between addition (connect to VSS) and subtraction (connect to VDD). A barrel shifter rotates or shifts an input bus by a specified amount. For example if we have an eight-input barrel shifter with input '1111 0000' and we specify a shift of '0001 0000' (3, coded by bit position) the rightshifted 8-bit output is '0001 1110'. A barrel shifter may rotate left or right (or switch between the two under a separate control). A barrel shifter may also have an output width that is smaller than the input. To use a simple example, we may have an 8-bit input and a 4-bit output. This situation is equivalent to having a barrel shifter with two 4-bit inputs and a 4-bit output. Barrel shifters are used extensively in floating-point arithmetic to align (we call this normalize and denormalize ) floating-point numbers (with sign, exponent, and mantissa). A leading-one detector is used with a normalizing (left-shift) barrel shifter to align mantissas in floating-point numbers. The input is an n -bit bus A, the output is an n -bit bus, S, with a single '1' in the bit position corresponding to the most significant '1' in the input. Thus, for example, if the input is A = '0000 0101' the leading-one detector output is S = '0000 0100', indicating the leading one in A is in bit position 2 (bit 7 is the MSB, bit zero is the LSB). If we feed the output, S, of the leading-one detector to the shift select input of a normalizing (left-shift) barrel shifter, the shifter will normalize the input A. In our example, with an input of A = '0000 0101', and a left-shift of S = '0000 0100', the barrel shifter will shift A left by five bits and the output of the shifter is Z = '1010 0000'. Now that Z is aligned (with the MSB equal to '1') we can multiply Z with another normalized number. The output of a priority encoder is the binary-encoded position of the leading one in an input. For example, with an input A = '0000 0101' the leading 1 is in bit position 3 (MSB is bit position 7) so the output of a 4-bit priority encoder would be Z = '0011' (3). In some cell libraries the encoding is reversed so that the MSB has an output code of zero, in this case Z = '0101' (5). This second, reversed, encoding scheme is useful in floatingpoint arithmetic. If A is a mantissa and we normalize A to '1010 0000' we have to subtract 5 from the exponent, this exponent correction is equal to the output of the priority encoder. An accumulator is an adder/subtracter and a register. Sometimes these are combined with a multiplier to form a multiplieraccumulator ( MAC ). An incrementer adds 1 to the input bus, Z = A + 1, so we can use this function, together with a register, to negate a twos complement number for example. The implementation is Z[ i ] = XOR(A[ i ], CIN[ i ]), and COUT[ i ] = AND(A[ i ], CIN[ i ]). The carry-in control input, CIN[0], thus acts
as an enable: If it is set to '0' the output is the same as the input. The implementation of arithmetic cells is often a little more complicated than we have explained. CMOS logic is naturally inverting, so that it is faster to implement an incrementer as Z[ i (even)] = XOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NAND(A[ i ], CIN[ i ]). This inverts COUT, so that in the following stage we must invert it again. If we push an inverting bubble to the input CIN we find that: Z[ i (odd)] = XNOR(A[ i ], CIN[ i ]) and COUT[ i (even)] = NOR(NOT(A[ i ]), CIN[ i ]). In many datapath implementations all odd-bit cells operate on inverted carry signals, and thus the odd-bit and even-bit datapath elements are different. In fact, all the adder and subtracter datapath elements we have described may use this technique. Normally this is completely hidden from the designer in the datapath assembly and any output control signals are inverted, if necessary, by inserting buffers. A decrementer subtracts 1 from the input bus, the logical implementation is Z[ i ] = XOR(A[ i ], CIN[ i ]) and COUT[ i ] = AND(NOT(A[ i ]), CIN[ i ]). The implementation may invert the odd carry signals, with CIN[0] again acting as an enable. An incrementer/decrementer has a second control input that gates the input, inverting the input to the carry chain. This has the effect of selecting either the increment or decrement function. Using the all-zeros detectors and all-ones detectors , remember that, for a 4-bit number, for example, zero in ones complement arithmetic is '1111' or '0000', and that zero in signed magnitude arithmetic is '1000' or '0000'. A register file (or scratchpad memory) is a bank of flip-flops arranged across the bus; sometimes these have the option of multiple ports (multiport register files) for read and write. Normally these register files are the densest logic and hardest to fit in a datapath. For large register files it may be more appropriate to use a multiport memory. We can add control logic to a register file to create a first-in first-out register ( FIFO ), or last-in firstout register ( LIFO ). In Section 2.5 we saw that the standard-cell version and gate-array macro version of the sequential cells (latches and flip-flops) each contain their own clock buffers. The reason for this is that (without intelligent placement software) we do not know where a standard cell or a gate-array macro will be placed on a chip. We also have no idea of the condition of the clock signal coming into a sequential cell. The ability to place the clock buffers outside the sequential cells in a datapath gives us more flexibility and saves space. For example, we can place the clock buffers for all the clocked elements at the top of the datapath (together with the buffers for the control signals) and river route (in river routing the interconnect lines all flow in the same direction on the same layer) the connections to the clock lines. This saves space and allows us to guarantee the clock skew and timing. It may mean, however, that there is a fixed overhead associated with a datapath. For example, it might make no sense to build a 4-bit datapath if the clock and control buffers take up twice the space of the datapath logic. Some tools allow us to design logic using a portable netlist . After we complete the design we can decide whether to implement the portable netlist in a datapath, standard cells, or even a gate array, based on area, speed, or power considerations.
2.7 I/O Cells
2.7 I/O Cells

Figure 2.33 shows a three-state bidirectional output buffer (Tri-State is a registered trademark of National Semiconductor). When the output enable (OE) signal is high, the circuit functions as a noninverting buffer driving the value of DATAin onto the I/O pad. When OE is low, the output transistors or drivers , M1 and M2, are disconnected. This allows multiple drivers to be connected on a bus. It is up to the designer to make sure that a bus never has two driversa problem known as contention . In order to prevent the problem opposite to contentiona bus floating to an intermediate voltage when there are no bus driverswe can use a bus keeper or bushold cell (TI calls this Bus-Friendly logic). A bus keeper normally acts like two weak (low drive-strength) cross-coupled inverters that act as a latch to retain the last logic state on the bus, but the latch is weak enough that it may be driven easily to the opposite state. Even though bus keepers act like latches, and will simulate like latches, they should not be used as latches, since their drive strength is weak. Transistors M1 and M2 in Figure 2.33 have to drive large off-chip loads. If we wish to change the voltage on a C = 200 pF load by 5 V in 5 ns (a slew rate of 1 Vns 1 ) we will require a current in the output transistors of I DS = C (d V /d t ) = (200 10
12
) (5/5 10 9 ) = 0.2 A or 200 mA.
Such large currents flowing in the output transistors must also flow in the power supply bus and can cause problems. There is always some inductance in series with the power supply, between the point at which the supply enters the ASIC package and reaches the power bus on the chip. The inductance is due to the bond wire, lead frame, and package pin. If we have a power-supply inductance of 2 nH and a current
2.7 I/O Cells
changing from zero to 1 A (32 I/O cells on a bus switching at 30 mA each) in 5 ns, we will have a voltage spike on the power supply (called power-supply bounce ) of L (d I /d t ) = (2 10 9 )(1/(5 10 9 )) = 0.4 V. We do several things to alleviate this problem: We can limit the number of simultaneously switching outputs (SSOs), we can limit the number of I/O drivers that can be attached to any one VDD and GND pad, and we can design the output buffer to limit the slew rate of the output (we call these slew-rate limited I/O pads). Quiet-I/O cells also use two separate power supplies and two sets of I/O drivers: an AC supply (clean or quiet supply) with small AC drivers for the I/O circuits that start and stop the output slewing at the beginning and end of a output transition, and a DC supply (noisy or dirty supply) for the transistors that handle large currents as they slew the output. The three-state buffer allows us to employ the same pad for input and output bidirectional I/O . When we want to use the pad as an input, we set OE low and take the data from DATAin. Of course, it is not necessary to have all these features on every pad: We can build output-only or input-only pads.
FIGURE 2.32 A three-state bidirectional output buffer. When the output enable, OE, is '1' the output section is enabled and drives the I/O pad. When OE is '0' the output buffer is placed in a high-impedance state.
We can also use many of these output cell features for input cells that have to drive large on-chip loads (a clock pad cell, for example). Some gate arrays simply turn an output buffer around to drive a grid of interconnect that supplies a clock signal internally. With a typical interconnect capacitance of 0.2pFcm 1 , a grid of 100 cm (consisting of 10 by 10 lines running all the way across a 1 cm chip) presents a load of 20 pF to the clock buffer.
2.7 I/O Cells
Some libraries include I/O cells that have passive pull-ups or pull-downs (resistors) instead of the transistors, M1 and M2 (the resistors are normally still constructed from transistors with long gate lengths). We can also omit one of the driver transistors, M1 or M2, to form open-drain outputs that require an external pull-up or pull-down. We can design the output driver to produce TTL output levels rather than CMOS logic levels. We may also add input hysteresis (using a Schmitt trigger) to the input buffer, I1 in Figure 2.33, to accept input data signals that contain glitches (from bouncing switch contacts, for example) or that are slow rising. The input buffer can also include a level shifter to accept TTL input levels and shift the input signal to CMOS levels. The gate oxide in CMOS transistors is extremely thin (100 or less). This leaves the gate oxide of the I/O cell input transistors susceptible to breakdown from static electricity ( electrostatic discharge , or ESD ). ESD arises when we or machines handle the package leads (like the shock I sometimes get when I touch a doorknob after walking across the carpet at work). Sometimes this problem is called electrical overstress (EOS) since most ESD-related failures are caused not by gate-oxide breakdown, but by the thermal stress (melting) that occurs when the n -channel transistor in an output driver overheats (melts) due to the large current that can flow in the drain diffusion connected to a pad during an ESD event. To protect the I/O cells from ESD, the input pads are normally tied to device structures that clamp the input voltage to below the gate breakdown voltage (which can be as low as 10 V with a 100 gate oxide). Some I/O cells use transistors with a special ESD implant that increases breakdown voltage and provides protection. I/O driver transistors can also use elongated drain structures (ladder structures) and large drain-to-gate spacing to help limit current, but in a salicide process that lowers the drain resistance this is difficult. One solution is to mask the I/O cells during the salicide step. Another solution is to use pnpn and npnp diffusion structures called silicon-controlled rectifiers (SCRs) to clamp voltages and divert current to protect the I/O circuits from ESD. There are several ways to model the capability of an I/O cell to withstand EOS. The human-body model ( HBM ) represents ESD by a 100 pF capacitor discharging through a 1.5 k resistor (this is an International Electrotechnical Committee, IEC, specification). Typical voltages generated by the human body are in the range of 24
2.7 I/O Cells
kV, and we often see an I/O pad cell rated by the voltage it can withstand using the HBM. The machine model ( MM ) represents an ESD event generated by automated machine handlers. Typical MM parameters use a 200 pF capacitor (typically charged to 200 V) discharged through a 25 resistor, corresponding to a peak initial current of nearly 10 A. The charge-device model ( CDM , also called device chargedischarge) represents the problem when an IC package is charged, in a shipping tube for example, and then grounded. If the maximum charge on a package is 3 nC (a typical measured figure) and the package capacitance to ground is 1.5 pF, we can simulate this event by charging a 1.5 pF capacitor to 2 kV and discharging it through a 1 resistor. If the diffusion structures in the I/O cells are not designed with care, it is possible to construct an SCR structure unwittingly, and instead of protecting the transistors the SCR can enter a mode where it is latched on and conducting large enough currents to destroy the chip. This failure mode is called latch-up . Latch-up can occur if the pn diodes on a chip become forward-biased and inject minority carriers (electrons in p type material, holes in n -type material) into the substrate. The sourcesubstrate and drainsubstrate diodes can become forward-biased due to power-supply bounce or output undershoot (the cell outputs fall below V SS ) or overshoot (outputs rise to greater than V DD ) for example. These injected minority carriers can travel fairly large distances and interact with nearby transistors causing latch-up. I/O cells normally surround the I/O transistors with guard rings (a continuous ring of n diffusion in an n -well connected to VDD, and a ring of p -diffusion in a p -well connected to VSS) to collect these minority carriers. This is a problem that can also occur in the logic core and this is one reason that we normally include substrate and well connections to the power supplies in every cell. [ Chapter start ] [ Previous page ] [ Next page ]
2.8 Cell Compilers
2.8 Cell Compilers

The process of hand crafting circuits and layout for a full-custom IC is a tedious, timeconsuming, and error-prone task. There are two types of automated layout assembly tools, often known as a silicon compilers . The first type produces a specific kind of circuit, a RAM compiler or multiplier compiler , for example. The second type of compiler is more flexible, usually providing a programming language that assembles or tiles layout from an input command file, but this is full-custom IC design. We can build a register file from latches or flip-flops, but, at 4.56.5 gates (1826 transistors) per bit, this is an expensive way to build memory. Dynamic RAM (DRAM) can use a cell with only one transistor, storing charge on a capacitor that has to be periodically refreshed as the charge leaks away. ASIC RAM is invariably static (SRAM), so we do not need to refresh the bits. When we refer to RAM in an ASIC environment we almost always mean SRAM. Most ASIC RAMs use a six-transistor cell (four transistors to form two cross-coupled inverters that form the storage loop, and two more transistors to allow us to read from and write to the cell). RAM compilers are available that produce single-port RAM (a single shared bus for read and write) as well as dual-port RAMs , and multiport RAMs . In a multi-port RAM the compiler may or may not handle the problem of address contention (attempts to read and write to the same RAM address simultaneously). RAM can be asynchronous (the read and write cycles are triggered by control and/or address transitions asynchronous to a clock) or synchronous (using the system clock). In addition to producing layout we also need a model compiler so that we can verify the circuit at the behavioral level, and we need a netlist from a netlist compiler so that we can simulate the circuit and verify that it works correctly at the structural level. Silicon compilers are thus complex pieces of software. We assume that a silicon
2.8 Cell Compilers
compiler will produce working silicon even if every configuration has not been tested. This is still ASIC design, but now we are relying on the fact that the tool works correctly and therefore the compiled blocks are correct by construction . [ Chapter start ] [ Previous page ] [ Next page ]
2.9 Summary
2.9 Summary
The most important concepts that we covered in this chapter are the following:
q q q q q q q q q q
The use of transistors as switches The difference between flip-flop and a latch The meaning of setup time and hold time Pipelines and latency The difference between datapath, standard-cell, and gate-array logic cells Strong and weak logic levels Pushing bubbles Ratio of logic Resistance per square of layers and their relative values in CMOS Design rules and
file:///C|/Documents%20and%20Settings/saran%20kum...waii.edu/_msmith/ASICs/HTML/Book2/CH02/CH02.9.htm [5/30/2004 11:01:33 PM]
ASIC LIBRARY DESIGN
ASIC LIBRARY DESIGN

Once we have decided to use an ASIC design styleusing predefined and precharacterized cells from a librarywe need to design or buy a cell library. Even though it is not necessary a knowledge of ASIC library design makes it easier to use library cells effectively. 3.1 Transistors as Resistors 3.2 Transistor Parasitic Capacitance 3.3 Logical Effort 3.4 Library-Cell Design 3.5 Library Architecture 3.6 Gate-Array Design 3.7 Standard-Cell Design 3.8 Datapath-Cell Design 3.9 Summary 3.10 Problems 3.11 Bibliography 3.12 References
file:///C|/Documents%20and%20Settings/saran%20kum...hawaii.edu/_msmith/ASICs/HTML/Book2/CH03/CH03.htm [5/30/2004 11:01:51 PM]
3.1 Transistors as Resistors

In Section 2.1, CMOS Transistors, we modeled transistors using ideal switches. If this model were accurate, logic cells would have no delay.
FIGURE 3.1 A model for CMOS logic delay. (a) A CMOS inverter with a load capacitance, C out . (b) Input, v(in1) , and output, v(out1) , waveforms showing the definition of the falling propagation delay, t PDf . In this case delay is measured from the input trip point of 0.5. The output trip points are 0.35 (falling) and 0.65 (rising). The model predicts t PDf R pd ( C p + C out ). (c) The model for the inverter includes: the input capacitance, C ; the pull-up resistance ( R pu ) and pull-down resistance ( R pd ); and the parasitic output capacitance, C p . The ramp input, v(in1) , to the inverter in Figure 3.1 (a) rises quickly from zero to
V DD . In response the output, v(out1) , falls from V DD to zero. In Figure 3.1 (b) we measure the propagation delay of the inverter, t PD , using an input trip point of 0.5 and output trip points of 0.35 (falling, t PDf ) and 0.65 (rising, t PDr ). Initially the n -channel transistor, m1 , is off . As the input rises, m1 turns on in the saturation region ( V DS > V GS V t n ) before entering the linear region ( V DS < V GS V t n ). We model transistor m1 with a resistor, R pd (Figure 3.1 c); this is the pull-down resistance . The equivalent resistance of m2 is the pull-up resistance , R pu . Delay is created by the pull-up and pull-down resistances, R pd and R pu , together with the parasitic capacitance at the output of the cell, C p (the intrinsic output capacitance ) and the load capacitance (or extrinsic output capacitance ), C out (Figure 3.1 c). If we assume a constant value for R pd , the output reaches a lower trip point of 0.35 when (Figure 3.1 b), t PDf 0.35 V DD = V DD exp . (3.1) R pd ( C out + C p ) An output trip point of 0.35 is convenient because ln (1/0.35) = 1.04 1 and thus t PDf = R pd ( C out + C p ) ln (1/0.35) R pd ( C out + C p ) . (3.2) The expression for the rising delay (with a 0.65 output trip point) is identical in form. Delay thus increases linearly with the load capacitance. We often measure load capacitance in terms of a standard load the input capacitance presented by a particular cell (often an inverter or two-input NAND cell). We may adjust the delay for different trip points. For example, for output trip points of 0.1/0.9 we multiply Eq. 3.2 by ln(0.1) = 2.3, because exp (2.3) = 0.100. Figure 3.2 shows the DC characteristics of a CMOS inverter. To form Figure 3.2 (b) we take the n -channel transistor surface (Figure 2.4b) and add that for a p -channel
transistor (rotated to account for the connections). Seen from above, the intersection of the two surfaces is the static transfer curve of Figure 3.2 (a)along this path the transistor currents are equal and there is no output current to change the output voltage. Seen from one side, the intersection is the curve of Figure 3.2 (c). (a) (b)
FIGURE 3.2 CMOS inverter characteristics. (a) This static inverter transfer curve is traced as the inverter switches slowly enough to be in equilibrium at all times ( I DSn = I DSp ). (b) This surface corresponds to the current flowing in the n channel transistor (falling delay) and p -channel transistor (rising delay) for any trajectory. (c) The current that flows through both transistors as the inverter switches along the equilibrium path.
(c)
The input waveform, v(in1) , and the output load (which determines the transistor currents) dictate the path we take on the surface of Figure 3.2 (b) as the inverter
switches. We can thus see that the currents through the transistors (and thus the pullup and pull-down resistance values) will vary in a nonlinear way during switching. Deriving theoretical values for the pull-up and pull-down resistance values is difficultinstead we work the problem backward by picking the trip points, simulating the propagation delays, and then calculating resistance values that fit the model.
(a)
(c) (b)
(d)
FIGURE 3.3 Delay. (a) LogicWorks schematic for inverters driving 1, 2, 4, and 8 standard loads (1 standard load = 0.034 pF in this case). (b) Transient response (falling delay only) from PSpice. The postprocessor Probe was used to mark each waveform as it crosses its trip point (0.5 for the input, 0.35 for the outputs). For example v(out1_4) (4 standard loads) crosses 1.0467 V ( 0.35 V DD ) at t = 169.93 ps. (c) Falling and rising delays as a function of load. The slopes in pspF 1 corresponds to the pull-up resistance (1281 ) and pull-down resistance (817 ). (d) Comparison of the delay model (valid for t > 20 ps) and simulation (4 standard loads). Both are equal at the 0.35 trip point. Figure 3.3 shows a simulation experiment (using the G5 process SPICE parameters from Table 2.1). From the results in Figure 3.3 (c) we can see that R pd = 817 and R = 1281 for this inverter (with shape factors of 6/0.6 for the n -channel transistor and 12/0.6 for the p -channel) using 0.5 (input) and 0.35/0.65 (output) trip points. Changing the trip points would give different resistance values.
pu
We can check that 817 is a reasonable value for the pull-down resistance. In the saturation region I DS (sat) is (to first order) independent of V DS . For an n -channel transistor from our generic 0.5 m process (G5 from Section 2.1) with shape factor W/L = 6/0.6, I DSn (sat) = 2.5 mA (at V GS = 3V and V DS = 3V). The pull-down resistance, R 1 , that would give the same drainsource current is R 1 = 3.0 V / (2.5 10 3 A) = 1200 . (3.3) This value is greater than, but not too different from, our measured pull-down resistance of 817 . We might expect this result since Figure 3.2b shows that the pulldown resistance reaches its maximum value at V GS = 3V, V DS = 3V. We could adjust the ratio of the logic so that the rising and falling delays were equal; then R = R pd = R pu is the pull resistance .
Next, we check our model against the simulation results. The model predicts t' v(out1) V DD exp R pd ( C out + C p ) ( t' is measured from the point at which the input crosses the 0.5 trip point, t' = 0 at t = 20 ps). With C p = 4 standard loads = 4 0.034 pF = 0.136 pF, R pd ( C out + C p ) = (38 + 817 (0.136)) ps = 149.112 ps . (3.5) To make a comparison with the simulation we need to use ln (1/0.35) = 1.04 and not approximately 1 as we have assumed, so that (with all times in ps) t' v(out1) 3.0 exp V 149.112/1.04 ( t 20) 143.4 for t ' > 0 . (3.4)
3.0 exp
for t > 20 ps . (3.6)
Equation 3.6 is plotted in Figure 3.3 (d). For v(out1) = 1.05 V (equal to the 0.35 output trip point), Eq. 3.6 predicts t = 20 + 149.112 169 ps and agrees with Figure 3.3 (b)it should because we derived the model from these results! Now we find C p . From Figure 3.3 (c) and Eq. 3.2 t PDr = (52 + 1281 C out ) ps thus C pr = 52/1281 = 0.041 pF (rising) ,
t PDf = (38 + 817 C out ) ps thus C pf = 38/817 = 0.047 pF (falling) .
(3.7)
These intrinsic parasitic capacitance values depend on the choice of output trip points, even though C pf R pdf and C pr R pdr are constant for a given input trip point and waveform, because the pull-up and pull-down resistances depend on the choice of output trip points. We take a closer look at parasitic capacitance next. [ Chapter start ] [ Previous page ] [ Next page ]
3.2 Transistor Parasitic Capacitance

Logic-cell delay results from transistor resistance, transistor (intrinsic) parasitic capacitance, and load (extrinsic) capacitance. When one logic cell drives another, the parasitic input capacitance of the driven cell becomes the load capacitance of the driving cell and this will determine the delay of the driving cell. Figure 3.4 shows the components of transistor parasitic capacitance. SPICE prints all of the MOS parameter values for each transistor at the DC operating point. The following values were printed by PSpice (v5.4) for the simulation of Figure 3.3 :
FIGURE 3.4 Transistor parasitic capacitance. (a) An n channel MOS transistor with (drawn) gate length L and width W. (b) The gate capacitance is split into: the constant overlap capacitances C GSOV , C GDOV , and C GBOV and the variable capacitances C GS , C GB , and C GD , which depend on the operating region. (c) A view showing how the different capacitances are approximated by planar components ( T FOX is the field-oxide thickness). (d) C BS and C BD are the sum of the area ( C BSJ , C BDJ ), sidewall ( C BSSW , C BDSW ), and channel edge ( C BSJ GATE , C BDJ GATE ) capacitances. (e)(f) The dimensions of the gate, overlap, and sidewall capacitances (L D is the lateral diffusion). NAME m1 m2
MODEL CMOSN CMOSP ID 7.49E-11 -7.49E-11 VGS 0.00E+00 -3.00E+00 VDS 3.00E+00 -4.40E-08 VBS 0.00E+00 0.00E+00 VTH 4.14E-01 -8.96E-01 VDSAT 3.51E-02 -1.78E+00 GM 1.75E-09 2.52E-11 GDS 1.24E-10 1.72E-03 GMB 6.02E-10 7.02E-12 CBD 2.06E-15 1.71E-14 CBS 4.45E-15 1.71E-14 CGSOV 1.80E-15 2.88E-15 CGDOV 1.80E-15 2.88E-15 CGBOV 2.00E-16 2.01E-16 CGS 0.00E+00 1.10E-14 CGD 0.00E+00 1.10E-14 CGB 3.88E-15 0.00E+00 The parameters ID ( I DS ), VGS , VDS , VBS , VTH (V t ), and VDSAT (V DS (sat) ) are DC parameters. The parameters GM , GDS , and GMB are small-signal conductances (corresponding to I DS / V GS , I DS / V DS , and I DS / V BS , respectively). The remaining parameters are the parasitic capacitances. Table 3.1 shows the calculation of these capacitance values for the n -channel transistor m1 (with W = 6 m and L = 0.6 m) in Figure 3.3 (a). TABLE 3.1 Calculations of parasitic capacitances for an n-channel MOS transistor. PSpice Equation CBD C BD = C BDJ + C BDSW Values 1 for VGS = 0V, VDS = 3V, VSB = 0V C BD = 1.855 10 13 + 2.04 10
16
= 2.06 10 13 F
C BDJ + A D C J ( 1 + V DB / B ) mJ C BDJ = (4.032 10 15 )(1 + ( B = PB ) (3/1)) 0.56 = 1.86 10 15 F C BDSW = P D C JSW (1 + V DB / B )

mJSW
C BDSW = (4.2 10 16 )(1 + (3/1)) 0.5 = 2.04 10 16 F C BS = 4.032 10 15 + 4.2 10

16
(P D may or may not include channel edge) CBS C BS = C BSJ + C BSSW C BSJ + A S C J ( 1 + V SB / B ) mJ
= 4.45 10 15 F
A S C J = (7.2 10 15 )(5.6 10
4
) = 4.03 10 15 F ) = 4.2 10 16 F
C BSSW = P S C JSW (1 + V SB / B )
mJSW
P S C JSW = (8.4 10 6 )(5 10

11
CGSOV CGDOV CGBOV
C GSOV = W EFF C GSO ; W EFF = W 2W D C GDOV = W EFF C GSO C GBOV = L EFF C GBO ; L EFF = L 2L D C GS /C O = 0 (off), 0.5 (lin.), 0.66 (sat.) C O (oxide capacitance) = W EF L EFF ox / T ox C GD /C O = 0 (off), 0.5 (lin.), 0 (sat.) C GB = 0 (on), = C O in series with C GS (off)
C GSOV = (6 10 6 )(3 10 10 ) = 1.8 10 16 F C GDOV = (6 10 6 )(3 10 10 ) = 1.8 10 15 F C GDOV = (0.5 10 6 )(4 10

10
) = 2 10 16 F
C O = (6 10 6 )(0.5 10 6 )(0.00345) = 1.03 10 14 F C GS = 0.0 F C GD = 0.0 F C GB = 3.88 10 15 F , C S = depletion capacitance
CGS
CGD CGB
Input
.MODEL CMOSN NMOS LEVEL=3 PHI=0.7 TOX=10E-09 XJ=0.2U TPG=1 VTO=0.65 DELTA=0.7 + LD=5E-08 KP=2E-04 UO=550 THETA=0.27 RSH=2 GAMMA=0.6 NSUB=1.4E+17 NFS=6E+11 + VMAX=2E+05 ETA=3.7E-02 KAPPA=2.9E-02 CGDO=3.0E10 CGSO=3.0E-10 CGBO=4.0E-10 + CJ=5.6E-04 MJ=0.56 CJSW=5E-11 MJSW=0.52 PB=1 m1 out1 in1 0 0 cmosn W=6U L=0.6U AS=7.2P AD=7.2P PS=8.4U PD=8.4U
3.2.1 Junction Capacitance

The junction capacitances, C BD and C BS , consist of two parts: junction area and sidewall; both have different physical characteristics with parameters: CJ and MJ for the junction, CJSW and MJSW for the sidewall, and PB is common. These capacitances depend on the voltage across the junction ( V DB and V SB ). The calculations in Table 3.1 assume both source and drain regions are 6 m 1.2 m rectangles, so that A D = A S = 7.2 ( m) 2 , and the perimeters (excluding the 1.2 m channel edge) are P D = P S = 6 + 1.2 + 1.2 = 8.4 m. We exclude the channel edge because the sidewalls facing the channel (corresponding to C BSJ GATE and C BDJ GATE in Figure 3.4 ) are different from the sidewalls that face the field. There is no standard method to allow for this. It is a mistake to exclude the gate edge assuming it is accounted for in the rest of the modelit is not. A pessimistic simulation includes the channel edge in P D and P S (but a true worst-case analysis would use more accurate models and worst-case model parameters). In HSPICE there is a separate mechanism to account for the channel edge capacitance (using parameters ACM and CJGATE ). In Table 3.1 we have neglected C J GATE . For the p -channel transistor m2 (W = 12 m and L = 0.6 m) the source and drain regions are 12 m 1.2 m rectangles, so that A D = A S 14 ( m) 2 , and the perimeters are P D = P S = 12 + 1.2 + 1.2 14 m (these parameters are rounded to two significant figures solely to simplify the figures and tables).
In passing, notice that a 1.2 m strip of diffusion in a 0.6 m process ( = 0.3 m) is only 4 widewide enough to place a contact only with aggressive spacing rules. The conservative rules in Figure 2.11 would require a diffusion width of at least 2 (rule 6.4a) + 2 (rule 6.3a) + 1.5 (rule 6.2a) = 5.5 .
3.2.2 Overlap Capacitance

The overlap capacitance calculations for C GSOV and C GDOV in Table 3.1 account for lateral diffusion (the amount the source and drain extend under the gate) using SPICE parameter LD = 5E-08 or L D = 0.05 m. Not all versions of SPICE use the equivalent parameter for width reduction, WD (assumed zero in Table 3.1 ), in calculating C GDOV and not all versions subtract W D to form W EFF .
3.2.3 Gate Capacitance

The gate capacitance calculations in Table 3.1 depend on the operating region. The gatesource capacitance C GS varies from zero when the transistor is off to 0.5C O (0.5 1.035 10 15 = 5.18 10 16 F) in the linear region to (2/3)C O in the saturation region (6.9 10 16 F). The gatedrain capacitance C GD varies from zero (off) to 0.5C O (linear region) and back to zero (saturation region). The gatebulk capacitance C GB may be viewed as two capacitors in series: the fixed gate-oxide capacitance, C O = W EFF L EFF ox / T ox , and the variable depletion capacitance, C S = W EFF L EFF Si / x d , formed by the depletion region that extends under the gate (with varying depth x d ). As the transistor turns on the conducting channel appears and shields the bulk from the gateand at this point C GB falls to zero. Even with V GS = 0 V, the depletion width under the gate is finite and thus C GB 4 10 15 F is less than C O 10 16 F. In fact, since C GB 0.5 C O , we can tell that at V GS = 0 V, C S C O . Figure 3.5 shows the variation of the parasitic capacitance values.
FIGURE 3.5 The variation of n -channel transistor parasitic capacitance. Values were obtained from a series of DC simulations using PSpice v5.4, the parameters shown in Table 3.1 ( LEVEL=3 ), and by varying the input voltage, v(in1) , of the inverter in Figure 3.3 (a). Data points are joined by straight lines. Note that CGSOV = CGDOV .
3.2.4 Input Slew Rate

Figure 3.6 shows an experiment to monitor the input capacitance of an inverter as it switches. We have introduced another variablethe delay of the input ramp or the slew rate of the input. In Figure 3.6 (b) the input ramp is 40 ps long with a slew rate of 3 V/ 40 ps or 75 GVs 1 as in our previous experimentsand the output of the inverter hardly moves before the input has changed. The input capacitance varies from 20 to 40 fF with an average value of approximately 34 fF for both transitionswe can measure the average value in Probe by plotting AVG(-i(Vin)) .
(a) (b)
(c)
FIGURE 3.6 The input capacitance of an inverter. (a) Input capacitance is measured by monitoring the input current to the inverter, i(Vin) . (b) Very fast switching. The current, i(Vin) , is multiplied by the input ramp delay ( t = 0.04 ns) and divided by the voltage swing ( V = V DD = 3 V) to give the equivalent input capacitance, C = i t / V . Thus an adjusted input current of 40 fA corresponds to an input capacitance of 40 fF. The current, i(Vin) , is positive for the rising edge of the input and negative for the falling edge. (c) Very slow switching. The input capacitance is now equal for both transitions. In Figure 3.6 (c) the input ramp is slow enough (300 ns) that we are switching under almost equilibrium conditionsat each voltage we allow the output to find its level on the static transfer curve of Figure 3.2 (a). The switching waveforms are quite different. The average input capacitance is now approximately 0.04 pF (a 20 percent difference). The propagation delay (using an input trip point of 0.5 and an output trip
point of 0.35) is negative and approximately 150 127 = 23 ns. By changing the input slew rate we have broken our model. For the moment we shall ignore this problem and proceed. The calculations in Table 3.1 and behavior of Figures 3.5 and 3.6 are very complex. How can we find the value of the parasitic capacitance, C , to fit the model of Figure 3.1 ? Once again, as we did for pull resistance and the intrinsic output capacitance, instead of trying to derive a theoretical value for C, we adjust the value to fit the model. Before we formulate another experiment we should bear in mind the following questions that the experiment of Figure 3.6 raises: Is it valid to replace the nonlinear input capacitance with a linear component? Is it valid to use a linear input ramp when the normal waveforms are so nonlinear? Figure 3.7 shows an experiment crafted to answer these questions. The experiment has the following two steps: 1. Adjust c2 to model the input capacitance of m5/6 ; then C = c2 = 0.0335 pF. 2. Remove all the parasitic capacitances for inverter m9/10 except for the gate capacitances C GS , C GD , and C GB and then adjust c3 (0.01 pF) and c4 (0.025 pF) to model the effect of these missing parasitics. (a) (c)
(b)
(d)
FIGURE 3.7 Parasitic capacitance. (a) All devices in this circuit include parasitic capacitance. (b) This circuit uses linear capacitors to model the parasitic capacitance of m9/10 . The load formed by the inverter ( m5 and m6 ) is modeled by a 0.0335 pF capacitor ( c2 ); the parasitic capacitance due to the overlap of the gates of m3 and m4 with their source, drain, and bulk terminals is modeled by a 0.01 pF capacitor ( c3 ); and the effect of the parasitic capacitance at the drain terminals of m3 and m4 is modeled by a 0.025 pF capacitor ( c4 ). (c) The two circuits compared. The delay shown (1.22 1.135 = 0.085 ns) is equal to t PDf for the inverter m3/4 . (d) An exact match would have both waveforms equal at the 0.35 trip point (1.05 V). We can summarize our findings from this and previous experiments as follows: 1. Since the waveforms in Figure 3.7 match, we can model the input capacitance of a logic cell with a linear capacitor. However, we know the input capacitance may vary (by up to 20 percent in our example) with the input slew rate. 2. The input waveform to the inverter m3/m4 in Figure 3.7 is from another inverternot a linear ramp. The difference in slew rate causes an error. The measured delay is 85 ps (0.085 ns), whereas our model (Eq. 3.7 ) predicts
t PDr = (38 + 817 C out ) ps = ( 38 + (817)(0.0355) ) ps = 65 ps . (3.8) 3. The total gate-oxide capacitance in our inverter with T ox = 100 is C O = (W n L n + W p L p ) ox T ox = (34.5 10 4 )(6)( (0.6) + (12)(0.6) ) pF = 0.037 pF . (3.9) 4. All the transistor parasitic capacitances excluding the gate capacitance contribute 0.01 pF of the 0.0335 pF input capacitanceabout 30 percent. The gate capacitances contribute the rest0.025 pF (about 70 percent). The last two observations are useful. Since the gate capacitances are nonlinear, we only see about 0.025/0.037 or 70 percent of the 0.037 pF gate-oxide capacitance, C O , in the input capacitance, C . This means that it happens by chance that the total gateoxide capacitance is also a rough estimate of the gate input capacitance, C C O . Using L and W rather than L EFF and W EFF in Eq. 3.9 helps this estimate. The accuracy of this estimate depends on the fact that the junction capacitances are approximately one-third of the gate-oxide capacitancewhich happens to be true for many CMOS processes for the shapes of transistors that normally occur in logic cells. In the next section we shall use this estimate to help us design logic cells. [ Chapter start ] [ Previous page ] [ Next page ]
3.3 Logical Effort
3.3 Logical Effort

In this section we explore a delay model based on logical effort, a term coined by Ivan Sutherland and Robert Sproull [1991], that has as its basis the time-constant analysis of Carver Mead, Chuck Seitz, and others. We add a catch all nonideal component of delay, t q , to Eq. 3.2 that includes: (1) delay due to internal parasitic capacitance; (2) the time for the input to reach the switching threshold of the cell; and (3) the dependence of the delay on the slew rate of the input waveform. With these assumptions we can express the delay as follows: t PD = R ( C out + C p ) + t q . (3.10) (The input capacitance of the logic cell is C , but we do not need it yet.) We will use a standard-cell library for a 3.3 V, 0.5 m (0.6 m drawn) technology (from Compass) to illustrate our model. We call this technology C5 ; it is almost identical to the G5 process from Section 2.1 (the Compass library uses a more accurate and more complicated SPICE model than the generic process). The equation for the delay of a 1X drive, two-input NAND cell is in the form of Eq. 3.10 ( C out is in pF): t PD = (0.07 + 1.46 C out + 0.15) ns . (3.11) The delay due to the intrinsic output capacitance (0.07 ns, equal to RC p ) and the nonideal delay ( t q = 0.15 ns) are specified separately. The nonideal delay is a
3.3 Logical Effort
considerable fraction of the total delay, so we may hardly ignore it. If data books do not specify these components of delay separately, we have to estimate the fractions of the constant part of a delay equation to assign to RC p and t q (here the ratio RC p / t q is approximately 2). The data book tells us the input trip point is 0.5 and the output trip points are 0.35 and 0.65. We can use Eq. 3.11 to estimate the pull resistance for this cell as R 1.46 nspF 1 or about 1.5 k . Equation 3.11 is for the falling delay; the data book equation for the rising delay gives slightly different values (but within 10 percent of the falling delay values). We can scale any logic cell by a scaling factor s (transistor gates become s times wider, but the gate lengths stay the same), and as a result the pull resistance R will decrease to R / s and the parasitic capacitance C p will increase to sC p . Since t q is nonideal, by definition it is hard to predict how it will scale. We shall assume that t q scales linearly with s for all cells. The total cell delay then scales as follows: t PD = ( R / s )( C out + sC p ) + st q . (3.12) For example, the delay equation for a 2X drive ( s = 2), two-input NAND cell is t PD = (0.03 + 0.75 C out + 0.51) ns . (3.13) Compared to the 1X version (Eq. 3.11 ), the output parasitic delay has decreased to 0.03 ns (from 0.07 ns), whereas we predicted it would remain constant (the difference is because of the layout); the pull resistance has decreased by a factor of 2 from 1.5 k to 0.75 k , as we would expect; and the nonideal delay has increased to 0.51 ns (from 0.15 ns). The differences between our predictions and the actual values give us a measure of the model accuracy. We rewrite Eq. 3.12 using the input capacitance of the scaled logic cell, C in = s C , C out
3.3 Logical Effort
t PD = RC + RC p + st q . (3.14) C in Finally we normalize the delay using the time constant formed from the pull resistance R inv and the input capacitance C inv of a minimum-size inverter: ( RC ) ( C out / C in ) + RC p + st q d = = f + p + q . (3.15) The time constant tau , = R inv C inv , (3.16) is a basic property of any CMOS technology. We shall measure delays in terms of . The delay equation for a 1X (minimum-size) inverter in the C5 library is t PDf = R pd ( C out + C p ) ln (1/0.35) R pd ( C out + C p ) . (3.17) Thus tq inv = 0.1 ns and R inv = 1.60 k . The input capacitance of the 1X inverter (the standard load for this library) is specified in the data book as C inv = 0.036 pF; thus = (0.036 pF)(1.60 k ) = 0.06 ns for the C5 technology. The use of logical effort consists of rearranging and understanding the meaning of the various terms in Eq. 3.15 . The delay equation is the sum of three terms, d = f + p + q . (3.18) We give these terms special names as follows:
3.3 Logical Effort
delay = effort delay + parasitic delay + nonideal delay . (3.19) The effort delay f we write as a product of logical effort, g , and electrical effort, h: f = gh . (3.20) So we can further partition delay into the following terms: delay = logical effort electrical effort + parasitic delay + nonideal delay . (3.21) The logical effort g is a function of the type of logic cell, g = RC/ . (3.22) What size of logic cell do the R and C refer to? It does not matter because the R and C will change as we scale a logic cell, but the RC product stays the samethe logical effort is independent of the size of a logic cell. We can find the logical effort by scaling down the logic cell so that it has the same drive capability as the 1X minimumsize inverter. Then the logical effort, g , is the ratio of the input capacitance, C in , of the 1X version of the logic cell to C inv (see Figure 3.8 ).
3.3 Logical Effort
FIGURE 3.8 Logical effort. (a) The input capacitance, C inv , looking into the input of a minimum-size inverter in terms of the gate capacitance of a minimum-size device. (b) Sizing a logic cell to have the same drive strength as a minimum-size inverter (assuming a logic ratio of 2). The input capacitance looking into one of the logic-cell terminals is then C in . (c) The logical effort of a cell is C in / C inv . For a two-input NAND cell, the logical effort, g = 4/3. The electrical effort h depends only on the load capacitance C out connected to the output of the logic cell and the input capacitance of the logic cell, C in ; thus h = C out / C in . (3.23) The parasitic delay p depends on the intrinsic parasitic capacitance C p of the logic cell, so that p = RC p / . (3.24) Table 3.2 shows the logical efforts for single-stage logic cells. Suppose the minimumsize inverter has an n -channel transistor with W/L = 1 and a p -channel transistor with W/L = 2 (logic ratio, r , of 2). Then each two-input NAND logic cell input is connected to an n -channel transistor with W/L = 2 and a p -channel transistor with W/L = 2. The input capacitance of the two-input NAND logic cell divided by that of the inverter is thus 4/3. This is the logical effort of a two-input NAND when r = 2. Logical effort depends on the ratio of the logic. For an n -input NAND cell with ratio r , the p -channel transistors are W/L = r /1, and the n -channel transistors are W/L = n /1. For a NOR cell the n -channel transistors are 1/1 and the p -channel transistors are nr /1.
3.3 Logical Effort
TABLE 3.2 Cell effort, parasitic delay, and nonideal delay (in units of ) for single-stage CMOS cells. Cell effort Cell effort Parasitic Nonideal Cell (logic ratio = (logic ratio = delay/ delay/ 2) r) p inv (by q inv (by 1 (by 1 (by inverter definition) definition) definition) 1 definition) 1 n -input NAND n -input NOR ( n + 2)/3 (2 n + 1)/3 ( n + r )/( r + 1) ( nr + 1)/( r + 1) n p inv n p inv n q inv n q inv
The parasitic delay arises from parasitic capacitance at the output node of a singlestage logic cell and most (but not all) of this is due to the source and drain capacitance. The parasitic delay of a minimum-size inverter is p inv = C p / C inv . (3.25) The parasitic delay is a constant, for any technology. For our C5 technology we know RC p = 0.06 ns and, using Eq. 3.17 for a minimum-size inverter, we can calculate p inv = RC p / = 0.06/0.06 = 1 (this is purely a coincidence). Thus C p is about equal to C
inv
and is approximately 0.036 pF. There is a large error in calculating p inv from extracted delay values that are so small. Often we can calculate p inv more accurately from estimating the parasitic capacitance from layout.
Because RC p is constant, the parasitic delay is equal to the ratio of parasitic capacitance of a logic cell to the parasitic capacitance of a minimum-size inverter. In practice this ratio is very difficult to calculateit depends on the layout. We can approximate the parasitic delay by assuming it is proportional to the sum of the widths of the n -channel and p -channel transistors connected to the output. Table 3.2 shows the parasitic delay for different cells in terms of p inv .
3.3 Logical Effort
The nonideal delay q is hard to predict and depends mainly on the physical size of the logic cell (proportional to the cell area in general, or width in the case of a standard cell or a gate-array macro), q = st q / . (3.26) We define q inv in the same way we defined p inv . An n -input cell is approximately n times larger than an inverter, giving the values for nonideal delay shown in Table 3.2 . For our C5 technology, from Eq. 3.17 , q inv = t q inv / = 0.1 ns/0.06 ns = 1.7.
3.3.1 Predicting Delay

As an example, let us predict the delay of a three-input NOR logic cell with 2X drive, driving a net with a fanout of four, with a total load capacitance (comprising the input capacitance of the four cells we are driving plus the interconnect) of 0.3 pF. From Table 3.2 we see p = 3 p inv and q = 3 q inv for this cell. We can calculate C in from the fact that the input gate capacitance of a 1X drive, three-input NOR logic cell is equal to gC inv , and for a 2X logic cell, C in = 2 gC inv . Thus, C out g (0.3 pF) (0.3 pF)
gh = g = = . (3.27) 2 g C inv C in (2)(0.036 pF) (Notice that g cancels out in this equation, we shall discuss this in the next section.) The delay of the NOR logic cell, in units of , is thus 0.3 10 12 d = gh + p + q = + (3)(1) + (3)(1.7) (2)(0.036 10 12 )
3.3 Logical Effort
= 4.1666667 + 3 + 5.1 = 12.266667 . equivalent to an absolute delay, t PD 12.3 0.06 ns = 0.74 ns.
(3.28)
The delay for a 2X drive, three-input NOR logic cell in the C5 library is t PD = (0.03 + 0.72 C out + 0.60) ns . (3.29) With C out = 0.3 pF, t PD = 0.03 + (0.72)(0.3) + 0.60 = 0.846 ns . (3.30) compared to our prediction of 0.74 ns. Almost all of the error here comes from the inaccuracy in predicting the nonideal delay. Logical effort gives us a method to examine relative delays and not accurately calculate absolute delays. More important is that logical effort gives us an insight into why logic has the delay it does.
3.3.2 Logical Area and Logical Efficiency

Figure 3.9 shows a single-stage OR-AND-INVERT cell that has different logical efforts at each input. The logical effort for the OAI221 is the logical-effort vector g = (7/3, 7/3, 5/3). For example, the first element of this vector, 7/3, is the logical effort of inputs A and B in Figure 3.9 .
3.3 Logical Effort
FIGURE 3.9 An OAI221 logic cell with different logical efforts at each input. In this case g = (7/3, 7/3, 5/3). The logical effort for inputs A and B is 7/3, the logical effort for inputs C and D is also 7/3, and for input E the logical effort is 5/3. The logical area is the sum of the transistor areas, 33 logical squares.
We can calculate the area of the transistors in a logic cell (ignoring the routing area, drain area, and source area) in units of a minimum-size n -channel transistorwe call these units logical squares . We call the transistor area the logical area . For example, the logical area of a 1X drive cell, OAI221X1, is calculated as follows:
q q q
n -channel transistor sizes: 3/1 + 4 (3/1) p -channel transistor sizes: 2/1 + 4 (4/1) total logical area = 2 + (4 4) + (5 3) = 33 logical squares
Figure 3.10 shows a single-stage AOI221 cell, with g = (8/3, 8/3, 6/3). The calculation of the logical area (for a AOI221X1) is as follows:
q q q
n -channel transistor sizes: 1/1 + 4 (2/1) p -channel transistor sizes: 6/1 + 4 (6/1) logical area = 1 + (4 2) + (5 6) = 39 logical squares
3.3 Logical Effort
FIGURE 3.10 An AND-OR-INVERT cell, an AOI221, with logical-effort vector, g = (8/3, 8/3, 7/3). The logical area is 39 logical squares.
These calculations show us that the single-stage AOI221, with an area of 33 logical squares and logical effort of (7/3, 7/3, 5/3), is more logically efficient than the singlestage OAI221 logic cell with a larger area of 39 logical squares and larger logical effort of (8/3, 8/3, 6/3).
3.3.3 Logical Paths

When we calculated the delay of the NOR logic cell in Section 3.3.1, the answer did not depend on the logical effort of the cell, g (it cancelled out in Eqs. 3.27 and 3.28 ). This is because g is a measure of the input capacitance of a 1X drive logic cell. Since we were not driving the NOR logic cell with another logic cell, the input capacitance of the NOR logic cell had no effect on the delay. This is what we do in a data bookwe measure logic-cell delay using an ideal input waveform that is the same no matter what the input capacitance of the cell. Instead let us calculate the delay of a logic cell when it is driven by a minimum-size inverter. To do this we need to extend the notion of logical effort. So far we have only considered a single-stage logic cell, but we can extend the idea of logical effort to a chain of logic cells or logical path . Consider the logic path when we use a minimum-size inverter ( g 0 = 1, p 0 = 1, q 0 = 1.7) to drive one input of a 2X drive, three-input NOR logic cell with g 1 = ( nr + 1)/( r + 1), p 1 = 3, q 1 =3, and a load equal to four standard loads. If the logic ratio is r = 1.5, then g 1 = 5.5/2.5 = 2.2. The delay of the inverter is
3.3 Logical Effort
d = g 0 h 0 + p 0 + q 0 = (1) (2g 1 ) (C inv /C inv ) +1 + 1.7 (3.31) = (1)(2)(2.2) + 1 + 1.7 = 7.1 . Of this 7.1 delay we can attribute 4.4 to the loading of the NOR logic cell input capacitance, which is 2 g 1 C inv . The delay of the NOR logic cell is, as before, d 1 = g
1
h 1 + p 1 + q 1 = 12.3, making the total delay 7.1 + 12.3 = 19.4, so the absolute delay
is (19.4)(0.06 ns) = 1.164 ns, or about 1.2 ns. We can see that the path delay D is the sum of the logical effort, parasitic delay, and nonideal delay at each stage. In general, we can write the path delay as
D =
i path
gihi +
i path
( p i + q i ) . (3.32)
3.3.4 Multistage Cells

Consider the following function (a multistage AOI221 logic cell): ZN(A1, A2, B1, B2, C) = NOT(NAND(NAND(A1, A2), AOI21(B1, B2, C))) = (((A1A2)' (B1B2 + C)')')' = (A1A2 + B1B2 + C)' = AOI221(A1, A2, B1, B2, C) . (3.33) Figure 3.11 (a) shows this implementation with each input driven by a minimum-size inverter so we can measure the effect of the cell input capacitance.
3.3 Logical Effort
FIGURE 3.11 Logical paths. (a) An AOI221 logic cell constructed as a multistage cell from smaller cells. (b) A singlestage AOI221 logic cell. The logical efforts of each of the logic cells in Figure 3.11 (a) are as follows: g 0 = g 4 = g (NOT) = 1 , g 1 = g (AOI21) = (2, (2 r + 1)/( r + 1)) = (2, 4/2.5) = (2, 1.6) , g 2 = g 3 = g (NAND2) = ( r + 2)/( r + 1) = (3.5)/(2.5) = 1.4 . (3.34)
Each of the logic cells in Figure 3.11 has a 1X drive strength. This means that the input capacitance of each logic cell is given, as shown in the figure, by gC inv . Using Eq. 3.32 we can calculate the delay from the input of the inverter driving A1 to the output ZN as d 1 = (1)(1.4) + 1 + 1.7 + (1.4)(1) + 2 + 3.4
3.3 Logical Effort
+ (1.4)(0.7) + 2 + 3.4 + (1) C L + 1 + 1.7 = (20 + C L ) . (3.35)
In Eq. 3.35 we have normalized the output load, C L , by dividing it by a standard load (equal to C inv ). We can calculate the delays of the other paths similarly. More interesting is to compare the multistage implementation with the single-stage version. In our C5 technology, with a logic ratio, r = 1.5, we can calculate the logical effort for a single-stage AOI221 logic cell as g (AOI221) = ((3 r + 2)/( r + 1), (3 r + 2)/( r + 1), (3 r + 1)/( r + 1)) = (6.5/2.5, 6.5/2.5, 5.5/2.5) = (2.6, 2.6, 2.2) . (3.36) This gives the delay from an inverter driving the A input to the output ZN of the singlestage logic cell as d1 = ((1)(2.6) + 1 + 1.7 + (1) C L + 5 + 8.5 ) = 18.8 + C L . (3.37)
The single-stage delay is very close to the delay for the multistage version of this logic cell. In some ASIC libraries the AOI221 is implemented as a multistage logic cell instead of using a single stage. It raises the question: Can we make the multistage logic cell any faster by adjusting the scale of the intermediate logic cells?
3.3.5 Optimum Delay

Before we can attack the question of how to optimize delay in a logic path, we shall need some more definitions. The path logical effort G is the product of logical efforts on a path:
3.3 Logical Effort
G =
i path
g i . (3.38)
The path electrical effort H is the product of the electrical efforts on the path, C out H =
i path
h i , (3.39) C in
where C out is the last output capacitance on the path (the load) and C in is the first input capacitance on the path. The path effort F is the product of the path electrical effort and logical efforts, F = GH . (3.40) The optimum effort delay for each stage is found by minimizing the path delay D by varying the electrical efforts of each stage h i , while keeping H , the path electrical effort fixed. The optimum effort delay is achieved when each stage operates with equal effort, f^ i = g i h i = F 1/ N . (3.41) This a useful result. The optimum path delay is then D^ = NF 1/ N = N ( GH ) 1/ N + P + Q , (3.42) where P + Q is the sum of path parasitic delay and nonideal delay,
3.3 Logical Effort
P+Q =
i path
p i + h i . (3.43)
We can use these results to improve the AOI221 multistage implementation of Figure 3.11 (a). Assume that we need a 1X cell, so the output inverter (cell 4) must have 1X drive strength. This fixes the capacitance we must drive as C out = C inv (the capacitance at the input of this inverter). The input inverters are included to measure the effect of the cell input capacitance, so we cannot cheat by altering these. This fixes the input capacitance as C in = C inv . In this case H = 1. The logic cells that we can scale on the path from the A input to the output are NAND logic cells labeled as 2 and 3. In this case G = g 0 g 2 g 3 = 1 1.4 1.4 = 1.95 . (3.44) Thus F = GH = 1.95 and the optimum stage effort is 1.95 (1/3) = 1.25, so that the optimum delay NF 1/ N = 3.75. From Figure 3.11 (a) we see that g 0 h 0 + g 2 h 2 + g 3 h 3 = 1.4 + 1.3 + 1 = 3.8 . (3.45) This means that even if we scale the sizes of the cells to their optimum values, we only save a fraction of a (3.8 3.75 = 0.05). This is a useful result (and one that is true in general)the delay is not very sensitive to the scale of the cells. In this case it means that we can reduce the size of the two NAND cells in the multicell implementation of an AOI221 without sacrificing speed. We can use logical effort to predict what the change in delay will be for any given cell sizes. We can use logical effort in the design of logic cells and in the design of logic that uses logic cells. If we do have the flexibility to continuously size each logic cell (which in ASIC design we normally do not, we usually have to choose from 1X, 2X, 4X drive strengths), each logic stage can be sized using the equation for the individual
3.3 Logical Effort
stage electrical efforts, F 1/ N h^ i = . (3.46) gi For example, even though we know that it will not improve the delay by much, let us size the cells in Figure 3.11 (a). We shall work backward starting at the fixed load capacitance at the input of the last inverter. For NAND cell 3, gh = 1.25; thus (since g = 1.4), h = C out / C in = 0.893. The output capacitance, C out , for this NAND cell is the input capacitance of the inverterfixed as 1 standard load, C inv . This fixes the input capacitance, C in , of NAND cell 3 at 1/0.893 = 1.12 standard loads. Thus, the scale of NAND cell 3 is 1.12/1.4 or 0.8X. Now for NAND cell 2, gh = 1.25; C out for NAND cell 2 is the C in of NAND cell 3. Thus C in for NAND cell 2 is 1.12/0.893 = 1.254 standard loads. This means the scale of NAND cell 2 is 1.254/1.4 or 0.9X. The optimum sizes of the NAND cells are not very different from 1X in this case because H = 1 and we are only driving a load no bigger than the input capacitance. This raises the question: What is the optimum stage effort if we have to drive a large load, H >> 1? Notice that, so far, we have only calculated the optimum stage effort when we have a fixed number of stages, N . We have said nothing about the situation in which we are free to choose, N , the number of stages.
3.3.6 Optimum Number of Stages

Suppose we have a chain of N inverters each with equal stage effort, f = gh . Neglecting parasitic and nonideal delay, the total path delay is Nf = Ngh = Nh , since g = 1 for an inverter. Suppose we need to drive a path electrical effort H ; then h N = H , or N ln h = ln H . Thus the delay, Nh = h ln H /ln h . Since ln H is fixed, we can only vary h /ln ( h ). Figure 3.12 shows that this is a very shallow function with a minimum at h = e 2.718. At this point ln h = 1 and the total delay is N e = e ln H . This result
3.3 Logical Effort
is particularly useful in driving large loads either on-chip (the clock, for example) or off-chip (I/O pad drivers, for example).
FIGURE 3.12 Stage effort. h 1.5 2 2.7 3 4 5 10 h/(ln h) 3.7 2.9 2.7 2.7 2.9 3.1 4.3
Figure 3.12 shows us how to minimize delay regardless of area or power and neglecting parasitic and nonideal delays. More complicated equations can be derived, including nonideal effects, when we wish to trade off delay for smaller area or reduced power. 1. For the Compass 0.5 m technology (C5): p inv = 1.0, q inv = 1.7, R inv = 1.5 k , C inv = 0.036 pF. [ Chapter start ] [ Previous page ] [ Next page ]
3.4 Library-Cell Design

The optimum cell layout for each process generation changes because the design rules for each ASIC vendors process are always slightly differenteven for the same generation of technology. For example, two companies may have very similar 0.35 m CMOS process technologies, but the third-level metal spacing might be slightly different. If a cell library is to be used with both processes, we could construct the library by adopting the most stringent rules from each process. A library constructed in this fashion may not be competitive with one that is constructed specifically for each process. Even though ASIC vendors prize their design rules as secret, it turns out that they are similarexcept for a few details. Unfortunately, it is the details that stop us moving designs from one process to another. Unless we are a very large customer it is difficult to have an ASIC vendor change or waive design rules for us. We would like all vendors to agree on a common set of design rules. This is, in fact, easier than it sounds. The reason that most vendors have similar rules is because most vendors use the same manufacturing equipment and a similar process. It is possible to construct a highest common denominator library that extracts the most from the current manufacturing capability. Some library companies and the large Japanese ASIC vendors are adopting this approach. Layout of library cells is either hand-crafted or uses some form of symbolic layout . Symbolic layout is usually performed in one of two ways: using either interactive graphics or a text layout language. Shapes are represented by simple lines or rectangles, known as sticks or logs , in symbolic layout. The actual dimensions of the sticks or logs are determined after layout is completed in a postprocessing step. An alternative to graphical symbolic layout uses a text layout language, similar to a programming language such as C, that directs a program to assemble layout. The
spacing and dimensions of the layout shapes are defined in terms of variables rather than constants. These variables can be changed after symbolic layout is complete to adjust the layout spacing to a specific process. Mapping symbolic layout to a specific process technology uses 1020 percent more area than hand-crafted layout (though this can then be further reduced to 510 percent with compaction). Most symbolic layout systems do not allow 45 layout and this introduces a further area penalty (my experience shows this is about 515 percent). As libraries get larger, and the capability to quickly move libraries and ASIC designs between different generations of process technologies becomes more important, the advantages of symbolic layout may outweigh the disadvantages. [ Chapter start ] [ Previous page ] [ Next page ]
3.5 Library Architecture

Figure 3.13 (a) shows cell use data from over 150 CMOS gate array designs. These results are remarkably similar to that from other ASIC designs using different libraries and different technologies and show that typically 80 percent of an ASIC uses less than 20 percent of the cell library. (a) (b )
(c)
(d)
(e )
FIGURE 3.13 Cell library statistics.
We can use the data in Figure 3.13 (a) to derive some useful conclusions about the number and types of cells to be included in a library. Before we do this, a few words of caution are in order. First, the data shown in Figure 3.13 (a) tells us about cells that are included a library. This data cannot tell us anything about cells that are not (and perhaps should be) included in a library. Second, the type of design entry we useand the type of ASIC we are designingcan dramatically affect the profile of the use of different cell types. For example, if we use a high-level design language, together with logic synthesis, to enter an ASIC design, this will favor the use of the complex combinational cells (cells of the AOI family that are particularly area efficient in CMOS, but are difficult to work with when we design by hand). Figure 3.13 (a) tells us which cells we use most often, but does not take into account the cell area. What we really want to know are which cells are most important in determining the area of an ASIC. Figure 3.13 (b) shows the area of the cellsnormalized to the area of a minimum-size inverter. If we take the data in Figure 3.13 (a) and multiply by the cell areas, we can derive a new measure of the contribution of each cell in a library (Figure 3.13c). This new measure, cell importance , is a measure of how much area each cell in a library contributes to a typical ASIC. For example, we can see from Figure 3.13 (c) that a D flip-flop (with a cell importance of 3.5) contributes 3.5 times as much area on a typical ASIC than does an inverter (with a cell importance of 1). Figure 3.13 (c) shows cell importance ordered by the cell frequency of use and normalized to an inverter. We can rearrange this data in terms of cell importance, as shown in Figure 3.13 (d), and normalized so that now the most important cell, a D
flip-flop, has a cell importance of 1. Figure 3.13 (e) includes the cell use data on the same scale as the cell importance data. Both show roughly the same shape, reflecting that both measures obey an 8020 rule. Roughly 20 percent of the cells in a library correspond to 80 percent of the ASIC area and 80 percent of the cells we use (but not the same 20 percentthat is why cell importance is useful). Figure 3.13 (e) shows us that the most important cells, measured by their contribution to the area of an ASIC, are not necessarily the cells that we use most often. If we wish to build or buy a dense library, we must concentrate on the area of those cells that have the highest cell importancenot the most common cells. [ Chapter start ] [ Previous page ] [ Next page ]
3.6 Gate-Array Design

Each logic cell or macro in a gate-array library is predesigned using fixed tiles of transistors known as the gate-array base cell (or just base cell ). We call the arrangement of base cells across a whole chip in a complete gate array the gate-array base (or just base ). ASIC vendors offer a selection of bases, with a different total numbers of transistors on each base. For example, if our ASIC design uses 48k equivalent gates and the ASIC vendor offers gate arrays bases with 50k-, 75k-, and 100k-gates, we will probably have to use the 75k-gate base (because it is unlikely that we can use 48/50 or 96 percent of the transistors on the 50k-gate base). We isolate the transistors on a gate array from one another either with thick field oxide (in the case of oxide-isolated gate arrays) or by using other transistors that are wired permanently off (in gate-isolated gate arrays). Channeled and channelless gate arrays may use either gate isolation or oxide isolation. Figure 3.14 (a) shows a base cell for a gate-isolated gate array . This base cell has two transistors: one p -channel and one n -channel. When these base cells are placed next to each other, the n -diffusion and p -diffusion layers form continuous strips that run across the entire chip broken only at the poly gates that cross at regularly spaced intervals (Figure 3.14b). The metal interconnect spacing determines the separation of the transistors. The metal spacing is determined by the design rules for the metal and contacts. In Figure 3.14 (c) we have shown all possible locations for a contact in the base cell. There is room for 21 contacts in this cell and thus room for 21 interconnect lines running in a horizontal direction (we use m1 running horizontally). We say that there are 21 horizontal tracks in this cell or that the cell is 21 tracks high. In a similar fashion the space that we need for a vertical interconnect (m2) is called a vertical track . The horizontal and vertical track widths are not necessarily equal,
because the design rules for m1 and m2 are not always equal. We isolate logic cells from each other in gate-isolated gate arrays by connecting transistor gates to the supply bushence the name, gate isolation . If we connect the gate of an n -channel transistor to V SS , we isolate the regions of n -diffusion on each side of that transistor (we call this an isolator transistor or device, or just isolator). Similarly if we connect the gate of a p -channel transistor to V DD , we isolate adjacent p -diffusion regions.
FIGURE 3.14 The construction of a gate-isolated gate array. (a) The one-track-wide base cell containing one p -channel and one n -channel transistor. (b) Three base cells: the center base cell is being used to isolate the base cells on either side from each other. (c) A base cell including all possible contact positions (there is room for 21 contacts in the vertical direction, showing the base cell has a height of 21 tracks).
Oxide-isolated gate arrays often contain four transistors in the base cell: the two n channel transistors share an n -diffusion strip and the two p -channel transistors share a p -diffusion strip. This means that the two n -channel transistors in each base cell are electrically connected in series, as are the p -channel transistors. The base cells are isolated from each other using oxide isolation . During the fabrication process a layer of the thick field oxide is left in place between each base cell and this separates the p -diffusion and n -diffusion regions of adjacent base cells. Figure 3.15 shows an oxide-isolated gate array . This cell contains eight transistors (which occupy six vertical tracks) plus one-half of a single track that contains the well contacts and substrate connections that we can consider to be shared by each base cell.
FIGURE 3.15 An oxide-isolated gate-array base cell. The figure shows two base cells, each containing eight transistors and two well contacts. The p -channel and n -channel transistors are each 4 tracks high (corresponding to the width of the transistor). The leftmost vertical track of the left base cell includes all 12 possible contact positions (the height of the cell is 12 tracks). As outlined here, the base cell is 7 tracks wide (we could also consider the base cell to be half this width).
Figure 3.16 shows a base cell in which the gates of the n -channel and p -channel transistors are connected on the polysilicon layer. Connecting the gates in poly saves contacts and a metal interconnect in the center of the cell where interconnect is most congested. The drawback of the preconnected gates is a loss in flexibility in cell design. Implementing memory and logic based on transmission gates will be less efficient using this type of base cell, for example.
FIGURE 3.16 This oxide-isolated gate-array base cell is 14 tracks high and 4 tracks wide. VDD (tracks 3 and 4) and GND (tracks 11 and 12) are each 2 tracks wide. The metal lines to the left of the cell indicate the 10 horizontal routing tracks (tracks 1, 2, 510, 13, 14). Notice that the p -channel and n channel polysilicon gates are tied together in the center of the cell. The well contacts are short, leaving room for a poly crossunder in each base cell.
Figure 3.17 shows the metal personalization for a D flip-flop macro in a gateisolated gate array using a base cell similar to that shown in Figure 3.14 (a). This macro uses 20 base cells, for a total of 40 transistors, equivalent to 10 gates.
FIGURE 3.17 An example of a flip-flop macro in a gateisolated gate-array library. Only the first-level metallization and contact pattern (the personalization) is shown on the right, but this is enough information to derive the schematic. The base cell is shown on the left. This macro is 20 tracks wide. The gates of the base cells shown in Figures 3.14 3.16 are bent. The bent gate allows contacts to the gates to be placed on the same grid as the contacts to diffusion. The polysilicon gates run in the space between adjacent metal interconnect lines. This saves space and also simplifies the routing software. There are many trade-offs that determine the gate-array base cell height. One factor is the number of wires that can be run horizontally through the base cell. This will determine the capacity of the routing channel formed from an unused row of base cells. The base cell height also determines how easy it is to wire the logic macros
since it determines how much space for wiring is available inside the macros. There are other factors that determine the width of the base-cell transistors. The widths of the p -channel and n -channel transistors are slightly different in Figure 3.14 (a). The p -channel transistors are 6 tracks wide and the n -channel transistors are 5 tracks wide. The ratio for this gate-array library is thus approximately 1.2. Most gate-array libraries are approaching a ratio of 1. ASIC designers are using ever-increasing amounts of RAM on gate arrays. It is inefficient to use the normal base cell for a static RAM cell and the size of RAM on an embedded gate array is fixed. As an alternative we can change the design of the base cell. A base cell designed for use as RAM has extra transistors (either fourtwo n -channel and two p -channelor two n -channel; usually minimum width) allowing a six-transistor RAM cell to be built using one base cell instead of the two or three that we would normally need. This is one of the advantages of the CBA (cell-based array) base cell shown in Figure 3.18 .
FIGURE 3.18 The SiARC/Synopsys cell-based array (CBA) basic cell.
3.7 Standard-Cell Design

Figure 3.19 shows the components of the standard cell from Figure 1.3. Each standard cell in a library is rectangular with the same height but different widths. The bounding box ( BB ) of a logic cell is the smallest rectangle that encloses all of the geometry of the cell. The cell BB is normally determined by the well layers. Cell connectors or terminals (the logical connectors ) must be placed on the cell abutment box ( AB ). The physical connector (the piece of metal to which we connect wires) must normally overlap the abutment box slightly, usually by at least 1 , to assure connection without leaving a tiny space between the ends of two wires. The standard cells are constructed so they can all be placed next to each other horizontally with the cell ABs touching (we abut two cells). (a) (b)
(c)
(d)
FIGURE 3.19 (a) The standard cell shown in Figure 1.3. (b) Diffusion, poly, and contact layers. (c) m1 and contact layers. (d) The equivalent schematic. A standard cell (a D flip-flop with clear) is shown in Figure 3.20 and illustrates the following features of standard-cell layout:
q
Layout using 45 angles. This can save 10%20% in area compared to a cell that uses only Manhattan or 90 geometry. Some ASIC vendors do not allow transistors with 45 angles; others do not allow 45 angles at all. Connectors are at the top and bottom of the cell on m2 on a routing grid equal to the vertical (m2) track spacing. This is a double-entry cell intended for a twolevel metal process. A standard cell designed for a three-level metal process has connectors in the center of the cell. Transistor sizes vary to optimize the area and performance but maintain a fixed ratio to balance rise times and fall times. The cell height is 64 (all cells in the library are the same height) with a horizontal (m1) track spacing of 8 . This is close to the minimum height that can accommodate the most complex cells in a library. The power rails are placed at the top and bottom, maintaining a certain width inside the cell and abut with the power rails in adjacent cells. The well contacts (substrate connections) are placed inside the cell at regular
intervals. Additional well contacts may be placed in spacers between cells. In this case both wells are drawn. Some libraries minimize the well or moat area to reduce leakage and parasitic capacitance. Most commercial standard cells use m1 for the power rails, m1 for internal connections, and avoid using m2 where possible except for cell connectors.
FIGURE 3.20 A D flip-flop standard cell. The wide power buses and transistors show this is a performanceoptimized cell. This double-entry cell is intended for a two-level metal process and channel routing. The five connectors run vertically through the cell on m2 (the extra short vertical metal line is an internal crossover). When a library developer creates a gate-array, standard-cell, or datapath library, there is a trade-off between using wide, high-drive transistors that result in large cells with high-speed performance and using smaller transistors that result in smaller cells that consume less power. A performance-optimized library with large cells might be used for ASICs in a high-performance workstation, for example. An area-optimized library might be used in an ASIC for a battery-powered portable computer. [ Chapter start ] [ Previous page ] [ Next page ]
3.8 Datapath-Cell Design

Figure 3.21 shows a datapath flip-flop. The primary, thicker, power buses run vertically on m2 with thinner, internal power running horizontally on m1. The control signals (clock in this case) run vertically through the cell on m2. The control signals that are common to the cells above and below are connected directly in m2. The other signals (data, q, and qbar in this example) are brought out to the wiring channel between the rows of datapath cells.
FIGURE 3.21 A datapath D flip-flop cell. Figure 3.22 is the schematic for Figure 3.21 . This flip-flop uses a pair of crosscoupled inverters for storage in both the master and slave latches. This leads to a smaller and potentially faster layout than the flip-flop circuits that we use in gatearray and standard-cell ASIC libraries. The device sizes of the inverters in the datapath flip-flops are adjusted so that the state of the latches may be changed. Normally using this type of circuit is dangerous in an uncontrolled environment. However, because the datapath structure is regular and known, the parasitic capacitances that affect the operation of the logic cell are also known. This is another advantage of the datapath structure.
FIGURE 3.22 The schematic of the datapath D flip-flop cell shown in Figure 3.21 . Figure 3.23 shows an example of a datapath. Figure 3.23 (a) depicts a two-level metal version showing the space between rows or slices of the datapath. In this case there are many connections to be brought out to the right of the datapath, and this causes the routing channel to be larger than normal and thus easily seen. Figure 3.23 (b) shows a three-level metal version of the same datapath. In this case more of the routing is completed over the top of the datapath slices, reducing the size of the routing channel. (a)
FIGURE 3.23 A datapath. (a) Implemented in a two-level metal process. (b) Implemented in a three-level metal process.
(b)
3.9 Summary
3.9 Summary
In this chapter we covered ASIC libraries: cell design, layout, and characterization. The most important concepts that we covered in this chapter were
q q q q
Tau, logical effort, and the prediction of delay Sizes of cells, and their drive strengths Cell importance The difference between gate-array macros, standard cells, and datapath cells
PROGRAMMABLE ASICs
PROGRAMMABLE ASICs
There are two types of programmable ASICs: programmable logic devices (PLDs) and field-programmable gate arrays (FPGAs). The distinction between the two is blurred. The only real difference is their heritage. PLDs started as small devices that could replace a handful of TTL parts, and they have grown to look very much like their younger relations, the FPGAs. We shall group both types of programmable ASICs together as FPGAs. An FPGA is a chip that you, as a systems designer, can program yourself. An IC foundry produces FPGAs with some connections missing. You perform design entry and simulation. Next, special software creates a string of bits describing the extra connections required to make your designthe configuration file . You then connect a computer to the chip and program the chip to make the necessary connections according to the configuration file. There is no customization of any mask level for an FPGA, allowing the FPGA to be manufactured as a standard part in high volume. FPGAs are popular with microsystems designers because they fill a gap between TTL and PLD design and modern, complex, and often expensive ASICs. FPGAs are ideal for prototyping systems or for low-volume production. FPGA vendors do not need an IC fabrication facility to produce the chips; instead they contract IC foundries to produce their parts. Being fabless relieves the FPGA vendors of the huge burden of building and running a fabrication plant (a new submicron fab costs hundreds of millions of dollars). Instead FPGA companies put their effort into the FPGA architecture and the software, where it is much easier to make a profit than building chips. They often sell the chips through distributors, but sell design software and any necessary programming hardware directly.
PROGRAMMABLE ASICs
All FPGAs have certain key elements in common. All FPGAs have a regular array of basic logic cells that are configured using a programming technology . The chip inputs and outputs use special I/O logic cells that are different from the basic logic cells. A programmable interconnect scheme forms the wiring between the two types of logic cells. Finally, the designer uses custom software, tailored to each programming technology and FPGA architecture, to design and implement the programmable connections. The programming technology in an FPGA determines the type of basic logic cell and the interconnect scheme. The logic cells and interconnection scheme, in turn, determine the design of the input and output circuits as well as the programming scheme. The programming technology may or may not be permanent. You cannot undo the permanent programming in one-time programmable ( OTP ) FPGAs. Reprogrammable or erasable devices may be reused many times. We shall discuss the different programming technologies in the following sections.
4.1 The Antifuse 4.2 Static RAM 4.3 EPROM and EEPROM Technology 4.4 Practical Issues 4.5 Specifications 4.6 PREP Benchmarks 4.7 FPGA Economics 4.8 Summary
PROGRAMMABLE ASICs
4.9 Problems 4.10 Bibliography 4.11 References
4.1 The Antifuse
4.1 The Antifuse

An antifuse is the opposite of a regular fusean antifuse is normally an open circuit until you force a programming current through it (about 5 mA). In a polydiffusion antifuse the high current density causes a large power dissipation in a small area, which melts a thin insulating dielectric between polysilicon and diffusion electrodes and forms a thin (about 20 nm in diameter), permanent, and resistive silicon link . The programming process also drives dopant atoms from the poly and diffusion electrodes into the link, and the final level of doping determines the resistance value of the link. Actel calls its antifuse a programmable low-impedance circuit element ( PLICE ). Figure 4.1 shows a polydiffusion antifuse with an oxidenitrideoxide ( ONO ) dielectric sandwich of: silicon dioxide (SiO 2 ) grown over the n -type antifuse diffusion, a silicon nitride (Si 3 N 4 ) layer, and another thin SiO 2 layer. The layered ONO dielectric results in a tighter spread of blown antifuse resistance values than using a single-oxide dielectric. The effective electrical thickness is equivalent to 10nm of SiO 2 (Si 3 N 4 has a higher dielectric constant than SiO 2 , so the actual thickness is less than 10 nm). Sometimes this device is called a fuse even though it is an anti fuse, and both terms are often used interchangeably.
4.1 The Antifuse
FIGURE 4.1 Actel antifuse. (a) A cross section. (b) A simplified drawing. The ONO (oxidenitrideoxide) dielectric is less than 10 nm thick, so this diagram is not to scale. (c) From above, an antifuse is approximately the same size as a contact. The fabrication process and the programming current control the average resistance of a blown antifuse, but values vary as shown in Figure 4.2 . In a particular technology a programming current of 5 mA may result in an average blown antifuse resistance of about 500 . Increasing the programming current to 15 mA might reduce the average antifuse resistance to 100 . Antifuses separate interconnect wires on the FPGA chip and the programmer blows an antifuse to make a permanent connection. Once an antifuse is programmed, the process cannot be reversed. This is an OTP technology (and radiation hard). An Actel 1010, for example, contains 112,000 antifuses (see Table 4.1 ), but we typically only need to program about 2 percent of the fuses on an Actel chip.
TABLE 4.1 Number of antifuses on Actel FPGAs. Device Antifuses A1010 112,000 A1020 186,000 A1225 250,000
4.1 The Antifuse
A1240 A1280
400,000 750,000 FIGURE 4.2 The resistance of blown Actel antifuses. The average antifuse resistance depends on the programming current. The resistance values shown here are typical for a programming current of 5 mA.
To design and program an Actel FPGA, designers iterate between design entry and simulation. When they are satisfied the design is correct they plug the chip into a socket on a special programming box, called an Activator , that generates the programming voltage. A PC downloads the configuration file to the Activator instructing it to blow the necessary antifuses on the chip. When the chip is programmed it may be removed from the Activator without harming the configuration data and the chip assembled into a system. One disadvantage of this procedure is that modern packages with hundreds of thin metal leads are susceptible to damage when they are inserted and removed from sockets. The advantage of other programming technologies is that chips may be programmed after they have been assembled on a printed-circuit boarda feature known as in-system programming ( ISP ). The Actel antifuse technology uses a modified CMOS process. A double-metal, singlepoly CMOS process typically uses about 12 masksthe Actel process requires an additional three masks. The n- type antifuse diffusion and antifuse polysilicon require an extra two masks and a 40 nm (thicker than normal) gate oxide (for the high-voltage transistors that handle 18 V to program the antifuses) uses one more masking step. Actel and Data General performed the initial experiments to develop the PLICE technology and Actel has licensed the technology to Texas Instruments (TI). The programming time for an ACT 1 device is 5 to 10 minutes. Improvements in programming make the programming time for the ACT 2 and ACT 3 devices about the same as the ACT 1. A 5-day work week, with 8-hour days, contains about 2400 minutes. This is enough time to program 240 to 480 Actel parts per week with 100 percent efficiency and no hardware down time. A production schedule of more than
4.1 The Antifuse
1000 parts per month requires multiple or gang programmers.
4.1.1 MetalMetal Antifuse

Figure 4.3 shows a QuickLogic metalmetal antifuse ( ViaLink ). The link is an alloy of tungsten, titanium, and silicon with a bulk resistance of about 500 cm.
FIGURE 4.3 Metalmetal antifuse. (a) An idealized (but to scale) cross section of a QuickLogic metalmetal antifuse in a two-level metal process. (b) A metalmetal antifuse in a threelevel metal process that uses contact plugs. The conductive link usually forms at the corner of the via where the electric field is highest during programming. There are two advantages of a metalmetal antifuse over a polydiffusion antifuse. The first is that connections to a metalmetal antifuse are direct to metalthe wiring layers. Connections from a polydiffusion antifuse to the wiring layers require extra space and create additional parasitic capacitance. The second advantage is that the direct connection to the low-resistance metal layers makes it easier to use larger programming currents to reduce the antifuse resistance. For example, the antifuse resistance R 0.8/ I , with the programming current I in mA and R in , for the QuickLogic antifuse. Figure 4.4 shows that the average QuickLogic metalmetal antifuse resistance is approximately 80 (with a standard deviation of about 10 ) using a programming current of 15 mA as opposed to an average antifuse resistance of 500 (with a programming current of 5 mA) for a polydiffusion antifuse.
4.1 The Antifuse
FIGURE 4.4 Resistance values for the QuickLogic metalmetal antifuse. A higher programming current (about 15 mA), made possible partly by the direct connections to metal, has reduced the antifuse resistance from the polydiffusion antifuse resistance values shown in Figure 4.2 . The size of an antifuse is limited by the resolution of the lithography equipment used to makes ICs. The Actel antifuse connects diffusion and polysilicon, and both these materials are too resistive for use as signal interconnects. To connect the antifuse to the metal layers requires contacts that take up more space than the antifuse itself, reducing the advantage of the small antifuse size. However, the antifuse is so small that it is normally the contact and metal spacing design rules that limit how closely the antifuses may be packed rather than the size of the antifuse itself. An antifuse is resistive and the addition of contacts adds parasitic capacitance. The intrinsic parasitic capacitance of an antifuse is small (approximately 12 fF in a 1 m CMOS process), but to this we must add the extrinsic parasitic capacitance that includes the capacitance of the diffusion and poly electrodes (in a polydiffusion antifuse) and connecting metal wires (approximately 10 fF). These unwanted parasitic elements can add considerable RC interconnect delay if the number of antifuses connected in series is not kept to an absolute minimum. Clever routing techniques are therefore crucial to antifuse-based FPGAs. The long-term reliability of antifuses is an important issue since there is a tendency for the antifuse properties to change over time. There have been some problems in this area, but as a result we now know an enormous amount about this failure mechanism. There are many failure mechanisms in ICselectromigration is a classic exampleand engineers have learned to deal with these problems. Engineers design the circuits to keep the failure rate below acceptable limits and systems designers accept the statistics. All the FPGA vendors that use antifuse technology have extensive information on long-term reliability in their data books.
4.1 The Antifuse
4.2 Static RAM
4.2 Static RAM

An example of static RAM ( SRAM ) programming technology is shown in Figure 4.5 . This Xilinx SRAM configuration cell is constructed from two crosscoupled inverters and uses a standard CMOS process. The configuration cell drives the gates of other transistors on the chipeither turning pass transistors or transmission gates on to make a connection or off to break a connection. FIGURE 4.5 The Xilinx SRAM (static RAM) configuration cell. The outputs of the cross-coupled inverter (configuration control) are connected to the gates of pass transistors or transmission gates. The cell is programmed using the WRITE and DATA lines. The advantages of SRAM programming technology are that designers can reuse chips during prototyping and a system can be manufactured using ISP. This programming technology is also useful for upgradesa customer can be sent a new configuration file to reprogram a chip, not a new chip. Designers can also update or change a system on the fly in reconfigurable hardware . The disadvantage of using SRAM programming technology is that you need to keep power supplied to the programmable ASIC (at a low level) for the volatile SRAM to retain the connection information. Alternatively you can load the configuration data from a permanently programmed memory (typically a programmable read-only memory or PROM ) every time you turn the system on. The total size of an SRAM
4.2 Static RAM
configuration cell plus the transistor switch that the SRAM cell drives is also larger than the programming devices used in the antifuse technologies. [ Chapter start ] [ Previous page ] [ Next page ]
4.3 EPROM and EEPROM Technology

Altera MAX 5000 EPLDs and Xilinx EPLDs both use UV-erasable electrically programmable read-only memory ( EPROM ) cells as their programming technology. Altera's EPROM cell is shown in Figure 4.6 . The EPROM cell is almost as small as an antifuse. An EPROM transistor looks like a normal MOS transistor except it has a second, floating, gate (gate1 in Figure 4.6 ). Applying a programming voltage V PP (usually greater than 12 V) to the drain of the n- channel EPROM transistor programs the EPROM cell. A high electric field causes electrons flowing toward the drain to move so fast they jump across the insulating gate oxide where they are trapped on the bottom, floating, gate. We say these energetic electrons are hot and the effect is known as hot-electron injection or avalanche injection . EPROM technology is sometimes called floating-gate avalanche MOS ( FAMOS ).
FIGURE 4.6 An EPROM transistor. (a) With a high (> 12 V) programming voltage, V PP , applied to the drain, electrons gain enough energy to jump onto the floating gate (gate1). (b) Electrons stuck on gate1 raise the threshold voltage so that the transistor is always off for normal operating voltages. (c) Ultraviolet light provides enough energy for the electrons stuck on gate1 to jump back to the bulk, allowing the transistor to operate normally. Electrons trapped on the floating gate raise the threshold voltage of the n- channel EPROM transistor ( Figure 4.6 b). Once programmed, an n- channel EPROM device remains off even with VDD applied to the top gate. An unprogrammed n- channel device will turn on as normal with a top-gate voltage of VDD . The programming voltage is applied either from a special programming box or by using on-chip charge pumps. Exposure to an ultraviolet (UV) lamp will erase the EPROM cell ( Figure 4.6 c). An absorbed light quantum gives an electron enough energy to jump from the floating gate. To erase a part we place it under a UV lamp (Xilinx specifies one hour within 1 inch of a 12,000 Wcm 2 source for its EPLDs). The manufacturer provides a software program that checks to see if a part is erased. You can buy an EPLD part in a windowed package for development, erase it, and use it again, or buy it in a nonwindowed package and program (or burn) the part once only for production. The packages get hot while they are being erased, so that windowed option is available with only ceramic packages, which are more expensive than plastic packages. Programming an EEPROM transistor is similar to programming an UV-erasable EPROM transistor, but the erase mechanism is different. In an EEPROM transistor an electric field is also used to remove electrons from the floating gate of a programmed transistor. This is faster than using a UV lamp and the chip does not have to be removed from the system. If the part contains circuits to generate both program and erase voltages, it may use ISP. [ Chapter start ] [ Previous page ] [ Next page ]
4.4 Practical Issues

System companies often select an ASIC technology first, which narrows the choice of software design tools. The software then influences the choice of computer. Most computer-aided engineering ( CAE ) software for FPGA design uses some type of security. For workstations this usually means floating licenses (any of n users on a network can use the tools) or node-locked licenses (only n particular computers can use the tools) using the hostid (or host I.D., a serial number unique to each computer) in the boot EPROM (a chip containing start-up instructions). For PCs this is a hardware key, similar to the Viewlogic key illustrated in Figure 4.7 . Some keys use the serial port (requiring extra cables and adapters); most now use the parallel port. There are often conflicts between keys and other hardware/software. For example, for a while some security keys did not work with the serial-port driver on Intel motherboardsusers had to buy another serial-port I/O card. FIGURE 4.7 CAE companies use hardware security keys that fit at the back of a PC (this one is shown at about one-half the real size). Each piece of software requires a separate key, so that a typical design system may have a half dozen or more keys daisychained on one socket. This presents both mechanical and software conflict problems. Software will not run without a key, so it is easily possible to have $60,000 worth of keys attached to a single PC. Most FPGA vendors offer software on multiple platforms. The performance difference between workstations and PCs is becoming blurred, but the time taken for
the place-and-route step for Actel and Xilinx designs seems to remain constanttypically taking tens of minutes to over an hour for a large designbounded by designers tolerances. A great deal of time during FPGA design is spent in schematic entry, editing files, and documentation. This often requires moving between programs and this is difficult on IBM-compatible PC platforms. Currently most large CAD and CAE programs completely take over the PC; for example you cannot always run third-party design entry and the FPGA vendor design systems simultaneously. There are many other factors to be considered in choosing hardware:
q q q q q
Software packages are normally less expensive on a PC. Peripherals are less expensive and easier to configure on a PC. Maintenance contracts are usually necessary and expensive for workstations. There is a much larger network of users to provide support for PC users. It is easier to upgrade a PC than a workstation.
4.4.1 FPGAs in Use

I once placed an order for a small number of FPGAs for prototyping and received a sales receipt with a scheduled shipping date three months away. Apparently, two customers had recently disrupted the vendors product planning by placing large orders. Companies buying parts from suppliers often keep an inventory to cover emergencies such as a defective lot or manufacturing problems. For example, assume that a company keeps two months of inventory to ensure that it has parts in case of unforeseen problems. This risk inventory or safety supply, at a sales volume of 2000 parts per month, is 4000 parts, which, at an ASIC price of $5 per part, costs the company $20,000. FPGAs are normally sold through distributors, and, instead of keeping a risk inventory, a company can order parts as it needs them using a just-intime ( JIT ) inventory system. This means that the distributors rather than the customer carry inventory (though the distributors wish to minimize inventory as well). The downside is that other customers may change their demands, causing unpredictable supply difficulties.
There are no standards for FPGAs equivalent to those in the TTL and PLD worlds; there are no standard pin assignments for VDD or GND, and each FPGA vendor uses different power and signal I/O pin arrangements. Most FPGA packages are intended for surface-mount printed-circuit boards ( PCBs ). However, surface mounting requires more expensive PCB test equipment and vapor soldering rather than bed-ofnails testers and surface-wave soldering. An alternative is to use socketed parts. Several FPGA vendors publish socket-reliability tests in their data books. Using sockets raises its own set of problems. First, it is difficult to find wire-wrap sockets for surface-mount parts. Second, sockets may change the pin configuration. For example, when you use an FPGA in a PLCC package and plug it into a socket that has a PGA footprint, the resulting arrangement of pins is different from the same FPGA in a PGA package. This means you cannot use the same board layout for a prototype PCB (which uses the socketed PLCC part) as for the production PCB (which uses the PGA part). The same problem occurs when you use through-hole mounted parts for prototyping and surface-mount parts for production. To deal with this you can add a small piece to your prototype board that you use as a converter. This can be sawn off on the production boardssaving a board iteration. Pin assignment can also cause a problem if you plan to convert an FPGA design to an MGA or CBIC. In most cases it is desirable to keep the same pin assignment as the FPGA (this is known as pin locking or I/O locking ), so that the same PCB can be used in production for both types of devices. There are often restrictions for custom gate arrays on the number and location of power pads and package pins. Systems designers must consider these problems before designing the FPGA and PCB. [ Chapter start ] [ Previous page ] [ Next page ]
4.5 Specifications
4.5 Specifications
All FPGA manufactures are continually improving their products to increase performance and reduce price. Often this means changing the design of an FPGA or moving a part from one process generation to the next without changing the part number (and often without changing the specifications). FPGA companies usually explain their part history in their data books. 1 The following history of Actel FPGA ACT 1 part numbers illustrates changes typical throughout the IC industry as products develop and mature:
q q q
The Actel ACT 1 A1010/A1020 used a 2 m process. The Actel A1010A/A1020A used a 1.2 m process. The Actel A1020B was a die revision (including a shrink to a 1.0 m process). At this time the A1020, A1020A, and A1020B all had different speeds. Actel graded parts into three speed bins as they phased in new processes, dropping the distinction between the different die suffixes. At the same time as the transition to die rev. 'B', Actel began specifying timing at worst-case commercial conditions rather than at typical conditions.
From this history we can see that it is often possible to have parts from the same family that use different circuit designs, processes, and die sizes, are manufactured in different locations, and operate at very different speeds. FPGA companies ensure that their products always meet the current published worst-case specifications, but there is no guarantee that the average performance follows the typical specifications, and there are usually no best-case specifications.
4.5 Specifications
There are also situations in which two parts with identical part numbers can have different performancewhen different ASIC foundries produce the same parts. Since FPGA companies are fabless, second sourcing is very common. For example, TI began making the TPC1010A/1020A to be equivalent to the original Actel ACT 1 parts produced elsewhere. The TI timing information for the TPC1010A/1020A was the same as the 2 m Actel specifications, but TI used a faster 1.2 m process. This meant that equivalent parts with the same part numbers were much faster than a designer expected. Often this type of information can only be obtained by large customers in the form of a qualification kit from FPGA vendors. A similar situation arises when the FPGA manufacturer adjusts its product mix by selling fast parts under a slower part number in a procedure known as down-binning . This is not a problem for synchronous designs that always work when parts are faster than expected, but is another reason to avoid asynchronous designs that may not always work when parts are much faster than expected.
1. See, for example, p.1-8 of the Xilinx 1994 data book.
4.6 PREP Benchmarks
4.6 PREP Benchmarks

Which type of FPGA is best? This is an impossible question to answer. The Programmable Electronics Performance Company ( PREP ) is a nonprofit organization that organized a series of benchmarks for programmable ASICs. The nine PREP benchmark circuits in the version 1.3 suite are: 1. An 8-bit datapath consisting of 4:1 MUX, register, and shift-register 2. An 8-bit timercounter consisting of two registers, a 4:1 MUX, a counter and a comparator 3. A small state machine (8 states, 8 inputs, and 8 outputs) 4. A larger state machine (16 states, 8 inputs, and 8 outputs) 5. An ALU consisting of a 4 4 multiplier, an 8-bit adder, and an 8-bit register 6. A 16-bit accumulator 7. A 16-bit counter with synchronous load and enable 8. A 16-bit prescaled counter with load and enable 9. A 16-bit address decoder The data for these benchmarks is archived at http://www.prep.org . PREPs online information includes Verilog and VHDL source code and test benches (provided by Synplicity) as well as additional synthesis benchmarks including a bitslice processor, multiplier, and R4000 MIPS RISC microprocessor. One problem with the FPGA benchmark suite is that the examples are small, allowing
4.6 PREP Benchmarks
FPGA vendors to replicate multiple instances of the same circuit on an FPGA. This does not reflect the way an FPGA is used in practice. Another problem is that the FPGA vendors badly misused the results. PREP made the data available in a spreadsheet form and thus inadvertently challenged the marketing department of each FPGA vendor to find a way that company could claim to win the benchmarks (usually by manipulating the data using a complicated weighting scheme). The PREP benchmarks do demonstrate the large variation in performance between different FPGA architectures that results from differences in the type and mix of logic. This shows that designers should be careful in evaluating others results and performing their own experiments. [ Chapter start ] [ Previous page ] [ Next page ]
4.7 FPGA Economics
4.7 FPGA Economics

FPGA vendors offer a wide variety of packaging, speed, and qualification (military, industrial, or commercial) options in each family. For example, there are several hundred possible part combinations for the Xilinx LCA series. Figure 4.8 shows the Xilinx part-naming convention, which is similar to that used by other FPGA vendors.
FIGURE 4.8 Xilinx part-naming convention.
Table 4.2 shows the various codes used by manufacturers in their FPGA part numbers. Not all possible part combinations are available, not all packaging combinations are available, and not all I/O options are available in all packages. For example, it is quite common for an FPGA vendor to offer a chip that has more I/O cells than pins on the package. This allows the use of cheaper plastic packages without having to produce separate chip designs for each different package. Thus a customer can buy an Actel A1020 that has 69 I/O cells in an inexpensive 44-pin PLCC package but uses only 34 pins for I/Othe other 10 (= 44 34) pins are required for programming and power: three for GND, four for VDD, one for MODE (a pin that controls four other multifunction pins), and one for VPP (the programming voltage). A designer who needs all 69 I/Os can buy the A1020 in a bigger package. Tables in the FPGA manufacturers data books show the availability, and these matrices change constantly.
4.7 FPGA Economics
TABLE 4.2 Programmable ASIC part codes. Item Code Description Actel Manufacturers A code XC Xilinx EPM EPF CY7C Package type PL or PC PQ CQ or CB PG Application C I M Altera MAX Altera FLEX Cypress plastic J-leaded chip carrier, PLCC plastic quad flatpack, PQFP ceramic quad flatpack, CQFP ceramic pin-grid array, PGA commercial industrial military
Code ATT isp M5 QL
Description AT&T (Lucent) Lattice Logic AMD MACH 5 is on the device QuickLogic
VQ TQ
very thin quad flatpack, VQFP
thin plastic flatpack, TQFP plastic pin-grid PP array, PPGA ball-grid array, WB, PB BGA B MIL-STD-883 E extended
TABLE 4.3 1992 base Actel FPGA prices. 1H92 base price A1010A-PL44C $23.25 A1020A-PL44C $43.30 A1225-PQ100C $105.00 A1240-PQ144C $175.00 Actel part
TABLE 4.4 1992 base Xilinx XC3000 FPGA prices. 1H92 base Xilinx part price XC3020-50PC68C $26.00 XC3030-50PC44C $34.20 XC3042-50PC84C $52.00 XC3064-50PC84C $87.00
4.7 FPGA Economics
A1280-PQ160C $305.00
XC3090-50PC84C
$133.30
4.7.1 FPGA Pricing

Asking How much do FPGAs cost? is rather like asking How much does a car cost? Prices of cars are published, but pricing schemes used by semiconductor manufactures are closely guarded secrets. Many FPGA companies use a pricing strategy based on a cost model that uses a series of multipliers or adders for each part option to calculate the suggested price for their distributors. Although the FPGA companies will not divulge their methods, it is possible to reverse engineer these factors to create a pricing matrix. Many FPGA vendors sell parts through distributors. This can introduce some problems for the designer. For example, in 1992 the Xilinx XC3000 series offered the following part options: TABLE 4.5 Actel price adjustment factors. Purchase quantity, all types (19) (1099) (100999) 100 % 96 % 84 % Purchase time, in (100999) quantity 1H92 2H92 93 100 % 8095 % 6080 % Qualification type, same package Commercial Industrial Military 100 % 120 % 150 % Speed bin 1 ACT 1-Std ACT 1-1 ACT 1-2
883-B 230300 % ACT 2-Std ACT 2-1
4.7 FPGA Economics
100 %
115 %
140 %
100 %
120 %
Package type A1010: PL44, 64, 84 100 % A1020: PL44, 64, 84 100 % A1225: PQ100 100 % A1240: PQ144 100 % A1280: PQ160 100 %
q q q
PQ100 125 % PQ100 125 % PG100 175 % PG132 140 % PG176 145 %
PG84 400 % JQ44, 68, 84 PG84 270 % 275 %
CQ84 400 %
CQ172 160 %
Five different size parts: XC30{20, 30, 42, 64, 90} Three different speed grades or bins: {50, 70, 100} Ten different packages: {PC68, PC84, PG84, PQ100, CQ100, PP132, PG132, CQ184, PP175, PG175} Four application ranges or qualification types: {C, I, M, B}
where {} means Choose one. This range of options gave a total of 600 possible XC3000 products, of which 127 were actually available from Xilinx, each with a different part code. If a designer is uncertain as to exact size, speed, or package required, then they might easily need price information on several dozen different part numbers. Distributors know the price informationit is given to each distributor by the FPGA vendors. Sometimes the distributors are reluctant to give pricing information outfor the same reason car salespeople do not always like to advertise the pricing scheme for cars. However, pricing of the components of a microelectronics system is a vital factor in making decisions such as whether to use FPGAs or some alternative technology. Designers would like to know how FPGAs are priced and how prices may change.
4.7 FPGA Economics
4.7.2 Pricing Examples

Table 4.3 shows the prices of the least-expensive version of the Actel ACT 1 and ACT 2 FPGA families, the base prices , in the first half of 1992 (1H92). Table 4.4 shows the 1H92 base prices for the Xilinx XC3000 FPGA family. Current FPGA prices are much lower. As an example, the least-expensive XC3000 part, the XC3020A-7PC68C, was $13.75 in 1996nearly half the 1992 price. Using historical prices helps prevent accusations of bias or distortion, but still realistically illustrates the pricing schemes that are used. We shall use these base prices to illustrate how to estimate the sticker price of an FPGA by adding optionsas we might for a car. To estimate the price of any part, multiply the base prices by the adjustment factors (shown in Table 4.5 for the Actel parts). The adjustment factors in Table 4.5 were calculated by taking averages across a matrix of prices. Not all combinations of product types are available (for example, there was no military version of an A1280-1 in 1H92). The dependence of price over time is especially variable. An example price calculation for an Actel part is shown in Table 4.6 . Many FPGA vendors use similar pricing models. TABLE 4.6 Example Actel part-price calculation using the base prices of Table 4.3 and the adjustment factors of Table 4.5 . Example: A1020A-2-PQ100I in (100999) quantity, purchased 1H92. Factor Base price Quantity Time Qualification type Speed bin 2 Package Example A1020A 100999 1H92 Industrial (I) 2 PQ100 Value $43.30 84 % 100 % 120 % 140 % 125 %
4.7 FPGA Economics
Estimated price (1H92) Actual Actel price (1H92)
$76.38 $75.60
Some distributors now include FPGA prices and availability online (for example, Marshall at http://marshall.com for Xilinx parts) so that is possible to complete an up-to-date analysis at any time. Most distributors carry only one FPGA vendor; not all of the distributors publish prices; and not all FPGA vendors sell through distributors. Currently Hamilton-Avnet, at http://www.hh.avnet.com , carries Xilinx; and Wyle, at http://www.wyle.com , carries Actel and Altera. 1. Actel speed bins are: Std = standard speed grade; 1 = medium speed grade; 2 = fastest speed grade. 2. The speed bin is a manufacturers code (usually a number) that follows the family part number and indicates the maximum operating speed of the device. [ Chapter start ] [ Previous page ] [ Next page ]
4.8 Summary
4.8 Summary
In this chapter we have covered FPGA programming technologies including antifuse, SRAM, and EPROM technologies; the programming technology is linked to all the other aspects of a programmable ASIC. Table 4.7 summarizes the programming technologies and the fabrication processes used by programmable ASIC vendors. TABLE 4.7 Programmable ASIC technologies. Actel Xilinx LCA 1 Altera EPLD UVerasable EPROM (MAX 5k) EEPROM (MAX 7/9k) One n channel EPROM device. Medium. Standard EPROM and EEPROM Xilinx EPLD UVerasable EPROM One n channel EPROM device. Medium. Standard EPROM
Polydiffusion Programming Erasable SRAM antifuse, technology ISP PLICE
Size of programming element
Small but requires contacts to metal Special: CMOS plus three extra masks.
Two inverters plus pass and switch devices. Largest.
Process
Standard CMOS
4.8 Summary
Programming Special method hardware
PC card, PROM, or serial port
ISP (MAX 9k) or EPROM EPROM programmer programmer Altera FLEX Erasable SRAM. ISP. Two inverters plus pass and switch devices. Largest. Standard CMOS PC card, PROM, or serial port
QuickLogic Metalmetal Programming antifuse, technology ViaLink Size of programming element
Crosspoint
Atmel
Smallest
Erasable Metalpolysilicon SRAM. antifuse ISP. Two inverters plus pass Small and switch devices. Largest. Special, CMOS plus antifuse Special hardware Standard CMOS PC card, PROM, or serial port
Process
Special, CMOS plus ViaLink
Programming Special method hardware
All FPGAs have the following key elements:

q q q q q
The programming technology The basic logic cells The I/O logic cells Programmable interconnect Software to design and program the FPGA
1. Lucent (formerly AT&T) FPGAs have almost identical properties to the Xilinx LCA family.
4.8 Summary
PROGRAMMABLE ASIC LOGIC CELLS

All programmable ASICs or FPGAs contain a basic logic cell replicated in a regular array across the chip (analogous to a base cell in an MGA). There are the following three different types of basic logic cells: (1) multiplexer based, (2) look-up table based, and (3) programmable array logic. The choice among these depends on the programming technology. We shall see examples of each in this chapter.
5.1 Actel ACT 5.2 Xilinx LCA 5.3 Altera FLEX 5.4 Altera MAX 5.5 Summary 5.6 Problems 5.7 Bibliography
5.8 References
5.1 Actel ACT
5.1 Actel ACT

The basic logic cells in the Actel ACT family of FPGAs are called Logic Modules . The ACT 1 family uses just one type of Logic Module and the ACT 2 and ACT 3 FPGA families both use two different types of Logic Module.
5.1.1 ACT 1 Logic Module

The functional behavior of the Actel ACT 1 Logic Module is shown in Figure 5.1 (a). Figure 5.1 (b) represents a possible circuit-level implementation. We can build a logic function using an Actel Logic Module by connecting logic signals to some or all of the Logic Module inputs, and by connecting any remaining Logic Module inputs to VDD or GND. As an example, Figure 5.1 (c) shows the connections to implement the function F = A B + B' C + D. How did we know what connections to make? To understand how the Actel Logic Module works, we take a detour via multiplexer logic and some theory.
FIGURE 5.1 The Actel ACT architecture. (a) Organization of the basic logic cells. (b) The ACT 1 Logic Module. (c) An implementation using pass transistors (without any buffering). (d) An example logic macro. (Source: Actel.)
5.1 Actel ACT
5.1.2 Shannons Expansion Theorem

In logic design we often have to deal with functions of many variables. We need a method to break down these large functions into smaller pieces. Using the Shannon expansion theorem, we can expand a Boolean logic function F in terms of (or with respect to) a Boolean variable A, F = A F (A = '1') + A' F (A = '0'),(5.1) where F (A = 1) represents the function F evaluated with A set equal to '1'. For example, we can expand the following function F with respect to (I shall use the abbreviation wrt ) A, F = A' B + A B C' + A' B' C = A (B C') + A' (B + B' C).(5.2) We have split F into two smaller functions. We call F (A = '1') = B C' the cofactor of F wrt A in Eq. 5.2 . I shall sometimes write the cofactor of F wrt A as F A (the cofactor of F wrt A' is F A' ). We may expand a function wrt any of its variables. For example, if we expand F wrt B instead of A, F = A' B + A B C' + A' B' C = B (A' + A C') + B' (A' C).(5.3) We can continue to expand a function as many times as it has variables until we reach the canonical form (a unique representation for any Boolean function that uses only minterms. A minterm is a product term that contains all the variables of Fsuch as A B' C). Expanding Eq. 5.3 again, this time wrt C, gives F = C (A' B + A' B') + C' (A B + A' B).(5.4) As another example, we will use the Shannon expansion theorem to implement the following function using the ACT 1 Logic Module:
5.1 Actel ACT
F = (A B) + (B' C) + D.(5.5) First we expand F wrt B: F = B (A + D) + B' (C + D) = B F2 + B' F1.(5.6) Equation 5.6 describes a 2:1 MUX, with B selecting between two inputs: F (A = '1') and F (A = '0'). In fact Eq. 5.6 also describes the output of the ACT 1 Logic Module in Figure 5.1 ! Now we need to split up F1 and F2 in Eq. 5.6 . Suppose we expand F2 = F B wrt A, and F1 = F B' wrt C: F2 = A + D = (A 1) + (A' D),(5.7) F1 = C + D = (C 1) + (C' D).(5.8) From Eqs. 5.6 5.8 we see that we may implement F by arranging for A, B, C to appear on the select lines and '1' and D to be the data inputs of the MUXes in the ACT 1 Logic Module. This is the implementation shown in Figure 5.1 (d), with connections: A0 = D, A1 = '1', B0 = D, B1 = '1', SA = C, SB = A, S0 = '0', and S1 = B. Now that we know that we can implement Boolean functions using MUXes, how do we know which functions we can implement and how to implement them?
5.1.3 Multiplexer Logic as Function Generators

Figure 5.2 illustrates the 16 different ways to arrange 1s on a Karnaugh map corresponding to the 16 logic functions, F (A, B), of two variables. Two of these functions are not very interesting (F = '0', and F = '1'). Of the 16 functions, Table 5.1 shows the 10 that we can implement using just one 2:1 MUX. Of these 10 functions, the following six are useful:
q q q q q q
INV. The MUX acts as an inverter for one input only. BUF. The MUX just passes one of the MUX inputs directly to the output. AND. A two-input AND. OR. A two-input OR. AND1-1. A two-input AND gate with inverted input, equivalent to an NOR-11. NOR1-1. A two-input NOR gate with inverted input, equivalent to an AND-11.
5.1 Actel ACT
FIGURE 5.2 The logic functions of two variables.
TABLE 5.1 Boolean functions using a 2:1 MUX. Function, F 1 2 3 4 5 6 7 8 9 '0' NOR11(A, B) NOT(A) AND11(A, B) NOT(B) BUF(B) AND(A, B) BUF(A) OR(A, B) F = '0' (A + B')' A' A B' B' B A B A A + B '1' Canonical Minterms Minterm 2 1 form code '0' A' B A' B' + A' B A B' A' B' + A B' A' B + A B AB A B' + A B A' B + A B' + A B A' B' + A' B + A B' +AB none 1 0, 1 2 0, 2 1, 3 3 2, 3 1, 2, 3 0000 0010 0011 0100 0101 1010 1000 1100 1110 Function number
3
M1
A0 A1 SA 0 B 0 A 0 0 0 0 B 0 0 1 0 1 B B A 1 0 A A B B 1 A 1 A
0 2 3 4 5 6 8 9 13
10 '1'
0, 1, 2, 3
1111
15
Figure 5.3 (a) shows how we might view a 2:1 MUX as a function wheel , a three-input black
5.1 Actel ACT
box that can generate any one of the six functions of two-input variables: BUF, INV, AND-11, AND1-1, OR, AND. We can write the output of a function wheel as F1 = WHEEL1 (A, B).(5.9) where I define the wheel function as follows: WHEEL1 (A, B) = MUX (A0, A1, SA).(5.10) The MUX function is not unique; we shall define it as MUX (A0, A1, SA) = A0 SA' + A1 SA.(5.11) The inputs (A0, A1, SA) are described using the notation A0, A1, SA = {A, B, '0', '1'}(5.12) to mean that each of the inputs (A0, A1, and SA) may be any of the values: A, B, '0', or '1'. I chose the name of the wheel function because it is rather like a dial that you set to your choice of function. Figure 5.3 (b) shows that the ACT 1 Logic Module is a function generator built from two function wheels, a 2:1 MUX, and a two-input OR gate.
FIGURE 5.3 The ACT 1 Logic Module as a Boolean function generator. (a) A 2:1 MUX viewed as a function wheel. (b) The ACT 1 Logic Module viewed as two function wheels, an OR gate, and a 2:1 MUX.
5.1 Actel ACT
We can describe the ACT 1 Logic Module in terms of two WHEEL functions: F = MUX [ WHEEL1, WHEEL2, OR (S0, S1) ](5.13) Now, for example, to implement a two-input NAND gate, F = NAND (A, B) = (A B)', using an ACT 1 Logic Module we first express F as the output of a 2:1 MUX. To split up F we expand it wrt A (or wrt B; since F is symmetric in A and B): F = A (B') + A' ('1')(5.14) Thus to make a two-input NAND gate we assign WHEEL1 to implement INV (B), and WHEEL2 to implement '1'. We must also set the select input to the MUX connecting WHEEL1 and WHEEL2, S0 + S1 = Awe can do this with S0 = A, S1 = '1'. Before we get too carried away, we need to realize that we do not have to worry about how to use Logic Modules to construct combinational logic functionsthis has already been done for us. For example, if we need a two-input NAND gate, we just use a NAND gate symbol and software takes care of connecting the inputs in the right way to the Logic Module. How did Actel design its Logic Modules? One of Actels engineers wrote a program that calculates how many functions of two, three, and four variables a given circuit would provide. The engineers tested many different circuits and chose the best one: a small, logically efficient circuit that implemented many functions. For example, the ACT 1 Logic Module can implement all two-input functions, most functions with three inputs, and many with four inputs. Apart from being able to implement a wide variety of combinational logic functions, the ACT 1 module can implement sequential logic cells in a flexible and efficient manner. For example, you can use one ACT 1 Logic Module for a transparent latch or two Logic Modules for a flipflop. The use of latches rather than flip-flops does require a shift to a two-phase clocking scheme using two nonoverlapping clocks and two clock trees. Two-phase synchronous design using latches is efficient and fast but, to handle the timing complexities of two clocks requires changes to synthesis and simulation software that have not occurred. This means that most people still use flip-flops in their designs, and these require two Logic Modules.
5.1.4 ACT 2 and ACT 3 Logic Modules

Using two ACT 1 Logic Modules for a flip-flop also requires added interconnect and associated parasitic capacitance to connect the two Logic Modules. To produce an efficient two-module flip-
5.1 Actel ACT
flop macro we could use extra antifuses in the Logic Module to cut down on the parasitic connections. However, the extra antifuses would have an adverse impact on the performance of the Logic Module in other macros. The alternative is to use a separate flip-flop module, reducing flexibility and increasing layout complexity. In the ACT 1 family Actel chose to use just one type of Logic Module. The ACT 2 and ACT 3 architectures use two different types of Logic Modules, and one of them does include the equivalent of a D flip-flop. Figure 5.4 shows the ACT 2 and ACT 3 Logic Modules. The ACT 2 C-Module is similar to the ACT 1 Logic Module but is capable of implementing five-input logic functions. Actel calls its Cmodule a combinatorial module even though the module implements combinational logic. John Wakerly blames MMI for the introduction of the term combinatorial [Wakerly, 1994, p. 404]. The use of MUXes in the Actel Logic Modules (and in other places) can cause confusion in using and creating logic macros. For the Actel library, setting S = '0' selects input A of a twoinput MUX. For other libraries setting S = '1' selects input A. This can lead to some very hard to find errors when moving schematics between libraries. Similar problems arise in flip-flops and latches with MUX inputs. A safer way to label the inputs of a two-input MUX is with '0' and '1', corresponding to the input selected when the select input is '1' or '0'. This notation can be extended to bigger MUXes, but in Figure 5.4 , does the input combination S0 = '1' and S1 = '0' select input D10 or input D01? These problems are not caused by Actel, but by failure to use the IEEE standard symbols in this area. The S-Module ( sequential module ) contains the same combinational function capability as the C-Module together with a sequential element that can be configured as a flip-flop. Figure 5.4 (d) shows the sequential element implementation in the ACT 2 and ACT 3 architectures.
5.1 Actel ACT
FIGURE 5.4 The Actel ACT 2 and ACT 3 Logic Modules. (a) The CModule for combinational logic. (b) The ACT 2 S-Module. (c) The ACT 3 S-Module. (d) The equivalent circuit (without buffering) of the SE (sequential element). (e) The sequential element configured as a positive-edgetriggered D flip-flop. (Source: Actel.)
5.1.5 Timing Model and Critical Path

Figure 5.5 (a) shows the timing model for the ACT family. This is a simple timing model since it deals only with logic buried inside a chip and allows us only to estimate delays. We cannot predict the exact delays on an Actel chip until we have performed the place-and-route step and know how much delay is contributed by the interconnect. Since we cannot determine the exact delay before physical layout is complete, we call the Actel architecture nondeterministic . Even though we cannot determine the preroute delays exactly, it is still important to estimate the delay on a logic path. For example, Figure 5.5 (a) shows a typical situation deep inside an ASIC. Internal signal I1 may be from the output of a register (flip-flop). We then pass through some combinational logic, C1, through a register, S1, and then another register, S2. The register-toregister delay consists of a clockQ delay, plus any combinational delay between registers, and
5.1 Actel ACT
the setup time for the next flip-flop. The speed of our system will depend on the slowest registerregister delay or critical path between registers. We cannot make our clock period any longer than this or the signal will not reach the second register in time to be clocked. Figure 5.5 (a) shows an internal logic signal, I1, that is an input to a C-module, C1. C1 is drawn in Figure 5.5 (a) as a box with a symbol comprising the overlapping letters C and L (borrowed from carpenters who use this symbol to mark the centerline on a piece of wood). We use this symbol to describe combinational logic. For the standard-speed grade ACT 3 (we shall look at speed grading in Section 5.1.6 ) the delay between the input of a C-module and the output is specified in the data book as a parameter, t PD , with a maximum value of 3.0 ns. The output of C1 is an input to an S-Module, S1, configured to implement combinational logic and a D flip-flop. The Actel data book specifies the minimum setup time for this D flip-flop as t SUD = 0.8 ns. This means we need to get the data to the input of S1 at least 0.8 ns before the rising clock edge (for a positive-edgetriggered flip-flop). If we do this, then there is still enough time for the data to go through the combinational logic inside S1 and reach the input of the flipflop inside S1 in time to be clocked. We can guarantee that this will work because the combinational logic delay inside S1 is fixed.
5.1 Actel ACT
FIGURE 5.5 The Actel ACT timing model. (a) Timing parameters for a 'Std' speed grade ACT 3. (Source: Actel.) (b) Flip-flop timing. (c) An example of flip-flop timing based on ACT 3 parameters. The S-Module seems like good valuewe get all the combinational logic functions of a Cmodule (with delay t PD of 3 ns) as well as the setup time for a flip-flop for only 0.8 ns? not really. Next I will explain why not. Figure 5.5 (b) shows what is happening inside an S-Module. The setup and hold times, as measured inside (not outside) the S-Module, of the flip-flop are t' SUD and t' H (a prime denotes parameters that are measured inside the S-Module). The clockQ propagation delay is t' CO . The parameters t' SUD , t' H , and t' CO are measured using the internal clock signal CLKi. The propagation delay of the combinational logic inside the S-Module is t' PD . The delay of the combinational logic that drives the flip-flop clock signal ( Figure 5.4 d) is t' CLKD . From outside the S-Module, with reference to the outside clock signal CLK1:
5.1 Actel ACT
t SUD = t' SUD + (t' PD t' CLKD ), t H = t' H + (t' PD t' CLKD ), t CO = t' CO + t' CLKD .(5.15) Figure 5.5 (c) shows an example of flip-flop timing. We have no way of knowing what the internal flip-flop parameters t' SUD , t' H , and t' CO actually are, but we can assume some reasonable values (just for illustration purposes): t' SUD = 0.4 ns, t' H = 0.1 ns, t' CO = 0.4 ns.(5.16) We do know the delay, t' PD , of the combinational logic inside the S-Module. It is exactly the same as the C-Module delay, so t' PD = 3 ns for the ACT 3. We do not know t' CLKD ; we shall assume a reasonable value of t' CLKD = 2.6 ns (the exact value does not matter in the following argument). Next we calculate the external S-Module parameters from Eq. 5.15 as follows: t SUD = 0.8 ns, t H = 0.5 ns, t CO = 3.0 ns.(5.17) These are the same as the ACT 3 S-Module parameters shown in Figure 5.5 (a), and I chose t' CLKD and the values in Eq. 5.16 so that they would be the same. So now we see where the combinational logic delay of 3.0 ns has gone: 0.4 ns went into increasing the setup time and 2.6 ns went into increasing the clockoutput delay, t CO . From the outside we can say that the combinational logic delay is buried in the flip-flop setup time. FPGA vendors will point this out as an advantage that they have. Of course, we are not getting something for nothing here. It is like borrowing moneyyou have to pay it back.
5.1.6 Speed Grading

Most FPGA vendors sort chips according to their speed ( the sorting is known as speed grading or speed binning , because parts are automatically sorted into plastic bins by the production tester). You pay more for the faster parts. In the case of the ACT family of FPGAs, Actel
5.1 Actel ACT
measures performance with a special binning circuit , included on every chip, that consists of an input buffer driving a string of buffers or inverters followed by an output buffer. The parts are sorted from measurements on the binning circuit according to Logic Module propagation delay. The propagation delay, t PD , is defined as the average of the rising ( t PLH ) and falling ( t PHL ) propagation delays of a Logic Module t PD = ( t PLH + t PHL )/2.(5.18) Since the transistor properties match so well across a chip, measurements on the binning circuit closely correlate with the speed of the rest of the Logic Modules on the die. Since the speeds of die on the same wafer also match well, most of the good die on a wafer fall into the same speed bin. Actel speed grades are: a 'Std' speed grade, a '1' speed grade that is approximately 15 percent faster, a '2' speed grade that is approximately 25 percent faster than 'Std', and a '3' speed grade that is approximately 35 percent faster than 'Std'.
5.1.7 Worst-Case Timing

If you use fully synchronous design techniques you only have to worry about how slow your circuit may benot how fast. Designers thus need to know the maximum delays they may encounter, which we call the worst-case timing . Maximum delays in CMOS logic occur when operating under minimum voltage, maximum temperature, and slowslow process conditions. (A slowslow process refers to a process variation, or process corner , which results in slow p channel transistors and slow n -channel transistorswe can also have fastfast, slowfast, and fastslow process corners.) Electronic equipment has to survive in a variety of environments and ASIC manufacturers offer several classes of qualification for different applications:
q q q q q
Commercial. VDD = 5 V 5 %, T A (ambient) = 0 to +70 C. Industrial. VDD = 5 V 10 %, T A (ambient) = 40 to +85 C. Military: VDD = 5 V 10 %, T C (case) = 55 to +125 C. Military: Standard MIL-STD-883C Class B. Military extended: Unmanned spacecraft.
ASICs for commercial application are cheapest; ASICs for the Cruise missile are very, very expensive. Notice that commercial and industrial application parts are specified with respect to the ambient temperature T A (room temperature or the temperature inside the box containing
5.1 Actel ACT
the ASIC). Military specifications are relative to the package case temperature , T C . What is really important is the temperature of the transistors on the chip, the junction temperature , T J , which is always higher than T A (unless we dissipate zero power). For most applications that dissipate a few hundred mW, T J is only 510 C higher than T A . To calculate the value of T J we need to know the power dissipated by the chip and the thermal properties of the packagewe shall return to this in Section 6.6.1, Power Dissipation. Manufacturers have to specify their operating conditions with respect to T J and not T A , since they have no idea how much power purchasers will dissipate in their designs or which package they will use. Actel used to specify timing under nominal operating conditions: VDD = 5.0 V, and T J = 25 C. Actel and most other manufacturers now specify parameters under worst-case commercial conditions: VDD = 4.75 V, and T J = +70 C. Table 5.2 shows the ACT 3 commercial worst-case timing. In this table Actel has included some estimates of the variable routing delay shown in Figure 5.5 (a). These delay estimates depend on the number of gates connected to a gate output (the fanout). When you design microelectronic systems (or design anything ) you must use worst-case figures ( just as you would design a bridge for the worst-case load). To convert nominal or typical timing figures to the worst case (or best case), we use measured, or empirically derived, constants called derating factors that are expressed either as a table or a graph. For example, Table 5.3 shows the ACT 3 derating factors from commercial worst-case to industrial worst-case and military worst-case conditions (assuming T J = T A ). The ACT 1 and ACT 2 derating factors are approximately the same.
7 6
TABLE 5.2 ACT 3 timing parameters. Family ACT 3-3 (data book) ACT3-2 (calculated) ACT3-1 (calculated) Source: Actel. TABLE 5.3 ACT 3 derating factors.
10
Fanout Delay t PD
9
1 2.9
2 3.2
3 3.4
4 3.7
8 4.8
t PD /0.85 3.41 3.76 4.00 4.35 5.65 t PD /0.75 3.87 4.27 4.53 4.93 6.40
ACT3-Std (calculated) t PD /0.65 4.46 4.92 5.23 5.69 7.38
5.1 Actel ACT
Temperature T J ( junction) / C V DD / V 55 40 0.76 0.73 0.71 0.69 0.66 0 0.85 0.82 0.79 0.77 0.74 25 0.90 0.87 0.84 0.82 0.79 70 1.04 1.00 0.97 0.94 0.90 85 1.07 1.03 1.00 0.97 0.93 125 1.17 1.12 1.09 1.06 1.01 4.5 0.72 4.75 0.70 5.00 0.68 5.25 0.66 5.5 0.63 Source: Actel.
As an example of a timing calculation, suppose we have a Logic Module on a 'Std' speed grade A1415A (an ACT 3 part) that drives four other Logic Modules and we wish to estimate the delay under worst-case industrial conditions. From the data in Table 5.2 we see that the Logic Module delay for an ACT 3 'Std' part with a fanout of four is t PD = 5.7 ns (commercial worstcase conditions, assuming T J = T A ). If this were the slowest path between flip-flops (very unlikely since we have only one stage of combinational logic in this path), our estimated critical path delay between registers , t CRIT , would be the combinational logic delay plus the flip-flop setup time plus the clockoutput delay: t CRIT (w-c commercial) = t PD + t SUD + t CO = 5.7 ns + 0.8 ns + 3.0 ns = 9.5 ns .(5.19) (I use w-c as an abbreviation for worst-case.) Next we need to adjust the timing to worst-case industrial conditions. The appropriate derating factor is 1.07 (from Table 5.3 ); so the estimated delay is t CRIT (w-c industrial) = 1.07 9.5 ns = 10.2 ns .(5.20) Let us jump ahead a little and assume that we can calculate that T J = T A + 20 C = 105 C in our application. To find the derating factor at 105 C we linearly interpolate between the values for 85 C (1.07) and 125 C (1.17) from Table 5.3 ). The interpolated derating factor is 1.12 and thus t CRIT (w-c industrial, T J = 105 C) = 1.12 9.5 ns = 10.6 ns ,(5.21)
5.1 Actel ACT
giving us an operating frequency of just less than 100 MHz. It may seem unfair to calculate the worst-case performance for the slowest speed grade under the harshest industrial conditionsbut the examples in the data books are always for the fastest speed grades under less stringent commercial conditions. If we want to illustrate the use of derating, then the delays can only get worse than the data book values! The ultimate word on logic delays for all FPGAs is the timing analysis provided by the FPGA design tools. However, you should be able to calculate whether or not the answer that you get from such a tool is reasonable.
5.1.8 Actel Logic Module Analysis

The sizes of the ACT family Logic Modules are close to the size of the base cell of an MGA. We say that the Actel ACT FPGAs use a fine-grain architecture . An advantage of a fine-grain architecture is that, whatever the mix of combinational logic to flip-flops in your application, you can probably still use 90 percent of an Actel FPGA. Another advantage is that synthesis software has an easier time mapping logic efficiently to the simple Actel modules. The physical symmetry of the ACT Logic Modules greatly simplifies the place-and-route step. In many cases the router can swap equivalent pins on opposite sides of the module to ease channel routing. The design of the Actel Logic Modules is a balance between efficiency of implementation and efficiency of utilization. A simple Logic Module may reduce performance in some areasas I have pointed outbut allows the use of fast and robust place-and-route software. Fast, robust routing is an important part of Actel FPGAs (see Section 7.1, Actel ACT). 1. The minterm numbers are formed from the product terms of the canonical form. For example, A B' = 10 = 2. 2. The minterm code is formed from the minterms. A '1' denotes the presence of that minterm. 3. The function number is the decimal version of the minterm code. 4. Connections to a two-input MUX: A0 and A1 are the data inputs and SA is the select input (see Eq. 5.11 ). 5. 1994 data book, p. 1-101. 6. ACT 3: May 1995 data sheet, p. 1-173. ACT 2: 1994 data book, p. 1-51. 7. 1994 data book, p. 1-12 (ACT 1), p. 1-52 (ACT 2), May 1995 data sheet, p. 1-174 (ACT 3). 8. V DD = 4.75 V, T J ( junction) = 70 C. Logic module plus routing delay. All propagation delays in nanoseconds. 9. The Actel '1' speed grade is 15 % faster than 'Std'; '2' is 25 % faster than 'Std'; '3' is 35 % faster than 'Std'.
5.1 Actel ACT
10. Worst-case commercial: V DD = 4.75 V, T A (ambient) = +70 C. Commercial: V DD = 5 V 5 %, T A (ambient) = 0 to +70 C. Industrial: V DD = 5 V 10 %, T A (ambient) = 40 to +85 C. Military V DD = 5 V 10 %, T C (case) = 55 to +125 C. [ Chapter start ] [ Previous page ] [ Next page ]
5.2 Xilinx LCA
5.2 Xilinx LCA

Xilinx LCA (a trademark, denoting logic cell array) basic logic cells, configurable logic blocks or CLBs , are bigger and more complex than the Actel or QuickLogic cells. The Xilinx LCA basic logic cell is an example of a coarse-grain architecture . The Xilinx CLBs contain both combinational logic and flip-flops.
5.2.1 XC3000 CLB

The XC3000 CLB, shown in Figure 5.6 , has five logic inputs (AE), a common clock input (K), an asynchronous direct-reset input (RD), and an enable (EC). Using programmable MUXes connected to the SRAM programming cells, you can independently connect each of the two CLB outputs (X and Y) to the output of the flipflops (QX and QY) or to the output of the combinational logic (F and G).
5.2 Xilinx LCA
FIGURE 5.6 The Xilinx XC3000 CLB (configurable logic block). (Source: Xilinx.) A 32-bit look-up table ( LUT ), stored in 32 bits of SRAM, provides the ability to implement combinational logic. Suppose you need to implement the function F = A B C D E (a five-input AND). You set the contents of LUT cell number 31 (with address '11111') in the 32-bit SRAM to a '1'; all the other SRAM cells are set to '0'. When you apply the input variables as an address to the 32-bit SRAM, only when ABCDE = '11111' will the output F be a '1'. This means that the CLB propagation delay is fixed, equal to the LUT access time, and independent of the logic function you implement. There are seven inputs for the combinational logic in the XC3000 CLB: the five CLB inputs (AE), and the flip-flop outputs (QX and QY). There are two outputs from the LUT (F and G). Since a 32-bit LUT requires only five variables to form a unique address (32 = 2 5 ), there are several ways to use the LUT:
q
You can use five of the seven possible inputs (AE, QX, QY) with the entire 32bit LUT. The CLB outputs (F and G) are then identical. You can split the 32-bit LUT in half to implement two functions of four variables
5.2 Xilinx LCA
each. You can choose four input variables from the seven inputs (AE, QX, QY). You have to choose two of the inputs from the five CLB inputs (AE); then one function output connects to F and the other output connects to G. You can split the 32-bit LUT in half, using one of the seven input variables as a select input to a 2:1 MUX that switches between F and G. This allows you to implement some functions of six and seven variables.
5.2.2 XC4000 Logic Block

Figure 5.7 shows the CLB used in the XC4000 series of Xilinx FPGAs. This is a fairly complicated basic logic cell containing 2 four-input LUTs that feed a three-input LUT. The XC4000 CLB also has special fast carry logic hard-wired between CLBs. MUX control logic maps four control inputs (C1C4) into the four inputs: LUT input H1, direct in (DIN), enable clock (EC), and a set / reset control (S/R) for the flip-flops. The control inputs (C1C4) can also be used to control the use of the F' and G' LUTs as 32 bits of SRAM.
5.2 Xilinx LCA
FIGURE 5.7 The Xilinx XC4000 family CLB (configurable logic block). ( Source: Xilinx.)
5.2.3 XC5200 Logic Block

Figure 5.8 shows the basic logic cell, a Logic Cell or LC, used in the XC5200 family of Xilinx LCA FPGAs. The LC is similar to the CLBs in the XC2000/3000/4000 CLBs, but simpler. Xilinx retained the term CLB in the XC5200 to mean a group of four LCs (LC0LC3). The XC5200 LC contains a four-input LUT, a flip-flop, and MUXes to handle signal switching. The arithmetic carry logic is separate from the LUTs. A limited capability to cascade functions is provided (using the MUX labeled F5_MUX in logic cells LC0 and LC2 in Figure 5.8 ) to gang two LCs in parallel to provide the equivalent of a fiveinput LUT.
1
FIGURE 5.8 The Xilinx XC5200 family LC (Logic Cell) and CLB (configurable logic block). (Source: Xilinx.)
5.2.4 Xilinx CLB Analysis

5.2 Xilinx LCA
The use of a LUT in a Xilinx CLB to implement combinational logic is both an advantage and a disadvantage. It means, for example, that an inverter is as slow as a five-input NAND. On the other hand a LUT simplifies timing of synchronous logic, simplifies the basic logic cell, and matches the Xilinx SRAM programming technology well. A LUT also provides the possibility, used in the XC4000, of using the LUT directly as SRAM. You can configure the XC4000 CLB as a memoryeither two 16 1 SRAMs or a 32 1 SRAM, but this is expensive RAM. Figure 5.9 shows the timing model for Xilinx LCA FPGAs. Xilinx uses two speedgrade systems. The first uses the maximum guaranteed toggle rate of a CLB flip-flop measured in MHz as a suffixso higher is faster. For example a Xilinx XC3020-125 has a toggle frequency of 125 MHz. The other Xilinx naming system (which supersedes the old scheme, since toggle frequency is rather meaningless) uses the approximate delay time of the combinational logic in a CLB in nanosecondsso lower is faster in this case. Thus, for example, an XC4010-6 has t ILO = 6.0 ns (the correspondence between speed grade and t ILO is fairly accurate for the XC2000, XC4000, and XC5200 but is less accurate for the XC3000).
2
FIGURE 5.9 The Xilinx LCA timing model. The paths show different uses of CLBs (configurable logic blocks). The parameters shown are for an XC5210-6. ( Source: Xilinx.)
The inclusion of flip-flops and combinational logic inside the basic logic cell leads to efficient implementation of state machines, for example. The coarse-grain architecture of the Xilinx CLBs maximizes performance given the size of the SRAM programming technology element. As a result of the increased complexity of the basic logic cell we
5.2 Xilinx LCA
shall see (in Section 7.2, Xilinx LCA) that the routing between cells is more complex than other FPGAs that use a simpler basic logic cell. 1. Xilinx decided to use Logic Cell as a trademark in 1995 rather as if IBM were to use Computer as a trademark today. Thus we should now only talk of a Xilinx Logic Cell (with capital letters) and not Xilinx logic cells. 2. October 1995 (Version 3.0) data sheet. [ Chapter start ] [ Previous page ] [ Next page ]
5.3 Altera FLEX
5.3 Altera FLEX

Figure 5.10 shows the basic logic cell, a Logic Element ( LE ), that Altera uses in its FLEX 8000 series of FPGAs. Apart from the cascade logic (which is slightly simpler in the FLEX LE) the FLEX cell resembles the XC5200 LC architecture shown in Figure 5.8 . This is not surprising since both architectures are based on the same SRAM programming technology. The FLEX LE uses a four-input LUT, a flip-flop, cascade logic, and carry logic. Eight LEs are stacked to form a Logic Array Block (the same term as used in the MAX series, but with a different meaning).
5.3 Altera FLEX
FIGURE 5.10 The Altera FLEX architecture. (a) Chip floorplan. (b) LAB (Logic Array Block). (c) Details of the LE (Logic Element). ( Source: Altera (adapted with permission).)
5.4 Altera MAX
5.4 Altera MAX

Suppose we have a simple two-level logic circuit that implements a sum of products as shown in Figure 5.11 (a). We may redraw any two-level circuit using a regular structure ( Figure 5.11 b): a vector of buffers, followed by a vector of AND gates (which construct the product terms) that feed OR gates (which form the sums of the product terms). We can simplify this representation still further ( Figure 5.11 c), by drawing the input lines to a multiple-input AND gate as if they were one horizontal wire, which we call a product-term line . A structure such as Figure 5.11 (c) is called programmable array logic , first introduced by Monolithic Memories as the PAL series of devices.
5.4 Altera MAX
FIGURE 5.11 Logic arrays. (a) Two-level logic. (b) Organized sum of products. (c) A programmable-AND plane. (d) EPROM logic array. (e) Wired logic. Because the arrangement of Figure 5.11 (c) is very similar to a ROM, we sometimes call a horizontal product-term line, which would be the bit output from a ROM, the bit line . The vertical input line is the word line . Figure 5.11 (d) and (e) show how to build the programmable-AND array (or product-term array) from EPROM transistors. The horizontal product-term lines connect to the vertical input lines using the EPROM transistors as pull-downs at each possible connection. Applying a '1' to the gate of an unprogrammed EPROM transistor pulls the product-term line low to a '0'. A programmed n -channel transistor has a threshold voltage higher than V DD and is therefore always off . Thus a programmed transistor has no effect on the productterm line.
5.4 Altera MAX
Notice that connecting the n -channel EPROM transistors to a pull-up resistor as shown in Figure 5.11 (e) produces a wired-logic functionthe output is high only if all of the outputs are high, resulting in a wired-AND function of the outputs. The product-term line is low when any of the inputs are high. Thus, to convert the wiredlogic array into a programmable-AND array, we need to invert the sense of the inputs. We often conveniently omit these details when we draw the schematics of logic arrays, usually implemented as NORNOR arrays (so we need to invert the outputs as well). They are not minor details when you implement the layout, however. Figure 5.12 shows how a programmable-AND array can be combined with other logic into a macrocell that contains a flip-flop. For example, the widely used 22V10 PLD, also called a registered PAL, essentially contains 10 of the macrocells shown in Figure 5.12 . The part number, 22V10, denotes that there are 22 inputs (44 vertical input lines for both true and complement forms of the inputs) to the programmable AND array and 10 macrocells. The PLD or registered PAL shown in Figure 5.12 has an 2 i jk programmable-AND array.
FIGURE 5.12 A registered PAL with i inputs, j product terms, and k macrocells.
5.4.1 Logic Expanders

5.4 Altera MAX
The basic logic cell for the Altera MAX architecture, a macrocell, is a descendant of the PAL. Using the logic expander , shown in Figure 5.13 to generate extra logic terms, it is possible to implement functions that require more product terms than are available in a simple PAL macrocell. As an example, consider the following function: F = A' C D + B' C D + A B + B C'.(5.22) This function has four product terms and thus we cannot implement F using a macrocell that has only a three-wide OR array (such as the one shown in Figure 5.13 ). If we rewrite F as a sum of (products of products) like this: F = (A' + B') C D + (A + C') B = (A B)' (C D) + (A' C)' B ;(5.23) we can use logic expanders to form the expander terms (A B)' and (A' C)' (see Figure 5.13 ). We can even share these extra product terms with other macrocells if we need to. We call the extra logic gates that form these shareable product terms a shared logic expander , or just shared expander .
5.4 Altera MAX
FIGURE 5.13 Expander logic and programmable inversion. An expander increases the number of product terms available and programmable inversion allows you to reduce the number of product terms you need. The disadvantage of the shared expanders is the extra logic delay incurred because of the second pass that you need to take through the product-term array. We usually do not know before the logic tools assign logic to macrocells ( logic assignment ) whether we need to use the logic expanders. Since we cannot predict the exact timing the Altera MAX architecture is not strictly deterministic . However, once we do know whether a signal has to go through the array once or twice, we can simply and accurately predict the delay. This is a very important and useful feature of the Altera MAX architecture. The expander terms are sometimes called helper terms when you use a PAL. If you use helper terms in a 22V10, for example, you have to go out to the chip I/O pad and then back into the programmable array again, using two-pass logic .
FIGURE 5.14 Use of programmed inversion to simplify logic: (a) The function F = A B' + A C' + A D' + A' C D requires four product terms (P1P4) to implement while (b) the complement, F ' = A B C D + A' D' + A' C' requires only three product terms (P1P3). Another common feature in complex PLDs, also used in some PLDs, is shown in Figure 5.13 . Programming one input of the XOR gate at the macrocell output allows
5.4 Altera MAX
you to choose whether or not to invert the output (a '1' for inversion or to a '0' for no inversion). This programmable inversion can reduce the required number of product terms by using a de Morgan equivalent representation instead of a conventional sumof-products form, as shown in Figure 5.14 . As an example of using programmable inversion, consider the function F = A B' + A C' + A D' + A' C D ,(5.24) which requires four product termsone too many for a three-wide OR array. If we generate the complement of F instead, F ' = A B C D + A' D' + A' C' ,(5.25) this has only three product terms. To create F we invert F ', using programmable inversion. Figure 5.15 shows an Altera MAX macrocell and illustrates the architectures of several different product families. The implementation details vary among the families, but the basic features: wide programmable-AND array, narrow fixed-OR array, logic expanders, and programmable inversionare very similar. Each family has the following individual characteristics:
q
A typical MAX 5000 chip has: 8 dedicated inputs (with both true and complement forms); 24 inputs from the chipwide interconnect (true and complement); and either 32 or 64 shared expander terms (single polarity). The MAX 5000 LAB looks like a 32V16 PLD (ignoring the expander terms). The MAX 7000 LAB has 36 inputs from the chipwide interconnect and 16 shared expander terms; the MAX 7000 LAB looks like a 36V16 PLD. The MAX 9000 LAB has 33 inputs from the chipwide interconnect and 16 local feedback inputs (as well as 16 shared expander terms); the MAX 9000 LAB looks like a 49V16 PLD.
5.4 Altera MAX
FIGURE 5.15 The Altera MAX architecture. (a) Organization of logic and interconnect. (b) A MAX family LAB (Logic Array Block). (c) A MAX family macrocell. The macrocell details vary between the MAX familiesthe functions shown here are closest to those of the MAX 9000 family macrocells.
5.4 Altera MAX
FIGURE 5.16 The timing model for the Altera MAX architecture. (a) A direct path through the logic array and a register. (b) Timing for the direct path. (c) Using a parallel expander. (d) Parallel expander timing. (e) Making two passes through the logic array to use a shared expander. (f) Timing for the shared expander (there is no register in this path). All timing values are in nanoseconds for the MAX 9000 series, '15' speed grade. ( Source: Altera.)
5.4.2 Timing Model

Figure 5.16 shows the Altera MAX timing model for local signals. For example, in
5.4 Altera MAX
Figure 5.16 (a) an internal signal, I1, enters the local array (the LAB interconnect with a fixed delay t 1 = t LOCAL = 0.5 ns), passes through the AND array (delay t 2 = t LAD = 4.0 ns), and to the macrocell flip-flop (with setup time, t 3 = t SU = 3.0 ns, and clockQ or register delay , t 4 = t RD = 1.0 ns). The path delay is thus: 0.5 + 4 +3 + 1 = 8.5 ns. Figure 5.16 (c) illustrates the use of a parallel logic expander . This is different from the case of the shared expander ( Figure 5.13 ), which required two passes in series through the product-term array. Using a parallel logic expander, the extra product term is generated in an adjacent macrocell in parallel with other product terms (not in seriesas in a shared expander). We can illustrate the difference between a parallel expander and a shared expander using an example function that we have used before (Eq. 5.22 ), F = A' C D + B' C D + A B + B C' .(5.26) This time we shall use macrocell M1 in Figure 5.16 (d) to implement F1 equal to the sum of the first three product terms in Eq. 5.26 . We use F1 (using the parallel expander connection between adjacent macrocells shown in Figure 5.15 ) as an input to macrocell M2. Now we can form F = F1 + B C' without using more than three inputs of an OR gate (the MAX 5000 has a three-wide OR array in the macrocell, the MAX 9000, as shown in Figure 5.15 , is capable of handling five product terms in one macrocellbut the principle is the same). The total delay is the same as before, except that we add the delay of a parallel expander, t PEXP = 1.0 ns. Total delay is then 8.5 + 1 = 9.5 ns. Figure 5.16 (e) and (f) shows the use of a shared expandersimilar to Figure 5.13 . The Altera MAX macrocell is more like a PLD than the other FPGA architectures discussed here; that is why Altera calls the MAX architecture a complex PLD. This means that the MAX architecture works well in applications for which PLDs are most useful: simple, fast logic with many inputs or variables.
5.4.3 Power Dissipation in Complex PLDs

5.4 Altera MAX
A programmable-AND array in any PLD built using EPROM or EEPROM transistors uses a passive pull-up (a resistor or current source), and these macrocells consume static power . Altera uses a switch called the Turbo Bit to control the current in the programmable-AND array in each macrocell. For the MAX 7000, static current varies between 1.4 mA and 2.2 mA per macrocell in high-power mode (the current depends on the partgenerally, but not always, the larger 7000 parts have lower operating currents) and between 0.6 mA and 0.8 mA in low-power mode. For the MAX 9000, the static current is 0.6 mA per macrocell in high-current mode and 0.3 mA in lowpower mode, independent of the part size. Since there are 16 macrocells in a LAB and up to 35 LABs on the largest MAX 9000 chip (16 35 = 560 macrocells), just the static power dissipation in low-power mode can be substantial (560 0.3 mA 5 V = 840 mW). If all the macrocells are in high-power mode, the static power will double. This is the price you pay for having an (up to) 114-wide AND gate delay of a few nanoseconds (t LAD = 4.0 ns) in the MAX 9000. For any MAX 9000 macrocell in the low-power mode it is necessary to add a delay of between 15 ns and 20 ns to any signal path through the local interconnect and logic array (including t LAD and t PEXP ). 1. 1995 data book p. 274 (5000), p. 160 (7000), p. 126 (9000). 2. March 1995 data sheet, v2. 3. 1995 data book, p. 1-47. [ Chapter start ] [ Previous page ] [ Next page ]
3
5.5 Summary
5.5 Summary
Table 5.4 is a look-up table to Tables 5.5 5.9 , which summarize the features of the logic cells used by the various FPGA vendors. TABLE 5.4 Logic cell tables. Programmable ASIC family Actel (ACT 1) Xilinx (XC3000) Table 5.5 Actel (ACT 2) Xilinx (XC4000) Altera MAX (EPM 5000) Table 5.6 Xilinx EPLD (XC7200/7300) QuickLogic (pASIC 1) Crosspoint (CP20K) Table 5.7 Altera MAX (EPM 7000) Atmel (AT6000)
Programmable ASIC family Actel (ACT 3) Table 5.8 Xilinx LCA (XC5200) Altera FLEX (8000/10k) AMD MACH 5 Table 5.9 Actel 3200DX Altera MAX (EPM 9000)
TABLE 5.5 Logic cells used by programmable ASICs. Actel Xilinx Actel ACT 2 Xilinx XC4000 ACT 1 XC3000 C-Module (combinatorialCLB Logic CLB module) and SBasic (Configurable module (Configurable Module logic cell Logic Block) (LM) Logic Block) (sequential module)
5.5 Summary
Logic cell contents
Three 32-bit LUT, 2 2:1MUXes D flip-flops, 9 plus OR MUXes gate
C-Module: 4:1 MUX, 2-input OR, 2-input AND S-Module: 4input MUX, 2input OR, latch or D flipflop Fixed
32-bit LUT, 2 D flip-flops, 10 MUXes, including fast carry logic E-suffix parts contain dualport RAM. Fixed with ability to bypass FF Two 4-input LUTs plus combiner with ninth input CLB as 32-bit SRAM (except D-suffix parts)
Logic path delay
Fixed
Fixed with ability to bypass FF All 5-input functions plus 2 D flipflops
Most 3input, Combinational many 4logic input functions functions (total 702 macros) 1 LM required for latch, Flip-flop (FF) 2 LMs implementation required for flipflops
Most 3- and 4input functions (total 766 macros)
2 D-flip-flops per CLB, latches can be built from pre-FF logic. 64 (XC3020/A/L, XC3120/A) 100 (XC3030/A/L, XC3130/A) 144 (XC3042/A/L, XC3142/A) 224
1 S-Module per D flip-flop; some FFs require 2 modules.
2 D flip-flops per CLB
LMs: A1010: 352 (8R 44C) = 295 + 57 I/O
A1225: 451 = 231 S + 220 C A1240: 684 = 348 S + 336 C
Basic logic cells in each chip
64 (XC4002A) 100 (XC4003/A/E/H) 144 (XC4004A) 196 (XC4005/A/E/H) 256 (XC4006/E) 324 (XC4008/E) 400
5.5 Summary
A1020: 616 (14 R 44C) = 547 + 69 I/O
(XC3064/A/L, XC3164/A) A1280: 320 1232 = 624 S (XC3090/A/L, + 608 C XC3190/A) 484 (XC3195/A)
(XC4010/D/E) 576 (XC4013/D/E) 784 (XC4020/E) 1024 (XC4025/E)
TABLE 5.6 Logic cells used by programmable ASICs. Xilinx Altera MAX 5000 XC7200/7300 16 macrocells in a 9 macrocells within LAB (Logic Array a FB (Functional Basic Block) except Block), fast FBs logic cell EPM5032, which (FFBs) omit ALU has 32 macrocells in a single LAB Macrocell: 64106wide AND, 3-wide OR array, 1 flipMacrocell: 21-wide flop, 2 MUXes, AND, 16-wide OR programmable Logic cell array, 1 flip-flop, inversion. 3264 contents 1ALU shared logic FB looks like 21V9 expander OR PLD. terms. LAB looks like a 32V16 PLD. Fixed (unless using Logic path shared logic Fixed delay expanders) Wide input Combinational Wide input functions with ability logic functions functions with to share product per logic cell added 2-input ALU terms
QuickLogic pASIC 1
Logic Cell (LC)
Four 2-input and two 6-input AND, three 2:1 MUXes and one D flipflop
Fixed
All 3-input functions
5.5 Summary
Flip-flop (FF) implementation
1 D flip-flop or latch per macrocell. More can be constructed in arrays.
1 D flip-flop or latch per macrocell FBs: 4 (XC7236A) 8 (XC7272A) 2 (XC7318) 4 (XC7336) 6 (XC7354) 8 (XC7372) 12 (XC73108) 16 (XC73144)
1 D flip-flop per LC. LCs for other flip-flops not specified.
LABs: 32 (EPM5032) Basic logic cells 64 (EPM5064) in each chip 128 (EPM5128) 128 (EPM5130) 192 (EPM5192)
48 (QL6X8) 96 (QL8X12) 192 (QL12X16) 384 (QL16X24)
TABLE 5.7 Logic cells used by programmable ASICs. Crosspoint Altera MAX 7k CP20K Transistor-pair 16 macrocells in a Basic tile (TPT), RAM- LAB (Logic Array logic cell logic Tile (RLT) Block) Macrocell: wide AND, 5-wide OR array, 1 flip-flop, 3 TPT: 2 MUXes, transistors (0.5 programmable gate). RLT: 3 Logic cell inversion. 16 inverters, two 3contents shared logic input NANDs, 2expander OR input NAND, 2terms, plus parallel input AND. logic expander. LAB looks like a 36V16 PLD. Fixed (unless using Logic path Variable shared logic delay expanders)
Atmel AT6000 Cell
Two 5:1 MUXes, two 4:1 MUXes, 3:1 MUX, three 2:1 MUXes, 6 pass gates, four 2-input gates, 1 D flip-flop
Variable
5.5 Summary
Combinational functions per logic cell
TPT is smaller than a gate, approx. 2 TPTs = 1 gate. D flip-flop requires 2 RLTs and 9 TPTs
Wide input functions with ability to share product terms 1 D flip-flop or latch per macrocell. More can be constructed in arrays. Macrocells: 32 (EPM7032/V) 64 (EPM7064) 96 (EPM7096) 128 (EPM70128E) 160 (EPM70160E) 192 (EPM70192E) 256 (EPM70256E)
1-, 2-, and 3-input combinational configurations: 44 logical states and 72 physical states 1 D flip-flop per cell
TPTs: 1760 (20220) 15,876 Basic logic cells (22000) in each chip RLTs: 440 (20220) 3969 (22000)
1024 (AT6002) 1600 (AT6003) 3136 (AT6005) 6400(AT6010)
TABLE 5.8 Logic cells used by programmable ASICs. Actel ACT 3 2 types of Logic Module: CModule and S-Module (similar but not identical to ACT 2) Xilinx XC5200 Altera FLEX 8000/10k
Basic logic cell
4 Logic Cells (LC) in a CLB (Configurable Logic Block)
8 Logic Elements (LE) in a Logic Array Block (LAB )
5.5 Summary
C-Module: 4:1 MUX, 2input OR, 2input AND. Logic cell contents S-Module: (LUT = look-up table) 4:1 MUX, 2input OR, latch or D flip-flop. Logic path delay Combinational functions per logic cell Fixed Most 3- and 4-input functions (total 766 macros) 1 D flip-flop (or latch) per S-Module; some FFs require 2 modules.
LC has 16-bit LUT, 1 flip-flop (or latch), 4 MUXes
16-bit LUT, 1 programmable flipflop or latch, MUX logic for control, carry logic, cascade logic Fixed with ability to bypass FF 4-input LUT may be cascaded with adjacent LE
Fixed One 4-input LUT per LC may be combined with adjacent LC to form 5-input LUT 1 D flip-flop (or latch) per LC (4 per CLB)
1 D flip-flop (or latch) per LE
A1415: 104 S + 96 C A1425: 160 S + 150 C A1440: 288 S + 276 C A1460: 432 S + 416 C A14100: 697 S + 680 C
64 CLB (XC5202) 120 CLB (XC5204) 196 CLB (XC5206) 324 CLB (XC5210) 484 CLB (XC5215)
LEs: 208 (EPF8282/V/A /AV) 336 (EPF8452/A) 504 (EPF8636A) 672 (EPF8820/A) 1008 (EPF81188/A) 1296 (EPF81500/A) 576 (EPF10K10) 1152 (EPF10K20) 1728 (EPF10K30) 2304 (EPF10K40)
5.5 Summary
2880 (EPF10K50) 3744 (EPF10K70) 4992 (EPF10K100) TABLE 5.9 Logic cells used by programmable ASICs. AMD MACH 5 Actel 3200DX Based on 4 PAL Blocks in a ACT 2, plus DSegment, 16 module Basic logic cell macrocells in a (decode) and PAL Block dual-port SRAM C-Module: 4:1 MUX, 2-input OR, 2-input AND 20-bit to 32-bit S-Module: 4wide OR array, input MUX, 2switching logic, input OR, latch XOR gate, programmable flip- or D flip-flop flop D-module: 7input AND, 2input XOR Fixed Wide input functions Fixed Most 3- and 4input functions (total 766 macros)
Altera MAX 9000 16 macrocells in a LAB (Logic Array Block) Macrocell: 114wide AND, 5-wide OR array, 1 flipflop, 5 MUXes, programmable inversion. 16 shared logic expander OR terms, plus parallel logic expander. LAB looks like a 49V16 PLD. Fixed (unless using expanders) Wide input functions with ability to share product terms
Logic cell contents
Logic path delay Combinational functions per logic cell
5.5 Summary
1 D flip-flop or latch per macrocell
1 D flip-flop or latch per SModule; some FFs require 2 modules. A3265DX: 510 S + 475 C + 20 D A32100DX: 700 S + 662 C + 20 D + 2 kSRAM A32140D): 954 S + 912 C + 24 D A32200DX: 1 230 S + 1 184 C + 24 D + 2.5 kSRAM A32300DX: 1 888 S + 1 833 C + 28 D + 3kSRAM A32400DX: 2 526 S + 2 466 C + 28 D + 4 kSRAM
1 D flip-flop or latch per macrocell. More can be constructed in arrays.
128 (M5-128) 192 (M5-192) 256 (M5-256) 320 (M5-320) 384 (M5-384) 512 (M5-512)
Macrocells: 320 (EPM9320) 45 LABs 400 (EPM9400) 55 LABs 480 (EPM9480) 65 LABs 560 (EPM9560) 75 LABs
The key points in this chapter are:

q q q q q
The use of multiplexers, look-up tables, and programmable logic arrays The difference between fine-grain and coarse-grain FPGA architectures Worst-case timing design Flip-flop timing Timing models
5.5 Summary
q q
Components of power dissipation in programmable ASICs Deterministic and nondeterministic FPGA architectures
Next, in Chapter 6, we shall examine the I/O cells used by the various programmable ASIC families. [ Chapter start ] [ Previous page ] [ Next page ]
PROGRAMMABLE ASIC I/O CELLS

All programmable ASICs contain some type of input/output cell ( I/O cell ). These I/O cells handle driving logic signals off-chip, receiving and conditioning external inputs, as well as handling such things as electrostatic protection. This chapter explains the different types of I/O cells that are used in programmable ASICs and their functions. The following are different types of I/O requirements.
q
DC output. Driving a resistive load at DC or low frequency (less than 1 MHz). Example loads are light-emitting diodes (LEDs), relays, small motors, and such. Can we supply an output signal with enough voltage, current, power, or energy? AC output. Driving a capacitive load with a high-speed (greater than 1 MHz) logic signal off-chip. Example loads are other logic chips, a data or address bus, ribbon cable. Can we supply a valid signal fast enough? DC input. Example sources are a switch, sensor, or another logic chip. Can we correctly interpret the digital value of the input? AC input. Example sources are high-speed logic signals (higher than 1 MHz) from another chip. Can we correctly interpret the input quickly enough? Clock input. Examples are system clocks or signals on a synchronous bus. Can we transfer the timing information from the input to the appropriate places on the chip correctly and quickly enough? Power input. We need to supply power to the I/O cells and the logic in the core, without introducing voltage drops or noise. We may also need a separate power supply to program the chip.
These issues are common to all FPGAs (and all ICs) so that the design of FPGA I/O cells is driven by the I/O requirements as well as the programming technology.
6.1 DC Output 6.2 AC Output 6.3 DC Input 6.4 AC Input 6.5 Clock Input 6.6 Power Input 6.7 Xilinx I/O Block 6.8 Other I/O Cells 6.9 Summary 6.10 Problems 6.11 Bibliography 6.12 References

6.1 DC Output
6.1 DC Output
Figure 6.1 shows a robot arm driven by three small motors together with switches to control the motors. The motor armature current varies between 50 mA and nearly 0.5 A when the motor is stalled. Can we replace the switches with an FPGA and drive the motors directly?
FIGURE 6.1 A robot arm. (a) Three small DC motors drive the arm. (b) Switches control each motor.
Figure 6.2 shows a CMOS complementary output buffer used in many FPGA I/O cells and its DC characteristics. Data books typically specify the output characteristics at two points, A (V OHmin , I OHmax ) and B ( V OLmax , I OLmax ), as shown in Figure 6.2 (d). As an example, values for the Xilinx XC5200 are as follows
1
:
q q
V OLmax = 0.4 V, low-level output voltage at I OLmax = 8.0 mA. V OHmin = 4.0 V, high-level output voltage at I OHmax = 8.0 mA.
By convention the output current , I O , is positive if it flows into the output. Input
6.1 DC Output
currents, if there are any, are positive if they flow into the inputs. The Xilinx XC5200 specifications show that the output buffer can force the output pad to 0.4 V or lower and sink no more than 8 mA if the load requires it. CMOS logic inputs that may be connected to the pad draw minute amounts of current, but bipolar TTL inputs can require several milliamperes. Similarly, when the output is 4 V, the buffer can source 8 mA. It is common to say that V OLmax = 0.4 V and V OHmin = 4.0 V for a technologywithout referring to the current values at which these are measuredstrictly this is incorrect.
FIGURE 6.2 (a) A CMOS complementary output buffer. (b) Pull-down transistor M2 (M1 is off) sinks (to GND) a current I OL through a pull-up resistor, R 1 . (c) Pull-up transistor M1 (M2 is off) sources (from VDD) current I OH ( I OH is negative) through a pull-down resistor, R 2 . (d) Output characteristics. If we force the output voltage , V O , of an output buffer, using a voltage supply, and measure the output current, IO , that results, we find that a buffer is capable of sourcing and sinking far more than the specified I OHmax and I OLmax values. Most vendors do not specify output characteristics because they are difficult to measure in production. Thus we normally do not know the value of I OLpeak or I OHpeak ; typical values range from 50 to 200 mA. Can we drive the motors by connecting several output buffers in parallel to reach a peak drive current of 0.5 A? Some FPGA vendors do specifically allow you to connect adjacent output cells in parallel to increase the output drive. If the output
6.1 DC Output
cells are not adjacent or are on different chips, there is a risk of contention. Contention will occur if, due to delays in the signal arriving at two output cells, one output buffer tries to drive an output high while the other output buffer is trying to drive the same output low. If this happens we essentially short VDD to GND for a brief period. Although contention for short periods may not be destructive, it increases power dissipation and should be avoided.
2
It is thus possible to parallel outputs to increase the DC drive capability, but it is not a good idea to do so because we may damage or destroy the chip (by exceeding the maximum metal electromigration limits). Figure 6.3 shows an alternativea simple circuit to boost the drive capability of the output buffers. If we need more power we could use two operational amplifiers ( op-amps ) connected as voltage followers in a bridge configuration. For even more power we could use discrete power MOSFETs or power op-amps. FIGURE 6.3 A circuit to drive a small electric motor (0.5 A) using ASIC I/O buffers. Any npn transistors with a reasonable gain ( 100) that are capable of handling the peak current (0.5 A) will work with an output buffer that is capable of sourcing more than 5 mA. The 470 resistors drop up to 5 V if an output buffer current approaches 10 mA, reducing the drive to the output transistors.
6.1.1 Totem-Pole Output

Figure 6.4 (a) and (b) shows a totem-pole output buffer and its DC characteristics. It is similar to the TTL totem-pole output from which it gets its name (the totem-pole circuit has two stacked transistors of the same type, whereas a complementary output uses transistors of opposite types). The high-level voltage, V OHmin , for a totem pole
6.1 DC Output
is lower than VDD . Typically V OHmin is in the range of 3.5 V to 4.0 V (with VDD = 5 V), which makes rising and falling delays more symmetrical and more closely matches TTL voltage levels. The disadvantage is that the totem pole will typically only drive the output as high as 34 V; so this would not be a good choice of FPGA output buffer to work with the circuit shown in Figure 6.3 .
FIGURE 6.4 Output buffer characteristics. (a) A CMOS totempole output stage (both M1 and M2 are n -channel transistors). (b) Totem-pole output characteristics. (c) Clamp diodes, D1 and D2, in an output buffer (these diodes are present in all output bufferstotem-pole or complementary). (d) The clamp diodes start to conduct as the output voltage exceeds the supply voltage bounds.
6.1.2 Clamp Diodes

Figure 6.4 (c) show the connection of clamp diodes (D1 and D2) that prevent the I/O pad from voltage excursions greater than V DD and less than V SS . Figure 6.4 (d) shows the resulting characteristics. 1. XC5200 data sheet, October 1995 (v. 3.0). 2. Actel specifies a maximum I/O current of 20 mA for ACT3 family (1994 data book, p. 1-93) and its ES family. Altera specifies the maximum DC output current per pin, for example 25 mA for the FLEX 10k (July 1995, v. 1 data sheet, p. 42).
6.1 DC Output
6.2 AC Output
6.2 AC Output
Figure 6.5 shows an example of an off-chip three-state bus. Chips that have inputs and outputs connected to a bus are called bus transceivers . Can we use FPGAs to perform the role of bus transceivers? We will focus on one bit, B1, on bus BUSA, and we shall call it BUSA.B1. We need unique names to refer to signals on each chip; thus CHIP1.OE means the signal OE inside CHIP1. Notice that CHIP1.OE is not connected to CHIP2.OE.
FIGURE 6.5 A three-state bus. (a) Bus parasitic capacitance. (b) The output buffers in each chip. The ASIC CHIP1 contains a bus keeper, BK1.
6.2 AC Output
Figure 6.6 shows the timing of part of a bus transaction (a sequence of signals on a bus): 1. Initially CHIP2 drives BUSA.B1 high (CHIP2.D1 is '1' and CHIP2.OE is '1'). 2. The buffer output enable on CHIP2 (CHIP2.OE) goes low, floating the bus. The bus will stay high because we have a bus keeper, BK1. 3. The buffer output enable on CHIP3 (CHIP3.OE) goes high and the buffer drives a low onto the bus (CHIP3.D1 is '0'). We wish to calculate the delays involved in driving the off-chip bus in Figure 6.6 . In order to find t float , we need to understand how Actel specifies the delays for its I/O cells. Figure 6.7 (a) shows the circuit used for measuring I/O delays for the ACT FPGAs. These measurements do not use the same trip points that are used to characterize the internal logic (Actel uses input and output trip points of 0.5 for internal logic delays).
FIGURE 6.6 Three-state bus timing for Figure 6.5 . The on-chip delays, t 2OE and t 3OE , for the logic that generates signals CHIP2.E1 and CHIP3.E1 are derived from the timing models described in Chapter 5 (the minimum values for each chip would be the clock-to-Q delay times).
6.2 AC Output
FIGURE 6.7 (a) The test circuit for characterizing the ACT 2 and ACT 3 I/O delay parameters. (b) Output buffer propagation delays from the data input to PAD (output enable, E, is high). (c) Three-state delay with D low. (d) Three-state delay with D high. Delays are shown for ACT 2 'Std' speed grade, worst-case commercial conditions ( R L = 1 k , C L = 50 pF, V OHmin = 2.4 V, V OLmax = 0.5 V). (The Actel threestate buffer is named TRIBUFF, an input buffer INBUF, and the output buffer, OUTBUF.) Notice in Figure 6.7 (a) that when the output enable E is '0' the output is three-stated ( high-impedance or hi-Z ). Different companies use different polarity and naming conventions for the output enable signal on a three-state buffer. To measure the buffer delay (measured from the change in the enable signal, E) Actel uses a resistor load ( R L = 1 k for ACT 2). The resistor pulls the buffer output high or low depending on whether we are measuring:
q
t ENZL , when the output switches from hi-Z to '0'.
6.2 AC Output
q q q
t ENLZ , when the output switches from '0' to hi-Z. t ENZH , when the output switches from hi-Z to '1'. t ENHZ , when the output switches from '1' to hi-Z.
Other vendors specify the time to float a three-state output buffer directly (t fr and t ff in Figure 6.7 c and d). This delay time has different names (and definitions): disable time , time to begin hi-Z , or time to turn off . Actel does not specify the time to float but, since R L C L = 50 ns, we know t RC = R L C L ln 0.9 or approximately 5.3 ns. Now we can estimate that t fr = t ENLZ t RC = 11.1 5.3 = 5.8 ns, and t ff = 9.4 5.3 = 4.1 ns, and thus the Actel buffer can float the bus in t float = 4.1 ns ( Figure 6.6 ). The Xilinx FPGA is responsible for the second part of the bus transaction. The time to make the buffer CHIP2.B1 active is t active . Once the buffer is active, the output transistors turn on, conducting a current I peak . The output voltage V O across the load capacitance, C BUS , will slew or change at a steady rate, d V O / d t = I peak / C BUS ; thus t slew = C BUS V O / I peak , where V O is the change in output voltage. Vendors do not always provide enough information to calculate t active and t slew separately, but we can usually estimate their sum. Xilinx specifies the time from the three-state input switching to the time the pad is active and valid for an XC3000125 switching with a 50 pF load, to be t active = t TSON = 11 ns (fast option), and 27 ns (slew-rate limited option). If we need to drive the bus in less than one clock cycle (30 ns), we will definitely need to use the fast option. A supplement to the XC3000 timing data specifies the additional fall delay for switching large capacitive loads (above 50 pF) as R fall = 0.06 nspF 1 (falling) and R
rise 1
= 0.12 nspF 1 (rising) using the fast output option. We can thus estimate that
6.2 AC Output
I peak (5 V)/(0.06 10 3 sF 1 ) 84 mA (falling) and I peak (5 V)/(0.12 10 3 sF 1 ) 42 mA (rising). Now we can calculate, t slew = R fall ( C BUS 50 pF) = (90 pF 50 pF) (0.06 nspF 1 ) or 2.4 ns , for a total falling delay of 11 + 2.4 = 13.4 ns. The rising delay is slower at 11 + (40 pF)(0.12 nspF 1 ) or 15.8 ns. This leaves (30 15.8) ns, or about 14 ns worst-case, to generate the output enable signal CHIP2.OE (t 3OE in Figure 6.6 ) and still leave time t spare before the bus data is latched on the next clock edge. We can thus probably use a XC3000 part for a 30 MHz bus transceiver, but only if we use the fast slew-rate option. An aside: Our example looks a little like the PCI bus used on Pentium and PowerPC systems, but the bus transactions are simplified. PCI buses use a sustained threestate system ( s / t / s ). On the PCI bus an s / t / s driver must drive the bus high for at least one clock cycle before letting it float. A new driver may not start driving the bus until a clock edge after the previous driver floats it. After such a turnaround cycle a new driver will always find the bus parked high.
6.2.1 Supply Bounce

Figure 6.8 (a) shows an n -channel transistor, M1, that is part of an output buffer driving an output pad, OUT1; M2 and M3 form an inverter connected to an input pad, IN1; and M4 and M5 are part of another output buffer connected to an output pad, OUT2. As M1 sinks current pulling OUT1 low ( V o 1 in Figure 6.8 b), a substantial current I OL may flow in the resistance, R S , and inductance, L S , that are between the on-chip GND net and the off-chip, external ground connection.
6.2 AC Output
FIGURE 6.8 Supply bounce. (a) As the pull-down device, M1, switches, it causes the GND net (value V SS ) to bounce. (b) The supply bounce is dependent on the output slew rate. (c) Ground bounce can cause other output buffers to generate a logic glitch. (d) Bounce can also cause errors on other inputs. The voltage drop across R S and L S causes a spike (or transient) on the GND net, changing the value of V SS , leading to a problem known as supply bounce . The situation is illustrated in Figure 6.8 (a), with V SS bouncing to a maximum of V OLP . This ground bounce causes the voltage at the output, V o 2 , to bounce also. If the threshold of the gate that OUT2 is driving is a TTL level at 1.4 V, for example, a ground bounce of more than 1.4 V will cause a logic high glitch (a momentary transition from one logic level to the opposite logic level and back again). Ground bounce may also cause problems at chip inputs. Suppose the inverter M2/M3 is set to have a TTL threshold of 1.4 V and the input, IN1, is at a fixed voltage equal to 3 V (a respectable logic high for bipolar TTL). In this case a ground bounce of greater than 1.6 V will cause the input, IN1, to see a logic low instead of a high and a glitch will be generated on the inverter output, I1. Supply bounce can also occur on the VDD net, but this is usually less severe because the pull-up transistors in an output buffer are usually weaker than the pull-down transistors. The risk of
6.2 AC Output
generating a glitch is also greater at the low logic level for TTL-threshold inputs and TTL-level outputs because the low-level noise margins are smaller than the high-level noise margins in TTL. Sixteen SSOs, with each output driving 150 pF on a bus, can generate a ground bounce of 1.5 V or more. We cannot simulate this problem easily with FPGAs because we are not normally given the characteristics of the output devices. As a rule of thumb we wish to keep ground bounce below 1 V. To help do this we can limit the maximum number of SSOs, and we can limit the number of I/O buffers that share GND and VDD pads. To further reduce the problem, FPGAs now provide options to limit the current flowing in the output buffers, reducing the slew rate and slowing them down. Some FPGAs also have quiet I/O circuits that sense when the input to an output buffer changes. The quiet I/O then starts to change the output using small transistors; shortly afterwards the large output transistors drop-in. As the output approaches its final value, the large transistors kick-out, reducing the supply bounce.
6.2.2 Transmission Lines

Most of the problems with driving large capacitive loads at high speed occur on a bus, and in this case we may have to consider the bus as a transmission line. Figure 6.9 (a) shows how a transmission line appears to a driver, D1, and receiver, R1, as a constant impedance, the characteristic impedance of the line, Z 0 . For a typical PCB trace, Z 0 is between 50 and 100 .
6.2 AC Output
FIGURE 6.9 Transmission lines. (a) A printed-circuit board (PCB) trace is a transmission (TX) line. (b) A driver launches an incident wave, which is reflected at the end of the line. (c) A connection starts to look like a transmission line when the signal rise time is about equal to twice the line delay (2 t f ). The voltages on a transmission line are determined by the value of the driver source resistance, R 0 , and the way that we terminate the end of the transmission line. In Figure 6.9 (a) the termination is just the capacitance of the receiver, C in . As the driver switches between 5 V and 0 V, it launches a voltage wave down the line, as shown in Figure 6.9 (b). The wave will be Z 0 / ( R 0 + Z 0 ) times 5 V in magnitude, so that if R 0 is equal to Z 0 , the wave will be 2.5 V. Notice that it does not matter what is at the far end of the line. The bus driver sees only Z 0 and not C in . Imagine the transmission line as a tunnel; all the bus driver can see at the entrance is a little way into the tunnelit could be 500 m or 5 km long. To find out, we have to go with the wave to the end, turn around, come back, and tell the bus driver. The final result will be the same whether the transmission line is there or not, but with a transmission line it takes a little longer for the voltages and currents to settle down. This is rather like the difference between having a conversation by telephone or by post. The propagation delay (or time of flight), t f , for a typical PCB trace is approximately 1 ns for every 15 cm of trace (the signal velocity is about one-half the speed of light).
6.2 AC Output
A voltage wave launched on a transmission line takes a time t f to get to the end of the line, where it finds the load capacitance, C in . Since no current can flow at this point, there must be a reflection that exactly cancels the incident wave so that the voltage at the input to the receiver, at V 2 , becomes exactly zero at time t f . The reflected wave travels back down the line and finally causes the voltage at the output of the driver, at V 1 , to be exactly zero at time 2 t f . In practice the nonidealities of the driver and the line cause the waves to have finite rise times. We start to see transmission line behavior if the rise time of the driver is less than 2 t f , as shown in Figure 6.9 (c). There are several ways to terminate a transmission line. Figure 6.10 illustrates the following methods:
q
Open-circuit or capacitive termination. The bus termination is the input capacitance of the receivers (usually less than 20 pF). The PCI bus uses this method. Parallel resistive termination. This requires substantial DC current (5 V / 100 = 50 mA for a 100 line). It is used by bipolar logic, for example emittercoupled logic (ECL), where we typically do not care how much power we use. Thvenin termination. Connecting 300 in parallel with 150 across a 5 V supply is equivalent to a 100 termination connected to a 1.6 V source. This reduces the DC current drain on the drivers but adds a resistance directly across the supply. Series termination at the source. Adding a resistor in series with the driver so that the sum of the driver source resistance (which is usually 50 or even less) and the termination resistor matches the line impedance (usually around 100 ). The disadvantage is that it generates reflections that may be close to the switching threshold. Parallel termination with a voltage bias. This is awkward because it requires a third supply and is normally used only for a specialized high-speed bus. Parallel termination with a series capacitance. This removes the requirement for DC current but introduces other problems.
6.2 AC Output
FIGURE 6.10 Transmission line termination. (a) Opencircuit or capacitive termination. (b) Parallel resistive termination. (c) Thvenin termination. (d) Series termination at the source. (e) Parallel termination using a voltage bias. (f) Parallel termination with a series capacitor. Until recently most bus protocols required strong bipolar or BiCMOS output buffers capable of driving all the way between logic levels. The PCI standard uses weaker CMOS drivers that rely on reflection from the end of the bus to allow the intermediate receivers to see the full logic value. Many FPGA vendors now offer complete PCI functions that the ASIC designer can drop in to an FPGA [PCI, 1995]. An alternative to using a transmission line that operates across the full swing of the supply voltage is to use current-mode signaling or differential signals with lowvoltage swings. These and other techniques are used in specialized bus structures and in high-speed DRAM. Examples are Rambus, and Gunning transistor logic ( GTL ). These are analog rather than digital circuits, but ASIC methods apply if the interface circuits are available as cells, hiding some of the complexity from the designer. For example, Rambus offers a Rambus access cell ( RAC ) for standardcell design (but not yet for an FPGA). Directions to more information on these topics
6.2 AC Output
are in the bibliography at the end of this chapter. 1. 1994 data book, p. 2-159. 2. Application Note XAPP 024.000, Additional XC3000 Data, 1994 data book p. 815. [ Chapter start ] [ Previous page ] [ Next page ]
6.3 DC Input
6.3 DC Input
Suppose we have a pushbutton switch connected to the input of an FPGA as shown in Figure 6.11 (a). Most FPGA input pads are directly connected to a buffer. We need to ensure that the input of this buffer never floats to a voltage between valid logic levels (which could cause both n -channel and p -channel transistors in the buffer to turn on, leading to oscillation or excessive power dissipation) and so we use the optional pull-up resistor (usually about 100 k ) that is available on many FPGAs (we could also connect a 1 k pull-up or pull-down resistor externally). Contacts may bounce as a switch is operated ( Figure 6.11 b). In the case of a Xilinx XC4000 the effective pull-up resistance is 550 k (since the specified pull-up current is between 0.2 and 2.0 mA) and forms an RC time constant with the parasitic capacitance of the input pad and the external circuit. This time constant (typically hundreds of nanoseconds) will normally be much less than the time over which the contacts bounce (typically many milliseconds). The buffer output may thus be a series of pulses extending for several milliseconds. It is up to you to deal with this in your logic. For example, you may want to debounce the waveform in Figure 6.11 (b) using an SR flip-flop.
FIGURE 6.11 A switch input. (a) A pushbutton switch connected to an input buffer with a pull-up resistor. (b) As the switch bounces several pulses may be generated.
A bouncing switch may create a noisy waveform in the time domain, we may also have noise in the voltage level of our input signal. The Schmitt-trigger inverter in Figure 6.12 (a) has a lower switching threshold of 2 V and an upper switching threshold of 3 V. The difference between these thresholds is the hysteresis , equal to 1 V in this case. If we apply the noisy waveform shown in Figure 6.12 (b) to an inverter with no hysteresis, there will be a glitch at the output, as shown in Figure 6.12 (c). As long as the noise on the waveform does not exceed the hysteresis, the Schmitt-trigger inverter will produce the glitch-free output of Figure 6.12 (d). Most FPGA input buffers have a small hysteresis (the 200 mV that Xilinx uses is a typical figure) centered around 1.4 V (for compatibility with TTL), as shown in Figure 6.12 (e). Notice that the drawing inside the symbol for a Schmitt trigger looks like the transfer characteristic for a buffer, but is backward for an inverter. Hysteresis in the input buffer also helps prevent oscillation and noise problems with inputs that have slow rise times, though most FPGA manufacturers still have a restriction that input signals must have a rise time faster than several hundred nanoseconds.
6.3 DC Input
FIGURE 6.12 DC input. (a) A Schmitt-trigger inverter. (b) A noisy input signal. (c) Output from an inverter with no hysteresis. (d) Hysteresis helps prevent glitches. (e) A typical FPGA input buffer with a hysteresis of 200 mV centered around a threshold of 1.4 V.
6.3.1 Noise Margins

Figure 6.13 (a) and (b) show the worst-case DC transfer characteristics of a CMOS inverter. Figure 6.13 (a) shows a situation in which the process and device sizes create the lowest possible switching threshold. We define the maximum voltage that will be recognized as a '0' as the point at which the gain ( V out / V in ) of the inverter is 1. This point is V ILmax = 1V in the example shown in Figure 6.13 (a). This means that any input voltage that is lower than 1V will definitely be recognized as a '0', even with the most unfavorable inverter characteristics. At the other worst-case extreme we define the minimum voltage that will be recognized as a '1' as V IHmin = 3.5V (for the example in Figure 6.13 b).
FIGURE 6.13 Noise margins. (a) Transfer characteristics of a CMOS inverter with the lowest switching threshold. (b) The highest switching threshold. (c) A graphical representation of CMOS logic thresholds. (d) Logic thresholds at the inputs and outputs of a logic gate or an ASIC. (e) The switching thresholds viewed as a plug and socket. (f) CMOS plugs fit CMOS sockets and the clearances are the noise margins. Figure 6.13 (c) depicts the following relationships between the various voltage levels at the inputs and outputs of a logic gate:
q
A logic '1' output must be between V OHmin and V DD .
6.3 DC Input
q q q q
A logic '0' output must be between V SS and V OLmax . A logic '1' input must be above the high-level input voltage , V IHmin . A logic '0' input must be below the low-level input voltage , V ILmax . Clamp diodes prevent an input exceeding V DD or going lower than V SS .
The voltages, V OHmin , V OLmax , V IHmin , and V ILmax , are the logic thresholds for a technology. A logic signal outside the areas bounded by these logic thresholds is badan unrecognizable logic level in an electronic no-mans land. Figure 6.13 (d) shows typical logic thresholds for a CMOS-compatible FPGA. The V IHmin and V ILmax logic thresholds come from measurements in Figure 6.13 (a) and (b) and V OHmin and V OLmax come from the measurements shown in Figure 6.2 (c). Figure 6.13 (d) illustrates how logic thresholds form a plug and socket for any gate, group of gates, or even a chip. If a plug fits a socket, we can connect the two components together and they will have compatible logic levels. For example, Figure 6.13 (e) shows that we can connect two CMOS gates or chips together.
FIGURE 6.14 TTL and CMOS logic thresholds. (a) TTL logic thresholds. (b) Typical CMOS logic thresholds. (c) A TTL plug will not fit in a CMOS socket. (d) Raising V OHmin solves the problem. Figure 6.13 (f) shows that we can even add some noise that shifts the input levels and the plug will still fit into the socket. In fact, we can shift the plug down by exactly V OHmin V IHmin (4.5 3.5 = 1 V) and still maintain a valid '1'. We can shift the plug up by V ILmax V OLmax (1.0 0.5 = 0.5 V) and still maintain a valid '0'. These clearances between plug and socket are the noise margins : V NMH = V OHmin V IHmin and V NML = V ILmax V OLmax . (6.1) For two logic systems to be compatible, the plug must fit the socket. This requires both the high-level noise margin (V NMH ) and the low-level noise margin (V NML ) to be positive. We also want both noise margins to be as large as possible to give us maximum immunity from noise and other problems at an interface. Figure 6.14 (a) and (b) show the logic thresholds for TTL together with typical CMOS logic thresholds. Figure 6.14 (c) shows the problem with trying to plug a TTL chip into a CMOS input levelthe lowest permissible TTL output level, V OHmin = 2.7 V, is too low to be recognized as a logic '1' by the CMOS input. This is fixed by most FPGA manufacturers by raising V OHmin to around 3.84.0 V ( Figure 6.14 d). Table 6.1 lists the logic thresholds for several FPGAs.
6.3.2 Mixed-Voltage Systems

To reduce power consumption and allow CMOS logic to be scaled below 0.5 m it is necessary to reduce the power
6.3 DC Input
supply voltage below 5 V. The JEDEC 8 [ JEDEC I/O] series of standards sets the next lower supply voltage as 3.3 0.3 V. Figure 6.15 (a) and (b) shows that the 3 V CMOS I/O logic-thresholds can be made compatible with 5 V systems. Some FPGAs can operate on both 3 V and 5 V supplies, typically using one voltage for internal (or core) logic, V DDint and another for the I/O circuits, V DDI/O ( Figure 6.15 c). TABLE 6.1 FPGA logic thresholds. I/O options Input levels Output levels (high current) Output levels (low current) V IL V OH I OH V OL I OL V OH I OH I OL V IH VOL Input Output (min) (max) (min) (max) (max) (max) (min) (max) (max) (max) XC3000 TTL 2.0 0.8 3.86 4.0 0.40 4.0 1 CMOS XC3000L XC4000
5
3.85
2
0.9
3.86 2.40 2.40 2.40 4.00

7
4.0 4.0 4.0 4.0 1.0 4.0 4.0 8.0 4.0
0.40 0.40 0.40 0.50 0.50 0.50 0.40 0.50 0.45
4.0 4.0 12.0 24.0 24.0 24.0 4.0 12.0 12.0 3.84 4.0 0.33 6.0 2.80
4
2.0 2.0 TTL CMOS TTL CMOS R C 2.0 3.85

2
0.8 0.8 0.8 0.9 3 0.8 0.9 3 0.8 0.8
0.1
0.2
0.1
XC4000H 6
XC8100
8
TTL CMOS
2.0 3.85
2
3.86 3.86 2.4 2.4
ACT 2/3 FLEX10k

9
2.0 3V/5V 2.0
There is one problem when we mix 3 V and 5 V supplies that is shown in Figure 6.15 (d). If we apply a voltage to a chip input that exceeds the power supply of a chip, it is possible to power a chip inadvertently through the clamp diodes. In the worst case this may cause a voltage as high as 2.5 V (= 5.5 V 3.0 V) to appear across the clamp diode, which will cause a very large current (several hundred milliamperes) to flow. One way to prevent damage is to include a series resistor between the chips, typically around 1 k . This solution does not work for all chips in all systems. A difficult problem in ASIC I/O design is constructing 5 V-tolerant I/O . Most solutions may never surface (there is little point in patenting a solution to a problem that will go away before the patent is granted). Similar problems can arise in several other situations:
q q q
when you connect two ASICs with different 5 V supplies; when you power down one ASIC in a system but not another, or one ASIC powers down faster than another; on system power-up or system reset.
6.3 DC Input
FIGURE 6.15 Mixed-voltage systems. (a) TTL levels. (b) Lowvoltage CMOS levels. (c) A mixed-voltage ASIC. (d) A problem when connecting two chips with different supply voltagescaused by the input clamp diodes.
1. XC2000, XC3000/A have identical thresholds. XC3100/A thresholds are identical to XC3000 except for 8 mA sourcesink current. XC5200 thresholds are identical to XC3100A. 2. Defined as 0.7 V DD , calculated with V DD max = 5.5 V. 3. Defined as 0.2 V DD , calculated with V DD min = 4.5 V. 4. Defined as V DD 0.2 V, calculated with V DD min = 3.0 V. 5. XC4000, XC4000A have identical I/O thresholds except XC4000A has 24 mA sink current. 6. XC4000H/E have identical I/O thresholds except XC4000E has 12 mA sink current. Options are independent. 7. Defined as VDD 0.5 V, calculated with VDD min = 4.5 V. 8. Input and output options are independent. 9. MAX 9000 has identical thresholds to FLEX 10k. Note: All voltages in volts, all currents in milliamperes. [ Chapter start ] [ Previous page ] [ Next page ]
6.4 AC Input
6.4 AC Input
Suppose we wish to connect an input bus containing sampled data from an analog-todigital converter ( A/D ) that is running at a clock frequency of 100 kHz to an FPGA that is running from a system clock on a bus at 10 MHz (a NuBus). We are to perform some filtering and calculations on the sampled data before placing it on the NuBus. We cannot just connect the A/D output bus to our FPGA, because we have no idea when the A/D data will change. Even though the A/D data rate (a sample every 10 s or every 100 NuBus clock cycles) is much lower than the NuBus clock, if the data happens to arrive just before we are due to place an output on the NuBus, we have no time to perform any calculations. Instead we want to register the data at the input to give us a whole NuBus clock cycle (100 ns) to perform the calculations. We know that we should have the A/D data at the flip-flop input for at least the flip-flop setup time before the NuBus clock edge. Unfortunately there is no way to guarantee this; the A/D converter clock and the NuBus clock are completely independent. Thus it is entirely possible that every now and again the A/D data will change just before the NuBus clock edge.
6.4.1 Metastability
If we change the data input to a flip-flop (or a latch) too close to the clock edge (called a setup or hold-time violation ), we run into a problem called metastability , illustrated in Figure 6.16. In this situation the flip-flop cannot decide whether its output should be a '1' or a '0' for a long time. If the flip-flop makes a decision, at a time t r after the clock edge, as to whether its output is a '1' or a '0', there is a small, but finite, probability that the flip-flop will decide the output is a '1' when it should
6.4 AC Input
have been a '0' or vice versa. This situation, called an upset , can happen when the data is coming from the outside world and the flip-flop cant determine when it will arrive; this is an asynchronous signal , because it is not synchronized to the chip clock.
FIGURE 6.16 Metastability. (a) Data coming from one system is an asynchronous input to another. (b) A flip-flop has a very narrow decision window bounded by the setup and hold times. If the data input changes inside this decision window, the output may be metastableneither '1' or '0'.
Experimentally we find that the probability of upset , p , is p = T 0 exp t r / c , (6.2) (per data event, per clock edge, in one second, with units Hz 1 Hz 1 s 1 ) where t r is the time a sampler (flip-flop or latch) has to resolve the sampler output; T 0 and c are constants of the sampler circuit design. Let us see how serious this problem is in practice. If t r = 5 ns, c = 0.1 ns, and T 0 = 0.1 s, Eq. 6.2 gives the upset probability as 5 10 19
6.4 AC Input
p = 0.1 exp = 2 10 23 s , (6.3) 0.1 10 9 which is very small, but the data and clock may be running at several MHz, causing the sampler plenty of opportunities for upset. The mean time between upsets ( MTBU , similar to MTBFmean time between failures) is 1 exp t r / c (6.4)
MTBU = = , pf clock f data f clock f data
where f clock is the clock frequency and f data is the data frequency. If t r = 5 ns, c = 0.1 ns, T 0 = 0.1 s (as in the previous example), f clock = 100 MHz, and f data = 1 MHz, then exp (5 10 9 /0.1 10 9) (100 10 6 )(1 10 6 )(0.1) or about 16 years (10 8 seconds is three years, and a day is 10 5 seconds). An MTBU of 16 years may seem safe, but suppose we have a 64-bit input bus using 64 flipflops. If each flip-flop has an MTBU of 16 years, our system-level MTBF is three months. If we ship 1000 systems we would have an average of 10 systems failing every day. What can we do? The parameter c is the inverse of the gainbandwidth product , GB , of the sampler at the instant of sampling. It is a constant that is independent of whether we are sampling a positive or negative data edge. It may be determined by a small-signal analysis of the sampler at the sampling instant or by measurement. It cannot be
MTBU =
= 5.2 10 8 seconds ,
(6.5)
6.4 AC Input
determined by simulating the transient response of the flip-flop to a metastable event since the gain and bandwidth both normally change as a function of time. We cannot change c . The parameter T 0 (units of time) is a function of the process technology and the circuit design. It may be different for sampling a positive or negative data edge, but normally only one value of T 0 is given. Attempts have been made to calculate T 0 and to relate it to a physical quantity. The best method is by measurement or simulation of metastable events. We cannot change T 0 . Given a good flip-flop or latch design, c and T 0 should be similar for comparable CMOS processes (so, for example, all 0.5 m processes should have approximately the same c and T 0 ). The only parameter we can change when using a flip-flop or latch from a cell library is t r , and we should allow as much resolution time as we can after the output of a latch before the signal is clocked again. If we use a flip-flop constructed from two latches in series (a masterslave design), then we are sampling the data twice. The resolution time for the first sample t r is fixed, it is half the clock cycle (if the clock is high and low for equal timeswe say the clock has a 50 percent duty cycle , or equal markspace ratio ). Using such a flip-flop we need to allow as much time as we can before we clock the second sample by connecting two flip-flops in series, without any combinational logic between them, if possible. If you are really in trouble, the next step is to divide the clock so you can extend the resolution time even further. TABLE 6.2 Metastability parameters for FPGA flip-flops. These figures are not guaranteed by the vendors. T0/s c/s FPGA Actel ACT 1 Xilinx XC3020-70 QuickLogic QL12x16-0 QuickLogic QL12x16-1 QuickLogic QL12x16-2 1.0E09 1.5E10 2.94E11 8.38E11 1.23E10 2.17E10 2.71E10 2.91E10 2.09E10 1.85E10
6.4 AC Input
Xilinx XC8100 Xilinx XC8100 synchronizer Altera MAX 7000 Altera FLEX 8000
2.15E-12 1.59E-17 2.98E17 1.01E13
4.65E10 2.07E10 2.00E10 7.89E11
Sources: Actel April 1992 data book, p. 5-1, gives C1 = T 0 = 10 9 Hz 1 , C2 = 1/ c = 4.6052 ns 1 , or c = 2.17E10 s and T 0 = 1.0E09 s. Xilinx gives K1 = T 0 = 1.5E10 s and K2 = 1/ c = 3.69E9 s1, c = 2.71E10 s, for the XC3020-70 (p. 8-20 of 1994 data book). QuickLogic pASIC 1 QL12X16: c = 0.2 ns to 0.3 ns, T 0 = 0.3E10 s to 1.2E10 s (1994 data book, p. 5-25, Fig. 2). Xilinx XC8100 data, c = 4.65E10 s and T 0 = 2.15E12 s, is from October 1995 (v. 1.0) data sheet, Fig.17 (the XC8100 was discontinued in August 1996). Altera 1995 data book p. 437, Table 1. Table 6.2 shows flip-flop metastability parameters and Figure 6.17 graphs the metastability data for f clock = 10 MHz and f data = 1 MHz. From this graph we can see the enormous variation in MTBF caused by small variations in c . For example, in the QuickLogic pASIC 1 series the range of T 0 from 0.3 to 1.2 10 10 s is 4:1, but it is the range of c = 0.2 0.3 ns (a variation of only 1:1.5) that is responsible for the enormous variation in MTBF (nearly four orders of magnitude at t r = 5 ns). The variation in c is caused by the variation in GB between the QuickLogic speed grades. Variation in the other vendors parts will be similar, but most vendors do not show this information. To be safe, build a large safety margin for MTBF into any designit is not unreasonable to use a margin of four orders of magnitude.
6.4 AC Input
FIGURE 6.17 Mean time between failures (MTBF) as a function of resolution time. The data is from FPGA vendors data books for a single flip-flop with clock frequency of 10 MHz and a data input frequency of 1 MHz (see Table 6.2 ). Some cell libraries include a synchronizer , built from two flip-flops in cascade, that greatly reduces the effective values of c and T 0 over a single flip-flop. The penalty is an extra clock cycle of latency. To compare discrete TTL parts with ASIC flip-flops, the 74AS4374 TTL metastablehardened dual flip-flops , from TI, have c = 0.42 ns and T 0 = 4 ns. The parameter T 0 ranges from about 10 s for the 74LS74 (a regular flip-flop) to 4 ns for the 74AS4374 (over nine orders of magnitude different); c only varies from 0.42 ns (74AS374) to 1.3 ns (74LS74), but this small variation in c is just as important. [ Chapter start ] [ Previous page ] [ Next page ]
6.5 Clock Input
6.5 Clock Input

When we bring the clock signal onto a chip, we may need to adjust the logic level (clock signals are often driven by TTL drivers with a high current output capability) and then we need to distribute the clock signal around the chip as it is needed. FPGAs normally provide special clock buffers and clock networks. We need to minimize the clock delay (or latency), but we also need to minimize the clock skew.
6.5.1 Registered Inputs

Some FPGAs provide a flip-flop or latch that you can use as part of the I/O circuit (registered I/O). For other FPGAs you have to use a flip-flop or latch using the basic logic cell in the core. In either case the important parameter is the input setup time. We can measure the setup with respect to the clock signal at the flip-flop or the clock signal at the clock input pad. The difference between these two parameters is the clock delay.
6.5 Clock Input
FIGURE 6.18 Clock input. (a) Timing model with values for a Xilinx XC4005-6. (b) A simplified view of clock distribution. (c) Timing diagram. Xilinx eliminates the variable internal delay t PG , by specifying a pin-to-pin setup time, t PSUFmin = 2 ns. Figure 6.18 shows part of the I/O timing model for a Xilinx XC40005-6.
q q q
t PICK is the fixed setup time for a flip-flop relative to the flip-flop clock. t skew is the variable clock skew , the signed delay between two clock edges. t PG is the variable clock delay or latency .
To calculate the flip-flop setup time ( t PSUFmin ) relative to the clock pad (which is the parameter system designers need to know), we subtract the clock delay, so that t PSUF = t PICK t PG . (6.6)
6.5 Clock Input
The problem is that we cannot easily calculate t PG , since it depends on the clock distribution scheme and where the flip-flop is on the chip. Instead Xilinx specifies t PSUFmin directly, measured from the data pad to the clock pad; this time is called a pin-to-pin timing parameter . Notice t PSUF min = 2 ns t PICK t PG max = 1 ns. Figure 6.19 shows that the hold time for a XC4005-6 flip-flop ( t CKI ) with respect to the flip-flop clock is zero. However, the pin-to-pin hold time including the clock delay is t PHF = 5.5 ns. We can remove this inconvenient hold-time restriction by delaying the input signal. Including a programmable delay allows Xilinx to guarantee the pin-to-pin hold time ( t PH ) as zero. The penalty is an increase in the pin-to-pin setup time ( t PSU ) to 21 ns (from 2 ns) for the XC4005-6, for example.
FIGURE 6.19 Programmable input delay. (a) Pin-to-pin timing model with values from an XC4005-6. (b) Timing diagrams with and without programmable delay. We also have to account for clock delay when we register an output. Figure 6.20 shows the timing model diagram for the clock-to-output delay.
6.5 Clock Input
FIGURE 6.20 Registered output. (a) Timing model with values for an XC4005-6 programmed with the fast slew-rate option. (b) Timing diagram.
1. The Xilinx XC4005-6 timing parameters are from the 1994 data book p. 2-50 to p. 2-53. [ Chapter start ] [ Previous page ] [ Next page ]
6.6 Power Input
6.6 Power Input

The last item that we need to bring onto an FPGA is the power. We may need multiple VDD and GND power pads to reduce supply bounce or separate VDD pads for mixed-voltage supplies. We may also need to provide power for on-chip programming (in the case of antifuse or EPROM programming technology). The package type and number of pins will determine the number of power pins, which, in turn, affects the number of SSOs you can have in a design.
6.6.1 Power Dissipation

As a general rule a plastic package can dissipate about 1 W, and more expensive ceramic packages can dissipate up to about 2 W. Table 6.3 shows the thermal characteristics of common packages. In a high-speed (high-power) design the ASIC power consumption may dictate your choice of packages. Actel provides a formula for calculating typical dynamic chip power consumption of their FPGAs. The formula for the ACT 2 and ACT 3 FPGAs are complex; therefore we shall use the simpler formula for the ACT 1 FPGAs as an example : TABLE 6.3 Thermal characteristics of ASIC packages. Package CPGA CPGA
2 1
Pin count 84 100
Max. power P max /W
JA /CW 1 (still air) 33 35

3,4
JA /CW 1 (still air) 3238

5
6.6 Power Input
CPGA CPGA CPGA CPGA CQFP CQFP PQFP PQFP PQFP VQFP PLCC PLCC PLCC PPGA
132 175 207 257 84 172 100 160 208 80 44 68 84 132
1.0 1.75 2.0
1.5
30 25 22 15 40 25 55 33 33 68 52 45 44
16
5675 3033 27-32 44 2835 3334
Total chip power = 0.2 (N F1) + 0.085 (M F2) + 0.8 ( P F3) mW (6.7) where F1 = average logic module switching rate in MHz F2 = average clock pin switching rate in MHz F3 = average I/O switching rate in MHz M = number of logic modules connected to the clock pin N = number of logic modules used on the chip P = number of I/O pairs used (input + output), with 50 pF load
6.6 Power Input
As an example of a power-dissipation calculation, consider an Actel 1020B-2 with a 20 MHz clock. We shall initially assume 100 percent utilization of the 547 Logic Modules and assume that each switches at an average speed of 5 MHz. We shall also initially assume that we use all of the 69 I/O Modules and that each switches at an average speed of 5 MHz. Using Eq. 6.7 , the Logic Modules dissipate P LM = (0.2)(547)(5) = 547 mW , (6.8) and the I/O Module dissipation is P IO = (0.8)(69)(5) = 276 mW . (6.9) If we assume the clock buffer drives 20 percent of the Logic Modules, then the additional power dissipation due to the clock buffer is P CLK = (0.085)(547)(0.2)(5) = 46.495 mW . (6.10) The total power dissipation is thus P D = (547 + 276 + 46.5) = 869.5 mW , (6.11) or about 900 mW (with an accuracy of certainly no better than 100 mW). Suppose we intend to use a very thin quad flatpack ( VQFP ) with no cooling (because we are trying to save area and board height). From Table 6.3 the thermal resistance, JA , is approximately 68 CW 1 for an 80-pin VQFP. Thus the maximum junction temperature under industrial worst-case conditions (T A = 85 C) will be T J = (85 + (0.87)(68)) = 144.16 C , (6.12)
6.6 Power Input
(with an accuracy of no better than 10 C). Actel specifies the maximum junction temperature for its devices as T Jmax = 150 C (T Jmax for Altera is also 150 C, for Xilinx T Jmax = 125C). Our calculated value is much too close to the rated maximum for comfort; therefore we need to go back and check our assumptions for power dissipation. At or near 100 percent module utilization is not unreasonable for an Actel device, but more questionable is that all nodes and I/Os switch at 5 MHz. Our real mistake is trying to use a VQFP package with a high JA for a high-speed design. Suppose we use an 84-pin PLCC package instead. From Table 6.3 the thermal resistance, JA , for this alternative package is approximately 44 CW 1 . Now the worst-case junction temperature will be a more reasonable T J = (85 + (0.87)(44)) = 123.28 C , (6.13) It is possible to estimate the power dissipation of the Actel architecture because the routing is regular and the interconnect capacitance is well controlled (it has to be since we must minimize the number of series antifuses we use). For most other architectures it is much more difficult to estimate power dissipation. The exception, as we saw in Section 5.4 Altera MAX, are the programmable ASICs based on programmable logic arrays with passive pull-ups where a substantial part of the power dissipation is static.
6.6.2 Power-On Reset

Each FPGA has its own power-on reset sequence. For example, a Xilinx FPGA configures all flip-flops (in either the CLBs or IOBs) as either SET or RESET. After chip programming is complete, the global SET/RESET signal forces all flip-flops on the chip to a known state. This is important since it may determine the initial state of a state machine, for example. 1. 1994 data book, p.1-9 2. CPGA = ceramic pin-grid array; CQFP = ceramic quad flatpack; PQFP = plastic
6.6 Power Input
quad flatpack; VQFP = very thin quad flatpack; PLCC = plastic leaded chip carrier; PPGA = plastic pin-grid array. 3. JA varies with die size. 4. Data from Actel 1994 data book p. 1-9, p. 1-45, and p. 1-94. 5. Data from Xilinx 1994 data book p. 4-26 and p. 4-27. [ Chapter start ] [ Previous page ] [ Next page ]
6.7 Xilinx I/O Block

The Xilinx I/O cell is the input/output block ( IOB ) . Figure 6.21 shows the Xilinx XC4000 IOB, which is similar to the IOB in the XC2000, XC3000, and XC5200 but performs a superset of the options in these other Xilinx FPGAs.
FIGURE 6.21 The Xilinx XC4000 family IOB (input/output block). ( Source: Xilinx.) The outputs contain features that allow you to do the following:
q
Switch between a totem-pole and a complementary output (XC4000H).

q
q q q
Include a passive pull-up or pull-down (both n -channel devices) with a typical resistance of about 50 k . Invert the three-state control (output enable OE or three-state, TS). Include a flip-flop, or latch, or a direct connection in the output path. Control the slew rate of the output.
The features on the inputs allow you to do the following:

q q q
Configure the input buffer with TTL or CMOS thresholds. Include a flip-flop, or latch, or a direct connection in the input path. Switch in a delay to eliminate an input hold time.
FIGURE 6.22 The Xilinx LCA (Logic Cell Array) timing model. The paths show different uses of CLBs (Configurable Logic Blocks) and IOBs (Input/Output Blocks). The parameters shown are for an XC5210-6. (Source: Xilinx.) Figure 6.22 shows the timing model for the XC5200 family. It is similar to the
timing model for all the other Xilinx LCA FPGAs with one exceptionthe XC5200 does not have registers in the I/O cell; you go directly to the core CLBs to include a flip-flop or latch on an input or output.
6.7.1 Boundary Scan

Testing PCBs can be done using a bed-of-nails tester. This approach becomes very difficult with closer IC pin spacing and more sophisticated assembly methods using surface-mount technology and multilayer boards. The IEEE implemented boundaryscan standard 1149.1 to simplify the problem of testing at the board level. The Joint Test Action Group (JTAG) developed the standard; thus the terms JTAG boundary scan or just JTAG are commonly used. Many FPGAs contain a standard boundary-scan test logic structure with a four-pin interface. By using these four signals, you can program the chip using ISP, as well as serially load commands and data into the chips to control the outputs and check the inputs. This is a great improvement over bed-of-nails testing. We shall cover boundary scan in detail in Section 14.6, Scan Test. 1. October 1995 (v. 3.0) data sheet. [ Chapter start ] [ Previous page ] [ Next page ]
6.8 Other I/O Cells
6.8 Other I/O Cells

The Altera MAX 5000 and 7000 use the I/O Control Block ( IOC ) shown in Figure 6.23 . In the MAX 5000, all inputs pass through the chipwide interconnect. The MAX 7000E has special fast inputs that are connected directly to macrocell registers in order to reduce the setup time for registered inputs. FIGURE 6.23 A simplified block diagram of the Altera I/O Control Block (IOC) used in the MAX 5000 and MAX 7000 series. The I/O pin feedback allows the I/O pad to be isolated from the macrocell. It is thus possible to use a LAB without using up an I/O pad (as you often have to do using a PLD such as a 22V10). The PIA is the chipwide interconnect. The FLEX 8000 and 10k use the I/O Element ( IOE ) shown in Figure 6.24 (the MAX 9000 IOC is similar). The interface to the IOE is directly to the chipwide interconnect rather than the core logic. There is a separate bus, the Peripheral Control Bus , for the IOE control signals: clock, preset, clear, and output enable.
6.8 Other I/O Cells
FIGURE 6.24 A simplified block diagram of the Altera I/O Element (IOE), used in the FLEX 8000 and 10k series. The MAX 9000 IOC (I/O Cell) is similar. The FastTrack Interconnect bus is the chipwide interconnect. The PCB is used for control signals common to each IOE.
The AMD MACH 5 family has some I/O features not currently found on other programmable ASICs. The MACH 5 family has 3.3 V and 5 V versions that are both suitable for mixed-voltage designs. The 3 V versions accept 5 V inputs, and the outputs of the 3 V versions do not drive above 3.3 V. You can apply a voltage up to 5.5 V to device inputs before you connect VDD (this is known as hot insertion or hot switching, allowing you to swap cards with power still applied without causing latchup). During power-up and power-down, all I/Os are three-state, and there is no I/O current during power-down, allowing power-down while connected to an active bus. All MACH 5 devices in the same package have the same pin configuration, so you can increase or reduce the size of device after completing the board layout. [ Chapter start ] [ Previous page ] [ Next page ]
6.9 Summary
6.9 Summary
Among the options available in I/O cells are: different drive strengths, TTLcompatibility, registered or direct inputs, registered or direct outputs, pull-up resistors, over-voltage protection, slew-rate control, and boundary-scan. Table 6.4 shows a list of features. Interfacing an ASIC with a system starts at the outputs where you check the voltage levels first, then the current levels. Table 6.5 is a look-up table for Tables 6.6 and 6.7 , which show the I/O resources present in each type of programmable ASIC (using the abbreviations of Table 6.4 ). TABLE 6.4 I/O options for programmable ASICs. Code IT/C OT/C nSNK nSRC 5/3 OD TS SR
1
I/O Option TTL/CMOS input TTL/CMOS output Sink capability Source capability 5V/3V
Function Programmable input buffer threshold Complementary or totem-pole output Maximum current sink ability (e.g., 12SNK is I 0 = 12 mA sink) Maximum current source ability (e.g., 12SRC is I 0 = 12 mA source)
Separate I/O and core voltage supplies Programmable open-drain at the output Open drain/collector buffer Three-state Output buffer with three-state control Fast or slew-rate limited output buffer to Slew-rate control reduce ground bounce
6.9 Summary
PD PU EP RI RO RIO ID JTAG SCH HOT PCI
Pull-down Pull-up Enable polarity Registered input Registered output Registered I/O Input delay JTAG Schmitt trigger Hot insertion PCI compliant
Programmable pull-down device or resistor at the I/O pad Programmable pull-up device or resistor at the I/O pad Driver control can be positive (three-state) or negative (enable). Inputs may be registered in I/O cell. Outputs may be registered in I/O cell. Both inputs and outputs may be registered in I/O cell. Input delay to eliminate input hold time Boundary-scan test Schmitt trigger or input hysteresis Inputs protected from hot insertion Output buffer characteristics comply with PCI specifications.
Important points that we covered in this chapter are the following:

q
q q q q q q q
Outputs can typically source or sink 510 mA continuously into a DC load, and 50200 mA transiently into an AC load. Input buffers can be CMOS (threshold at 0.5 V DD ) or TTL (1.4 V). Input buffers normally have a small hysteresis (100200 mV). CMOS inputs must never be left floating. Clamp diodes to GND and VDD are present on every pin. Inputs and outputs can be registered or direct. I/O registers can be in the I/O cell or in the core. Metastability is a problem when working with asynchronous inputs.
1. These codes are used in Tables 6.6 and 6.7 .
6.9 Summary
PROGRAMMABLE ASIC INTERCONNECT

All FPGAs contain some type of programmable interconnect . The structure and complexity of the interconnect is largely determined by the programming technology and the architecture of the basic logic cell. The raw material that we have to work with in building the interconnect is aluminum-based metallization, which has a sheet resistance of approximately 50 m /square and a line capacitance of 0.2 pFcm 1 . The first programmable ASICs were constructed using two layers of metal; newer programmable ASICs use three or more layers of metal interconnect.
7.1 Actel ACT 7.2 Xilinx LCA 7.3 Xilinx EPLD 7.4 Altera MAX 5000 and 7000 7.5 Altera MAX 9000 7.6 Altera FLEX
7.7 Summary 7.8 Problems 7.9 Bibliography 7.10 References
7.1 Actel ACT
7.1 Actel ACT

The Actel ACT family interconnect scheme shown in Figure 7.1 is similar to a channeled gate array. The channel routing uses dedicated rectangular areas of fixed size within the chip called wiring channels (or just channels ). The horizontal channels run across the chip in the horizontal direction. In the vertical direction there are similar vertical channels that run over the top of the basic logic cells, the Logic Modules. Within the horizontal or vertical channels wires run horizontally or vertically, respectively, within tracks . Each track holds one wire. The capacity of a fixed wiring channel is equal to the number of tracks it contains. Figure 7.2 shows a detailed view of the channel and the connections to each Logic Modulethe input stubs and output stubs .
7.1 Actel ACT
FIGURE 7.1 The interconnect architecture used in an Actel ACT family FPGA. ( Source: Actel.)
FIGURE 7.2 ACT 1 horizontal and vertical channel architecture. (Source: Actel.) In a channeled gate array the designer decides the location and length of the interconnect within a channel. In an FPGA the interconnect is fixed at the time of manufacture. To allow programming of the interconnect, Actel divides the fixed interconnect wires within each channel into various lengths or wire segments. We call this segmented channel routing, a variation on channel routing. Antifuses join the wire segments. The designer then programs the interconnections by blowing antifuses and making connections between wire segments; unwanted connections are left unprogrammed. A statistical analysis of many different layouts determines the optimum number and the lengths of the wire segments.
7.1.1 Routing Resources

7.1 Actel ACT
The ACT 1 interconnection architecture uses 22 horizontal tracks per channel for signal routing with three tracks dedicated to VDD, GND, and the global clock (GCLK), making a total of 25 tracks per channel. Horizontal segments vary in length from four columns of Logic Modules to the entire row of modules (Actel calls these long segments long lines ). Four Logic Module inputs are available to the channel below the Logic Module and four inputs to the channel above the Logic Module. Thus eight vertical tracks per Logic Module are available for inputs (four from the Logic Module above the channel and four from the Logic Module below). These connections are the input stubs. The single Logic Module output connects to a vertical track that extends across the two channels above the module and across the two channels below the module. This is the output stub. Thus module outputs use four vertical tracks per module (counting two tracks from the modules below, and two tracks from the modules above each channel). One vertical track per column is a long vertical track ( LVT ) that spans the entire height of the chip (the 1020 contains some segmented LVTs). There are thus a total of 13 vertical tracks per column in the ACT 1 architecture (eight for inputs, four for outputs, and one for an LVT). Table 7.1 shows the routing resources for both the ACT 1 and ACT 2 families. The last two columns show the total number of antifuses (including antifuses in the I/O cells) on each chip and the total number of antifuses assuming the wiring channels are fully populated with antifuses (an antifuse at every horizontal and vertical interconnect intersection). The ACT 1 devices are very nearly fully populated. TABLE 7.1 Actel FPGA routing resources. Vertical Horizontal tracks Rows, Columns, tracks per per R C channel, column, H V A1010 22 13 8 44 A1020 22 13 14 44
Total antifuses on each chip 112,000 186,000
HV RC 100,672 176,176
7.1 Actel ACT
A1225A A1240A A1280A
36 36 36
15 15 15
13 14 18
46 62 82
250,000 400,000 750,000
322,920 468,720 797,040
If the Logic Module at the end of a net is less than two rows away from the driver module, a connection requires two antifuses, a vertical track, and two horizontal segments. If the modules are more than two rows apart, a connection between them will require a long vertical track together with another vertical track (the output stub) and two horizontal tracks. To connect these tracks will require a total of four antifuses in series and this will add delay due to the resistance of the antifuses. To examine the extent of this delay problem we need some help from the analysis of RC networks.
7.1.2 Elmores Constant

Figure 7.3 shows an RC tree representing a net with a fanout of two. We shall assume that all nodes are initially charged to V DD = 1 V, and that we short node 0 to ground, so V 0 = 0 V, at time t = 0 sec. We need to find the node voltages, V 1 to V 4 , as a function of time. A similar problem arose in the design of wideband vacuum tube distributed amplifiers in the 1940s. Elmore found a measure of delay that we can use today [ Rubenstein, Penfield, and Horowitz, 1983].
FIGURE 7.3 Measuring the delay of a net. (a) An RC tree. (b) The waveforms as a result of closing the switch at t = 0.
7.1 Actel ACT
The current in branch k of the network is dVk ik = Ck dt The linear superposition of the branch currents gives the voltage at node i as n Vi = R ki C k k=1 dVk dt , (7.2) . (7.1)
where R ki is the resistance of the path to V 0 (ground in this case) shared by node k and node i . So, for example, R 24 = R 1 , R 22 = R 1 + R 2 , and R 31 = R 1 . Unfortunately, Eq. 7.2 is a complicated set of coupled equations that we cannot easily solve. We know the node voltages have different values at each point in time, but, since the waveforms are similar, let us assume the slopes (the time derivatives) of the waveforms are related to each other. Suppose we express the slope of node voltage V k as a constant, a k , times the slope of V i , dVk dt = ak dVi dt . (7.3)
Consider the following measure of the error, E , of our approximation: n
7.1 Actel ACT
E =
R ki C k . (7.4) k=1
The error, E , is a minimum when a k = 1 since initially V i ( t = 0) = V k ( t = 0) = 1 V (we normalized the voltages) and V i ( t = ) = V k ( t = ) = 0. Now we can rewrite Eq. 7.2 , setting a k = 1, as follows: n Vi = R ki C k k=1 dVi dt , (7.5)
This is a linear first-order differential equation with the following solution: n V i ( t ) = exp ( t / Di ) ; Di =
k=1
R ki C k . (7.6)
The time constant t D i is often called the Elmore delay and is different for each node. We shall refer to t D i as the Elmore time constant to remind us that, if we approximate V i by an exponential waveform, the delay of the RC tree using 0.35/0.65 trip points is approximately t Di seconds.
7.1.3 RC Delay in Antifuse Connections

Suppose a single antifuse, with resistance R 1 , connects to a wire segment with parasitic capacitance C 1 . Then a connection employing a single antifuse will delay the signal passing along that connection by approximately one time constant, or R 1 C 1 seconds. If we have more than one antifuse, we need to use the Elmore time
7.1 Actel ACT
constant to estimate the interconnect delay.
FIGURE 7.4 Actel routing model. (a) A four-antifuse connection. L0 is an output stub, L1 and L3 are horizontal tracks, L2 is a long vertical track (LVT), and L4 is an input stub. (b) An RC-tree model. Each antifuse is modeled by a resistance and each interconnect segment is modeled by a capacitance. For example, suppose we have the four-antifuse connection shown in Figure 7.4 . Then, from Eq. 7.6 , D 4 = R 14 C 1 + R 24 C 2 + R 14 C 1 + R 44 C 4 = (R 1 + R 2 + R 3 + R 4 ) C 4 + (R 1 + R 2 + R 3 ) C 3 + (R 1 + R 2 ) C 2 + R1C1
If all the antifuse resistances are approximately equal (a reasonably good assumption) and the antifuse resistance is much larger than the resistance of any of the metal lines, L1L5, shown in Figure 7.4 (a very good assumption) then R 1 = R 2 = R 3 = R 4 = R , and the Elmore time constant is D 4 = 4 RC 4 + 3 RC 3 + 2 RC 2 + RC 1 (7.7) Suppose now that the capacitance of each interconnect segment (including all the
7.1 Actel ACT
antifuses and programming transistors that may be attached) is approximately constant, and equal to C . A connection with two antifuses will generate a 3 RC time constant, a connection with three antifuses a 6 RC time constant, and a connection with four antifuses gives a 10 RC time constant. This analysis is disturbingit says that the interconnect delay grows quadratically ( n 2 ) as we increase the interconnect length and the number of antifuses, n . The situation is worse when the intermediate wire segments have larger capacitance than that of the short input stubs and output stubs. Unfortunately, this is the situation in an Actel FPGA where the horizontal and vertical segments in a connection may be quite long.
7.1.4 Antifuse Parasitic Capacitance

We can determine the number of antifuses connected to the horizontal and vertical lines for the Actel architecture. Each column contains 13 vertical signal tracks and each channel contains 25 horizontal tracks (22 of these are used for signals). Thus, assuming the channels are fully populated with antifuses,
q q q q q q
An input stub (1 channel) connects to 25 antifuses. An output stub (4 channels) connects to 100 (25 4) antifuses. An LVT (1010, 8 channels) connects to 200 (25 8) antifuses. An LVT (1020, 14 channels) connects to 350 (25 14) antifuses. A four-column horizontal track connects to 52 (13 4) antifuses. A 44-column horizontal track connects to 572 (13 44) antifuses.
A connection to the diffusion of an Actel antifuse has a parasitic capacitance due to the diffusion junction. The polysilicon of the antifuse has a parasitic capacitance due to the thin oxide. These capacitances are approximately equal. For a 2 m CMOS process the capacitance to ground of the diffusion is 200 to 300 aF m 2 (area component) and 400 to 550 aF m 1 (perimeter component). Thus, including both area and perimeter effects, a 16 m 2 diffusion contact (consisting of a 2 m by 2 m opening plus the required overlap) has a parasitic capacitance of 1014 f F. If we assume an antifuse has a parasitic capacitance of approximately 10 fF in a 1.0 or 1.2 m process, we can calculate the parasitic capacitances shown in Table 7.2 .
7.1 Actel ACT
TABLE 7.2 Actel interconnect parameters. Parameter A1010/A1020 Technology 2.0 m, = 1.0 m Die height (A1010) 240 mil Die width (A1010) 360 mil Die area (A1010) 86,400 mil 2 = 56 M 2 Logic Module (LM) 180 m = 180 height (Y1) LM width (X) 150 m = 150 LM area (X Y1) Channel height (Y2) Channel area per LM (X Y2) LM and routing area (X Y1 + X Y2) Antifuse capacitance Metal capacitance Output stub length (spans 3 LMs + 4 channels) Output stub metal capacitance Output stub antifuse connections Output stub antifuse capacitance Horiz. track length 27,000 m 2 = 27 k 2 25 tracks = 287 m 43,050 m 2 = 43 k 2 70,000 m 2 = 70 k 2 0.2 pFmm 1 4 channels = 1688 m 0.34 pF 100 444 cols. = 6006600 m
A1010B/A1020B 1.2 m, = 0.6 m 144 mil 216 mil 31,104 mil 2 = 56 M 2 108 m = 180 90 m = 150 9,720 m 2 = 27 k 2 25 tracks = 170 m 15,300 m 2 = 43 k 2 25,000 m 2 = 70 k 2 10 fF 0.2 pFmm 1 4 channels = 1012 m 0.20 pF 100 1.0 pF 444 cols. = 3603960 m
7.1 Actel ACT
Horiz. track metal capacitance Horiz. track antifuse connections Horiz. track antifuse capacitance Long vertical track (LVT) LVT metal capacitance LVT track antifuse connections LVT track antifuse capacitance Antifuse resistance (ACT 1)
0.11.3 pF 52572 antifuses 814 channels = 37606580 m 0.080.13 pF 200350 antifuses
0.070.8 pF 52572 antifuses 0.525.72 pF 814 channels = 22403920 m 0.450.8 pF 200350 antifuses 23.5 pF 0.5 k (typ.), 0.7 k (max.)
We can use the figures from Table 7.2 to estimate the interconnect delays. First we calculate the following resistance and capacitance values: 1. The antifuse resistance is assumed to be R = 0.5 k . 2. C 0 = 1.2 pF is the sum of the gate output capacitance (which we shall neglect) and the output stub capacitance (1.0 pF due to antifuses, 0.2 pF due to metal). The contribution from this term is zero in our calculation because we have neglected the pull resistance of the driving gate. 3. C 1 = C 3 = 0.59 pF (0.52 pF due to antifuses, 0.07 pF due to metal) corresponding to a minimum-length horizontal track. 4. C 2 = 4.3 pF (3.5 pF due to antifuses, 0.8 pF due to metal) corresponding to a LVT in a 1020B. 5. The estimated input capacitance of a gate is C 4 = 0.02 pF (the exact value will depend on which input of a Logic Module we connect to).
7.1 Actel ACT
From Eq. 7.7 , the Elmore time constant for a four-antifuse connection is D 4 = 4(0.5)(0.02) + 3(0.5)(0.59) + 2(0.5)(4.3) + (0.5)(0.59) (7.8) = 5.52 ns . This matches delays obtained from the Actel delay calculator. For example, an LVT adds between 510 ns delay in an ACT 1 FPGA (612 ns for ACT 2, and 414 ns for ACT 3). The LVT connection is about the slowest connection that we can make in an ACT array. Normally less than 10 percent of all connections need to use an LVT and we see why Actel takes great care to make sure that this is the case.
7.1.5 ACT 2 and ACT 3 Interconnect

The ACT 1 architecture uses two antifuses for routing nearby modules, three antifuses to join horizontal segments, and four antifuses to use a horizontal or vertical long track. The ACT 2 and ACT 3 architectures use increased interconnect resources over the ACT 1 device that we have described. This reduces further the number of connections that need more than two antifuses. Delay is also reduced by decreasing the population of antifuses in the channels, and by decreasing the antifuse resistance of certain critical antifuses (by increasing the programming current). The channel density is the absolute minimum number of tracks needed in a channel to make a given set of connections (see Section 17.2.2, Measurement of Channel Density ). Software to route connections using channeled routing is so efficient that, given complete freedom in location of wires, a channel router can usually complete the connections with the number of tracks equal or close to the theoretical minimum, the channel density. Actels studies on segmented channel routing have shown that increasing the number of horizontal tracks slightly (by approximately 10 percent) above density can lead to very high routing completion rates. The ACT 2 devices have 36 horizontal tracks per channel rather than the 22 available in the ACT 1 architecture. Horizontal track segments in an ACT 3 device range from a module pair to the full channel length. Vertical tracks are: input (with a two channel
7.1 Actel ACT
span: one up, one down); output (with a four-channel span: two up, two down); and long (LVT). Four LVTs are shared by each column pair. The ACT 2/3 Logic Modules can accept five inputs, rather than four inputs for the ACT 1 modules, and thus the ACT 2/3 Logic Modules need an extra two vertical tracks per channel. The number of tracks per column thus increases from 13 to 15 in the ACT 2/3 architecture. The greatest challenge facing the Actel FPGA architects is the resistance of the polysilicon-diffusion antifuse. The nominal antifuse resistance in the ACT 12 12 m processes (with a 5 mA programming current) is approximately 500 and, in the worst case, may be as high as 700 . The high resistance severely limits the number of antifuses in a connection. The ACT 2/3 devices assign a special antifuse to each output allowing a direct connection to an LVT. This reduces the number of antifuses in a connection using an LVT to three. This type of antifuse (a fast fuse) is blown at a higher current than the other antifuses to give them about half the nominal resistance (about 0.25 k for AC 2) of a normal antifuse. The nominal antifuse resistance is reduced further in the ACT 3 (using a 0.8 m process) to 200 (Actel does not state whether this value is for a normal or fast fuse). However, it is the worst-case antifuse resistance that will determine the worst-case performance. [ Chapter start ] [ Previous page ] [ Next page ]
7.2 Xilinx LCA
7.2 Xilinx LCA

Figure 7.5 shows the hierarchical Xilinx LCA interconnect architecture.
q q
The vertical lines and horizontal lines run between CLBs. The general-purpose interconnect joins switch boxes (also known as magic boxes or switching matrices). The long lines run across the entire chip. It is possible to form internal buses using long lines and the three-state buffers that are next to each CLB. The direct connections (not used on the XC4000) bypass the switch matrices and directly connect adjacent CLBs. The Programmable Interconnection Points ( PIP s) are programmable pass transistors that connect the CLB inputs and outputs to the routing network. The bidirectional ( BIDI ) interconnect buffers restore the logic level and logic strength on long interconnect paths.
7.2 Xilinx LCA
FIGURE 7.5 Xilinx LCA interconnect. (a) The LCA architecture (notice the matrix element size is larger than a CLB). (b) A simplified representation of the interconnect resources. Each of the lines is a bus. Table 7.3 shows the interconnect data for an XC3020, a typical Xilinx LCA FPGA, that uses two-level metal interconnect. Figure 7.6 shows the switching matrix. Programming a switch matrix allows a number of different connections between the general-purpose interconnect. TABLE 7.3 XC3000 interconnect parameters. Parameter XC3020 Technology 1.0 m, = 0.5 m Die height 220 mil Die width 180 mil Die area 39,600 mil 2 = 102 M 2 CLB matrix height (Y) CLB matrix width (X) 480 m = 960 370 m = 740
7.2 Xilinx LCA
CLB matrix area (X Y) Matrix transistor resistance, R P1 Matrix transistor parasitic capacitance, C P1 PIP transistor resistance, R P2 PIP transistor parasitic capacitance, C P2 Single-length line (X, Y) Single-length line capacitance: C LX , C LY Horizontal Longline (8X) Horizontal Longline metal capacitance, C LL In Figure 7.6 (d), (g), and (h):
17,600 m 2 = 710 k 2 0.51k 0.010.02 pF 0.51k 0.010.02 pF 370 m, 480 m 0.075 pF, 0.1 pF 8 cols. = 2960 m 0.6 pF
7.2 Xilinx LCA
FIGURE 7.6 Components of interconnect delay in a Xilinx LCA array. (a) A portion of the interconnect around the CLBs. (b) A switching matrix. (c) A detailed view inside the switching matrix showing the pass-transistor arrangement. (d) The equivalent circuit for the connection between nets 6 and 20 using the matrix. (e) A view of the interconnect at a Programmable Interconnection Point (PIP). (f) and (g) The equivalent schematic of a PIP connection. (h) The complete RC delay path.
q
C1 = 3CP1 + 3CP2 + 0. 5C LX is the parasitic capacitance due to the switch matrix and PIPs (F4, C4, G4) for CLB1, and half of the line capacitance for the double-length line adjacent to CLB1.
7.2 Xilinx LCA

q q
q q
C P1 and R P1 are the switching-matrix parasitic capacitance and resistance. C P2 and R P2 are the parasitic capacitance and resistance for the PIP connecting YQ of CLB1 and F4 of CLB3. C2 = 0. 5CLX + CLX accounts for half of the line adjacent to CLB1 and the line adjacent to CLB2. C 3 = 0. 5C LX accounts for half of the line adjacent to CLB3. C 4 = 0. 5C LX + 3C P2 + C LX + 3C P1 accounts for half of the line adjacent to CLB3, the PIPs of CLB3 (C4, G4, YQ), and the rest of the line and switch matrix capacitance following CLB3.
We can determine Elmores time constant for the connection shown in Figure 7.6 as D = R P2 (C P2 + C 2 + 3C P1 ) + (R P2 + R P1 )(3C P1 + C 3 + C P2 ) (7.9) + (2R P2 + R P1 )(C P2 + C 4 ) . If RP1 = RP2 , and CP1 = CP2 , then D = (15 + 21)R P C P + (1.5 + 1 + 4.5)R P C LX . (7.10) We need to know the pass-transistor resistance RP . For example, suppose RP = 1k . If k ' n = 50 AV 2 , then (with Vt n = 0.65 V and V DD = 3.3 V) 1 W/L = k'nRp(V DD V t n )
=
1 (50 10 6 )(1 10 3 )(3.3 0.65)

=
7.5 .
(7.11)
If L = 1 m, both source and drain areas are 7.5 m long and approximately 3 m wide (determined by diffusion overlap of contact, contact width, and contact-to-gate
7.2 Xilinx LCA
spacing, rules 6.1a + 6.2a + 6.4a = 5.5 in Table 2.7 ). Both drain and source areas are thus 23 m 2 and the sidewall perimeters are 14 m (excluding the sidewall facing the channel). If we have a diffusion capacitance of 140 aF m 2 (area) and 500 aF m 1 (perimeter), typical values for a 1.0 m process, the parasitic source and drain capacitance is C P = (140 10 18 )(23) + (500 10 18 )(14) (7.12) = 1.022 10 14 F . If we assume CP = 0.01 pF and CLX = 0.075 pF ( Table 7.3 ), D = (36)(1)(0.01) + (7)(1)(0.075) (7.13) = 0.885 ns . A delay of approximately 1 ns agrees with the typical values from the XACT delay calculator and is about the fastest connection we can make between two CLBs.
7.2 Xilinx LCA
FIGURE 7.7 The Xilinx EPLD UIM (Universal Interconnection Module). (a) A simplified block diagram of the UIM. The UIM bus width, n , varies from 68 (XC7236) to 198 (XC73108). (b) The UIM is actually a large programmable AND array. (c) The parasitic capacitance of the EPROM cell.
7.3 Xilinx EPLD
7.3 Xilinx EPLD

The Xilinx EPLD family uses an interconnect bus known as Universal Interconnection Module ( UIM ) to distribute signals within the FPGA. The UIM, shown in Figure 7.7 , is a programmable AND array with constant delay from any input to any output. In Figure 7.7 :
q q q q
C G is the fixed gate capacitance of the EPROM device. C D is the fixed drain parasitic capacitance of the EPROM device. C B is the variable horizontal bus (bit line) capacitance. C W is the variable vertical bus (word line) capacitance.
Figure 7.7 shows the UIM has 21 output connections to each FB. 1 Thus the XC7272 UIM (with a 4 2 array of eight FBs as shown in Figure 7.7 ) has 168 (8 21) output connections. Most (but not all) of the nine I/O cells attached to each FB have two input connections to the UIM, one from a chip input and one feedback from the macrocell output. For example, the XC7272 has 18 I/O cells that are outputs only and thus have only one connection to the UIM, so n = (18 8) 18 = 126 input connections. Now we can calculate the number of tracks in the UIM: the XC7272, for example, has H = 126 tracks and V = 168/2 = 84 tracks. The actual physical height, V , of the UIM is determined by the size of the FBs, and is close to the die height. The UIM ranges in size with the number of FBs. For the smallest XC7236 (with a 2 2 array of four FBs), the UIM has n = 68 inputs and 84 outputs. For the XC73108 (with a 6 2 array of 12 FBs), the UIM has n = 198 inputs. The UIM is a large array with large parasitic capacitance; it employs a highly optimized structure that uses
7.3 Xilinx EPLD
EPROM devices and a sense amplifier at each output. The signal swing on the UIM uses less than the full V DD = 5 V to reduce the interconnect delay. 1. 1994 data book p. 3-62 and p. 3-78. [ Chapter start ] [ Previous page ] [ Next page ]
7.4 Altera MAX 5000 and 7000

Altera MAX 5000 devices (except the EPM5032, which has only one LAB) and all MAX 7000 devices use a Programmable Interconnect Array ( PIA ), shown in Figure 7.8 . The PIA is a cross-point switch for logic signals traveling between LABs. The advantages of this architecture (which uses a fixed number of connections) over programmable interconnection schemes (which use a variable number of connections) is the fixed routing delay. An additional benefit of the simpler nature of a large regular interconnect structure is the simplification and improved speed of the placement and routing software.
FIGURE 7.8 A simplified block diagram of the Altera MAX interconnect scheme. (a) The PIA (Programmable Interconnect Array) is deterministicdelay is independent of the path length. (b) Each LAB (Logic Array Block) contains a programmable AND array. (c) Interconnect timing within a LAB is also fixed. Figure 7.8 (a) illustrates that the delay between any two LABs, t PIA , is fixed. The delay between LAB1 and LAB2 (which are adjacent) is the same as the delay between LAB1 and LAB6 (on opposite corners of the die). It may seem rather strange to slow down all connections to the speed of the longest possible connectiona large penalty to pay to achieve a deterministic architecture. However, it gives Altera the opportunity to highly optimize all of the connections since they are completely fixed. [ Chapter start ] [ Previous page ] [ Next page ]
7.5 Altera MAX 9000
7.5 Altera MAX 9000

Figure 7.9 shows the Altera MAX 9000 interconnect architecture. The size of the MAX 9000 LAB arrays varies between 4 5 (rows columns) for the EPM9320 and 7 5 for the EPM9560. The MAX 9000 is an extremely coarse-grained architecture, typical of complex PLDs, but the LABs themselves have a finer structure. Sometimes we say that complex PLDs with arrays (LABs in the Altera MAX family) that are themselves arrays (of macrocells) have a dual-grain architecture .
FIGURE 7.9 The Altera MAX 9000 interconnect scheme. (a) A 4 5 array of Logic Array Blocks (LABs), the same size as the EMP9400 chip. (b) A simplified block diagram of the interconnect architecture showing the connection of the FastTrack buses to a LAB.
In Figure 7.9 (b), boxes A, B, and C represent the interconnection between the FastTrack buses and the 16 macrocells in each LAB:
q q
Box A connects a macrocell to one row channel. Box B connects three column channels to two row channels.
7.5 Altera MAX 9000

q
Box C connects a macrocell to three column channels.
7.6 Altera FLEX
7.6 Altera FLEX

Figure 7.10 shows the interconnect used in the Altera FLEX family of complex PLDs. Altera refers to the FLEX interconnect and MAX 9000 interconnect by the same name, FastTrack, but the two are different because the granularity of the logic cell arrays is different. The FLEX architecture is of finer grain than the MAX arraysbecause of the difference in programming technology. The FLEX horizontal interconnect is much denser (at 168 channels per row) than the vertical interconnect (16 channels per column), creating an aspect ratio for the interconnect of over 10:1 (168:16). This imbalance is partly due to the aspect ratio of the die, the array, and the aspect ratio of the basic logic cell, the LAB.
7.6 Altera FLEX
FIGURE 7.10 The Altera FLEX interconnect scheme. (a) The row and column FastTrack interconnect. The chip shown, with 4 rows 21 columns, is the same size as the EPF8820. (b) A simplified diagram of the interconnect architecture showing the connections between the FastTrack buses and a LAB. Boxes A, B, and C represent the bus-to-bus connections. As an example, the EPF8820 has 4 rows and 21 columns of LABs ( Figure 7.10 a). Ignoring, for simplicitys sake, what happens at the edge of the die we can total the routing channels as follows:
q q
Horizontal channels = 4 rows 168 channels/row = 672 channels. Vertical channels = 21 rows 16 channels/row = 336 channels.
It appears that there is still approximately twice (672:336) as much interconnect capacity in the horizontal direction as the vertical. If we look inside the boxes A, B, and C in Figure 7.10 (b) we see that for individual lines on each bus:
q q q
Box A connects an LE to two row channels. Box B connects two column channels to a row channel. Box C connects an LE to two column channels.
There is some dependence between boxes A and B since they contain MUXes rather than direct connections, but essentially there are twice as many connections to the column FastTrack as the row FastTrack, thus restoring the balance in interconnect capacity. [ Chapter start ] [ Previous page ] [ Next page ]
7.7 Summary
7.7 Summary
The RC product of the parasitic elements of an antifuse and a pass transistor are not too different. However, an SRAM cell is much larger than an antifuse which leads to coarser interconnect architectures for SRAM-based programmable ASICs. The EPROM device lends itself to large wired-logic structures. These differences in programming technology lead to different architectures:
q q q
The antifuse FPGA architectures are dense and regular. The SRAM architectures contain nested structures of interconnect resources. The complex PLD architectures use long interconnect lines but achieve deterministic routing.
Table 7.4 is a look-up table for Tables 7.5 and 7.6 , which summarize the features of the logic cells used by the various FPGA vendors. TABLE 7.4 I/O Cell Tables. Programmable ASIC Table family Actel (ACT 1) Xilinx (XC3000) Actel (ACT 2) Xilinx (XC4000) Altera MAX (EPM 5000) Xilinx EPLD Table 7.5 (XC7200/7300) Actel (ACT 3) QuickLogic (pASIC 1) Crosspoint (CP20K)
Table
Programmable ASIC family
Xilinx (XC8100) Lucent ORCA (2C) Altera FLEX (8000/10k) Table 7.6 AMD MACH 5 Actel 3200DX Altera MAX (EPM 9000)
7.7 Summary
Altera MAX (EPM 7000) Atmel (AT6000) Xilinx LCA (XC5200) TABLE 7.5 Programmable ASIC interconnect. Xilinx Actel (ACT 1) (XC3000) Channeled array with Switch box, segmented PIPs Interconnect routing, long (Programmable between lines: Interconnect logic cells 25 trks/ch. Points), 3-state (tracks = (horiz.); 13 internal bus, trks) trks/ch. (vert.); and long lines <4 antifuses/path Interconnect Variable Variable delay Interconnect Polydiffusion 32-bit SRAM inside logic antifuse LUT cells
Actel (ACT 2) Channeled array with segmented routing, long lines: 36 trks/ch. (horiz.); 15 trks/ch. (vert.); <4 antifuses/path Variable Polydiffusion antifuse
Xilinx (XC4000) Switch box, PIPs (Programmable Interconnect Points), 3-state internal bus, and long lines
Variable 32-bit SRAM LUT
Altera (MAX 5000) Cross-bar PIA (Programmable Interconnect Interconnect between Architecture) logic cells using EPROM programmableAND array Interconnect Fixed delay
Xilinx EPLD UIM (Universal Interconnect Matrix) using EPROM programmableAND array Fixed
QuickLogic (pASIC 1)
Actel (ACT 3)
Channeled array with Programmable segmented fully populated routing, long antifuse matrix lines: <4 antifuses/path Variable Variable
7.7 Summary
Interconnect inside logic EPROM cells
EPROM
Metalmetal antifuse
Polydiffusion antifuse
Crosspoint (CP20K) Programmable Interconnect highly between interconnected logic cells matrix Interconnect Variable delay Interconnect Metalmetal inside logic antifuse cells
Xilinx LCA (XC5200) Switch box, Programmable PIPs Fixed cross-bar regular, local, (Programmable PIA and express Interconnect (Programmable bus scheme Interconnect Points), 3-state with line internal bus, Architecture) repeaters and long lines Fixed EEPROM Variable SRAM Variable 16-bit SRAM LUT
Altera MAX (MAX 7000)
Atmel (AT6000)
TABLE 7.6 Programmable ASIC interconnect (continued). Xilinx (XC8100) Channeled array with segmented routing, long lines. Programmable fully populated antifuse matrix. Variable Lucent ORCA 2C Switch box, SRAM programmable interconnect, 3state internal bus, and long lines Altera FLEX 8000/10k
Interconnect between logic cells
Row and column FastTrack between LABs
Interconnect delay
Variable
Fixed with small variation in delay in row FastTrack
7.7 Summary
Interconnect inside logic cells
Antifuse
SRAM LUTs and MUXs
LAB local interconnect between LEs. 16bit SRAM LUT in LE. Altera MAX 9000 Row and column FastTrack between LABs Fixed Programmable AND array inside LAB, EEPROM MUXes
AMD MACH 5 Interconnect between logic cells Interconnect delay Interconnect inside logic cells EPROM programmable array Fixed
Actel 3200DX Channeled gate array with segmented routing, long lines Variable Polydiffusion antifuse
EPROM
The key points covered in this chapter are:

q q q
The difference between deterministic and nondeterministic interconnect Estimating interconnect delay Elmores constant
Next, in Chapter 8, we shall cover the software you need to design with the various FPGA families and explain how FPGAs are programmed. [ Chapter start ] [ Previous page ] [ Next page ]
PROGRAMMABLE ASIC DESIGN SOFTWARE

There are five components of a programmable ASIC or FPGA: (1) the programming technology, (2) the basic logic cell, (3) the I/O cell, (4) the interconnect, and (5) the design software that allows you to program the ASIC. The design software is much more closely tied to the FPGA architecture than is the case for other types of ASICs.
8.1 Design Systems 8.2 Logic Synthesis 8.3 The Halfgate ASIC 8.4 Summary 8.5 Problems 8.6 Bibliography 8.7 References
8.1 Design Systems
8.1 Design Systems

The sequence of steps for FPGA design is similar to the sequence discussed in Section 1.2 , Design Flow . As for any ASIC a designer needs design-entry software, a cell library, and physical-design software. Each of the FPGA vendors sells design kits that include all the software and hardware that a designer needs. Many of these kits use design-entry software produced by a different company. Often designers buy that software from the FPGA vendor. This is called an original equipment manufacturer ( OEM ) arrangementsimilar to buying a car with a stereo manufactured by an electronics company but labeled with the automobile companys name. Design entry uses cell libraries that are unique to each FPGA vendor. All of the FPGA vendors produce their own physical-design software so they can tune the algorithms to their own architecture. Unfortunately, there are no standards in FPGA design. Thus, for example, Xilinx calls its 2:1 MUX an M2_1 with inputs labeled D0 , D1 , and S0 with output O . Actel calls a 2:1 MUX an MX2 with inputs A , B , and S with output Y . This problem is not peculiar to Xilinx and Actel; each ASIC vendor names its logic cells, buffers, pads, and so on in a different manner. Consequently designers may not be able to transfer a netlist using one ASIC vendor library to another. Worse than this, designers may not even be able to transfer a design between two FPGA families made by the same FPGA vendor! One solution to the lack of standards for cell libraries is to use a generic cell library, independent from any particular FPGA vendor. For example, most of the FPGA libraries include symbols that are equivalent to TTL 7400 logic series parts. The FPGA vendors own software automatically handles the conversion from schematic symbols to the logic cells of the FPGA.
8.1 Design Systems
Schematic entry is not the only method of design entry for FPGAs. Some designers are happier describing control logic and state machines in terms of state diagrams and logic equations. A solution to some of the problems with schematic entry for FPGA design is to use one of several hardware description languages ( HDL s) for which there are some standards. There are two sets of languages in common use. One set has evolved from the design of programmable logic devices (PLDs). The ABEL (pronounced able), CUPL (cupple), and PALASM (pal-azzam) languages are simple and easy to learn. These languages are useful for describing state machines and combinational logic. The other set of HDLs includes VHDL and Verilog, which are higher-level and are more complex but are capable of describing complete ASICs and systems. After completing design entry and generating a netlist, the next step is simulation. Two types of simulators are normally used for FPGA design. The first is a logic simulator for behavioral, functional, and timing simulation. This tool can catch any design errors. The designer provides input waveforms to the simulator and checks to see that the outputs are as expected. At this point, using a nondeterministic architecture, logic path delays are only estimates, since the wiring delays will not be known until after physical design (place-and-route) is complete. Designers then add or back-annotate the postlayout timing information to the postlayout netlist (also called a back-annotated netlist). This is followed by a postlayout timing simulation. The second type of simulator, the type most often used in FPGA design, is a timinganalysis tool. A timing analyzer is a static simulator and removes the need for input waveforms. Instead the timing analyzer checks for critical paths that limit the speed of operationsignal paths that have large delays caused, say, by a high fanout net. Designers can set a certain delay restriction on a net or path as a timing constraint; if the actual delay is longer, this is a timing violation. In most design systems we can return to design entry and tag critical paths with attributes before completing the place-and-route step again. The next time we use the place-and-route software it will pay special attention to those signals we have labeled as critical in order to minimize the routing delays associated with those signals. The problem is that this iterative process can be lengthy and sometimes nonconvergent. Each time timing violations are fixed, others appear. This is especially a problem with place-and-route software that uses random algorithms (and forms a chaotic system). More complex (and
8.1 Design Systems
expensive) logic synthesizers can automate this iterative stage of the design process. The critical path information is calculated in the logic synthesizer, and timing constraints are created in a feedforward path (this is called forward-annotation ) to direct the place-and-route software. Although some FPGAs are reprogrammable, it is not a good idea to rely on this fact. It is very tempting to program the FPGA, test it, make changes to the netlist, and then keep programming the device until it works. This process is much more time consuming and much less reliable than performing thorough simulation. It is quite possible, for example, to get a chip working in an experimental fashion without really knowing why. The danger here is that the design may fail under some other set of operating conditions or circumstances. Simulation is the proper way to catch and correct these potential disasters.
8.1.1 Xilinx
Figure 8.1 shows the Xilinx design system. Using third-party design-entry software, the designer creates a netlist that forms the input to the Xilinx software. Utility software ( pin2xnf for FutureNet DASH and wir2xnf for Viewlogic, for example) translate the netlist into a Xilinx netlist format ( XNF ) file. In the next step the Xilinx program xnfmap takes the XNF netlist and maps the logic into the Xilinx Logic Cell Array ( LCA ) architecture. The output from the mapping step is a MAP file. The schematic MAP file may then be merged with other MAP files using xnfmerge . This technique is useful to merge different pieces of a design, some created using schematic entry and others created, for example, using logic synthesis. A translator program map2lca translates from the logic gates (NAND gates, NOR gates, and so on) to the required CLB configurations and produces an unrouted LCA file. The Xilinx place-and-route software ( apr or ppr ) takes the unrouted LCA file and performs the allocation of CLBs and completes the routing. The result is a routed LCA file. A control program xmake (that works like the make program in C) can automatically handle the mapping, merging, and place-and-route steps. Following the place-and-route step, the logic and wiring delays are known and the postlayout netlist may be generated. After a postlayout simulation the download file or BIT file used to program the FPGA (or a PROM that will load the FPGA) is generated using
8.1 Design Systems
the Xilinx makebits program.
FIGURE 8.1 The Xilinx FPGA design flow. The numbers next to the steps in the flow correspond to those in the general ASIC design flow of Figure 1.10 . Xilinx also provides a software program (Xilinx design editor, XDE) that permits manual control over the placement and routing of a Xilinx FPGA. The designer views a graphical representation of the FPGA, showing all the CLBs and interconnect, and can make or alter connections by pointing and clicking. This program is useful to check an automatically generated layout, or to explore critical routing paths, or to change and hand tune a critical connection, for example. Xilinx uses a system called X-BLOX for creating regular structures such as vectored instances and datapaths. This system works with the Xilinx XNF netlist format. Other
8.1 Design Systems
vendors, notably Actel and Altera, use a standard called Relationally Placed Modules ( RPM ), based on the EDIF standard, that ensures that the pieces of an 8-bit adder, for example, are treated as a macro and stay together during placement.
8.1.2 Actel
Actel FPGA design uses third-party design entry and simulators. After creating a netlist, a designer uses the Actel software for the place-and-route step. The Actel design software, like other FPGA and ASIC design systems, employs a large number of file formats with associated filename extensions. Table 8.1 shows some of the Actel file extensions and their meanings. TABLE 8.1 File types used by Actel design software. ADL Main design netlist IPF Partial or complete pin assignment for the design CRT Net criticality VALIDATED Audit information COB List of macros removed from design VLD Information, warning, and error messages PIN Complete pin assignment for the design DFR Information about routability and I/O assignment quality Placement of non-I/O macros, pin swapping, and freeway LOC assignment PLI Feedback from placement step SEG Assignment of horizontal routing segments STF Back-annotation timing RTI Feedback from routing step FUS Fuse coordinates (column-track, row-track) DEL Delays for input pins, nets, and I/O modules Fuse programming times and currents for last chip AVI programmed
8.1 Design Systems
Actel software can also map hardware description files from other programmable logic design software into the Actel FPGA architecture. As an example, Table 8.2 shows a text description of a state machine using an HDL from a company called LOG/iC. You can then convert the LOG/iC code to the PALASM code shown in Table 8.2 . The Actel software can take the PALASM code and merge it with other PALASM files or netlists. TABLE 8.2 FPGA state-machine language. LOG/iC state-machine language *IDENTIFICATION sequence detector LOG/iC code *X-NAMES X; !input *Y-NAMES D; !output, D = 1 when three 1's appear on X *FLOW-TABLE ;State, X input, Y output, next state S1, X1, Y0, F2; S1, X0, Y0, F1; S2, X1, Y0, F3; S2, X0, Y0, F1; S3, X1, Y0, F4; S3, X0, Y0, F1; S4, X1, Y1, F4; S4, X0, Y0, F1; *STATE-ASSIGNMENT BINARY; *RUN-CONTROL PROGFORMAT = P-EQUATIONS; *END
PALASM version
TITLE sequence detector CHIP MEALY USER CLK Z QQ2 QQ1 X EQUATIONS Z = X * QQ2 * QQ1 QQ2 := X * QQ1 + X * QQ2 QQ1 := X * QQ2 + X * /QQ1
8.1 Design Systems
8.1.3 Altera
Altera uses a self-contained design system for its complex PLDs that performs design entry, simulation, and programming of the parts. Altera also provides an input and output interface to EDIF so that designers may use third-party schematic entry or a logic synthesizer. We have seen that the interconnect scheme in the Altera complex PLDs is nearly deterministic, simplifying the physical-design software as well as eliminating the need for back-annotation and a postlayout simulation. As Altera FPGAs become larger and more complex, there are some exceptions to this rule. Some special cases require signals to make more than one pass through the routing structures or travel large distances across the Altera FastTrack interconnect. It is possible to tell if this will be the case only by trying to place and route an Altera device. [ Chapter start ] [ Previous page ] [ Next page ]
8.2 Logic Synthesis
8.2 Logic Synthesis

Designers are increasingly using logic synthesis as a replacement for schematic entry. As microelectronic systems and their ASICs become more complex, the use of schematics becomes less practical. For example, a complex ASIC that contains over 10,000 gates might require hundreds of pages of schematics at the gate level. As another example, it is easier to write A = B + C than to draw a schematic for a 32-bit adder at the gate level. The term logic synthesis is used to cover a broad range of software and software capabilities. Many logic synthesizers are based on logic minimization. Logic minimization is usually performed in one of two ways, either using a set of rules or using algorithms. Early logic-minimization software was designed using algorithms for two-level logic minimization and developed into multilevel logic-optimization software. Two-level and multilevel logic minimization is well suited to random logic that is to be implemented using a CBIC, MGA, or PLD. In these technologies, twolevel logic can be implemented very efficiently. Logic minimization for FPGAs, including complex PLDs, is more difficult than other types of ASICs, because of the complex basic logic cells in FPGAs. There are two ways to use logic synthesis in the design of FPGAs. The first and simplest method takes a hardware description, optimizes the logic, and then produces a netlist. The netlist is then passed to software that maps the netlist to an FPGA architecture. The disadvantage of this method is the inefficiency of decoupling the logic optimization from the mapping step. The second, more complicated, but more efficient method, takes the hardware description and directly optimizes the logic for a specific FPGA architecture. Some logic synthesizers produce files in PALASM, ABEL, or CUPL formats.
8.2 Logic Synthesis
Software provided by the FPGA vendor then take these files and maps the logic to the FPGA architecture. The FPGA mapping software requires detailed knowledge of the FPGA architecture. This makes it difficult for third-party companies to create logic synthesis software that can map directly to the FPGA. A problem with design-entry systems is the difficulty of moving netlists between different FPGA vendors. Once you have completed a design using an FPGA cell library, for example, you are committed to using that type of FPGA unless you repeat design entry using a different cell library. ASIC designers do not like this approach since it exposes them to the mercy of a single ASIC vendor. Logic synthesizers offer a degree of independence from FPGA vendors (universally referred to vendor independence, but this should, perhaps, be designer independence) by delaying the point in the design cycle at which designers need to make a decision on which FPGA to use. Of course, now designers become dependent on the synthesis software company.
8.2.1 FPGA Synthesis

For low-level logic synthesis, PALASM is a de facto standard as the lowest-commondenominator interchange format. Most FPGA design systems are capable of converting their own native formats into a PALASM file. The most common programmable logic design systems are ABEL from Data I/O, CUPL from P-CAD, LOG/iC from IsData, PALASM2 from AMD, and PGA-Designer from Minc. At a higher level, CAD companies (Cadence, Compass, Mentor, and Synopsys are examples) support most FPGA cell libraries. This allows you to map from a VHDL or Verilog description to an EDIF netlist that is compatible with FPGA design software. Sometimes you have to buy the cell library from the software company, sometimes from the FPGA vendor. TABLE 8.3 The VHDL code for the sequence detector of Table 8.2 .
8.2 Logic Synthesis
entity detector is port (X, CLK: in BIT; Z : out BIT); end; architecture behave of SEQDET is type STATES is (S1, S2, S3, S4); signal current, next: STATES; begin combinational: process begin case current is when S1 => if X = '1' then Z <= '0'; next <= S3; else Z <= '0'; next <= S1; end if; when S2 => if X = '1' then Z <= '0'; next <= S2; else Z <= '0'; next <= S1; end if; when S3 => if X = '1' then Z <= '0'; next <= S2; else Z <= '0'; next <= S1; end if; when S4 => if X = '1' then Z <= '1'; next <= S4; else Z <= '0'; next <= S1; end if end case; end process sequential: process begin wait until CLK'event and CLK = '1'; current <= next ; end process; end behave; As an example, Table 8.3 shows a VHDL model for a pattern detector to check for a sequence of three '1's (excluding the code for the I/O pads). Table 8.4 shows a script or command file that runs the Synopsys software to generate an EDIF netlist from this VHDL that targets the TI version of the Actel FPGA parts. A script is a recipe that tells the software what to do. If we wanted to retarget this design to another type of FPGA or an MGA or CBIC ASIC, for example, we may only need a new set of cell libraries and to change the script (if we are lucky). In practice, we shall probably find we need to make a few changes in the VHDL code (in the areas of I/O pads, for
8.2 Logic Synthesis
example, that are different for each kind of ASIC). We now have a portable design and a measure of vendor independence. We have also introduced some dependence on the Synopsys software since the code in Table 8.3 might be portable, but the script (which is just as important a part of the design) in Table 8.4 may only be used with the Synopsys software. Nevertheless, using logic synthesis results in a more portable design than using schematic entry. TABLE 8.4 The Synopsys script for the VHDL code of Table 8.3 . /design checking/ search_path = . /use the TI cell libraries/ link_library = tpc10.db target_library = tpc10.db symbol_library = tpc10.sdb read -f vhdl detector.vhd current_design = detector write -n -f db -hierarchy 0 detector.db check_design > detector.rpt report_design > detector.rpt /optimize for area/ max_area 0.0 compile write -h -f db -o detector_opt.db report -area -cell -timing > detector.rpt free -all /write EDIF netlist/ write -h -f edif -0 exit
8.3 The Halfgate ASIC

This section illustrates FPGA design using a very simple ASICa single inverter. The hidden details of the design and construction of this halfgate FPGA are quite complicated. Fortunately, most of the inner workings of the design software are normally hidden from the designer. However, when software breaks, as it sometimes does, it is important to know how things work in order to fix the problem. The formats, filenames, and flow will change, but the information needed at each stage and the order in which it is conveyed will stay much the same.
8.3.1 Xilinx
Table 8.5 shows an FPGA design flow using Compass and Xilinx software. On the left of Table 8.5 is a script for the Compass programsscripts for Cadence, Mentor, and Synopsys software are similar, but not all design software has the capability to be run on autopilot using scripts and a command language. The diagrams in Table 8.5 illustrate what is happening at each of the design steps. The following numbered comments, corresponding to the labels in Table 8.5 , highlight the important steps: TABLE 8.5 Design flow for the Xilinx implementation of the halfgate ASIC. Script Design flow # halfgate.xilinx.inp shell setdef path working xc4000d xblox cmosch000x quit asic open [v]halfgate synthesize save [nls]halfgate_p quit fpga set tag xc4000 set opt area
optimize [nls]halfgate_p quit qtv open [nls]halfgate_p trace critical print trace [txt]halfgate_p quit shell vuterm exec xnfmerge -p 4003PC84 halfgate_p > /dev/null exec xnfprep halfgate_p > /dev/null exec ppr halfgate_p > /dev/null exec makebits -w halfgate_p > /dev/null exec lca2xnf -g -v halfgate_p halfgate_b > /dev/null quit manager notice utility netlist open [xnf]halfgate_b save [nls]halfgate_b save [edf]halfgate_b quit qtv open [nls]halfgate_b trace critical print trace [txt]halfgate_b quit
TABLE 8.6 The Xilinx files for the halfgate ASIC. Verilog file (halfgate.v)
Preroute XNF file (halfgate_p.xnf)
LCA file (halfgate_p.lca)
Postroute XNF file (halfgate_b.xnf)
1. The Verilog code, in halfgate.v , describes a single inverter. 2. The script runs the logic synthesizer that converts the Verilog description to an inverter (using elements from the Xilinx XC4000 library) and saves the result in a netlist, halfgate_p.nls (a Compass internal format). 3. The script next runs the logic optimizer for FPGAs. This program also adds the I/O pads. In this case, logic optimization implements the inverter by using an inverting output pad. The software writes out the netlist as halfgate_p.xnf . 4. A timing simulation is run on the netlist halfgate_p.nls (the Compass format netlist). This netlist uses the default delaysevery gate has a delay of 1 ns. 5. At this point the script has run all of the Xilinx programs required to complete the place-and-route step. The Xilinx programs have created several files, the most important of which is halfgate_p.lca , which describes the FPGA layout. This postroute netlist is converted to halfgate_b.nls (the
added suffix 'b' stands for back-annotation). Next a timing simulation is performed on the postroute netlist, which now includes delays, to find the delay from the input ( myInput ) to the output ( myOutput ). This is the criticaland onlypath. The simulation (not shown) reveals that the delay is 2.8 ns (for the input buffer) plus 11.6 ns (for the output buffer), for a total delay of 14.4 ns (this is for a XC4003 in a PC84 package, and default speed grade '4'). Table 8.6 shows the key Xilinx files that are created. The preroute file, halfgate_p.xnf , describes the IBUF and OBUF library cells but does not contain any delays. The LCA file, halfgate_p.lca , contains all the physical design information, including the locations of the pads and I/O cells on the FPGA ( PAD61 for myInput and PAD1 for myOutput ), as well as the details of the programmable connections between these I/O Cells. The postroute file, halfgate_b.xnf , is similar to the preroute version except that now the delays are included. Xilinx assigns delays to a pin (connector or terminal of a cell). In this case 2.8 ns is assigned to the output of the input buffer, 8.6 ns is assigned to the input of the output buffer, and finally 3.0 ns is assigned to the output of the output buffer.
8.3.2 Actel
The key Actel files for the halfgate design are the netlist file, halfgate_io.adl, and the STF delay file for backannotation, halfgate_io.stf. Both of these files are shown in Table 8.7 (the STF file is large and only the last few lines, which contain the delay information, are shown in the table). TABLE 8.7 The Actel files for the halfgate ASIC. ADL file STF file ; HEADER ; FILEID STF ./halfgate_io.stf c96ef4d8 ; HEADER ; FILEID ADL ./halfgate_io.adl ... lines omitted ... (126 lines 85e8053b total) ; CHECKSUM 85e8053b ; PROGRAM certify DEF halfgate_io. ; VERSION 23/1 USE ; INBUF_2/U0; ; ALSMAJORREV 2 TPADH:'11:26:37', ; ALSMINORREV 3 TPADL:'13:30:41', ; ALSPATCHREV .1 TPADE:'12:29:41', ; NODEID 72705192 TPADD:'20:48:70', ; VAR FAMILY 1400 TYH:'8:20:27', ; ENDHEADER TYL:'12:28:39'. DEF halfgate_io; myInput, myOutput. PIN u2:A; USE ADLIB:INBUF; INBUF_2. RDEL:'13:31:42', USE ADLIB:OUTBUF; OUTBUF_3. FDEL:'11:26:37'. USE ADLIB:INV; u2. USE ; OUTBUF_3/U0; NET DEF_NET_8; u2:A, INBUF_2:Y. TPADH:'11:26:37',
NET DEF_NET_9; myInput, INBUF_2:PAD. NET DEF_NET_11; OUTBUF_3:D, u2:Y. NET DEF_NET_12; myOutput, OUTBUF_3:PAD. END.
TPADL:'13:30:41', TPADE:'12:29:41', TPADD:'20:48:70', TYH:'8:20:27', TYL:'12:28:39'. PIN OUTBUF_3/U0:D; RDEL:'14:32:45', FDEL:'11:26:37'. END.
8.3.3 Altera
Because Altera complex PLDs use a deterministic routing structure, they can be designed more easily using a self-contained software packagean all-in-one software package using a single interface. We shall assume that we can generate a netlist that the Altera software can accept using Cadence, Mentor, or Compass software with an Altera design kit (the most convenient format is EDIF). Table 8.8 shows the EDIF preroute netlist in a format that the Altera software can accept. This netlist file describes a single inverter (the line 'cellRef not'). The majority of the EDIF code in Table 8.8 is a standard template to pass information about how the VDD and VSS nodes are named, which libraries are used, the name of the design, and so on. We shall cover EDIF in Chapter 9 . TABLE 8.8 EDIF netlist in Altera format for the halfgate ASIC.
Table 8.9 shows a small part of the reports generated by the Altera software after completion of the placeand-route step. This report tells us how the software has used the basic logic cells, interconnect, and I/O cells
to implement our design. With practice it is possible to read the information from reports such as Table 8.9 directly, but it is a little easier if we also look at the netlist. The EDIF version of postroute netlist for this example is large. Fortunately, the Altera software can also generate a Verilog version of the postroute netlist. Here is the generated Verilog postroute netlist, halfgate_p.vo (not '.v' ), for the halfgate design: TABLE 8.9 Report for the halfgate ASIC fitted to an Altera MAX 7000 complex PLD. ** INPUTS ** Shareable Expanders Fan-In Fan-Out Pin LC LAB Primitive Code Total Shared n/a INP FBK OUT FBK Name 43 - - INPUT 0 0 0 0 0 0 1 myInput ** OUTPUTS ** Shareable Expanders Fan-In Fan-Out Pin LC LAB Primitive Code Total Shared n/a INP FBK OUT FBK Name 41 17 B OUTPUT t 0 0 0 1 0 0 0 myOutput ** LOGIC CELL INTERCONNECTIONS ** Logic Array Block 'B': +- LC17 myOutput | LC | | A B | Name Pin 43 -> * | - * | myInput * = The logic cell or pin is an input to the logic cell (or LAB) through the PIA. - = The logic cell or pin is not an input to the logic cell (or LAB). // halfgate_p (EPM7032LC44) MAX+plus II Version 5.1 RC6 10/03/94 // Wed Jul 17 04:07:10 1996 `timescale 100 ps / 100 ps module TRI_halfgate_p( IN, OE, OUT ); input IN; input OE; output OUT; bufif1 ( OUT, IN, OE ); specify specparam TTRI = 40; specparam TTXZ = 60; specparam TTZX = 60; (IN => OUT) = (TTRI,TTRI); (OE => OUT) = (0,0, TTXZ, TTZX, TTXZ, TTZX); endspecify endmodule module halfgate_p (myInput, myOutput);
input myInput; output myOutput; supply0 gnd; supply1 vcc; wire B1_i1, myInput, myOutput, N_8, N_10, N_11, N_12, N_14; TRI_halfgate_p tri_2 ( .OUT(myOutput), .IN(N_8), .OE(vcc) ); TRANSPORT transport_3 ( N_8, N_8_A ); defparam transport_3.DELAY = 10; and delay_3 ( N_8_A, B1_i1 ); xor xor2_4 ( B1_i1, N_10, N_14 ); or or1_5 ( N_10, N_11 ); TRANSPORT transport_6 ( N_11, N_11_A ); defparam transport_6.DELAY = 60; and and1_6 ( N_11_A, N_12 ); TRANSPORT transport_7 ( N_12, N_12_A ); defparam transport_7.DELAY = 40; not not_7 ( N_12_A, myInput ); TRANSPORT transport_8 ( N_14, N_14_A ); defparam transport_8.DELAY = 60; and and1_8 ( N_14_A, gnd ); endmodule The Verilog model for our ASIC, halfgate_p , is written in terms of other models: and , xor , or , not , TRI_halfgate_p , TRANSPORT . The first four of these are primitive models for basic logic cells and are built into the Verilog simulator. The model for TRI_halfgate_p is generated together with the rest of the code. We also need the following model for TRANSPORT, which contains the delay information for the Altera MAX complex PLD. This code is part of a file ( alt_max2.vo ) that is generated automatically. // MAX+plus II Version 5.1 RC6 10/03/94 Wed Jul 17 04:07:10 1996 `timescale 100 ps / 100 ps module TRANSPORT( OUT, IN ); input IN; output OUT; reg OUTR; wire OUT = OUTR; parameter DELAY = 0; ìfdef ZeroDelaySim always @IN OUTR <= IN; èlse always @IN OUTR <= #DELAY IN; èndif ìfdef Silos initial #0 OUTR = IN; èndif endmodule The Altera software can also write the following VHDL postroute netlist: -- halfgate_p (EPM7032LC44) MAX+plus II Version 5.1 RC6 10/03/94 -- Wed Jul 17 04:07:10 1996 LIBRARY IEEE; USE IEEE.std_logic_1164.all; ENTITY n_tri_halfgate_p IS GENERIC (ttri: TIME := 1 ns; ttxz: TIME := 1 ns; ttzx: TIME := 1 ns);
PORT (in0 : IN X01Z; oe : IN X01Z; out0: OUT X01Z); END n_tri_halfgate_p; ARCHITECTURE behavior OF BEGIN PROCESS (in0, oe) BEGIN IF oe'EVENT THEN IF oe = '0' THEN out0 <= ELSIF oe = '1' THEN out0 END IF; ELSIF oe = '1' THEN out0 END IF; END PROCESS; END behavior; n_tri_halfgate_p IS
TRANSPORT 'Z' AFTER ttxz; <= TRANSPORT in0 AFTER ttzx; <= TRANSPORT in0 AFTER ttri;
LIBRARY IEEE; USE IEEE.std_logic_1164.all; USE work.n_tri_halfgate_p; ENTITY n_halfgate_p IS PORT ( myInput : IN X01Z; myOutput : OUT X01Z); END n_halfgate_p; ARCHITECTURE EPM7032LC44 OF n_halfgate_p IS SIGNAL gnd : X01Z := '0'; SIGNAL vcc : X01Z := '1'; SIGNAL n_8, B1_i1, n_10, n_11, n_12, n_14 : X01Z; COMPONENT n_tri_halfgate_p GENERIC (ttri, ttxz, ttzx: TIME); PORT (in0, oe : IN X01Z; out0 : OUT X01Z); END COMPONENT; BEGIN PROCESS(myInput) BEGIN ASSERT myInput /= 'X' OR Now = 0 ns REPORT "Unknown value on myInput" SEVERITY Warning; END PROCESS; n_tri_2: n_tri_halfgate_p GENERIC MAP (ttri => 4 ns, ttxz => 6 ns, ttzx => 6 ns) PORT MAP (in0 => n_8, oe => vcc, out0 => myOutput); n_delay_3: n_8 <= TRANSPORT B1_i1 AFTER 1 ns; n_xor_4: B1_i1 <= n_10 XOR n_14; n_or_5: n_10 <= n_11; n_and_6: n_11 <= TRANSPORT n_12 AFTER 6 ns; n_not_7: n_12 <= TRANSPORT NOT myInput AFTER 4 ns; n_and_8: n_14 <= TRANSPORT gnd AFTER 6 ns; END EPM7032LC44; LIBRARY IEEE; USE IEEE.std_logic_1164.all; USE work.n_halfgate_p; ENTITY halfgate_p IS PORT ( myInput : IN std_logic; myOutput : OUT std_logic);
END halfgate_p; ARCHITECTURE EPM7032LC44 OF halfgate_p IS COMPONENT n_halfgate_p PORT (myInput : IN X01Z; myOutput : OUT X01Z); END COMPONENT; BEGIN n_0: n_halfgate_p PORT MAP ( myInput => TO_X01Z(myInput), myOutput => myOutput); END EPM7032LC44; The VHDL is a little harder to decipher than the Verilog, so the schematic for the VHDL postroute netlist is shown in Figure 8.2 . This VHDL netlist is identical in function to the Verilog netlist, but the net names and component names are different. Compare Figure 8.2 with Figure 5.15 (c) in Section 5.4 , Altera MAX , which shows the Altera basic logic cell and Figure 6.23 in Section 6.8, Other I/O Cells, which describes the Altera I/O cell. The software has fixed the inputs to the various elements in the Altera MAX device to implement a single inverter.
FIGURE 8.2 The VHDL version of the postroute Altera MAX 7000 schematic for the halfgate ASIC. Compare this with Figure 5.15(c) and Figure 6.23.
8.3.4 Comparison
The halfgate ASIC design illustrates the differences between a nondeterministic coarse-grained FPGA (Xilinx XC4000), a nondeterministic fine-grained FPGA (Actel ACT 3), and a deterministic complex PLD (Altera MAX 7000). These differences, summarized as follows, were apparent even in the halfgate design: 1. The Xilinx LCA architecture does not permit an accurate timing analysis until after place and route. This is because of the coarse-grained nondeterministic architecture. 2. The Actel ACT architecture is nondeterministic, but the fine-grained structure allows fairly accurate preroute timing prediction. 3. The Altera MAX complex PLD requires logic to be fitted to the product steering and programmable array logic. The Altera MAX 7000 has an almost deterministic architecture, which allows accurate preroute timing.
8.4 Summary
8.4 Summary
The important concepts covered in this chapter are:
q q q q
FPGA design flow: design entry, simulation, physical design, and programming Schematic entry, hardware design languages, logic synthesis PALASM as a common low-level hardware description EDIF, Verilog, and VHDL as vendor-independent netlist standards
LOW-LEVEL DESIGN ENTRY

The purpose of design entry is to describe a microelectronic system to a set of electronic-design automation ( EDA ) tools. Electronic systems used to be, and many still are, constructed from off-the-shelf components, such as TTL ICs. Design entry for these systems now usually consists of drawing a picture, a schematic . The schematic shows how all the components are connected together, the connectivity of an ASIC. This type of design-entry process is called schematic entry , or schematic capture . A circuit schematic describes an ASIC in the same way an architects plan describes a building. The circuit schematic is a picture, an easy format for us to understand and use, but computers need to work with an ASCII or binary version of the schematic that we call a netlist . The output of a schematic-entry tool is thus a netlist file that contains a description of all the components in a design and their interconnections. Not all the design information may be conveyed in a circuit schematic or netlist, because not all of the functions of an ASIC are described by the connectivity information. For example, suppose we use a programmable ASIC for some random logic functions. Part of the ASIC might be designed using a text language. In this case design entry also includes writing the code. What if an ASIC in our system contains a programmable memory (PROM)? Is the PROM microcode, the '1's and '0's, part of design entry? The operation of our system is certainly dependent on the correct programming of the PROM. So perhaps the PROM code ought to be considered part of design entry. On the other hand nobody would consider the operating-system code that is loaded into a RAM on an ASIC to be a part of design entry. Obviously, then, there are several different forms of design entry. In each case it is important to make
sure that you have completely specified the systemnot only so that it can be correctly constructed, but so that someone else can understand how the system is put together. Design entry is thus an important part of documentation . Until recently most ASIC design entry used schematic entry. As ASICs have become more complex, other design-entry methods are becoming common. Alternative designentry methods can use graphical methods, such as a schematic, or text files, such as a programming language. Using a hardware description language ( HDL ) for design entry allows us to generate netlists directly using logic synthesis . We will concentrate on low-level design-entry methods together with their advantages and disadvantages in this chapter. 9.1 Schematic Entry 9.2 Low-Level Design Languages 9.3 PLA Tools 9.4 EDIF 9.5 CFI Design Representation 9.6 Summary 9.7 Problems 9.8 Bibliography 9.9 References
9.1 Schematic Entry
9.1 Schematic Entry

Schematic entry is the most common method of design entry for ASICs and is likely to be useful in one form or another for some time. HDLs are replacing conventional gate-level schematic entry, but new graphical tools based on schematic entry are now being used to create large amounts of HDL code. Circuit schematics are drawn on schematic sheets . Standard schematic sheet sizes ( Table 9.1 ) are ANSI AE (more common in the United States) and ISO A4A0 (more common in Europe). Usually a frame or border is drawn around the schematic containing boxes that list the name and number of the schematic page, the designer, the date of the drawing, and a list of any modifications or changes. TABLE 9.1 ANSI (American National Standards Institute) and ISO (International Standards Organization) schematic sheet sizes. ANSI sheet Size (inches) ISO sheet Size (cm) A A5 8.5 11 21.0 14.8 B C D E 11 17 17 22 22 34 34 44 A4 A3 A2 A1 A0 29.7 21.0 42.0 29.7 59.4 42.0 84.0 59.4 118.9 84.0
Figure 9.1 shows the spades and shovels, the recognized symbols for AND, NAND, OR, and NOR gates. One of the problems with these recommendations is that
9.1 Schematic Entry
the corner points of the shapes do not always lie on a grid point (using a reasonable grid size).
FIGURE 9.1 IEEE-recommended dimensions and their construction for logic-gate symbols. (a) NAND gate (b) exclusive-OR gate (an OR gate is a subset). Figure 9.2 shows some pictorial definitions of objects you can use in a simple schematic. We shall discuss the different types of objects that might appear in an ASIC schematic first and then discuss the different types of connections.
FIGURE 9.2 Terms used in circuit schematics. Schematic-entry tools for ASIC design are similar to those for printed-circuit board
9.1 Schematic Entry
(PCB) design. The basic object on a PCB schematic is a component or device a TTL IC or resistor, for example. There may be several hundred components on a typical PCB. If we think of a logic gate on an ASIC as being equivalent to a component on a PCB, then a large ASIC contains hundreds of thousands of components. We can normally draw every component on a few schematic sheets for a PCB, but drawing every component on an ASIC schematic is impractical.
9.1.1 Hierarchical Design

Hierarchy reduces the size and complexity of a schematic. Suppose a building has 10 floors and contains several hundred offices but only three different basic office plans. Furthermore, suppose each of the floors above the ground floor that contains the lobby is identical. Then the plans for the whole building need only show detailed plans for the ground floor and one of the upper floors. The plans for the upper floor need only show the locations of each office and the office type. We can then use a separate set of three detailed plans for each of the different office types. All these different plans together form a nested structure that is a hierarchical design . The plan for the whole building is the top-level plan. The plans for the individual offices are the lowest level. To clarify the relationship between different levels of hierarchy we say that a subschematic (an office) is a child of the parent schematic (the floor containing offices). An electrical schematic can contain subschematics. The subschematic, in turn, may contain other subschematics. Figure 9.3 illustrates the principles of schematic hierarchical design.
9.1 Schematic Entry
FIGURE 9.3 Schematic example showing hierarchical design. (a) The schematic of a half-adder, the subschematic of cell HADD. (b) A schematic symbol for the half adder. (c) A schematic that uses the half-adder cell. (d) The hierarchy of cell HADD. The alternative to hierarchical design is to draw all of the ASIC components on one giant schematic, with no hierarchy, in a flat design . For a modern ASIC containing thousands or more logic gates using a flat design or a flat schematic would be hopelessly impractical. Sometimes we do use flat netlists though.
9.1.2 The Cell Library

Components in an ASIC schematic are chosen from a library of cells. Library elements for all types of ASICs are sometimes also known as modules . Unfortunately the term module will have a very specific meaning when we come to discuss hardware description languages. To avoid any chance of confusion I use the term cell to mean either a cell, a module, a macro, or a book from an ASIC library. Library cells are equivalent to the offices in our office building.
9.1 Schematic Entry
Most ASIC companies provide a schematic library of primitive gates to be used for schematic entry. The first problem with ASIC schematic libraries is that there are no naming conventions. For example, a primitive two-input NAND gate in a Xilinx FPGA library does not have the same name as the two-input NAND gate in an LSI Logic gate-array library. This means that you cannot take a schematic that you used to create a prototype product using a Xilinx FPGA and use that schematic to create an LSI Logic gate array for production (something you might very likely want to do). As soon as you start entering a schematic using a library from an ASIC vendor, you are, to some extent, making a commitment to use that vendors ASIC. Most ASIC designers are much happier maintaining a large degree of vendor independence. A second problem with ASIC schematic libraries is that there are no standards for cell behavior. For example, a two-input MUX in an Actel library operates so that the input labeled A is selected when the MUX select input S = '0'. A two-input MUX in a VLSI Technology library operates in the reverse fashion, so that the input labeled B is selected when S = '0'. These types of differences can cause hard-to-find problems when trying to convert a schematic from one vendor to another by hand. These problems make changing or retargeting schematics from one vendor to another difficult. This process is sometimes known as porting a design. Library cells that represent basic logic gates, such as a NAND gate, are known as primitive cells , usually referred to just as cells. In a hierarchical ASIC design a cell may be a NAND gate, a flip-flop, a multiplier, or even a microprocessor, for example. To use the office building analogy again, each of the three basic office types is a primitive cell. However, the plan for the second floor is also a cell. The second-floor cell is a subschematic of the schematic for the whole building. Now we see why the commonly accepted use of the term cell in schematic entry can be so confusing. The term cell is used to represent both primitive cells and subschematics. These are two different, but closely related, things. There are two types of macros for MGAs and programmable ASICs. The most common type of macro is a hard macro that includes placement information. A hard macro can change in position and orientation, but the relative location of the transistors, other layout, and wiring inside the macro is fixed. A soft macro contains only connection information (between transistors for a gate array or between logic cells for a programmable ASIC). Thus the placement and wiring for a soft macro can
9.1 Schematic Entry
vary. This means that the timing parameters for a soft macro can only be determined after you complete the place-and-route step. For this reason the basic library elements for MGAs and programmable ASICs, such as NAND gates, flip-flops, and so on, are hard macros. A standard cell contains layout information on all mask levels. An MGA hard macro contains layout information on just the metal, contact, and via layers. An MGA soft macro or programmable ASIC macro does not contain any layout information at all, just the details of connections to be made inside the macro. We can stretch the office building analogy to explain the difference between hard and soft macros. A hard macro would be an office with fixed walls in which you are not allowed to move the furniture. A soft macro would be an office with partitions in which you can move the furniture around and you can also change the shape of your office by moving the partitions.
9.1.3 Names
Each of the cells, primitive or not, that you place on an ASIC schematic has a cell name . Each use of a cell is a different instance of that cell, and we give each instance a unique instance name . A cell instance is somewhere between a copy and a reference to a cell in a library. An analogy would be the pictures of hamburgers on the wall in a fast-food restaurant. The pictures are somewhere between a copy and a reference to a real hamburger. We represent each cell instance by a picture or icon , also known as a symbol . We can represent primitive cells, such as NAND and NOR gates, with familiar icons that look like spades and shovels. Some schematic editors offer the option of switching between these familiar icons and using the rectangular IEEE standard symbols for logic gates. Unfortunately the term icon is also often used to refer to any of the pictures on a schematic, including those that represent subschematics. There is no accepted way to differentiate between an icon that represents a primitive cell and one that represents a subschematic that may be in turn a collection of primitive cells. In fact, there is usually no easy way to tell by looking at a schematic which icons represent primitive cells and which represent subschematics. We will have three different icons for each of the three different primitive offices in
9.1 Schematic Entry
the imaginary office building example of Section 9.1.1 . We also will have icons to represent the ground floor and the plan for the other floors. We shall call the common plan for the second through tenth floors, Floor . Then we say that the second floor is an instance of the cell name Floor . The third through tenth floors are also instances of the cell name Floor . The same icon will be used to represent the second through tenth floors, but each will have a unique instance name. We shall give them instance names: FloorTwo , FloorThree , ... , FloorTen . We say that FloorTwo through FloorTen are unique instance names of the cell name Floor . At the risk of further confusion I should point out that, strictly speaking, the definition of a primitive cell depends on the type of library being used. Schematic-entry libraries for the ASIC designer stop at the level of NAND gates and other similar low-level logic gates. Then, as far as the ASIC designer is concerned, the primitive cells are these logic gates. However, from the view of the library designer there is another level of hierarchy below the level of logic gates. The library designer needs to work with libraries that contain schematics of the gates themselves, and so at this level the primitive cells are transistors. Let us look at the building analogy again to understand the subtleties of primitive cells. A building contractor need only concern himself with the plans for our office building down to the level of the offices. To the building contractor the primitive cells are the offices. Suppose that the first of the three different office types is a corner office, the second office type has a window, and a third office type is without a window. We shall call these office cells: CornerOffice , WindowOffice , and NoWindowOffice . These cells are primitive cells as far as the contractor is concerned. However, when discussing the plans with a client, the architect of our building will also need to see how each offices is furnished. The architect needs to see a level of detail of each office that is more complicated than needed by the building contractor. The architect needs to see the cells that represent the tables, chairs, and desks that make up each type of office. To the architect the primitive cells are a library containing cells such as chair , table , and desk .
9.1.4 Schematic Icons and Symbols

Most schematic-entry programs allow the designer to draw special or custom icons. In addition, the schematic-entry tool will also usually create an icon automatically for a
9.1 Schematic Entry
subschematic that is used in a higher-level schematic. This is a derived icon , or derived symbol . The external connections of the subschematic are automatically attached to the icon, usually a rectangle. Figure 9.4 (c) shows what a derived icon for a cell, DLAT , might look like (we could also have drawn this by hand). The subschematic for DLAT is shown in Figure 9.4 (b). We say that the inverter with the instance name inv1 in the subschematic is a subcell (or submodule) of the cell DLAT . Alternatively we say that cell instance inv1 is a child of the cell DLAT , and cell DLAT is a parent of cell instance inv1 .
FIGURE 9.4 A cell and its subschematic. (a) A schematic library containing icons for the primitive cells. (b) A subschematic for a cell, DLAT, showing the instance names for the primitive cells. (c) A symbol for cell DLAT. Figure 9.5 (a) shows a more complex subschematic for a 4-bit latch. Each primitive cell instance in this schematic must have a unique name. This can get very tiresome for large circuits. Instead of creating complex, but repetitive, subschematics for complex cells we can use hierarchy.
9.1 Schematic Entry
FIGURE 9.5 A 4-bit latch: (a) drawn as a flat schematic from gate-level primitives, (b) drawn as four instances of the cell symbol DLAT, (c) drawn using a vectored instance of the DLAT cell symbol with cardinality of 4, (d) drawn using a new cell symbol with cell name FourBit. Figure 9.5 (b) shows a hierarchical subschematic for a cell FourBit , which in turn uses four instances of the cell DLAT . The four instances of DLAT in Figure 9.5 (b) have different instance names: L1 , L2 , L3 , and L4 . Notice that we cannot use just one name for the four instances of DLAT to indicate that they are all the same cell. If we did, we could not differentiate between L1 and L2 , for example. The vertical row of instances in Figure 9.5 (b) looks like a vector of elements. Figure 9.5 (c) shows a vectored instance representing four copies of the DLAT cell. We say the cardinality of this instance is 4. Tools normally use bold lines or some other distinguishing feature to represent a vectored instance. The cardinality information is often shown as a vector. Thus L[1:4] represents four instances: L[1] , L[2] , L[3] , L[4] . This is convenient because now we can see that all subcells are identical copies of L , but we have a unique name for each.
9.1 Schematic Entry
Finally, as shown in Figure 9.5 (d) we can create a new symbol for the 4-bit latch, FourBit . The symbol for FourBit has a 4-bit-wide input bus for the four D inputs, and a 4-bit wide output bus for the four Q outputs. The subschematic for FourBit could be either Figure 9.5 (a), (b), or (c) (though the exact naming of the inputs and outputs and their attachment to the buses may be different in each case). We need a convention to distinguish, for example, between the inverter subcells, inv1 , which are children of the cell DLAT , which are in turn children of the cell FourBit . Most schematic-entry tools do this by combining the instance names of the subcells in a hierarchical manner using a special character as a delimiter. For example, if we drew the subschematic as in Figure 9.5 (b), the four inverters in FourBit might be named L1.inv1 , L2.inv1 , L3.inv1 , and L4.inv1 . Once again this makes it clear that the inverters, inv1 , are identical in all four subcells. In our office building example, the offices are subcells of the cell Floor . Suppose you and I both have corner offices. Mine is on the second floor and yours is above mine on the third floor. My office is 211 and your office is 311. Another way to name our offices on a building plan might be FloorTwo.11 for my office and FloorThree.11 for your office. This shows that FloorTwo.11 is a subcell of FloorTwo and also makes it clear that, apart from being on different floors, your office and mine are identical. Both our offices have instance names 11 and are instances of cell name Corner .
9.1.5 Nets
The schematics shown in Figure 9.4 contain both local nets and external nets . An example of a local net in Figure 9.4 (b) is n1 , the connection between the output terminal of the AND cell and1 to the OR cell or1 . When the four copies of this circuit are placed in the parent cell FourBit in Figure 9.5 (d), four copies of net n1 are created. Since the four nets named n1 are not actually electrically connected, even though they have the same name at the lowest hierarchical level, we must somehow find a way to uniquely identify each net. The usual convention for naming nets in a hierarchical schematic uses the parent cell instance name as a prefix to the local net name. A special character ( ':' '/'
9.1 Schematic Entry
'$' '#' for example) that is not allowed to appear in names is used as a delimiter to separate the net name from the cell instance name. Supposing that we drew the subschematic for cell FourBit as shown in Figure 9.5 (b), the four different nets labeled n1 might then become: FourBit .L1:n1 FourBit .L2:n1 FourBit .L3:n1 FourBit .L4:n1 This naming is usually done automatically by the schematic-entry tool. The schematic DLAT also contains three external nets: D, EN, and Q . The terminals on the symbol DLAT connect these nets to other nets in the hierarchical level above. For example, the signal Trigger:flag in Figure 9.4 (c) is also Trigger.DLAT:Q . Each schematic tool handles this situation differently, and life becomes especially difficult when we need to refer to these nodes from a simulator outside the schematic tool, for example. HDLs such as VHDL and Verilog have a very precise and well-defined standard for naming nets in hierarchical structures.
9.1.6 Schematic Entry for ASICs and PCBs

A symbol on a schematic may represent a component, which may contain component parts. You are more likely to come across the use of components in a PCB schematic. A component is slightly different from an ASIC library cell. A simple example of a component would be a TTL gate, an SN74LS00N, that contains four 2-input NAND gates. We call an SN74LS00N a component and each of the individual NAND gates inside is a component part. Another common example of a component would be a resistor packa single package that contains several identical resistors. In PCB design language a component label or name is a reference designator . A reference designator is a unique name attribute, such as R99 , attached to each component. A reference designator, such as R99 , has two pieces: an alpha prefix R and a numerical suffix 99 . To understand the difference between reference designators and instance names, we need to look at the special requirements of PCB design. PCBs usually contain packaged ASICs and other ICs that have pins that are soldered
9.1 Schematic Entry
to a board. For rectangular, dual-in-line (DIP) packages the pins are numbered counterclockwise from the upper-left corner looking down on the package. IC symbols have a pin number for each part in the package. For example, the TTL 74174 hex D flip-flop with clear, contains six parts: six identical D flip-flops. The IC symbol representing this device has six PinNumber attribute entries for the D input corresponding to the six possible input pins. They are pins 3, 4, 6, 11, 13, and 14. When we need a flip-flop in our design, we use a symbol for a 74174 from a schematic library, suppose the symbol name is dffClr . We shall assign a unique instance name to the symbol, CarryFF . Now suppose we need another, identical, flip-flop and we call this BitFF . We do not mind which of the six flip-flop parts in a 74174 we use for CarryFF and BitFF . In fact they do not even have to be in the same package. We shall delay the choice of assigning CarryFF and BitFF to specific packages until we get to the PCB routing step. So at this point on our schematic we do not even know the pin numbers for CarryFF and BitFF . For example the D input to CarryFF could be pin 3, 4, 6, 11, 13, or 14. The number of wire crossings on a PCB is minimized by careful assignment of components to packages and choice of parts within a package. So the placement-androuting software may decide which part of which package to use for CarryFF and BitFF depending on which is easier to route. Then, only after the placement and routing is complete, are unique reference designators assigned to the component parts. Only at this point do we know where CarryFF is actually located on the PCB by referring to the reference designator, which points to a specific part in a specific package. Thus CarryFF might be located in IC4 on our PCB. At this point we also know which pins are used for each symbol. So we now know, for example, that the Dinput to CarryFF is pin 3 of IC4 . There is no process in ASIC design directly equivalent to the process of part assignment described above and thus no need to use reference designators. The reference-designator naming convention quickly becomes unwieldy if there are a large number of components in a design. For example, how will we find a NAND gate named X3146 in an ASIC schematic with 100 pages? Instead, for ASICs, we use a naming scheme based on hierarchy. In large hierarchical ASIC designs it is difficult to provide a unique reference
9.1 Schematic Entry
designator to each element. For this reason ASIC designs use instance names to identify the individual components. Meaningful names can be assigned to low-level components and also the symbols that represent hierarchy. We derive the component names by joining all of the higher level cell names together. A special character is used as a delimiter and separates each level. Examples of hierarchical instance names are: cpu.alu.adder.and01 MotherBoard:Cache:RAM4:ReadBit4:Inverter2
9.1.7 Connections
Cell instances have terminals that are the inputs and outputs of the cell. Terminals are also known as pins , connectors , or signals . The term pin is widely used, but we shall try to use terminal, and reserve the term pin for the metal leads on an ASIC package. The term pin is used in schematic entry and routing programs that are primarily intended for PCB design.
FIGURE 9.6 An example of the use of a bus to simplify a schematic. (a) An address decoder without using a bus. (b) A bus with bus rippers simplifies the schematic and reduces the possibility of making a mistake in creating and reading the schematic.
9.1 Schematic Entry
Electrical connections between cell instances use wire segments or nets . We can group closely related nets, such as the 32 bits of a 32-bit digital word, together into a bus or into buses (not busses). If signals on a bus are not closely related, we usually use the term bundle or array instead of bus. An example of a bundle might be a bus for a SCSI disk system, containing not only data bits but handshake and control signals too. Figure 9.6 shows an example of a bus in a schematic. If we need to access individual nets in a bus or a bundle, we use a breakout (also known as a ripper , an EDIF term, or extractor ). For example, a breakout is used to access bits 07 of a 32bit bus. If we need to rearrange bits on a bus, some schematic editors offer something called a swizzle . For example, we might use a swizzle to reorder the bits on an 8-bit bus so that the MSB becomes the LSB and so on down to the LSB, which now becomes the MSB. Swizzles can be useful. For example, we can multiply or divide a number by 2 by swizzling all the bits up or down one place on a bus.
9.1.8 Vectored Instances and Buses

So far the naming conventions are fairly standard and easy to follow. However, when we start to use vectored instances and buses (as is now common in large ASICs), there are potential areas of difficulty and confusion. Figure 9.7 (a) shows a schematic for a 16-bit latch that uses multiple copies of the cell FourBit . The buses are labeled with the appropriate bits. Figure 9.7 (b) shows a new cell symbol for the 16bit latch with 16-bit wide buses for the inputs, D, and outputs, Q.
9.1 Schematic Entry
FIGURE 9.7 A 16-bit latch: (a) drawn as four instances of cell FourBit; (b) drawn as a cell named SixteenBit; (c) drawn as four multiple instances of cell FourBit. Figure 9.7 (c) shows an alternative representation of the 16-bit latch using a vectored instance of FourBit with cardinality 4. Suppose we wish to make a connection to expressly one bit, D1 (we have used D1 as the first bit rather than the more conventional D0 so that numbering is easier to follow). We also wish to make a connection to bits D9D12, represented as D[9:12]. We do this using a bus ripper. Now we have the rather awkward situation of bus naming shown in Figure 9.7 (c). Problems arise when we have buses of buses because the numbers for the bus widths do not match on either side of a ripper. For this reason it is best to use the single-bus approach shown in Figure 9.7 (b) rather than the vectored-bus approach of Figure 9.7 (c).
9.1.9 Edit-in-Place
Figure 9.7 (b) shows a symbol SixteenBit , which uses the subschematic shown
9.1 Schematic Entry
in Figure 9.7 (a) containing four copies of FourBit , named NB1 , NB2 , NB3 , and NB4 (the NB stands for nibble, which is half of a word; a nibble is 4 bits for 8-bit words). Suppose we use the schematic-entry program to edit the subcell NB1.L1 , which is an instance of DLAT inside NB1 . Perhaps we wish to change the D latch to a D latch with a reset, for example. If the schematic editor supports edit-in-place , we can edit a cell instance directly. After we edit the cell, the program will update all the DLAT subcells in the cell that is currently loaded to reflect the changes that have been made. To see how edit-in-place works, consider our office building again. Suppose we wish to change some of the offices on each floor from offices without windows to offices with windows. We select the cell instance FloorTwo that is, an instance of cell Floor . Now we choose the edit mode in the schematic-entry program. But wait! Do we want to edit the cell Floor , or do we want to edit the cell instance FloorTwo ? If we edit the cell Floor , we will be making changes to all of the floors that use cell name Floor that is, instances FloorTwo through FloorTen . If we edit the cell instance FloorTwo , then the second floor will become different from all the other floors. It will no longer be an instance of cell name Floor and we will have to create another cell name for the cell used by instance FloorTwo . This is like the difference between ordering just one hamburger without pickles and changing the picture on the wall that will change all future hamburgers. Using edit-in-place we can edit the cell Floor . Suppose we change some of the cell instances of cell name NoWindowOffice to instances of cell name WindowOffice . When we finish editing and save the cell Floor , we have effectively changed all of the floors that contain instances of this cell. Instead of editing a cell in place, you may really want to edit just one instance of a cell and leave any other instances unchanged. In this case you must create a new cell with a new symbol and new, unique cell name. It might also be wise to change the instance name of the new cell to avoid any confusion. For example, we might change the third-floor plan of our office to be different from the other upper floors. Suppose the third floor is now an instance of cell name FloorVIP instead of Floor . We could continue to call the third floor cell instance FloorThree , but it would be better to rename the instance differently, FloorSpecial for example, to make it clear that it is different from all the other
9.1 Schematic Entry
floors. Some tools have the ability to alias nets. Aliasing creates a net name from the highest level in the design. Local names are net names at the lowest level such as D , and Q in a flip-flop cell. These local names are automatically replaced by the appropriate toplevel names such as Clock1 , or Data2 , using a dictionary . This greatly speeds tracing of signals through a design containing many levels of hierarchy.
9.1.10 Attributes
You can attach a name , also known as an identifier or label , to a component, cell instance, net, terminal, or connector. You can also attach an attribute , or property , which describes some aspect of the component, cell instance, net, or connector. Each attribute has a name, and some attributes also have values. The most common problems in working with schematics and netlists, especially when you try to exchange schematic information between different tools, are problems in naming. Since cells and their contents have to be stored in a database, a cell name frequently corresponds (or is mapped to) a filename. This then raises the problems of naming conventions including: case sensitivity, name-collision resolution, dictionaries, handling of common special characters (such as embedded blanks or underscores), other special characters (such as characters in foreign alphabets), first-character restrictions, name-length problems (only 28 characters are permitted on an NFS compatible filename), and so on.
9.1.11 Netlist Screener

A surprising number of problems can be found by checking a schematic for obviously fatal errors. A program that analyzes a schematic netlist for simple errors is sometimes called a schematic screener or netlist screener . Errors that can be found by a netlist screener include:
q q q q
unconnected cell inputs, unconnected cell outputs, nets not driven by any cells, too many nets driven by one cell,
9.1 Schematic Entry

q
nets driven by more than one cell.
The screener can work continuously as the designer is creating the schematic or can be run as a separate program independently from schematic entry. Usually the designer provides attributes that give the screener the information necessary to perform the checks. A few of the typical attributes that schematic-entry programs use are described next. A screener usually generates a list of errors together with the locations of the problem on the schematic where appropriate. Some editors associate an identifier, or handle , to every piece of a schematic, including comments and every net. Normally there is some convention to the assigned names such as a grid on a schematic. This works like the locator codes on a map, so that a net with A1 as part of the name is in the upperleft-hand corner, for example. This allows you to quickly and uniquely find any problems found by a screener. The term handle is a computer programming term that is used in referring to a location in memory. Each piece of information on a schematic is stored in lists in memory. This technique breaks down completely when we move to HDLs. Most schematic-entry programs work on a grid. The designer can control the size of the grid and whether it is visible or not. When you place components or wires you can instruct the editor to force your drawing to snap to grid . This means that drawing a schematic is like drawing on graph paper. You can only locate symbols, wires, and connections on grid points. This simplifies the internal mechanics of the schematicentry program. It also makes the transfer of schematics between different EDA systems more manageable. Finally, it allows the designer to produce schematic diagrams that are cleaner in appearance and thus easier to read. Most schematic-entry programs allow you to find components by instance name or cell name. The editor may either jump to the component location and center the graphic window on the component or highlight the component. More sophisticated options allow more complex searches, perhaps using wildcard matching. For example, to find all three-input NAND gates (primitive cell name ND3) or three-input NOR gates (primitive cell name NO3), you could search for cell name N*3, where * is a wildcard symbol standing for any character. The editor may generate a list of components, perhaps with page number and coordinate locations. Extensive find
9.1 Schematic Entry
features are useful for large schematics where it quickly becomes impossible to find individual components. Some schematic editors can complete automatic naming of reference designators or instance names to the schematic symbols either as the editor is running or as a postprocessing step. A component attribute, called a prefix, defines the prefix for the name for each type of component. For example, the prefix for all resistor component types may be R . Each time a prefix is found or a new instance is placed, the number in the reference designator or name is automatically incremented. Thus if the last resistor component type you placed was R99 , the next time you place a resistor it would automatically be named R100 . For large schematics it is useful to be able to generate a report on the used and unused reference designators. An example would be: Reference designator prefix: R Unused reference designator numbers: 153, 154 Last used reference designator number: 180 If you need this feature, you probably are not using enough hierarchy to simplify your design. During schematic entry of an ASIC design you will frequently need multiple copies of components. This often occurs during datapath design, where operations are carried out across multiple signals on a bus. A common example would be multiple copies of a latch, one for each signal on a bus. It is tedious and inefficient to have to draw and label the same cell many times on a schematic. To simplify this task, most editors allow you to place a special vectored cell instance of a cell. A vectored cell instance, or vectored instance for short, uses the same icon for a single instance but with a special attribute, the cell cardinality , that denotes the number of copies of the cell. Connections between signals on a bus and vectored instances should be handled automatically. The width or cardinality of the bus and the cell cardinality must match, and the design-entry tool should issue a warning if this is not the case. A schematic-entry program can use a terminal attribute to determine which cell terminals are output terminals and which terminals are input terminals. This attribute is usually called terminal polarity or terminal direction . Possible values for
9.1 Schematic Entry
terminal polarity might be: input , output , and bidirectional . Checking the terminal polarity of the terminals on a net can help find problems such as a net with all input terminals or all output terminals. The fanout of a cell measures the driving capability of an output terminal. The fanin of a cell measures the number of input terminals. Fanout is normally measured using a standard load. A standard load is the load presented by one input of a primitive cell, usually a two-input NAND. For example, a library cell Counter may have an input terminal, Clock , that is connected to the input terminals of five primitive cells. The loading at this terminal is then five standard loads. We say that the fanout of Clock is five. In a similar fashion, we say that if a cell Buffer is capable of driving the inputs of three primitive cells, the fanout of Buffer is three. Using the fanin and fanout attributes a netlist screener can check to see if the fanout driving a net is greater than the sum of all loads on that net. (See Figure 9.2 on page 329.)
9.1.12 Schematic-Entry tools

Some editors offer icon edit-in-place in a similar fashion as schematic edit-in-place for cells. Often you have to toggle editing modes in the schematic-entry program to switch between editing cells and editing cell icons. A schematic-entry program must keep track of when cells are edited. Normally this is done by using a timestamp or datestamp for each cell. This is a text field within the data file for each cell that holds the date and time that the cell was last modified. When a new schematic or cell is loaded, the program needs to compare its timestamp with the timestamps of any subcells. If any of the subcell timestamps are more recent, then the designer needs to be alerted. Usually a message appears to inform you that changes have been made to subcells since the last time the cell currently loaded was saved. This may be what you expect or it may be a warning that somehow a subcell has been changed inadvertently (perhaps someone else changed it) since you last loaded that cell. Normally the primitive cells in a library are locked and cannot be edited. If you can edit a primitive cell, you have to make a copy, edit the copy, and rename it. Normally the ASIC designer cannot do this and does not want to. For example, to edit a primitive NAND gate stored in an ASIC schematic library would require that the subschematic of the primitive cell be available (usually not the case) and also that the next lower level primitives (symbols for the transistors making up the NAND gate)
9.1 Schematic Entry
also be available to the designer (also usually not the case). What do you do if somehow changes were made to a cell by mistake, perhaps by someone else, and you dont want the new cell, you want the old version? Most schematic-entry and other EDA tools keep old versions of files as a back-up in case this kind of problem occurs. Most EDA software automatically keeps track of the different versions of a file by appending a version number to each file. Usually this is transparent to the designer. Thus when you edit a cell named Floor , the file on disk might be called Floor.6 . When you save the changes, the software will not overwrite Floor.6 , but write out a new file and automatically name it Floor.7 . Some design-entry tools are more sophisticated and allow users to create their own libraries as they complete an ASIC design. Designers can then control access to libraries and the cells that they build during a design. This normally requires that a schematic editor, for example, be part of a larger EDA system or framework rather than work as a stand-alone tool. Sometimes the process of library control operates as a separate tool, as a design manager or library manager . Often there is a program similar to the UNIX make command that keeps track of all files, their dependencies, and the tools that are necessary to create and update each file. You can normally set the number of back-up versions of files that EDA software keeps. The version history controls the number of files the software will keep. If you accidentally update, overwrite, or delete a file, there is usually an option to select and revert to an earlier version. More advanced systems have check-out services (which work just as in source control systems in computer programming databases) that prevent these kinds of problems when many people are working on the same design. Whenever possible, the management of design files and different versions should be left under software control because the process can become very complicated. Reverting to an earlier version of a cell can have drastic consequences for other cells that reference the cell you are working with. Attempts to manually edit files by changing version numbers and timestamps can quickly lead to chaos. Most schematic-entry programs allow you to undo commands. This feature may be restricted to simply undoing the last command that you entered, or may be an unlimited undo and redo, allowing you to back up as many commands as you want in the current editing session.
9.1 Schematic Entry
You can spend a lot of time in a schematic editor placing components and drawing the connections between them. Features that simplify initial entry and allow modifications to be made easily can make an enormous difference to the efficiency of the schematic-entry process. Most schematic editors allow you to make connections by dragging the cursor with the wire following behind, in a process known as rubber banding . The connection snaps to a right angle when the connection is completed. For wire connections that require more than two line segments, an automatic wiring feature is useful. This allows you to define the wire path roughly using mouse clicks and have the editor complete the connection. It is exceedingly painful to move components if you have to rewire connections each time. Most schematic editors allow you to move the components and drag any wires along with them. One of the most annoying problems that can arise in schematic entry is to think that you have joined two wires on a schematic but find that in reality they do not quite meet. This error can be almost impossible to find. A good editing program will have a way of avoiding this problem. Some editors provide a visual (flash) or audible (beep) feedback when the designer draws a wire that makes an electrical connection with another. Some editors will also automatically insert a dot at a T connection to show that an electrical connection is present. Other editors refuse to allow four-way connections to be made, so there can be no ambiguity when wires cross each other if an electrical connection is present or not. A cell library or a collection of libraries is a key part of the schematic-entry process. The ability to handle and control these libraries is an important feature of any schematic editor. It should be easy to select components from the library to be placed on a schematic. In large schematics it is necessary to continue large nets and signals across several pages of schematics. Signals such as power and ground, VDD and GND, can be connected using global nets or special connectors . Global nets allow the designer to label a net with the same name at different places on a schematic page or on different pages without having to draw a connection explicitly. The schematic editor treats these nets as though they were electrically connected. Special connector symbols can
9.1 Schematic Entry
be used for connections that cross schematic pages. An off-page connector or multipage connector is a special symbol that will show and label a connection to different schematic pages. More sophisticated editors can automatically label these connectors with the page numbers of the destination connectors.
9.1.13 Back-Annotation
After you enter a schematic you simulate the design to make sure it works as expected. This completes the logical design. Next you move to ASIC physical design and complete the layout. Only after you complete the layout do you know the parasitic capacitance and therefore the delay associated with the interconnect. This postroute delay information must be returned to the schematic in a process known as back-annotation . Then you can complete a final, postlayout simulation to make sure that the specifications for the ASIC are met. Chapter 13 covers simulation, and the physical design steps are covered in Chapters 15 to 17. [ Chapter start ] [ Previous page ] [ Next page ]
9.2 Low-Level Design Languages

Schematics can be a very effective way to convey design information because pictures are such a powerful medium. There are two major problems with schematic entry, however. The first problem is that making changes to a schematic can be difficult. When you need to include an extra few gates in the middle of a schematic sheet, you may have to redraw the whole sheet. The second problem is that for many years there were no standards on how symbols should be drawn or how the schematic information should be stored in a netlist. These problems led to the development of design-entry tools based on text rather than graphics. As TTL gave way to PLDs, these text-based design tools became increasingly popular as de facto standards began to emerge for the format of the design files. PLDs are closely related to FPGAs. The major advantage of PLD tools is their low cost, their ease of use, and the tremendous amount of knowledge and number of designs, application notes, textbooks, and examples that have been built up over years of their use. It is natural then that designers would want to use PLD development systems and languages to design FPGAs and other ASICs. For example, there is a tremendous amount of PLD design expertise and working designs that can be reused. In the case of ASIC design it is important to use the right tool for the job. This may mean that you need to convert from a low-level design medium you have used for PLD design to one more appropriate for ASIC design. Often this is because you are merging several PLDs into a single, much larger, ASIC. The reason for covering the PLD design languages here is not to try and teach you how to use them, but to allow you to read and understand a PLD language and, if necessary, convert it to a form that you can use in another ASIC design system.
9.2.1 ABEL
ABEL is a PLD programming language from Data I/O. Table 9.2 shows some examples of the ABEL statements. The following example code describes a 4:1 MUX (equivalent to the LS153 TTL part): TABLE 9.2 ABEL. Statement Example module Module MyModule title 'Title in a Title String' MYDEV device Device '22V10' ; "comments go between double quotes" Comment "end of line is end of comment @ALTERNATE "use @ALTERNATE alternate symbols
Comment You can have multiple modules. A string is a character series between quotes. MYDEV is Device ID for documentation. 22V10 is checked by the compiler.
The end of a line signifies the end of a comment; there is no need for an end quote.
operator AND OR NOT XOR XNOR
alternate * + / :+: :*:
default & # ! $ !$
Pin declaration
Equations
Assignments
MYINPUT pin 2; I3, I4 pin 3, 4 ; /MYOUTPUT pin 22; IO3,IO4 pin 21,20 ; equations IO4 = HELPER ; HELPER = /I4 ; MYOUTPUT = /MYINPUT ; IO3 := I4 ; D = [D0, D1, D2, D3] ; Q = [Q0, Q1, Q2, Q3]; Q := D ; MYOUTPUT.RE = CLR ; MYOUTPUT.PR = PRE ; COUNT = [D0, D1, D2]; COUNT := COUNT + 1;
Pin 22 is the IO for input on pin 2 for a 22V10. MYOUTPUT is active-low at the chip pin. Signal names must start with a letter. Defines combinational logic. Two-pass logic
Equals '=' is unlocked assignment. Clocked assignment operator (registered IO)
Signal sets
A signal set, an ABEL bus
4-bit-wide register Register reset Register preset
Suffix
Addition
Cant use @ALTERNATE if you use '+' to add.
Enable
Constants Relational End
ENABLE IO3 = IO2; IO3 = MYINPUT; K = [1, 0, 1] ; IO# = D == K5 ; end MyModule
Three-state enable (ENABLE is a keyword). IO3 must be a three-state pin. K is 5. Operators: == != < > <= >=
Last statement in module
module MUX4 title '4:1 MUX' MyDevice device 'P16L8' ; @ALTERNATE "inputs A, B, /P1G1, /P1G2 pin 17,18,1,6 "LS153 pins 14,2,1,15 P1C0, P1C1, P1C2, P1C3 pin 2,3,4,5 "LS153 pins 6,5,4,3 P2C0, P2C1, P2C2, P2C3 pin 7,8,9,11 "LS153 pins 10,11,12,13 "outputs P1Y, P2Y pin 19, 12 "LS153 pins 7,9 equations P1Y = P1G*(/B*/A*P1C0 + /B*A*P1C1 + B*/A*P1C2 + B*A*P1C3); P1Y = P1G*(/B*/A*P1C0 + /B*A*P1C1 + B*/A*P1C2 + B*A*P1C3); end MUX4
9.2.2 CUPL
CUPL is a PLD design language from Logical Devices. We shall review the CUPL 4.0 language here. The following code is a simple CUPL example describing sequential logic:
SEQUENCE BayBridgeTollPlaza { PRESENT red IF car NEXT green OUT go; /* conditional synchronous output */ DEFAULT NEXT red; /* default next state */ PRESENT green NEXT red; } /* unconditional next state */ This code describes a state machine with two states. Table 9.3 shows the different state machine assignment statements. TABLE 9.3 CUPL statements for state-machine entry. Statement Description IF NEXT Conditional next state transition Conditional next state transition with synchronous IF NEXT OUT output NEXT Unconditional next state transition Unconditional next state transition with NEXT OUT asynchronous output OUT Unconditional asynchronous output IF OUT Conditional asynchronous output DEFAULT NEXT Default next state transition DEFAULT OUT Default asynchronous output Default next state transition with synchronous DEFAULT NEXT OUT output You may also encode state machines as truth tables in CUPL. Here is another simple example: FIELD input = [in1..0]; FIELD output = [out3..0]; TABLE input => output {00 => 01; 01 => 02; 10 => 04; 11 => 08; }
The advantage of the CUPL language, and text-based PLD languages in general, is now apparent. First, we do not have to enter the detailed logic for the state decoding ourselvesthe software does it for us. Second, to make changes only requires simple text editingfast and convenient. Table 9.4 shows some examples of CUPL statements. In CUPL Boolean equations may use variables that contain a suffix, or an extension , as in the following example: output.ext = (Boolean expression); TABLE 9.4 CUPL. Statement Example Boolean A = !B; expression A = B & C; A = B # C; A = B $ C; A = B & C /* comment Comment */ Pin PIN 1 = CLK; declaration PIN = CLK; Node NODE A; declaration NODE [B0..7]; Pinnode PINNODE 99 = A; declaration PINNODE [10..17] = [B0..7]; FIELD Address = Bit-field [B0..7]; declaration
Comment Logical negation Logical AND Logical OR Logical exclusive-OR
Device dependent Device independent Number automatically assigned Array of buried nodes Node assigned by designer Array of pinnodes 8-bit address field
Bit-field operations
add_one = Address:FF; add_zero = !(Address:&); add_range = Address:[0F..FF];
True if Address = OxFF True if Address = Ox00 True if 0F.LE.Address.LE.FF
The extensions steer the software, known as a fitter , in assigning the logic. For example, a signal-name suffix of .OE marks that signal as an output enable. Here is an example of a CUPL file for a 4-bit counter placed in an ATMEL PLD part that illustrates the use of some common extensions: Name 4BIT; Device V2500B; /* inputs */ pin 1 = CLK; pin 3 = LD_; pin 17 = RST_; pin [18,19,20,21] = [I0,I1,I2,I3]; /* outputs */ pin [4,5,6,7] = [Q0,Q1,Q2,Q3]; field CNT = [Q3,Q2,Q1,Q0]; /* equations */ Q3.T = (!Q2 & !Q1 & !Q0) & LD_ & RST_ /* count down */ # Q3 & !RST_ /* ReSeT */ # (Q3 $ I3) & !LD_; /* LoaD*/ Q2.T = (!Q1 & !Q0) & LD_ & RST_ # Q2 & !RST_ # (Q2 $ I2) & !LD_; Q1.T = !Q0 & LD_ & RST_ # Q1 & !RST_ # (Q1 $ I1) & !LD_; Q0.T = LD_ & RST_ # Q0 & !RST_ # (Q0 $ I0) & !LD_; CNT.CK = CLK; CNT.OE = 'h'F; CNT.AR = 'h'0; CNT.SP = 'h'0; In this example the suffix extensions have the following effects: .CK marks the clock; .T configures sequential logic as T flip-flops; .OE (wired high) is the output enable; .AR (wired low) is the asynchronous reset; and .SP (wired low) is the synchronous preset. Table 9.5 shows the different CUPL extensions.
TABLE 9.5 CUPL 4.0 extensions. Explanation Extension 1 D L D input to a D register
Extension DFB R
Explanation D register feedback of combinational output Latched feedback of combinational output T register feedback of combinational output Internal feedback Pin feedback of registered output D/T register on pin feedback path selection Latch on pin feedback path selection Asynchronous preset/reset of register on feedback path Synchronous preset/reset of register on feedback path
L input to a latch
LFB
J, K
J-K-input to a J-K register S-R input to an SR register T input to a T register D output of an input D register Q output of an input latch Asynchronous preset/reset
TFB
S, R T
L L
INT IO
R R
DQ
IOD/T
LQ
IOL
AP, AR
IOAP, IOAR
SP, SR
Synchronous preset/reset
IOSP, IOSR
CK
Product clock term (async.) Product-term output enable Complement array Programmable preload CE input of a DCE register Product-term latch enable Programmable observability of buried nodes Programmable register bypass
IOCK
OE
APMUX, ARMUX
CA
CKMUX
PR
LEMUX
CE
OEMUX
LE
IMUX
OBS BYP
L L
TEC T1
L L
Clock for pin feedback register Asynchronous preset/reset multiplexor selection Clock multiplexor selector Latch enable multiplexor selector Output enable multiplexor selector Input multiplexor selector of two pins Technologydependent fuse selection T1 input of 2-T register
The 4-bit counter is a very simple example of the use of the Atmel ATV2500B. This PLD is quite complex and has many extra buried features. In order to use these features in CUPL (and ABEL) you need to refer to special pin numbers and node numbers that are given in tables in the manufacturers data sheets. You may need the pin-number tables to reverse engineer or convert a complicated CUPL (or ABEL) design from one format to another. Atmel also gives skeleton headers and pin declarations for their parts in their data sheets. Table 9.6 shows the headers and pin declarations in ABEL and CUPL format
for the ATMEL ATV2500B. TABLE 9.6 ABEL and CUPL pin declarations for an ATMEL ATV2500B. ABEL CUPL device_id device 'P2500B'; "device_id used for JEDEC filename device V2500B; I1,I2,I3,I17,I18 pin pin [1,2,3,17,18] = 1,2,3,17,18; [I1,I2,I3,I17,I18]; O4,O5 pin 4,5 istype pin [7,6,5,4] = 'reg_d,buffer'; [O7,O6,O5,O4]; O6,O7 pin 6,7 istype pinnode [41,65,44] = 'com'; [O4Q2,O4Q1,O7Q2]; O4Q2,O7Q2 node 41,44 pinnode [43,68] = istype 'reg_d'; [O6Q2,O7Q1]; O6F2 node 43 istype 'com'; O7Q1 node 220 istype 'reg_d';
9.2.3 PALASM
PALASM is a PLD design language from AMD/MMI. Table 9.7 shows the format of PALASM statements. The following simple example (a video shift register) shows the most basic features of the PALASM 2 language: TABLE 9.7 PALASM 2. Statement Example CHIP abc 22V10 Chip CHIP xyz USER
Comment Specific PAL type Free-form equation entry
Pinlist String Equations
Polarity inversion Assignment
CLK /LD D0 D1 D2 D3 D4 GND NC Q4 Q3 Q2 Q1 Q0 /RST VCC STRING string_name 'text' EQUATIONS A = /B A = B * C A = B + C A = B :+: C A = B :*: C /A = /(B + C) A = B + C
A := B + C A = B + C ; comment Comment Functional equation name.TRST name.CLKF name.RSTF name.SETF
Part of CHIP statement; PAL pins in numerical order starting with pin 1 Before EQUATIONS statement After CHIP statement Logical negation Logical AND Logical OR Logical exclusive-OR Logical exclusive-NOR Same as A = B + C Combinational assignment Registered assignment Comment Output enable control Register clock control Register reset control Register set control
TITLE video ; shift register CHIP video PAL20X8 CK /LD D0 D1 D2 D3 D4 D5 D6 D7 CURS GND NC REV Q7 Q6 Q5 Q4 Q3 Q2 Q1 Q0 /RST VCC STRING Load 'LD*/REV*/CURS*RST' ; load data STRING LoadInv 'LD*REV*/CURS*RST' ; load inverted of data STRING Shift '/LD*/CURS*/RST' ; shift data from MSB to LSB EQUATIONS /Q0 := /D0*Load+D0*LoadInv:+:/Q1*Shift+RST /Q1 := /D1*Load+D1*LoadInv:+:/Q2*Shift+RST
/Q2 /Q3 /Q4 /Q5 /Q6 /Q7
:= := := := := :=
/D2*Load+D2*LoadInv:+:/Q3*Shift+RST /D3*Load+D3*LoadInv:+:/Q4*Shift+RST /D4*Load+D4*LoadInv:+:/Q5*Shift+RST /D5*Load+D5*LoadInv:+:/Q6*Shift+RST /D6*Load+D6*LoadInv:+:/Q7*Shift+RST /D7*Load+D7*LoadInv:+:Shift+RST;
The order of the pin numbers in the previous example is important; the order must correspond to the order of pins for the DEVICE . This means that you probably need the device data sheet in order to be able to translate a design from PALASM to another format by hand. The alternative is to use utilities that many PLD and FPGA companies offer that automatically translate from PALASM to their own formats. 1. L means that the extension is used only on the LHS of an equation; R means that the extension is used only on the RHS of an equation. [ Chapter start ] [ Previous page ] [ Next page ]
9.3 PLA Tools
9.3 PLA Tools

We shall use the Berkeley PLA tools to illustrate logic minimization using an example to minimize the logic required to implement the following three logic functions: F1 = A|B|!C; F2 = !B&C; F3 = A&B|C; These equations are in eqntott input format. The eqntott (for equation to truth table) program converts the input equations into a tabular format. Table 9.8 shows the truth table and eqntott output for functions F1 , F2 , and F3 that use the six minterms: A , B , !C , !B&C , A&B , C . TABLE 9.8 A PLA tools example. Input (6 minterms): F1 = A|B|!C; F2 = !B&C; F3 = A&B|C; eqntott output espresso output A B C F1 F2 F3 .i 3 0 0 0 1 0 0 .i 3 .o 3 .o 3 0 0 1 0 1 1 .p 6 .p 6 0 1 0 1 0 0 1-- 100 --0 100 11- 001 0 1 1 1 0 1 --1 001 --0 100 -01 010 1 0 0 1 0 0 -01 011 -1- 100 1 0 1 1 1 1 -11 101 1-- 100 .e 1 1 0 1 0 1 11- 001 .e
9.3 PLA Tools
Output (5 minterms): F1 = A|!C|(B&C); F2 = !B&C; F3 = A&B|(!B&C)|(B&C); This eqntott output is not really a truth table since each line corresponds to a minterm. The output forms the input to the espresso logic-minimization program. Table 9.9 shows the format for espresso input and output files. Table 9.10 explains the format of the input and output planes of the espresso input and output files. The espresso output in Table 9.8 corresponds to the eqntott logic equations on the next page. TABLE 9.9 The format of the input and output files used by the PLA design tool espresso. Expression Explanation # comment # must be first character on a line. [d] Decimal number [s] Character string .i [d] Number of input variables .o [d] Number of output variables .p [d] Number of product terms Names of the binary-valued variables must .ilb [s1] [s2]... [sn] be after .i and .o . Names of the output functions must be after .ob [s1] [s2]... [sn] .i and .o . Following table describes the ON set; DC .type f set is empty. Following table describes the ON set and .type fd DC set. Following table describes the ON set and .type fr OFF set. Following table describes the ON set, OFF .type fdr set, and DC set.
9.3 PLA Tools
.e
Optional, marks the end of the PLA description.
TABLE 9.10 The format of the plane part of the input and output files for espresso. Plane Character Explanation 1 I The input literal appears in the product term. The input literal appears complemented in the product 0 I term. I The input literal does not appear in the product term. 1 or 4 O This product term appears in the ON set. 0 O This product term appears in the OFF set. 2 or O This product term appears in the dont care set. 3 or ~ O No meaning for the value of this function. F1 = A|!C|(B&C); F2 = !B&C; F3 = A&B|(!B&C)|(B&C); We see that espresso reduced the original six minterms to these five: A , A&B , !C , !B&C , B&C . The Berkeley PLA tools were widely used in the 1980s. They were important stepping stones to modern logic synthesis tools. There are so many testbenches, examples, and old designs that used these tools that we occasionally need to convert files in the Berkeley PLA format to formats used in new tools. [ Chapter start ] [ Previous page ] [ Next page ]
9.4 EDIF
9.4 EDIF
An ASIC designer spends an increasing amount of time forcing different tools to communicate. One standard for exchanging information between EDA tools is the electronic design interchange format ( EDIF ). We will describe EDIF version 2 0 0. The most important features added in EDIF 3 0 0 were to handle buses, bus rippers, and buses across schematic pages. EDIF 4 0 0 includes new extensions for PCB and multichip module (MCM) data. The Library of Parameterized Modules ( LPM ) standard is also based on EDIF. The newer versions of EDIF have a richer feature set, but the ASIC industry seems to have standardized on EDIF 2 0 0. Most EDA companies now support EDIF. The FPGA companies Altera and Actel use EDIF as their netlist format, and Xilinx has announced its intention to switch from its own XNF format to EDIF. We only have room for a brief description of the EDIF format here. A complete description of the EDIF standard is contained in the Electronic Industries Association ( EIA ) publication, Electronic Design Interchange Format Version 2 0 0 ( ANSI/EIA Standard 548-1988) [ EDIF, 1988].
9.4.1 EDIF Syntax

The structure of EDIF is similar to the Lisp programming language or the Postscript printer language. This makes EDIF a very hard language to read and almost impossible to write by hand. EDIF is intended as an exchange format between tools, not as a design-entry language. Since EDIF is so flexible each company reads and writes different flavors of EDIF. Inevitably EDIF from one company does not quite work when we try and use it with a tool from another company, though this situation is improving with the gradual adoption of EDIF 3 0 0. We need to know just enough about EDIF to be able to fix these problems.
9.4 EDIF
FIGURE 9.8 The hierarchical nature of an EDIF file.
Figure 9.8 illustrates the hierarchy of the EDIF file. Within an EDIF file are one or more libraries of cell descriptions. Each library contains technology information that is used in describing the characteristics of the cells it contains. Each cell description contains one or more user-named views of the cell. Each view is defined as a particular viewType and contains an interface description that identifies where the cell may be connected to and, possibly, a contents description that identifies the components and related interconnections that make up the cell. The EDIF syntax consists of a series of statements in the following format: (keywordName {form}) A left parenthesis (round bracket) is always followed by a keyword name , followed by one or more EDIF forms (a form is a sequence of identifiers, primitive data, symbolic constants, or EDIF statements), ending with a right parenthesis. If you have programmed in Lisp or Postscript, you may understand that EDIF uses a define it before you use it approach and why there are so many parentheses in an EDIF file. The semantics of EDIF are defined by the EDIF keywords . Keywords are the only types of name that can immediately follow a left parenthesis. Case is not significant in keywords. An EDIF identifier represents the name of an object or group of data. Identifiers are used for name definition, name reference, keywords, and symbolic constants. Valid
9.4 EDIF
EDIF identifiers consist of alphanumeric or underscore characters and must be preceded by an ampersand ( &) if the first character is not alphabetic. The ampersand is not considered part of the name. The length of an identifier is from 1 to 255 characters and case is not significant. Thus &clock , Clock , and clock all represent the same EDIF name (very confusing). Numbers in EDIF are 32-bit signed integers. Real numbers use a special EDIF format. For example, the real number 1.4 is represented as (e 14 -1) . The e form requires a mantissa ( 14 ) and an exponent ( -1 ). Reals are restricted to the range 1 10 35 . Numbers in EDIF are dimensionless and the units are determined according to where the number occurs in the file. Coordinates and line widths are units of distance and must be related to meters. Each coordinate value is converted to meters by applying a scale factor . Each EDIF library has a technology section that contains a required numberDefinition . The scale keyword is used with the numberDefinition to relate EDIF numbers to physical units. Valid EDIF strings consist of sequences of ASCII characters enclosed in double quotes. Any alphanumeric character is allowed as well as any of the following characters: ! # $ & ' () * + , . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ . Special characters, such as " and % are entered as escape sequences: %number% , where number is the integer value of the ASCII character. For example, "A quote is % 34 %" is a string with an embedded double-quote character. Blank, tab, line feed, and carriage-return characters (white space) are used as delimiters in EDIF. Blank and tab characters are also significant when they appear in strings. The rename keyword can be used to create a new EDIF identifier as follows: (cell (rename TEST_1 "test$1") ... In this example the EDIF string contains the original name, test$1, and a new name, TEST_1 , is created as an EDIF identifier.
9.4.2 An EDIF Netlist Example

Table 9.11 shows an EDIF netlist. This EDIF description corresponds to the halfgate
9.4 EDIF
example in Chapter 8 and describes an inverter. We shall explain the functions of the EDIF in Table 9.11 by showing a piece of the code at a time followed by an explanation. TABLE 9.11 EDIF file for the halfgate netlist from Chapter 8.
(edif halfgate_p (edifVersion 2 0 0) (edifLevel 0) (keywordMap (keywordLevel 0)) (status (written (timeStamp 1996 7 10 22 5 10) (program "COMPASS Design Automation -- EDIF Interface" (version "v9r1.2 last updated 26-Mar-96")) (author "mikes"))) Every EDIF file must have an edif form. The edif form must have a name , an edifVersion , an edifLevel , and a keywordMap . The edifVersion consists of three integers describing the major (first number) and minor version of EDIF. The keywordMap must have a keywordLevel . The optional status can contain a written form that must have a timeStamp and, optionally, author or
9.4 EDIF
program forms. (library xc4000d (edifLevel 0) (technology (The unbalanced parentheses are deliberate since we are showing segments of the EDIF code.) The library form must have a name , edifLevel and technology . The edifLevel is normally 0. The xc4000d library contains the cells we are using in our schematic. (numberDefinition ) (simulationInfo (logicValue H) (logicValue L))) The simulationInfo form is used by simulation tools; we do not need that information for netlist purposes for this cell. We shall discuss numberDefinition in the next example. It is not needed in a netlist. (cell (rename INV "inv") (cellType GENERIC) This cell form defines the name and type of a cell inv that we are going to use in the schematic. (view COMPASS_mde_view (viewType NETLIST) (interface (port I (direction INPUT)) (port O (direction OUTPUT)) (designator "@@Label"))))) The NETLIST view of this inverter cell has an input port I and an output port O . There is also a place holder "@@Label" for the instance name of the cell. (library working... This begins the description of our schematic that is in our library working. The lines that follow this library form are similar to the preamble for the cell library xc4000d that we just explained.
9.4 EDIF
(cell (rename HALFGATE_P "halfgate_p")(cellType GENERIC) (view COMPASS_nls_view (viewType NETLIST) This cell form is for our schematic named halfgate_p. (interface (port myInput (direction INPUT)) (port myOutput (direction OUTPUT)) The interface form defines the names of the ports that were used in our schematic, myInput and myOutput. At this point we have not associated these ports with the ports of the cell INV in the cell library. (designator "@@Label")) (contents (instance B1_i1 This gives an instance name B1_i1 to the cell in our schematic. (viewRef COMPASS_mde_view (cellRef INV (libraryRef xc4000d)))) The cellRef form links the cell instance name B1_i1 in our schematic to the cell INV in the library xc4000d. (net myInput (joined (portRef myInput) (portRef I (instanceRef B1_i1)))) The net form for myInput (and the one that follows it for myOutput) ties the net names in our schematic to the ports I and O of the library cell INV . (net VDD (joined )) (net VSS (joined )))))) These forms for the global VDD and VSS nets are often handled differently by different tools (one company might call the negative supply GND instead of VSS , for example). This section is where you most often have to edit the EDIF. (design HALFGATE_P (cellRef HALFGATE_P (libraryRef working))))
9.4 EDIF
The design form names and places our design in library working, and completes the EDIF description.
9.4.3 An EDIF Schematic Icon

EDIF is capable of handling many different representations. The next EDIF example is another view of an inverter that describes how to draw the icon (the picture that appears on the printed schematic or on the screen) shown in Figure 9.9 . We shall examine the EDIF created by the CAD/CAM Groups Engineering Capture System ( ECS) schematic editor.
FIGURE 9.9 An EDIF view of an inverter icon. The coordinates shown are in EDIF units. The crosses that show the text location origins and the dotted bounding box do not print as part of the icon. This time we shall give more detailed explanations after each piece of EDIF code. We shall also maintain balanced parentheses to make the structure easier to follow. To shorten the often lengthy EDIF code, we shall use an ellipsis ( ... ) to indicate any code that has been left out. (edif ECS (edifVersion 2 0 0) (edifLevel 0) (keywordMap (keywordLevel 0)) (status (written
9.4 EDIF
(timeStamp 1987 8 20 0 50 23) (program "CAD/CAM Group, Inc. ECS" (Version "1")))) (library USER ... ) ... ) This preamble is virtually identical to the previous netlist example (and demonstrates that EDIF is useful to store design information as software tools come and go over many years). The first line of the file defines the name of the file. This is followed by lines that identify the version of EDIF being used and the highest EDIF level used in the file (each library may use its own level up to this maximum). EDIF level 0 supports only literal constants and basic constructs. Higher EDIF levels support parameters, expressions, and flow control constructs. EDIF keywords may be mapped to aliases, and keyword macros may be defined within the keywordMap form. These features are not often used in ASIC design because of a lack of standardization. The keywordLevel 0 indicates these capabilities are not used here. The status construct is used for administration: when the file was created, the software used to create the file, and so on. Following this preamble is the main section of the file, which contains design information. (library USER (edifLevel 0) (technology (numberDefinition (scale 4 (e 254 -5) (unit distance))) (figureGroup NORMAL (pathWidth 0) (borderWidth 0) (textHeight 5)) (figureGroup WIDE (pathWidth 1) (borderWidth 1) (textHeight 5))) (cell 7404 ... ) ) The technology form has a numberDefinition that defines the scaling
9.4 EDIF
information (we did not use this form for a netlist, but the form must be present). The first numberValue after scale represents EDIF numbers and the second numberValue represents the units specified by the unit form. The EDIF unit for distance is the meter. The numberValue can be an integer or an exponential number. The e form has a mantissa and an exponent. In this example, within the USER library, a distance of 4 EDIF units equals 254 10 5 meters (or 4 EDIF units equals 0.1 inch). After the numberDefinition in the technology form there are one or more figureGroup definitions. A figureGroup defines drawing information such as pathWidth , borderWidth , color , fillPattern , borderPattern , and textHeight . The figureGroup form must have a name, which will be used later in the library to refer back to these definitions. In this example the USER library has one figureGroup (NORMAL) for lines and paths of zero width (the actual width will be implementation dependent) and another figureGroup (WIDE) that will be used for buses with a wider width (for bold lines). The borderWidth is used for drawing filled areas such as rectangles, circles, and polygons. The pathWidth is used for open figures such as lines (paths) and open arcs. Following the technology section the cell forms each represent a symbol. The cell form has a name that will appear in the names of any files produced. The cellType form GENERIC type is required by this schematic editor. The property form is used to list properties of the cell. (cell 7404 (cellType GENERIC) (property SymbolType (string "GATE")) (view PCB_Symbol (viewType SCHEMATIC) (interface ... ) ) ) The SymbolType property is used to distinguish between purely graphical symbols that do not occur in the parts list (a ground connection, for example), gate or component symbols, and block or cell symbols (for hierarchical schematics). The
9.4 EDIF
SymbolType property is a string that may be COMPONENT , GATE , CELL , BLOCK , or GRAPHIC . Each cell may contain view forms and each view must have a name. Following the name of the view must be a viewType that is either GRAPHIC or SCHEMATIC . Following the viewType is the interface form, which contains the symbol and terminal information. The interface form contains the actual symbol data. (interface (port Pin_1 (designator "2") (direction OUTPUT) (dcMaxFanout 50)) (port Pin_2 (designator "1") (direction INPUT) (dcFanoutLoad 8) (property Cap (string "22"))) (property Value (string "45")) (symbol ... ) If the symbol has terminals, they are listed before the symbol form. The port form defines each terminal. The required port name is used later in the symbol form to refer back to the port. Since this example is from a PCB design, the terminals have pin numbers that correspond to the IC package leads. The pin numbers are defined in the designator form with the pin number as a string. The polarity of the pin is indicated by the direction form, which may be INPUT , OUTPUT , or INOUT . If the pin is an output pin, its Drive can be represented by dcMaxFanout and if it is an input pin its Load can be represented by dcFanoutLoad . The port form can also contain forms unused , dcMaxFanin , dcFaninLoad , acLoad , and portDelay . All other attributes for pins besides PinNumber , Polarity , Load , and Drive are contained in the property form. An attribute string follows the name of the property in the string form. In this
9.4 EDIF
example port Pin_2 has a property Cap whose value is 22. This is the input capacitance of the inverter, but the interpretation and use of this value depends on the tools. In ASIC design pins do not have pin numbers, so designator is not used. Instead, the pin names use the property form. So (property NetName (string "1")) would replace the (designator "1") in this example on Pin_2 . The interface form may also contain attributes of the symbol. Symbol attributes are similar to pin attributes. In this example the property name Value has an attribute string "45" . The names occurring in the property form may be referenced later in the interface under the symbol form to refer back to the property . (symbol (boundingBox (rectangle (pt 0 0) (pt 76 -32))) (portImplementation Pin_1 (connectLocation (figure NORMAL (dot (pt 60 -16))))) (keywordDisplay designator (display NORMAL (justify LOWERCENTER) (origin (pt 60 -14))))) (portImplementation Pin_2 (connectLocation (figure NORMAL (dot (pt 0 -16))))) (keywordDisplay designator (display NORMAL (justify LOWERCENTER) (origin (pt 0 -14))))) (keywordDisplay cell (display NORMAL (justify CENTERLEFT) (origin (pt 25 5)))) (keywordDisplay instance (display NORMAL (justify CENTERLEFT) (origin (pt 36 -28)))) (keywordDisplay designator (display (figureGroupOverride NORMAL (textHeight 7)) (justify CENTERLEFT) (origin (pt 13 -16)))) (propertyDisplay Value (display (figureGroupOverride NORMAL (textHeight 9)) (justify CENTERRIGHT) (origin (pt 76 -24))))
9.4 EDIF
(figure ... ) ) The interface contains a symbol that contains the pin locations and graphical information about the icon. The optional boundingBox form encloses all the graphical data. The x- and y-locations of two opposite corners of the bounding rectangle use the pt form. The scale section of the numberDefinition from the technology section of the library determines the units of these coordinates. The pt construct is used to specify coordinate locations in EDIF. The keyword pt must be followed by the x-location and the y-location. For example: (pt 100 200) is at x = 100, y = 200.
q q
q q
Each pin in the symbol is given a location using a portImplementation . The portImplementation refers back to the port defined in the interface . The connectLocation defines the point to connect to the pin. The connectLocation is specified as a figure , a dot with a single pt for its location.
(symbol ( ... (figure WIDE (path (pointList (pt 12 0) (pt 12 -32))) (path (pointList (pt 12 -32) (pt 44 -16))) (path (pointList (pt 12 0) (pt 44 -16)))) (figure NORMAL (path (pointList (pt 48 -16) (pt 60 -16))) (circle (pt 44 -16) (pt 48 -16)) (path (pointList (pt 0 -16) (pt 12 -16)))) (annotate (stringDisplay "INV" (display NORMAL (justify CENTERLEFT) (origin (pt 12 -12))))) ) The figure form has either a name, previously defined as a figureGroup in the
9.4 EDIF
technology section, or a figureGroupOverride form. The figure has all the attributes ( pathWidth , borderWidth , and so on) that were defined in the figureGroup unless they are specifically overridden with a figureGroupOverride . Other objects that may appear in a figure are: circle , openShape , path , polygon , rectangle , and shape . Most schematic editors use a grid, and the pins are only allowed to occur on grid . A portImplementation can contain a keywordDisplay or a propertyDisplay for the location to display the pin number or pin name. For a GATE or COMPONENT , keywordDisplay will display the designator (pin number), and designator is the only keyword that can be displayed. For a BLOCK or CELL , propertyDisplay will display the NetName . The display form displays text in the same way that the figure displays graphics. The display must have either a name previously defined as a figureGroup in the technology section or a figureGroupOverride form. The display will have all the attributes ( textHeight for example) defined in the figureGroup unless they are overridden with a figureGroupOverride . A symbolic constant is an EDIF name with a predefined meaning. For example, LOWERLEFT is used to specify text justification. The display form can contain a justify to override the default LOWERLEFT . The display can also contain an orientation that overrides the default R0 (zero rotation). The choices for orientation are rotations ( R0, R90, R180, R270 ), mirror about axis ( MX, MY ), and mirror with rotation ( MXR90, MYR90 ). The display can contain an origin to override the default (pt 0 0) . The symbol itself can have either keywordDisplay or propertyDisplay forms such as the ones in the portImplementation . The choices for keywordDisplay are: cell for attribute Type , instance for attribute InstName , and designator for attribute RefDes . In the preceding example an attribute window currently mapped to attribute Value is displayed at location (76, 24) using right-justified text, and a font size is set with (textHeight 9) . The graphical data in the symbol are contained in figure forms. The path form must contain pointList with two or more points. The figure may also contain a
9.4 EDIF
rectangle or circle . Two points in a rectangle define the opposite corners. Two points in a circle represent opposite ends of the diameter. In this example a figure from figureGroup WIDE has three lines representing the triangle of the inverter symbol. Arcs use the openShape form. The openShape must contain a curve that contains an arc with three points. The three points in an arc correspond to the starting point, any point on the arc, and the end point. For example, (openShape (curve (arc (pt - 5 0) (pt 0 5 ) (pt 5 0)))) is an arc with a radius of 5, centered at the origin. Arcs and lines use the pathWidth from the figureGroup or figureGroupOverride ; circles and rectangles use borderWidth . The fixed text for a symbol uses annotate forms. The stringDisplay in annotate contains the text as a string. The stringDisplay contains a display with the textHeight , justification , and location . The symbol form can contain multiple figure and annotate forms.
9.4.4 An EDIF Example

In this section we shall illustrate the use of EDIF in translating a cell library from one set of tools to anotherfrom a Compass Design Automation cell library to the Cadence schematic-entry tools. The code in Table 9.12 shows the EDIF description of the symbol for a two-input AND gate, an02d1, from the Compass cell library. TABLE 9.12 EDIF file for a Compass standard-cell schematic icon.
9.4 EDIF
The Cadence schematic tools do contain a procedure, EDIFIN, that reads the Compass EDIF files. This procedure works, but, as we shall see, results in some problems when you use the icons in the Cadence schematic-entry tool. Instead we shall make some changes to the original files before we use EDIFIN to transfer the information to the Cadence database, cdba . The original Compass EDIF file contains a figureGroup for each of the following four EDIF cell symbols: connector_FG icon_FG instance_FG net_FG bus_FG
9.4 EDIF
The EDIFIN application translates each figureGroup to a Cadence layerpurpose pair definition that must be defined in the Cadence technology file associated with the library. If we use the original EDIF file with EDIFIN this results in the automatic modification of the Cadence technology file to define layer names, purposes, and the required properties to enable use of the figureGroup names. This results in nonCadence layer names in the Cadence database. First then, we need to modify the EDIF file to use the standard Cadence layer names shown in Table 9.13 . These layer names and their associated purposes and properties are defined in the default Cadence technology file, default.tf . There is one more layer name in the Compass files ( bus_FG figureGroup ), but since this is not used in the library we can remove this definition from the EDIF input file. TABLE 9.13 Compass and corresponding Cadence figureGroup names. Compass name Cadence name Compass name Cadence name connector_FG pin net_FG wire icon_FG device bus_FG not used instance_FG instance Internal scaling differences lead to giant characters in the Cadence tools if we use the textHeight of 30 defined in the EDIF file. Reducing the textHeight to 5 results in a reasonable text height. The EDIF numberDefinition construct, together with the scale construct, defines measurement scaling in an EDIF file. In a Cadence schematic EDIF file the numberDefinition and scale construct is determined by an entry in the associated library technology file that defines the edifUnit to userUnit ratio. This ratio affects the printed size of an icon. For example, the distance defined by the following path construct is 10 EDIF units: (path (pointlist (pt 0 0) (pt 0 10)))
9.4 EDIF
What is the length of 10 EDIF units? The numberDefinition and scale construct associates EDIF units with a physical dimension. The following construct (numberDefinition (scale 100 (e 25400 -6) unit DISTANCE)) specifies that 100 EDIF units equal 25400 10 6 m or approximately 1 inch. Cadence defines schematic measurements in inches by defining the userUnit property of the affected viewType or viewName as inch in the Cadence technology file. The Compass EDIF files do not provide values for the numberDefinition and scale construct, and the Cadence tools default to a value of 160 EDIF units to 1 user unit. We thus need to add a numberDefinition and scale construct to the Compass EDIF file to control the printed size of icons. The EDIF file defines blank label placeholders for each cell using the EDIF property construct. Cadence EDIFIN does recognize and translate EDIF properties, but to attach a label property to a cellview object it must be defined (not blank) and identified as a property using the EDIF owner construct in the EDIF file. Since the intent of a placeholder is to hold an empty spot for later use and since Cadence Composer (the schematic-entry tool) supports label additions to instantiated icons, we can remove the EDIF label property construct in each cell and the associated propertyDisplay construct from the Compass file. There is a problem that we need to resolve with naming. This is a problem that sooner or later everyone must tackle in ASIC design case sensitivity . In EDIF, input and output pins are called ports and they are identified using portImplementation constructs. In order that the ports of a particular cell icon_view are correctly associated with the ports in the related functional, layout, and abstract views, they must all have the same name. The Cadence tools are case sensitive in this respect. The Verilog and CIF files corresponding to each cell in the Compass library use lowercase names for each port of a given cell, whereas the EDIF file uses uppercase. The EDIFIN translator allows the case of cell, view, and port names to be automatically changed on translation. Thus pin names such as ' A1 ' become ' a1 ' and the original view name ' Icon_view ' becomes ' icon_view '.
9.4 EDIF
The boundingBox construct defines a bounding box around a symbol (icon). Schematic-capture tools use this to implement various functions. The Cadence Composer tool, for example, uses the bounding box to control the wiring between cells and as a highlight box when selecting components of a schematic. Compass uses a large boundingBox definition for the cells to allow space for long hierarchical names. Figure 9.10 (a) shows the original an02d1 cell bounding box that is larger than the cell icon.
FIGURE 9.10 The bounding box problem. (a) The original bounding box for the an02d1 icon. (b) Problems in Cadence Composer due to overlapping bounding boxes. (c) A shrinkwrapped bounding box created using SKILL. Icons with large bounding boxes create two problems in Composer. Highlighting all or part of a complex design consisting of many closely spaced cells results in a confusion of overlapped highlight boxes. Also, large boxes force strange wiring patterns between cells that are placed too closely together when Composer's automatic routing algorithm is used. Figure 9.10 (b) shows an example of this problem. There are two solutions to the bounding-box problem. We could modify each boundingBox definition in the original EDIF file before translation to conform to the outline of the icon. This involves identifying the outline of each icon in the EDIF file and is difficult. A simpler approach is to use the Cadence tool programming language, SKILL. SKILL provides direct access to the Cadence database, cdba , in order to modify and create objects. Using SKILL you can use a batch file to call functions normally accessed interactively. The solution to the bounding box problem is:
9.4 EDIF
1. Use EDIFIN to create the views in the Cadence database, cdba . 2. Use the schCreateInstBox() command on each icon_view object to eliminate the original bounding box and create a new, minimum-sized, bounding box that is shrink-wrapped to each icon. Figure 9.10 (c) shows the results of this process. This modification fixes the problems with highlighting and wiring in Cadence Composer. This completes the steps required to translate the schematic icons from one set of tools to another. The process can be automated in three ways:
q
Write UNIX sed and awk scripts to make the changes to the EDIF file before using EDIFIN and SKILL. Write custom C programs to make the changes to the EDIF file and then proceed as in the first option. Perform all the work using SKILL.
The last approach is the most elegant and most easily maintained but is the most difficult to implement (mostly because of the time required to learn SKILL). The whole project took several weeks (including the time it took to learn how to use each of the tools). This is typical of the problems you face when trying to convert data from one system to another. [ Chapter start ] [ Previous page ] [ Next page ]
9.5 CFI Design Representation

The CAD Framework Initiative ( CFI ) is an independent nonprofit organization working on the creation of standards for the electronic CAD industry. One of the areas in which CFI is working is the definition of standards for design representation ( DR ). The CFI 1.0 standard [ CFI, 1992] has tackled the problems of ambiguity in the area of definitions and terms for schematics by defining an information model ( IM ) for electrical connectivity information. What this means is that a group of engineers got together and proposed a standard way of using the terms and definitions that we have discussed. There are good things and bad things about standards, and one aspect of the CFI 1.0 DR standard illustrates this point. A good thing about the CFI 1.0 DR standard is that it precisely defines what we mean by terms and definitions in schematics, for example. A bad thing about the CFI DR standard is that in order to be precise it introduces yet more terms that are difficult to understand. A very brief discussion of the CFI 1.0 DR standard is included here, at the end of this chapter, for several reasons:
q
It helps to solidify the concepts of the terms and definitions such as cell, net, and instance that we have already discussed. However, there are additional new concepts and terms to define in order to present the standard model, so this is not a good way to introduce schematic terminology. The ASIC design engineer is becoming more of a programmer and less of a circuit designer. This trend shows no sign of stopping as ASICs grow larger and systems more complex. A precise understanding of how tools operate and interact is becoming increasingly important.
9.5.1 CFI Connectivity Model

The CFI connectivity model is defined using the EXPRESS language and its graphical equivalent EXPRESS-G . EXPRESS is an International Standards Organization (ISO) standard [ EXPRESS, 1991]. EDIF 3 0 0 and higher also use EXPRESS as the internal formal description of the language. EXPRESS is used to define objects and their relationships. Figure 9.11 shows some simple examples of the EXPRESS-G notation.
FIGURE 9.11 Examples of EXPRESS-G. (a) Each day in January has a number from 1 to 31. (b) A shopping list may contain a list of items. (c) An EXPRESS-G model for a family. The following EXPRESS code (a schema ) is equivalent to the EXPRESS-G family model shown in Figure 9.11 (c): SCHEMA family_model; ENTITY person ABSTRACT SUPERTYPE OF (ONEOF (man, woman, child));
name: STRING; date of birth: STRING; END_ENTITY; ENTITY man SUBTYPE OF (person); wife: SET[0:1] OF woman; children: SET[0:?] OF child; END_ENTITY; ENTITY woman SUBTYPE OF (person); husband: SET[0:1] OF man; children: SET[0:?] OF child; END_ENTITY; ENTITY child SUBTYPE OF (person); father: man; mother: woman; END_ENTITY; END_SCHEMA; This EXPRESS description is a formal way of saying the following:
q q q q q
Men, women, and children are people. A man can have one woman as a wife, but does not have to. A wife can have one man as a husband, but does not have to. A man or a woman can have several children. A child has one father and one mother.
Computers can deal more easily with the formal language version of these statements. The formal language and graphical forms are more precise for very complex models. Figure 9.12 shows the basic structure of the CFI 1.0.0 Base Connectivity Model (
BCM ). The actual EXPRESS-G diagram for the BCM defined in the CFI 1.0.0 standard is only a little more complicated than Figure 9.12 (containing 21 boxes or types rather than just six). The extra types are used for bundles (a group of nets) and different views of cells (other than the netlist view).
FIGURE 9.12 The original five-box model of electrical connectivity. There are actually six boxes or types in this figure; the Library type was added later. Figure 9.12 says the following (presents as used in Figure 9.12 is the Express jargon for have):
q q q q q q
A library contains cells. Cells have ports, contain nets, and can contain other cells. Cell instances are copies of a cell and have port instances. A port instance is a copy of the port in the library cell. You connect to a port using a net. Nets connect port instances together.
Once you understand Figure 9.12 you will see that it replaces the first half of this chapter. Unfortunately you have to read the first half of this chapter to understand Figure 9.12 .
9.6 Summary
9.6 Summary
The important concepts that we covered in this chapter are:
q q q q q q q q q
Schematic entry using a cell library Cells and cell instances, nets and ports Bus naming, vectored instances in datapath Hierarchy Editing cells PLD languages: ABEL, PALASM, and CUPL Logic minimization The functions of EDIF CFI representation of design information
LOGIC SYNTHESIS
LOGIC SYNTHESIS
Logic synthesis provides a link between an HDL (Verilog or VHDL) and a netlist similarly to the way that a C compiler provides a link between C code and machine language. However, the parallel is not exact. C was developed for use with compilers, but HDLs were not developed for use with logic-synthesis tools. Verilog was designed as a simulation language and VHDL was designed as a documentation and description language. Both Verilog and VHDL were developed in the early 1980s, well before the introduction of commercial logic-synthesis software. Because these HDLs are now being used for purposes for which they were not intended, the state of the art in logic synthesis falls far short of that for computer-language compilers. Logic synthesis forces designers to use a subset of both Verilog and VHDL. This makes using logic synthesis more difficult rather than less difficult. The current state of synthesis software is rather like learning a foreign language, and then having to talk to a five-year-old. When talking to a logic-synthesis tool using an HDL, it is necessary to think like hardware, anticipating the netlist that logic synthesis will produce. This situation should improve in the next five years, as logic synthesizers mature. Designers use graphic or text design entry to create an HDL behavioral model , which does not contain any references to logic cells. State diagrams, graphical datapath descriptions, truth tables, RAM/ROM templates, and gate-level schematics may be used together with an HDL description. Once a behavioral HDL model is complete, two items are required to proceed: a logic synthesizer (software and documentation) and a cell library (the logic cellsNAND gates and such) that is called the target library . Most synthesis software companies produce only software.
LOGIC SYNTHESIS
Most ASIC vendors produce only cell libraries. The behavioral model is simulated to check that the design meets the specifications and then the logic synthesizer is used to generate a netlist, a structural model , which contains only references to logic cells. There is no standard format for the netlists that logic synthesis produces, but EDIF is widely used. Some logic-synthesis tools can also create structural HDL (Verilog, VHDL, or both). Following logic synthesis the design is simulated again, and the results are compared with the earlier behavioral simulation. Layout for any type of ASIC may be generated from the structural model produced by logic synthesis. 12.1 A Logic-Synthesis Example 12.2 A Comparator/MUX 12.3 Inside a Logic Synthesizer 12.4 Synthesis of the Viterbi Decoder 12.5 Verilog and Logic Synthesis 12.6 VHDL and Logic Synthesis 12.7 Finite-State Machine Synthesis 12.8 Memory Synthesis 12.9 The Multiplier 12.10 The Engine Controller 12.11 Performance-Driven Synthesis 12.12 Optimization of the Viterbi Decoder 12.13 Summary 12.14 Problems 12.15 Bibliography 12.16 References
12.1 A Logic-Synthesis Example

As an example of logic synthesis, we will compare two implementations of the Viterbi decoder described in Chapter 11. Both versions used logic cells from a VLSI Technology cell library. The first ASIC was designed by hand using schematic entry and a data book. The second version of the ASIC (the one that was fabricated) used Verilog for design entry and a logic synthesizer. Table 12.1 compares the two versions. The synthesized ASIC is 16 percent smaller and 13 percent faster than the hand-designed version. How does logic synthesis generate smaller and faster circuits? Figure 12.1 shows the schematic for a hand-designed comparator and MUX used in the Viterbi decoder ASIC, called here the comparator/MUX example. The Verilog code and the schematic in Figure 12.1 describe the same function. The comparison, in Table 12.2 , of the two design approaches shows that the synthesized version is smaller and faster than the hand design, even though the synthesized design uses more cells. TABLE 12.1 A comparison of hand design with synthesis (using a 1.0 m VLSI Technology cell library). Chip Path No. of area/ No. of delay/ standard transistors mils 2 ( 2 (1) cells ns ) Hand design Synthesized design 41.6 36.3 1,359 1,493 16,545 11,946 21,877 18,322
// comp_mux.v module comp_mux(a, b, outp); input [2:0] a, b; output [2:0] outp; function [2:0] compare; input [2:0] ina, inb; begin if (ina <= inb) compare = ina; else compare = inb; end endfunction assign outp = compare(a, b); endmodule FIGURE 12.1 Schematic and HDL design entry. TABLE 12.2 Comparison of the comparator/MUX designs using a 1.0 m standard-cell library. Area No. of standard No. of Delay /ns cells transistors /mils 2 Hand design 4.3 12 116 68.68 Synthesized 2.9 15 66 46.43
1. These delays are under nominal operating conditions with no wiring capacitance. This is the only stage at which a comparison could be made because the hand design was not completed. 2. Both figures are initial layout estimates using default power-bus and signal routing widths.
12.2 A Comparator/MUX
With the Verilog behavioral model of Figure 12.1 as the input, logic-synthesis software generates logic that performs the same function as the Verilog. The software then optimizes the logic to produce a structural model, which references logic cells from the cell library and details their connections.
`timescale 1ns / 10ps module comp_mux_u (a, b, outp); input [2:0] a; input [2:0] b; output [2:0] outp; supply1 VDD; supply0 VSS; in01d0 nd02d0 in01d0 nd02d0 in01d0 nd02d0 nd02d0 nd03d0 in01d0 nd02d0 nd02d0 nd02d0 nd03d0 nd02d0 in01d0 nd02d0 nd02d0 u2 (.I(b[1]), .ZN(u2_ZN)); u3 (.A1(a[1]), .A2(u2_ZN), .ZN(u3_ZN)); u4 (.I(a[1]), .ZN(u4_ZN)); u5 (.A1(u4_ZN), .A2(b[1]), .ZN(u5_ZN)); u6 (.I(a[0]), .ZN(u6_ZN)); u7 (.A1(u6_ZN), .A2(u3_ZN), .ZN(u7_ZN)); u8 (.A1(b[0]), .A2(u3_ZN), .ZN(u8_ZN)); u9 (.A1(u5_ZN), .A2(u7_ZN), .A3(u8_ZN), .ZN(u9_ZN)); u10 (.I(a[2]), .ZN(u10_ZN)); u11 (.A1(u10_ZN), .A2(u9_ZN), .ZN(u11_ZN)); u12 (.A1(b[2]), .A2(u9_ZN), .ZN(u12_ZN)); u13 (.A1(u10_ZN), .A2(b[2]), .ZN(u13_ZN)); u14 (.A1(u11_ZN), .A2(u12_ZN), .A3(u13_ZN), .ZN(u14_ZN)); u15 (.A1(a[2]), .A2(u14_ZN), .ZN(u15_ZN)); u16 (.I(u14_ZN), .ZN(u16_ZN)); u17 (.A1(b[2]), .A2(u16_ZN), .ZN(u17_ZN)); u18 (.A1(u15_ZN), .A2(u17_ZN), .ZN(outp[2]));
file:///C|/Documents%20and%20Settings/saran%20kumar/Desktop/To%20...si/www-ee.eng.hawaii.edu/_msmith/ASICs/HTML/Book2/CH12/CH12.2.htm (1 of 9) [5/30/2004 11:06:28 PM]
nd02d0 nd02d0 nd02d0 nd02d0 nd02d0 nd02d0
u19 u20 u21 u22 u23 u24
(.A1(a[1]), .A2(u14_ZN), .ZN(u19_ZN)); (.A1(b[1]), .A2(u16_ZN), .ZN(u20_ZN)); (.A1(u19_ZN), .A2(u20_ZN), .ZN(outp[1])); (.A1(a[0]), .A2(u14_ZN), .ZN(u22_ZN)); (.A1(b[0]), .A2(u16_ZN), .ZN(u23_ZN)); (.A1(u22_ZN), .A2(u23_ZN), .ZN(outp[0]));
endmodule
FIGURE 12.2 The comparator/MUX after logic synthesis, but before logic optimization. This figure shows the structural netlist, comp_mux_u.v , and its derived schematic.
`timescale 1ns / 10ps module comp_mux_o (a, b, outp); input [2:0] a; input [2:0] b; output [2:0] outp; supply1 VDD; supply0 VSS; in01d0 B1_i1 (.I(a[2]), .ZN(B1_i1_ZN)); in01d0 B1_i2 (.I(b[1]), .ZN(B1_i2_ZN)); oa01d1 B1_i3 (.A1(a[0]), .A2(B1_i4_ZN), .B1(B1_i2_ZN), .B2(a[1]), .ZN(B1_i3_Z;
fn05d1 fn02d1 mx21d1 mx21d1 mx21d1
B1_i4 B1_i5 B1_i6 B1_i7 B1_i8
(.A1(a[1]), .B1(b[1]), .ZN(B1_i4_ZN)); (.A(B1_i3_ZN), .B(B1_i1_ZN), .C(b[2]), .ZN(B1_i5_ZN)); (.I0(a[0]), .I1(b[0]), .S(B1_i5_ZN), .Z(outp[0])); (.I0(a[1]), .I1(b[1]), .S(B1_i5_ZN), .Z(outp[1])); (.I0(a[2]), .I1(b[2]), .S(B1_i5_ZN), .Z(outp[2]));
endmodule
FIGURE 12.3 The comparator/MUX after logic synthesis and logic optimization with the default settings. This figure shows the structural netlist, comp_mux_o.v , and its derived schematic. Before running a logic synthesizer, it is necessary to set up paths and startup files ( synopsys_dc.setup , compass.boo , view.ini , or similar). These files set the target library and directory locations. Normally it is easier to run logic synthesis in text mode using a script. A script is a text file that directs a software tool to execute a series of synthesis commands (we call this a synthesis run ). Figure 12.2 shows a structural netlist, comp_mux_u.v , and the derived schematic after logic synthesis, but before any logic optimization . A derived schematic is created by software from a structural netlist (as opposed to a schematic drawn by hand). shows the structural netlist, comp_mux_o.v , and the derived schematic after logic optimization is performed (with the default settings). Figures 12.2 and 12.3 show the results of the two separate steps: logic synthesis and logic optimization. Confusingly, the whole process, which includes synthesis and optimization (and other steps as well), is referred to as logic synthesis . We also refer to the software that performs all of these steps (even if the software consists of more than one program) as a logic synthesizer . Logic synthesis parses (in a process sometimes called analysis ) and translates (sometimes called elaboration ) the input HDL to a data structure. This data structure is then converted to a network of generic logic cells. For example, the network in Figure 12.2 uses NAND gates (each with three or fewer inputs in this case) and inverters. This network of generic logic cells is technology-independent since cell libraries in any technology normally contain NAND gates and inverters. The next step, logic optimization , attempts to improve this technology-independent network under the controls of the designer. The output of the optimization step is an optimized, but still technology-independent, network. Finally, in the logic-mapping step, the
synthesizer maps the optimized logic to a specified technology-dependent target cell library. Figure 12.3 shows the results of using a standard-cell library as the target. Text reports such as the one shown in Table 12.3 may be the only output that the designer sees from the logic-synthesis tool. Often, synthesized ASIC netlists and the derived schematics containing thousands of logic cells are far too large to follow. To make things even more difficult, the net names and instance names in synthesized netlists are automatically generated. This makes it hard to see which lines of code in the HDL generated which logic cells in the synthesized netlist or derived schematic. TABLE 12.3 Reports from the logic synthesizer for the Verilog version of the comparator/MUX. Command Synthesizer output 1 Num Gate Count Tot Gate Width Total Cell Name Insts Per Cell Count Per Cell Width --------- ----- ---------- -------- -------- -------in01d0 5 .8 3.8 7.2 36.0 nd02d0 16 1.0 16.0 9.6 153.6 nd03d0 2 1.3 2.5 12.0 24.0 --------- ----- ---------- -------- -------- -------Totals: 23 22.2 213.6 Num Gate Count Tot Gate Width Total Cell Name Insts Per Cell Count Per Cell Width --------- ----- ---------- -------- -------- -------fn02d1 1 1.8 1.8 16.8 16.8 fn05d1 1 1.3 1.3 12.0 12.0 in01d0 2 .8 1.5 7.2 14.4 mx21d1 3 2.2 6.8 21.6 64.8 oa01d1 1 1.5 1.5 14.4 14.4 --------- ----- ---------- -------- -------- -------Totals: 8 12.8 122.4
> synthesize
> optimize
instance name inPin --> outPin incr arrival trs rampDel cap cell (ns) (ns) (ns) (pf) ---------------------------------------------------------------------a[1] .00 .00 R .00 .04 comp_m... B1_i4 A1 --> ZN .33 .33 R .17 .03 fn05d1 > report timing B1_i3 A2 --> ZN .39 .72 F .33 .06 oa01d1 B1_i5 A --> ZN 1.03 1.75 R .67 .11 fn02d1 B1_i6 S --> Z .68 2.43 R .09 .02 mx21d1
In the comparator/MUX example the derived schematics are simple enough that, with hindsight, it is clear that the XOR logic cell used in the hand design is logically inefficient. Using XOR logic cells does, however, result in the simple schematic of Figure 12.1 . The synthesized version of the comparator/MUX in Figure 12.3 uses complex combinational logic cells that are logically efficient, but the schematic is not as easy to read. Of course, the computer does not care about thisand neither do we since we usually never see the schematic. Which version is bestthe hand-designed or the synthesized version? Table 12.3 shows statistics generated by the logic synthesizer for the comparator/MUX. To calculate the performance of each circuit that it evaluates during synthesis, there is a timing-analysis tool (also known as a timing engine ) built into the logic synthesizer. The timing-analysis tool reports that the critical path in the optimized comparator/MUX is 2.43 ns. This critical path is highlighted on the derived schematic of Figure 12.3 and consists of the following delays:
q
q q q
0.33 ns due to cell fn05d1 , instance name B1_i4 , a two-input NOR cell with an inverted input. We might call this a NOR1-1 or (A + B')' logic cell. 0.39 ns due to cell oa01d1 , instance name B1_i3 , an OAI22 logic cell. 1.03 ns due to logic cell fn02d1 , instance name B1_i5 , a three-input majority function, MAJ3 (A, B, C). 0.68 ns due to logic cell mx21d1 , instance name B1_i6 , a 2:1 MUX.
(In this cell library the 'd1' suffix indicates normal drive strength.) TABLE 12.4 Logic cell comparisons between the two comparator/MUX designs.
Cell type
Library cell name 2 in01d0 xo02d1 an02d1 an03d1 an04d1 or03d1 mx21d1 oa01d1 fn02d1 fn05d1 6
Inverter 2-input XOR 2-input AND 3-input AND 4-input AND 3-input OR 2-input MUX AOI22 MAJ3 NOR11= (A' + B)' Totals
Width Cells Gate Gate Width used Width of Cells used 3 Gate used equivalents equivalents of by synthesized tPHL in in used used in tPLH /ns equivalents 5 hand synthesized design / hand by hand synthesized cell /ns in cell 4 design design m /m design design design /m 0.37 0.36 0.8 2 2 1.6 1.6 7.2 14.4 14.4 0.93 0.34 0.38 0.41 0.60 0.69 0.51 0.84 0.42 0.62 0.46 0.52 0.98 0.44 0.68 0.42 0.81 0.46 1.8 1.3 1.5 1.8 1.8 2.2 1.5 1.8 1.3 3 1 1 1 1 3 12 3 1 1 1 8 5.3 1.3 1.5 1.8 1.8 6.6 19.8 6.6 1.5 1.8 1.3 12.8 16.8 12.0 14.4 16.8 16.8 21.6 14.4 16.8 12.0 50.4 12.0 14.4 16.8 16.8 64.8 189.6 64.8 14.4 16.8 12.0 122.4
Table 12.4 lists the name, type, the number of transistors, the area, and the delay of each logic cell used in the hand-designed and synthesized comparator/MUX. We could have performed this analysis by hand using the cell-library data book and a calculator or spreadsheet, but it would have been tedious workespecially calculating the delays. The computer is excellent at this type of bookkeeping. We can think of the timing engine of a logic synthesizer as a logic calculator. We see from Table 12.4 that the sum of the widths of all the cells used in the synthesized design (122.4 m) is less than for the hand design (189.6 m). All the standard cells in a library are the same height, 72 or 21.6 m, in this case. Thus the synthesized design is smaller. We could estimate the critical path of the hand design using the information from the cell-library data book (summarized in Table 12.4 ). Instead we will use the timing engine in the logic synthesizer as a logic calculator to extract the critical path for the hand-designed comparator/MUX.
Table 12.5 shows a timing analysis obtained by loading the hand-designed schematic netlist into the logic synthesizer. Table 12.5 shows that the handdesigned (critical path 2.42 ns) and synthesized versions (critical path 2.43 ns) of the comparator/MUX are approximately the same speed. Remember, though, that we used the default settings during logic optimization. Section 12.11 shows that the logic synthesizer can do much better. TABLE 12.5 Timing report for the hand-designed version of the comparator/MUX using the logic synthesizer to calculate the critical path (compare with Table 12.3 ). Command Synthesizer output 7 instance name inPin --> outPin incr arrival trs rampDel cap cell (ns) (ns) (ns) (pf) ---------------------------------------------------------------------a[1] .00 .00 F .00 .04 comp_mux B1_i4 A1 --> ZN .61 .61 F .14 .03 xo02d1 > report timing B1_i3 A2 --> ZN .85 1.46 F .19 .05 an04d1 B1_i5 A --> ZN .42 1.88 F .23 .09 or03d1 B1_i6 S --> Z .54 2.42 R .09 .02 mx21d1 outp[0] .00 2.42 R .00 .00 comp_mux
12.2.1 An Actel Version of the Comparator/MUX

Figure 12.4 shows the results of targeting the comparator/MUX design to the Actel ACT 2/3 FPGA architecture. (The EDIF converter prefixes all internal nodes in this netlist with 'block_0_DEF_NET_' . This prefix was replaced with 'n_' in the Verilog file, comp_mux_actel_o_adl_e.v , derived from the .adl netlist.) As can be seen by comparing the netlists and schematics in Figures 12.3 and 12.4 , the results are very different between a standard-cell library and the Actel library. Each of the symbols in the schematic in Figure 12.4 represents the eight-input ACT 2/3 C-Module (see Figure 5.4 a). The logic synthesizer, during the technology-mapping step, has decided which connections should be made to the inputs to the combinational logic macro, CM8 . The CM8 names and the ACT2/3 C-Module names (in parentheses) correspond as follows: S00(A0) , S01(B0) , S10(A1) , S11(A2) , D0(D00) , D1(D01) , D2(D10) , D3(D11) , and Y(Y) .
`timescale 1 ns/100 ps module comp_mux_actel_o (a, b, outp); input [2:0] a, b; output [2:0] outp; wire n_13, n_17, n_19, n_21, n_23, n_27, n_29, n_31, n_62; CM8 I_5_CM8(.D0(n_31), .D1(n_62), .D2(a[0]), .D3(n_62), .S00(n_62), .S01(n_13), .S10(n_23), .S11(n_21), .Y(outp[0])); CM8 I_2_CM8(.D0(n_31), .D1(n_19), .D2(n_62), .D3(n_62), .S00(n_62), .S01(b[1]), .S10(n_31), .S11(n_17), .Y(outp[1])); CM8 I_1_CM8(.D0(n_31), .D1(n_31), .D2(b[2]), .D3(n_31), .S00(n_62), .S01(n_31), .S10(n_31), .S11(a[2]), .Y(outp[2])); VCC VCC_I(.Y(n_62)); CM8 I_4_CM8(.D0(a[2]), .D1(n_31), .D2(n_62), .D3(n_62), .S00(n_62), .S01(b[2]), .S10(n_31), .S11(a[1]), .Y(n_19)); CM8 I_7_CM8(.D0(b[1]), .D1(b[2]), .D2(n_31), .D3(n_31), .S00(a[2]), .S01(b[1]), .S10(n_31), .S11(a[1]), .Y(n_23)); CM8 I_9_CM8(.D0(n_31), .D1(n_31), .D2(a[1]), .D3(n_31), .S00(n_62), .S01(b[1]), .S10(n_31), .S11(b[0]), .Y(n_27)); CM8 I_8_CM8(.D0(n_29), .D1(n_62), .D2(n_31), .D3(a[2]), .S00(n_62), .S01(n_27), .S10(n_31), .S11(b[2]), .Y(n_13)); CM8 I_3_CM8(.D0(n_31), .D1(n_31), .D2(a[1]), .D3(n_31), .S00(n_62), .S01(a[2]), .S10(n_31), .S11(b[2]), .Y(n_17)); CM8 I_6_CM8(.D0(b[2]), .D1(n_31), .D2(n_62), .D3(n_62), .S00(n_62), .S01(a[2]), .S10(n_31), .S11(b[0]), .Y(n_21)); CM8 I_10_CM8(.D0(n_31), .D1(n_31), .D2(b[0]), .D3(n_31), .S00(n_62), .S01(n_31), .S10(n_31), .S11(a[2]), .Y(n_29)); GND GND_I(.Y(n_31)); endmodule FIGURE 12.4 The Actel version of the comparator/MUX after logic optimization. This figure shows the s tructural netlist, comp_mux_actel_o_adl_e.v , and its derived schematic.
1. Cell Name = cell name from the ASIC library (Compass Passport, 0.6 m high-density, 5 V standard-cell library, cb60hd230); Num Insts = number of cell instances; Gate Count Per Cell = equivalent gates with two-input NAND = 1 gate (with number of transistors equivalent gates 4); Width Per Cell = width in m (cell height in this library is 72 or 21.6 m); incr = incremental delay time due to logic cell delay; trs = transition; R = rising; F = falling; rampDel = ramp delay; cap = capacitance at node or cell output pin. 2. 0.6 m, 5 V, high-density Compass standard-cell library, cb60hd230. 3. Average over all inputs with load capacitance equal to two standard loads (one standard load = 0.016 pF).
4. 2-input NAND = 1 gate equivalent. 5. Cell height is 72 (21.6 m). 6. Rise and fall delays are different for the two inputs, A and B, of this cell: t PLHA = 0.48 ns; t PLHB = 0.36 ns; t PHLA = 0.59 ns; t PHLB = 0.33 ns. 7. See footnote 1 in Table 12.3 for explanations of the abbreviations used in this table.
12.3 Inside a Logic Synthesizer

The logic synthesizer parses the Verilog of Figure 12.1 and builds an internal data structure (usually a graph represented by linked lists). Such an abstract representation is not easy to visualize, so we shall use pictures instead. The first Karnaugh map in Figure 12.5 (a) is a picture that represents the sel signal (labeled as the input to the three MUXes in the schematic of Figure 12.1 ) for the case when the inputs are such that a[2]b[2] = 00 . The signal sel is responsible for steering the smallest input, a or b , to the output of the comparator/MUX. We insert a '1' in the Karnaugh map (which will select the input b to be the output) whenever b is smaller than a . When a = b we do not care whether we select a or b (since a and b are equal), so we insert an 'x' , a dont care logic value, in the Karnaugh map of Figure 12.5 (a). There are four Karnaugh maps for the signal sel , one each for the values a[2]b[2] = 00 , a[2]b[2] = 01 , a[2]b[2] = 10 , and a[2]b[2] = 11 .
FIGURE 12.5 Logic maps for the comparator/MUX. (a) If the input b is less than a , then sel is '1' . If a = b , then sel = 'x' (dont care). (b) A cover for sel . Next, logic minimization tries to find a minimum cover for the Karnaugh mapsthe smallest number of the largest possible circles to cover all the '1' s. One possible cover is shown in Figure 12.5 (b). In order to understand the steps that follow we shall use some notation from the Berkeley Logic Interchange Format ( BLIF ) and from the Berkeley tools misII and sis . We shall use the logic operators (in decreasing order of their precedence):
'!' (negation), '*' (AND), '+' (OR). We shall also abbreviate Verilog signal names; writing a[2] as a2 , for example. We can write equations for sel and the output signals of the comparator/MUX in the format that is produced by sis , as follows (this is the same format as input file for the Berkeley tool eqntott ): sel = a1*!b1*!b2 + a0*!b1*!b2 + a0*a1*!b2 + a1*!b1*a2 + a0*!b1*a2 + a0*a1*a2 + a2*!b2;[12.1] outp2 = !sel*a2 + sel*b2;[12.2] outp1 = !sel*a1 + sel*b1;[12.3] outp0 = !sel*a0 + sel*b0;[12.4] Equations 12.1 12.4 describe the synthesized network . There are seven product terms in Eq. 12.1 the logic equation for sel (numbered and labeled in the drawing of the cover for sel in Figure 12.5 ). We shall keep track of the sel signal separately even though this is not exactly the way the logic synthesizer worksthe synthesizer looks at all the signals at once. Logic optimization uses a series of factoring, substitution, and elimination steps to simplify the equations that represent the synthesized network. A simple analogy would be the simplification of arithmetic expressions. Thus, for example, we can simplify 189 / 315 to 0.6 by factoring the top and bottom lines and eliminating common factors as follows: (3 7 9) / (5 7 9) = 3 / 5. Boolean algebra is more complicated than ordinary algebra. To make logic optimization tractable, most tools use algorithms based on algebraic factors rather than Boolean factors. Logic optimization attempts to simplify the equations in the hope that this will also minimize area and maximize speed. In the synthesis results presented in Table 12.3 , we accepted the default optimization settings without setting any constraints. Thus only a minimum amount of logic optimization is attempted that did not alter the synthesized network in this case. The technology-decomposition step builds a generic network from the optimized
logic network. The generic network is usually simple NAND gates ( sis uses either AND, or NOR gates, or both). This generic network is in a technology-independent form. To build this generic network involves creating intermediate nodes. The program sis labels these intermediate nodes [n] , starting at n = 100 . sel = [100] [101] [102] [103] [104] [105] [106] [107] [100] * [101] * [102] ;[12.5] = !( !a2 * [103] ); = !( b2 * [103] ); = !( !a2 * b2 ); = !( [104] * [105] * [106] ); = !( !a1 * b1 ); = !( b0 * [107] ); = !( a0' * [107] ); = !( a1 * !b1 );
outp2 = !( [108] * [109] );[12.6] [108] = !( a2 * !sel ); [109] = !( sel * b2 ); There are two other sets of equations, similar to Eq. 12.6 , for outp1 and outp0 . Notice the polarity of the sel signal in Eq. 12.5 is correct and represents an AND gate (a consequence of labeling sel as the MUX select input in Table 12.1 ). Next, the technology-mapping step (or logic-mapping step) implements the technology-independent network by matching pieces of the network with the logic cells that are available in a technology-dependent cell library (an FPGA or standardcell library, for example). While performing the logic mapping, the algorithms attempt to minimize area (the default constraint) while meeting any other user constraints (timing or power constraints, for example). Working backward from the outputs the logic mapper recognizes that each of the three output nodes ( outp2 , outp1 , and outp0 ) may be mapped to a MUX. (We are using the term node mapping to a logic cell rather loosely herean exact parallel is a compiler mapping patterns of source code to object code.) Here is the equation that shows the mapping for outp2 :
outp2 = MUX(a, b, c) = ac + b!c[12.7] a = b2 ; b = a2 ; c = sel The equations for outp1 and outp0 are similar. The node sel can be mapped to the three-input majority function as follows: sel = MAJ3(w, x, y) = !(wx + wy + xy) [12.8] w = !a2 ; x = b2 ; y = [103] ; Next node [103] is mapped to an OAI22 cell, [103] = OAI22(w, x, y, z) = ! ((w + x)(y + z)) = (!w!x + !y!z) [12.9] w = a0 ; x = a1 ; y = !b1 z = [107] ; Finally, node [107] is mapped to a two-input NOR with one inverted input, [107] = !(b1 + !a1) ; [12.10] Putting Equations 12.7 12.10 together describes the following optimized logic network (corresponding to the structural netlist and schematic shown in Figure 12.3 ): sel = !((( !a0 * !(a1&!b1) | (b1*!a1) ) * (!a2|b2) ) | (!a2*b2)) ;[12.11] outp2 = !sel * a2 | sel * b2; outp1 = !sel * a1 | sel * b1; outp0 = !sel * a0 | sel * b0; The comparator/MUX example illustrates how logic synthesis takes the behavioral model (the HDL input) and, in a series of steps, converts this to a structural model describing the connections of logic cells from a cell library. When we write a C program we almost never think of the object code that will result. When we write HDL it is always necessary to consider the hardware. In C there is not much difference between i*j and i/j . In an HDL, if i and j are 32-bit numbers, i*j will take up a large amount of silicon. If j is a constant, equal to 2, then i*j
take up hardly any space at all. Most logic synthesizers cannot even produce logic to implement i/j . In the following sections we shall examine the Verilog and VHDL languages as a way to communicate with a logic synthesizer. Using one of these HDLs we have to tell the logic synthesizer what hardware we wantwe imply A. The logic synthesizer then has to figure out what we wantit has to infer B. The problem is making sure that we write the HDL code such that A = B. As will become apparent, the more clearly we imply what we mean, the easier the logic synthesizer can infer what we want. [ Chapter start ] [ Previous page ] [ Next page ]
12.4 Synthesis of the Viterbi Decoder

In this section we return to the Viterbi decoder from Chapter 11. After an initial synthesis run that shows how logic synthesis works with a real example, we step back and study some of the issues and problems of using HDLs for logic synthesis.
12.4.1 ASIC I/O

Some logic synthesizers can include I/O cells automatically, but the designer may have to use directives to designate special pads (clock buffers, for example). It may also be necessary to use commands to set I/O cell features such as selection of pull-up resistor, slew rate, and so on. Unfortunately there are no standards in this area. Worse, there is currently no accepted way to set these parameters from an HDL. Designers may also use either generic technology-independent I/O models or instantiate I/O cells directly from an I/O cell library. Thus, for example, in the Compass tools the statement asPadIn #(3,"1,2,3") u0 (in0, padin0); uses a generic I/O cell model, asPadIn . This statement will generate three input pads (with pin numbers "1" , "2" , and "3" ) if in0 is a 3-bit-wide bus. The next example illustrates the use of generic I/O cells from a standard-component library. These components are technology independent (so they may equally well be
used with a 0.6 m or 0.35 m technology). module allPads(padTri, padOut, clkOut, padBidir, padIn, padClk); output padTri, padOut, clkOut; inout padBidir; input [3:0] padIn; input padClk; wire [3:0] in; //compass dontTouch u* // asPadIn #(W, N, L, P) I (toCore, Pad) also asPadInInv // asPadOut #(W, N, L, P) I (Pad, frCore) // asPadTri #(W, N, S, L, P) I (Pad, frCore, OEN) // asPadBidir #(W, N, S, L, P) I (Pad, toCore, frCore, OEN) // asPadClk #(N, S, L) I (Clk, Pad) also asPadClkInv // asPadVxx #(N, subnet) I (Vxx) // W = width, integer (default=1) // N = pin number string, e.g. "1:3,5:8" // S = strength = {2, 4, 8, 16} in mA drive // L = level = {cmos, ttl, schmitt} (default = cmos) // P = pull-up resistor = {down, float, none, up} // Vxx = {Vss, Vdd} // subnet = connect supply to {pad, core, both} asPadIn #(4,"1:4","","none") u1 (in, padIn); asPadOut #(1,"5",13) u2 (padOut, d); asPadTri #(1,"6",11) u3 (padTri, in[1], in[0]); asPadBidir #(1,"7",2,"","") u4 (d, padBidir, in[3], in[2]); asPadClk #(8) u5 (clk, padClk); asPadOut #(1, "9") u6 (clkOut, clk); asPadVdd #("10:11","pads") u7 (vddr); asPadVss #("12,13","pads") u8 (vssr); asPadVdd #("14","core") u9 (vddc); asPadVss #("15","core") u10 (vssc); asPadVdd #("16","both") u11 (vddb); asPadVss #("17","both") u12 (vssb); endmodule
The following code is an example of the contents of a generic model for a three-state I/O cell (provided in a standard-component library or in an I/O cell library): module PadTri (Pad, I, Oen); // active-low output enable parameter width = 1, pinNumbers = "", \strength = 1, level = "CMOS", externalVdd = 5; output [width-1:0] Pad; input [width-1:0] I; input Oen; assign #1 Pad = (Oen ? {width{1'bz}} : I); endmodule The module PadTri can be used for simulation and as the basis for synthesizing an I/O cell. However, the synthesizer also has to be told to synthesize an I/O cell connected to a bonding pad and the outside world and not just an internal three-state buffer. There is currently no standard mechanism for doing this, and every tool and every ASIC company handles it differently. The following model is a generic model for a bidirectional pad. We could use this model as a basis for input-only and output-only I/O cell models. module PadBidir (C, Pad, I, Oen); // active-low output enable parameter width = 1, pinNumbers = "", \strength = 1, level = "CMOS", pull = "none", externalVdd = 5; output [width-1:0] C; inout [width-1:0] Pad; input [width-1:0] I; input Oen; assign #1 Pad = Oen ? {width{1'bz}} : I; assign #1 C = Pad; endmodule In Chapter 8 we used the halfgate example to demonstrate an FPGA design flowincluding I/O. If the synthesis tool is not capable of synthesizing I/O cells, then we may have to instantiate them by hand; the following code is a handinstantiated version of lines 19 22 in module allPads :
pc5o05 u2_2 (.PAD(padOut), .I(d)); pc5t04r u3_2 (.PAD(padTri), .I(in[1]), .OEN(in[0])); pc5b01r u4_3 (.PAD(padBidir), .I(in[3]), .CIN(d), .OEN(in[2])); pc5d01r u5_in_1 (.PAD(padClk), .CIN(u5toClkBuf[0])); The designer must find the names of the I/O cells ( pc5o05 and so on), and the names, positions, meanings, and defaults for the parameters from the cell-library documentation. I/O cell models allow us to simulate the behavior of the synthesized logic inside an ASIC all the way to the pads. To simulate outside the pads at a system level, we should use these same I/O cell models. This is important in ASIC design. For example, the designers forgot to put pull-up resistors on the outputs of some of the SparcStation ASICs. This was one of the very few errors in a complex project, but an error that could have been caught if a system-level simulation had included complete I/O cell models for the ASICs.
12.4.2 Flip-Flops
In Chapter 11 we used this D flip-flop model to simulate the Viterbi decoder: module dff(D,Q,Clock,Reset); // N.B. reset is active-low output Q; input D,Clock,Reset; parameter CARDINALITY = 1; reg [CARDINALITY-1:0] Q; wire [CARDINALITY-1:0] D; always @( posedge Clock) if (Reset!==0) #1 Q=D; always begin wait (Reset==0); Q=0; wait (Reset==1); end endmodule Most simulators cannot synthesize this model because there are two wait statements in one always statement (line 6 ). We could change the code to use flip-flops from the synthesizer standard-component library by using the following code: asDff ff1 (.Q(y), .D(x), .Clk(clk), .Rst(vdd));
Unfortunately we would have to change all the flip-flop models from 'dff' to 'asDff' and the code would become dependent on a particular synthesis tool. Instead, to maintain independence from vendors, we shall use the following D flipflop model for synthesis and simulation: module dff(D, Q, Clk, Rst); // new flip-flop for Viterbi decoder parameter width = 1, reset_value = 0; input [width - 1 : 0] D; output [width - 1 : 0] Q; reg [width - 1 : 0] Q; input Clk, Rst; initial Q <= {width{1'bx}}; always @ ( posedge Clk or negedge Rst ) if ( Rst == 0 ) Q <= #1 reset_value; else Q <= #1 D; endmodule
12.4.3 The Top-Level Model

The following code models the top-level Viterbi decoder and instantiates (with instance name v_1 ) a copy of the Verilog module viterbi from Chapter 11. The model uses generic input, output, power, and clock I/O cells from the standardcomponent library supplied with the synthesis software. The synthesizer will take these generic I/O cells and map them to I/O cells from a technology-specific library. We do not need three-state I/O cells or bidirectional I/O cells for the Viterbi ASIC. /* This is the top-level module, viterbi_ASIC.v */ module viterbi_ASIC (padin0, padin1, padin2, padin3, padin4, padin5, padin6, padin7, padOut, padClk, padRes, padError); input [2:0] padin0, padin1, padin2, padin3, padin4, padin5, padin6, padin7; input padRes, padClk; output padError; output [2:0]
padOut; wire Error, Clk, Res; wire [2:0] Out; // core wire padError, padClk, padRes; wire [2:0] padOut; wire [2:0] in0,in1,in2,in3,in4,in5,in6,in7; // core wire [2:0] padin0, padin1,padin2,padin3,padin4,padin5,padin6,padin7; // Do not let the software mess with the pads. //compass dontTouch u* asPadIn #(3,"1,2,3") u0 (in0, padin0); asPadIn #(3,"4,5,6") u1 (in1, padin1); asPadIn #(3,"7,8,9") u2 (in2, padin2); asPadIn #(3,"10,11,12") u3 (in3, padin3); asPadIn #(3,"13,14,15") u4 (in4, padin4); asPadIn #(3,"16,17,18") u5 (in5, padin5); asPadIn #(3,"19,20,21") u6 (in6, padin6); asPadIn #(3,"22,23,24") u7 (in7, padin7); asPadVdd #("25","both") u25 (vddb); asPadVss #("26","both") u26 (vssb); asPadClk #("27") u27 (Clk, padClk); asPadOut #(1,"28") u28 (padError, Error); asPadin #(1,"29") u29 (Res, padRes); asPadOut #(3,"30,31,32") u30 (padOut, Out); // Here is the core module: viterbi v_1 (in0,in1,in2,in3,in4,in5,in6,in7,Out,Clk,Res,Error); endmodule At this point we are ready to begin synthesis. In order to demonstrate how synthesis works, I am cheating here. The code that was presented in Chapter 11 has already been simulated and synthesized (requiring several iterations to produce error-free code). What I am doing is a little like the Galloping Gourmets television presentation: And then we put the souffl in the oven . . . and look at the souffl that I prepared earlier. The synthesis results for the Viterbi decoder are shown in Table 12.6 . Normally the worst thing we can do is prepare a large amount of code, put it in the synthesis oven, close the door, push the synthesize and optimize button,
and wait. Unfortunately, it is easy to do. In our case it works (at least we may think so at this point) because this is a small ASIC by todays standardsonly a few thousand gates. I made the bus widths small and chose this example so that the code was of a reasonable size. Modern ASICs may be over one million gates, hundreds of times more complicated than our Viterbi decoder example. TABLE 12.6 Initial synthesis results of the Viterbi decoder ASIC. Command Synthesizer output 1 , 2 Num Gate Count Tot Gate Width Total Cell Name Insts Per Cell Count Per Cell Width --------- ----- ---------- -------- -------- ------> optimize pc5c01 1 315.4 315.4 100.8 100.8 pc5d01r 26 315.4 8200.4 100.8 2620.8 pc5o06 4 315.4 1261.6 100.8 403.2 pv0f 1 315.4 315.4 100.8 100.8 pvdf 1 315.4 315.4 100.8 100.8 viterbi_p 1 1880.0 1880.0 18048.0 18048.0 The derived schematic for the synthesized core logic is shown in Figure 12.6 . There are eight boxes in Figure 12.6 that represent the eight modules in the Verilog code. The schematics for each of these eight blocks are too complex to be useful. With practice it is possible to see the synthesized logic from reports such as Table 12.6 . First we check the following cells at the top level:
FIGURE 12.6 The core logic of the Viterbi decoder ASIC. Bus names are abbreviated in this figure for clarity. For example the label m_out0-3 denotes the four buses: m_out0, m_out1, m_out2, and m_out3.
q
pc5c01 is an I/O cell that drives the clock node into the logic core. ASIC designers also call an I/O cell a pad cell , and often refer to the pad cells (the bonding pads and associated logic) as just the pads . From the library data book we find this is a core-driven, noninverting clock buffer capable of driving 125 pF. This is a large logic cell and does not have a bonding pad, but is placed in a pad site (a slot in the ring of pads around the perimeter of the die) as if it were an I/O cell with a bonding pad. pc5d01r is a 5V CMOS input-only I/O cell with a bus repeater. Twenty-four of these I/O cells are used for the 24 inputs ( in0 to in7 ). Two more are used for Res and Clk . The I/O cell for Clk receives the clock signal from the bonding pad and drives the clock buffer cell ( pc5c01 ). The pc5c01 cell then buffers
q q q
and drives the clock back into the core. The power-hungry clock buffer is placed in the pad ring near the VDD and VSS pads. pc5o06 is a CMOS output-only I/O cell with 6X drive strength (6 mA AC drive and 4 mA DC drive). There are four output pads: three pads for the signal outputs, outp[2:0 ], and one pad for the output signal, error . pv0f is a power pad that connects all VSS power buses on the chip. pvdf is a power pad that connects all VDD power buses on the chip. viterbi_p is the core logic. This cell takes its name from the top-level Verilog module ( viterbi ). The software has appended a "_p" suffix (the default) to prevent input files being accidentally overwritten.
The software does not tell us any of this directly. We learn what is going on by looking at the names and number of the synthesized cells, reading the synthesis tool documentation, and from experience. We shall learn more about I/O pads and the layout of power supply buses in Chapter 16. Next we examine the cells used in the logic core. Most synthesis tools can produce reports, such as that shown in Table 12.7 , which lists all the synthesized cells. The most important types of cells to check are the sequential elements: flip-flops and latches (I have omitted all but the sequential logic cells in Table 12.7 ). One of the most common mistakes in synthesis is to accidentally leave variables unassigned in all situations in the HDL. Unassigned variables require memory and will generate unnecessary sequential logic. In the Viterbi decoder it is easy to identify the sequential logic cells that should be present in the synthesized logic because we used the module dff explicitly whenever we required a flip-flop. By scanning the code in Chapter 11 and counting the references to the dff model, we can see that the only flip-flops that should be inferred are the following:
q q
q q
24 (3 8) D flip-flops in instance subset_decode 132 (11 12) D flip-flops in instance path_memory that contains 11 instances of path (12 D flip-flops in each instance of path ) 12 D flip-flops in instance pathin 20 (5 4) D flip-flops in instance metric
The total is 24 + 132 + 12 + 20 = 188 D flip-flops, which is the same as the number
of dfctnb cell instances in Table 12.7 . TABLE 12.7 Number of synthesized flip-flops in the Viterbi ASIC. Command Synthesizer output 3 Num Gate Count Tot Gate Width Total Cell Name Insts Per Cell Count Per Cell Width --------- ----- ---------- -------- ------- -------> report area -flat ... dfctnb 188 5.8 1081.0 55.2 10377.6 ... --------- ----- ---------- -------- ------- -------Totals: 1383 12716.5 25485.6 Table 12.6 gives the total width of the standard cells in the logic core after logic optimization as 18,048 m. Since the standard-cell height for this library is 72 (21.6 m), we can make a first estimate of the total logic cell area as (18,048 m) (21.6 m) = 390 k( m) 2 (12.12)
390 k( m) 2 mil 2 (25.4 m) 2 600 mil 2 In the physical layout we shall need additional space for routing. The ratio of routing to logic cell area is called the routing factor . The routing factor depends primarily
on whether we use two levels or three levels of metal. With two levels of metal the routing factor is typically between 1 and 2. With three levels of metal, where we may use over-the-cell routing, the routing factor is usually zero to 1. We thus expect a logic core area of 6001000 mils 2 for the Viterbi decoder using this cell library. From Table 12.6 we see the I/O cells in this library are 100.8 m wide or approximately 4 mil (the width of a single pad site). From the I/O cell data book we find the I/O cell height is 650 m (actually 648.825 m) or approximately 26 mil. Each I/O cell thus occupies 104 mil 2 . Our 33 pad sites will thus require approximately 3400 mil 2 which is larger than the estimated core logic area. Let us go back and take a closer look at what it usually takes to get to this point. Remember we used an already prepared Verilog model for the Viterbi decoder. 1. See footnote 1 in Table 12.3 for explanations of the abbreviations used in this table. 2. I/O cell height (I/O cells have prefixes pc5 and pv ) is approximately 650 m in this cell library. 3. See footnote 1 in Table 12.3 for explanations of the abbreviations used in this table. Logic cell dfctnb is a D flip-flop with clear in this standard-cell library. [ Chapter start ] [ Previous page ] [ Next page ]
12.5 Verilog and Logic Synthesis

A top-down design approach using Verilog begins with a single module at the top of the hierarchy to model the input and output response of the ASIC: module MyChip_ASIC(); ... (code to model ASIC I/O) ... endmodule ; This top-level Verilog module is used to simulate the ASIC I/O connections and any bus I/O during the earliest stages of design. Often the reason that designs fail is lack of attention to the connection between the ASIC and the rest of the system. As a designer, you proceed down through the hierarchy as you add lower-level modules to the top-level Verilog module. Initially the lower-level modules are just empty placeholders, or stubs , containing a minimum of code. For example, you might start by using inverters just to connect inputs directly to the outputs. You expand these stubs before moving down to the next level of modules. module MyChip_ASIC() // behavioral "always", etc. ... SecondLevelStub1 port mapping SecondLevelStub2 port mapping ... endmodule module SecondLevelStub1() ... assign Output1 = ~Input1; endmodule module SecondLevelStub2() ... assign Output2 = ~Input2; endmodule
Eventually the Verilog modules will correspond to the various component pieces of the ASIC.
12.5.1 Verilog Modeling

Before we could start synthesis of the Viterbi decoder we had to alter the model for the D flip-flop. This was because the original flip-flop model contained syntax (multiple wait statements in an always statement) that was acceptable to the simulation tool but not by the synthesis tool. This example was artificial because we had already prepared and tested the Verilog code so that it was acceptable to the synthesis software (we say we created synthesizable code). However, finding ourselves with nonsynthesizable code arises frequently in logic synthesis. The original OVI LRM included a synthesis policy , a set of guidelines that outline which parts of the Verilog language a synthesis tool should support and which parts are optional. Some EDA vendors call their synthesis policy a modeling style . There is no current standard on which parts of an HDL (either Verilog or VHDL) a synthesis tool should support. It is essential that the structural model created by a synthesis tool is functionally identical , or functionally equivalent , to your behavioral model. Hopefully, we know this is true if the synthesis tool is working properly. In this case the logic is correct by construction. If you use different HDL code for simulation and for synthesis, you have a problem. The process of formal verification can prove that two logic descriptions (perhaps structural and behavioral HDL descriptions) are identical in their behavior. We shall return to this issue in Chapter 13. Next we shall examine Verilog and VHDL from the following viewpoint: How do I write synthesizable code?
12.5.2 Delays in Verilog

Synthesis tools ignore delay values. They musthow can a synthesis tool guarantee that logic will have a certain delay? For example, a synthesizer cannot generate hardware to implement the following Verilog code:
module Step_Time(clk, phase); input clk; output [2:0] phase; reg [2:0] phase; always @( posedge clk) begin phase <= 4'b0000; phase <= #1 4'b0001; phase <= #2 4'b0010; phase <= #3 4'b0011; phase <= #4 4'b0100; end endmodule We can avoid this type of timing problem by dividing a clock as follows: module Step_Count (clk_5x, phase); input clk_5x; output [2:0] phase; reg [2:0] phase; always @( posedge clk_5x) case (phase) 0:phase = #1 1; 1:phase = #1 2; 2:phase = #1 3; 3:phase = #1 4; default : phase = #1 0; endcase endmodule
12.5.3 Blocking and Nonblocking Assignments

There are some synthesis limitations that arise from the different types of Verilog assignment statements. Consider the following shift-register model: module race(clk, q0); input clk, q0; reg q1, q2; always @( posedge clk) q1 = #1 q0; always @( posedge clk) q2 = #1 q1; endmodule This example has a race condition (or a race ) that occurs as follows. The synthesizer ignores delays and the two always statements are procedures that execute concurrently. So, do we update q1 first and then assign the new value of q1
to q2 ? or do we update q2 first (with the old value of q1 ), and then update q1 ? In real hardware two signals would be racing each otherand the winner is unclear. We must think like the hardware to guide the synthesis tool. Combining the assignment statements into a single always statement, as follows, is one way to solve this problem: module no_race_1(clk, q0, q2); input clk, q0; output q2; reg q1, q2; always @( posedge clk) begin q2 = q1; q1 = q0; end endmodule Evaluation is sequential within an always statement, and the order of the assignment statements now ensures q2 gets the old value of q1 before we update q1 . We can also avoid the problem if we use nonblocking assignment statements, module no_race_2(clk, q0, q2); input clk, q0; output q2; reg q1, q2; always @( posedge clk) q1 <= #1 q0; always @( posedge clk) q2 <= #1 q1; endmodule This code updates all the registers together, at the end of a time step, so q2 always gets the old value of q1 .
12.5.4 Combinational Logic in Verilog

To model combinational logic, the sensitivity list of a Verilog always statement must contain only signals with no edges (no reference to keywords posedge or negedge ). This is a level-sensitive sensitivity listas in the following example that implies a two-input AND gate: module And_Always(x, y, z); input x,y; output z; reg z; always @(x or y) z <= x & y; // combinational logic
method 1 endmodule Continuous assignment statements also imply combinational logic (notice that z is now a wire rather than a reg ), module And_Assign(x, y, z); input x,y; output z; wire z; assign z <= x & y; // combinational logic method 2 = method 1 endmodule We may also use concatenation or bit reduction to synthesize combinational logic functions, module And_Or (a,b,c,z); input a,b,c; output z; reg [1:0]z; always @(a or b or c) begin z[1]<= &{a,b,c}; z[2]<= |{a,b,c}; end endmodule module Parity (BusIn, outp); input [7:0] BusIn; output outp; reg outp; always @(BusIn) if (^Busin == 0) outp = 1; else outp = 0; endmodule The number of inputs, the types, and the drive strengths of the synthesized combinational logic cells will depend on the speed, area, and load requirements that you set as constraints. You must be careful if you reference a signal ( reg or wire ) in a level-sensitive always statement and do not include that signal in the sensitivity list. In the following example, signal b is missing from the sensitivity list, and so this code should be flagged with a warning or an error by the synthesis tooleven though the code is perfectly legal and acceptable to the Verilog simulator: module And_Bad(a, b, c); input a, b; output c; reg c;
always @(a) c <= a & b; // b is missing from this sensitivity list endmodule It is easy to write Verilog code that will simulate, but that does not make sense to the synthesis software. You must think like the hardware. To avoid this type of problem with combinational logic inside an always statement you should either:
q q
include all variables in the event expression or assign to the variables before you use them
For example, consider the following two models: module CL_good(a, b, c); input a, b; output c; reg c; always @(a or b) begin c = a + b; d = a & b; e = c + d; end // c, d: LHS before RHS endmodule module CL_bad(a, b, c); input a, b; output c; reg c; always @(a or b) begin e = c + d; c = a + b; d = a & b; end // c, d: RHS before LHS endmodule In CL_bad , the signals c and d are used on the right-hand side (RHS) of an assignment statement before they are defined on the left-hand side (LHS) of an assignment statement. If the logic synthesizer produces combinational logic for CL_bad , it should warn us that the synthesized logic may not match the simulation results. When you are describing combinational logic you should be aware of the complexity of logic optimization. Some combinational logic functions are too difficult for the optimization algorithms to handle. The following module, Achilles , and large parity functions are examples of hard-to-synthesize functions. This is because most logic-optimization algorithms calculate the complement of the functions at some
point. The complements of certain functions grow exponentially in the number of their product terms. // The complement of this function is too big for synthesis. module Achilles (out, in); output out; input [30:1] in; assign out = in[30]&in[29]&in[28] | in[27]&in[26]&in[25] | in[24]&in[23]&in[22] | in[21]&in[20]&in[19] | in[18]&in[17]&in[16] | in[15]&in[14]&in[13] | in[12]&in[11]&in[10] | in[9] & in[8]&in[7] | in[6] & in[5]&in[4] | in[3] & in[2]&in[1]; endmodule In a case like this you can isolate the problem function in a separate module. Then, after synthesis, you can use directives to tell the synthesizer not to try and optimize the problem function.
12.5.5 Multiplexers In Verilog

We imply a MUX using a case statement, as in the following example: module Mux_21a(sel, a, b, z); input sel, a , b; output z; reg z; always @(a or b or sel) begin case (sel) 1'b0: z <= a; 1'b1: z <= b; end endmodule Be careful using 'x' in a case statement. Metalogical values (such as 'x' ) are not real and are only valid in simulation (and they are sometimes known as simbits for that reason). For example, a synthesizer cannot make logic to model the following and will usually issue a warning to that effect: module Mux_x(sel, a, b, z); input sel, a, b; output z; reg z;
always @(a or b or sel) begin case (sel) 1'b0: z <= 0; 1'b1: z <= 1; 1'bx: z <= 'x'; end endmodule For the same reason you should avoid using casex and casez statements. An if statement can also be used to imply a MUX as follows: module Mux_21b(sel, a, b, z); input sel, a, b; output z; reg z; always @(a or b or sel) begin if (sel) z <= a else z <= b; end endmodule However, if you do not always assign to an output, as in the following code, you will get a latch: module Mux_Latch(sel, a, b, z); input sel, a, b; output z; reg z; always @(a or sel) begin if (sel) z <= a; end endmodule It is important to understand why this code implies a sequential latch and not a combinational MUX. Think like the hardware and you will see the problem. When sel is zero, you can pass through the always statement whenever a change occurs on the input a without updating the value of the output z . In this situation you need to remember the value of z when a changes. This implies sequential logic using a as the latch input, sel as the active-high latch enable, and z as the latch output. The following code implies an 8:1 MUX with a three-state output: module Mux_81(InBus, sel, OE, OutBit); input [7:0] InBus; input [2:0] Sel; input OE; output OutBit; reg OutBit; always @(OE or sel or InBus) begin
if (OE == 1) OutBit = InBus[sel]; else OutBit = 1'bz; end endmodule When you synthesize a large MUX the required speed and area, the output load, as well as the cells that are available in the cell library will determine whether the synthesizer uses a large MUX cell, several smaller MUX cells, or equivalent random logic cells. The synthesized logic may also use different logic cells depending on whether you want the fastest path from the select input to the MUX output or from the data inputs to the MUX output.
12.5.6 The Verilog Case Statement

Consider the following model: module case8_oneHot(oneHot, a, b, c, z); input a, b, c; input [2:0] oneHot; output z; reg z; always @(oneHot or a or b or c) begin case (oneHot) //synopsys full_case 3'b001: z <= a; 3'b010: z <= b; 3'b100: z <= c; default: z <= 1'bx; endcase end endmodule By including the default choice, the case statement is exhaustive . This means that every possible value of the select variable ( oneHot ) is accounted for in the arms of the case statement. In some synthesizers (Synopsys, for example) you may indicate the arms are exhaustive and imply a MUX by using a compiler directive or synthesis directive . A compiler directive is also called a pseudocomment if it uses the comment format (such as //synopsys full_case ). The format of pseudocomments is very specific. Thus, for example, //synopys may be recognized but // synopys (with an extra space) or //SynopSys (uppercase) may not. The use of pseudocomments shows the problems of using an HDL for a purpose for which it was not intended. When we start extending the language we
lose the advantages of a standard and sacrifice portability. A compiler directive in module case8_oneHot is unnecessary if the default choice is included. If you omit the default choice and you do not have the ability to use the full_case directive (or you use a different tool), the synthesizer will infer latches for the output z. If the default in a case statement is 'x' (signifying a synthesis dont care value ), this gives the synthesizer flexibility in optimizing the logic. It does not mean that the synthesized logic output will be unknown when the default applies. The combinational logic that results from a case statement when a dont care ( 'x' ) is included as a default may or may not include a MUX, depending on how the logic is optimized. In case8_oneHot the choices in the arms of the case statement are exhaustive and also mutually exclusive . Consider the following alternative model: module case8_priority(oneHot, a, b, c, z); input a, b, c; input [2:0] oneHot; output z; reg z; always @(oneHot or a or b or c) begin case (1'b1) //synopsys parallel_case oneHot[0]: z <= a; oneHot[1]: z <= b; oneHot[2]: z <= c; default: z <= 1'bx; endcase end endmodule In this version of the case statement the choices are not necessarily mutually exclusive ( oneHot[0] and oneHot[2] may both be equal to 1'b1 , for example). Thus the code implies a priority encoder. This may not be what you intended. Some logic synthesizers allow you to indicate mutually exclusive choices by using a directive ( //synopsys parallel_case , for example). It is probably wiser not to use these outside-the-language directives if they can be avoided.
12.5.7 Decoders In Verilog

The following code models a 4:16 decoder with enable and three-state output: module Decoder_4To16(enable, In_4, Out_16); // 4-to-16 decoder input enable; input [3:0] In_4; output [15:0] Out_16; reg [15:0] Out_16; always @(enable or In_4) begin Out_16 = 16'hzzzz; if (enable == 1) begin Out_16 = 16'h0000; Out_16[In_4] = 1; end end endmodule In line 7 the binary-encoded 4-bit input sets the corresponding bit of the 16-bit output to '1' . The synthesizer infers a three-state buffer from the assignment in line 5 . Using the equality operator, '==' , rather than the case equality operator, '===' , makes sense in line 6 , because the synthesizer cannot generate logic that will check for enable being 'x' or 'z' . So, for example, do not write the following (though some synthesis tools will still accept it): if (enable === 1) // can't make logic to check for enable = x or z
12.5.8 Priority Encoder in Verilog

The following Verilog code models a priority encoder with three-state output: module Pri_Encoder32 (InBus, Clk, OE, OutBus); input [31:0]InBus; input OE, Clk; output [4:0]OutBus; reg j; reg [4:0]OutBus; always @( posedge Clk)
begin if (OE == 0) OutBus = 5'bz ; else begin OutBus = 0; for (j = 31; j >= 0; j = j - 1) begin if (InBus[j] == 1) OutBus = j; end end end endmodule In lines 9 11 the binary-encoded output is set to the position of the lowest-indexed '1' in the input bus. The logic synthesizer must be able to unroll the loop in a for statement. Normally the synthesizer will check for fixed (or static) bounds on the loop limits, as in line 9 above.
12.5.9 Arithmetic in Verilog

You need to make room for the carry bit when you add two numbers in Verilog. You may do this using concatenation on the LHS of an assignment as follows: module Adder_8 (A, B, Z, Cin, Cout); input [7:0] A, B; input Cin; output [7:0] Z; output Cout; assign {Cout, Z} = A + B + Cin; endmodule In the following example, the synthesizer should recognize '1' as a carry-in bit of an adder and should synthesize one adder and not two: module Adder_16 (A, B, Sum, Cout); input [15:0] A, B; output [15:0] Sum; output Cout; reg [15:0] Sum; reg Cout; always @(A or B) {Cout, Sum} = A + B + 1; endmodule
It is always possible to synthesize adders (and other arithmetic functions) using random logic, but they may not be as efficient as using datapath synthesis (see Section 12.5.12 ). A logic sythesizer may infer two adders from the following description rather than shaping a single adder. module Add_A (sel, a, b, c, d, y); input a, b, c, d, sel; output y; reg y; always @(sel or a or b or c or d) begin if (sel == 0) y <= a + b; else y <= c + d; end endmodule To imply the presence of a MUX before a single adder we can use temporary variables. For example, the synthesizer should use only one adder for the following code: module Add_B (sel, a, b, c, d, y); input a, b, c, d, sel; output y; reg t1, t2, y; always @(sel or a or b or c or d) begin if (sel == 0) begin t1 = a; t2 = b; end // Temporary else begin t1 = c; t2 = d; end // variables. y = t1 + t2; end endmodule If a synthesis tool is capable of performing resource allocation and resource sharing in these situations, the coding style may not matter. However we may want to use a different tool, which may not be as advanced, at a later dateso it is better to use Add_B rather than Add_A if we wish to conserve area. This example shows that the simplest code ( Add_A ) does not always result in the simplest logic ( Add_B ). Multiplication in Verilog assumes nets are unsigned numbers: module Multiply_unsigned (A, B, Z); input [1:0] A, B; output [3:0] Z;
assign Z <= A * B; endmodule To multiply signed numbers we need to extend the multiplicands with their sign bits as follows (some simulators have trouble with the concatenation '{}' structures, in which case we have to write them out long hand): module Multiply_signed (A, B, Z); input [1:0] A, B; output [3:0] Z; // 00 -> 00_00 01 -> 00_01 10 -> 11_10 11 -> 11_11 assign Z = { { 2{A[1]} }, A} * { { 2{B[1]} }, B}; endmodule How the logic synthesizer implements the multiplication depends on the software.
12.5.10 Sequential Logic in Verilog

The following statement implies a positive-edgetriggered D flip-flop: always @( posedge clock) Q_flipflop = D; // A flip-flop. When you use edges ( posedge or negedge ) in the sensitivity list of an always statement, you imply a clocked storage element. However, an always statement does not have to be edge-sensitive to imply sequential logic. As another example of sequential logic, the following statement implies a level-sensitive transparent latch: always @(clock or D) if (clock) Q_latch = D; // A latch. On the negative edge of the clock the always statement is executed, but no assignment is made to Q_latch . These last two code examples concisely illustrate the difference between a flip-flop and a latch. Any sequential logic cell or memory element must be initialized. Although you could
use an initial statement to simulate power-up, generating logic to mimic an initial statement is hard. Instead use a reset as follows: always @( posedge clock or negedge reset) A problem now arises. When we use two edges, the synthesizer must infer which edge is the clock, and which is the reset. Synthesis tools cannot read any significance into the names we have chosen. For example, we could have written always @( posedge day or negedge year) but which is the clock and which is the reset in this case? For most synthesis tools you must solve this problem by writing HDL code in a certain format or pattern so that the logic synthesizer may correctly infer the clock and reset signals. The following examples show one possible pattern or template . These templates and their use are usually described in a synthesis style guide that is part of the synthesis software documentation. always @( posedge clk or negedge reset) begin // template for reset: if (reset == 0) Q = 0; // initialize, else Q = D; // normal clocking end module Counter_With_Reset (count, clock, reset); input clock, reset; output count; reg [7:0] count; always @ ( posedge clock or negedge reset) if (reset == 0) count = 0; else count = count + 1; endmodule module DFF_MasterSlave (D, clock, reset, Q); // D type flip-flop input D, clock, reset; output Q; reg Q, latch;
always @( posedge clock or posedge reset) if (reset == 1) latch = 0; else latch = D; // the master. always @(latch) Q = latch; // the slave. endmodule The synthesis tool can now infer that, in these templates, the signal that is tested in the if statement is the reset, and that the other signal must therefore be the clock.
12.5.11 Component Instantiation in Verilog

When we give an HDL description to a synthesis tool, it will synthesize a netlist that contains generic logic gates. By generic we mean the logic is technology-independent (it could be CMOS standard cell, FPGA, TTL, GaAs, or something elsewe have not decided yet). Only after logic optimization and mapping to a specific ASIC cell library do the speed or area constraints determine the cell choices from a cell library: NAND gates, OAI gates, and so on. The only way to ensure that the synthesizer uses a particular cell, 'special' for example, from a specific library is to write structural Verilog and instantiate the cell, 'special' , in the Verilog. We call this hand instantiation . We must then decide whether to allow logic optimization to replace or change 'special' . If we insist on using logic cell 'special' and do not want it changed, we flag the cell with a synthesizer command. Most logic synthesizers currently use a pseudocomment statement or set an attribute to do this. For example, we might include the following statement to tell the Compass synthesizerDo not change cell instance my_inv_8x . This is not a standard construct, and it is not portable from tool to tool either. //Compass dontTouch my_inv_8x or // synopsys dont_touch INVD8 my_inv_8x(.I(a), .ZN(b) ); ( some compiler directives are trademarks). Notice, in this example, instantiation involves declaring the instance name and defining a structural port mapping. There is no standard name for technology-independent models or componentswe
shall call them soft models or standard components . We can use the standard components for synthesis or for behavioral Verilog simulation. Here is an example of using standard components for flip-flops (remember there are no primitive Verilog flip-flop modelsonly primitives for the elementary logic cells): module Count4(clk, reset, Q0, Q1, Q2, Q3); input clk, reset; output Q0, Q1, Q2, Q3; wire Q0, Q1, Q2, Q3; // Q , D , clk, reset asDff dff0( Q0, ~Q0, clk, reset); // The asDff is a asDff dff1( Q1, ~Q1, Q0, reset); // standard component, asDff dff2( Q2, ~Q2, Q1, reset); // unique to one set of tools. asDff dff3( Q3, ~Q3, Q2, reset); endmodule The asDff and other standard components are provided with the synthesis tool. The standard components have specific names and interfaces that are part of the software documentation. When we use a standard component such as asDff we are saying: I want a D flip-flop, but I do not know which ASIC technology I want to usegive me a generic version. I do not want to write a Verilog model for the D flip-flop myself because I do not want to bother to synthesize each and every instance of a flipflop. When the time comes, just map this generic flip-flop to whatever is available in the technology-dependent (vendor-specific) library. If we try and simulate Count4 we will get an error, :Count4.v: L5: error: Module 'asDff' not defined (and three more like this) because asDff is not a primitive Verilog model. The synthesis tool should provide us with a model for the standard component. For example, the following code models the behavior of the standard component, asDff :
module asDff (D, Q, Clk, Rst); parameter width = 1, reset_value = 0; input [width-1:0] D; output [width-1:0] Q; reg [width1:0] Q; input Clk,Rst; initial Q = {width{1'bx}}; always @ ( posedge Clk or negedge Rst ) if ( Rst==0 ) Q <= #1 reset_value; else Q <= #1 D; endmodule When the synthesizer compiles the HDL code in Count4 , it does not parse the asDff model. The software recognizes asDff and says I see you want a flip-flop. The first steps that the synthesis software and the simulation software take are often referred to as compilation, but the two steps are different for each of these tools. Synopsys has an extensive set of libraries, called DesignWare , that contains standard components not only for flip-flops but for arithmetic and other complex logic elements. These standard components are kept protected from optimization until it is time to map to a vendor technology. ASIC or EDA companies that produce design software and cell libraries can tune the synthesizer to the silicon and achieve a more efficient mapping. Even though we call them standard components, there are no standards that cover their names, use, interfaces, or models.
12.5.12 Datapath Synthesis in Verilog

Datapath synthesis is used for bus-wide arithmetic and other bus-wide operations. For example, synthesis of a 32-bit multiplier in random logic is much less efficient than using datapath synthesis. There are several approaches to datapath synthesis:
q
Synopsys VHDL DesignWare. This models generic arithmetic and other large functions (counters, shift registers, and so on) using standard components. We can either let the synthesis tool map operators (such as '+' ) to VHDL DesignWare components, or we can hand instantiate them in the code. Many ASIC vendors support the DesignWare libraries. Thus, for example, we can instantiate a DesignWare counter in VHDL and map that to a cell predesigned and preoptimized by Actel for an Actel FPGA.

q
Compiler directives. This approach uses synthesis directives in the code to steer the mapping of datapath operators either to specific components (a two-port RAM or a register file, for example) or flags certain operators to be implemented using a certain style ( '+' to be implemented using a ripple-carry adder or a carry-lookahead adder, for example). X-BLOX is a system from Xilinx that allows us to keep the logic of certain functions (counters, arithmetic elements) together. This is so that the layout tool does not splatter the synthesized CLBs all over your FPGA, reducing the performance of the logic. LPM ( library of parameterized modules) and RPM ( relationally placed modules) are other techniques used principally by FPGA companies to keep logic that operates on related data close together. This approach is based on the use of the EDIF language to describe the modules.
In all cases the disadvantage is that the code becomes specific to a certain piece of software. Here are two examples of datapath synthesis directives: module DP_csum(A1,B1,Z1); input [3:0] A1,B1; output Z1; reg [3:0] Z1; always @(A1 or B1) Z1 <= A1 + B1;//Compass adder_arch cond_sum_add endmodule module DP_ripp(A2,B2,Z2); input [3:0] A2,B2; output Z2; reg [3:0] Z2; always @(A2 or B2) Z2 <= A2 + B2;//Compass adder_arch ripple_add endmodule These directives steer the synthesis of a conditional-sum adder (usually the fastest adder implementation) or a ripple-carry adder (small but slow). There are some limitations to datapath synthesis. Sometimes, complex operations are not synthesized as we might expect. For example, a datapath library may contain a subtracter that has a carry input; however, the following code may synthesize to random logic, because the synthesizer may not be able to infer that the signal
CarryIn is a subtracter carry: module DP_sub_A(A,B,OutBus,CarryIn); input [3:0] A, B ; input CarryIn ; output OutBus ; reg [3:0] OutBus ; always @(A or B or CarryIn) OutBus <= A - B - CarryIn ; endmodule If we rewrite the code and subtract the carry as a constant, the synthesizer can more easily infer that it should use the carry-in of a datapath subtracter: module DP_sub_B (A, B, CarryIn, Z) ; input [3:0] A, B, CarryIn ; output [3:0] Z; reg [3:0] Z; always @(A or B or CarryIn) begin case (CarryIn) 1'b1 : Z <= A - B - 1'b1; default : Z <= A - B - 1'b0; endcase end endmodule This is another example of thinking like the hardware in order to help the synthesis tool infer what we are trying to imply. [ Chapter start ] [ Previous page ] [ Next page ]
12.6 VHDL and Logic Synthesis

Most logic synthesizers insist we follow a set of rules when we use a logic system to ensure that what we synthesize matches the behavioral description. Here is a typical set of rules for use with the IEEE VHDL nine-value system:
q
q q
You can use logic values corresponding to states '1' , 'H' , '0' , and 'L' in any manner. Some synthesis tools do not accept the uninitialized logic state 'U' . You can use logic states 'Z' , 'X' , 'W' , and '-' in signal and variable assignments in any manner. 'Z' is synthesized to three-state logic. The states 'X' , 'W' , and '-' are treated as unknown or dont care values.
The values 'Z' , 'X' , 'W' , and '-' may be used in conditional clauses such as the comparison in an if or case statement. However, some synthesis tools will ignore them and only match surrounding '1' and '0' bits. Consequently, a synthesized design may behave differently from the simulation if a stimulus uses 'Z' , 'X' , 'W' or '-' . The IEEE synthesis packages provide the STD_MATCH function for comparisons.
12.6.1 Initialization and Reset

You can use a VHDL process with a sensitivity list to synthesize clocked logic with a reset, as in the following code: process (signal_1, signal_2) begin
if (signal_2'EVENT and signal_2 = '0') then -- Insert initialization and reset statements. elsif (signal_1'EVENT and signal_1 = '1') then -- Insert clocking statements. end if ; end process ; Using a specific pattern the synthesizer can infer that you are implying a positiveedge clock ( signal_1 ) and a negative-edge reset ( signal_2 ). In order to be able to recognize sequential logic in this way, most synthesizers restrict you to using a maximum of two edges in a sensitivity list.
12.6.2 Combinational Logic Synthesis in VHDL

In VHDL a level-sensitive process is a process statement that has a sensitivity list with signals that are not tested for event attributes ( 'EVENT or 'STABLE , for example) within the process . To synthesize combinational logic we use a VHDL level-sensitive process or a concurrent assignment statement. Some synthesizers do not allow reference to a signal inside a level-sensitive process unless that signal is in the sensitivity list. In this example, signal b is missing from the sensitivity list: entity And_Bad is port (a, b: in BIT; c: out BIT); end And_Bad; architecture Synthesis_Bad of And_Bad is begin process (a) -- this should be process (a, b) begin c <= a and b; end process ; end Synthesis_Bad; This situation is similar but not exactly the same as omitting a variable from an event control in a Verilog always statement. Some logic synthesizers accept the VHDL version of And_Bad but not the Verilog version or vice versa. To ensure that the VHDL simulation will match the behavior of the synthesized logic, the logic synthesizer usually checks the sensitivity list of a level-sensitive process and
issues a warning if signals seem to be missing.
12.6.3 Multiplexers in VHDL

Multiplexers can be synthesized using a case statement (avoiding the VHDL reserved word 'select' ), as the following example illustrates: entity Mux4 is port (i: BIT_VECTOR(3 downto 0); sel: BIT_VECTOR(1 downto 0); s: out BIT); end Mux4; architecture Synthesis_1 of Mux4 is begin process (sel, i) begin case sel is when "00" => s <= i(0); when "01" => s <= i(1); when "10" => s <= i(2); when "11" => s <= i(3); end case ; end process ; end Synthesis_1; The following code, using a concurrent signal assignment is equivalent: architecture Synthesis_2 of Mux4 is begin with sel select s <= i(0) when "00", i(1) when "01", i(2) when "10", i(3) when "11"; end Synthesis_2; In VHDL the case statement must be exhaustive in either form, so there is no question of any priority in the choices as there may be in Verilog. For larger MUXes we can use an array, as in the following example: library IEEE; use ieee.std_logic_1164. all ;
entity Mux8 is port (InBus : in STD_LOGIC_VECTOR(7 downto 0); Sel : in INTEGER range 0 to 7; OutBit : out STD_LOGIC); end Mux8; architecture Synthesis_1 of Mux8 is begin process (InBus, Sel) begin OutBit <= InBus(Sel); end process ; end Synthesis_1; Most synthesis tools can infer that, in this case, Sel requires three bits. If not, you have to declare the signal as a STD_LOGIC_VECTOR , Sel : in STD_LOGIC_VECTOR(2 downto 0); and use a conversion routine from the STD_NUMERIC package like this: OutBit <= InBus(TO_INTEGER ( UNSIGNED (Sel) ) ) ; At some point you have to convert from an INTEGER to BIT logic anyway, since you cannot connect an INTEGER to the input of a chip! The VHDL case , if , and select statements produce similar results. Assigning dont care bits ( 'x' ) in these statements will make it easier for the synthesizer to optimize the logic.
12.6.4 Decoders in VHDL

The following code implies a decoder: library IEEE; use IEEE.STD_LOGIC_1164. all ; use IEEE.NUMERIC_STD. all ;
entity Decoder is port (enable : in BIT; Din: STD_LOGIC_VECTOR (2 downto 0); Dout: out STD_LOGIC_VECTOR (7 downto 0)); end Decoder; architecture Synthesis_1 of Decoder is begin with enable select Dout <= STD_LOGIC_VECTOR (UNSIGNED' (shift_left ("00000001", TO_INTEGER (UNSIGNED(Din)) ) ) ) when '1', "11111111" when '0', "00000000" when others ; end Synthesis_1; There are reasons for this seemingly complex code:
q
Line 1 declares the IEEE library. The synthesizer does not parse the VHDL code inside the library packages, but the synthesis company should be able to guarantee that the logic will behave exactly the same way as a simulation that uses the IEEE libraries and does parse the code. Line 2 declares the STD_LOGIC_1164 package, for STD_LOGIC types, and the NUMERIC_STD package for conversion and shift functions. The shift operators ( sll and so onthe infix operators) were introduced in VHDL-93, they are not defined for STD_LOGIC types in the 1164 standard. The shift functions defined in NUMERIC_STD are not operators and are called shift_left and so on. Some synthesis tools support NUMERIC_STD , but not VHDL-93. Line 10 performs a type conversion to STD_LOGIC_VECTOR from UNSIGNED . Line 11 is a type qualification to tell the software that the argument to the type
q q
conversion function is type UNSIGNED . Line 12 is the shift function, shift_left , from the NUMERIC_STD package. Line 13 converts the STD_LOGIC_VECTOR , Din , to UNSIGNED before converting to INTEGER . We cannot convert directly from STD_LOGIC_VECTOR to INTEGER . The others clause in line 18 is required by the logic synthesizer even though type BIT may only be '0' or '1' .
If we model a decoder using a process, we can use a case statement inside the process. A MUX model may be used as a decoder if the input bits are set at '1' (active-high decoder) or at '0' (active-low decoder), as in the following example: library IEEE; use IEEE.NUMERIC_STD. all ; use IEEE.STD_LOGIC_1164. all ; entity Concurrent_Decoder is port ( enable : in BIT; Din : in STD_LOGIC_VECTOR (2 downto 0); Dout : out STD_LOGIC_VECTOR (7 downto 0)); end Concurrent_Decoder; architecture Synthesis_1 of Concurrent_Decoder is begin process (Din, enable) variable T : STD_LOGIC_VECTOR(7 downto 0); begin if (enable = '1') then T := "00000000"; T( TO_INTEGER (UNSIGNED(Din))) := '1'; Dout <= T ; else Dout <= ( others => 'Z'); end if ; end process ; end Synthesis_1;
Notice that T must be a variable for proper timing of the update to the output. The else clause in the if statement is necessary to avoid inferring latches.
12.6.5 Adders in VHDL

To add two n -bit numbers and keep the overflow bit, we need to assign to a signal with more bits, as follows: library IEEE; use IEEE.NUMERIC_STD. all ; use IEEE.STD_LOGIC_1164. all ; entity Adder_1 is port (A, B: in UNSIGNED(3 downto 0); C: out UNSIGNED(4 downto 0)); end Adder_1; architecture Synthesis_1 of Adder_1 is begin C <= ('0' & A) + ('0' & B); end Synthesis_1; Notice that both A and B have to be SIGNED or UNSIGNED as we cannot add STD_LOGIC_VECTOR types directly using the IEEE packages. You will get an error if a result is a different length from the target of an assignment, as in the following example (in which the arguments are not resized): adder_1: begin C <= A + B; Error : Width mis-match: right expression is 4 bits wide, c is 5 bits wide The following code may generate three adders stacked three deep: z <= a + b + c + d;
Depending on how the expression is parsed, the first adder may perform x = a + b , a second adder y = x + c , and a third adder z = y + d . The following code should generate faster logic with three adders stacked only two deep: z <= (a + b) + (c + d);
12.6.6 Sequential Logic in VHDL

Sensitivity to an edge implies sequential logic in VHDL. A synthesis tool can locate edges in VHDL by finding a process statement that has either:
q q
no sensitivity list with a wait until statement a sensitivity list and test for 'EVENT plus a specific level
Any signal assigned in an edge-sensitive process statement should also be resetbut be careful to distinguish between asynchronous and synchronous resets. The following example illustrates these points: library IEEE; use IEEE.STD_LOGIC_1164. all ; entity DFF_With_Reset is port (D, Clk, Reset : in STD_LOGIC; Q : out STD_LOGIC); end DFF_With_Reset; architecture Synthesis_1 of DFF_With_Reset is begin process (Clk, Reset) begin if (Reset = '0') then Q <= '0'; -- asynchronous reset elsif rising_edge(Clk) then Q <= D; end if ; end process ; end Synthesis_1; architecture Synthesis_2 of DFF_With_Reset is begin process begin
wait until rising_edge(Clk); -- This reset is gated with the clock and is synchronous: if (Reset = '0') then Q <= '0'; else Q <= D; end if ; end process ; end Synthesis_2; Sequential logic results when we have to remember something between successive executions of a process statement. This occurs when a process statement contains one or more of the following situations:
q q q q
A signal is read but is not in the sensitivity list of a process statement. A signal or variable is read before it is updated. A signal is not always updated. There are multiple wait statements.
Not all of the models that we could write using the above constructs will be synthesizable. Any models that do use one or more of these constructs and that are synthesizable will result in sequential logic.
12.6.7 Instantiation in VHDL

The easiest way to find out how to hand instantiate a component is to generate a structural netlist from a simple HDL inputfor example, the following Verilog behavioral description (VHDL could have been used, but the Verilog is shorter): `timescale 1ns/1ns module halfgate (myInput, myOutput); input myInput; output myOutput; wire myOutput; assign myOutput = ~myInput; endmodule We synthesize this module and generate the following VHDL structural netlist: library IEEE; use IEEE.STD_LOGIC_1164. all ;
library COMPASS_LIB; use COMPASS_LIB.COMPASS. all ; --compass compile_off -- synopsys etc. use COMPASS_LIB.COMPASS_ETC. all ; --compass compile_on -- synopsys etc. entity halfgate_u is --compass compile_off -- synopsys etc. generic ( myOutput_cap : Real := 0.01; INSTANCE_NAME : string := "halfgate_u" ); --compass compile_on -- synopsys etc. port ( myInput : in Std_Logic := 'U'; myOutput : out Std_Logic := 'U' ); end halfgate_u; architecture halfgate_u of halfgate_u is component in01d0 port ( I : in Std_Logic; ZN : out Std_Logic ); end component ; begin u2: in01d0 port map ( I => myInput, ZN => myOutput ); end halfgate_u; --compass compile_off -- synopsys etc. library cb60hd230d; configuration halfgate_u_CON of halfgate_u is for halfgate_u for u2 : in01d0 use configuration cb60hd230d.in01d0_CON generic map ( ZN_cap => 0.0100 + myOutput_cap, INSTANCE_NAME => INSTANCE_NAME&"/u2" ) port map ( I => I, ZN => ZN); end for ; end for ; end halfgate_u_CON; --compass compile_on -- synopsys etc.
This gives a template to follow when hand instantiating logic cells. Instantiating a standard component requires the name of the component and its parameters: component ASDFF generic (WIDTH : POSITIVE := 1; RESET_VALUE : STD_LOGIC_VECTOR := "0" ); port (Q : out STD_LOGIC_VECTOR (WIDTH-1 downto 0); D : in STD_LOGIC_VECTOR (WIDTH-1 downto 0); CLK : in STD_LOGIC; RST : in STD_LOGIC ); end component ; Now you have enough information to be able to instantiate both logic cells from a cell library and standard components. The following model illustrates instantiation: library IEEE, COMPASS_LIB; use IEEE.STD_LOGIC_1164. all ; use COMPASS_LIB.STDCOMP. all ; entity Ripple_4 is port (Trig, Reset: STD_LOGIC; QN0_5x: out STD_LOGIC; Q : inout STD_LOGIC_VECTOR(0 to 3)); end Ripple_4; architecture structure of Ripple_4 is signal QN : STD_LOGIC_VECTOR(0 to 3); component in01d1 port ( I : in Std_Logic; ZN : out Std_Logic ); end component ; component in01d5 port ( I : in Std_Logic; ZN : out Std_Logic ); end component ; begin --compass dontTouch inv5x -- synopsys dont_touch etc. -- Named association for hand-instantiated library cells: inv5x: IN01D5 port map ( I=>Q(0), ZN=>QN0_5x ); inv0 : IN01D1 port map ( I=>Q(0), ZN=>QN(0) );
inv1 : IN01D1 inv2 : IN01D1 inv3 : IN01D1 -- Positional -d0: d1: d2: d3: end
q
port map ( I=>Q(1), ZN=>QN(1) ); port map ( I=>Q(2), ZN=>QN(2) ); port map ( I=>Q(3), ZN=>QN(3) ); association for standard components: Q to to to to D to to to to Clk Trig, Q(0), Q(1), Q(2), Rst Reset); Reset); Reset); Reset);
asDFF port asDFF port asDFF port asDFF port structure;
map map map map
(Q (Q (Q (Q
(0 (1 (2 (3
0), 1), 2), 3),
QN(0 QN(1 QN(2 QN(3
0), 1), 2), 3),
Lines 5 and 8 . Type STD_LOGIC_VECTOR must be used for standard component ports, because the standard components are defined using this type. Line 5 . Mode inout has to be used for Q since it has to be read/write and this is a structural model. You cannot use mode buffer since the formal outputs of the standard components are declared to be of mode out . Line 14 . This synthesis directive prevents the synthesis tool from removing the 5X drive strength inverter inv5x . This statement ties the code to a particular synthesis tool. Lines 16 20 . Named association for the hand-instantiated library cells. The names ( IN01D5 and IN01D1 ) and port names ( I and ZN ) come from the cell library data book or from a template (such as the one created for the IN01D1 logic cell). These statements tie the code to a particular cell library. Lines 23 26 . Positional port mapping of the standard components. The port locations are from the synthesis standard component library documentation. These asDFF standard components will be mapped to D flip-flop library cells. These statements tie the code to a particular synthesis tool.
You would receive the following warning from the logic synthesizer when it synthesizes this input code (entity Ripple_4 ): Warning : Net has more than one driver: d3_Q[0]; connected to: ripple_4_p.q[3], inv3.I, d3.Q
There is potentially more than one driver on a net because Q was declared as inout . There are a total of four warnings of this type for each of the flip-flop outputs. You can check the output netlist to make sure that you have the logic you expected as follows (the Verilog netlist is shorter and easier to read): `timescale 1ns / 10ps module ripple_4_u (trig, reset, qn0_5x, q); input trig; input reset; output qn0_5x; inout [3:0] q; wire [3:0] qn; supply1 VDD; supply0 VSS; in01d5 inv5x (.I(q[0]),.ZN(qn0_5x)); in01d1 inv0 (.I(q[0]),.ZN(qn[0])); in01d1 inv1 (.I(q[1]),.ZN(qn[1])); in01d1 inv2 (.I(q[2]),.ZN(qn[2])); in01d1 inv3 (.I(q[3]),.ZN(qn[3])); dfctnb d0(.D(qn[0]),.CP(trig),.CDN(reset),.Q(q[0]),.QN(\d0.QN )); dfctnb d1(.D(qn[1]),.CP(q[0]),.CDN(reset),.Q(q[1]),.QN(\d1.QN )); dfctnb d2(.D(qn[2]),.CP(q[1]),.CDN(reset),.Q(q[2]),.QN(\d2.QN )); dfctnb d3(.D(qn[3]),.CP(q[2]),.CDN(reset),.Q(q[3]),.QN(\d3.QN )); endmodule
12.6.8 Shift Registers and Clocking in VHDL

The following code implies a serial-in/parallel-out (SIPO) shift register: library IEEE; use IEEE.STD_LOGIC_1164. all ; use IEEE.NUMERIC_STD. all
; entity SIPO_1 is port ( Clk : in STD_LOGIC; SI : in STD_LOGIC; -- serial in PO : buffer STD_LOGIC_VECTOR(3 downto 0)); -- parallel out end SIPO_1; architecture Synthesis_1 of SIPO_1 is begin process (Clk) begin if (Clk = '1' ) then PO <= SI & PO(3 downto 1); end if ; end process ; end Synthesis_1; Here is the Verilog structural netlist that results ( dfntnb is a positiveedgetriggered D flip-flop without clear or reset): module sipo_1_u (clk, si, po); input clk; input si; output [3:0] po; supply1 VDD; supply0 VSS; dfntnb po_ff_b0 (.D(po[1]),.CP(clk),.Q(po[0]),.QN(\po_ff_b0.QN)); dfntnb po_ff_b1 (.D(po[2]),.CP(clk),.Q(po[1]),.QN(\po_ff_b1.QN)); dfntnb po_ff_b2 (.D(po[3]),.CP(clk),.Q(po[2]),.QN(\po_ff_b2.QN)); dfntnb po_ff_b3 (.D(si),.CP(clk),.Q(po[3]),.QN(\po_ff_b3.QN )); endmodule The synthesized design consists of four flip-flops. Notice that (line 6 in the VHDL input) signal PO is of mode buffer because we cannot read a signal of mode out inside a process. This is acceptable for synthesis but not usually a good idea for
simulation models. We can modify the code to eliminate the buffer port and at the same time we shall include a reset signal, as follows: library IEEE; use IEEE.STD_LOGIC_1164. all ; use IEEE.NUMERIC_STD. all ; entity SIPO_R is port ( clk : in STD_LOGIC ; res : in STD_LOGIC ; SI : in STD_LOGIC ; PO : out STD_LOGIC_VECTOR(3 downto 0)); end ; architecture Synthesis_1 of SIPO_R is signal PO_t : STD_LOGIC_VECTOR(3 downto 0); begin process (PO_t) begin PO <= PO_t; end process ; process (clk, res) begin if (res = '0') then PO_t <= ( others => '0'); elsif (rising_edge(clk)) then PO_t <= SI & PO_t(3 downto 1); end if ; end process ; end Synthesis_1; Notice the following:
q
Line 10 uses a temporary signal, PO_t , to avoid using a port of mode buffer for the output signal PO . We could have used a variable instead of a signal and the variable would consume less overhead during simulation. However, we must complete an assignment to a variable inside the clocked process (not in a separate process as we can for the signal). Assignment between a variable and a signal inside a single process creates its own set of problems. Line 11 is sensitive to the clock, clk , and the reset, res . It is not sensitive to
PO_t or SI and this is what indicates the sequential logic. Line 13 uses the rising_edge function from the STD_LOGIC_1164 package.
The software synthesizes four positive-edgetriggered D flip-flops for design entity SIPO_R(Synthesis_1) as it did for design entity SIPO_1(Synthesis_1) . The difference is that the synthesized flip-flops in SIPO_R have active-low resets. However, the simulation behavior of these two design entities will be different. In SIPO_R , the function rising_edge only evaluates to TRUE for a transition from '0' or 'L' to '1' or 'H' . In SIPO_1 we only tested for Clk = '1' . Since nearly all synthesis tools now accept rising_edge and falling_edge , it is probably wiser to use these functions consistently.
12.6.9 Adders and Arithmetic Functions

If you wish to perform BIT_VECTOR or STD_LOGIC_VECTOR arithmetic you have three choices:
q
Use a vendor-supplied package (there are no standard vendor packageseven if a company puts its own package in the IEEE library). Convert to SIGNED (or UNSIGNED ) and use the IEEE standard synthesis packages (IEEE Std 1076.3-1997). Use overloaded functions in packages or functions that you define yourself.
Here is an example of addition using a ripple-carry architecture: library IEEE; use IEEE.STD_LOGIC_1164. all ; use IEEE.NUMERIC_STD. all ; entity Adder4 is port ( in1, in2 : in BIT_VECTOR(3 downto 0) ; mySum : out BIT_VECTOR(3 downto 0) ) ;
end Adder4; architecture Behave_A of Adder4 is function DIY(L,R: BIT_VECTOR(3 downto 0)) return BIT_VECTOR is variable sum:BIT_VECTOR(3 downto 0); variable lt,rt,st,cry: BIT; begin cry := '0'; for i in L'REVERSE_RANGE loop lt := L(i); rt := R(i); st := lt xor rt; sum(i):= st xor cry; cry:= (lt and rt) or (st and cry); end loop ; return sum; end ; begin mySum <= DIY (in1, in2); -- do it yourself (DIY) add end Behave_A; This model results in random logic. An alternative is to use UNSIGNED or UNSIGNED from the IEEE NUMERIC_STD or NUMERIC_BIT packages as in the following example: library IEEE; use IEEE.STD_LOGIC_1164. all ; use IEEE.NUMERIC_STD. all ; entity Adder4 is port ( in1, in2 : in UNSIGNED(3 downto 0) ; mySum : out UNSIGNED(3 downto 0) ) ; end Adder4; architecture Behave_B of Adder4 is begin mySum <= in1 + in2; -- This uses an overloaded '+'. end Behave_B;
In this case, the synthesized logic will depend on the logic synthesizer.
12.6.10 Adder/Subtracter and Dont Cares

The following code models a 16-bit sequential adder and subtracter. The input signal, xin , is added to output signal, result , when signal addsub is high; otherwise result is subtracted from xin . The internal signal addout temporarily stores the result until the next rising edge of the clock: library IEEE; use IEEE.STD_LOGIC_1164. all ; use IEEE.NUMERIC_STD. all ; entity Adder_Subtracter is port ( xin : in UNSIGNED(15 downto 0); clk, addsub, clr: in STD_LOGIC; result : out UNSIGNED(15 downto 0)); end Adder_Subtracter; architecture Behave_A of Adder_Subtracter is signal addout, result_t: UNSIGNED(15 downto 0); begin result <= result_t; with addsub select addout <= (xin + result_t) when '1', (xin - result_t) when '0', ( others => '-') when others ; process (clr, clk) begin if (clr = '0') then result_t <= ( others => '0'); elsif rising_edge(clk) then result_t <= addout; end if ; end process ; end Behave_A;
Notice the following:

q q
Line 11 is a concurrent assignment to avoid using a port of mode buffer . Lines 12 15 define an exhaustive list of choices for the selected signal assignment statement. The default choice sets the result to '-' (dont care) to allow the synthesizer to optimize the logic.
Line 18 includes a reference to signal addout that could be eliminated by moving the selected signal assignment statement inside the clocked process as follows: architecture Behave_B of Adder_Subtracter is signal result_t: UNSIGNED(15 downto 0); begin result <= result_t; process (clr, clk) begin if (clr = '0') then result_t <= ( others => '0'); elsif rising_edge(clk) then case addsub is when '1' => result_t <= (xin + result_t); when '0' => result_t <= (xin - result_t); when others => result_t <= ( others => '-'); end case ; end if ; end process ; end Behave_B; This code is simpler than architecture Behave_A , but the synthesized logic should be identical for both architectures. Since the logic that results is an adder/subtracter followed by a register (bank of flip-flops) the Behave_A model more clearly reflects the hardware. [ Chapter start ] [ Previous page ] [ Next page ]
12.7 Finite-State Machine Synthesis

There are three ways to synthesize a finite-state machine ( FSM ): 1. Omit any special synthesis directives and let the logic synthesizer operate on the state machine as though it were random logic. This will prevent any reassignment of states or state machine optimization. It is the easiest method and independent of any particular synthesis tool, but is the most inefficient approach in terms of area and performance. 2. Use directives to guide the logic synthesis tool to improve or modify state assignment. This approach is dependent on the software that you use. 3. Use a special state-machine compiler, separate from the logic synthesizer, to optimize the state machine. You then merge the resulting state machine with the rest of your logic. This method leads to the best results but is harder to use and ties your code to a particular set of software tools, not just the logic synthesizer. Most synthesis tools require that you write a state machine using a certain stylea special format or template. Synthesis tools may also require that you declare an FSM, the encoding, and the state register using a synthesis directive or special software command. Common FSM encoding options are:
q
Adjacent encoding assigns states by the minimum logic difference in the state transition graph. This normally reduces the amount of logic needed to decode each state. The minimum number of bits in the state register for an FSM with n states is log 2 n . In some tools you may increase the state register width up to n to generate encoding based on Gray codes. One-hot encoding sets one bit in the state register for each state. This technique seems wasteful. For example, an FSM with 16 states requires 16 flip-flops for one-
q q q
hot encoding but only four if you use a binary encoding. However, one-hot encoding simplifies the logic and also the interconnect between the logic. One-hot encoding often results in smaller and faster FSMs. This is especially true in programmable ASICs with large amounts of sequential logic relative to combinational logic resources. Random encoding assigns a random code for each state. User-specified encoding keeps the explicit state assignment from the HDL. Moore encoding is useful for FSMs that require fast outputs. A Moore state machine has outputs that depend only on the current state (Mealy state machine outputs depend on the current state and the inputs).
You need to consider how the reset of the state register will be handled in the synthesized hardware. In a programmable ASIC there are often limitations on the polarity of the flipflop resets. For example, in some FPGAs all flip-flop resets must all be of the same polarity (and this restriction may or may not be present or different for the internal flipflops and the flip-flops in the I/O cells). Thus, for example, if you try to assign the reset state as '0101' , it may not be possible to set two flip-flops to '0' and two flip-flops to '1' at the same time in an FPGA. This may be handled by assigning the reset state, resSt , to '0000' or '1111' and inverting the appropriate two bits of the state register wherever they are used. You also need to consider the initial value of the state register in the synthesized hardware. In some reprogrammable FPGAs, after programming is complete the flip-flops may all be initialized to a value that may not correspond to the reset state. Thus if the flipflops are all set to '1' at start-up and the reset state is '0000' , the initial state is '1111' and not the reset state. For this reason, and also to ensure fail-safe behavior, it is important that the behavior of the FSM is defined for every possible value of the state register.
12.7.1 FSM Synthesis in Verilog

The following FSM model uses paired processes . The first process synthesizes to sequential logic and the second process synthesizes to combinational logic: `define resSt 0 `define S1 1
`define S2 2 `define S3 3 module StateMachine_1 (reset, clk, yOutReg); input reset, clk; output yOutReg; reg yOutReg, yOut; reg [1:0] curSt, nextSt; always @( posedge clk or posedge reset) begin :Seq //Compass statemachine oneHot curSt if (reset == 1) begin yOut = 0; yOutReg = yOut; curSt = `resSt; end else begin case (curSt) `resSt:yOut = 0;`S1:yOut = 1;`S2:yOut = 1;`S3:yOut = 1; default :yOut = 0; endcase yOutReg = yOut; curSt = nextSt; // ... update the state. end end always @(curSt or yOut) // Assign the next state: begin :Comb case (curSt) `resSt:nextSt = `S3; `S1:nextSt = `S2; `S2:nextSt = `S1; `S3:nextSt = `S1; default :nextSt = `resSt; endcase end endmodule Synopsys uses separate pseudocomments to define the states and state vector as in the following example: module StateMachine_2 (reset, clk, yOutReg); input reset, clk; output yOutReg; reg yOutReg, yOut; parameter [1:0] //synopsys enum states resSt = 2'b00, S1 = 2'b01, S2 = 2'b10, S3 = 2'b11; reg [1:0] /* synopsys enum states */ curSt, nextSt; //synopsys state_vector curSt
always @( posedge clk or posedge reset) begin if (reset == 1) begin yOut = 0; yOutReg = yOut; curSt = resSt; end else begin case (curSt) resSt:yOut = 0;S1:yOut = 1;S2:yOut = 1;S3:yOut = 1; default :yOut = 0; endcase yOutReg = yOut; curSt = nextSt; end end always @(curSt or yOut) begin case (curSt) resSt:nextSt = S3; S1:nextSt = S2; S2:nextSt = S1; S3:nextSt = S1; default :nextSt = S1; endcase end endmodule To change encoding we can assign states explicitly by altering lines 3 4 to the following, for example: parameter [3:0] //synopsys enum states resSt = 4'b0000, S1 = 4'b0010, S2 = 4'b0100, S3 = 4'b1000;
12.7.2 FSM Synthesis in VHDL

The first architecture that follows is a template for a Moore state machine: library IEEE; use IEEE.STD_LOGIC_1164. all ; entity SM1 is port (aIn, clk : in Std_logic; yOut: out Std_logic); end SM1; architecture Moore of SM1 is type state is (s1, s2, s3, s4); signal pS, nS : state;
begin process (aIn, pS) begin case pS is when s1 => yOut <= '0'; nS <= s4; when s2 => yOut <= '1'; nS <= s3; when s3 => yOut <= '1'; nS <= s1; when s4 => yOut <= '1'; nS <= s2; end case ; end process ; process begin -- synopsys etc. --compass Statemachine adj pS wait until clk = '1'; pS <= nS; end process ; end Moore; An example input, aIn , is included but not used in the next state assignments. A reset is also omitted to further simplify this example. An FSM compiler extracts the state machine. Some companies use FSM compilers that are separate from the logic synthesizers (and priced separately) because the algorithms for FSM optimization are different from those for optimizing combinational logic. We can see what is happening by asking the Compass synthesizer to write out intermediate results. The synthesizer extracts the FSM and produces the following output in a statemachine language used by the tools: sm sm1_ps_sm; inputs; outputs yout_smo; STATE S1 { let yout_smo=0 STATE S2 { let yout_smo=1 STATE S3 { let yout_smo=1 STATE S4 { let yout_smo=1 end
clock clk; ; } --> S4; ; } --> S3; ; } --> S1; ; } --> S2;
You can use this language to modify the FSM and then use this modified code as an input to the synthesizer if you wish. In our case, it serves as documentation that explains the FSM behavior.
Using one-hot encoding generates the following structural Verilog netlist ( dfntnb is positive-edgetriggered D flip-flop, and nd03d0 is a three-input NAND): dfntnb sm_ps4(.D(sm_ps1_Q),.CP(clk),.Q(sm_ps4_Q),.QN(sm_ps4_QN)); dfntnb sm_ps3(.D(sm_ps2_Q),.CP(clk),.Q(sm_ps3_Q),.QN(sm_ps3_QN)); dfntnb sm_ps2(.D(sm_ps4_Q),.CP(clk),.Q(sm_ps2_Q),.QN(sm_ps2_QN)); dfntnb sm_ps1(.D(sm_ps3_Q),.CP(clk),.Q(sm_ps1_Q),.QN(\sm_ps1.QN )); nd03d0 i_6(.A1(sm_ps4_QN),.A2(sm_ps3_QN),.A3(sm_ps2_QN), .ZN(yout_smo)); (Each example shows only the logic cells and their interconnection in the Verilog structural netlists.) The synthesizer has assigned one flip-flop to each of the four states to form a 4-bit state register. The FSM output (renamed from yOut to yout_smo by the software) is taken from the output of the three-input NAND gate that decodes the outputs from the flip-flops in the state register. Using adjacent encoding gives a simpler result, dfntnb sm_ps2(.D(i_4_ZN),.CP(clk), .Q(\sm_ps2.Q ),.QN(sm_ps2_QN)); dfntnb sm_ps1(.D(sm_ps1_QN),.CP(clk),.Q(\sm_ps1.Q ),.QN(sm_ps1_QN)); oa04d1 i_4(.A1(sm_ps1_QN),.A2(sm_ps2_QN),.B(yout_smo),.ZN(i_4_ZN)); nd02d0 i_5(.A1(sm_ps2_QN), .A2(sm_ps1_QN), .ZN(yout_smo)); ( oa04d1 is an OAI21 logic cell, nd02d0 is a two-input NAND). In this case binary encoding for the four states uses only two flip-flops. The two-input NAND gate decodes the states to produce the output. The OAI21 logic cell implements the logic that determines the next state. The combinational logic in this example is only slightly more complex than that for the one-hot encoding, but, in general, combinational logic for onehot encoding is simpler than the other forms of encoding. Using the option 'moore' for Moore encoding, we receive the following message from
the FSM compiler: The states were assigned these codes: 0?? : S1 100 : S2 101 : S3
110 : S4
The FSM compiler has assigned three bits to the state register. The first bit in the state register is used as the output. We can see more clearly what has happened by looking at the Verilog structural netlist: dfntnb sm_ps3(.D(i_6_ZN),.CP(clk),.Q(yout_smo),.QN(sm_ps3_QN)); dfntnb sm_ps2(.D(sm_ps3_QN),.CP(clk),.Q(sm_ps2_Q),.QN(\sm_ps2.QN )); dfntnb sm_ps1(.D(i_5_ZN),.CP(clk),.Q(sm_ps1_Q),.QN(\sm_ps1.QN )); nr02d0 i_5(.A1(sm_ps3_QN),.A2(sm_ps2_Q),.ZN(i_5_ZN)); nd02d0 i_6(.A1(sm_ps1_Q),.A2(yout_smo),.ZN(i_6_ZN)); The output, yout_smo , is now taken directly from a flip-flop. This means that the output appears after the clock edge with no combinational logic delay (only the clock-toQ delay). This is useful for FSMs that are required to produce outputs as soon as possible after the active clock edge (in PCI bus controllers, for example). The following code is a template for a Mealy state machine: library IEEE; use IEEE.STD_LOGIC_1164. all ; entity SM2 is port (aIn, clk : in Std_logic; yOut: out Std_logic); end SM2; architecture Mealy of SM2 is type state is (s1, s2, s3, s4); signal pS, nS : state; begin process (aIn, pS) begin case pS is when s1 => if (aIn = '1')
then yOut <= '0'; nS <= s4; else yOut <= '1'; nS <= s3; end if ; when s2 => yOut <= '1'; nS <= s3; when s3 => yOut <= '1'; nS <= s1; when s4 => if (aIn = '1') then yOut <= '1'; nS <= s2; else yOut <= '0'; nS <= s1; end if ; end case ; end process ; process begin wait until clk = '1' ; --Compass Statemachine oneHot pS pS <= nS; end process ; end Mealy;
12.8 Memory Synthesis

There are several approaches to memory synthesis: 1. Random logic using flip-flops or latches 2. Register files in datapaths 3. RAM standard components 4. RAM compilers The first approach uses large vectors or arrays in the HDL code. The synthesizer will map these elements to arrays of flip-flops or latches depending on how the timing of the assignments is handled. This approach is independent of any software or type of ASIC and is the easiest to use but inefficient in terms of area. A flip-flop may take up 10 to 20 times the area of a six-transistor static RAM cell. The second approach uses a synthesis directive or hand instantiation to synthesize a memory to a datapath component. Usually the datapath components are constructed from latches in a regular array. These are slightly more efficient than a random arrangement of logic cells, but the way we create the memory then depends on the software and the ASIC technology we are using. The third approach uses standard components supplied by an ASIC vendor. For example, we can instantiate a small RAM using CLBs in a Xilinx FPGA. This approach is very dependent on the technology. For example, we could not easily transfer a design that uses Xilinx CLBs as SRAM to an Actel FPGA. The last approach, using a custom RAM compiler, is the most area-efficient approach. It depends on having the capability to call a compiler from within the
synthesis tool or to instantiate a component that has already been compiled.
12.8.1 Memory Synthesis in Verilog

Most synthesizers implement a Verilog memory array, such as the one shown in the following code, as an array of latches or flip-flops. reg [31:0] MyMemory [3:0]; // a 4 x 32-bit register For example, the following code models a small RAM, and the synthesizer maps the memory array to sequential logic: module RAM_1(A, CEB, WEB, OEB, INN, OUTT); input [6:0] A; input CEB,WEB,OEB; input [4:0]INN; output [4:0] OUTT; reg [4:0] OUTT; reg [4:0] int_bus; reg [4:0] memory [127:0]; always @( negedge CEB) begin if (CEB == 0) begin if (WEB == 1) int_bus = memory[A]; else if (WEB == 0) begin memory[A] = INN; int_bus = INN; end else int_bus = 5'bxxxxx; end end always @(OEB or int_bus) begin case (OEB) 0 : OUTT = int_bus; default : OUTT = 5'bzzzzz; endcase end endmodule Memory synthesis using random control logic and transparent latches for each bit is reasonable only for small, fast register files, or for local RAM on an MGA or CBIC. For large RAMs synthesized memory becomes very expensive and instead you
should normally use a dedicated RAM compiler. Typically there will be restrictions on synthesizing RAM with multiple read/writes:
q
If you write to the same memory in two different processes, be careful to avoid address contention. You need a multiport RAM if you read or write to multiple locations simultaneously. If you write and read the same memory location, you have to be very careful. To mimic hardware you need to read before you write so that you read the old memory value. If you attempt to write before reading, the difference between blocking and nonblocking assignments can lead to trouble.
You cannot make a memory access that depends on another memory access in the same clock cycle. For example, you cannot do this: memory[i + 1] = memory[i]; // needs two clock cycles or this: pointer = memory[memory[i]]; // needs two clock cycles For the same reason (but less obviously) we cannot do this: pc = memory[addr1]; memory[addr2] = pc + 1; // not on the same cycle
12.8.2 Memory Synthesis in VHDL

VHDL allows multidimensional arrays so that we can synthesize a memory as an array of latches by declaring a two-dimensional array as follows: type memStor is array(3 downto 0) of integer ; -- This is
OK. subtype MemReg is STD_LOGIC_VECTOR(15 downto 0); -- So is this. type memStor is array(3 downto 0) of MemReg; -- other code... signal Mem1 : memStor; As an example, the following code models a standard-cell RAM: library IEEE; use IEEE.STD_LOGIC_1164. all ; package RAM_package is constant numOut : INTEGER := 8; constant wordDepth: INTEGER := 8; constant numAddr : INTEGER := 3; subtype MEMV is STD_LOGIC_VECTOR(numOut-1 downto 0); type MEM is array (wordDepth-1 downto 0) of MEMV; end RAM_package; library IEEE; use IEEE.STD_LOGIC_1164. all ; use IEEE.NUMERIC_STD. all ; use work.RAM_package. all ; entity RAM_1 is port (signal A : in STD_LOGIC_VECTOR(numAddr-1 downto 0); signal CEB, WEB, OEB : in STD_LOGIC; signal INN : in MEMV; signal OUTT : out MEMV); end RAM_1; architecture Synthesis_1 of RAM_1 is signal i_bus : MEMV; -- RAM internal data latch signal mem : MEM; -- RAM data begin
process begin wait until CEB = '0'; if WEB = '1' then i_bus <= mem(TO_INTEGER(UNSIGNED(A))); elsif WEB = '0' then mem(TO_INTEGER(UNSIGNED(A))) <= INN; i_bus <= INN; else i_bus <= ( others => 'X'); end if ; end process ; process (OEB, int_bus) begin -- control output drivers: case (OEB) is when '0' => OUTT <= i_bus; when '1' => OUTT <= ( others => 'Z'); when others => OUTT <= ( others => 'X'); end case ; end process ; end Synthesis_1;
12.9 The Multiplier
12.9 The Multiplier

This section looks at the messages that result from attempting to synthesize the VHDL code from Section 10.2, A 4-bit Multiplier. The following examples use the line numbers that were assigned in the comments at the end of each line of code in Tables 10.110.9. The first problem arises in the following code (line 7 of the full adder in Table 10.1): Sum <= X xor Y xor Cin after TS; Warning : AFTER clause in a waveform element is not supported This is not a serious problem if you are using a synchronous design style. If you are, then your logic will work whatever the delays (it may run slowly but it will work). The next problem is from lines 3 4 of the 8-bit MUX in Table 10.5, port (A, B : in BIT_VECTOR (7 downto 0); Sel : in BIT := '0'; Y : out BIT_VECTOR (7 downto 0)); Warning : Default values on interface signals are not supported The synthesis tool cannot mimic the behavior of a default value on a port in the software model. The default value is the value given to an input if nothing is connected ( 'open' in VHDL). In hardware either an input is connected or it is not. If it is connected, there will be a voltage on the wire. If it is not connected, the node
12.9 The Multiplier
will be floating. Default values are useful in VHDLwithout a default value on an input port, an entityarchitecture pair will not compile. The default value may be omitted in this model because this input port is connected at the next higher level of hierarchy. The next problem illustrates what happens when a designer fails to think like the hardware (from line 3 of the zero-detector in Table 10.6), port (X:BIT_VECTOR; F:out BIT ); Error : An index range must be specified for this data type This code has the advantage of being flexible, but the synthesizer needs to know exactly how wide the bus will be. There are two other similar errors in shiftn, the variable-width shift register (from lines 45 in Table 10.7). There are also three more errors generated by the same problem in the component statement for AllZero (from lines 45 of package Mult_Components ) and the component statement for shiftn (from lines 1011 of package Mult_Components ). All of these index range problems may be fixed by sacrificing the flexible nature of the code and specifying an index range explicitly, as in the following example: port (X:BIT_VECTOR(7 downto 0); F:out BIT ); Table 12.8 shows the synthesizable version of the shift-register model. The constrained index ranges in lines 6 , 7 , 11 , 18 , 22 , and 23 fix the problem, but are rather ugly. It would be better to use generic parameters for the input and output bus widths. However, a shift register with different input and output widths is not that common so, for now, we will leave the code as it is. TABLE 12.8 A synthesizable version of the shift register shown in Table 10.7.
12.9 The Multiplier
entity ShiftN is generic (TCQ:TIME := 0.3 ns; TLQ:TIME := 0.5 ns; TSQ:TIME := 0.7 ns); port ( CLK, CLR, LD, SH, DIR: in BIT; D: in BIT_VECTOR(3 downto 0); Q: out BIT_VECTOR(7 downto 0) ); end ShiftN; architecture Behave of ShiftN is begin Shift: process (CLR, CLK) variable St: BIT_VECTOR(7 downto 0); begin if CLR = '1' then St := ( others => '0'); Q <= St after TCQ; elsif CLK'EVENT and CLK='1' then if LD = '1' then St := ( others => '0'); St(3 downto 0) := D; Q <= St after TLQ; elsif SH = '1' then case DIR is when '0'=>St:='0' & St(7 downto 1); when '1'=>St:=St(6 downto 0) & '0'; end case ; Q <= St after TSQ;
CLK Clock CLR Clear, active high LD Load, active high SH Shift, active high DIR Direction, 1=left D Data in Q Data out Shift register. Input width = 4. Output width = 8. Output is leftshifted or right-shifted under control of DIR. Unused MSBs are zero-padded during load. Clear is asynchronous. Load is synchronous. Timing: TCQ (CLR to Q) = 0.3 ns TLQ (LD to Q) = 0.5 ns TSQ (SH to Q) =0. 7 ns
12.9 The Multiplier
end end end end
if ; if ; process ; ;
The next problem occurs because VHDL is not a synthesis language (from lines 67 of the variable-width shift register in Table 10.7), begin assert (D'LENGTH <= Q'LENGTH) report "D wider than output Q" severity Failure; Warning : Assertion statements are ignored Error : Statements in entity declarations are not supported The synthesis tool warns us it does not know how to generate hardware that writes to our screen to implement an assertion statement. The error occurs because a synthesis tool cannot support any of the passive statements (no assignments to signals, for example) that VHDL allows in an entity declaration. Synthesis software usually provides a way around these problems by providing switches to turn the synthesizer on and off. For example, we might be able to write the following: //Compass compile_off begin assert (D'LENGTH <= Q'LENGTH) report "D wider than output Q" severity Failure; //Compass compile_on The disadvantage of this approach is that the code now becomes tied to a particular synthesis tool. The alternative is to move the statement to the architecture to eliminate the error, and ignore the warning. The next error message is, at first sight, confusing (from lines 1516 of the variablewidth shift register in Table 10.7), if CLR = '1' then St := (others => '0'); Q <= St after
12.9 The Multiplier
TCQ; Error : Illegal use of aggregate with the choice "others": the derived subtype of an array aggregate that has a choice "others" must be a constrained array subtype This error message is precise and uses the terminology of the LRM but does not reveal the source of the problem. To discover the problem we work backward through the model. We declared variable St as follows (lines 1213 of Table 10.7): subtype OutB is NATURAL range Q'LENGTH-1 downto 0; variable St: BIT_VECTOR(OutB); (to keep the model flexible). Continuing backward we see Q is declared as type BIT_VECTOR with no index range as follows (lines 45 of Table 10.7): port(CLK, CLR, LD, SH, DIR: in BIT; D: in BIT_VECTOR; Q: out BIT_VECTOR); The error is thus linked to the previous problem (undeclared bus widths) in this entityarchitecture pair. Because the synthesizer does not know the width of Q , it does not know how many '0' s to put in St when it has to implement St := (others => '0') . There is one more error like this one in the second assignment to St (line 19 in Table 10.7). Again the problem may be solved by sacrificing flexibility and constraining the width of Q to be a fixed value. The next warning involves names (line 5 in Table 10.9), signal SRA, SRB, ADDout, MUXout, REGout: BIT_VECTOR(7 downto 0); Warning : Name is reserved word in VHDL-93: sra This problem can be fixed by (a) changing the signal name, (b) using an escaped name, or (c) accepting that this code will not work in a VHDL-93 environment. Finally, there is the following warning (line 6 in Table 10.9):
12.9 The Multiplier
signal Zero, Init, Shift, Add, Low: BIT := '0'; signal High: BIT := '1'; Warning : Initial values on signals are only for simulation and setting the value of undriven signals in synthesis. A synthesized circuit can not be guaranteed to be in any known state when the power is turned on. Signals Low and High are used to tie inputs to a logic '0' and to a logic '1' , respectively. This is because VHDL-87 does not allow '1' or '0' , which are literals, as actual parameters. Thus one way to solve this problem is to change to a VHDL-93 environment, where this restriction was lifted. Some synthesis systems handle VDD and GND nets in a specific fashion. For example, VDD and GND may be declared as constants in a synthesis package. It does not really matter how inputs are connected to VDD and GND as long as they are connected in the synthesized logic.
12.9.1 Messages During Synthesis

After fixing the error and warning messages, we can synthesize the multiplier. During synthesis we see these messages: These unused instances are being removed: in full_adder_p_dup8: u5, u2, u3, u4 These unused instances are being removed: in dffclr_p_dup1: u2 and seven more similar to this for dffclr_p_dup2: u2 to dffclr_p_dup8: u2 . We are suspicious because we did not include any redundant or unused logic in our input code. Let us dig deeper. Turning to the second set of messages first, we need to discover the locations of dffclr_p_dup1: u2 and the other seven similarly named unused instances. We
12.9 The Multiplier
can ask the synthesizer to produce the following hierarchy map of the design: ************* Hierarchy of cell "mult8_p" ************* mult8_p adder8_p | full_adder_p [x8] allzero_p mux8_p register8_p | dffclr_p [x8] shiftn_p [x2] sm_1_p The eight unused instances in question are inside the 8-bit shift register, register8_p . The only models in this shift register are eight copies of the D flipflop model, DFFClr . Let us look more closely at the following code: architecture Behave of DFFClr is signal Qi : BIT; begin QB <= not Qi; Q <= Qi; process (CLR, CLK) begin if CLR = '1' then Qi <= '0' after TRQ; elsif CLK'EVENT and CLK = '1' then Qi <= D after TCQ; end if ; end process ; end ; The synthesizer infers an inverter from the first statement in line 3 ( QB <= not Qi ). What we meant to imply (A) was: I am trying to describe the function of a D flip-flop and it has two outputs; one output is the complement of the other. What the synthesizer inferred (B) was: You described a D flip-flop with an inverter connected to Q. Unfortunately A does not equal B. Why were four cell instances ( u5 , u2 , u3 , u4 ) removed from inside a cell with instance name full_adder_p_dup8 ? The top-level cell mult8_p contains cell adder8_p , which in turn contains full_adder_p [x8] . This last entry in the
12.9 The Multiplier
hierarchy map represents eight occurrences or instances of cell full_adder_p . The logic synthesizer appends the suffix '_p' by default to the names of the design units to avoid overwriting any existing netlists (it also converts all names to lowercase). The synthesizer has then added the suffix 'dup8' to create the instance name full_adder_p_dup8 for the eighth copy of cell full_adder_p . What is so special about the eighth instance of full_adder_p inside cell adder8_p ? The following (line 13 in Table 10.9) instantiates Adder8 : A1:Adder8 port map (A=>SRB,B=>REGout,Cin=>Low,Cout=>OFL,Sum=>ADDout); The signal OFL is declared but not used. This means that the formal port name Cout for the entity Adder8 in Table 10.2 is unconnected in the instance full_adder_p_dup8 . Since the carry-out bit is unused, the synthesizer deletes some logic. Before dismissing this message as harmless, let us look a little closer. In the architecture for entity Adder8 we wrote: Cout <= (X and Y) or (X and Cin) or (Y and Cin) after TC; In one of the instances of Adder8 , named full_adder_p_dup8 , this statement is redundant since we never use Cout in that particular cell instance. If we look at the synthesized netlist for full_adder_p_dup8 before optimization, we find four NAND cells that produce the signal Cout . During logic optimization the synthesizer removes these four instances. Their instance names are full_adder_p_dup8:u2, u3, u4, u5 . [ Chapter start ] [ Previous page ] [ Next page ]
12.10 The Engine Controller

This section returns to the example from Section 10.16, An Engine Controller. This ASIC gathers sampled temperature measurements from sensors, converts the temperature values from Fahrenheit to Centigrade, averages them, and stores them in a FIFO before passing the values to a microprocessor on a three-state bus. We receive the following message from the logic synthesizer when we use the FIFO-controller code shown in Table 10.25: Warning: Made latches to store values on: net d(4), d(5), d(6), d(7), d(8), d(9), d(10), d(11), in module fifo_control This message often indicates that we forgot to initialize a variable. Here is the part of the code from Table 10.25 that assigns to the vector D (the error message for d is in lowercaseremember VHDL is case insensitive): case sel is when "01" => D <= D_1 after TPD; r1 <= '1' after TPD; when "10" => D <= D_2 after TPD; r2 <= '1' after TPD; when "00" => D(3) <= f1 after TPD; D(2) <= f2 after TPD; D(1) <= e1 after TPD; D(0) <= e2 after TPD; when others => D <= "ZZZZZZZZZZZZ" after TPD; end case ; When sel = "00" , there is no assignment to D(4) through D(11) . This did not matter in the simulation, but to reproduce the exact behavior of the HDL code the
file:///C|/Documents%20and%20Settings/saran%20...i.edu/_msmith/ASICs/HTML/Book2/CH12/CH12.a.htm (1 of 5) [5/30/2004 11:06:43 PM]
logic synthesizer generates latches to remember the values of D(4) through D(11) . This problem may be corrected by replacing the "00" choice with the following: when "00" => D(3) <= f1 after TPD; D(2) <= f2 after TPD; D(1) <= e1 after TPD; D(0) <= e2 after TPD; D(11 downto 4) <= "ZZZZZZZZ" after TPD; The synthesizer recognizes the assignment of the high-impedance logic value 'Z' to a signal as an indication to implement a three-state buffer. However, there are two kinds of three-state buffers: core logic three-state buffers and three-state I/O cells. We want a three-state I/O cell containing a bonding pad and not a three-state buffer located in the core logic. If we synthesize the code in Table 10.25, we get a threestate buffer in the core. Table 12.9 shows the modified code that will synthesize to three-state I/O cells. The signal OE_b drives the output enable (active-low) of the three-state buffers. Table 12.10 shows the top-level code including all the I/O cells. TABLE 12.9 A modified version of the FIFO controller to drive threestate I/O cells. library IEEE; use IEEE.STD_LOGIC_1164. all ; use IEEE.NUMERIC_STD. all ; entity fifo_control is generic TPD:TIME := 1 ns; port (D_1, D_2: in UNSIGNED(11 downto 0); sel : in UNSIGNED(1 downto 0) ; read , f1, f2, e1, e2 : in STD_LOGIC; r1, r2, w12: out STD_LOGIC; D: out UNSIGNED(11 downto 0); OE: out STD_LOGIC ) ; end ; architecture rtl of fifo_control is begin process (read, sel, D_1, D_2, f1, f2, e1, e2) begin r1 <= '0' after TPD; r2 <= '0' after TPD; OE_b <= '0' after TPD; if (read = '1') then
w12 <= '0' after TPD; case sel is when "01" => D <= D_1 after TPD; r1 <= '1' after TPD; when "10" => D <= D_2 after TPD; r2 <= '1' after TPD; when "00" => D(3) <= f1 after TPD; D(2) <= f2 after TPD; D(1) <= e1 after TPD; D(0) <= e2 after TPD; D(11 downto 4) <= "00000000" after TPD; when others => OE_b <= '1' after TPD; end case ; elsif (read = '0') then OE_b <= '0' after TPD; w12 <= '1' after TPD; else OE_b <= '0' after TPD; end if ; end process ; end rtl;
TABLE 12.10 The top-level VHDL code for the engine controller ASIC. library COMPASS_LIB, IEEE ; use IEEE.STD. all ; use IEEE.NUMERIC_STD. all ; use COMPASS_LIB.STDCOMP. all ; use COMPASS_LIB.COMPASS. all ; entity t_control_ASIC is port ( PadTri : out STD_LOGIC_VECTOR (11 downto 0) ; PadClk, PadInreset, PadInreadv : in STD_LOGIC_VECTOR ( 0 downto 0) ; PadInp1, PadInp2 : in STD_LOGIC_VECTOR (11 downto 0) ; PadInSens : in STD_LOGIC_VECTOR ( 1 downto 0) ) ; end t_control_ASIC ; architecture structure of t_control_ASIC is for all : asPadIn use entity COMPASS_LIB.aspadIn(aspadIn) ; for all : asPadClk use entity
COMPASS_LIB.aspadClk(aspadClk); for all : asPadTri use entity COMPASS_LIB.aspadTri(aspadTri) ; for all : asPadVdd use entity COMPASS_LIB.aspadVdd(aspadVdd) ; for all : asPadVss use entity COMPASS_LIB.aspadVss(aspadVss) ; component pc3c01 port ( cclk : in STD_LOGIC; cp : out STD_LOGIC ); end component ; component t_control port(T_in1, T_in2 : in UNSIGNED(11 downto 0); SENSOR: in UNSIGNED( 1 downto 0) ; clk, rd, rst : in STD_LOGIC; D : out UNSIGNED(11 downto 0); oe_b : out STD_LOGIC ); end component ; signal T_in1_sv, T_in2_sv : STD_LOGIC_VECTOR(11 downto 0) ; signal T_in1_un, T_in2_un : UNSIGNED(11 downto 0) ; signal sensor_sv : STD_LOGIC_VECTOR(1 downto 0) ; signal sensor_un : UNSIGNED(1 downto 0) ; signal clk_sv, rd_fifo_sv, reset_sv : STD_LOGIC_VECTOR (0 downto 0) ; signal clk_core, oe_b : STD_LOGIC ; signal D_un : UNSIGNED(11 downto 0) ; signal D_sv : STD_LOGIC_VECTOR(11 downto 0) ; begin --compass dontTouch u* -- synopsys dont_touch etc. u1 : asPadIn generic map (12,"2:13") port map (t_in1_sv,PadInp1) ; u2 : asPadIn generic map (12,"14:25") port map (t_in2_sv,PadInp2) ; u3 : asPadIn generic map (2,"26:27") port map (sensor_sv, PadInSens ) ; u4 : asPadIn generic map (1,"29") port map (rd_fifo_sv, PadInReadv ) ; u5 : asPadIn generic map (1,"30") port map (reset_sv, PadInreset ) ;
u6 : asPadIn generic map (1,"32") port map (clk_sv, PadClk) ; u7 : pc3c01 port map (clk_sv(0), clk_core) ; u8 : asPadTri generic map (12,"35:38,41:44,47:50") port map (PadTri,D_sv,oe_b); u9 : asPadVdd generic map ("1,31,34,40,45,52") port map (Vdd) ; u10: asPadVss generic map ("28,33,39,46,51,53") port map (Vss) ; T_in1_un <= UNSIGNED(T_in1_sv) ; T_in2_un <= UNSIGNED(T_in2_sv) ; sensor_un <= UNSIGNED(sensor_sv) ; D_sv <= STD_LOGIC_VECTOR(D_un) ; v_1 : t_control port map (T_in1_un,T_in2_un,sensor_un, Clk_core, rd_fifo_sv(0), reset_sv(0),D_un, oe_b) ; end ;
12.11 Performance-Driven Synthesis

Many logic synthesizers allow the use of directives. The pseudocomment in the following code directs the logic synthesizer to minimize the delay of an addition: module add_directive (a, b, z); input [3:0] a, b; output [3:0] z; //compass maxDelay 2 ns //synopsys and so on. assign z = a + b; endmodule These directives become complicated when we need to describe complex timing constraints. Figure 12.7 (a) shows an example of a more flexible method to measure and specify delay using timing arcs (or timing paths). Suppose we wish to improve the performance of the comparator/MUX example from Section 12.2 . First we define a pathcluster (a group of circuit nodessee Figure 12.7 b). Next, we specify the required time for a signal to reach the output nodes (the end set ) as 2 ns. Finally, we specify the arrival time of the signals at all the inputs as 0 ns. We have thus constrained the delay of the comparator/MUX to be 2 nsmeasured between any input and any output. The logic-optimization step will simplify the logic network and then map it to a cell library while attempting to meet the timing constraints.
file:///C|/Documents%20and%20Settings/saran%20...i.edu/_msmith/ASICs/HTML/Book2/CH12/CH12.b.htm (1 of 6) [5/30/2004 11:06:45 PM]
FIGURE 12.7 Timing constraints. (a) A pathcluster. (b) Defining constraints. Table 12.11 shows the results of a timing-driven logic optimization for the comparator/MUX. Comparing these results with the default optimization results shown in Table 12.3 reveals that the timing has dramatically improved (critical path delay was 2.43 ns with default optimization settings, and the delay varies between 0.31 ns and 1.64 ns for the timing-driven optimization). TABLE 12.11 Timing-driven synthesis reports for the comparator/MUX example of Section 12.2 . Command Synthesizer output 1 > set pathcluster pc1 > set requiredTime 2 outp[0] outp[1] outp[2] pathcluster pc1 > set arrivalTime 0 * -pathcluster pc1
Num Gate Count Tot Gate Width Total Cell Name Insts Per Cell Count Per Cell Width --------- ----- ---------- -------- ------- -------an02d1 1 1.3 1.3 12.0 12.0 > optimize in01d0 2 .8 1.5 7.2 14.4 mx21d1 2 2.2 4.5 21.6 43.2 nd02d0 2 1.0 2.0 9.6 19.2 oa03d1 1 1.8 1.8 16.8 16.8 oa04d1 1 1.3 1.3 12.0 12.0 --------- ----- ---------- -------- ------- -------Totals: 9 12.2 117.6 path cluster name: pc1 path type: maximum --------------------------------------------------------------------> report timing end node current required slack -----------------------------------------allpaths ----------------------------outp[1] 1.64 2.00 .36 MET outp[0] 1.64 2.00 .36 MET outp[2] .31 2.00 1.69 MET Figure 12.8 shows that timing-driven optimization and the subsequent mapping have simplified the logic considerably. For example, the logic for outp[2] has been reduced to a two-input AND gate. Using sis reveals how optimization works in this case. Table 12.12 shows the equations for the intermediate signal sel and the three comparator/MUX outputs in the BLIF. Thus, for example, the following line of the BLIF code in Table 12.12 (the first line following .names a0 b0 a1 b1 a2 b2 sel ) includes the term a0b0'a1'b1'a2'b2' in the equation for sel :
100000 1 There are six similar lines that describe the six other product terms for sel . These seven product terms form a cover for sel in the Karnaugh maps of Figure 12.5 .
`timescale 1ns / 10ps module comp_mux_o (a, b, outp); input [2:0] a; input [2:0] b; output [2:0] outp; supply1 VDD; supply0 VSS; mx21d1 B1_i1 (.I0(a[0]), .I1(b[0]), .S(B1_i6_ZN), .Z(outp[0])); oa03d1 B1_i2 (.A1(B1_i9_ZN), .A2(a[2]), .B1(a[0]), .B2(a[1]), .C(B1_i4_ZN), .ZN(B1_i2_ZN)); nd02d0 B1_i3 (.A1(a[1]), .A2(a[0]), .ZN(B1_i3_ZN)); nd02d0 B1_i4 (.A1(b[1]), .A2(B1_i3_ZN), .ZN(B1_i4_ZN)); mx21d1 B1_i5 (.I0(a[1]), .I1(b[1]), .S(B1_i6_ZN), .Z(outp[1])); oa04d1 B1_i6 (.A1(b[2]), .A2(B1_i7_ZN), .B(B1_i2_ZN), .ZN(B1_i6_ZN)); in01d0 B1_i7 (.I(a[2]), .ZN(B1_i7_ZN)); an02d1 B1_i8 (.A1(b[2]), .A2(a[2]), .Z(outp[2])); in01d0 B1_i9 (.I(b[2]), .ZN(B1_i9_ZN)); endmodule
FIGURE 12.8 The comparator/MUX example of Section 12.2 after logic optimization with timing constraints. The figure shows the structural netlist, comp_mux_o2.v , and its derived schematic. Compare this with Figures 12.2 and 12.3 . In addition sis must be informed of the dont care values (called the external dont care set ) in these Karnaugh maps. This is the function of the PLA-format input that follows the .exdc line. Now sis can simplify the equations including the dont care values using a standard script, rugged.script , that contains a sequence of sis commands. This particular script uses a series of factoring and substitution steps. The output ( Table 12.12 ) reveals that sis finds the same equation for outp[2] (named outp2 in the sis equations): {outp2} = a2 b2 The other logic equations in Table 12.12 that sis produces are also equivalent to the logic in Figure 12.8 . The technology-mapping step hides the exact details of the conversion between the internal representation and the optimized logic. TABLE 12.12 Optimizing the comparator/MUX equations using sis . sis input file (BLIF) sis results .model comp_mux .inputs a0 b0 a1 b1 a2 b2 .outputs outp0 outp1 outp2 .names a0 b0 a1 </usr/user1/msmith/sis> sis b1 a2 b2 sel UC Berkeley, SIS Development Version 100000 1 (compiled 11-Oct-95 at 11:50 AM) 101100 1 sis> read_blif comp_mux.blif --1000 1 sis> print ----10 1 {outp0} = a0 sel' + b0 sel 100011 1 {outp1} = a1 sel' + b1 sel
101111 1 --1011 1 .names sel a0 b0 outp0 1-1 1 01- 1 .names sel a1 b1 outp1 1-1 1 01- 1 .names sel a2 b2 outp2 1-1 1 01- 1 .exdc .names a0 b0 a1 b1 a2 b2 sel 000000 1 110000 1 001100 1 111100 1 000011 1 110011 1 001111 1 111111 1 .end
{outp2} = a2 sel' + b2 sel sel = a0 a1 a2 b0' b1 b2 + a0 a1 a2' b0' b1 b2' + a0 a1' a2 b0' b1' b2 + a0 a1' a2' b0' b1' b2' + a1 a2 b1' b2 + a1 a2' b1' b2' + a2 b2' sis> source script.rugged sis> print {outp0} = a0 sel' + b0 sel {outp1} = a1 sel' + b1 sel {outp2} = a2 b2 sel = [9] a2 b0' + [9] b0' b2' + a1 a2 b1' + a1 b1' b2' + a2 b2' [9] = a1 + b1' sis> quit </usr/user1/msmith/sis>
1. See footnote 1 in Table 12.3 for explanations of the abbreviations used in this table.
12.12 Optimization of the Viterbi Decoder

Returning to the Viterbi decoder example (from Section 12.4), we first set the environment for the design using the following worst-case conditions: a die temperature of 25 C (fastest logic) to 120 C (slowest logic); a power supply voltage of V DD = 5.5 V (fastest logic) to V DD = 4.5 V (slowest logic); and worst process (slowest logic) to best process (fastest logic). Assume that this ASIC should run at a clock frequency of at least 33 MHz (clock period of 30 ns). An initial synthesis run gives a critical path delay at nominal conditions (the default setting) of about 25 ns and nearly 35 ns under worst-case conditions using a high-density 0.6 m standard-cell target library. Estimates (using simulation and calculation) show that data arrives at the input pins 5 ns (worst-case) after the rising edge of the clock. The reset signal arrives 10 ns (worstcase) after the rising edge of the clock. The outputs of the Viterbi decoder must be stable at least 4 ns before the rising edge of the clock. This allows these signals to be driven to another ASIC in time to be clocked. These timing constraints are particularly devastating. Together they effectively reduce the clock period that is available for use by 9 ns. However, these figures are typical for board-level delays. The initial synthesis runs reveal the critical path is through the following six modules: subset_decode -> compute_metric -> compare_select -> reduce -> metric -> output_decision
file:///C|/Documents%20and%20Settings/saran%20...i.edu/_msmith/ASICs/HTML/Book2/CH12/CH12.c.htm (1 of 5) [5/30/2004 11:06:47 PM]
The logic synthesizer can do little or no optimization across these module boundaries. The next step, then, is to rearrange the design hierarchy for synthesis. Flattening ( merging or ungrouping) the six modules into a new cell, called critical , allows the synthesizer to reduce the critical path delay by optimizing one large module. At present the last module in the critical path is output_decision . This combinational logic adds 23 ns to the output delay requirement of 4 ns (this means the outputs of the module metric must be stable 67 ns before the rising clock edge). Registering the output reduces this overhead and removes the module output_decision from the critical path. The disadvantage is an increase in latency by one clock cycle, but the latency is already 12 clock cycles in this design. If registering the output decreases the critical path delay by more than a factor of 12 / 13, performance will still improve. To register the output, alter the code (on pages 575576) as follows: module viterbi_ASIC ... wire [2:0] Out, Out_r; // Change: add Out_r. ... asPadOut #(3,"30,31,32") u30 (padOut, Out_r); // Change: Out_r. Outreg o_1 (Out, Out_r, Clk, Res); // Change: add output register. ... endmodule module Outreg (Out, Out_r, Clk, Res); // Change: add this module. input [2:0] Out; input Clk, Rst; output [2:0] Out_r; dff #(3) reg1(Out, Out_r, Clk, Res); endmodule These changes move the performance closer to the target. Prelayout estimates indicate the die perimeter required for the I/O pads will allow more than enough area to hold the core logic. Since there is unused area in the core, it makes sense to switch to a high-performance standard-cell library with a slightly larger cell height (96 versus
72 ). This cell library is less dense, but faster. Typically, at this point, the design is improved by altering the HDL, the hierarchy, and the synthesis controls in an iterative manner until the desired performance is achieved. However, remember there is still no information from the layout. The best that can be done is to estimate the contribution of the interconnect using wire-load models. As soon as possible the netlist should be passed to the floorplanner (or the place-and-route software in the absence of a floorplanner) to generate better estimates of interconnect delays. TABLE 12.13 Critical-path timing report for the Viterbi decoder. Instance name Delay information 1 v_1.u100 u1.subout5.Q_ff_b0 B1_i67 B1_i66 B1_i64 B1_i68 B1_i316 u3.add_rip1.u4 inPin --> outPin incr arrival trs rampDel cap(pF) cell CP --> QN 1.65 1.65 F .20 .10 dfctnb A1 --> ZN .63 2.27 R .14 .08 ao01d1 B --> ZN .84 3.12 F .15 .08 ao04d1 B2 --> ZN .91 4.03 F .35 .17 fn03d1 I --> ZN .39 4.43 R .23 .12 in01d1 S --> Z .91 5.33 F .34 .17 mx21d1 B0 --> CO 2.20 7.54 F .24 .14 ad02d1 ... 28 other cell instances omitted ... u5.sub_rip1.u6 B0 --> CO 2.25 23.17 F .23 .13 ad02d1 u5.sub_rip1.u8 CI --> CO .53 23.70 F .21 .09 ad01d1 B1_i301 A1 --> Z .69 24.39 R .19 .07 xo02d1 u2.metric3.Q_ff_b4 setup: D --> CP .17 24.56 R .00 .00 dfctnb slack: MET .44 Table 12.13 is a timing report for the Viterbi decoder, which shows the critical path
starts at a sequential logic cell (a D flip-flop in the present example), ends at a sequential logic cell (another D flip-flop), with 37 other combinational logic cells inbetween. The first delay is the clock-to-Q delay of the first flip-flop. The last delay is the setup time of the last flip-flop. The critical path delay is 24.56 ns, which gives a slack of 0.44 ns from the constraint of 25 ns (reduced from 30 ns to give an extra margin). We have met the timing constraint (otherwise we say it is violated ). In Table 12.13 all instances in the critical path are inside instance v_1.u100 . Instance name u100 is the new cell (cell name critical ) formed by merging six blocks in module viterbi (instance name v_1 ). The second column in Table 12.13 shows the timing arc of the cell involved on the critical path. For example, CP --> QN represents the path from the clock pin, CP , to the flip-flop output pin, QN , of a D flip-flop (cell name dfctnb ). The pin names and their functions come from the library data book. Each company adopts a different naming convention (in this case CP represents a positive clock edge, for example). The conventions are not always explicitly shown in the data books but are normally easy to discover by looking at examples. As another example, B0 --> CO represents the path from the B input to the carry output of a 2-bit full adder (cell name ad02d1 ). The third column ( incr ) represents the incremental delay contribution of the logic cell to the critical path. The fourth column ( arrival ) shows the arrival time of the signal at the output pin of the logic cell. This is the cumulative delay to that point on the critical path. The fifth column ( trs ) describes whether the transition at the output node is rising ( R ) or falling ( F ). The timing analyzer examines each possible combination of rising and falling delays to find the critical path. The sixth column ( rampDel ) is a measure of the input slope (ramp delay, or slew rate). In submicron ASIC design this is an important contribution to delay. The seventh column ( Cap ) is the capacitance at the output node of the logic cell. This determines the logic cell delay and also the signal slew rate at the node. The last column ( cell ) is the cell name (from the cell-library data book). In this library suffix 'd1' represents normal drive strength with 'd0' , 'd2 ', and 'd5'
being the other available strengths. 1. See the text for explanations of the column headings. [ Chapter start ] [ Previous page ] [ Next page ]
12.13 Summary
12.13 Summary
A logic synthesizer may contain over 500,000 lines of code. With such a complex system, complex inputs, and little feedback at the output there is a danger of the garbage in, garbage out syndrome. Ask yourself What do I expect to see at the output? and Does the output make sense? If you cannot answer these questions, you should simplify the input (reduce the width of the buses, simplify or partition the code, and so on). The worst thing you can do is write and simulate a huge amount of code, read it into the synthesis tool, and try and optimize it all at once with the default settings. With experience it is possible to recognize what the logic synthesizer is doing by looking at the number of cells, their types, and the drive strengths. For example, if there are many minimum drive strength cells on the critical path it is usually an indication that the synthesizer has room to increase speed by substituting cells with stronger drive. This is not always true, sometimes a higher-drive cell may actually slow down the circuit. This is because adding the larger cell increases load capacitance, but not enough drive to make up for it. This is why logical effort is a useful measure. Because interconnect delay is increasingly dominant, it is important to begin the physical design steps as early as possible. Ideally floorplanning and logic synthesis should be completed at the same time. This ensures that the estimated interconnect delays are close to the actual delays after routing is complete. [ Chapter start ] [ Previous page ] [ Next page ]
file:///C|/Documents%20and%20Settings/saran%20kum...waii.edu/_msmith/ASICs/HTML/Book2/CH12/CH12.d.htm [5/30/2004 11:06:48 PM]
SIMULATION
SIMULATION
Engineers used to prototype systems to check their designs, often using a breadboard with connector holes, allowing them to plug in ICs and wires. Breadboarding was feasible when it was possible to construct systems from a few off-the-shelf TTL parts. It is impractical for prototyping an ASIC. Instead most ASIC design engineers turn to simulation as the modern equivalent of breadboarding. 13.1 Types of Simulation 13.2 The Comparator/MUX Example 13.3 Logic Systems 13.4 How Logic Simulation Works 13.5 Cell Models 13.6 Delay Models 13.7 Static Timing Analysis 13.8 Formal Verification 13.9 Switch-Level Simulation 13.10 Transistor-Level Simulation 13.11 Summary 13.12 Problems 13.13 Bibliography 13.14 References

SIMULATION
13.1 Types of Simulation

Simulators are usually divided into the following categories or simulation modes :
q q q q q q
Behavioral simulation Functional simulation Static timing analysis Gate-level simulation Switch-level simulation Transistor-level or circuit-level simulation
This list is ordered from high-level to low-level simulation (high-level being more abstract, and low-level being more detailed). Proceeding from high-level to low-level simulation, the simulations become more accurate, but they also become progressively more complex and take longer to run. While it is just possible to perform a behavioral-level simulation of a complete system, it is impossible to perform a circuit-level simulation of more than a few hundred transistors. There are several ways to create an imaginary simulation model of a system. One method models large pieces of a system as black boxes with inputs and outputs. This type of simulation (often using VHDL or Verilog) is called behavioral simulation . Functional simulation ignores timing and includes unit-delay simulation , which sets delays to a fixed value (for example, 1 ns). Once a behavioral or functional simulation predicts that a system works correctly, the next step is to check the timing performance. At this point a system is partitioned into ASICs and a timing simulation is performed for each ASIC separately (otherwise the simulation run times become too long). One class of timing simulators employs timing analysis that
analyzes logic in a static manner, computing the delay times for each path. This is called static timing analysis because it does not require the creation of a set of test (or stimulus) vectors (an enormous job for a large ASIC). Timing analysis works best with synchronous systems whose maximum operating frequency is determined by the longest path delay between successive flip-flops. The path with the longest delay is the critical path . Logic simulation or gate-level simulation can also be used to check the timing performance of an ASIC. In a gate-level simulator a logic gate or logic cell (NAND, NOR, and so on) is treated as a black box modeled by a function whose variables are the input signals. The function may also model the delay through the logic cell. Setting all the delays to unit value is the equivalent of functional simulation. If the timing simulation provided by a black-box model of a logic gate is not accurate enough, the next, more detailed, level of simulation is switch-level simulation which models transistors as switcheson or off. Switch-level simulation can provide more accurate timing predictions than gate-level simulation, but without the ability to use logic-cell delays as parameters of the models. The most accurate, but also the most complex and time-consuming, form of simulation is transistor-level simulation . A transistor-level simulator requires models of transistors, describing their nonlinear voltage and current characteristics. Each type of simulation normally uses a different software tool. A mixed-mode simulator permits different parts of an ASIC simulation to use different simulation modes. For example, a critical part of an ASIC might be simulated at the transistor level while another part is simulated at the functional level. Be careful not to confuse mixed-level simulation with a mixed analog/digital simulator, these are mixed-level simulators . Simulation is used at many stages during ASIC design. Initial prelayout simulations include logic-cell delays but no interconnect delays. Estimates of capacitance may be included after completing logic synthesis, but only after physical design is it possible to perform an accurate postlayout simulation . [ Chapter start ] [ Previous page ] [ Next page ]
13.2 The Comparator/MUX Example

As an example we borrow the model from Section 12.2, A Comparator/MUX, // comp_mux.v module comp_mux(a, b, outp); input [2:0] a, b; output [2:0] outp; function [2:0] compare; input [2:0] ina, inb; begin if (ina <= inb) compare = ina; else compare = inb; end endfunction assign outp = compare(a, b); endmodule We can use the following testbench to generate a sequence of input values (we call these input vectors ) that test or exercise the behavioral model, comp_mux.v : // testbench.v module comp_mux_testbench; integer i, j; reg [2:0] x, y, smaller; wire [2:0] z; always @(x) $display("t x y actual calculated"); initial $monitor("%4g",$time,,x,,y,,z,,,,,,,smaller); initial $dumpvars; initial #1000 $finish; initial begin for (i = 0; i <= 7; i = i + 1)
begin for (j = 0; j <= 7; j = j + 1) begin x = i; y = j; smaller = (x <= y) ? x : y; #1 if (z != smaller) $display("error"); end end end comp_mux v_1 (x, y, z); endmodule The results from the behavioral simulation are as follows: t x y actual calculated 0 0 0 0 0 1 0 1 0 0 ... 60 lines omitted... 62 7 6 6 6 63 7 7 7 7 We included a delay of one Verilog time unit in line 15 of the testbench model (allowing time to progress), but we did not specify the unitsthey could be nanoseconds or days. Thus, behavioral simulation can only tell us if our design does not work; it cannot tell us that real hardware will work.
13.2.1 Structural Simulation

We use logic synthesis to produce a structural model from a behavioral model. The following comparator/MUX model is adapted from the example in Section 12.11 , Performance-Driven Synthesis (optimized for a 0.6 m standard-cell library): `timescale 1ns / 10ps // comp_mux_o2.v module comp_mux_o (a, b, outp); input [2:0] a; input [2:0] b; output [2:0] outp;
supply1 VDD; supply0 VSS; mx21d1 b1_i1 (.i0(a[0]), .i1(b[0]), .s(b1_i6_zn), .z(outp[0])); oa03d1 b1_i2 (.a1(b1_i9_zn), .a2(a[2]), .b1(a[0]), .b2(a[1]), .c(b1_i4_zn), .zn(b1_i2_zn)); nd02d0 b1_i3 (.a1(a[1]), .a2(a[0]), .zn(b1_i3_zn)); nd02d0 b1_i4 (.a1(b[1]), .a2(b1_i3_zn), .zn(b1_i4_zn)); mx21d1 b1_i5 (.i0(a[1]), .i1(b[1]), .s(b1_i6_zn), .z(outp[1])); oa04d1 b1_i6 (.a1(b[2]), .a2(b1_i7_zn), .b(b1_i2_zn), .zn(b1_i6_zn)); in01d0 b1_i7 (.i(a[2]), .zn(b1_i7_zn)); an02d1 b1_i8 (.a1(b[2]), .a2(a[2]), .z(outp[2])); in01d0 b1_i9 (.i(b[2]), .zn(b1_i9_zn)); endmodule Logic simulation requires Verilog models for the following six logic cells: mx21d1 (2:1 MUX), oa03d1 (OAI221), nd02d0 (two-input NAND), oa04d1 (OAI21), in01d0 (inverter), and an02d1 (two-input AND). These models are part of an ASIC library (often encoded so that they cannot be seen) and thus, from this point on, the designer is dependent on a particular ASIC library company. As an example of this dependence, notice that some of the names in the preceding code have changed from uppercase (in Figure 12.8 on p. 624) to lowercase. Verilog is case sensitive and we are using a cell library that uses lowercase. Most unfortunately, there are no standards for names, cell functions, or the use of case in ASIC libraries. The following code (a simplified model from a 0.8 m standard-cell library) models a 2:1 MUX and uses fixed delays: `timescale 1 ns / 10 ps module mx21d1 (z, i0, i1, s); input i0, i1, s; output z; not G3(N3, s); and G4(N4, i0, N3), G5(N5, s, i1), G6(N6, i0, i1); or G7(z, N4, N5, N6); specify
(i0*>z) = (0.279:0.504:0.900, 0.276:0.498:0.890); (i1*>z) = (0.248:0.448:0.800, 0.264:0.476:0.850); (s*>z) = (0.285:0.515:0.920, 0.298:0.538:0.960); endspecify endmodule This code uses Verilog primitive models ( not , and , or ) to describe the behavior of a MUX, but this is not how the logic cell is implemented. To simulate the optimized structural model, module comp_mux_o2.v , we use the library cell models (module mx21d1 and the other five that are not shown here) together with the following new testbench model: `timescale 1 ps / 1 ps // comp_mux_testbench2.v module comp_mux_testbench2; integer i, j; integer error; reg [2:0] x, y, smaller; wire [2:0] z, ref; always @(x) $display("t x y derived reference"); // initial $monitor("%8.2f",$time/1e3,,x,,y,,z,,,,,,,,ref); initial $dumpvars; initial begin error = 0; #1e6 $display("%4g", error, " errors"); $finish; end initial begin for (i = 0; i <= 7; i = i + 1) begin for (j = 0; j <= 7; j = j + 1) begin x = i; y = j; #10e3; $display("%8.2f",$time/1e3,,x,,y,,z,,,,,,,,ref); if (z != ref) begin $display("error"); error = error + 1; end end end end comp_mux_o v_1 (x, y, z); // comp_mux_o2.v
reference v_2 (x, y, ref); endmodule // reference.v module reference(a, b, outp); input [2:0] a, b; output [2:0] outp; assign outp = (a <= b) ? a : b; // different from comp_mux endmodule In this testbench we have instantiated two models: a reference model (module reference ) and a derived model (module comp_mux_o , the optimized structural model). The high-level behavioral model that represents the initial system specification (module reference ) may be different from the model that we use as input to the logic-synthesis tool (module comp_mux ). Which is the real reference model? We postpone this question until we discuss formal verification in Section 13.8 . For the moment, we shall simply perform simulations to check the reference model against the derived model. The simulation results are as follows: t x y derived reference 10.00 0 0 0 0 20.00 0 1 0 0 ... 60 lines omitted... 630.00 7 6 6 6 640.00 7 7 7 7 0 errors (A summary is printed at the end of the simulation to catch any errors.) The next step is to examine the timing of the structural model (by switching the leading '//' from line 6 to 16 in module comp_mux_testbench2 ). It is important to simulate using the worst-case delays by using a command-line switch as follows: verilog +maxdelays . We can then find the longest path delay by searching through the simulator output, part of which follows: t x y derived reference ... lines omitted...
260.00 3 2 1 2 260.80 3 2 3 2 260.85 3 2 2 2 270.00 3 3 2 3 270.80 3 3 3 3 280.00 3 4 3 3 280.85 3 4 0 3 283.17 3 4 3 3 ... lines omitted... At time 280 ns, the input vectors, x and y , switch from ( x = 3 , y = 3 ) to ( x = 3 , y = 4 ). The output of the derived model (which should be equal to the smaller of x and y ) is the same for both of these input vectors and should remain unchanged. In fact there is a glitch at the output of the derived model, as it changes from 3 to 0 and back to 3 again, taking 3.17 ns to settle to its final value (this is the longest delay that occurs using this testbench). The glitch occurs because one of the input vectors (input y ) changes from '011' (3 in decimal) to '100' (decimal 4). Changing several input bits simultaneously causes the output to vacillate. Notice that the nominal and worst-case simulations will not necessarily give the same longest path delay. In addition the longest path delay found using this testbench is not necessarily the critical path delay. For example, the longest, and therefore critical, path delay might result from a transition from x = 3 , y = 4 to x = 4 , y = 3 (to choose a random but possible candidate set of input vectors). This testbench does not include tests with such transitions. To find the critical path using logic simulation requires simulating all possible input transitions (64 64 = 4096) and then sifting through the output to find the critical path. Vector-based simulation (or dynamic simulation ) can show us that our design functions correctlyhence the name functional simulation. However, functional simulation does not work well if we wish to find the critical path. For this we turn to a different type of simulationstatic simulation or static timing analysis. TABLE 13.1 Timing analysis of the comparator/MUX structural model, comp_mux_o2.v , from Figure 12.8 . Command Timing analyzer/logic synthesizer output 1 1
> report timing
instance name inPin --> outPin incr arrival trs rampDel cap cell (ns) (ns) (ns) (pf) --------------------------------------------------------------------a[0] .00 .00 R .00 .12 comp_m... b1_i3 A2 --> ZN .31 .31 F .23 .08 nd02d0 b1_i4 A2 --> ZN .41 .72 R .26 .07 nd02d0 b1_i2 C --> ZN 1.36 2.08 F .13 .07 oa03d1 b1_i6 B --> ZN .94 3.01 R .24 .14 oa04d1 b1_i5 S --> Z 1.04 4.06 F .08 .04 mx21d1 outp[0] .00 4.06 F .00 .00 comp_m...
13.2.2 Static Timing Analysis

A timing analyzer answers the question: What is the longest delay in my circuit? Table 13.1 shows the timing analysis of the comparator/MUX structural model, module comp_mux_o2.v . The longest or critical path delay is 4.06 ns under the following worst-case operating conditions: worst-case process, V DD = 4.75 V, and T = 70 C (the same conditions as used for the library data book delay values). The timing analyzer gives us only the critical path and its delay. A timing analyzer does not give us the input vectors that will activate the critical path. In fact input vectors may not exist to activate the critical path. For example, it may be that the decimal values of the input vectors to the comparator/MUX may never differ by more than four, but the timing-analysis tool cannot use this information. Future timing-analysis tools may consider such factors, called Boolean relations , but at present they do not. Section 13.2.1 explained why dynamic functional simulation does not necessarily find the critical path delay. Nevertheless, the difference between the longest path delay
found using functional simulation, 3.17 ns, and the critical path delay reported by the static timing-analysis tool, 4.06 ns, is surprising. This difference occurs because the timing analysis accounts for the loading of each logic cell by the input capacitance of the logic cells that follow, but the simplified Verilog models used for functional simulation in Section 13.2.1 did not include the effects of capacitive loading. For example, in the model for the logic cell mx21d1 , the (rising) delay from the i0 input to the output z , was fixed at 0.900 ns worst case (the maximum delay value is the third number in the first triplet in line 7 of module mx21d1 ). Normally library models include another portion that adjusts the timing of each logic cellthis portion was removed to simplify the model mx21d1 shown in Section 13.2.1 . Most timing analyzers do not consider the function of the logic when they search for the critical path. Thus, for example, the following code models z = NAND(a, NOT(a)) , which means that the output, z , is always '1' . module check_critical_path_1 (a, z); input a; output z; supply1 VDD; supply0 VSS; nd02d0 b1_i3 (.a1(a), .a2(b), .zn(z)); // 2-input NAND in01d0 b1_i7 (.i(a), .zn(b)); // inverter endmodule A timing-analyzer report for this model might show the following critical path:
inPin --> outPin incr arrival trs rampDel cap cell (ns) (ns) (ns) (pf) ------------------------------------------------------------------a .00 .00 R .00 .08 check_... b1_i7 I --> ZN .38 .38 F .30 .07 in01d0 b1_i3 A2 --> ZN .28 .66 R .13 .04 nd02d0 z .00 .66 R .00 .00 check_... Paths such as this, which are impossible to activate, are known as false paths . Timing
analysis is essential to ASIC design but has limitations. A timing-analysis tool is more logic calculator than logic simulator.
13.2.3 Gate-Level Simulation

To illustrate the differences between functional simulation, timing analysis, and gatelevel simulation, we shall simulate the comparator/MUX critical path (the path is shown in Table 13.1 ). We start by trying to find vectors that activate this critical path by working forward from the beginning of the critical path, the input a[0] , toward the end of the critical path, output outp[0] , as follows: 1. Input a[0] to the two-input NAND, nd02d0 , cell instance b1_i3 , changes from a '0' to a '1' . We know this because there is an 'R' (for rising) under the trs (for transition) heading on the first line of the critical path timing analysis report in Table 13.1 . 2. Input a[1] to the two-input NAND, nd02d0 , cell instance b1_i3 , must be a '1' . This allows the change on a[0] to propagate toward the output, outp[0] . 3. Similarly, input b[1] to the two-input NAND, cell instance b1_i4 , must be a '1' . 4. We skip over the required inputs to cells b1_i2 and b1_i6 for the moment. 5. From the last line of Table 13.1 we know the output of MUX, mx21d1 , cell instance b1_i5 , changes from '1' to a '0' . From the previous line in Table 13.1 we know that the select input of this MUX changes from '0' to a '1' . This means that the final value of input b[0] (the i1 input, selected when the select input is '1' ) must be '0' (since this is the final value that must appear at the MUX output). Similarly, the initial value of a[0] must be a '1' . We have now contradicted ourselves. In step 1 we saw that the initial value of a[0] must be a '0' . The critical path is thus a false path. Nevertheless we shall proceed. We set the initial input vector to ( a = '110' , b = '111') and then to ( a = '111' , b = '110' ). These vectors allow the change on a[0] to propagate to the
select signal of the MUX, mx21d1 , cell instance b1_i5 . In decimal we are changing a from 6 to 7, and b from 7 to 6; the output should remain unchanged at 6. The simulation results from the gate-level simulator we shall use ( CompassSim) can be displayed graphically or in the text form that follows: ... # The calibration was done at Vdd=4.65V, Vss=0.1V, T=70 degrees C Time = 0:0 [0 ns] a = 'D6 [0] (input)(display) b = 'D7 [0] (input)(display) outp = 'Buuu ('Du) [0] (display) outp --> 'B1uu ('Du) [.47] outp --> 'B11u ('Du) [.97] outp --> 'D6 [4.08] a --> 'D7 [10] b --> 'D6 [10] outp --> 'D7 [10.97] outp --> 'D6 [14.15] Time = 0:0 +20ns [20 ns] The code 'Buuu denotes that the output is initially, at t = 0 ns, a binary vector of three unknown or unsettled signals. The output bits become valid as follows: outp[2] at 0.47 ns, outp[1] at 0.97 ns, and outp[0] at 4.08 ns. The output is stable at 'D6 (decimal 6) or '110' at t = 10 ns when the input vectors are changed in an attempt to activate the critical path. The output glitches from 'D6 ( '110' ) to 'D7 ( '111' ) at t = 10.97 ns and back to 'D6 again at t = 14.15 ns. Thus, the output bit, outp[0] , takes a total of 4.15 ns to settle. Can we explain this behavior? The data book entry for the mx21d1 logic cell gives the following equation for the rising delay as a function of Cld (the load capacitance, excluding the output capacitance of the logic cell itself, expressed in picofarads): tI0Z (IO->Z) = 0.90 + 0.07 + (1.76 Cld) ns (13.1)
tI0Z (IO->Z) = 0.90 + 0.07 + (1.76 Cld) ns (13.1) The capacitance, Cld , at the output of each MUX is zero (because nothing is connected to the outputs). From Eq. 13.1 , the path delay from the input, a[0] , to the output, outp[0] , is thus 0.97 ns. This explains why the output, outp[0] , changes from '0' to '1' at t = 10.97 ns, 0.97 ns after a change occurs on a[0] . The gate-level simulation predicts that the input, a[0] , to the MUX will change before the changes on the inputs have time to propagate to the MUX select. Finally, at t = 14.15 ns, the MUX select will change and switch the output, outp[0] , back to '0' again. The total delay for this input vector stimulus is thus 4.15 ns. Even though this path is a false path (as far as timing analysis is concerned), it is a critical path. It is indeed necessary to wait for 4.15 ns before using the output signal of this circuit. A timing analyzer can only offer us a guarantee that there is no other path that is slower than the critical path.
13.2.4 Net Capacitance

The timing analyzer predicted a critical path delay of 4.06 ns compared to the gatelevel simulation prediction of 4.15 ns. We can check our results by using another gatelevel simulator ( QSim) which uses a slightly different algorithm. Here is the output (with the same input vectors as before): @nodes a R10 W1; a[2] a[1] a[0] b R10 W1; b[2] b[1] b[0] outp R10 W1; outp[2] outp[1] outp[0] @data .00 a -> 'D6 .00 b -> 'D7 .00 outp -> 'Du .53 outp -> 'Du .93 outp -> 'Du 4.42 outp -> 'D6 10.00 a -> 'D7
10.00 b -> 11.03 outp 14.43 outp ### END OF @end
'D6 -> 'D7 -> 'D6 SIMULATION TIME = 20 ns
The output is similar but gives yet another value, 4.43 ns, for the path delay. Can this be explained? The simulator prints the following messages as a clue: defCapacitance = .1E-01 pF incCapacitance = .1E-01 pF/pin The simulator is adding capacitance to the outputs of each of the logic cells to model the parasitic net capacitance ( interconnect capacitance or wire capacitance) that will be present in the physical layout. The simulator adds 0.01 pF ( defCapacitance ) on each node and another 0.01 pF ( incCapacitance ) for each pin (logic cell input) attached to a node. The model that predicts these values is known as a wireload model , wire-delay model , or interconnect model . Changing the wire-load model parameters to zero and repeating the simulation changes the critical-path delay to 4.06 ns, which agrees exactly with the logic-synthesizer timing analysis. This emphasizes that the net capacitance may contribute a significant delay. The library data book (VLSI Technology, vsc450) lists the cell input and output capacitances. For example, the values for the nd02d0 logic cell are as follows: Cin (inputs, a1 and a2) = 0.042 pF Cout (output, zn) = 0.038 pF
(13.2)
Cin (inputs, a1 and a2) = 0.042 pF Cout (output, zn) = 0.038 pF (13.2) Armed with this information, let us return to the timing analysis report of Table 13.1 (the part of this table we shall focus on follows) and examine how a timing analyzer handles net capacitance.
inPin --> outPin incr arrival trs rampDel cap cell (ns) (ns) (ns) (pf) -------------------------------------------------------------------a[0] .00 .00 R .00 .12 comp_m... b1_i3 A2 --> ZN .31 .31 F .23 .08 nd02d0 ... The total capacitance at the output node of logic cell instance b1_i3 is 0.08 pF. This figure is the sum of the logic cell ( nd02d0 ) output capacitance of cell instance b1_i3 (equal to 0.038 pF) and Cld , the input capacitance of the next cell, b1_i2 (also an nd02d0 ), equal to 0.042 pF. The capacitance at the input node, a[0] , is equal to the sum of the input capacitances of the logic cells connected to that node. These capacitances (and their sources) are as follows: 1. 0.042 pF (the a2 input of the two-input NAND, instance b1_i3 , cell nd02d0) 2. 0.038 pF (the i0 input of the 2:1 MUX, instance b1_i1 , cell mx21d1 ) 3. 0.038 pF (the b1 input of the OAI221, instance b1_i2 , cell oa03d1 ) The sum of these capacitances is the 0.12 pF shown in the timing-analysis report. Having explained the capacitance figures in the timing-analysis report, let us turn to the delay figures. The fall-time delay equation for a nd02d0 logic cell (again from the vsc450 library data book) is as follows: tD (AX->ZN) = 0.08 + 0.11 + (2.89 Cld) ns (13.3) tD (AX->ZN) = 0.08 + 0.11 + (2.89 Cld) ns (13.3) Notice 0.11 ns = 2.89 nspF1 0.038 pF, and this figure in Eq. 13.3 is the part of the cell delay attributed to the cell output capacitance. The ramp delay in the timing
analysis (under the heading rampDel in Table 13.1 ) is the sum of the last two terms in Eq. 13.3 . Thus, the ramp delay is 0.11 + (2.89 0.042 ) = 0.231 ns (since Cld is 0.042 pF). The total delay (under incr in Table 13.1 ) is 0.08 + 0.231 = 0.31 ns. There are thus the following four figures for the critical path delay: 1. 4.06 ns from a static timing analysis using the logic-synthesizer timing engine (worst-case process, V DD = 4.50 V, and T = 70 C). No wire capacitance. 2. 4.15 ns from a gate-level functional simulation (worst-case process, V SS = 0.1 V, V DD = 4.65 V, and T = 70 C). No wire capacitance. 3. 4.43 ns from a gate-level functional simulation. Default wire-capacitance model (0.01 pF + 0.01 pF / pin). 4. 4.06 ns from a gate-level functional simulation. No wire capacitance. Normally we do not check our simulation results this thoroughly. However, we can only trust the tools if we understand what they are doing, how they work, their limitations, and we are able to check that the results are reasonable. 1. 1Using a 0.8 m standard-cell library, VLSI Technology vsc450. Worst-case environment: worst-case process, V DD = 4.75 V, and T = 70 C. No wire capacitance, no input or output capacitance, propramp timing model. The structural model was synthesized and optimized using a 0.6 m library, but this timing analysis was performed using the 0.8 m library. This is because the library models are simpler for the 0.8 m library and thus easier to explain in the text. [ Chapter start ] [ Previous page ] [ Next page ]
13.3 Logic Systems
13.3 Logic Systems

Digital signals are actually analog voltage (or current) levels that vary continuously as they change. Digital simulation assumes that digital signals may only take on a set of logic values (or logic states here we will consider the two terms equivalent) from a logic system . A logic system must be chosen carefully. Too many values will make the simulation complicated and slow. With too few values the simulation may not accurately reflect the hardware performance. A two-value logic system (or two-state logic system) has a logic value '0' corresponding to a logic level 'zero' and a logic value '1' corresponding to a logic level 'one'. However, when the power to a system is initially turned on, we do not immediately know whether the logic value of a flip-flop output is '1' or '0' (it will be one or the other, but we do not know which). To model this situation we introduce a logic value 'X' , with an unknown logic level, or unknown . An unknown can propagate through a circuit. For example, if the inputs to a two-input NAND gate are logic values '1' and 'X' , the output is logic value 'X' or unknown. Next, in order to model a three-state bus, we need a high-impedance state . A high-impedance state may have a logic level of 'zero' or 'one', but it is not being drivenwe say it is floating. This will occur if none of the gates connected to a three-state bus is driving the bus. A four-value logic system is shown in Table 13.2 . TABLE 13.2 Logic state 0 1 A four-value logic system. Logic level Logic value zero zero one one
13.3 Logic Systems
X Z
zero or one unknown zero, one, or neither high impedance
13.3.1 Signal Resolution

What happens if multiple drivers try to drive different logic values onto a bus? Table 13.3 shows a signal-resolution function for a four-value logic system that will predict the result. TABLE 13.3 A resolution function R {A, B} that predicts the result of two drivers simultaneously attempting to drive signals with values A and B onto a bus. R {A, B} B=0 B=1 B=X B=Z 0 X X 0 A=0 X 1 X 1 A=1 X X X X A=X 0 1 X Z A=Z A resolution function, R {A, B}, must be commutative and associative . That is, R {A, B} = R {B, A} and R {R {A, B}, C} = R {A, R {B, C}}. (13.4) R {A, B} = R {B, A} and R {R {A, B}, C} = R {A, R {B, C}}.(13.4) Equation 13.4 ensures that, if we have three (or more) signals to resolve, it does not matter in which order we resolve them. Suppose we have four drivers on a bus driving values '0' , '1' , 'X' , and 'Z' . If we use Table 13.3 three times to resolve these signals, the answer is always 'X' whatever order we use.
13.3.2 Logic Strength

13.3 Logic Systems
In CMOS logic we use n -channel transistors to produce a logic level 'zero' (with a forcing strength) and we use p -channel transistors to force a logic level 'one'. An n channel transistor provides a weak logic level 'one'. This is a new logic value, a resistive 'one' , which has a logic level of 'one', but with resistive strength . Similarly, a p -channel transistor produces a resistive 'zero' . A resistive strength is not as strong as a forcing strength. At a high-impedance node there is nothing to keep the node at any logic level. We say that the logic strength is high impedance . A highimpedance strength is the weakest strength and we can treat it as either a very highresistance connection to a power supply or no connection at all. TABLE 13.4 A 12-state logic system. Logic level Logic strength zero unknown one strong S0 SX S1 weak W0 WX W1 high impedance Z0 ZX Z1 unknown U0 UX U1 With the introduction of logic strength, a logic value may now have two properties: level and strength. Suppose we were to measure a voltage at a node N with a digital voltmeter (with a very high input impedance). Suppose the measured voltage at node N was 4.98 V (and the measured positive supply, V DD = 5.00 V). We can say that node N is a logic level 'one', but we do not know the logic strength. Now suppose you connect one end of a 1 k resistor to node N , the other to GND, and the voltage at N changes to 4.95 V. Now we can say that whatever is driving node N has a strong forcing strength. In fact, we know that whatever is driving N is capable of supplying a current of at least 4.95 V / 1 k 5 mA. Depending on the logic-value system we are using, we can assign a logic value to N . If we allow all possible combinations of logic level with logic strength, we end up with a matrix of logic values and logic states. Table 13.4 shows the 12 states that result with three logic levels (zero, one, unknown) and four logic strengths (strong, weak, high-impedance, and unknown). In this logic system, node N has logic value S1 a logic level of 'one' with a logic strength of 'strong'.
13.3 Logic Systems
The Verilog logic system has three logic levels that are called '1' , '0' , and 'x' ; and the eight logic strengths shown in Table 13.5 . The designer does not normally see the logic values that resultonly the three logic levels. TABLE 13.5 Verilog logic strengths. Logic strength Strength number Models supply drive 7 power supply default gate and strong drive 6 assign output strength gate and assign pull drive 5 output strength size of trireg net large capacitor 4 capacitor gate and assign weak drive 3 output strength size of trireg net medium capacitor 2 capacitor size of trireg net small capacitor 1 capacitor high impedance 0 not applicable
Abbreviation Su supply strong pull large weak medium small highz St Pu La We Me Sm Hi
The IEEE Std 1164-1993 logic system defines a variable type, std_ulogic , with the nine logic values shown in Table 13.6 . When we wish to simulate logic cells using this logic system, we must define the primitive-gate operations. We also need to define the process of VHDL signal resolution using VHDL signal-resolution functions . For example, the function in the IEEE Std_Logic_1164 package that defines the and operation is as follows 1 : TABLE 13.6 The nine-value logic system, IEEE Std 1164-1993. Logic state Logic value Logic state Logic value
13.3 Logic Systems
'0' '1' 'L' 'H'
strong low strong high weak low weak high
'X' 'W' 'Z' '-' 'U'
strong unknown weak unknown high impedance dont care uninitialized
function "and"(l,r : std_ulogic_vector) return std_ulogic_vector is alias lv : std_ulogic_vector (1 to l'LENGTH ) is l; alias rv : std_ulogic_vector (1 to r'LENGTH ) is r; variable result : std_ulogic_vector (1 to l'LENGTH ); constant and_table : stdlogic_table := ( ------------------------------------------------------------| U X 0 1 Z W L H - | | ----------------------------------------------------------( 'U', 'U', '0', 'U', 'U', 'U', '0', 'U', 'U' ), -- | U | ( 'U', 'X', '0', 'X', 'X', 'X', '0', 'X', 'X' ), -- | X | ( '0', '0', '0', '0', '0', '0', '0', 'U', '0' ), -- | 0 | ( 'U', 'X', '0', '1', 'X', 'X', '0', '1', 'X' ), -- | 1 | ( 'U', 'X', '0', 'X', 'X', 'X', '0', 'X', 'X' ), -- | Z | ( 'U', 'X', '0', 'X', 'X', 'X', '0', 'X', 'X' ), -- | W | ( '0', '0', '0', '0', '0', '0', '0', '0', '0' ), -- | L | ( 'U', 'X', '0', '1', 'X', 'X', '0', '1', 'X' ), -- | H | ( 'U', 'X', '0', 'X', 'X', 'X', '0', 'X', 'X' ), -- | |); begin if (l'LENGTH /= r'LENGTH) then assert false report
13.3 Logic Systems
"arguments of overloaded 'and' operator are not of the same length" severity failure; else for i in result'RANGE loop result(i) := and_table ( lv(i), rv(i) ); end loop; end if; return result; end "and"; If a = 'X' and b = '0' , then (a and b) is '0' no matter whether a is, in fact, '0' or '1' . 1. IEEE Std 1164-1993, Copyright 1993 IEEE. All rights reserved. [ Chapter start ] [ Previous page ] [ Next page ]
13.4 How Logic Simulation Works

The most common type of digital simulator is an event-driven simulator . When a circuit node changes in value the time, the node, and the new value are collectively known as an event . The event is scheduled by putting it in an event queue or event list . When the specified time is reached, the logic value of the node is changed. The change affects logic cells that have this node as an input. All of the affected logic cells must be evaluated , which may add more events to the event list. The simulator keeps track of the current time, the current time step , and the event list that holds future events. For each circuit node the simulator keeps a record of the logic state and the strength of the source or sources driving the node. When a node changes logic state, whether as an input or as an output of a logic cell, this causes an event. An interpreted-code simulator uses the HDL model as data, compiling an executable model as part of the simulator structure, and then executes the model. This type of simulator usually has a short compile time but a longer execution time compared to other types of simulator. An example is Verilog-XL. A compiled-code simulator converts the HDL model to an intermediate form (usually C) and then uses a separate compiler to create executable binary code (an executable). This results in a longer compile time but shorter execution time than an interpreted-code simulator. A native-code simulator converts the HDL directly to an executable and offers the shortest execution time. The logic cells for each of these types of event-driven simulator are modeled using a primitive modeling language (primitive in the sense of fundamental). There are no standards for this primitive modeling language. For example, the following code is a primitive model of a two-input NAND logic cell:
model nd01d1 (a, b, zn) function (a, b) !(a & b); function end model end The model has three ports: a , b , and zn . These ports are connected to nodes when a NAND gate is instantiated in an input structural netlist, nand nd01d1(a2, b3, r7) An event occurs when one of the circuit nodes a2 or b3 changes, and the function defined in the primitive model is called. For example, when a2 changes, it affects the port a of the model. The function will be called to set zn to the logical NAND of a and b . The implementation of the primitive functions is unique to each simulator and carefully coded to reduce execution time. The data associated with an event consists of the affected node, a new logic value for the node, a time for the change to take effect, and the node that caused the event. Written in C, the data structure for an event might look like the following: struct Event { event_ptr fwd_link, back_link; /* event list */ event_ptr node_link; /* list of node events */ node_ptr event_node; /* node for the event */ node_ptr cause; /* node causing event */ port_ptr port; /* port which caused this event */ long event_time; /* event time, in units of delta */ char new_value; /* new value: '1' '0' etc. */ }; The event list keeps track of logic cells whose outputs are changing and the new values for each output. The evaluation list keeps track of logic cells whose inputs have changed. Using separate event and evaluation lists avoids any dependence on the order in which events are processed, since the evaluations occur only after all nodes have been updated. The sequence of event-list processing followed by the evaluation-list processing is called a simulation cycle , or an eventevaluation cycle (or eventeval cycle for short).
Delays are tracked using a time wheel divided into ticks or slots, with each slot representing a unit of time. A software pointer marks the current time on the timing wheel. As simulation progresses, the pointer moves forward by one slot for each time step. The event list tracks the events pending and, as the pointer moves, the simulator processes the event list for the current time.
13.4.1 VHDL Simulation Cycle

We shall use VHDL as an example to illustrate the steps in a simulation cycle (which is precisely defined in the LRM). In VHDL, before simulation begins, the design hierarchy is first elaborated . This means all the pieces of the model code (entities, architectures, and configurations) are put together. Then the nets in the model are initialized just before simulation starts. The simulation cycle is then continuously repeated during which processes are executed and signals are updated. A VHDL simulation cycle consists of the following steps: 1. The current time, t c is set equal to t n . 2. Each active signal in the model is updated and events may occur as a result. 3. For each process P, if P is currently sensitive to a signal S, and an event has occurred on signal S in this simulation cycle, then process P resumes. 4. Each resumed process is executed until it suspends. 5. The time of the next simulation cycle, t n , is set to the earliest of: a. the next time at which a driver becomes active or b. the next time at which a process resumes 1. If t n = t c , then the next simulation cycle is a delta cycle . Simulation is complete when we run out of time ( t n = TIME'HIGH ) and there are no active drivers or process resumptions at t n (there are some slight modifications to these rules involving postponed processeswhich we rarely use in ASIC design).
Time in an event-driven simulator has two dimensions. A delta cycle takes delta time , which does not result in a change in real time. Each event that occurs at the same time step executes in delta time. Only when all events have been completed and signals updated does real time advance to the next time step.
13.4.2 Delay
In VHDL you may assign a delay mechanism to an assignment statement. Transport delay is characteristic of wires and transmission lines that exhibit nearly infinite frequency response and will transmit any pulse, no matter how short. Inertial delay more closely models the real behavior of logic cells. Typically, a logic cell will not transmit a pulse that is shorter than the switching time of the circuit, and this is the default pulse-rejection limit . If we explicitly specify a pulse-rejection limit, the assignment will not transmit a pulse shorter than the limit. As an example, the following three assignments are equivalent to each other: Op <= Ip after 10 ns; Op <= inertial Ip after 10 ns; Op <= reject 10 ns inertial Ip after 10 ns; Every assignment that uses transport delay can be written using inertial delay with a pulse-rejection limit, as the following examples illustrate. -Op Op -Op Op 10 Assignments using transport delay: <= transport Ip after 10 ns; <= transport Ip after 10 ns, not Ip after 20 ns; Their equivalent assignments: <= reject 0 ns inertial Ip after 10 ns; <= reject 0 ns inertial Ip after 10 ns, not Ip after ns;

13.5 Cell Models
13.5 Cell Models

There are several different kinds of logic cell models:
q
Primitive models , which are produced by the ASIC library company and describe the function and properties of each logic cell (NAND, D flip-flop, and so on) using primitive functions. Verilog and VHDL models that are produced by an ASIC library company from the primitive models. Proprietary models produced by library companies that describe either small logic cells or larger functions such as microprocessors.
A logic cell model is different from the cell delay model, which is used to calculate the delay of the logic cell, from the power model , which is used to calculate power dissipation of the logic cell, and from the interconnect timing model , which is used to calculate the delays between logic cells (we return to these in Section 13.6 ).
13.5.1 Primitive Models

The following is an example of a primitive model from an ASIC library company (Compass Design Automation). This particular model (for a two-input NAND cell) is complex because it is intended for a 0.35 m process and has some advanced delay modeling features. The contents are not important to an ASIC designer, but almost all of the information about a logic cell is derived from the primitive model. The designer does not normally see this primitive model; it may only be used by an ASIC library company to generate other modelsVerilog or VHDL, for example.
13.5 Cell Models
Function (timingModel = oneOf("ism","pr"); powerModel = oneOf("pin"); ) Rec Logic = Function (A1; A2; )Rec ZN = not (A1 AND A2); End; End; miscInfo = Rec Title = "2-Input NAND, 1X Drive"; freq_fact = 0.5; tml = "nd02d1 nand 2 * zn a1 a2"; MaxParallel = 1; Transistors = 4; power = 0.179018; Width = 4.2; Height = 12.6; productName = "stdcell35"; libraryName = "cb35sc"; End; Pin = Rec A1 = Rec input; cap = 0.010; doc = "Data Input"; End; A2 = Rec input; cap = 0.010; doc = "Data Input"; End; ZN = Rec output; cap = 0.009; doc = "Data Output"; End; End; Symbol = Select timingModel On pr Do Rec tA1D_fr = |( Rec prop = 0.078; ramp = 2.749; End); tA1D_rf = |( Rec prop = 0.047; ramp = 2.506; End); tA2D_fr = |( Rec prop = 0.063; ramp = 2.750; End); tA2D_rf = |( Rec prop = 0.052; ramp = 2.507; End); End On ism Do Rec tA1D_fr = |( Rec A0 = 0.0015; dA = 0.0789; D0 = -0.2828; dD = 4.6642; B = 0.6879; Z = 0.5630; End ); tA1D_rf = |( Rec A0 = 0.0185; dA = 0.0477; D0 = -0.1380; dD = 4.0678; B = 0.5329; Z = 0.3785; End ); tA2D_fr = |( Rec A0 = 0.0079; dA = 0.0462; D0 = -0.2819; dD = 4.6646; B = 0.6856; Z = 0.5282; End ); tA2D_rf = |( Rec A0 = 0.0060; dA = 0.0464; D0 = -0.1408; dD = 4.0731; B = 0.6152; Z = 0.4064; End ); End; End; Delay = |( Rec from = pin.A1; to = pin.ZN; edges = Rec fr = Symbol.tA1D_fr; rf = Symbol.tA1D_rf;
13.5 Cell Models
End; End, Rec from = pin.A2; to = pin.ZN; edges = Rec fr = Symbol.tA2D_fr; rf = Symbol.tA2D_rf; End; End ); MaxRampTime = |( Rec check = pin.A1; riseTime = 3.000; fallTime = 3.000; End, Rec check = pin.A2; riseTime = 3.000; fallTime = 3.000; End, Rec check = pin.ZN; riseTime = 3.000; fallTime = 3.000; End ); DynamicPower = |( Rec rise = { ZN }; val = 0.003; End); End; End This primitive model contains the following information:
q
The logic cell name, the logic cell function expressed using primitive functions, and port names. A list of supported delay models ( ism stands for input-slope delay model, and pr for propramp delay modelsee Section 13.6 ). Miscellaneous data on the logic cell size, the number of transistors and so onprimarily for use by logic-synthesis tools and for data book generation. Information for power dissipation models and timing analysis.
13.5.2 Synopsys Models

The ASIC library company may provide vendor models in formats unique to each CAD tool company. The following is an example of a Synopsys model derived from a primitive model similar to the example in Section 13.5.1 . In a Synopsys library, each logic cell is part of a large file that also contains wire-load models and other characterization information for the cell library. cell (nd02d1) { /* title : 2-Input NAND, 1X Drive */ /* pmd checksum : 'HBA7EB26C */ area : 1; pin(a1) { direction : input; capacitance : 0.088; fanout_load : 0.088; } pin(a2) { direction : input; capacitance : 0.087; fanout_load : 0.087; }
13.5 Cell Models
pin(zn) { direction : output; max_fanout : 1.786; max_transition : 3; function : "(a1 a2)'"; timing() { timing_sense : "negative_unate" intrinsic_rise : 0.24 intrinsic_fall : 0.17 rise_resistance : 1.68 fall_resistance : 1.13 related_pin : "a1" } timing() { timing_sense : "negative_unate" intrinsic_rise : 0.32 intrinsic_fall : 0.18 rise_resistance : 1.68 fall_resistance : 1.13 related_pin : "a2" } } } /* end of cell */ This file contains the only information the Synopsys logic synthesizer, simulator, and other design tools use. If the information is not in this model, the tools cannot produce it. You can see that not all of the information from a primitive model is necessarily present in a vendor model.
13.5.3 Verilog Models

The following is a Verilog model for an inverter (derived from a primitive model): `celldefine `delay_mode_path `suppress_faults ènable_portfaults `timescale 1 ns / 1 ps module in01d1 (zn, i); input i; output zn; not G2(zn, i); specify specparam InCap$i = 0.060, OutCap$zn = 0.038, MaxLoad$zn = 1.538, R_Ramp$i$zn = 0.542:0.980:1.750, F_Ramp$i$zn = 0.605:1.092:1.950; specparam cell_count = 1.000000; specparam Transistors = 4 ; specparam Power = 1.400000; specparam MaxLoadedRamp = 3 ;
13.5 Cell Models
(i => zn) = (0.031:0.056:0.100, 0.028:0.050:0.090); endspecify endmodule `nosuppress_faults `disable_portfaults èndcelldefine This is very similar in form to the model for the MUX of Section 13.2.1 , except that this model includes additional timing parameters (at the beginning of the specify block). These timing parameters were omitted to simplify the model of Section 13.2.1 (see Section 13.6 for an explanation of their function). There are no standards on writing Verilog logic cell models. In the Verilog model, in01d1 , fixed delays (corresponding to zero load capacitance) are embedded in a specify block. The parameters describing the delay equations for the timing model and other logic cell parameters (area, power-model parameters, and so on) are specified using the Verilog specparam feature. Writing the model in this way allows the model information to be accessed using the Verilog PLI routines. It also allows us to back-annotate timing information by overriding the data in the specify block. The following Verilog code tests the model for logic cell in01d1 : `timescale 1 ns / 1 ps module SDF_b; reg A; in01d1 i1 (B, A); initial begin A = 0; #5; A = 1; #5; A = 0; end initial $monitor("T=%6g",$realtime," A=",A," B=",B); endmodule T= 0 A=0 B=x T= 0.056 A=0 B=1 T= 5 A=1 B=1 T= 5.05 A=1 B=0 T= 10 A=0 B=0 T=10.056 A=0 B=1
13.5 Cell Models
In this case the simulator has used the fixed, typical timing delays (0.056 ns for the rising delay, and 0.05 ns for the falling delayboth from line 12 in module in01d1 ). Here is an example SDF file (filename SDF_b.sdf ) containing back-annotation timing delays: (DELAYFILE (SDFVERSION "3.0") (DESIGN "SDF.v") (DATE "Aug-13-96") (VENDOR "MJSS") (PROGRAM "MJSS") (VERSION "v0") (DIVIDER .) (TIMESCALE 1 ns) (CELL (CELLTYPE "in01d1") (INSTANCE SDF_b.i1) (DELAY (ABSOLUTE (IOPATH i zn (1.151:1.151:1.151) (1.363:1.363:1.363)) )) ) ) (Notice that since Verilog is case sensitive, the instance names and node names in the SDF file are also case sensitive.) This SDF file describes the path delay between input (pin i ) and output (pin zn ) as 1.151 ns (rising delayminimum, typical, and maximum are identical in this simple example) and 1.363 ns (falling delay). These delays are calculated by a delay calculator . The delay calculator may be a standalone tool or part of the simulator. This tool calculates the delay values by using the delay parameters in the logic cell model (lines 8 9 in module in01d1 ). We call a system task, $sdf_annotate , to perform back-annotation, `timescale 1 ns / 1 ps module SDF_b; reg A; in01d1 i1 (B, A); initial begin $sdf_annotate ( "SDF_b.sdf", SDF_b, , "sdf_b.log", "minimum", , ); A = 0; #5; A = 1; #5; A = 0; end initial $monitor("T=%6g",$realtime," A=",A," B=",B); endmodule
13.5 Cell Models
Here is the output (from MTI V-System/Plus) including back-annotated timing: T= 0 A=0 B=x T= 1.151 A=0 B=1 T= 5 A=1 B=1 T= 6.363 A=1 B=0 T= 10 A=0 B=0 T=11.151 A=0 B=1 The delay information from the SDF file has been passed to the simulator. Back-annotation is not part of the IEEE 1364 Verilog standard, although many Verilog-compatible simulators do support the $sdf_annotate system task. Many ASIC vendors require the use of Verilog to complete a back-annotated timing simulation before they will accept a design for manufacture. Used in this way Verilog is referred to as a golden simulator , since an ASIC vendor uses Verilog to judge whether an ASIC design fabricated using its process will work.
13.5.4 VHDL Models

Initially VHDL did not offer a standard way to perform back-annotation. Here is an example of a VHDL model for an inverter used to perform a back-annotated timing simulation using an Altera programmable ASIC: library IEEE; use IEEE.STD_LOGIC_1164. all ; library COMPASS_LIB; use COMPASS_LIB.COMPASS_ETC. entity bknot is generic (derating : REAL := 1.0; Z1_cap : REAL := INSTANCE_NAME : STRING := "bknot"); port (Z2 : in Std_Logic; Z1 : out STD_LOGIC); end bknot; architecture bknot of bknot is constant tplh_Z2_Z1 : TIME := (1.00 ns + (0.01 ns Z1_Cap)) * derating; constant tphl_Z2_Z1 : TIME := (1.00 ns + (0.01 ns
all ; 0.000;
* *
13.5 Cell Models
Z1_Cap)) * derating; begin process (Z2) variable int_Z1 : Std_Logic := 'U'; variable tplh_Z1, tphl_Z1, Z1_delay : time := 0 ns; variable CHANGED : BOOLEAN; begin int_Z1 := not (Z2); if Z2'EVENT then tplh_Z1 := tplh_Z2_Z1; tphl_Z1 := tphl_Z2_Z1; end if ; Z1_delay := F_Delay(int_Z1, tplh_Z1, tphl_Z1); Z1 <= int_Z1 after Z1_delay; end process ; end bknot; configuration bknot_CON of bknot is for bknot end for ; end bknot_CON; This model accepts two generic parameters: load capacitance, Z1_cap , and a derating factor, derating , used to adjust postlayout timing delays. The proliferation of different VHDL back-annotation techniques drove the VHDL community to develop a standard method to complete back-annotationVITAL.
13.5.5 VITAL Models

VITAL is the VHDL Initiative Toward ASIC Libraries, IEEE Std 1076.4 [ 1995]. 1 VITAL allows the use of sign-off quality ASIC libraries with VHDL simulators. Signoff is the transfer of a design from a customer to an ASIC vendor. If the customer has completed simulation of a design using sign-off quality models from an approved cell library and a golden simulator, the customer and ASIC vendor will sign off the design (by signing a contract) and the vendor guarantees that the silicon will match the simulation. VITAL models, like Verilog models, may be generated from primitive models. Here is an example of a VITAL-compliant model for an inverter,
13.5 Cell Models
library IEEE; use IEEE.STD_LOGIC_1164. all ; use IEEE.VITAL_timing. all ; use IEEE.VITAL_primitives. all ; entity IN01D1 is generic ( tipd_I : VitalDelayType01 := (0 ns, 0 ns); tpd_I_ZN : VitalDelayType01 := (0 ns, 0 ns) ); port ( I : in STD_LOGIC := 'U'; ZN : out STD_LOGIC := 'U' ); attribute VITAL_LEVEL0 of IN01D1 : entity is TRUE; end IN01D1; architecture IN01D1 of IN01D1 is attribute VITAL_LEVEL1 of IN01D1 : architecture is TRUE; signal I_ipd : STD_LOGIC := 'X'; begin WIREDELAY: block begin VitalWireDelay(I_ipd, I, tipd_I); end block ; VITALbehavior : process (I_ipd) variable ZN_zd : STD_LOGIC; variable ZN_GlitchData : VitalGlitchDataType; begin ZN_zd := VitalINV(I_ipd); VitalPathDelay01( OutSignal => ZN, OutSignalName => "ZN", OutTemp => ZN_zd, Paths => (0 => (I_ipd'LAST_EVENT, tpd_I_ZN, TRUE)), GlitchData => ZN_GlitchData, DefaultDelay => VitalZeroDelay01, Mode => OnEvent, MsgOn => FALSE, XOn => TRUE, MsgSeverity => ERROR); end process ; end IN01D1;
13.5 Cell Models
The following testbench, SDF_testbench , contains an entity, SDF , that in turn instantiates a copy of an inverter, in01d1 : library IEEE; use IEEE.STD_LOGIC_1164. all ; entity SDF is port ( A : in STD_LOGIC; B : out STD_LOGIC ); end SDF; architecture SDF of SDF is component in01d1 port ( I : in STD_LOGIC; ZN : out STD_LOGIC ); end component ; begin i1: in01d1 port map ( I => A, ZN => B); end SDF; library STD; use STD.TEXTIO. all ; library IEEE; use IEEE.STD_LOGIC_1164. all ; entity SDF_testbench is end SDF_testbench; architecture SDF_testbench of SDF_testbench is component SDF port ( A : in STD_LOGIC; B : out STD_LOGIC ); end component ; signal A, B : STD_LOGIC := '0'; begin SDF_b : SDF port map ( A => A, B => B); process begin A <= '0'; wait for 5 ns; A <= '1'; wait for 5 ns; A <= '0'; wait ; end process ; process (A, B) variable L: LINE; begin write(L, now, right, 10, TIME'(ps)); write(L, STRING'(" A=")); write(L, TO_BIT(A)); write(L, STRING'(" B=")); write(L, TO_BIT(B)); writeline(output, L); end process ;
13.5 Cell Models
end SDF_testbench; Here is an SDF file ( SDF_b.sdf ) that contains back-annotation timing information (min/typ/max timing values are identical in this example): (DELAYFILE (SDFVERSION "3.0") (DESIGN "SDF.vhd") (DATE "Aug-13-96") (VENDOR "MJSS") (PROGRAM "MJSS") (VERSION "v0") (DIVIDER .) (TIMESCALE 1 ns) (CELL (CELLTYPE "in01d1") (INSTANCE i1) (DELAY (ABSOLUTE (IOPATH i zn (1.151:1.151:1.151) (1.363:1.363:1.363)) (PORT i (0.021:0.021:0.021) (0.025:0.025:0.025)) )) ) ) (VHDL is case insensitive, but to allow the use of an SDF file with both Verilog and VHDL we must maintain case.) As in the Verilog example in Section 13.5.3 the logic cell delay (from the input pin of the inverter, i , to the output pin, zn ) follows the IOPATH keyword. In this example there is also an interconnect delay that follows the PORT keyword. The interconnect delay has been placed, or lumped, at the input of the inverter. In order to include back-annotation timing using the SDF file, SDF_b.sdf , we use a command-line switch to the simulator. In the case of MTI V-System/Plus the command is as follows: <msmith/MTI/vital> vsim -c -sdfmax /sdf_b=SDF_b.sdf sdf_testbench ... # 0 ps A=0 B=0 # 0 ps A=0 B=0 # 1176 ps A=0 B=1 # 5000 ps A=1 B=1 # 6384 ps A=1 B=0
13.5 Cell Models
# 10000 ps A=0 B=0 # 11176 ps A=0 B=1 We have to explain to the simulator where in the design hierarchy to apply the timing information in the SDF file. The situation is like giving someone directions Go North on the M1 and turn left at the third intersection, but where do we start? London or Birmingham? VHDL needs much more precise directions. Using VITAL we say we back-annotate to a region . The switch /sdf_b=SDF_b.sdf specifies that all instance names in the SDF file, SDF_b.sdf , are relative to the region /sdf_b . The region refers to instance name sdf_b (line 9 in entity SDF_testbench ), which is an instance of component SDF . Component SDF in turn contains an instance of a component, in01d1 , with instance name i1 (line 7 in architecture SDF ). Through this rather (for us) difficult-to-follow set of directions, the simulator knows that ... (CELL (CELLTYPE "in01d1") (INSTANCE i1) ... refers to (SDF) cell or (VHDL) component in01d1 with instance name i1 in instance SDF_b of the compiled model sdf_testbench . Notice that we cannot use an SDF file of the following form (as we did for the Verilog version of this example): ... (CELL (CELLTYPE "in01d1") (INSTANCE SDF_b.i1) ... There is no instance in the VHDL model higher than instance name SDF_b that we can use as a starting point for VITAL back-annotation. In the Verilog SDF file we can refer to the name of the top-level module ( SDF_b in line 2 in module SDF_b ). We cannot do this in VHDL; we must name an instance. The result is that, unless you are careful in constructing the hierarchy of your VHDL design, you may not be able to use the same SDF file for back-annotating both VHDL and Verilog.
13.5.6 SDF in Simulation

13.5 Cell Models
SDF was developed to handle back-annotation, but it is also used to describe forwardannotation of timing constraints from logic synthesis. Here is an example of an SDF file that contains the timing information for the halfgate ASIC design: (DELAYFILE (SDFVERSION "1.0") (DESIGN "halfgate_ASIC_u") (DATE "Aug-13-96") (VENDOR "Compass") (PROGRAM "HDL Asst") (VERSION "v9r1.2") (DIVIDER .) (TIMESCALE 1 ns) (CELL (CELLTYPE "in01d0") (INSTANCE v_1.B1_i1) (DELAY (ABSOLUTE (IOPATH I ZN (1.151:1.151:1.151) (1.363:1.363:1.363)) )) ) (CELL (CELLTYPE "pc5o06") (INSTANCE u1_2) (DELAY (ABSOLUTE (IOPATH I PAD (1.216:1.216:1.216) (1.249:1.249:1.249)) )) ) (CELL (CELLTYPE "pc5d01r") (INSTANCE u0_2) (DELAY (ABSOLUTE (IOPATH PAD CIN (.169:.169:.169) (.199:.199:.199)) )) ) ) This SDF file describes the delay due to the input pad (cell pc5d01r , instance name u0_2 ), our inverter (cell in01d0 , instance name v_1.B1_i1 ), and the output pad (cell pc5o06 , instance name u1_2 ). Since this SDF file was produced before any
13.5 Cell Models
physical layout, there are no estimates for interconnect delay. The following partial SDF file illustrates how interconnect delay can be specified in SDF. (DELAYFILE ... (PROCESS "FAST-FAST") (TEMPERATURE 0:55:100) (TIMESCALE 100ps) (CELL (CELLTYPE "CHIP") (INSTANCE TOP) (DELAY (ABSOLUTE ( INTERCONNECT A.INV8.OUT B.DFF1.Q (:0.6:) (:0.6:)) ))) This SDF file specifies an interconnect delay (using the keyword INTERCONNECT ) of 60 ps (0.6 units with a timescale of 100 ps per unit) between the output port of an inverter with instance name A.INV8 (note that '.' is the hierarchy divider) in block A and the Q input port of a D flip-flop (instance name B.DFF1 ) in block B. The triplet notation (min : typ : max) in SDF corresponds to minimum, typical, and maximum values of a parameter. Specifying two triplets corresponds to rising (the first triplet) and falling delays. A single triplet corresponds to both. A third triplet corresponds to turn-off delay (transitions to or from 'Z' ). You can also specify six triplets (rising, falling, '0' to 'Z' , 'Z' to '1' , '1' to 'Z' , and 'Z' to '0' ). When only the typical value is specified, the minimum and maximum are set equal to the typical value. Logic cell delays can use several models in SDF. Here is one example: (INSTANCE B.DFF1) (DELAY (ABSOLUTE ( IOPATH (POSEDGE CLK) Q (12:14:15) (11:13:15)))) The IOPATH construct specifies a delay between the input pin and the output pin of a cell. In this example the delay is between the positive edge of the clock (input port) and the flip-flop output.
13.5 Cell Models
The following example SDF file is for an AO221 logic cell: (DELAYFILE (DESIGN "MYDESIGN") (DATE "26 AUG 1996") (VENDOR "ASICS_INC") (PROGRAM "SDF_GEN") (VERSION "3.0") (DIVIDER .) (VOLTAGE 3.6:3.3:3.0) (PROCESS "-3.0:0.0:3.0") (TEMPERATURE 0.0:25.0:115.0) (TIMESCALE ) (CELL (CELLTYPE "AOI221") (INSTANCE X0) (DELAY (ABSOLUTE (IOPATH A1 Y (1.11:1.42:2.47) (IOPATH A2 Y (0.97:1.30:2.34) (IOPATH B1 Y (1.26:1.59:2.72) (IOPATH B2 Y (1.10:1.45:2.56) (IOPATH C1 Y (0.79:1.04:1.91) ))))
(1.39:1.78:3.19)) (1.53:1.94:3.50)) (1.52:2.01:3.79)) (1.66:2.18:4.10)) (1.36:1.62:2.61))
1. IEEE Std 1076.4-1995, 1995 IEEE. All rights reserved. [ Chapter start ] [ Previous page ] [ Next page ]
13.6 Delay Models
13.6 Delay Models

We shall use the term timing model to describe delays outside logic cells and the term delay model to describe delays inside logic cells. These terms are not standard and often people use them interchangeably. There are also different terms for various types of delay:
q
A pin-to-pin delay is a delay between an input pin and an output pin of a logic cell. This usually represents the delay of the logic cell excluding any delay contributed by interconnect. A pin delay is a delay lumped to a certain pin of a logic cell (usually an input). This usually represents the delay of the interconnect, but may also represent the delay of the logic cell. A net delay or wire delay is a delay outside a logic cell. This always represents the delay of interconnect.
In this section we shall focus on delay models and logic cell delays. In Chapter 3 we modeled logic cell delay as follows (Eq. 3.10): t PD = R ( C out + C p ) + t q . (13.5) A linear delay model is also known as a propramp delay model , because the delay comprises a fixed propagation delay (the intrinsic delay) and a ramp delay (the extrinsic delay). As an example, the data book entry for the inverter, cell in01d0 , in a 0.8 m standard-cell library gives the following delay information (with delay measured in nanoseconds and capacitance in picofarads):
13.6 Delay Models
RISE = 0.10 + 0.07 + (1.75 Cld) FALL = 0.09 + 0.07 + (1.95 Cld)
(13.6)
RISE = 0.10 + 0.07 + (1.75 Cld) FALL = 0.09 + 0.07 + (1.95 Cld) (13.5) The first two terms in each of these equations represents the intrinsic delay, with the last term in each equation representing the extrinsic delay. We see that the Cld corresponds to C out , R pu = 1.75 k , and R pd = 1.95 k ( R pu is the pull-up resistance and R pd is the pull-down resistance) . From the data book the pin capacitances for this logic cell are as follows: pin I (input) = 0.060 pF pin ZN (output) = 0.038 pF (13.7) pin I (input) = 0.060 pF pin ZN (output) = 0.038 pF (13.6) Thus, C p = 0.038 pF and we can calculate the component of the intrinsic delay due to the output pin capacitance as follows: C p R pu = 0.038 1.75 = 0.0665 ns and C p R pd = 0.038 1.95 = 0.0741 ns
(13.8)
C p R pu = 0.038 1.75 = 0.0665 ns and C p R pd = 0.038 1.95 = 0.0741 ns(13.7) Suppose t qr and t qf are the parasitic delays for the rising and falling waveforms respectively. By comparing the data book equations for the rise and fall delays with Eq. and 13.7 , we can identify t qr = 0.10 ns and t qf = 0.09 ns. Now we can explain the timing section of the in01d0 model ( Section 13.5.3 ), specify specparam InCap$i = 0.060, OutCap$zn = 0.038, MaxLoad$zn = 1.538, R_Ramp$i$zn = 0.542:0.980:1.750, F_Ramp$i$zn =
13.6 Delay Models
0.605:1.092:1.950; specparam cell_count = 1.000000; specparam Transistors = 4 ; specparam Power = 1.400000; specparam MaxLoadedRamp = 3 ; (i=>zn)=(0.031:0.056:0.100, 0.028:0.050:0.090); The parameter OutCap$zn is C p . The maximum value of the parameter R_Ramp$i$zn is R pu , and the maximum value of parameter F_Ramp$i$zn is R pd . Finally, the maximum values of the fixed-delay triplets correspond to t qr and t qf .
13.6.1 Using a Library Data Book

ASIC library data books typically contain two types of information for each cell in the librarycapacitance loading and delay. Table 13.7 shows the input capacitances for the inverter family for both an area-optimized library (small) and a performance-optimized library (fast). From Table 13.7 , the input capacitance of the small library version of the inv1 (a 1X inverter gate) is 0.034 pF. Any logic cell that is driving an inv1 from the small library sees this as a load capacitance. This capacitance consists of the gate capacitance of a p channel transistor, the gate capacitance of an n -channel transistor, and the internal cell routing. Similarly, 0.145 pF is the input capacitance of a fast inv1 . We can deduce that the transistors in the fast library are approximately 0.145 / 0.034 4 times larger than those in the small version. The small library and fast library may not have the same cell height (they usually do not), so that we cannot mix cells from different libraries in the same standard-cell area. TABLE 13.7 Input capacitances for an inverter family (pF).1 Library inv1 invs inv8 inv12 invh 1 Area 0.034 Performance 0.145 0.067 0.292 0.133 0.265 0.584 1.169 0.397 1.753
The delay table for a 2:1 MUX is shown in Table 13.8 . For example, DO/ to Z/ , indicates the path delay from the DO input rising to the Z output rising. Rising delay is denoted by '/' and falling delay by '\' .
13.6 Delay Models
TABLE 13.8 Delay information for a 2:1 MUX. Propagation delay Area Performance Extrinsic / Intrinsic / Extrinsic / Intrinsic / From input 2 To output ns ns ns nspF 1 D0\ Z\ 2.10 1.42 0.5 0.8 D0/ Z/ 3.66 1.23 0.68 0.70 D1\ Z\ 2.10 1.42 0.50 0.80 D1/ Z/ 3.66 1.23 0.68 0.70 SD\ Z\ 2.10 1.42 0.50 0.80 SD\ Z/ 3.66 1.09 0.70 0.73 SD/ Z\ 2.10 2.09 0.5 1.09 SD/ Z/ 3.66 1.23 0.68 0.70 Both intrinsic delay and extrinsic delay values are given in Table 13.8 . For example, the delay t PD (from DO\ to Z \) of a 2:1 MUX from the small library is t PD = 1.42 ns + (2.10 ns/pF) C L (pF) . (13.9) ASIC cell libraries may be characterized and the delay information presented in several ways in a data book. Some manufacturers simulate under worst-case slow conditions (4.5 V, 100 C, and slow process conditions, for example) and then derate each delay value to convert delays to nominal conditions (5.0 V, 25 C, and nominal process). This allows nominal delays to be used in the data book while maintaining accurate predictions for worst-case behavior. Other manufacturers characterize using nominal conditions and include worst-case values in the data book. In either case, we always design with worstcase values. Data books normally include process, voltage, and temperature derating factors as tables or graphs such as those shown in Tables 13.9 and 13.10 . For example, suppose we are measuring the performance of an ASIC on the bench and the lab temperature (25 C) and the power supply voltage (5 V) correspond to nominal operating conditions. We shall assume, in the absence of other information, that we have an ASIC from a nominal process lot. We have data book values given as worst case (worstfile:///C|/Documents%20and%20Settings/saran%2....edu/_msmith/ASICs/HTML/Book2/CH13/CH13.6.htm (4 of 10) [5/30/2004 11:07:19 PM]
13.6 Delay Models
case temperature, 100 C; worst-case voltage, 4.5 V; slow process) and we wish to find nominal values for delay to compare them with our measured results. From Table 13.9 the derating factor from nominal process to slow process is 1.31. From Table 13.10 the derating factor from 100 C and 4.5 V to nominal (25 C and 5 V) is 1.60. The derating factor from nominal to worst-case (data book values) is thus: worst-case = nominal 1.31 (slow process) 1.60 (4.5 V, 100 C). (13.10) worst-case = nominal 1.31 (slow process) 1.60 (4.5 V, 100 C).(13.8) To get from the data book values to nominal operating conditions we use the following equation: nominal = worst-case/(1.31 1.60) = 0.477 worst-case. (13.11) nominal = worst-case/(1.31 1.60) = 0.477 worst-case. (13.9) TABLE 13.9 Process derating factors. Derating Process factor Slow 1.31 Nominal 1.0 Fast 0.75 TABLE 13.10 Temperature and voltage derating factors. Supply voltage Temperature/C 4.5V 4.75V 5.00V 5.25V 5.50V 40 0.77 0.73 0.68 0.64 0.61 0 1.00 0.93 0.87 0.82 0.78 25 1.14 1.07 1.00 0.94 0.90 85 1.50 1.40 1.33 1.26 1.20 100 1.60 1.49 1.41 1.34 1.28 125 1.76 1.65 1.56 1.47 1.41
13.6.2 Input-Slope Delay Model

13.6 Delay Models
It is increasingly important for submicron technologies to account for the effects of the rise (and fall) time of the input waveforms to a logic cell. The nonlinear delay model described in this section was developed by Mike Misheloff at VLSI Technology and then at Compass. There are, however, no standards in this areaeach ASIC company has its own, often proprietary, model. We begin with some definitions:
q q q
D t 0 is the time from the beginning of the input to beginning of the output. D t 1 is the time from the beginning of the input to the end of the output. I R is the time from the beginning to the end of the input ramp.
In these definitions beginning and end refer to the projected intersections of the input waveform or the output waveform with V DD and V SS as appropriate. Then we can calculate the delay, D (measured with 0.5 trip points at input and output), and output ramp, O R , as follows: D = ( D t 1 + D t 0 I R ) / 2 (13.12) and O R = D t 1 D t 0 . (13.13)
Experimentally we find that the times, D t 0 and D t 1 , are accurately modeled by the following equations: D t 0 = A 0 + D 0 C L + B min ( I R , C R ) + Z max (0, I R C R ) (13.14) and Dt1=A1+BIR+D1CL. (13.15)
C R is the critical ramp that separates two regions of operation, we call these slow ramp and fast ramp. A sensible definition for C R is the point at which the end of the input ramp occurs at the same time the output reaches the 0.5 trip point. This leads to the following equation for C R :
13.6 Delay Models
A0+A1+(D0+D1)CL CR = 2 (1 B ) It is convenient to define two more parameters: d A = A 1 A 0 and d D = D 1 D 0 . (13.17) (13.16)
In the region that C R > I R , we can simplify Eqs. 13.14 and by using the definitions in Eq. 13.17 , as follows: D = ( D t 1 + D t 0 I R )/2 = A 0 + D 0 C L + d A /2 + d D C L /2 (13.18) and OR=Dt1Dt0=dA+dDCL. (13.19)
Now we can understand the timing parameters in the primitive model in Section 13.5.1 . For example, the following parameter, tA1D_fr , models the falling input to rising output waveform delay for the logic cell (the units are a consistent set: all times are measured in nanoseconds and capacitances in picofarads): A0 = 0.0015;dA = 0.0789;D0 = -0.2828;dD = 4.6642;B = 0.6879;Z = 0.5630; The input-slope model predicts delay in the fast-ramp region, D ISM (50 %, FR), as follows (0.5 trip points): D ISM (50 %, FR) = A 0 + D 0 C L + 0.5 O R = A 0 + D 0 C L + d A /2 + d D C L /2 = 0.0015 + 0.5 0.0789 + (0.2828 + 0.5 4.6642) C L
13.6 Delay Models
= 0.041 + 2.05 C L .
(13.20)
We can adjust this delay to 0.35/0.65 trip points as follows: D ISM (65 %, FR) = A 0 + D 0 C L + 0.65 O R = 0.0015 + 0.65 0.0789 + ( 0.2828 C L + 0.65 4.6642) C L = 0.053 + 2.749 C L .
(13.21)
We can now compare Eq. 13.21 with the propramp model. The propramp parameters for this logic cell (from the primitive model in Section 13.5.1 ) are: tA1D_fr = |( Rec prop = 0.078; ramp = 2.749; End); These parameters predict the following propramp delay (0.35/0.65 trip points): D PR (65 %) = 0.078 + 2.749 C L .
(13.22)
The input-slope delay model and the propramp delay model predict similar delays in the fast-ramp region, but for slower inputs the differences can become significant.
13.6.3 Limitations of Logic Simulation

Table 13.11 shows the switching characteristics of a two-input NAND gate (1X drive) from a commercial 1 m gate-array family. The difference in propagation delay (with FO = 0) between the inputs A and B is
13.6 Delay Models
(0.25 0.17) 2 / (0.25 + 0.17) = 38 %. This difference is taken into account only by a pin-to-pin delay model.
TABLE 13.11 Switching characteristics of a two-input NAND gate. Fanout 3 Symbol t PLH t PHL tr tf Parameter Propagation delay, A to X Propagation delay, B to X Output rise time, X Output fall time, X FO = 0 /ns 0.25 0.17 1.01 0.54 FO = 1 /ns 0.35 0.24 1.28 0.69 FO = 2 /ns 0.45 0.30 1.56 0.84 FO = 4 /ns 0.65 0.42 2.10 1.13 FO = 8 /ns 1.05 0.68 3.19 1.71 K /nspF 1 1.25 0.79 3.40 1.83
Timing information for most gate-level simulators is calculated once, before simulation, using a delay calculator. This works as long as the logic cell delays and signal ramps do not change. There are some cases in which this is not true. Table 13.12 shows the switching characteristics of a half adder. In addition to pin-to-pin timing differences there is a timing difference depending on state. For example, the pin-to-pin timing from input pin A to the output pin S depends on the state of the input pin B. Depending on whether B = '0' or B = '1' the difference in propagation delay (at FO = 0) is (0.93 0.58) 2 / (0.93 + 0.58) = 46 %. This state-dependent timing is not taken into account by simple pin-to-pin delay models and is not accounted for by most gate-level simulators. TABLE 13.12 Switching characteristics of a half adder.
13.6 Delay Models
Symbol t PLH t PHL t PLH t PHL t PLH t PHL tr tf
Parameter Delay, A to S (B = '0') Delay, A to S (B = '1') Delay, B to S (B = '0') Delay, B to S (B = '1') Delay, A to CO Delay, A to CO Output rise time, X Output fall time, X
FO = 0 /ns 0.58 0.93 0.89 1.00 0.43 0.59 1.01 0.54
Fanout 4 FO = FO = FO = 1 2 4 /ns /ns /ns 0.68 0.97 0.99 1.04 0.53 0.63 1.28 0.69 0.78 1.00 1.09 1.08 0.63 0.67 1.56 0.84 0.98 1.08 1.29 1.15 0.83 0.75 2.10 1.13
FO = 8 /ns 1.38 1.24 1.69 1.31 1.23 0.90 3.19 1.71
K /nspF 1 1.25 0.48 1.25 0.48 1.25 0.48 3.40 1.83
1. 1Suffix '1' denotes normal drive strength, suffix 'h' denotes high-power drive strength (approximately 2) , suffix 's' denotes superpower drive strength (approximately 4), and a suffix ' m ' ( m =8 or 12) denotes inverter blocks containing m inverters. 2. / = rising and \ = falling. 3. FO = fanout in standard loads (one standard load = 0.08 pF). Nominal conditions: V DD = 5 V, T A = 25 C. 4. FO = fanout in standard loads (one standard load = 0.08 pF). Nominal conditions: V DD = 5 V, T A = 25 C. [ Chapter start ] [ Previous page ] [ Next page ]
13.7 Static Timing Analysis

We return to the comparator/MUX example to see how timing analysis is applied to sequential logic. We shall use the same input code ( comp_mux.v in Section 13.2 ), but this time we shall target the design to an Actel FPGA. Before routing we obtain the following static timing analysis:
Instance name in pin-->out pin tr total incr cell ------------------------------------------------------------------END_OF_PATH outp_2_ R 27.26 OUT1 : D--->PAD R 27.26 7.55 OUTBUF I_1_CM8 : S11--->Y R 19.71 4.40 CM8 I_2_CM8 : S11--->Y R 15.31 5.20 CM8 I_3_CM8 : S11--->Y R 10.11 4.80 CM8 IN1 : PAD--->Y R 5.32 5.32 INBUF a_2_ R 0.00 0.00 BEGIN_OF_PATH The estimated prelayout critical path delay is nearly 30 ns including the I/O-cell delays (ACT 3, worst-case, standard speed grade). This limits the operating frequency to 33 MHz (assuming we can get the signals to and from the chip pins with no further delayshighly unlikely). The operating frequency can be increased by pipelining the design as follows (by including three register stages: at the inputs, the outputs, and between the comparison and the select functions):
// comp_mux_rrr.v module comp_mux_rrr(a, b, clock, outp); input [2:0] a, b; output [2:0] outp; input clock; reg [2:0] a_r, a_rr, b_r, b_rr, outp; reg sel_r; wire sel = ( a_r <= b_r ) ? 0 : 1; always @ ( posedge clock) begin a_r <= a; b_r <= b; end always @ ( posedge clock) begin a_rr <= a_r; b_rr <= b_r; end always @ ( posedge clock) outp <= sel_r ? b_rr : a_rr; always @ ( posedge clock) sel_r <= sel; endmodule Following synthesis we optimize module comp_mux_rrr for maximum speed. Static timing analysis gives the following preroute critical paths:
---------------------INPAD to SETUP longest path-------------------Rise delay, Worst case Instance name in pin-->out pin tr total incr cell ------------------------------------------------------------------END_OF_PATH D.a_r_ff_b2 R 4.52 0.00 DF1 INBUF_24 : PAD--->Y R 4.52 4.52 INBUF a_2_ R 0.00 0.00 BEGIN_OF_PATH
---------------------CLOCK to SETUP longest path-------------------Rise delay, Worst case
Instance name in pin-->out pin tr total incr cell ------------------------------------------------------------------END_OF_PATH

D.sel_r_ff R 9.99 0.00 DF1 I_1_CM8 : S10--->Y R 9.99 0.00 CM8 I_3_CM8 : S00--->Y R 9.99 4.40 CM8 a_r_ff_b1 : CLK--->Q R 5.60 5.60 DF1 BEGIN_OF_PATH
---------------------CLOCK to OUTPAD longest path------------------Rise delay, Worst case
Instance name in pin-->out pin tr total incr cell ------------------------------------------------------------------END_OF_PATH outp_2_ R 11.95 OUTBUF_31 : D--->PAD R 11.95 7.55 OUTBUF outp_ff_b2 : CLK--->Q R 4.40 4.40 DF1 BEGIN_OF_PATH The timing analyzer has examined the following: 1. Paths that start at an input pad and end on the data input of a sequential logic cell (the D input to a D flip-flop, for example). We might call this an entry path (or input-to-D path) to a pipelined design. The longest entry delay (or input-tosetup delay) is 4.52 ns. 2. Paths that start at a clock input to a sequential logic cell and end at the data input of a sequential logic cell. This is a stage path ( register-to-register path or clockto-D path) in a pipeline stage. The longest stage delay ( clock-to-D delay) is 9.99 ns. 3. Paths that start at a sequential logic cell output and end at an output pad. This is an exit path ( clock-to-output path) from the pipeline. The longest exit delay ( clock-to-output delay) is 11.95 ns. By pipelining the design we added three clock periods of latency, but we increased the estimated operating speed. The longest prelayout critical path is now an exit delay,
approximately 12 nsmore than doubling the maximum operating frequency. Next, we route the registered version of the design. The Actel software informs us that the postroute maximum stage delay is 11.3 ns (close to the preroute estimate of 9.99 ns). To check this figure we can perform another timing analysis. This time we shall measure the stage delays (the start points are all clock pins, and the end points are all inputs to sequential cells, in our case the D input to a D flip-flop). We need to define the sets of nodes at which to start and end the timing analysis (similar to the path clusters we used to specify timing constraints in logic synthesis). In the Actel timing analyzer we can use predefined sets 'clock' (flip-flop clock pins) and 'gated' (flip-flop inputs) as follows: timer> startset clock timer> endset gated timer> longest 1st longest path to all endpins Rank Total Start pin First Net End Net End pin 0 11.3 a_r_ff_b2:CLK a_r_2_ block_0_OUT1 sel_r_ff:D 1 6.6 sel_r_ff:CLK sel_r DEF_NET_50 outp_ff_b0:D ... 8 similar lines omitted ... We could try to reduce the long stage delay (11.3 ns), but we have already seen from the preroute timing estimates that an exit delay may be the critical path. Next, we check some other important timing parameters.
13.7.1 Hold Time

Hold-time problems can occur if there is clock skew between adjacent flip-flops, for example. We first need to check for the shortest exit delays using the same sets that we used to check stage delays, timer> shortest 1st shortest path to Rank Total Start pin 0 4.0 b_rr_ff_b1:CLK 1 4.1 a_rr_ff_b2:CLK
all endpins First Net End Net End pin b_rr_1_ DEF_NET_48 outp_ff_b1:D a_rr_2_ DEF_NET_46 outp_ff_b2:D
... 8 similar lines omitted ... The shortest path delay, 4 ns, is between the clock input of a D flip-flop with instance name b_rr_ff_b1 (call this X ) and the D input of flip-flop instance name outp_ff_b1 ( Y ). Due to clock skew, the clock signal may not arrive at both flipflops simultaneously. Suppose the clock arrives at flip-flop Y 3 ns earlier than at flipflop X . The D input to flip-flop Y is only stable for (4 3) = 1 ns after the clock edge. To check for hold-time violations we thus need to find the clock skew corresponding to each clock-to-D path. This is tedious and normally timing-analysis tools check holdtime requirements automatically, but we shall show the steps to illustrate the process.
13.7.2 Entry Delay

Before we can measure clock skew, we need to analyze the entry delays, including the clock tree. The synthesis tools automatically add I/O pads and the clock cells. This means that extra nodes are automatically added to the netlist with automatically generated names. The EDIF conversion tools may then modify these names. Before we can perform an analysis of entry delays and the clock network delay, we need to find the input node names. By looking for the EDIF 'rename' construct in the EDIF netlist we can associate the input and output node names in the behavioral Verilog model, comp_mux_rrr , and the EDIF names, piron% grep rename comp_mux_rrr_o.edn (port (rename a_2_ "a[2]") (direction INPUT)) ... 8 similar lines renaming ports omitted ... (net (rename a_rr_0_ "a_rr[0]") (joined ... 9 similar lines renaming nets omitted ... piron% Thus, for example, the EDIF conversion program has renamed input port a[2] to a_2_ because the design tools do not like the Verilog bus notation using square brackets. Next we find the connections between the ports and the added I/O cells by looking for 'PAD' in the Actel format netlist, which indicates a connection to a pad and the pins of the chip, as follows:
piron% grep PAD comp_mux_rrr_o.adl NET DEF_NET_148; outp_2_, OUTBUF_31:PAD. NET DEF_NET_151; outp_1_, OUTBUF_32:PAD. NET DEF_NET_154; outp_0_, OUTBUF_33:PAD. NET DEF_NET_127; a_2_, INBUF_24:PAD. NET DEF_NET_130; a_1_, INBUF_25:PAD. NET DEF_NET_133; a_0_, INBUF_26:PAD. NET DEF_NET_136; b_2_, INBUF_27:PAD. NET DEF_NET_139; b_1_, INBUF_28:PAD. NET DEF_NET_142; b_0_, INBUF_29:PAD. NET DEF_NET_145; clock, CLKBUF_30:PAD. piron% This tells us, for example, that the node we called clock in our behavioral model has been joined to a node (with automatically generated name) called CLKBUF_30:PAD , using a net (connection) named DEF_NET_145 (again automatically generated). This net is the connection between the node clock that is dangling in the behavioral model and the clock-buffer pad cell that the synthesis tools automatically added.
13.7.3 Exit Delay

We now know that the clock-pad input is CLKBUF_30:PAD , so we can find the exit delays (the longest path between clock-pad input and an output) as follows (using the clock-pad input as the start set): timer> startset clockpad Working startset 'clockpad' contains 0 pins. timer> addstart CLKBUF_30:PAD Working startset 'clockpad' contains 2 pins. I shall explain why this set contains two pins and not just one presently. Next, we define the end set and trace the longest exit paths as follows: timer> endset outpad Working endset 'outpad' contains 3 pins.
timer> longest 1st longest path to all endpins Rank Total Start pin First Net End Net End pin 0 16.1 CLKBUF_30/U0:PAD DEF_NET_144 DEF_NET_154 OUTBUF_33:PAD 1 16.0 CLKBUF_30/U0:PAD DEF_NET_144 DEF_NET_151 OUTBUF_32:PAD 2 16.0 CLKBUF_30/U0:PAD DEF_NET_144 DEF_NET_148 OUTBUF_31:PAD 3 pins This tells us we have three paths from the clock-pad input to the three output pins ( outp[0] , outp[1] , and outp[2] ). We can examine the longest exit delay in more detail as follows: timer> expand 0 1st longest path to OUTBUF_33:PAD (rising) (Rank: 0) Total Delay Typ Load Macro Start pin Net name 16.1 3.7 Tpd 0 OUTBUF OUTBUF_33:D DEF_NET_154 12.4 4.5 Tpd 1 DF1 outp_ff_b0:CLK DEF_NET_1530 7.9 7.9 Tpd 16 CLKEXT_0 CLKBUF_30/U0:PAD DEF_NET_144 The input-to-clock delay, t IC , due to the clock-buffer cell (or macro) CLKEXT_0 , instance name CLKBUF_30/U0 , is 7.9 ns. The clock-to-Q delay, t CQ , of flip-flop cell DF1 , instance name outp_ff_b0 , is 4.5 ns. The delay, t QO , due to the output buffer cell OUTBUF , instance name OUTBUF_33 , is 3.7 ns. The longest path between clock-pad input and the output, t CO , is thus t CO = t IC + t CQ + t QO = 16.1 ns .
(13.23)
This is the critical path and limits the operating frequency to (1 / 16.1 ns) 62 MHz. When we created a start set using CLKBUF_30:PAD , the timing analyzer told us that
this set consisted of two pins. We can list the names of the two pins as follows: timer> showset clockpad Pin name Net name Macro name CLKBUF_30/U0:PAD <no net> CLKEXT_0 CLKBUF_30/U1:PAD DEF_NET_145 CLKTRI_0 2 pins The clock-buffer instance name, CLKBUF_30/U0 , is hierarchical (with a '/' hierarchy separator). This indicates that there is more than one instance inside the clock-buffer cell, CLKBUF_30 . Instance CLKBUF_30/U0 is the input driver, instance CLKBUF_30/U1 is the output driver (which is disabled and unused in this case).
13.7.4 External Setup Time

Each of the six chip data inputs must satisfy the following set-up equation: t SU (external) > t SU (internal) (clock delay) + (data delay
(13.24)
(where both clock and data delays end at the same flip-flop instance). We find the clock delays in Eq. 13.24 using the clock input pin as the start set and the end set 'clock' . The timing analyzer tells us all 16 clock path delays are the same at 7.9 ns in our design, and the clock skew is thus zero. Actels clock distribution system minimizes clock skew, but clock skew will not always be zero. From the discussion in Section 13.7.1 , we see there is no possibility of internal hold-time violations with a clock skew of zero. Next, we find the data delays in Eq, 13.24 using a start set of all input pads and an end set of 'gated' , timer> longest ... lines omitted ...
1st longest path to all endpins Rank Total Start pin First Net End Net End pin 10 10.0 INBUF_26:PAD DEF_NET_1320 DEF_NET_1320 a_r_ff_b0:D 11 9.7 INBUF_28:PAD DEF_NET_1380 DEF_NET_1380 b_r_ff_b1:D 12 9.4 INBUF_25:PAD DEF_NET_1290 DEF_NET_1290 a_r_ff_b1:D 13 9.3 INBUF_27:PAD DEF_NET_1350 DEF_NET_1350 b_r_ff_b2:D 14 9.2 INBUF_29:PAD DEF_NET_1410 DEF_NET_1410 b_r_ff_b0:D 15 9.1 INBUF_24:PAD DEF_NET_1260 DEF_NET_1260 a_r_ff_b2:D 16 pins We are only interested in the last six paths of this analysis (rank 1015) that describe the delays from each data input pad ( a[0] , a[1] , a[2] , b[0] , b[1] , b[2] ) to the D input of a flip-flop. The maximum data delay, 10 ns, occurs on input buffer instance name INBUF_26 (pad 26); pin INBUF_26:PAD is node a_0_ in the EDIF file or input a[0] in our behavioral model. The six t SU (external) equations corresponding to Eq, 13.24 may be reduced to the following worst-case relation: > t SU (internal) 7.9 ns + max (9.1 ns, 10.0 ns) > t SU (internal) + 2.1 ns
t SU (external) max
(13.25)
We calculated the clock and data delay terms in Eq. 13.24 separately, but timing analyzers can normally perform a single analysis as follows: t SU (external) max > t SU (internal) (clock delay data delay) min .
(13.26)
Finally, we check that there is no external hold-time requirement. That is to say, we must check that t SU (external) is never negative or
t SU (external) min
> t SU (internal) (clock delay data delay) max > 0 > t SU (internal) + 1.2 ns > 0 .
(13.27)
Since t SU (internal) is always positive on Actel FPGAs, t SU (external) min is always positive for this design. In large ASICs, with large clock delays, it is possible to have external hold-time requirements on inputs. This is the reason that some FPGAs (Xilinx, for example) have programmable delay elements that deliberately increase the data delay and eliminate irksome external hold-time requirements. [ Chapter start ] [ Previous page ] [ Next page ]
13.8 Formal Verification

Using logic synthesis we move from a behavioral model to a structural model. How are we to know (other than by trusting the logic synthesizer) that the two representations are the same? We have already seen that we may have to alter the original reference model because the HDL acceptable to a synthesis tool is a subset of HDL acceptable to simulators. Formal verification can prove, in the mathematical sense, that two representations are equivalent. If they are not, the software can tell us why and how two representations differ.
13.8.1 An Example
We shall use the following VHDL entity with two architectures as an example: 1 entity Alarm is port (Clock, Key, Trip : in bit; Ring : out bit); end Alarm; The following behavioral architecture is the reference model : architecture RTL of Alarm is type States is (Armed, Off, Ringing); signal State : States; begin process (Clock) begin if Clock = '1' and Clock'EVENT then
case State is when Off => if Key = '1' then State <= Armed; end if ; when Armed => if Key = '0' then State <= Off; elsif Trip = '1' then State <= Ringing; end if ; when Ringing => if Key = '0' then State <= Off; end if ; end case ; end if ; end process ; Ring <= '1' when State = Ringing else '0'; end RTL; The following synthesized structural architecture is the derived model : library cells; use cells. all ; // ...contains logic cell models architecture Gates of Alarm is component Inverter port (i : in BIT;z : out BIT) ; end component ; component NAnd2 port (a,b : in BIT;z : out BIT) ; end component ; component NAnd3 port (a,b,c : in BIT;z : out BIT) ; end component ; component DFF port(d,c : in BIT; q,qn : out BIT) ; end component ; signal State, NextState : BIT_VECTOR(1 downto 0); signal s0, s1, s2, s3 : BIT; begin g2: Inverter port map ( i => State(0), z => s1 ); g3: NAnd2 port map ( a => s1, b => State(1), z => s2 ); g4: Inverter port map ( i => s2, z => Ring ); g5: NAnd2 port map ( a => State(1), b => Key, z => s0 ); g6: NAnd3 port map ( a => Trip, b => s1, c => Key, z => s3 ); g7: NAnd2 port map ( a => s0, b => s3, z => NextState(1)
); g8: Inverter port map ( i => Key, z => NextState(0) ); state_ff_b0: DFF port map ( d => NextState(0), c => Clock, q => State(0), qn => open ); state_ff_b1: DFF port map ( d => NextState(1), c => Clock, q => State(1), qn => open ); end Gates; To compare the reference and the derived models (two representations), formal verification performs the following steps: (1) the HDL is parsed, (2) a finite-state machine compiler extracts the states present in any sequential logic, (3) a proof generator automatically generates formulas to be proved, (4) the theorem prover attempts to prove the formulas. The results from the last step are as follows: formulas to be proved: 8 formulas proved VALID: 8 By constructing and then proving formulas the software tells us that architecture RTL implies architecture Gates (implication is the default proof mechanismwe could also have asked if the architectures are exactly equivalent). Next, we shall explore what this means and how formal verification works.
13.8.2 Understanding Formal Verification

The formulas to be proved are generated in a separate file of proof statements : # axioms Let Axiom_ref = Axioms Of alarm-rtl Let Axiom_der = Axioms Of alarm-gates ProveNotAlwaysFalse (Axiom_ref) Prove (Axiom_ref => Axiom_der)
# assertions Let Assert_ref = Asserts Of alarm-rtl Let Assert_der = Asserts Of alarm-gates Prove (Axiom_ref => (Assert_ref => Assert_der)) # clocks Let ClockEvents_ref = Clocks Of alarm-rtl Let ClockEvents_der = Clocks Of alarm-gates Let Master__clock_event_ref = Value (master__clock'event Of alarm-rtl) Prove (Axiom_ref => (ClockEvents_ref <=> ClockEvents_der)) # next state of memories Prove ((Axiom_ref And Master__clock_event_ref) => (Transition (state(1) Of alarm-rtl) <=> Transition (state_ff_b1.t Of alarm-gates))) Prove ((Axiom_ref And Master__clock_event_ref) => (Transition (state(0) Of alarm-rtl) <=> Transition (state_ff_b0.t Of alarm-gates))) # validity value of outbuses Prove (Axiom_ref => (Domain (ring Of alarm-rtl) <=> Domain (ring Of alarm-gates))) Prove (Axiom_ref => (Domain (ring Of alarm-rtl) => (Value (ring Of alarm-rtl) <=> Value (ring Of alarm-gates)))) Formal verification makes strict use of the terms axiom and assertion . An axiom is an explicit or implicit fact. For example, if a VHDL signal is declared to be type BIT , an implicit axiom is that this signal may only take the logic values '0' and '1' . An assertion is derived from a statement placed in the HDL code. For example, the following VHDL statement is an assertion: assert Key /= '1' or Trip /= '1' or NextState = Ringing report "Alarm on and tripped but not ringing"; A VHDL assert statement prints only if the condition is FALSE . We know from de Morgans theorem that (A + B + C)' = A'B'C' . Thus, this statement
checks for a burglar alarm that does not ring when it is on and we are burgled. In the proof statements the symbol '=>' means implies . In logic calculus we write A B to mean A implies B . The symbol '<=>' means equivalence , and this is stricter than implication. We write A B to mean: A is equivalent to B . Table 13.13 show the truth tables for these two logic operators. TABLE 13.13 Implication and equivalence. A B AB AB F F T T F T T F T F F F T T T T
13.8.3 Adding an Assertion

If we include the assert statement from the previous section in architecture RTL and repeat formal verification, we get the following message from the FSM compiler: <E> Assertion may be violated SEVERITY: ERROR REPORT: Alarm on and tripped but not ringing FILE: .../alarm-rtl3.vhdl FSM: alarm-rtl3 STATEMENT or DECLARATION: line8 .../alarm-rtl3.vhdl (line 8) Context of the message is: (key And trip And memoryofdriver__state(0)) This message tells us that the assert statement that we included may be triggered under a certain condition: (key And trip And state(0)) . The prefix 'memoryofdriver__' is used by the theorem prover to refer to the memory
element used for state(0) . The state 'off' in the reference model corresponds to state(0) in the encoding that the finite-state machine compiler has used (and also to state(0) in the derived model). From this message we can isolate the problem to the following case statement (the line numbers follow the original code in architecture RTL ): case State is when Off => if Key = '1' then State <= Armed; end if ; when Armed => if Key = '0' then State <= Off; elsif Trip = '1' then State <= Ringing; end if ; when Ringing => if Key = '0' then State <= Off; end if ; end case ; When we start in state Off and the two inputs are Trip = '1' and Key = '1' , we go to state Armed , and not to state Ringing . On the subsequent clock cycle we will go state Ringing , but only if Trip does not change. Since we have all seen Mission Impossible and the burglar who exits the top-secret computer room at the Pentagon at the exact moment the alarm is set, we know this is perfectly possible and the software is warning us of this fact. Continuing on, we get the following results from the theorem prover: Prove (Axiom_ref => (Assert_ref => Assert_der)) Formula is NOT VALID But is VALID under Assert Context of alarm-rtl3 We included the assert statement in the reference model ( architecture RTL ) but not in the derived model ( architecture Gates ). Now we are really mixed up: The assertion statement in the reference model says one thing, but the case statement in the reference model describes another. The theorem prover retorts: The axioms of the reference model do not imply that the assertions of the reference model imply the assertions of the derived model. Translation: These two architectures differ in some way. However, if we assume that the assertion is true (despite what the case statement says) then the formula is true. The prover is also saying: Make up your mind, you cannot have it both ways. The prover goes on to
explain the differences between the two representations: ***Difference is: (Not state(1) And key And state(0) And trip) There are 1 cubes and 4 literals in the complete equation ***Local Variable Assert_der is: Not key Or Not state(0) Or Not trip There are 3 cubes and 3 literals in the complete equation ***Local Variable Assert_ref is: 1 ***Local Variable Axiom_ref is: Not state(1) Or Not state(0) There are 2 cubes and 2 literals in the complete equation formulas to be proved: 8 formulas proved VALID: 7 formulas VALID under assert context of der.model: 1 Study these messages hard and you will see that the differences between the two models are consistent with our explanation.
13.8.4 Completing a Proof

To fix the problem we change the code as follows: ... case State is when Off => if Key = '1' then if Trip = '1' then NextState <= Ringing; else NextState <= Armed; end if ; end if ; when Armed => if Key = '0' then NextState <= Off;
elsif Trip = '1' then NextState <= Ringing; end if ; when Ringing => if Key = '0' then NextState <= Off; end if ; end case ; ... This results in a minor change in the synthesized netlist, g2: Inverter port map ( i => State(0), z => s1 ); g3: NAnd2 port map ( a => s1, b => State(1), z => s2 ); g4: Inverter port map ( i => s2, z => Ring ); g5: NAnd2 port map ( a => State(1), b => Key, z => s0 ); g6: NAnd3 port map ( a => Trip, b => s1, c => Key, z => s3 ); g7: NAnd2 port map ( a => s0, b => s3, z => NextState(1) ); g8: Inverter port map ( i => Key, z => NextState(0) ); state_ff_b0: DFF port map ( d => NextState(0), c => Clock, q => State(0), qn => open ); state_ff_b1: DFF port map ( d => NextState(1), c => Clock, q => State(1), qn => open ); Repeating the formal verification confirms and formally proves that the derived model will operate correctly. Strictly, we say that the operation of the derived model is implied by the reference model. 1. By one of the architects of the Compass VFormal software, Erich Marschner. [ Chapter start ] [ Previous page ] [ Next page ]
13.9 Switch-Level Simulation

The switch-level simulator is a more detailed level of simulation than we have discussed so far. Figure 13.1 shows the circuit schematic of a true single-phase flipflop using true single-phase clocking ( TSPC ). TSPC has been used in some fullcustom ICs to attempt to save area and power. (a) (b)
FIGURE 13.1 A TSPC (true single-phase clock) flip-flop. (a) The schematic (all devices are W/L = 3/2) created using a Compass schematic-entry tool. (b) The switch-level simulation results (Compass MixSim). The parameter chargeDecayTime sets the time after which the simulator sets an undriven node to an invalid logic level (shown shaded). In a CMOS logic cell every node is driven to a strong '1' or a strong '0' . This is not true in TSPC, some nodes are left floating, so we ask the switch-level simulator to
model charge leakage or charge decay (normally we need not worry about this lowlevel device issue). Figure 13.1 shows the waveform results. After five clock cycles, or 100 ns, we set the charge decay time to 5 ns. We notice two things. First, some of the node waveforms have values that are between logic '0' and '1' . Second, there are shaded areas on some node waveforms that represent the fact that, during the period of time marked, the logic value of the node is unknown. We can see that initially, before t = 100 ns (while we neglect the effects of charge decay), the circuit functions as a flip-flop. After t = 100 ns (when we begin including the effects of charge decay), the simulator tells us that this circuit may not function correctly. It is unlikely that all the charge would leak from a node in 5 ns, but we could not stop the clock in a design that uses a TSPC flip-flop. In ASIC design we do not use dangerous techniques such as TSPC and therefore do not normally need to use switch-level simulation. A switch-level simulator keeps track of voltage levels as well as logic levels, and it may do this in several ways. The simulator may use a large possible set of discrete values or the value of a node may be allowed to vary continuously. [ Chapter start ] [ Previous page ] [ Next page ]
13.10 Transistor-Level Simulation

Sometimes we need to simulate a logic circuit with more accuracy than provided by switch-level simulation. In this case we turn to simulators that can solve circuit equations exactly, given models for the nonlinear transistors, and predict the analog behavior of the node voltages and currents in continuous time. This type of transistorlevel simulation or circuit-level simulation is costly in computer time. It is impossible to simulate more than a few hundred logic cells using a circuit-level simulator. Virtually all circuit-level simulators used for ASIC design are commercial versions of the SPICE (or Spice , Simulation Program with Integrated Circuit Emphasis ) developed at UC Berkeley.
FIGURE 13.2 Output buffer (OB.IN) schematic (created using Capilanos DesignWorks)
13.10.1 A PSpice Example

Figure 13.2 shows the schematic for the output section of a CMOS I/O buffer driving a 10 pF output capacitor representing an off-chip load. The PSpice input file that follows is called a deck (from the days of punched cards):
file:///C|/Documents%20and%20Settings/saran%2....edu/_msmith/ASICs/HTML/Book2/CH13/CH13.a.htm (1 of 11) [5/30/2004 11:07:26 PM]
OB September 5, 1996 17:27 .TRAN/OP 1ns 20ns .PROBE cl output Ground 10pF VIN input Ground PWL(0us 5V 10ns 5V 12ns 0V 20ns 0V) VGround 0 Ground DC 0V Vdd +5V 0 DC 5V m1 output input Ground Ground NMOS W=100u L=2u m2 output input +5V +5V PMOS W=200u L=2u .model nmos nmos level=2 vto=0.78 tox=400e-10 nsub=8.0e15 xj=-0.15e-6 + ld=0.20e-6 uo=650 ucrit=0.62e5 uexp=0.125 vmax=5.1e4 neff=4.0 + delta=1.4 rsh=37 cgso=2.95e-10 cgdo=2.95e-10 cj=195e-6 cjsw=500e-12 + mj=0.76 mjsw=0.30 pb=0.80 .model pmos pmos level=2 vto=-0.8 tox=400e-10 nsub=6.0e15 xj=-0.05e-6 + ld=0.20e-6 uo=255 ucrit=0.86e5 uexp=0.29 vmax=3.0e4 neff=2.65 + delta=1 rsh=125 cgso=2.65e-10 cgdo=2.65e-10 cj=250e-6 cjsw=350e-12 + mj=0.535 mjsw=0.34 pb=0.80 .end Figure 13.3 shows the input and output waveforms as well as the current flowing in the devices.We can quickly check our circuit simulation results as follows. The total charge transferred to the 10 pF load capacitor as it charges from 0 V to 5 V is 50 pC (equal to 5 V 10 pF). This total charge should be very nearly equal to the integral of the drain current of the pull-up ( p -channel) transistor I L ( m2 ). We can get a quick estimate of the integral of the current by approximating the area under the waveform for id(m2) in Figure 13.3 as a trianglehalf the base (about 12 ns) multiplied by the height (about 8 mA), so that
22 ns
10 ns
I L (m2) d t
0.5 (8 mA) (12 ns)
(13.28)
50 pC 5 (10 pF) Notice that the two estimates for the transferred charge are equal.
FIGURE 13.3 Output Buffer (OB.IN). (Top) The input and output voltage waveforms. (Bottom) The current flowing in the drains of the output devices.
Next, we can check the time derivative of the pull-up current. (We can also do this by using the Probe program and requesting a plot of did(m2) ; the symbol dn represents the time derivative of quantity n for Probe. The symbol id(m2) requests Probe to plot the drain current of m2 .) The maximum derivative should be roughly equal to the maximum change of the drain current ( I L ( m 2) = 8 mA) divided by the time taken for that change (about t = 2 ns from Figure 13.3 ) or | I L (m2) | t 8 mA = = 4 10 6 As 1 (13.29) 2 ns
The large time derivative of the device current, here 4 MAs 1 , causes problems in high-speed CMOS I/O. This sharp change in current must flow in the supply leads to the chip, and through the inductance associated with the bonding wires to the chip which may be of the order of 10 nanohenrys. An electromotive force (emf ), V P , will be generated in the inductance as follows, dI VP = L dt = 10 nH (4 10 6 ) As 1 = 40 mV The result is a glitch in the power supply voltage during the buffer output transient. This is known as supply bounce or ground bounce . To limit the amount of bounce we may do one of two things: 1. Limit the power supply lead inductance (minimize L) (13.30)
2. Reduce the current pulse (minimize dI/dt) We can work on the first solution by careful design of the packages and by using parallel bonding wires (inductors add in series, reduce in parallel).
13.10.2 SPICE Models

Table 13.14 shows the SPICE parameters for the typical 0.5 m CMOS process (0.6 m drawn gate length), G5, that we used in Section 2.1 . These LEVEL = 3 parameters may be used with Spice3, PSpice, and HSPICE (see also Table 2.1 and Figure 2.4 ). TABLE 13.14 SPICE transistor model parameters ( LEVEL = 3 ). p-channel nSPICE value Explanation channel Units 2 (if parameter 1 value different) Gatebulk overlap CGBO 4.0E-10 capacitance (CGBoh, 3.8E-10 Fm 1 not CGBzero) Gatedrain overlap 3.0E-10 CGDO 2.4E-10 capacitance (CGDoh, Fm 1 not CGDzero) Gatesource overlap CGSO 3.0E-10 capacitance (CGSoh, 2.4E-10 Fm 1 not CGSzero) Junction area CJ 5.6E-4 9.3E-4 Fm 2 capacitance Junction sidewall CJSW 5E-11 2.9E-10 Fm 1 capacitance Narrow-width factor 0.7 0.29 DELTA m for adjusting threshold voltage
ETA GAMMA KAPPA
3.7E-2 0.6 2.9E-2
2.45E-2 0.47 8
1 V 0.5 V 1
KP LD LEVEL MJ MJSW NFS NSUB PB PHI RSH THETA TOX TPG
2E-4
4.9E-5
AV 2 m none
5E-8 3 0.56 0.52 6E11 1.4E17 1 0.7 2 0.27 1E-8 1
3.5E-8
0.47 0.50 6.5E11 8.5E16 1
1 1 cm 2 V
1
Static-feedback factor for adjusting threshold voltage Body-effect factor Saturation-field factor (channel-length modulation) Intrinsic transconductance ( Cox , not 0.5 Cox ) Lateral diffusion into channel Empirical model Junction area exponent Junction sidewall exponent Fast surface-state density Bulk surface doping Junction area contact potential Surface inversion potential Sheet resistance of source and drain Mobility-degradation factor Gate-oxide thickness Type of polysilicon gate
cm 3 V V / square
0.29
V 1 m
-1
none
U0 XJ VMAX VTO
550 0.2E-6 2E5 0.65
135
cm 2 V 1 s 1 m
2.5E5 -0.92
ms 1 V
Low-field bulk carrier mobility (Uzero, not Uoh) Junction depth Saturated carrier velocity Zero-bias threshold voltage (VTzero, not VToh)
There are several levels of the SPICE MOSFET models, the following is a simplified overview (a huge number of confusing variations, fixes, and options have been added to these modelssee Meta Softwares HSPICE Users Manual, Vol. II, for a comprehensive description [ 1996]): 1. LEVEL = 1 ( SchichmanHodges model ) uses the simple square-law I DS V DS relation we derived in Section 2.1 (Eqs. 2.9 and 2.12). 2. LEVEL = 2 ( GroveFrohman model ) uses the 3/2 power equations that result if we include the variation of threshold voltage across the channel. 3. LEVEL = 3 ( empirical model ) uses empirical equations. 4. The UCB BSIM1 model (~1984, PSpice LEVEL = 4 , HSPICE LEVEL = 13 ) focuses on modeling observed device data rather than on device physics. A commercial derivative (HSPICE LEVEL = 28 ) is widely used by ASIC vendors. 5. The UCB BSIM2 model (~1991, the commercial derivative is HSPICE LEVEL = 39 ) improves modeling of subthreshold conduction. 6. The UCB BSIM3 model (~1995, the commercial derivative is HSPICE LEVEL = 49 ) corrects potential nonphysical behavior of earlier models. TABLE 13.15 PSpice parameters for process G5 (PSpice LEVEL = 4 ). 3
.MODEL NM1 NMOS LEVEL=4 + VFB=-0.7, LVFB=-4E-2, WVFB=5E-2 + PHI=0.84, LPHI=0, WPHI=0 + K1=0.78, LK1=-8E-4, WK1=-5E-2 + K2=2.7E-2, LK2=5E-2, WK2=-3E-2 + ETA=-2E-3, LETA=2E-02, WETA=-5E-3 + MUZ=600, DL=0.2, DW=0.5 + U0=0.33, LU0=0.1, WU0=0.1 + U1=3.3E-2, LU1=3E-2, WU1=-1E-2 + X2MZ=9.7, LX2MZ=-6, WX2MZ=7 + X2E=4.4E-4, LX2E=-3E-3, WX2E=9E-4 + X3E=-5E-5, LX3E=-2E-3, WX3E=-1E-3 + X2U0=-1E-2, LX2U0=-1E3, WX2U0=5E-3 + X2U1=-1E-3, LX2U1=1E-3, WX2U1=-7E-4 + MUS=700, LMUS=-50, WMUS=7 + X2MS=-6E-2, LX2MS=1, WX2MS=4 + X3MS=9, LX3MS=2, WX3MS=6 + X3U1=9E-3, LX3U1=2E-4, WX3U1=-5E-3 + TOX=1E-2, TEMP=25, VDD=5
.MODEL PM1 PMOS LEVEL=4 + VFB=-0.2, LVFB=4E-2, WVFB=-0.1 + PHI=0.83, LPHI=0, WPHI=0 + K1=0.35, LK1=-7E-02, WK1=0.2 + K2=-4.5E-2, LK2=9E-3, WK2=4E-2 + ETA=-1E-2, LETA=2E-2, WETA=-4E-4 + MUZ=140, DL=0.2, DW=0.5 + U0=0.2, LU0=6E-2, WU0=6E-2 + U1=1E-2, LU1=1E-2, WU1=7E-4 + X2MZ=7, LX2MZ=-2, WX2MZ=1 + X2E= 5E-5, LX2E=-1E-3, WX2E=-2E-4 + X3E=8E-4, LX3E=-2E-4, WX3E=-1E-3 + X2U0=9E-3, LX2U0=-2E-3, WX2U0=2E-3 + X2U1=6E-4, LX2U1=5E-4, WX2U1=3E-4 + MUS=150, LMUS=10, WMUS=4 + X2MS=6, LX2MS=-0.7, WX2MS=2 + X3MS=-1E-2, LX3MS=2, WX3MS=1 + X3U1=-1E-3, LX3U1=-5E4, WX3U1=1E-3 + TOX=1E-2, TEMP=25, VDD=5
+ CGDO=3E-10, CGSO=3E-10, CGBO=4E-10 + XPART=1 + N0=1, LN0=0, WN0=0 + NB=0, LNB=0, WNB=0 + ND=0, LND=0, WND=0 * n+ diffusion + RSH=2.1, CJ=3.5E-4, CJSW=2.9E-10 + JS=1E-8, PB=0.8, PBSW=0.8 + MJ=0.44, MJSW=0.26, WDF=0 *, DS=0
+ CGDO=2.4E-10, CGSO=2.4E10, CGBO=3.8E-10 + XPART=1 + N0=1, LN0=0, WN0=0 + NB=0, LNB=0, WNB=0 + ND=0, LND=0, WND=0 * p+ diffusion + RSH=2, CJ=9.5E-4, CJSW=2.5E-10 + JS=1E-8, PB=0.85, PBSW=0.85 + MJ=0.44, MJSW=0.24, WDF=0 *, DS=0
Table 13.15 shows the BSIM1 parameters (in the PSpice LEVEL = 4 format) for the G5 process. The Berkeley short-channel IGFET model ( BSIM ) family models capacitance in terms of charge. In Sections 2.1 and 3.2 we treated the gatedrain capacitance, C GD , for example, as if it were a reciprocal capacitance , and could be written assuming there was charge associated with the gate, Q G , and the drain, Q D , as follows: - Q G C GD = VD = C DG = - Q D VG (13.31)
Equation 13.31 (the Meyer model ) would be true if the gate and drain formed a parallel plate capacitor and Q G = Q D , but they do not. In general, Q G Q D and Eq. 13.31 is not true. In an MOS transistor we have four regions of charge: Q G (gate), Q D (channel charge associated with the drain), Q S (channel charge associated with the drain), and Q B (charge in the bulk depletion region). These charges are not independent, since
QG+QD+QS+QB = 0
(13.32)
We can form a 4 4 matrix, M , whose entries are Q i / V j , where V j = V G , V S , V D , and V B . Then C ii = M ii are the terminal capacitances; and C ij = M ij , where i j , is a transcapacitance . Equation 13.32 forces the sum of each column of M to be zero. Since the charges depend on voltage differences, there are only three independent voltages ( V GB , V DB , and V SB , for example) and each row of M must sum to zero. Thus, we have nine (= 16 7) independent entries in the matrix M . In general, C ij is not necessarily equal to C ji . For example, using PSpice and a LEVEL = 4 BSIM model, there are nine independent partial derivatives, printed as follows: Derivatives of gate (dQg/dVxy) and bulk (dQb/dVxy) charges DQGDVGB 1.04E-14 DQGDVDB -1.99E-15 DQGDVSB -7.33E-15 DQDDVGB -1.99E-15 DQDDVDB 1.99E-15 DQDDVSB 0.00E+00 DQBDVGB -7.51E-16 DQBDVDB 0.00E+00 DQBDVSB -2.72E-15 From these derivatives we may compute six nonreciprocal capacitances : C GB = Q G / V GB + Q G / V DB + Q G / V SB C BG = Q B / V GB C GS = Q G / V SB C SG = Q G / V GB + Q B / V GB + Q D / V GB C GD = Q G / V DB (13.33)
file:///C|/Documents%20and%20Settings/saran%...edu/_msmith/ASICs/HTML/Book2/CH13/CH13.a.htm (10 of 11) [5/30/2004 11:07:27 PM]
C DG = Q D / V GB and three terminal capacitances: C GG = Q G / V GB C DD = Q D / V DB C SS = ( Q G / V SB + Q B / V SB + Q D / V SB ) Nonreciprocal transistor capacitances cast a cloud over our analysis of gate capacitance in Section 3.2, but the error we made in neglecting this effect is small compared to the approximations we made in the sections that followed. Even though we now find the theoretical analysis was simplified, the conclusions in our treatment of logical effort and delay modeling are still sound. Sections 7.3 and 9.2 in the book on transistor modeling by Tsividis [ 1987] describe nonreciprocal capacitance in detail. Pages 15-42 to 15-44 in Vol. II of Meta Softwares HSPICE User Manual [ 1996] also gives an explanation of transcapacitance. 1. Meta Softwares HSPICE Users Manual [ 1996], p. 15-36 and pp.16-13 to 16-15, explains these parameters. 2. Note that m or M both represent milli or 10 3 in SPICE, not mega or 10 6 ( u or U = micro or 10 6 and so on). 3. PSpice LEVEL = 4 is almost exactly equivalent to the UCB BSIM1 model, and closely equivalent to the HSPICE LEVEL = 13 model (see Table 14-1 and pp. 1686 to 16-89 in Meta Softwares HSPICE Users Manual [ 1996]. [ Chapter start ] [ Previous page ] [ Next page ] (13.34)
file:///C|/Documents%20and%20Settings/saran%...edu/_msmith/ASICs/HTML/Book2/CH13/CH13.a.htm (11 of 11) [5/30/2004 11:07:27 PM]
13.11 Summary
13.11 Summary
We discussed the following types of simulation (from high level to low level):
q
Behavioral simulation includes no timing information and can tell you only if your design will not work. Prelayout simulation of a structural model can give you estimates of performance, but finding a critical path is difficult because you need to construct input vectors to exercise the model. Static timing analysis is the most widely used form of simulation. It is convenient because you do not need to create input vectors. Its limitations are that it can produce false pathscritical paths that may never be activated. Formal verification is a powerful adjunct to simulation to compare two different representations and formally prove if they are equal. It cannot prove your design will work. Switch-level simulation is required to check the behavior of circuits that may not always have nodes that are driven or that use logic that is not complementary. Transistor-level simulation is used when you need to know the analog, rather than the digital, behavior of circuit voltages.
There is a trade-off in accuracy against run time. The high-level simulators are fast but are less accurate. [ Chapter start ] [ Previous page ] [ Next page ]
file:///C|/Documents%20and%20Settings/saran%20kum...waii.edu/_msmith/ASICs/HTML/Book2/CH13/CH13.b.htm [5/30/2004 11:07:28 PM]
TEST
TEST
ASICs are tested at two stages during manufacture using production tests . First, the silicon die are tested after fabrication is complete at wafer test or wafer sort . Each wafer is tested, one die at a time, using an array of probes on a probe card that descend onto the bonding pads of a single die. The production tester applies signals generated by a test program and measures the ASIC test response . A test program often generates hundreds of thousands of different test vectors applied at a frequency of several megahertz over several hundred milliseconds. Chips that fail are automatically marked with an ink spot. Production testers are large machines that take up their own room and are very expensive (typically well over $1 million). Either the customer, or the ASIC manufacturer, or both, develops the test program. A diamond saw separates the die, and the good die are bonded to a lead carrier and packaged. A second, final test is carried out on the packaged ASIC (usually with the same test vectors used at wafer sort) before the ASIC is shipped to the customer. The customer may apply a goods-inward test to incoming ASICs if the customer has the resources and the product volume is large enough. Normally, though, parts are directly assembled onto a bare printed-circuit board ( PCB or board ) and then the board is tested. If the board test shows that an ASIC is bad at this point, it is difficult to replace a surface-mounted component soldered on the board, for example. If there are several board failures due to a particular ASIC, the board manufacturer typically ships the defective chips back to the ASIC vendor. ASIC vendors have sophisticated failure analysis departments that take packaged ASICs apart and can often determine the failure mechanism. If the ASIC production tests are adequate, failures are often due to the soldering process, electrostatic damage during handling, or other problems that can occur between the part being shipped and board test. If the problem is traced to defective ASIC fabrication, this indicates that the test program may be inadequate.
TEST
As we shall see, failure and diagnosis at the board level is very expensive. Finally, ASICs may be tested and replaced (usually by swapping boards) either by a customer who buys the final product or by servicingthis is field repair . Such system-level diagnosis and repair is even more expensive. Programmable ASICs (including FPGAs) are a special case. Each programmable ASIC is tested to the point that the manufacturer can guarantee with a high degree of confidence that if your design works, and if you program the FPGA correctly, then your ASIC will work. Production testing is easier for some programmable ASIC architectures than others. In a reprogrammable technology the manufacturer can test the programming features. This cannot be done for a one-time programmable antifuse technology, for example. A programmable ASIC is still tested in a similar fashion to any other ASIC and you are still paying for test development and design. Programmable ASICs also have similar test, defect, and manufacturing problems to other members of the ASIC family. Finally, once a programmable ASIC is soldered to a board and part of a system, it looks just like any other ASIC. As you will see in the next section, considering board-level and system-level testing is a very important part of ASIC design. 14.1 The Importance of Test 14.2 Boundary-Scan Test 14.3 Faults 14.4 Fault Simulation 14.5 Automatic Test-Pattern Generation 14.6 Scan Test 14.7 Built-in Self-test 14.8 A Simple Test Example 14.9 The Viterbi Decoder Example 14.10 Summary 14.11 Problems 14.12 Bibliography 14.13 References
TEST
14.1 The Importance of Test

One measure of product quality is the defect level . If the ABC Company sells 100,000 copies of a product and 10 of these are defective, then we say the defect level is 0.1 percent or 100 ppm. The average quality level ( AQL ) is equal to one minus the defect level (ABCs AQL is thus 99.9 percent). Suppose the semiconductor division of ABC makes an ASIC, the bASIC, for the PC division. The PC division buys 100,000 bASICs, tested by the semiconductor division, at $10 each. The PC division includes one surface-mounted bASIC on each PC motherboard it assembles for the aPC computer division. The aPC division tests the finished motherboards. Rejected boards due to defective bASICs incur an average $200 board repair cost. The board repair cost as a function of the ASIC defect level is shown in Table 14.1 . A defect level of 5 percent in bASICs costs $1 million dollars in board repair costs (the same as the total ASIC part cost). Things are even worse at the system level, however. TABLE 14.1 Defect levels in printed-circuit boards (PCB). 1 ASIC defect level Defective ASICs Total PCB repair cost 5% 5000 $1million 1% 1000 $200,000 0.1% 100 $20,000 0.01% 10 $2,000 Suppose the ABC Company sells its aPC computers for $5,000, with a profit of $500 on each. Unfortunately the aPC division also has a defect level. Suppose that 10
percent of the motherboards that contain defective bASICs that passed the chip test also manage to pass the board tests (10 percent may seem high, but chips that have hard-to-test faults at the chip level may be very hard to find at the board levelcatching 90 percent of these rogue chips would be considered good). The system-level repair cost as a function of the bASIC defect level is shown in Table 14.2 . In this example a 5 percent defect level in a $10 bASIC part now results in a $5 million cost at the system level. From Table 14.2 we can see it would be worth spending $4 million (i.e., $5 million $1 million ) to reduce the bASIC defect density from 5 percent to 1 percent. TABLE 14.2 Defect levels in systems. 2 ASIC defect level 5% 1% 0.1% 0.01% Defective ASICs Defective boards 5000 1000 100 10 500 100 10 1 Total repair cost at system level $5 million $1 million $100 ,000 $10,000
1. Assumptions: The number of parts shipped is 100,000; part price is $10; total part cost is $1 million; the cost of a fault in an assembled PCB is $200. 2. Assumptions: The number of systems shipped is 100,000; system cost is $5,000; total cost of systems shipped is $500 million; the cost of repairing or replacing a system due to failure is $10,000; profit on 100,000 systems is $50 million. [ Chapter start ] [ Previous page ] [ Next page ]
14.2 Boundary-Scan Test


It is possible to test ICs in dual-in-line packages (DIPs ) with 0.1 inch (2.5 mm) lead spacing on low-density boards using a bed-of-nails tester with probes that contact test points underneath the board. Mechanical testing becomes difficult with board trace widths and separations below 0.1 mm or 100 mm, package-pin separations of 0.3 mm or less, packages with 200 or more pins, surface-mount packages on both sides of the board, and multilayer boards [ Scheiber, 1995]. In 1985 a group of European manufacturers formed the Joint European Test Action Group ( JETAG ) to study board testing. With the addition of North American companies, JETAG became the Joint Test Action Group ( JTAG ) in 1986. The JTAG 2.0 test standard formed the basis of the IEEE Standard 1149.1 Test Port and Boundary-Scan Architecture [ IEEE 1149.1b, 1994], approved in February 1990 and also approved as a standard by the American National Standards Institute (ANSI) in August 1990 [ Bleeker, v. d. Eijnden, and de Jong, 1993; Maunder and Tulloss, 1990; Parker, 1992]. The IEEE standard is still often referred to as JTAG, although there are important differences between the last JTAG specification (version 2.0) and the IEEE 1149.1 standard. Boundary-scan test ( BST ) is a method for testing boards using a four-wire interface (five wires with an optional master reset signal). A good analogy would be the RS-232 interface for PCs. The BST standard interface was designed to test boards, but it is also useful to test ASICs. The BST interface provides a standard means of communicating with test circuits on-board an ASIC. We do need to include extra circuits on an ASIC in order to use BST. This is an example of increasing the cost and complexity (as well as potentially reducing the performance) of an ASIC to reduce the cost of testing the ASIC and the system.
FIGURE 14.1 IEEE 1149.1 boundary scan. (a) Boundary scan is intended to check for shorts or opens between ICs mounted on a board. (b) Shorts and opens may also occur inside the IC package. (c) The boundary-scan architecture is a long chain of shift registers allowing data to be sent over all the connections between the ICs on a board. Figure 14.1 (a) illustrates failures that may occur on a PCB due to shorts or opens in the copper traces on the board. Less frequently, failures in the ASIC package may also arise from shorts and opens in the wire bonds between the die and the package frame ( Figure 14.1 b). Failures in an ASIC package that occur during ASIC fabrication are caught by the ASIC production test, but stress during automated handling and board assembly may cause package failures. Figure 14.1 (c) shows how a group of ASICs are linked together in boundary-scan testing. To detect the failures shown in Figure 14.1 (a) or (b) manufacturers use boundary scan to test every connection between ASICs on a board. During boundary scan, test data is loaded into each ASIC and then driven onto the board traces. Each ASIC monitors its inputs, captures the data received, and then shifts the captured data out. Any defects in the board or ASIC connections will show up as a discrepancy between expected and actual measured continuity data. In order to include BST on an ASIC, we add a special logic cell to each ASIC I/O pad. These cells are joined together to form a chain and create a boundary-scan shift register that extends around each ASIC. The input to a boundary-scan shift register is the test-data input ( TDI ). The output of a boundary-scan shift register is the test-data output ( TDO ). These boundary-scan shift registers are then linked in a serial fashion with the boundary-scan shift registers on other ASICs to form one long boundary-scan shift register. The boundary-scan shift register in each ASIC is one of several test-data registers ( TDR ) that may be included in each ASIC. All the TDRs in an ASIC are connected directly between the TDI and TDO ports. A special register that decodes instructions provides a way to select a particular TDR and control operation of the boundary-scan test process. Controlling all of the operations involved in selecting registers, loading data, performing a test, and shifting out results are the test clock ( TCK ) and test-mode select ( TMS ). The boundary-scan standard specifies a four-wire test interface using the four signals: TDI, TDO, TCK, and TMS. These four dedicated signals, the test-access port ( TAP ), are connected to the TAP controller inside each ASIC. The TAP controller is a state machine clocked on the rising edge of TCK, and with state transitions controlled by the TMS signal. The test-reset input signal ( TRST* , nTRST , or TRST always an active-low signal) is an optional (fifth) dedicated interface pin to reset the TAP controller. Normally the boundary-scan shift-register cells at each ASIC I/O pad are transparent, allowing signals to pass between the I/O pad and the core logic. When an ASIC is put into boundary-scan test mode, we first tell the TAP controller which TDR to select. The TAP controller then tells each boundary-scan shift register in the appropriate TDR either to capture input data, to shift data to the neighboring cell, or to output data. There are many acronyms in the IEEE 1149.1 standard (referred to as dot one ); Table 14.3 provides a list of the most common terms. TABLE 14.3 Boundary-scan terminology. Acronym Meaning BR Bypass register BSC Boundary-scan cell BSR Boundary-scan register BST Boundary-scan test IDCODE Device-identification register
Explanation A TDR, directly connects TDI and TDO, bypassing BSR Each I/O pad has a BSC to monitor signals A TDR, a shift register formed from a chain of BSCs Not to be confused with BIST (built-in self-test) Optional TDR, contains manufacturer and part number

IR JTAG TAP TCK TDI TDO TDR TMS TRST* or nTRST Instruction register Joint Test Action Group Test-access port Test clock Test-data input Test-data output Test-data register Test-mode select Test-reset input signal Holds a BST instruction, provides control signals The organization that developed boundary scan Four- (or five-)wire test interface to an ASIC A TAP wire, the clock that controls BST operation A TAP wire, the input to the IR and TDRs A TAP wire, the output from the IR and TDRs Group of BST registers: IDCODE, BR, BSR A TAP wire, together with TCK controls the BST state Optional TAP wire, resets the TAP controller (active-low)
14.2.1 BST Cells

Figure 14.2 shows a data-register cell ( DR cell ) that may be used to implement any of the TDRs. The most common DR cell is a boundary-scan cell ( BS cell , or BSC ), or boundary-register cell (this last name is not abbreviated to BR cell, since this term is reserved for another type of cell) [ IEEE 1149.1b-1994, p. 10-18, Fig. 10-16]. A BSC contains two sequential elements. The capture flip-flop or capture register is part of a shift register formed by series connection of BSCs. The update flip-flop , or update latch , is normally drawn as an edge-triggered D flip-flop, though it may be a transparent latch. The inputs to a BSC are: scan in ( serial in or SI ); data in ( parallel in or PI ); and a control signal, mode (also called test / normal ). The BSC outputs are: scan out ( serial out or SO ); data out ( parallel out or PO ). The BSC in Figure 14.2 is reversible and can be used for both chip inputs and outputs. Thus data_in may be connected to a pad and data_out to the core logic or vice versa.
entity DR_cell is port (mode, data_in, shiftDR, scan_in, clockDR, updateDR: BIT; data_out, scan_out: out BIT ); end DR_cell; architecture behave of DR_cell is signal q1, q2 : BIT; begin CAP : process (clockDR) begin if clockDR = '1' then if shiftDR = '0' then q1 <= data_in; else q1 <= scan_in; end if ; end if ; end process ; UPD : process (updateDR) begin if updateDR = '1' then q2 <= q1; end if ; end process ; data_out <= data_in when mode = '0' else q2; scan_out <= q1; end behave; FIGURE 14.2 A DR (data register) cell. The most common use of this cell is as a boundary-scan cell (BSC). The IEEE 1149.1 standard shows the sequential logic in a BSC controlled by the gated clocks: clockDR (whose positive edge occurs at the positive edge of TCK) and updateDR (whose positive edge occurs at the negative edge of TCK). The IEEE 1149.1 schematics illustrate the standard but do not define how circuits should be implemented. The function of the circuit in Figure 14.2 (and its model) follows the IEEE 1149.1 standard and many other published schematics, but this is not necessarily the best, or even a safe, implementation. For example, as drawn here, signals clockDR and updateDR are gated clocksnormally to be avoided if possible. The update sequential element is shown as an edge-triggered D flip-flop but may be implemented using a latch. Figure 14.3 [ IEEE 1149.1b-1994, Chapter 9] shows a bypass-register cell ( BR cell ). The BR inputs and outputs, scan in (serial in, SI) and scan out (serial out, SO), have the same names as the DR cell ports, but DR cells and BR cells are not directly connected.

entity BR_cell is port ( clockDR,shiftDR,scan_in : BIT; scan_out : out BIT ); end BR_cell; architecture behave of BR_cell is signal t1 : BIT; begin t1 <= shiftDR and scan_in; process (clockDR) begin if (clockDR = '1') then scan_out <= t1; end if ; end process ; end behave; FIGURE 14.3 A BR (bypass register) cell. Figure 14.4 shows an instruction-register cell ( IR cell ) [ IEEE 1149.1b-1994, Chapter 6]. The IR cell inputs are: scan_in , data_in ; as well as clock, shift, and update signals (with names and functions similar to those of the corresponding signals in the BSC). The reset signals are nTRST and reset_bar (active-low signals often use an asterisk, reset* for example, but this is not a legal VHDL name). The two LSBs of data_in must permanently be set to '01' (this helps in checking the integrity of the scan chain during testing). The remaining data_in bits are status bits under the control of the designer. The update sequential element (sometimes called the shadow register ) in each IR cell may be set or reset (depending on reset_value ). The IR cell outputs are: data_out (the instruction bit passed to the instruction decoder) and scan_out (the data passed to the next IR cell in the IR).
entity IR_cell is port ( shiftIR, data_in, scan_in, clockIR, updateIR, reset_bar, nTRST, reset_value : BIT; data_out, scan_out : out BIT); end IR_cell; architecture behave of IR_cell is signal q1, SR : BIT; begin scan_out <= q1; SR <= reset_bar and nTRST; CAP: process (clockIR) begin if (clockIR = '1') then if (shiftIR = '0') then q1 <= data_in; else q1 <= scan_in; end if ; end if ; end process ; UPD: process (updateIR, SR) begin if (SR = '0') then data_out <= reset_value; elsif ((updateIR = '1') and updateIR'EVENT) then data_out <= q1; end if ; end process ; end behave; FIGURE 14.4 An IR (instruction register) cell.
14.2.2 BST Registers

Figure 14.5 shows a boundary-scan register ( BSR ), which consists of a series connection, or chain, of BSCs. The BSR surrounds the ASIC core logic and is connected to the I/O pad cells. The BSR monitors (and optionally controls) the inputs and outputs of an ASIC. The direction of information flow is shown by an arrow on each of the BSCs in Figure 14.5 . The control signal, mode , is decoded from the IR. Signal mode is drawn as common to all cells for the BSR in Figure 14.5 , but that is not always the case.

entity BSR is generic (width : INTEGER := 3); port (shiftDR, clockDR, updateDR, mode, scan_in : BIT; scan_out : out BIT; data_in : BIT_VECTOR(width-1 downto 0); data_out : out BIT_VECTOR(width-1 downto 0)); end BSR; architecture structure of BSR is component DR_cell port ( mode, data_in, shiftDR, scan_in, clockDR, updateDR : BIT; data_out, scan_out : out BIT); end component ; for all : DR_cell use entity WORK.DR_cell(behave); signal int_scan : BIT_VECTOR (data_in'RANGE); begin BSR : for i in data_in'LOW to data_in'HIGH generate RIGHT : if (i = 0) generate BSR_LSB : DR_cell port map (mode, data_in(i), shiftDR, int_scan(i), clockDR, updateDR, data_out(i), scan_out); end generate ; MIDDLE : if ((i > 0) and (i < data_in'HIGH)) generate BSR_i : DR_cell port map (mode, data_in(i), shiftDR, int_scan(i), clockDR, updateDR, data_out(i), int_scan(i-1)); end generate ; LFET : if (i = data_in'HIGH) generate BSR_MSB : DR_cell port map (mode, data_in(i), shiftDR, scan_in, clockDR, updateDR, data_out(i), int_scan(i-1)); end generate ; end generate ; end structure; FIGURE 14.5 A BSR (boundary-scan register). An example of the component data-register (DR) cells (used as boundary-scan cells) is shown in Figure 14.2 . Figure 14.6 shows an instruction register ( IR ), which consists of at least two IR cells connected in series. The IEEE 1149.1 standard specifies that the IR cell is reset to '00...01' (the optional IDCODE instruction). If there is no IDCODE TDR, then the IDCODE instruction defaults to the BYPASS instruction.
entity IR is generic (width : INTEGER := 4); port ( shiftIR, clockIR, updateIR, reset_bar, nTRST, scan_in : BIT; scan_out : out BIT; data_in : BIT_VECTOR (width-1 downto 0) ; data_out : out BIT_VECTOR (width-1 downto 0) ); end IR;
architecture structure of IR is component IR_cell port (shiftIR, data_in, scan_in, clockIR, updateIR, reset_bar, nTRST, reset_value : BIT ; data_out, scan_out : out BIT ); end component ; for all : IR_cell use entity WORK.IR_cell(behave); signal int_scan : BIT_VECTOR (data_in'RANGE); signal Vdd : BIT := '1'; signal GND : BIT := '0'; begin IRGEN : for i in data_in'LOW to data_in'HIGH generate FIRST : if (i = 0) generate IR_LSB: IR_cell port map (shiftIR, Vdd, int_scan(i),

clockIR, updateIR, reset_bar, nTRST, Vdd, data_out(i), scan_out); end generate ; SECOND : if ((i = 1) and (data_in'HIGH > 1)) generate IR1 : IR_cell port map (shiftIR, GND, int_scan(i), clockIR, updateIR, reset_bar, nTRST, Vdd, data_out(i), int_scan(i-1)); end generate ; MIDDLE : if ((i < data_in'HIGH) and (i > 1)) generate IRi : IR_cell port map (shiftIR, data_in(i), int_scan(i), clockIR, updateIR, reset_bar, nTRST, Vdd, data_out(i), int_scan(i-1)); end generate ; LAST : if (i = data_in'HIGH) generate IR_MSB : IR_cell port map (shiftIR, data_in(i), scan_in, clockIR, updateIR, reset_bar, nTRST, Vdd, data_out(i), int_scan(i-1)); end generate ; end generate ; end structure; FIGURE 14.6 An IR (instruction register).
14.2.3 Instruction Decoder

Table 14.4 shows an instruction decoder . This model is capable of decoding the following minimum set of boundary-scan instructions: 1. EXTEST , external test. Drives a known value onto each output pin to test connections between ASICs. 2. SAMPLE/PRELOAD (often abbreviated to SAMPLE ). Performs two functions: first sampling the present input value from input pad during capture; and then preloading the BSC update register output during update (in preparation for an EXTEST instruction, for example). 3. IDCODE . An optional instruction that allows the device-identification register ( IDCODE) to be shifted out. The IDCODE TDR is an optional register that allows the tester to query the ASIC for the manufacturers name, part number, and other data that is shifted out on TDO. IDCODE defaults to the BYPASS instruction if there is no IDCODE TDR. 4. BYPASS . Selects the single-cell bypass register (instead of the BSR) and allows data to be quickly shifted between ASICs. The IEEE 1149.1 standard predefines additional optional instructions and also defines the implementation of custom instructions that may use additional TDRs. TABLE 14.4 An IR (instruction register) decoder. entity IR_decoder is generic (width : INTEGER := 4); port ( shiftDR, clockDR, updateDR : BIT; IR_PO : BIT_VECTOR (width-1 downto 0) ; test_mode, selectBR, shiftBR, clockBR, shiftBSR, clockBSR, updateBSR : out BIT ); end IR_decoder; architecture behave of IR_decoder is type INSTRUCTION is (EXTEST, SAMPLE_PRELOAD, IDCODE, BYPASS); signal I : INSTRUCTION; begin process (IR_PO) begin case BIT_VECTOR'( IR_PO(1), IR_PO(0) ) is when "00" => I <= EXTEST; when "01" => I <= SAMPLE_PRELOAD; when "10" => I <= IDCODE; when "11" => I <= BYPASS; end case ; end process ; test_mode <= '1' when I = EXTEST else '0'; selectBR <= '1' when (I = BYPASS or I = IDCODE) else '0'; shiftBR <= shiftDR; clockBR <= clockDR when (I = BYPASS or I = IDCODE) else '1'; shiftBSR <= shiftDR; clockBSR <= clockDR when (I = EXTEST or I = SAMPLE_PRELOAD) else '1'; updateBSR <= updateDR when (I = EXTEST or I = SAMPLE_PRELOAD) else '0'; end behave;
14.2.4 TAP Controller

Figure 14.7 shows the TAP controller finite-state machine. The 16-state diagram contains some symmetry: states with suffix '_DR' operate on the data registers and those with suffix '_IR' apply to the instruction register. All transitions between states are determined by the TMS (test mode select) signal and occur at the rising edge of TCK , the boundary-scan clock. An optional active-low reset signal, nTRST or TRST* , resets the state machine to the initial state, Reset . If the dedicated nTRST is not used, there must be a power-on reset signal (POR)not an existing system reset signal.

The outputs of the TAP controller are not shown in Figure 14.7 , but are derived from each TAP controller state. The TAP controller operates rather like a four-button digital watch that cycles through several states (alarm, stopwatch, 12 hr / 24 hr, countdown timer, and so on) as you press the buttons. Only the shaded states in Figure 14.7 affect the ASIC core logic; the other states are intermediate steps. The pause states let the controller jog in place while the tester reloads its memory with a new set of test vectors, for example.
use work.TAP. all ; entity TAP_sm_states is port (TMS, TCK, nTRST : in BIT; S : out TAP_STATE); end TAP_sm_states;
architecture behave of TAP_sm_states is type STATE_ARRAY is array (TAP_STATE, 0 to 1) of TAP_STATE; constant T : STATE_ARRAY := ( (Run_Idle, Reset), (Run_Idle, Select_DR), (Capture_DR, Select_IR), (Shift_DR, Exit1_DR), (Shift_DR, Exit1_DR), (Pause_DR, Update_DR), (Pause_DR, Exit2_DR), (Shift_DR, Update_DR), (Run_Idle, Select_DR), (Capture_IR, Reset), (Shift_IR, Exit1_IR), (Shift_IR, Exit1_IR), (Pause_IR, Update_IR), (Pause_IR, Exit2_IR), (Shift_IR, Update_IR), (Run_idle, Select_DR) ); begin process (TCK, nTRST) variable S_i: TAP_STATE; begin if ( nTRST = '0' ) then S_i := Reset; elsif ( TCK = '1' and TCK'EVENT ) then -- transition on +VE clock edge if ( TMS = '1' ) then S_i := T(S_i, 1); else S_i := T(S_i, 0); end if ; end if ; S <= S_i; -- update signal with already updated internal variable end process ; end behave;
FIGURE 14.7 The TAP (test-access port) controller state machine. Table 14.5 shows the output control signals generated by the TAP state machine. I have taken the unusual step of writing separate entities for the state machine and its outputs. Normally this is bad practice because it makes it difficult for synthesis tools to extract and optimize the logic, for example. This separation of functions reflects the fact that the operation of the TAP controller state machine is precisely defined by the IEEE 1149.1 standardindependent of the implementation of the register cells and number of instructions supported. The model in Table 14.5 contains the following combinational, registered, and gated output signals and will change with different implementations:
q q q q q
q q
reset_bar . Resets the IR to IDCODE (or BYPASS in absence of IDCODE TDR). selectIR . Connects a register, the IR or a TDR, to TDO . enableTDO . Enables the three-state buffer that drives TDO . This allows data to be shifted out of the ASIC on TDO , either from the IR or from the DR, in states shift_IR or shift_DR respectively. shiftIR . Selects the serial input to the capture flip-flop in the IR cells. clockIR . Causes data at the input of the IR to be captured or the contents of the IR to be shifted toward TDO (depending on shiftIR ) on the negative edge of TCK following the entry to the states shift_IR or capture_IR . This is a dirty signal. updateIR . Clocks the update sequential element on the positive edge of TCK at the same time as the exit from state update_IR . This is a dirty signal. shiftDR , clockDR , and updateDR . Same functions as corresponding IR signals applied to the TDRs. These signals may be gated to the appropriate TDR by the instruction decoder.
The signals reset_bar , enableTDO , shiftIR , and shiftDR are registered or clocked by TCK (on the positive edge of TCK ). We say these signals are clean (as opposed to being dirty gated clocks). TABLE 14.5 The TAP (test-access port) control. 1

Reset Run_Idle Select_DR Capture_DR Shift_DR Exit1_DR Pause_DR Exit2_DR Update_DR Select_IR Capture_IR Shift_IR Exit1_IR Pause_IR Exit2_IR Update_IR reset_bar 0R selectIR 1 1 enableTDO 1R shiftIR clockIR updateIR shiftDR 1R clockDR 0G 0G updateDR use work.TAP. all ; entity TAP_sm_output is port (TCK : in BIT; S : in TAP_STATE; reset_bar, selectIR, enableTDO, shiftIR, clockIR, updateIR, shiftDR, clockDR, updateDR : out BIT); end TAP_sm_output; architecture behave_1 of TAP_sm_output is begin -- registered outputs process (TCK) begin if ( (TCK = '0') and TCK'EVENT ) then if S = Reset then reset_bar <= '0'; else reset_bar <= '1'; end if ; if S = Shift_IR or S = Shift_DR then enableTDO <= '1'; else enableTDO <= '0'; end if ; if S = Shift_IR then ShiftIR <= '1'; else shiftIR <= '0'; end if ; if S = Shift_DR then ShiftDR <= '1'; else shiftDR <= '0'; end if ; end if ; end process ; process (TCK) begin -- dirty outputs gated with not(TCK) if (TCK = '0' and (S = Capture_IR or S = Shift_IR)) then clockIR <= '0'; else clockIR <= '1'; end if ; if (TCK = '0' and (S = Capture_DR or S = Shift_DR)) then clockDR <= '0'; else clockDR <= '1'; end if ; if TCK = '0' and S=Update_IR then updateIR <= '1'; else updateIR <= '0'; end if ; if TCK = '0' and S=Update_DR then updateDR <= '1'; else updateDR <= '0'; end if ; end process ; selectIR <= '1' when (S = Reset or S = Run_Idle or S = Capture_IR or S = Shift_IR or S = Exit1_IR or S = Pause_IR or S = Exit2_IR or S = Update_IR) else '0'; end behave_1;
0G
1 1R 1R 0G
1G
1G
14.2.5 Boundary-Scan Controller

Figure 14.8 shows a boundary-scan controller. It contains the following four parts:
library IEEE; use IEEE.std_logic_1164. all entity Control is generic (width : INTEGER TDO: out STD_LOGIC; BSR_SO : BIT; BSR_PO : shiftBSR, clockBSR, updateBSR, test_mode :
; use work.TAP. all ; := 2); port (TMS, TCK, TDI, nTRST : BIT; BIT_VECTOR (width-1 downto 0); out BIT); end Control;
architecture mixed of Control is use work.BST_components. all ; signal reset_bar, selectIR, enableTDO, shiftIR, clockIR, updateIR, shiftDR, clockDR, updateDR, IR_SO, BR_SO, TDO_reg, TDO_data, TDR_SO, selectBR, clockBR, shiftBR : BIT; signal IR_PI, IR_PO : BIT_VECTOR (1 downto 0); signal S : TAP_STATE; begin IR_PI <= "01"; TDO <= TO_STDULOGIC(TDO_reg) when enableTDO = '1' else 'Z'; R1 : process (TCK) begin if (TCK='0') then TDO_reg <= TDO_data; end if ; end process ; TDO_data <= IR_SO when selectIR = '1' else TDR_SO; TDR_SO <= BR_SO when selectBR = '1' else BSR_SO; TC1 : TAP_sm_states port map (TMS, TCK, nTRST, S); TC2 : TAP_sm_output port map (TCK, S, reset_bar, selectIR, enableTDO, shiftIR, clockIR, updateIR, shiftDR, clockDR, updateDR); IR1 : IR generic map (width => 2) port map (shiftIR, clockIR, updateIR, reset_bar, nTRST, TDI, IR_SO, IR_PI, IR_PO); DEC1 : IR_decoder generic map (width => 2) port map (shiftDR, clockDR, updateDR, IR_PO, test_mode, selectBR, shiftBR, clockBR, shiftBSR, clockBSR, updateBSR); BR1 : BR_cell port map (clockBR, shiftBR, TDI, BR_SO); end mixed; FIGURE 14.8 A boundary-scan controller. 1. Bypass register. 2. TDO output circuit. The data to be shifted out of the ASIC on TDO is selected from the serial outputs of bypass register ( BR_SO ), instruction register ( IR_SO ), or boundary-scan register ( BSR_SO ). Notice the registered output means that data appears on TDO at the negative edge of TCK . This prevents race conditions between ASICs. 3. Instruction register and instruction decoder. 4. TAP controller. The BSR (and other optional TDRs) are connected to the ASIC core logic outside the BST controller.
14.2.6 A Simple Boundary-Scan Example

Figure 14.9 shows an example of a simple ASIC (our comparator/MUX example) containing boundary scan. The following two packages define the TAP states and the components (these are not essential to understanding what follows, but are included so that the code presented here forms a complete BST model):
entity Core is port (a, b : BIT_VECTOR (2 downto 0); outp : out BIT_VECTOR (2 downto 0)); end Core; architecture behave of Core is begin outp <= a when a < b else b; end behave; library IEEE; use IEEE.std_logic_1164. all ; entity BST_ASIC is port (TMS, TCK, TDI, nTRST : BIT; TDO : out STD_LOGIC; a_PAD, b_PAD : BIT_VECTOR (2 downto 0); z_PAD : out BIT_VECTOR (2 downto 0)); end BST_ASIC; architecture structure of BST_ASIC is use work.BST_components. all ; component Core port (a, b: BIT_VECTOR (2 downto 0); outp: out BIT_VECTOR (2 downto 0)); end component ; for all : Core use entity work.Core(behave); constant BSR_width : INTEGER := 9; signal BSR_SO, test_mode, shiftBSR, clockBSR, updateBSR : BIT; signal BSR_PI, BSR_PO : BIT_VECTOR (BSR_width-1 downto 0); signal a, b, z : BIT_VECTOR (2 downto 0); begin BSR_PI <= a_PAD & b_PAD & z ; a <= BSR_PO(8 downto 6); b <= BSR_PO(5 downto 3); z_pad <= BSR_PO(2 downto 0); CORE1 : Core port map (a, b, z); C1 : Control generic map (width => BSR_width) port map (TMS, TCK, TDI, nTRST, TDO, BSR_SO, BSR_PO, shiftBSR, clockBSR, updateBSR, test_mode); BSR1 : BSR generic map (width => BSR_width) port map (shiftBSR, clockBSR, updateBSR, test_mode, TDI, BSR_SO, BSR_PI, BSR_PO); end structure;
FIGURE 14.9 A boundary-scan example. package TAP is type TAP_STATE is (reset, run_idle, select_DR, capture_DR, shift_DR, exit1_DR, pause_DR, exit2_DR, update_DR, select_IR, capture_IR, shift_IR, exit1_IR, pause_IR, exit2_IR, update_IR); end TAP; use work.TAP. all ; library IEEE; use IEEE.std_logic_1164. all ; package BST_Components is component DR_cell port ( mode, data_in, shiftDR, scan_in, clockDR, updateDR: BIT; data_out, scan_out : out BIT ); end component ; component IR_cell port ( shiftIR, data_in, scan_in, clockIR, updateIR, reset_bar, nTRST, reset_value : BIT; data_out, scan_out : out BIT); end component ; component BR_cell port ( clockDR,shiftDR,scan_in : BIT; scan_out: out BIT ); end component ; component BSR generic (width : INTEGER := 5); port ( shiftDR, clockDR, updateDR, mode, scan_in : BIT; scan_out : out BIT; data_in : BIT_VECTOR(width-1 downto 0); data_out : out BIT_VECTOR(width-1 downto 0));

end component ; component IR generic (width : INTEGER := 4); port ( shiftIR, clockIR, updateIR, reset_bar, nTRST, scan_in : BIT; scan_out : out BIT; data_in : BIT_VECTOR (width-1 downto 0) ; data_out : out BIT_VECTOR (width-1 downto 0) ); end component ; component IR_decoder generic (width : INTEGER := 4); port ( shiftDR, clockDR, updateDR : BIT; IR_PO : BIT_VECTOR (width-1 downto 0); test_mode, selectBR, shiftBR, clockBR, shiftBSR, clockBSR, updateBSR: out BIT ); end component ; component TAP_sm_states port ( TMS, TCK, nTRST : in BIT; S : out TAP_STATE); end component ; component TAP_sm_output port ( TCK: BIT; S : TAP_STATE; reset_bar, selectIR, enableTDO, shiftIR, clockIR, updateIR, shiftDR, clockDR, updateDR : out BIT); end component ; component Control generic (width : INTEGER := 2); port ( TMS, TCK, TDI, nTRST : BIT; TDO : out STD_LOGIC; BSR_SO : BIT; BSR_PO : BIT_VECTOR (width-1 downto 0); shiftBSR, clockBSR, updateBSR, test_mode : out BIT); end component ; component BST_ASIC port ( TMS, TCK, TDI : BIT; TDO : out STD_LOGIC; a_PAD, b_PAD : BIT_VECTOR (2 downto 0); z_PAD : out BIT_VECTOR (2 downto 0)); end component ; end ; The following testbench, Test_BST , performs these functions: 1. Resets the TAP controller at t = 10 ns using nTRST . 2. Continuously clocks the BST clock, TCK , at a frequency of 10 MHz. Rising edges of TCK occur at 100 ns, 200 ns, and so on. 3. Drives a series of values onto the TAP inputs TDI and TMS . The sequence shifts in instruction code '01' (SAMPLE/PRELOAD),followed by '00' (EXTEST). library IEEE; use IEEE.std_logic_1164. all ; library STD; use STD.TEXTIO. all ; entity Test_BST is end ; architecture behave of Test_BST is component BST_ASIC port (TMS, TCK, TDI, nTRST: BIT; TDO : out STD_LOGIC; a_PAD, b_PAD : BIT_VECTOR (2 downto 0); z_PAD : out BIT_VECTOR (2 downto 0)); end component ; for all : BST_ASIC use entity work.BST_ASIC(behave); signal TMS, TCK, TDI, nTRST : BIT; signal TDO : STD_LOGIC; signal TDI_TMS : BIT_VECTOR (1 downto 0); signal a_PAD, b_PAD, z_PAD : BIT_VECTOR (2 downto 0); begin TDI <= TDI_TMS(1) ; TMS <= TDI_TMS(0) ; ASIC1 : BST_ASIC port map (TMS, TCK, TDI, nTRST, TDO, a_PAD, b_PAD, z_PAD);
file:///C|/Documents%20and%20Settings/saran%20kumar/Desktop/To%2...i/www-ee.eng.hawaii.edu/_msmith/ASICs/HTML/Book2/CH14/CH14.2.htm (10 of 14) [5/30/2004 11:07:53 PM]

nTRST_DRIVE : process begin nTRST <= '1', '0' after 10 ns, '1' after 20 ns; wait ; PAD_DRIVE : process begin a_PAD <= ('0', '1', '0'); b_PAD <= ('0', '1', '1'); wait ; end process ; end process ; TCK_DRIVE : process begin -- rising edge at 100 ns TCK <= '0' after 50 ns, '1' after 100 ns; wait for 100 ns; if (now > 3000 ns) then wait ; end if ; end process ; BST_DRIVE : process begin TDI_TMS <= -- State after +VE edge: ('0', '1') after 0 ns, -- Reset ('0', '0') after 101 ns, -- Run_Idle ('0', '1') after 201 ns, -- Select_DR ('0', '1') after 301 ns, -- Select_IR ('0', '0') after 401 ns, -- Capture_IR ('0', '0') after 501 ns, -- Shift_IR ('1', '0') after 601 ns, -- Shift_IR ('0', '1') after 701 ns, -- Exit1_IR ('0', '1') after 801 ns, -- Update_IR, 01 = SAMPLE/PRELOAD ('0', '1') after 901 ns, -- Select_DR ('0', '0') after 1001 ns, -- Capture_DR ('0', '0') after 1101 ns, -- Shift_DR -- shift 111111101 into BSR, TDI(time) = 101111111 starting now ('1', '0') after 1201 ns, -- Shift_DR ('0', '0') after 1301 ns, -- Shift_DR ('1', '0') after 1401 ns, -- Shift_DR -- shift 4 more 1's ('1', '0') after 1901 ns, -- Shift_DR -- in-between ('1', '1') after 2001 ns, -- Exit1_DR ('0', '1') after 2101 ns, -- Update_DR ('0', '1') after 2201 ns, -- Select_DR ('0', '1') after 2301 ns, -- Select_IR ('0', '0') after 2401 ns, -- Capture_IR ('0', '0') after 2501 ns, -- Shift_IR ('0', '0') after 2601 ns, -- Shift_IR ('0', '1') after 2701 ns, -- Exit1_IR ('0', '1') after 2801 ns, -- Update_IR, 00=EXTEST ('0', '0') after 2901 ns; -- Run_Idle wait ; end process ; process (TDO, a_pad, b_pad, z_pad) variable L : LINE; begin write (L, now, RIGHT, 10); write (L, STRING'(" TDO=")); if TDO = 'Z' then write (L, STRING'("Z")) ; else write (L, TO_BIT(TDO)); end if ; write (L, STRING'(" PADS=")); write (L, a_pad & b_pad & z_pad); writeline (output, L); end process ; end behave; Here is the output from this testbench: # # # # # # # # # # # 0 ns TDO=0 PADS=000000000 0 ns TDO=Z PADS=010011000 0 ns TDO=Z PADS=010011010 650 ns TDO=1 PADS=010011010 750 ns TDO=0 PADS=010011010 850 ns TDO=Z PADS=010011010 1250 ns TDO=0 PADS=010011010 1350 ns TDO=1 PADS=010011010 1450 ns TDO=0 PADS=010011010 1550 ns TDO=1 PADS=010011010 1750 ns TDO=0 PADS=010011010

# # # # # # # 1950 2050 2150 2650 2750 2850 2950 ns ns ns ns ns ns ns TDO=1 TDO=0 TDO=Z TDO=1 TDO=0 TDO=Z TDO=Z PADS=010011010 PADS=010011010 PADS=010011010 PADS=010011010 PADS=010011010 PADS=010011010 PADS=010011101
This trace shows the following activities:

q q q q q q
q q q q
All changes to TDO and at the pads occur at the negative edge of TCK . The core logic output is z_pad = '010' and appears at the I/O pads at t = 0 ns. This is the smaller of the two inputs, a_pad = '010' and b_pad = '011' , and the correct output when the pads are connected to the core logic. At t = 650 ns the IDCODE instruction '01' is shifted out on TDO (with '1' appearing first). If we had multiple ASICs in the boundary-scan chain, this would show us that the chain was intact. At t = 850 ns the TDO output is floated (to 'Z' ) as we exit the shift_IR state. At t = 1200 ns the TAP controller begins shifting the serial data input from TDI ( '111111101' ) into the BSR. At t = 1250 ns the BSR data starts shifting out. This is data that was captured during the SAMPLE/PRELOAD instruction from the device input pins, a_pad and b_pad , as well as the driver of the output pins, z_pad . The data appears as the pattern '010011010' . This pattern consists of a_pad = '010' , b_pad = '011' , followed by z_pad = '010' (notice that TDO does not change at t = 1650 ns or 1850 ns). At t = 2150 ns, TDO is floated after we exit the shift_DR state. At t = 2650 ns the IDCODE instruction '01' (loaded into the IR as we passed through capture_IR the second time) is again shifted out as we shift the EXTEST instruction from TDI into the IR. At t = 2650 ns the TDO output is floated after we exit the shift_IR state. At t = 2950 ns the output, z_pad , is driven with '101' . The inputs a_pad and b_pad remain unchanged since they are driven from outside the chip. The change on z_pad occurs on the negative edge of TCK because the IR is loaded with the instruction EXTEST on the negative edge of TCK . When this instruction is decoded, the signal mode changes (this signal controls the MUX at the output of the BSCs).
Figure 14.10 shows a signal trace from the MTI simulator for the last four negative edges of TCK . Notice that we shifted in the test pattern on TDI in the order '101111111' . The output z_pad (3 bits wide) is last in the BSR (nearest TDO ) and thus is driven with the first 3 bits of this pattern, '101' . Forcing '101' onto the ASIC output pins would allow us to check that this pattern is correctly received at inputs of other connected ASICs through the bonding wires and board traces. In a later test cycle we can force '010' onto z_pad to check that both logic levels can be transmitted and received. We may also capture other signals (which are similarly being forced onto the outputs of neighboring ASICs) at the inputs.
FIGURE 14.10 Results from the MTI simulator for the boundary-scan testbench.
14.2.7 BSDL
The boundary-scan description language ( BSDL ) is an extension of IEEE 1149.1 but without any overlap. BSDL uses a subset of VHDL. The BSDL for an ASIC is part of an imaginary data sheet; it is not intended for simulation and does not include models for any boundary-scan components. BSDL is a standard way to describe the features and behavior of an ASIC that includes IEEE 1149.1 boundary scan and a standard way to pass information to test-generation software. Using BSDL, test software can also check that the BST features are correct. As an example, test software can use the BSDL to check that the ASIC uses the correct boundary-scan cells for the instructions that claim to be supported. BSDL cannot prove that an implementation works, however. The following example BSDL description corresponds to our halfgate ASIC example with BST (this code was generated automatically by the Compass tools): entity asic_p is generic (PHYSICAL_PIN_MAP : STRING := "DUMMY_PACKAGE"); port ( pad_a: in BIT_VECTOR (0 to 0); pad_z: buffer BIT_VECTOR (0 to 0); TCK: in BIT; TDI: in BIT; TDO: out BIT; TMS: in BIT; TRST: in BIT); use STD_1149_1_1994. all ; attribute PIN_MAP of asic_p : entity is PHYSICAL_PIN_MAP; -- CUSTOMIZE package pin mapping. constant DUMMY_PACKAGE : PIN_MAP_STRING :=

"pad_a:(1)," & "pad_z:(2)," & "TCK:3," & "TDI:4," & "TDO:5," & "TMS:6," & "TRST:7"; attribute TAP_SCAN_IN of TDI : signal is TRUE; attribute TAP_SCAN_MODE of TMS : signal is TRUE; attribute TAP_SCAN_OUT of TDO : signal is TRUE; attribute TAP_SCAN_RESET of TRST : signal is TRUE; -- CUSTOMIZE TCK max freq and safe stop state. attribute TAP_SCAN_CLOCK of TCK : signal is (20.0e6, BOTH); attribute INSTRUCTION_LENGTH of asic_p : entity is 3; attribute INSTRUCTION_OPCODE of asic_p : entity is "IDCODE (001)," & "STCTEST (101)," & "INTEST (100)," & "BYPASS (111)," & "SAMPLE (010)," & "EXTEST (000)"; attribute INSTRUCTION_CAPTURE of asic_p : entity is "001"; -- attribute INSTRUCTION_DISABLE of asic_p : entity is " " -- attribute INSTRUCTION_GUARD of asic_p : entity is " " -- attribute INSTRUCTION_PRIVATE of asic_p : entity is " " attribute IDCODE_REGISTER of asic_p : entity is "0000" & -- 4-bit version "0000000000000000" & -- 16-bit part number "00000101011" & -- 11-bit manufacturer "1"; -- mandatory LSB -- attribute USERCODE_REGISTER of asic_p : entity is " " attribute REGISTER_ACCESS of asic_p : entity is "BOUNDARY (STCTEST)"; attribute BOUNDARY_CELLS of asic_p : entity is "BC_1, BC_2"; attribute BOUNDARY_LENGTH of asic_p : entity is 2; attribute BOUNDARY_REGISTER of asic_p : entity is -- num cell port function safe [ccell disval rslt] " 1 ( BC_2, pad_a(0), input, X)," & " 0 ( BC_1, pad_z(0), output2, X)"; -- " 98 ( BC_1, OE, input, X), " & -- " 98 ( BC_1, *, control, 0), " & -- " 99 ( BC_1, myport(0), output3, X, 98, 0, Z); end asic_p; The functions of the lines of this BSDL description are as follows:
q
q q
q q q q
q q q q
Line 2 refers to the ASIC package. We can have the same part (with identical pad numbers on the silicon die) in different ASIC packages. We include the name of the ASIC package in line 2 and the pin mapping between bonding pads and ASIC package pins in lines 14 21 . Lines 3 10 describe the signal names of inputs and outputs, the TAP pins, and the optional fifth TAP reset signal. The BST signals do not have to be given the names used in the standard: TCK, TDI, and so on. Line 11 refers to the VHDL package, STD_1149_1_1994 . This is a small VHDL package (just over 100 lines) that contains definitions of the constants, types, and attributes used in a BSDL description. It does not contain any models for simulation. Lines 22 25 attach signal names to the required TAP pins and the optional fifth TAP reset signal. Lines 26 27 refer to the maximum test clock frequency in hertz, and whether the clock may be stopped in both states or just the low state (just the high state is not valid). Line 28 describes a 3-bit IR (in the comparator/MUX example we used a 2-bit IR). Length must be greater than or equal to 2. Lines 29 35 describe the three required instruction opcodes and mnemonics ( BYPASS, SAMPLE, EXTEST ) and three optional instructions: IDCODE, STCTEST (which is a scan test mode), and INTEST (which supports internal testing in the same fashion as EXTEST supports external testing). EXTEST must be all ones; BYPASS must be all zeros. A mnemonic may have more than one opcode (and opcodes may be specified using 'x' ). Other instructions that may appear here include CLAMP and HIGHZ , both optional instructions that were added to 1149.1 (see Supplement A, 1149.1a). String concatenation is used in BSDL to avoid line-break problems. Lines 37 39 include instruction attributes INSTRUCTION_DISABLE (for HIGHZ ), INSTRUCTION_GUARD (for CLAMP ), as well as INSTRUCTION_PRIVATE (for user-defined instructions) that are not used in this example. Lines 40 44 describe the IDCODE TDR. The 11-bit manufacturer number is determined from codes assigned by JEDEC Publication 106-A. Line 45 describes the USERCODE TDR in a similar fashion to IDCODE, but is not used here. Lines 46 47 describe the TDRs for user-defined instructions. In this case the existing BOUNDARY TDR is inserted between TDI and TDO during STCTEST . User-defined instructions listed here may use the other existing IDCODE and BYPASS

q
q q
TDRs or define new TDRs. Lines 48 49 list the boundary-scan cells used in the ASIC. These may be any of the following cells defined in the 1149.1 standard and defined in the VHDL package, STD_1149_1_1994 : BC_1 (Figs. 10-18, 10-29, 10-31c, 10-31d, and 10-33c), BC_2 (Figs. 10-14, 10-30, 10-32c, 10-32d, 10-35c), BC_3 (Fig. 10-15), BC_4 (Figs. 10-16, 10-17), BC_5 (Fig. 10-41c), BC_6 (Fig. 10-34d). The figure numbers in parentheses here refer to the IEEE 1149.1 standard [ IEEE 1149.1b-1994]. Alternatively the cells may be user-defined (and must then be declared in a package). Line 50 must be an integer greater than zero and match the number defined by the following register description. Lines 51 54 are an array of records, numbered by cell, with seven fields: four required and three that only appear for certain cells. Field 1 specifies the scan cell name as defined in the STD_1149_1_1994 or user-defined package. Field 2 is the port name, with a subscript if the type is BIT_VECTOR . An '*' denotes no connection. Field 3 is one of the following cell functions (with figure or page numbers from the IEEE standard [ IEEE 1149.1b-1994]): input (Fig. 10-18), clock (Fig. 10-17), output2 (two-state output, Fig. 10-29), output3 (three-state, Fig. 10-31d), internal (p. 33, 1149.1b), control (Fig. 10-31c), controlr (Fig. 10-33c), bidir_in (a reversible cell acting as an input, Fig. 10-34d), bidir_out (a reversible cell acting as an output, Fig. 10-34d). Field 4, safe , contains the safe value to be loaded into the update flip-flop when otherwise unspecified, with 'X' as a dont care value. Lines 55 57 illustrate the use of the optional three fields. Field 5, ccell or control cell, refers to the cell number (98 in this example) of the cell that controls an output or bidirectional cell. The control cell number 98 is a merged cell in this example with an input cell, input signal name OE , also labeled as cell number 98. The ASIC input OE (for output enable) thus directly controls (enables) the ASIC three-state output, myport(0) .
The boundary-scan standard may seem like a complicated way to test the connections outside an ASIC. However, the IEEE 1149.1 standard also gives us a method to communicate with test circuits inside an ASIC. Next, we turn our attention from problems at the board level to problems that may occur within the ASIC. 1. Outputs: G = gated with TCK, R = registered on falling edge of TCK. Only active levels are shown in the table. [ Chapter start ] [ Previous page ] [ Next page ]
14.3 Faults
14.3 Faults
Fabrication of an ASIC is a complicated process requiring hundreds of processing steps. Problems may introduce a defect that in turn may introduce a fault (Sabnis [ 1990] describes defect mechanisms ). Any problem during fabrication may prevent a transistor from working and may break or join interconnections. Two common types of defects occur in metallization [ Rao, 1993]: either underetching the metal (a problem between long, closely spaced lines), which results in a bridge or short circuit ( shorts ) between adjacent lines, or overetching the metal and causing breaks or open circuits ( opens ). Defects may also arise after chip fabrication is completewhile testing the wafer, cutting the die from the wafer, or mounting the die in a package. Wafer probing, wafer saw, die attach, wire bonding, and the intermediate handling steps each have their own defect and failure mechanisms. Many different materials are involved in the packaging process that have different mechanical, electrical, and thermal properties, and these differences can cause defects due to corrosion, stress, adhesion failure, cracking, and peeling. Yield loss also occurs from human errorusing the wrong mask, incorrectly setting the implant doseas well as from physical sources: contaminated chemicals, dirty etch sinks, or a troublesome process step. It is possible to repeat or rework some of the reversible steps (a lithography step, for examplebut not etching) if there are problems. However, reliance on rework indicates a poorly controlled process.
14.3.1 Reliability
It is possible for defects to be nonfatal but to cause failures early in the life of a product. We call this infant mortality . Most products follow the same kinds of trend
14.3 Faults
for failures as a function of life. Failure rates decrease rapidly to a low value that remains steady until the end of life when failure rates increase again; this is called a bathtub curve . The end of a product lifetime is determined by various wearout mechanisms (usually these are controlled by an exponential energy process). Some of the most important wearout mechanisms in ASICs are hot-electron wearout, electromigration, and the failure of antifuses in FPGAs. We can catch some of the products that are susceptible to early failure using burn-in . Many failure mechanisms have a failure rate proportional to exp ( E a /kT). This is the Arrhenius equation , where E a is a known activation energy (k is Boltzmanns constant, 8.62 10 5 eVK -1 , and T the absolute temperature). Operating an ASIC at an elevated temperature accelerates this type of failure mechanism. Depending on the physics of the failure mechanism, additional stresses, such as elevated current or voltage, may also accelerate failures. The longer and harsher the burn-in conditions, the more likely we are to find problems, but the more costly the process and the more costly the parts. We can measure the overall reliability of any product using the mean time between failures ( MTBF ) for a repairable product or mean time to failure ( MTTF ) for a fatal failure. We also use failures in time ( FITs ) where 1 FIT equals a single failure in 10 9 hours. We can sum the FITs for all the components in a product to determine an overall measure for the product reliability. Suppose we have a system with the following components:
q q q
Microprocessor (standard part) 5 FITs 100 TTL parts, 50 parts at 10 FITs, 50 parts at 15 FITs 100 RAM chips, 6 FITs
The overall failure rate for this system is 5 + 50 10 + 50 15 + 100 6 = 1855 FITs. Suppose we could reduce the component count using ASICs to the following:
q q q
Microprocessor (custom) 7 FITs 9 ASICs, 10 FITs 5 SIMMs, 15 FITs
14.3 Faults
The failure rate is now 10 + 9 10 + 5 15 = 175 FITs, or about an order of magnitude lower. This is the rationale behind the Sun SparcStation 1 design described in Section 1.3 , Case Study .
14.3.2 Fault Models

Table 14.6 shows some of the causes of faults. The first column shows the fault level whether the fault occurs in the logic gates on the chip or in the package. The second column describes the physical fault . There are too many of these and we need a way to reduce and simplify their effectsby using a fault model. There are several types of fault model . First, we simplify things by mapping from a physical fault to a logical fault . Next, we distinguish between those logical faults that degrade the ASIC performance and those faults that are fatal and stop the ASIC from working at all. There are three kinds of logical faults in Table 14.6 : a degradation fault, an open-circuit fault, and a short-circuit fault. TABLE 14.6 Mapping physical faults to logical faults. Logical fault OpenFault Physical fault Degradation fault circuit level fault Chip Leakage or short between package leads Broken, misaligned, or poor wire bonding Surface contamination, moisture Metal migration, stress, peeling
Shortcircuit fault
14.3 Faults
Metallization (open or short) Gate Contact opens Gate to S/D junction short Field-oxide parasitic device Gate-oxide imperfection, spiking Mask misalignment
A degradation fault may be a parametric fault or delay fault ( timing fault ). A parametric fault might lead to an incorrect switching threshold in a TTL/CMOS level converter at an input, for example. We can test for parametric faults using a production tester. A delay fault might lead to a critical path being slower than specification. Delay faults are much harder to test in production. An open-circuit fault results from physical faults such as a bad contact, a piece of metal that is missing or overetched, or a break in a polysilicon line. These physical faults all result in failure to transmit a logic level from one part of a circuit to anotheran open circuit. A short-circuit fault results from such physical faults as: underetching of metal; spiking, pinholes or shorts across the gate oxide; and diffusion shorts. These faults result in a circuit being accidentally connecteda short circuit. Most shortcircuit faults occur in interconnect; often we call these bridging faults (BF). A BF usually results from metal coverage problems that lead to shorts. You may see reference to feedback bridging faults and nonfeedback bridging faults , a useful distinction when trying to predict the results of faults on logic operation. Bridging faults are a frequent problem in CMOS ICs.
14.3.3 Physical Faults

Figure 14.11 shows the following examples of physical faults in a logic cell:
14.3 Faults
FIGURE 14.11 Defects and physical faults. Many types of defects occur during fabrication. Defects can be of any size and on any layer. Only a few small sample defects are shown here using a typical standard cell as an example. Defect density for a modern CMOS process is of the order of 1 cm 2 or less across a whole wafer. The logic cell shown here is approximately 64 32 2 , or 250 m 2 for a = 0.25 m process. We would thus have to examine approximately 1 cm 2 /250 m 2 or 400,000 such logic cells to find a single defect.
14.3 Faults
q q
q q q
F1 is a short between m1 lines and connects node n1 to VSS. F2 is an open on the poly layer and disconnects the gate of transistor t1 from the rest of the circuit. F3 is an open on the poly layer and disconnects the gate of transistor t3 from the rest of the circuit. F4 is a short on the poly layer and connects the gate of transistor t4 to the gate of transistor t5. F5 is an open on m1 and disconnects node n4 from the output Z1. F6 is a short on m1 and connects nodes p5 and p6. F7 is a nonfatal defect that causes necking on m1.
Once we have reduced the large number of physical faults to fewer logical faults, we need a model to predict their effect. The most common model is the stuck-at fault model .
14.3.4 Stuck-at Fault Model

The single stuck-at fault ( SSF ) model assumes that there is just one fault in the logic we are testing. We use a single stuck-at fault model because a multiple stuck-at fault model that could handle several faults in the logic at the same time is too complicated to implement. We hope that any multiple faults are caught by single stuck-at fault tests [Agarwal and Fung, 1981; Hughes and McCluskey, 1986]. In practice this seems to be true. There are other fault models. For example, we can assume that faults are located in the transistors using a stuck-on fault and stuck-open fault (or stuck-off fault ). Fault models such as these are more realistic in that they more closely model the actual physical faults. However, in practice the simple SSF model has been found to workand work well. We shall concentrate on the SSF model. In the SSF model we further assume that the effect of the physical fault (whatever it may be) is to create only two kinds of logical fault. The two types of logical faults or stuck-at faults are: a stuck-at-1 fault (abbreviated to SA1 or s@1) and a stuck-at-0 fault ( SA0 or s@0). We say that we place faults ( inject faults , seed faults , or apply faults ) on a node (or net), on an input of a circuit, or on an output of a circuit.
14.3 Faults
The location at which we place the fault is the fault origin . A net fault forces all the logic cell inputs that the net drives to a logic '1' or '0' . An input fault attached to a logic cell input forces the logic cell input to a '1' or '0' , but does not affect other logic cell inputs on the same net. An output fault attached to the output of a logic cell can have different strengths. If an output fault is a supply-strength fault (or rail-strength fault) the logic-cell output node and every other node on that net is forced to a '1' or '0' as if all these nodes were connected to one of the supply rails. An alternative assigns the same strength to the output fault as the drive strength of the logic cell. This allows contention between outputs on a net driving the same node. There is no standard method of handling output-fault strength , and no standard for using types of stuck-at faults. Usually we do not inject net faults; instead we inject only input faults and output faults. Some people use the term node fault but in different ways to mean either a net fault, input fault, or output fault. We usually inject stuck-at faults to the inputs and outputs, the pins, of logic cells (AND gates, OR gates, flip-flops, and so on). We do not inject faults to the internal nodes of a flip-flop, for example. We call this a pin-fault model and say the fault level is at the structural level , gate level, or cell level. We could apply faults to the internal logic of a logic cell (such as a flip-flop) and (the fault level would then be at the transistor level or switch level. We do not use transistor-level or switch-level fault models because there is often no need. From experience, but not from any theoretical reason, it turns out that using a fault model that applies faults at the logic-cell level is sufficient to catch the bad chips in a production test. When a fault changes the circuit behavior, the change is called the fault effect . Fault effects travel through the circuit to other logic cells causing other fault effects. This phenomenon is fault propagation . If the fault level is at the structural level, the phenomenon is structural fault propagation . If we have one or more large functional blocks in a design, we want to apply faults to the functional blocks only at the inputs and outputs of the blocks. We do not want to place (or cannot place) faults inside the blocks, but we do want faults to propagate through the blocks. This is behavioral fault propagation . Designers adjust the fault level to the appropriate level at which they think there may be faults. Suppose we are performing a fault simulation on a board and we have
14.3 Faults
already tested the chips. Then we might set the fault level to the chip level, placing faults only at the chip pins. For ASICs we use the logic-cell level. You have to be careful, though, if you mix behavioral level and structural level models in a mixedlevel fault simulation . You need to be sure that the behavioral models propagates faults correctly. In particular, if the behavioral model responds to faults on its inputs by propagating too many unknown 'X' values to its outputs, this will decrease the fault coverage, because the model is hiding the logic beyond it.
14.3.5 Logical Faults

Figure 14.12 and the following list show how the defects and physical faults of Figure 14.11 translate to logical faults (not all physical faults translate to logical faultsmost do not):
q q
q q
F1 translates to node n1 being stuck at 0, equivalent to A1 being stuck at 1. F2 will probably result in node n1 remaining high, equivalent to A1 being stuck at 0. F3 will affect half of the n -channel pull-down stack and may result in a degradation fault, depending on what happens to the floating gate of T3. The cell will still work, but the fall time at the output will approximately double. A fault such as this in the middle of a chain of logic is extremely hard to detect. F4 is a bridging fault whose effect depends on the relative strength of the transistors driving this node. The fault effect is not well modeled by a stuck-at fault model. F5 completely disables half of the n -channel pulldown stack and will result in a degradation fault. F6 shorts the output node to VDD and is equivalent to Z1 stuck at 1. Fault F7 could result in infant mortality. If this line did break due to electromigration the cell could no longer pull Z1 up to VDD. This would translate to a Z1 stuck at 0. This fault would probably be fatal and stop the ASIC working.
14.3 Faults
FIGURE 14.12 Fault models. (a) Physical faults at the layout level (problems during fabrication) shown in Figure 14.11 translate to electrical problems on the detailed circuit schematic. The location and effect of fault F1 is shown. The locations of the other fault examples from Figure 14.11 (F2F6) are shown, but not their effect. (b) We can translate some of these faults to the simplified transistor schematic. (c) Only a few of the physical faults still remain in a gate-level fault model of the logic cell. (d) Finally at the functional-level fault model of a logic cell, we abandon the connection between physical and logical faults and model all faults by stuck-at faults. This is a very poor model of the physical reality, but it works well in practice.
14.3.6 IDDQ Test
14.3 Faults
When they receive a prototype ASIC, experienced designers measure the resistance between VDD and GND pins. Providing there is not a short between VDD and GND, they connect the power supplies and measure the power-supply current. From experience they know that a supply current of more than a few milliamperes indicates a bad chip. This is exactly what we want in production test: Find the bad chips quickly, get them off the tester, and save expensive tester time. An IDDQ (IDD stands for the supply current, and Q stands for quiescent) test is one of the first production tests applied to a chip on the tester, after the chip logic has been initialized [ Gulati and Hawkins, 1993; Rajsuman, 1994]. High supply current can result from bridging faults that we described in Section 14.3.2 . For example, the bridging fault F4 in Figure 14.11 and Figure 14.12 would cause excessive IDDQ if node n1 and input B1 are being driven to opposite values.
14.3.7 Fault Collapsing

Figure 14.13 (a) shows a test for a stuck-at-1 output of a two-input NAND gate. Figure 14.13 (b) shows tests for other stuck-at faults. We assume that the NAND gate still works correctly in the bad circuit (also called the faulty circuit or faulty machine ) even if we have an input fault. The input fault on a logic cell is presumed to arise either from a fault from a preceding logic cell or a fault on the connection to the input. Stuck-at faults attached to different points in a circuit may produce identical fault effects. Using fault collapsing we can group these equivalent faults (or indistinguishable faults ) into a fault-equivalence class . To save time we need only consider one fault, called the prime fault or representative fault , from a faultequivalence class. For example, Figure 14.13 (a) and (b) show that a stuck-at-0 input and a stuck-at-1 output are equivalent faults for a two-input NAND gate. We only need to check for one fault, Z1 (output stuck at 1), to catch any of the equivalent faults. Suppose that any of the tests that detect a fault B also detects fault A, but only some of the tests for fault A also detect fault B. W say A is a dominant fault , or that fault A dominates fault B (this the definition of fault dominance that we shall use, some texts say fault B dominates fault A in this situation). Clearly to reduce the number of
14.3 Faults
tests using dominant fault collapsing we will pick the test for fault B. For example, Figure 14.13 (c) shows that the output stuck at 0 dominates either input stuck at 1 for a two-input NAND. By testing for fault A1, we automatically detect the fault Z1. Confusion over dominance arises because of the difference between focusing on faults ( Figure 14.13 d) or test vectors ( Figure 14.13 e). Figure 14.13 (f) shows the six stuck-at faults for a two-input NAND gate. We can place SA1 or SA0 on each of the two input pins (four faults in total) and SA1 or SA0 on the output pins. Using fault equivalence ( Figure 14.13 g) we can collapse six faults to four: SA1 on each input, and SA1 or SA0 on the output. Using fault dominance ( Figure 14.13 h) we can collapse six faults to three. There is no way to tell the difference between equivalent faults, but if we use dominant fault collapsing we may lose information about the fault location.
14.3 Faults
FIGURE 14.13 Fault dominance and fault equivalence. (a) We can test for fault Z0 (Z stuck at 0) by applying a test vector that makes the bad (faulty) circuit produce a different output than the good circuit. (b) Some test vectors provide tests for more than one fault. (c) A test for A stuck at 1 (A1) will also test for Z stuck at 0; Z0 dominates A1. The fault effects of faults: A0, B0 and Z1 are the same. These faults are equivalent. (d) There are six sets of input vectors that test for the six stuck-at faults. (e) We only need to choose a subset of all test vectors that test for all faults. (f) The six stuck-at faults for a two-input NAND logic cell. (g) Using fault equivalence we can collapse six faults to four. (h) Using fault dominance we can collapse six faults to three.
14.3.8 Fault-Collapsing Example

Figure 14.14 shows an example of fault collapsing. Using the properties of logic cells to reduce the number of faults that we need to consider is called gate collapsing . We can also use node collapsing by examining the effect of faults on the same node. Consider two inverters in series. An output fault on the first inverter collapses with the node fault on the net connecting the inverters. We can collapse the node fault in turn with the input fault of the second inverter. The details of fault collapsing depends on whether the simulator uses net or pin faults, the fanin and fanout of nodes, and the output fault-strength model used.
14.3 Faults
FIGURE 14.14 Fault collapsing for A'B + BC. (a) A pin-fault model. Each pin has stuck-at-0 and stuck-at-1 faults. (b) Using fault equivalence the pin faults at the input pins and output pins of logic cells are collapsed. This is gate collapsing. (c) We can reduce the number of faults we need to consider further by collapsing equivalent faults on nodes and between logic cells. This is node collapsing. (d) The final circuit has eight stuck-at faults (reduced from the 22 original faults). If we wished to use fault dominance we could also eliminate the stuck-at-0 fault on Z. Notice that in a pin-fault model we cannot collapse the faults U4.A1.SA1 and U3.A2.SA1 even though they are on the same net.
14.4 Fault Simulation

We use fault simulation after we have completed logic simulation to see what happens in a design when we deliberately introduce faults. In a production test we only have access to the package pinsthe primary inputs ( PIs ) and primary outputs ( POs ). To test an ASIC we must devise a series of sets of input patterns that will detect any faults. A stimulus is the application of one such set of inputs (a test vector ) to the PIs of an ASIC. A typical ASIC may have several hundred PIs and therefore each test vector is several hundred bits long. A test program consists of a set of test vectors. Typical ASIC test programs require tens of thousands and sometimes hundreds of thousands of test vectors. The test-cycle time is the period of time the tester requires to apply the stimulus, sense the POs, and check that the actual output is equal to the expected output. Suppose the test cycle time is 100 ns (corresponding to a test frequency of 10 MHz), in which case we might sense (or strobe ) the POs at 90 ns after the beginning of each test cycle. Using fault simulation we mimic the behavior of the production test. The fault simulator deliberately introduces all possible faults into our ASIC, one at a time, to see if the test program will find them. For the moment we dodge the problem of how to create the thousands of test vectors required in a typical test program and focus on fault simulation. As each fault is inserted, the fault simulator runs our test program. If the fault simulation shows that the POs of the faulty circuit are different than the PIs of the good circuit at any strobe time, then we have a detected fault ; otherwise we have an undetected fault . The list of fault origins is collected in a file and as the faults are inserted and simulated, the results are recorded and the faults are marked according to the result. At the end of fault simulation we can find the fault coverage ,
fault coverage = detected faults / detectable faults. (14.1) Detected faults and detectable faults will be defined in Section 14.4.5 , after the description of fault simulation. For now assume that we wish to achieve close to 100 percent fault coverage. How does fault coverage relate to the ASIC defect level? Table 14.7 shows the results of a typical experiment to measure the relationship between single stuck-at fault coverage and AQL. Table 14.7 completes a circle with test and repair costs in Table 14.1 and defect levels in Table 14.2 . These experimental results are the only justification (but a good one) for our assumptions in adopting the SSF model. We are not quite sure why this model works so well, but, being engineers, as long as it continues to work we do not worry too much. TABLE 14.7 Average quality level as a function of single stuck-at fault coverage. Average quality level Fault coverage Average defect level (AQL) 50% 7% 93% 90% 3% 97% 95% 1% 99% 99% 0.1% 99.9% 99.9% 0.01% 99.99% There are several algorithms for fault simulation: serial fault simulation, parallel fault simulation, and concurrent fault simulation. Next, we shall discuss each of these types of fault simulation in turn.
14.4.1 Serial Fault Simulation

Serial fault simulation is the simplest fault-simulation algorithm. We simulate two copies of the circuit, the first copy is a good circuit. We then pick a fault and insert it
into the faulty circuit. In test terminology, the circuits are called machines , so the two copies are a good machine and a faulty machine . We shall continue to use the term circuit here to show the similarity between logic and fault simulation (the simulators are often the same program used in different modes). We then repeat the process, simulating one faulty circuit at a time. Serial simulation is slow and is impractical for large ASICs.
14.4.2 Parallel Fault Simulation

Parallel fault simulation takes advantage of multiple bits of the words in computer memory. In the simplest case we need only one bit to represent either a '1' or '0' for each node in the circuit. In a computer that uses a 32-bit word memory we can simulate a set of 32 copies of the circuit at the same time. One copy is the good circuit, and we insert different faults into the other copies. When we need to perform a logic operation, to model an AND gate for example, we can perform the operation across all bits in the word simultaneously. In this case, using one bit per node on a 32bit machine, we would expect parallel fault simulation to be about 32 times faster than serial simulation. The number of bits per node that we need in order to simulate each circuit depends on the number of states in the logic system we are using. Thus, if we use a four-state system with '1' , '0' , 'X' (unknown), and 'Z' (highimpedance) states, we need two bits per node. Parallel fault simulation is not quite as fast as our simple prediction because we have to simulate all the circuits in parallel until the last fault in the current set is detected. If we use serial simulation we can stop as soon as a fault is detected and then start another fault simulation. Parallel fault simulation is faster than serial fault simulation but not as fast as concurrent fault simulation. It is also difficult to include behavioral models using parallel fault simulation.
14.4.3 Concurrent Fault Simulation

Concurrent fault simulation is the most widely used fault-simulation algorithm and takes advantage of the fact that a fault does not affect the whole circuit. Thus we do not need to simulate the whole circuit for each new fault. In concurrent simulation we
first completely simulate the good circuit. We then inject a fault and resimulate a copy of only that part of the circuit that behaves differently (this is the diverged circuit ). For example, if the fault is in an inverter that is at a primary output, only the inverter needs to be simulatedwe can remove everything preceding the inverter. Keeping track of exactly which parts of the circuit need to be diverged for each new fault is complicated, but the savings in memory and processing that result allow hundreds of faults to be simulated concurrently. Concurrent simulation is split into several chunks, you can usually control how many faults (usually around 100) are simulated in each chunk or pass . Each pass thus consists of a series of test cycles. Every circuit has a unique fault-activity signature that governs the divergence that occurs with different test vectors. Thus every circuit has a different optimum setting for faults per pass . Too few faults per pass will not use resources efficiently. Too many faults per pass will overflow the memory.
14.4.4 Nondeterministic Fault Simulation

Serial, parallel, and concurrent fault-simulation algorithms are forms of deterministic fault simulation . In each of these algorithms we use a set of test vectors to simulate a circuit and discover which faults we can detect. If the fault coverage is inadequate, we modify the test vectors and repeat the fault simulation. This is a very timeconsuming process. As an alternative we give up trying to simulate every possible fault and instead, using probabilistic fault simulation , we simulate a subset or sample of the faults and extrapolate fault coverage from the sample. In statistical fault simulation we perform a fault-free simulation and use the results to predict fault coverage. This is done by computing measures of observability and controllability at every node. We know that a node is not stuck if we can make the node togglethat is, change from a '0' to '1' or vice versa. A toggle test checks which nodes toggle as a result of applying test vectors and gives a statistical estimate of vector quality , a measure of faults detected per test vector. There is a strong correlation between high-quality test vectors, the vectors that will detect most faults, and the test vectors that have the
highest toggle coverage . Testing for nodes toggling simply requires a single logic simulation that is much faster than complete fault simulation. We can obtain a considerable improvement in fault simulation speed by putting the high-quality test vectors at the beginning of the simulation. The sooner we can detect faults and eliminate them from having to be considered in each simulation, the faster the simulation will progress. We take the same approach when running a production test and initially order the test vectors by their contribution to fault coverage. This assumes that all faults are equally likely. Test engineers can then modify the test program if they discover vectors late in the test program that are efficient in detecting faulty chips.
14.4.5 Fault-Simulation Results

The output of a fault simulator separates faults into several fault categories . If we can detect a fault at a location, it is a testable fault . A testable fault must be placed on a controllable net , so that we can change the logic level at that location from '0' to '1' and from '1' to '0' . A testable fault must also be on an observable net , so that we can see the effect of the fault at a PO. This means that uncontrollable nets and unobservable nets result in faults we cannot detect. We call these faults untested faults , untestable faults , or impossible faults . If a PO of the good circuit is the opposite to that of the faulty circuit, we have a detected fault (sometimes called a hard-detected fault or a definitely detected fault ). If the POs of the good circuit and faulty circuit are identical, we have an undetected fault . If a PO of the good circuit is a '1' or a '0' but the corresponding PO of the faulty circuit is an 'X' (unknown, either '0' or '1' ), we have a possibly detected fault ( also called a possible-detected fault , potential fault , or potentially detected fault ). If the PO of the good circuit changes between a '1' and a '0' while the faulty circuit remains at 'X' , then we have a soft-detected fault . Soft-detected faults are a subset of possibly detected faults. Some simulators keep track of these soft-detected faults separately. Soft-detected faults are likely to be detected on a real tester if this sequence occurs often. Most fault simulators allow you to set a fault-drop threshold so that the simulator will remove faults from further consideration after soft-detecting
or possibly detecting them a specified number of times. This is called fault dropping (or fault discarding ). The more often a fault is possibly detected, the more likely it is to be detected on a real tester. A redundant fault is a fault that makes no difference to the circuit operation. A combinational circuit with no such faults is irredundant . There are close links between logic-synthesis algorithms and redundancy. Logic-synthesis algorithms can produce combinational logic that is irredundant and 100 % testable for single stuck-at faults by removing redundant logic as part of logic minimization. If a fault causes a circuit to oscillate, it is an oscillatory fault . Oscillation can occur within feedback loops in combinational circuits with zero-delay models. A fault that affects a larger than normal portion of the circuit is a hyperactive fault . Fault simulators have settings to prevent such faults from using excessive amounts of computation time. It is very annoying to run a fault simulation for several days only to discover that the entire time was taken up by simulating a single fault in a RS flipflop or on the clock net, for example. Figure 14.15 shows some examples of fault categories.
FIGURE 14.15 Fault categories. (a) A detectable fault requires the ability to control and observe the fault origin. (b) A net that is fixed in value is uncontrollable and therefore will produce one undetected fault. (c) Any net that is unconnected is unobservable and will produce undetected faults. (d) A net that produces an unknown 'X' in the faulty circuit and a '1' or a '0' in the good circuit may be detected (depending on whether the 'X' is in fact a '0' or '1'), but we cannot say for sure. At some point this type of fault is likely to produce a discrepancy between good and bad circuits and will eventually be detected. (e) A redundant fault does not affect the operation of the good circuit. In this case the AND gate is redundant since AB + B' = A + B'.
14.4.6 Fault-Simulator Logic Systems

In addition to the way the fault simulator counts faults in various fault categories, the number of detected faults during fault simulation also depends on the logic system used by the fault simulator. As an example, Cadences VeriFault concurrent fault simulator uses a logic system with the six logic values: '0' , '1' , 'Z' , 'L' , 'H' , 'X' . Table 14.8 shows the results of comparing the faulty and the good circuit simulations. From Table 14.8 we can deduce that, in this logic system:
q
Fault detection is possible only if the good circuit and the bad circuit both produce either a '1' or a '0' . If the good circuit produces a 'Z' at a three-state output, no faults can be detected (not even a fault on the three-state output). If the good circuit produces anything other than a '1' or '0' , no faults can be detected.
A fault simulator assigns faults to each of the categories we have described. We define the fault coverage as:
fault coverage = detected faults / detectable faults. (14.2) The number of detectable faults excludes any undetectable fault categories (untestable or redundant faults). Thus, detectable faults = faults undetectable faults, (14.3) undetectable faults = untested faults + redundant faults. (14.4) The fault simulator may also produce an analysis of fault grading . This is a graph, histogram, or tabular listing showing the cumulative fault coverage as a function of the number of test vectors. This information is useful to remove dead test cycles , which contain vectors that do not add to fault coverage. If you reinitialize the circuit at regular intervals, you can remove vectors up to an initialization without altering the function of any vectors after the initialization. The list of faults that the simulator inserted is the fault list. In addition to the fault list, a fault dictionary lists the faults with their corresponding primary outputs (the faulty output vector ). The set of input vectors and faulty output vectors that uniquely identify a fault is the fault signature . This information can be useful to test engineers, allowing them to work backward from production test results and pinpoint the cause of a problem if several ASICs fail on the tester for the same reasons. TABLE 14.8 The VeriFault concurrent fault simulator logic system. 1 Faulty circuit 0 1 Z L H X 0 U D P P P P 1 D U P P P P Z U U U U U U Good circuit L U U U U U U H U U U U U U
U U U U U U
14.4.7 Hardware Acceleration

Simulation engines or hardware accelerators use computer architectures that are tuned to fault-simulation algorithms. These special computers allow you to add multiple simulation boards in one chassis. Since each board is essentially a workstation produced in relatively low volume and there are between 2 and 10 boards in one accelerator, these machines are between one and two orders of magnitude more expensive than a workstation. There are two ways to use multiple boards for fault simulation. One method runs good circuits on each board in parallel with the same stimulus and generates faulty circuits concurrently with other boards. The acceleration factor is less than the number of boards because of overhead. This method is usually faster than distributing a good circuit across multiple boards. Some fault simulators allow you to use multiple circuits across multiple machines on a network in distributed fault simulation .
1
Fault Type F1 F2 F3 F4 F5 F6 F7 F8
1
SA1 SA1 SA1 SA1 SA1 SA1 SA1 SA0
Vectors (hex) 3 0, 4 4, 5 3 2 7 0, 1, 3, 4, 5 2, 6, 7
Good output 0 0, 0 0, 0 0 1 1 0, 0, 0, 0, 0 1, 1, 1
Bad output 1 1, 1 1, 1 1 0 0 1, 1, 1, 1, 1 0, 0, 0
Test vector format: 3 = 011, so that CBA = 011: C = '0', B = '1', A = '1'
FIGURE 14.16 Fault simulation of A'B + BC. The simulation results for fault F1 (U2 output stuck at 1) with test vector value hex 3 (shown in bold in the table) are shown on the LogicWorks schematic. Notice that the output of U2 is 0 in the good circuit and stuck at 1 in the bad circuit.
14.4.8 A Fault-Simulation Example

Figure 14.16 illustrates fault simulation using the circuit of Figure 14.14 . We have used all possible inputs as a test vector set in the following order: {000, 001, 010, 011, 100, 101, 110, 111} . There are eight collapsed SSFs in this circuit, F1F8. Since the good circuit is irredundant, we have 100 percent fault coverage. The following fault-simulation results were derived from a logic simulator rather than a fault simulator, but are presented in the same format as output from an automated test system. Total number of faults: 22 Number of faults in collapsed fault list: 8 Test Vector Faults detected Coverage/% Cumulative/% ----------- --------------- ---------- -----------000 F2, F7 25.0 25.0 001 F7 12.5 25.0 010 F5, F8 25.0 62.5 011 F1, F4, F7 37.5 75.0 100 F2, F3, F7 37.5 87.5 101 F3, F7 25.0 87.5 110 F8 12.5 100.0 111 F6, F8 25.0 100.0 Total number of vectors : 8 Noncollapsed Collapsed
Fault counts: Detected 16 8 Untested 0 0 ------ -----Detectable 16 8 Redundant 0 0 Tied 0 0 FAULT COVERAGE 100.00 %
100.00 %
Fault simulation tells us that we need to apply seven test vectors in order to achieve full fault coverage. The highest-quality test vectors are {011} and {100} . For example, test vector {011} detects three faults (F1, F4, and F7) out of eight. This means if we were to reduce the test set to just {011} the fault coverage would be 3/8, or 37 percent. Proceeding in this fashion we reorder the test vectors in terms of their contribution to cumulative test coverage as follows: {011, 100, 010, 111, 000, 001, 101, 110} . This is a hard problem for large numbers of test vectors because of the interdependencies between the faults detected by the different vectors. Repeating the fault simulation gives the following fault grading: Test Vector Faults detected Coverage/% Cumulative/% ----------- --------------- ---------- -----------011 F1, F4, F7 37.5 37.5 100 F2, F3, F7 37.5 62.5 010 F5, F8 25.0 87.5 111 F6, F8 25.0 100.0 000 F2, F7 25.0 100.0 001 F7 12.5 100.0 101 F3, F7 25.0 100.0 110 F8 12.5 100.0 Now, instead of using seven test vectors, we need only apply the first four vectors from this set to achieve 100 percent fault coverage, cutting the expensive production test time nearly in half. Reducing the number of test vectors in this fashion is called test-vector compression or test-vector compaction . The fault signatures for faults F1F8 for the last test sequence, {011, 100, 010,
111, 000, 001, 101, 110} , are as follows: # -F1 F2 F3 F4 F5 F6 F7 F8 fail good bad ---- ---- -------- -------10000000 00110001 10110001 01001000 00110001 01111001 01000010 00110001 01110011 10000000 00110001 10110001 00100000 00110001 00010001 00010000 00110001 00100001 11001110 00110001 11111111 00110001 00110001 00000000
The first pattern for each fault indicates which test vectors will fail on the tester (we say a test vector fails when it successfully detects a faulty circuit during a production test). Thus, for fault F1, pattern '10000000' indicates that only the first test vector will fail if fault F1 is present. The second and third patterns for each fault are the POs of the good and bad circuits for each test vector. Since we only have one PO in our simple example, these patterns do not help further distinguish between faults. Notice, that as far as an external view is concerned, faults F1 and F4 have identical fault signatures and are therefore indistinguishable. Faults F1 and F4 are said to be structurally equivalent . In general, we cannot detect structural equivalence by looking at the circuit. If we apply only the first four test vectors, then faults F2 and F3 also have identical fault signatures. Fault signatures are only useful in diagnosing fault locations if we have one, or a very few faults. Not all fault simulators give all the information we have described. Most fault simulators drop hard-detected faults from consideration once they are detected to increase the speed of simulation. With dropped hard-detected faults we cannot independently grade each vector and we cannot construct a fault dictionary. This is the reason we used a logic simulator to generate the preceding results.
14.4.9 Fault Simulation in an ASIC Design Flow

At the beginning of this section we dodged the issue of test-vector generation. It is
possible to automatically generate test vectors and test programs (with certain restrictions), and we shall discuss these methods in Section 14.5 . A by-product of some of these automated systems is a measure of fault coverage. However, fault simulation is still used for the following reasons:
q
Test-generation software is expensive, and many designers still create test programs manually and then grade the test vectors using fault simulation. Automatic test programs are not yet at the stage where fault simulation can be completely omitted in an ASIC design flow. Usually we need fault simulation to add some vectors to test logic not covered automatically, to check that test logic has been inserted correctly, or to understand and correct fault coverage problems. It is far too expensive to use a production tester to debug a production test. One use of a fault simulator is to perform this function off line. The reuse and automatic generation of large cells is essential to decrease the complexity of large ASIC designs. Megacells and embedded blocks (an embedded microcontroller, for example) are normally provided with canned test vectors that have already been fault simulated and fault graded. The megacell has to be isolated during test to apply these vectors and measure the response. Cell compilers for RAM, ROM, multipliers, and other regular structures may also generate test vectors. Fault simulation is one way to check that the various embedded blocks and their vectors have been correctly glued together with the rest of the ASIC to produce a complete set of test vectors and a test program. Production testers are very expensive. There is a trend away from the use of test vectors to include more of the test function on an ASIC. Some internal test logic structures generate test vectors in a random or pseudorandom fashion. For these structures there is no known way to generate the fault coverage. For these types of test structures we will need some type of fault simulation to measure fault coverage and estimate defect levels.
1. L = 0 or Z; H = 1 or Z; Z = high impedance; X = unknown; D = detected; P = potentially detected; U = undetected. [ Chapter start ] [ Previous page ] [ Next page ]
14.5 Automatic Test-Pattern Generation

In this section we shall describe a widely used algorithm, PODEM, for automatic test-pattern generation ( ATPG ) or automatic test-vector generation ( ATVG ). Before we can explain the PODEM algorithm we need to develop a shorthand notation and explain some terms and definitions using a simpler ATPG algorithm.
FIGURE 14.17 The D-calculus. (a) We need a way to represent the behavior of the good circuit and the bad circuit at the same time. (b) The composite logic value D (for detect) represents a logic '1' in the good circuit and a logic '0' in the bad circuit. We can also write this as D = 1/0. (c) The logic behavior of simple logic cells using the D-calculus. Composite logic values can propagate through simple logic gates if the other inputs are set to their enabling values.
14.5.1 The D-Calculus

Figure 14.17 (a) and (b) shows a shorthand notation, the D-calculus , for tracing faults. The D-calculus was developed by Roth [ 1966] together with an ATPG algorithm, the D-algorithm . The symbol D (for detect) indicates the value of a node is a logic '0' in the good circuit and a logic '1' in the bad circuit. We can also write this as D = 0/1. In general we write g/b, a composite logic value , to indicate a node value in the good circuit is g and b in the bad circuit (by convention we always write the good circuit value first and the faulty circuit value second). The complement of D is D = 1/0 ( D is rarely written as D' since D is a logic value just like '1' and '0'). Notice that D does not mean not detected, but simply that we see a '0' in the good circuit and a '1' in the bad circuit. We can apply Boolean algebra to the composite logic values D and D as shown in Figure 14.17 (c). The composite values 1/1 and 0/0 are equivalent to '1' and '0' respectively. We use the unknown logic value 'X' to represent a logic value that is one of '0', '1', D, or D , but we do not know or care which. If we wish to propagate a signal from one or more inputs of a logic cell to the logic cell output, we set the remaining inputs of that logic cell to what we call the enabling value . The enabling value is '1' for AND and NAND gates and '0' for OR and NOR gates. Figure 14.17 (c) illustrates the use of enabling values. In contrast, setting at least one input of a logic gate to the controlling value , the opposite of the enabling value for that gate, forces or justifies the output node of that logic gate to a fixed value. The controlling value of '0' for an AND gate justifies the output to '0' and for a NAND gate justifies the output to '1'. The controlling values of '1' justifies the output
of an OR gate to '1' and justifies the output of a NOR gate to '0'. To find controlling and enabling values for more complex logic cells, such as AOI and OAI logic cells, we can use their simpler AND, OR, NAND, and NOR gate representations.
FIGURE 14.18 A basic ATPG (automatic test-pattern generation) algorithm for A'B + BC. (a) We activate a fault, U2.ZN stuck at 1, by setting the pin or node to '0', the opposite value of the fault. (b) We work backward from the fault origin to the PIs (primary inputs) by recursively justifying signals at the output of logic cells. (c) We then work forward from the fault origin to a PO (primary output), setting inputs to gates on a sensitized path to their enabling values. We propagate the fault until the D-frontier reaches a PO. (d) We then work backward from the PO to the PIs recursively justifying outputs to generate the sensitized path. This simple algorithm always works, providing signals do not branch out and then rejoin again.
14.5.2 A Basic ATPG Algorithm

A basic algorithm to generate test vectors automatically is shown in Figure 14.18 . We detect a fault by first activating (or exciting the fault). To do this we must drive the faulty node to the opposite value of the fault. Figure 14.18 (a) shows a stuck-at-1 fault at the output pin, ZN, of the inverter U2 (we call this fault U2.ZN.SA1). To create a test for U2.ZN.SA1 we have to find the values of the PIs that will justify node U2.ZN to '0' . We work backward from node U2.ZN justifying each logic gate output until we reach a PI. In this case we only have to justify U2.ZN to '0' , and this is easily done by setting the PI A = '0'. Next we work forward from the fault origin and sensitize a path to a PO (there is only one PO in this example). This propagates the fault effect to the PO so that it may be observed . To propagate the fault effect to the PO Z, we set U3.A2 = '1' and then U5.A2 = '1'. We can visualize fault propagation by supposing that we set all nodes in a circuit to unknown, 'X'. Then, as we successively propagate the fault effect toward the POs, we can imagine a wave of Ds and D s, called the D-frontier , that propagates from the fault origin toward the POs. As a value of D or D reaches the inputs of a logic cell whose other inputs are 'X', we add that logic cell to the D-frontier. Then we find values for the other inputs to propagate the D-frontier through the logic cell to continue the process. This basic algorithm of justifying and then propagating a fault works when we can justify nodes without interference from other nodes. This algorithm breaks down when we have reconvergent fanout . Figure 14.19 (a) shows another example of justifying and propagating a fault in a circuit with reconvergent fanout. For direct comparison Figure 14.19 (b) shows an irredundant circuit, similar to part (a), except the fault signal, B stuck at 1, branches and then reconverges at the inputs to gate U5. The reconvergent fanout in this new circuit breaks our basic algorithm. We now have two sensitized paths that propagate the fault effect to U5. These paths combine to produce a constant '1' at Z, the PO. We have a multipath sensitization problem.
FIGURE 14.19 Reconvergent fanout. (a) Signal B branches and then reconverges at logic gate U5, but the fault U4.A1 stuck at 1 can still be excited and a path sensitized using the basic algorithm of Figure 14.18 . (b) Fault B stuck at 1 branches and then reconverges at gate U5. When we enable the inputs to both gates U3 and U4 we create two sensitized paths that prevent the fault from propagating to the PO (primary output). We can solve this problem by changing A to '0', but this breaks the rules of the algorithm illustrated in Figure 14.18 . The PODEM algorithm solves this problem.
14.5.3 The PODEM Algorithm

The path-oriented decision making ( PODEM ) algorithm solves the problem of reconvergent fanout and allows multipath sensitization [ Goel, 1981]. The method is similar to the basic algorithm we have already described except PODEM will retry a step, reversing an incorrect decision. There are four basic steps that we label: objective , backtrace , implication , and D-frontier . These steps are as follows: 1. Pick an objective to set a node to a value. Start with the fault origin as an objective and all other nodes set to 'X'. 2. Backtrace to a PI and set it to a value that will help meet the objective. 3. Simulate the network to calculate the effect of fixing the value of the PI (this step is called implication ). If there is no possibility of sensitizing a path to a PO, then retry by reversing the value of the PI that was set in step 2 and simulate again.
4. Update the D-frontier and return to step 1. Stop if the D-frontier reaches a PO. Figure 14.20 shows an example that uses the following iterations of the four steps in the PODEM algorithm: 1. We start with activation of the fault as our objective, U3.A2 = '0'. We backtrace to J. We set J = '1'. Since K is still 'X', implication gives us no further information. We have no D-frontier to update. 2. The objective is unchanged, but this time we backtrace to K. We set K = '1'. Implication gives us U2.ZN = '1' (since now J = '1' and K = '1') and therefore U7.ZN = '1'. We still have no D-frontier to update. 3. We set U3.A1 = '1' as our objective in order to propagate the fault through U3. We backtrace to M. We set M = '1'. Implication gives us U2.ZN = '1' and U3.ZN = D. We update the D-frontier to reflect that U4.A2 = D and U6.A1 = D, so the D-frontier is U4 and U6. 4. We pick U6.A2 = '1' as an objective in order to propagate the fault through U6. We backtrace to N. We set N = '1'. Implication gives us U6.ZN = D . We update the D-frontier to reflect that U4.A2 = D and U8.A1 = D , so the Dfrontier is U4 and U8. 5. We pick U8.A1 = '1' as an objective in order to propagate the fault through U8. We backtrace to L. We set L = '0'. Implication gives us U5.ZN = '0' and therefore U8.ZN = '0' (this node is Z, the PO). There is then no possible sensitized path to the PO Z. We must have made an incorrect decision, we retry and set L = '1'. Implication now gives us U8.ZN = D and we have propagated the D-frontier to a PO.
Iteration Objective 1 U3.A2 = 0 2 U3.A2 = 0 3 U3.A1 = 1 4 U6.A2 = 1 5a U8.A1 = 1 5b Retry

1
Backtrace 1 J=1 K=1 M=1 N=1 L=0 L=1
Implication U7.ZN = 1 U3.ZN = D U6.ZN = D U8.ZN = 1 U8.ZN = D
D-frontier
U4, U6 U4, U8 U4, U8 A
Backtrace is not the same as retry or backtrack.
FIGURE 14.20 The PODEM (path-oriented decision making) algorithm. We can see that the PODEM algorithm proceeds in two phases. In the first phase, iterations 1 and 2 in Figure 14.20 , the objective is fixed in order to activate the fault. In the second phase, iterations 35, the objective changes in order to propagate the fault. In step 3 of the PODEM algorithm there must be at least one path containing unknown values between the gates of the D-frontier and a PO in order to be able to complete a sensitized path to a PO. This is called the X-path check . You may wonder why there has been no explanation of the backtrace mechanism or how to decide a value for a PI in step 2 of the PODEM algorithm. The decision tree shown in Figure 14.20 shows that it does not matter. PODEM conducts an implicit binary search over all the PIs. If we make an incorrect decision and assign the wrong value to a PI at some step, we will simply need to retry that step. Texts, programs, and articles use the term backtrace as we have described it, but then most use the
term backtrack to describe what we have called a retry, which can be confusing. I also did not explain how to choose the objective in step 1 of the PODEM algorithm. The initial objective is to activate the fault. Subsequently we select a logic gate from the D-frontier and set one of its inputs to the enabling value in an attempt to propagate the fault. We can use intelligent procedures, based on controllability and observability , to guide PODEM and reduce the number of incorrect decisions. PODEM is a development of the D-algorithm, and there are several other ATPG algorithms that are developments of PODEM. One of these is FAN ( fanout-oriented test generation ) that removes the need to backtrace all the way to a PI, reducing the search time [ Fujiwara and Shimono, 1983; Schulz, Trischler, and Sarfert, 1988]. Algorithms based on the D-algorithm, PODEM, and FAN are the basis of many commercial ATPG systems.
14.5.4 Controllability and Observability

In order for an ATPG system to provide a test for a fault on a node it must be possible to both control and observe the behavior of the node. There are both theoretical and practical issues involved in making sure that a design does not contain buried circuits that are impossible to observe and control. A software program that measures the controllability (with three l s) and observability of nodes in a circuit is useful in conjunction with ATPG software. There are several different measures for controllability and observability [ Butler and Mercer, 1992]. We shall describe one of the first such systems called SCOAP ( Sandia Controllability/Observability Analysis Program ) [ Goldstein, 1979]. These measures are also used by ATPG algorithms. Combinational controllability is defined separately from sequential controllability . We also separate zero-controllability and one-controllability . For example, the combinational zero-controllability for a two-input AND gate, Y = AND (X 1 , X 2 ), is recursively defined in terms of the input controllability values as follows: CC0 (Y) = min { CC0 (X 1 ), CC0 (X 2 ) } + 1 . (14.5)
We choose the minimum value of the two-input controllability values to reflect the fact that we can justify the output of an AND gate to '0' by setting any input to the control value of '0'. We then add one to this value to reflect the fact that we have passed through an additional level of logic. Incrementing the controllability measures for each level of logic represents a measure of the logic distance between two nodes. We define the combinational one-controllability for a two-input AND gate as CC1 (Y) = CC1(X 1 ) + CC1 (X 2 ) + 1 . (14.6) This equation reflects the fact that we need to set all inputs of an AND gate to the enabling value of '1' to justify a '1' at the output. Figure 14.21 (a) illustrates these definitions.
FIGURE 14.21 Controllability measures. (a) Definition of combinational zero-controllability, CC0, and combinational one-controllability, CC1, for a two-input AND gate. (b) Examples of controllability calculations for simple gates, showing intermediate steps. (c) Controllability in a combinational circuit. An inverter, Y = NOT (X), reverses the controllability values:
CC1 (Y) = CC0 (X) + 1 and CC0 (Y) = CC1 (X) + 1 . (14.7) Since we can construct all other logic cells from combinations of two-input AND gates and inverters we can use Eqs. 14.5 14.7 to derive their controllability equations. When we do this we only increment the controllability by one for each primitive gate. Thus for a three-input NAND with an inverting input, Y = NAND (X 1 , X 2 , NOT (X 3 )): CC0 (Y) = CC1 (X 1 ) + CC1 (X 2 ) + CC0 (X 3 ) + 1 , CC1 (Y) = min { CC0 (X 1 ), CC0 (X 2 ), CC1 (X 3 ) } + 1 . (14.8) For a two-input NOR, Y = NOR (X 1 , X 2 ) = NOT (AND (NOT (X 1 ), NOT (X 2 )): CC1 (Y) = min { CC1 (X 1 ), CC1 (X 2 ) } + 1 , CC0 (Y) = CC0 (X 1 ) + CC0 (X 2 ) + 1 . (14.9)
Figure 14.21 (b) shows examples of controllability calculations. A bubble on a logic gate at the input or output swaps the values of CC1 and CC0. Figure 14.21 (c) shows how controllability values for a combinational circuit are calculated by working forward from each PI that is defined to have a controllability of one. We define observability in terms of the controllability measures. The combinational observability , OC (X 1 ), of input X 1 of a two-input AND gate can be expressed in terms of the controllability of the other input CC1 (X 2 ) and the combinational observability of the output, OC (Y): OC (X 1 ) = CC1 (X 2 ) + OC (Y) + 1 . (14.10) If a node X 1 branches (has fanout) to nodes X 2 and X 3 we choose the most observable of the branches:
OC (X 1 ) = min { O (X 2 ) + O (X 3 ) } . (14.11) Figure 14.22 (a) and (b) show the definitions of observability. Figure 14.22 (c) illustrates calculation of observability at a three-input NAND; notice we sum the CC1 values for the other inputs (since the enabling value for a NAND gate is one, the same as for an AND gate). Figure 14.22 (d) shows the calculation of observability working back from the PO which, by definition, has an observability of zero.
FIGURE 14.22 Observability measures. (a) The combinational observability, OC(X 1 ), of an input, X 1 , to a two-input AND gate defined in terms of the controllability of the other input and the observability of the output. (b) The observability of a fanout node is equal to the observability of the most observable branch. (c) Example of an observability calculation at a three-input NAND gate. (d) The observability of a combinational network can be calculated from the controllability measures, CC0:CC1. The observability of a PO (primary output) is defined to be zero. Sequential controllability and observability can be measured using similar equations to the combinational measures except that in the sequential measures (SC1, SC0, and
OS) we measure logic distance in terms of the layers of sequential logic, not the layers of combinational logic. [ Chapter start ] [ Previous page ] [ Next page ]
14.6 Scan Test
14.6 Scan Test

Sequential logic poses a very difficult ATPG problem. Consider the example of a 32bit counter with a final carry. If the designer included a reset, we have to clock the counter 2 32 (approximately 4 10 9 ) times to check the carry logic. Using a 1 MHz tester clock this requires 4 10 3 seconds, 1 hour, or (at approximately $0.25 per second) $1,000 of tester time. Consider a 16-bit state machine implemented using a one-hot state register with 16 D flip-flops. If the designer did not include a reset we have a very complicated initialization problem. A sequential ATPG algorithm must consider over 2000 states when constructing sequential test vectors. In an ad hoc approach to testing we could construct special reset circuits or create manual test vectors to deal with these special situations, one at a time, as they arise. Instead we can take a structured test approach (also called design for test , though this term covers a wider field). We can automatically generate test vectors for combinational logic, but ATPG is much harder for sequential logic. Therefore the most common sequential structured test approach converts sequential logic to combinational logic. In full-scan design we replace every sequential element with a scan flip-flop. The result is an internal form of boundary scan and, if we wish, we can use the IEEE 1149.1 TAP to access (and the boundary-scan controller to control) an internal-scan chain. Table 14.9 shows a VHDL model and schematic symbols for a scan flip-flop. There is an area and performance penalty to pay for scan design. The scan MUX adds the delay of a 2:1 MUX to the setup time of the flip-flop; this will directly subtract from the critical path delay. The 2:1 MUX and any separate driver for the scan output also adds approximately 10 percent to the area of the flip-flop (depending on the features present in the original flip-flop). The scan chain must also be routed, and this
14.6 Scan Test
complicates physical design and adds to the interconnect area. In ASIC design the benefits of eliminating complex sequential ATPG and the addition of observability and controllability usually outweigh these disadvantages. TABLE 14.9 Scan flip-flop.
library IEEE; use IEEE.STD_LOGIC_1164. all ; entity DFFSCAN is generic (reset_value : STD_LOGIC := '0'); port ( Q : out STD_LOGIC ; D, CLK, RST : in STD_LOGIC; SCOUT : out STD_LOGIC; SCIN, SCEN : in STD_LOGIC ); end DFFSCAN; architecture behave of DFFSCAN is signal RST_IN, CLK_IN , SCEN_IN , SCIN_IN, D_IN : STD_LOGIC ; begin RST_IN <= to_X01(RST); CLK_IN <= to_X01(CLK); SCEN_IN <= to_X01(SCEN); SCIN_IN <= to_X01(SCIN); D_IN <= to_X01(D); DFSCAN : process (CLK_IN, RST_IN) begin if RST_IN = '0' then Q <= reset_value; SCOUT <= reset_value; elsif RST_IN = '1' and rising_edge (CLK_IN) then if SCEN_IN = '1' then Q <= SCIN_IN; SCOUT <= SCIN_IN; end if ; elsif SCEN_IN = '0' then Q <= D_IN; SCOUT <= D_IN; else Q <= 'X' ; SCOUT <= 'X'; end if ; elsif RST_IN = 'X' or CLK_IN = 'X' or SCEN_IN = 'X' then
14.6 Scan Test
Q <= 'X'; SCOUT <= 'X'; end if ; end process DFSCAN; end behave;
The highly structured nature of full scan allows test software (usually called a test compiler ) to perform automatic scan insertion . Using scan design we turn the output of each flip-flop into a pseudoprimary input and the input to each flip-flop into a pseudoprimary output . ATPG software can then generate test vectors for the combinational logic between scan flip-flops. There are other approaches to scan design. In partial scan we replace a subset of the sequential elements with scan flip-flops. We can choose this subset using heuristic procedures to allow the remaining sequential logic to be tested using sequential ATPG techniques. In destructive scan we remove the values at the outputs of the flip-flops during the scan process (this is the usual form of scan design). In nondestructive scan we keep the flip-flop outputs intact so that we can shift out the scan chain and then resume where we left off. Level-sensitive scan design ( LSSD ) is a form of scan design developed at IBM that uses separate clock phases to drive scan elements. We shall describe scan design, automated scan insertion, and test-program generation with several examples. First, though, we describe another important structured-test technique. [ Chapter start ] [ Previous page ] [ Next page ]
14.7 Built-in Self-test

The trend to include more test logic on an ASIC has already been mentioned. Built-in self-test ( BIST ) is a set of structured-test techniques for combinational and sequential logic, memories, multipliers, and other embedded logic blocks. In each case the principle is to generate test vectors, apply them to the circuit under test ( CUT ) or device under test ( DUT ), and then check the response.
14.7.1 LFSR
Figure 14.23 shows a linear feedback shift register ( LFSR ). The exclusive-OR gates and shift register act to produce a pseudorandom binary sequence ( PRBS ) at each of the flip-flop outputs. By correctly choosing the points at which we take the feedback from an n -bit shift register (see Section 14.7.5 ), we can produce a PRBS of length 2 n 1, a maximal-length sequence that includes all possible patterns (or vectors) of n bits, excluding the all-zeros pattern.
FIGURE 14.23 A linear feedback shift register (LFSR). A 3-bit maximal-length LFSR produces a repeating string of seven pseudorandom binary numbers: 7, 3, 1, 4, 2, 5, 6.
Table 14.10 shows the maximal-length sequence, with length 2 3 1 = 7, for the 3-bit
LFSR shown in Figure 14.23 . Notice that the first (clock tick 1) and last rows (clock tick 8) are identical. Rows following the seventh row repeat rows 17, so that the length of this 3-bit LFSR sequence is 7 = 2 3 1, the maximal length. The shaded regions show how bits are shifted from one clock cycle to the next. We assume the register is initialized to the all-ones state, but any initial state will work and produce the same PRBS, as long as the initial state is not all zeros (in which case the LFSR will stay stuck at all zeros). TABLE 14.10 LFSR example of Figure 14.23 . Clock tick, t = 1 2 3 4 5 6 7 8 Q0 t+1 = Q1 t Q2 t 1 0 0 1 0 1 1 1 Q1 t+1 = Q0 t Q2 t+1 = Q1 t Q0Q1Q2 1 1 0 0 1 0 1 1 1 1 1 0 0 1 0 1 7 3 1 4 2 5 6 7
14.7.2 Signature Analysis

Figure 14.24 shows the LFSR of Figure 14.23 with an additional XOR gate used in the first stage of the shift register. If we apply a binary input sequence to IN , the shift register will perform data compaction (or compression ) on the input sequence. At the end of the input sequence the shift-register contents, Q0Q1Q2 , will form a pattern that we call a signature . If the input sequence and the serial-input signature register ( SISR ) are long enough, it is unlikely (though possible) that two different input sequences will produce the same signature. If the input sequence comes from logic that we wish to test, a fault in the logic will cause the input sequence to change. This causes the signature to change from a known good value and we shall then know
that the circuit under test is bad. This technique, called signature analysis , was developed by Hewlett-Packard to test equipment in the field in the late 1970s. FIGURE 14.24 A 3-bit serial-input signature register (SISR) using an LFSR (linear feedback shift register). The LFSR is initialized to Q1Q2Q3 = '000' using the common RES (reset) signal. The signature, Q1Q2Q3, is formed from shift-and-add operations on the sequence of input bits (IN).
14.7.3 A Simple BIST Example

We can combine the PRBS generator of Figure 14.23 together with the signature register of Figure 14.24 to form the simple BIST structure shown in Figure 14.25 (a). LFSR1 generates a maximal-length (2 3 1 = 7 cycles) PRBS. LFSR2 computes the signature ('011' for the good circuit) of the CUT. LFSR1 is initialized to '100' (Q0 = 1, Q1 = 0, Q2 = 0) and LFSR2 is initialized to '000'. The schematic in Figure 14.25 (a) shows the bit sequences in the circuit, both for a good circuit and for a bad circuit with a stuck-at-1 fault, F1. Figure 14.25 (b) shows how the bit sequences are calculated in the good circuit. The signature is formed as R0R1R2 seven clock edges (on the eighth clock cycle) after the active-low reset is taken high. Figure 14.26 shows the waveforms in the good and bad circuit. The bad circuit signature, '000', differs from the good circuit and the signature can either be compared with the known good signature on-chip or the signature may be shifted out and compared off-chip (both approaches are used in practice).
(a)
(b) Q0 t+1 = Q1 t Q2 t 1 0 1 1 1 0 0 1 Z= Q0'.Q1 + Q1.Q2 0 1 0 0 1 1 0 0 R0 t+1 = Zt R0 t R2 t 0 0 1 1 1 1 1 0
Q1 t+1 = Q0 t
Q2 t+1 = Q1 t
R1 t+1 = R0 t
R2 t+1 = R1 t
0 1 0 1 1 1 0 0
0 0 1 0 1 1 1 0
0 0 0 1 1 1 1 1
0 0 0 0 1 1 1 1
FIGURE 14.25 BIST example. (a) A simple BIST structure showing bit sequences for both good and bad circuits. (b) Bit sequence calculations for the good circuit. The signature appears on the eighth clock cycle (after seven positive clock edges) and is R0 = '0', R1 = '1', R2 = '1'; with R2 as the MSB this is '011' or hex 3. (a)
(b)
(c)
FIGURE 14.26 The waveforms of the BIST circuit of Figure 14.25 . (a) The good-circuit response. The waveforms Q1 and Q2, as well as R1 and R2, are delayed by one clock cycle as they move through each stage of the shift registers. (b) The same good-circuit response with the register outputs Q0Q2 and R0R2 grouped and their values displayed in hexadecimal (Q0 and R0 are the MSBs). The signature hex 3 or '011' (R0 = 0, R1 = 1, R2 = 1) in R appears seven positive clock edges after the reset signal is taken high. This is one clock cycle after the generator completes its first sequence (hex pattern 4, 2, 5, 6, 7, 3, 1). (b) The response of the bad circuit with fault F1 and fault signature hex 0 (circled).
14.7.4 Aliasing
In Figure 14.26 the good and bad circuits produced different signatures. There is a small probability that the signature of a bad circuit will be the same as a good circuit. This problem is known as aliasing or error masking . For the example in Figure 14.25 , the bit stream input to the signature analysis register is 7 bits long. There are 2 7 or 128 possible 7-bit-long bit-stream patterns. We assume that each of these 128 bit-stream patterns is equally likely to produce any of the eight (all-zeros is an allowed pattern in a signature register) possible 3-bit signatures. It turns out that this is a good assumption. Thus there are 128 / 8 or 16 bit-streams that produce the good signature, one of these belongs to the good circuit, the remaining 15 cause aliasing. Since there are a total of 128 1 = 127 bit-streams due to bad circuits, the fraction of bad-circuit bit-streams that cause aliasing is 15 / 127, or 0.118. If all bad circuit bit-streams are equally likely (and this is a poor assumption) then 0.118 is also the probability of aliasing. In general, if the length of the test sequence is L and the length of the signature register is R the probability p of aliasing (not detecting an error) is 2LR1
p = 2L1
(14.12)
Thus, for the example in Figure 14.25 , L = 7 and R = 3, and the probability of aliasing is p = (2 (7 3) 1) / (2 7 1) = 15 / 127 = 0.118, as we have just calculated. This is a very high probability of error and we would not use such a short test sequence and such a short signature register in practice. For L >> R the error probability is p 2 R (14.13)
For example, if R = 16, p 0.0000152 corresponding to an error coverage (1 p ) of approximately 99.9984 percent. Unfortunately, these equations for error coverage are rather meaningless since there is no easy way to relate the error coverage to fault coverage. The problem lies in our assumption that all bad-circuit bit-streams are equally likely, and this is not true in practice (for example, bit-stream outputs of all ones or all zeros are more likely to occur as a result of faults). Nevertheless signature analysis with high error-coverage rates is found to produce high fault coverage.
14.7.5 LFSR Theory

The operation of LFSRs is related to the mathematics of polynomials and Galois-field theory. The properties and behavior of these polynomials are well known and they are also used extensively in coding theory. Every LFSR has a characteristic polynomial that describes its behavior. The characteristic polynomials that cause an LFSR to generate a maximum-length PRBS are called primitive polynomials. Consider the primitive polynomial P(x) = 1 x 1 x 3 , (14.14) where a b represents the exclusive-OR of a and b . The order of this polynomial is
three, and the corresponding LFSR will generate a PRBS of length 2 3 1 = 7. For a primitive polynomial of order n , the length of the PRBS is 2 n 1. Figure 14.27 shows the nonzero coefficients of some primitive polynomials [ Golomb et al., 1982]. s Octal Binary 0, 1 3 11 0, 1, 2 7 111 2 0, 1, 3 13 1011 3 0, 1, For n = 3 and s = 0, 1, 3: c 0 = 1, c 4 3 10011 4 1 = 1, c 2 = 0, c 3 = 1 0, 2, 5 45 100101 5 0, 1, 6 103 1000011 6 0, 1, 7 211 10001001 7 0, 1, 8 5, 6, 435 100011101 8 0, 4, 9 1021 1000010001 9 0, 3, 10 2011 10000001001 10 FIGURE 14.27 Primitive polynomial coefficients for LFSRs (linear feedback shift registers) that generate a maximal-length PRBS (pseudorandom binary sequence). A schematic for a type 1 LFSR is shown. Any primitive polynomial can be written as P(x) = c 0 c 1 x 1 ... c n x n , (14.15)
n 1
where c 0 and c n are always one. Thus for example, from Figure 14.27 for n = 3, we see s = 0, 1, 3; and thus the nonzero coefficients are c 0 , c 1 , and c 3 . This corresponds to the primitive polynomial P(x) = 1 x 1 x 3 . There is no easy way to determine the coefficients of primitive polynomials, especially for large n . There are many primitive polynomials for each n , but Figure 14.27 lists the one with the fewest nonzero coefficients. The schematic in Figure 14.27 shows how the feedback taps on a LFSR correspond to the nonzero coefficients of the primitive polynomial. If the i th coefficient c i is 1, then we include a feedback connection and an XOR gate in that position. If c i is zero, there is no feedback connection and no XOR gate in that position. The reciprocal of a primitive polynomial, P*(x) , is also primitive, where P*(x) = x n P*(x 1 ) . (14.16)
For example, by taking the reciprocal of the primitive polynomial P(x) = 1 x 1 x 3 from Eq. 14.17 , we can form P*(x) = 1 x 3 x 4 , (14.17)
which is also a primitive polynomial. This means that there are two possible LFSR implementations for every P(x) . Or, looked at another way, for every LFSR implementation, the characteristic polynomial can be written in terms of two primitive polynomials, P(x) and P*(x) , that are reciprocals of each other.
FIGURE 14.28 For every primitive polynomial there are four linear feedback shift registers (LFSRs). There are two types of LFSR; one type uses external XOR gates (type 1) and the other type uses internal XOR gates (type 2). For each type the feedback taps can be constructed either from the polynomial P(x) or from its reciprocal, P*(x). The LFSRs in this figure correspond to P(x) = 1 x x 3 and P*(x)= 1 x 2 x 3 . Each LFSR produces a different pseudorandom sequence, as shown. The binary values of the LFSR seen as a register, with the bit labeled as zero being the MSB, are shown in hexadecimal. The sequences shown are for each register initialized to '111', hex 7. (a) Type 1, P*(x). (b) Type 1, P(x). (c) Type 2, P(x). (d) Type 1, P*(x). We may also implement an LFSR by using XOR gates in series with each flip-flop output rather than external to the shift register. The external-XOR LFSR is called a type 1 LFSR and the internal-XOR LFSR is called a type 2 LFSR (this is a nomenclature that most follow). Figure 14.28 shows the four different LFSRs that may be constructed for each primitive polynomial, P(x) . There are differences between the four different LFSRs for each polynomial. Each
gives a different output sequence. The outputs for the type 1 LFSRs, taken from the Q outputs of each flip-flop, are identical, but delayed by one clock cycle from the previous output. This is a problem when we use the parallel output from an LFSR to test logic because of the strong correlation between the test signals. The type 2 LFSRs do not have this problem. The type 2 LFSRs also are capable of higher-frequency operation since there are fewer series XOR gates in the signal path than in the corresponding type 1 LFSR. For these reasons, the type 2 LFSRs are usually used in BIST structures. The type 1 LFSR does have the advantage that it can be more easily constructed using register structures that already exist on an ASIC. Table 14.11 shows primitive polynomial coefficients for higher values of n than Figure 14.27 . Test length grows quickly with the size of the LFSR. For example, a 32-bit generator will produce a sequence with 2 32 = 4,294,967,296 4.3 10 9 bits. With a 100 MHz clock (with 10 ns cycle time), the test time of 43 seconds would be impractical. TABLE 14.11 Nonzero coefficients of primitive polynomials for LFSRs (linear feedback shift registers) that generate a maximal-length PRBS (pseudorandom binary sequence). n s n s n s n s 1 0, 1 11 0, 2, 11 21 0, 2, 21 31 0, 3, 31 0, 3, 4, 7, 0, 1, 27, 28, 2 0, 1, 2 12 22 0, 1, 22 32 12 32 0, 1, 3, 4, 0, 2, 19, 21, 3 0, 1, 3 13 23 0, 5, 23 40 13 40 0, 1, 26, 27, 0, 1, 3, 4, 0, 1, 11, 12, 4 0, 1, 4 14 24 50 14 24 50 5 0, 2, 5 15 0, 1, 15 25 0, 3, 25 60 0, 1, 60 0, 1, 15, 16, 0, 2, 3, 5, 6 0, 1, 6 16 26 0, 1, 7, 26 70 16 70 0, 1, 37, 38, 7 0, 1, 7 17 0, 3, 17 27 0, 1, 7, 27 80 80
8 9
0, 1, 5, 6, 8 0, 4, 9
18 0, 7, 18 19 0, 1, 5, 6, 19
28 0, 3, 28 29 0, 2, 29 30 0, 1, 15, 16, 30
90
0, 1, 18, 19, 90
100 0, 37, 100 256 0, 1, 3, 16, 256
10 0, 3, 10
20 0, 3, 20
There is confusion over naming, labeling, and drawing of LFSRs in texts and test programs. Looking at the schematic in Figure 14.27 , we can draw the LFSR with signals flowing from left to right or vice versa (two ways), we can name the leftmost flip-flop output Q 0 or Q n (two more ways), and we can name the coefficient that goes with Q 0 either c 0 or c n 1 (two more ways). There are thus at least 2 3 4 different ways to draw an LFSR for a given polynomial. Four of these are distinct. You can connect the LFSR feedback in the reverse order and the LFSR will still workyou will, however, get a different sequence. Usually this does not matter.
14.7.6 LFSR Example

We can use a cell compiler to produce LFSR and signature register BIST structures. For example, we might complete a property sheet as follows: property name value property name value ------------------ ----- ------------------ ----LFSR_is_bilbo false LFSR_configuration generator LFSR_length 3 LFSR_init_hex_value 4 LFSR_scan false LFSR_mux_data false LFSR_mux_output false LFSR_xor_hex_function max_length LFSR_zero_state false LFSR_signature_inputs 1 The Verilog structural netlist for the compiled type 2 LFSR generator is shown in Table 14.12 . According to our notation and the primitive polynomials in Figure 14.27 , the corresponding primitive polynomial is P*(x) = 1 x 2 x 3 . The LFSR has both serial and parallel outputs (taken from the inverted flip-flop outputs
with inverting buffers, cell names in02d1 ). The clock and reset inputs are buffered (with noninverting buffers, cell names ni01d1 ) since these inputs would normally have to drive a load of more than 3 bits. Looking in the cell data book we find that the flip-flop corresponding to the MSB, instance FF0 with cell name dfptnb , has an active-low set input SDN . The remaining flip-flops, cell name dfctnb , have activelow clears, CDN . This gives us the initial value '100'. Table 14.13 shows the serial-input signature register compiled using the reciprocal polynomial. Again the compiler has included buffers. All the flip-flops, cell names dfctnb , have active-low clear so that the initial content of the register is '000'. TABLE 14.12 Compiled LFSR generator, using P*(x) = 1 x 2 x 3 . module lfsr_generator (OUT, SERIAL_OUT, INITN, CP); output [2:0] OUT; output SERIAL_OUT; input INITN, CP; dfptnb FF2 (.D(FF0_Q), .CP(u4_Z), .SDN(u2_Z), .Q(FF2_Q), .QN(FF2_QN)); dfctnb FF1 (.D(XOR0_Z), .CP(u4_Z), .CDN(u2_Z), .Q(FF1_Q), .QN(FF1_QN)); dfctnb FF0 (.D(FF1_Q), .CP(u4_Z), .CDN(u2_Z), .Q(FF0_Q), .QN(FF0_QN)); ni01d1 u2 (.I(u3_Z), .Z(u2_Z)); ni01d1 u3 (.I(INITN), .Z(u3_Z)); ni01d1 u4 (.I(u5_Z), .Z(u4_Z)); ni01d1 u5 (.I(CP), .Z(u5_Z)); xo02d1 XOR0 (.A1(FF2_Q), .A2(FF0_Q), .Z(XOR0_Z)); in02d1 INV2X0 (.I(FF0_QN), .ZN(OUT[0])); in02d1 INV2X1 (.I(FF1_QN), .ZN(OUT[1])); in02d1 INV2X2 (.I(FF2_QN), .ZN(OUT[2])); in02d1 INV2X3 (.I(FF0_QN), .ZN(SERIAL_OUT)); endmodule TABLE 14.13 Compiled serial-input signature register, using P(x) = 1 xx3.
module lfsr_signature (OUT, SERIAL_OUT, INITN, CP, IN); output [2:0] OUT; output SERIAL_OUT; input INITN, CP; input [0:0] IN; dfctnb FF2 (.D(XOR1_Z), .CP(u4_Z), .CDN(u2_Z), .Q(FF2_Q), .QN(FF2_QN)); dfctnb FF1 (.D(FF2_Q), .CP(u4_Z), .CDN(u2_Z), .Q(FF1_Q), .QN(FF1_QN)); dfctnb FF0 (.D(XOR0_Z), .CP(u4_Z), .CDN(u2_Z), .Q(FF0_Q), .QN(FF0_QN)); ni01d1 u2 (.I(u3_Z), .Z(u2_Z)); ni01d1 u3 (.I(INITN), .Z(u3_Z)); ni01d1 u4 (.I(u5_Z), .Z(u4_Z)); ni01d1 u5 (.I(CP), .Z(u5_Z)); xo02d1 XOR1 (.A1(IN[0]), .A2(FF0_Q), .Z(XOR1_Z)); xo02d1 XOR0 (.A1(FF1_Q), .A2(FF0_Q), .Z(XOR0_Z)); in02d1 INV2X1 (.I(FF1_QN), .ZN(OUT[1])); in02d1 INV2X2 (.I(FF2_QN), .ZN(OUT[2])); in02d1 INV2X3 (.I(FF0_QN), .ZN(SERIAL_OUT)); in02d1 INV2X0 (.I(FF0_QN), .ZN(OUT[0])); endmodule
14.7.7 MISR
A serial-input signature register can only be used to test logic with a single output. We can extend the idea of a serial-input signature register to the multiple-input signature register ( MISR ) shown in Figure 14.29 . There are several ways to connect the inputs to both types (type 1 and type 2) of LFSRs to form an MISR. Since the XOR operation is linear and associative, so that ( A B) C = A (B C ), as long as the result of the additions are the same then the different representations are equivalent. If we have an n -bit long MISR we can accommodate up to n inputs to form the signature. If we use m < n inputs we do not need the extra XOR gates in the last n m positions of the MISR.
FIGURE 14.29 Multiple-input signature register (MISR). This MISR is formed from the type 2 LFSR (with P*(x) = 1 x 2 x 3 ) shown in Figure 14.28 (d) by adding XOR gates xor_i1, xor_i2, and xor_i3. This 3-bit MISR can form a signature from logic with three outputs. If we only need to test two outputs then we do not need XOR gate, xor_i3, corresponding to input in[2]. There are several types of BIST architecture based on the MISR. By including extra logic we can reconfigure an MISR to be an LFSR or a signature register; this is called a built-in logic block observer ( BILBO ). By including the logic that we wish to test in the feedback path of an MISR, we can construct circular BIST structures. One of these is known as the circular self-test path ( CSTP ). We can test compiled blocks including RAM, ROM, and datapath elements using an LFSR generator and a MISR. To generate all 2 n address values for a RAM or ROM we can modify the LFSR feedback path to force entrance and exit from the all-zeros state. This is known as a complete LFSR . The pattern generator does not have to be an LFSR or exhaustive. For example, if we were to apply an exhaustive test to a 4-bit by 4-bit multiplier this would require 2 8 or 256 vectors. An 8-bit by 8-bit multiplier requires 65,536 vectors and, if it were possible to test a 32-bit by 32-bit multiplier exhaustively, it would require 1.8 10 19 vectors. Table 14.14 shows two sets of nonexhaustive test patterns, {SA} and {SAE}, if A and B are both 4 bits wide. The test sequences {SA} and {SAE} consist of nested sequences of walking 1s and walking 0s (S1 and S1B), walking pairs (S2 and S2B), and triplets (S3, S3B). The sequences are extended for larger inputs, so that, for example, {S2} is a sequence of seven vectors
for an 8-bit input and so on. Intermediate sequences {SX} and {SXB} are concatenated from S1, S2, and S3; and from S1B, S2B, and S3B respectively. These sequences are chosen to exercise as many of the add-and-carry functions within the multiplier as possible. TABLE 14.14 Multiplier test patterns. 1 Sequence Sequence Sequence {SA} {SX} {SXB} S1= {1000 0100 0010 0001} S2 = {1100 0110 0011} S3 = {1110 0111} SX = { {S1} {S2} {S3}} S1B = {0111 1011 1101 0111} S2B = {0011 1001 1100} S3B = {0001 1000} SXB = { {S1B} {S2B} {S3B} }
{ A B= {S1, SX} }
Total = 3(X 1) = Total = 3(X 1) = Total = 4 9 9, X = 4 9, X = 4 = 3A(B 1) = 36
Sequence {SAE} { { AB = {S1, SX} } { AB = {S1B, SXB} } { AB = {S2, SX} } { AB = {S2B, SXB} } { A B= {S3, SX} } { AB = {S3B, SXB} } } Total = 3(2A 1)(3B 2) = 3 7 10 = 210
The sequence length of {SA} is 3A (B 1), and 3(2A 1)(3B 2) for {SAE}, where A and B are the sizes of the multiplier inputs. For example, {SA} is 168 vectors for A = B = 8 and 2976 vectors for A = B = 32; {SAE} is 990 vectors (A = B = 8) and 17,766 vectors (A = B = 32). From fault simulation, the stuck-at fault coverage is 93 percent for sequence {SA} and 97 percent for sequence {SAE}. Figure 14.30 shows an MISR with a scan chain. We can now include the BIST logic as part of a boundary-scan chain, this approach is called scanBIST .
FIGURE 14.30 Multiple-input signature register (MISR) with scan generated from the MISR of Figure 14.29 .
1. {AB = {S1, SB} } means for each value of A in the sequence {S1} set B equal to all the values in {SB}. [ Chapter start ] [ Previous page ] [ Next page ]
14.8 A Simple Test Example

As an example, we will describe automatic test generation using boundary scan together with internal scan. We shall use the function Z = A'B + BC for the core logic and register the three inputs using three flip-flops. We shall test the resulting sequential logic using a scan chain. The simple logic will allow us to see how the test vectors are generated.
14.8.1 Test-Logic Insertion

Figure 14.31 shows a structural Verilog model of the logic core. The three flip-flops (cell name dfctnb ) implement the input register. The combinational logic implements the function, outp = a_r[0]'.a_r[1] + a_r[1].a_r[2 ]. This is the same function as Figure 14.14 and Figure 14.16 .
module core_p (outp, reset, a, clk); output outp; input reset, clk; input [2:0] a; wire dfctnb a_r_ff_b0 (.D(a[0]), .CP(clk), .CDN(reset), dfctnb a_r_ff_b1 (.D(a[1]), .CP(clk), .CDN(reset), dfctnb a_r_ff_b2 (.D(a[2]), .CP(clk), .CDN(reset), in01d0 u2 (.I(a_r[0]), .ZN(u2_ZN)); nd02d0 u3 (.A1(u2_ZN), .A2(a_r[1]), .ZN(u3_ZN)); nd02d0 u4 (.A1(a_r[1]), .A2(a_r[2]), .ZN(u4_ZN)); nd02d0 u5 (.A1(u3_ZN), .A2(u4_ZN), .ZN(outp)); endmodule FIGURE 14.31 Core of the Threegates ASIC.
[2:0] a_r; .Q(a_r[0]), .QN(\a_r_ff_b0.QN )); .Q(a_r[1]), .QN(\a_r_ff_b1.QN )); .Q(a_r[2]), .QN(\a_r_ff_b2.QN ));
Table 14.15 shows the structural Verilog for the top-level logic of the Threegates ASIC including the I/O pads. There are nine pad cells. Three instances (up1_b0 , up1_b1 , and up1_b2 ) are the data-input pads, and one instance, up2_1 , is the output pad. These were vectorized pads (even for the output that had a range of 1), so the synthesizer has added suffixes ( '_1' and so on) to the pad instance names. Two pads are for power, one each for ground and the positive supply, instances up11 and up12 . One pad, instance up3_1 , is for the reset signal. There are two pad cells for the clock. Instance up4_1 is the clock input pad attached to the package pin and instance up6 is the clock input buffer. The next step is to insert the boundary-scan logic and the internal-scan logic. Some synthesis tools can create test logic as they synthesize, but for most tools we need to perform test-logic insertion as a separate step. Normally we complete a parameter sheet specifying the type of test logic (boundary scan with internal scan in this case), as well as the ordering of the scan chain. In our example, we shall include all of the sequential cells in the boundary-scan register and order the boundary-scan cells using the pad numbers (in the original behavioral input). Figure 14.32 shows the modified core logic. The test software has changed all the flip-flops (cell names dfctnb ) to scan flip-flops (with the same instance names, but the cell names are changed to mfctnb ). The test software also adds a noninverting buffer to drive the scanselect signal to all the scan flip-flops. The test software also adds logic to the top level. We do not need a detailed understanding of the automatically generated logic, but later in the design flow we will need to understand what has been done. Figure 14.33 shows a high-level view of the Threegates ASIC before and after test-logic insertion. TABLE 14.15 The top level of the Threegates ASIC before test-logic insertion.
module asic_p (pad_outp, pad_a, pad_reset, pad_clk); output [0:0] pad_outp; input [2:0] pad_a; input [0:0] pad_reset, pad_clk; wire [0:0] reset_sv, clk_sv, outp_sv; wire [2:0] a_sv; supply1 VDD; supply0 VSS; core_p uc1 (.outp(outp_sv[0]), .reset(reset_sv[0]), .a(a_sv[2:0]), .clk(clk_bit)); pc3o07 up2_1 (.PAD(pad_outp[0]), .I(outp_sv[0])); pc3c01 up6 (.CCLK(clk_sv[0]), .CP(clk_bit)); pc3d01r up3_1 (.PAD(pad_reset[0]), .CIN(reset_sv[0])); pc3d01r up4_1 (.PAD(pad_clk[0]), .CIN(clk_sv[0])); pc3d01r up1_b0 (.PAD(pad_a[0]), .CIN(a_sv[0])); pc3d01r up1_b1 (.PAD(pad_a[1]), .CIN(a_sv[1])); pc3d01r up1_b2 (.PAD(pad_a[2]), .CIN(a_sv[2])); pv0f up11 (.VSS(VSS)); pvdf up12 (.VDD(VDD)); endmodule
module core_p_ta (a_r_2, outp, a_r_ff_b0_DA, taDriver12_I, a, clk, reset); output a_r_2, outp; input a_r_ff_b0_DA, taDriver12_I; input [2:0] a; input clk, reset; wire [1:0] a_r; supply1 VDD; supply0 VSS; ni01d5 taDriver12 (.I(taDriver12_I), .Z(taDriver12_Z)); mfctnb a_r_ff_b0 (.DA(a_r_ff_b0_DA), .DB(a[0]), .SA(taDriver12_Z), .CP(clk), .CDN(reset), .Q(a_r[0]), .QN(\a_r_ff_b0.QN )); mfctnb a_r_ff_b1 (.DA(a_r[0]), .DB(a[1]), .SA(taDriver12_Z), .CP(clk), .CDN(reset), .Q(a_r[1]), .QN(\a_r_ff_b1.QN )); mfctnb a_r_ff_b2 (.DA(a_r[1]), .DB(a[2]), .SA(taDriver12_Z), .CP(clk), .CDN(reset), .Q(a_r_2), .QN(\a_r_ff_b2.QN )); in01d0 u2 (.I(a_r[0]), .ZN(u2_ZN));
nd02d0 u3 (.A1(u2_ZN), .A2(a_r[1]), .ZN(u3_ZN)); nd02d0 u4 (.A1(a_r[1]), .A2(a_r_2), .ZN(u4_ZN)); nd02d0 u5 (.A1(u3_ZN), .A2(u4_ZN), .ZN(outp)); endmodule FIGURE 14.32 The core of the Threegates ASIC after test-logic insertion.
FIGURE 14.33 The Threegates ASIC. (a) Before test-logic insertion. (b) After test-logic insertion.
14.8.2 How the Test Software Works

The structural Verilog for the Threegates ASIC is lengthy, so Figure 14.34 shows only the essential parts. The following main blocks are labeled in Figure 14.34 : 1. This block is the logic core shown in Figure 14.32 . The Verilog module header shows the local and formal port names. Arrows indicate whether each signal is an input or an output. 2. This is the main body of logic added by the test software. It includes the boundary-scan controller and clock control. 3. This block groups together the buffers that the test software has added at the top level to drive the control signals throughout the boundary-scan logic.
4. This block is the first boundary-scan cell in the BSR. There are six boundary-scan cells: three input cells for the data inputs, one output cell for the data output, one input cell for the reset, and one input cell for the clock. Only the first (the boundary-scan input cell for a[0] ) and the last boundary-scan cells are shown. The others are similar. 5. This is the last boundary-scan cell in the BSR, the output cell for the clock. 6. This is the clock pad (with input connected to the ASIC package pin). The cell itself is unchanged by the test software, but the connections have been altered. 7. This is the clock-buffer cell that has not been changed. 8. The test software adds five I/O pads for the TAP. Four are input pad cells for TCK, TMS, TDO, and TRST. One is a three-state output pad cell for TDO. 9. The power pad cells remain unchanged. 10. The remaining I/O pad cells for the three data inputs, the data output, and reset remain unchanged, but the connections to the core logic are broken and the boundary-scan cells inserted. The numbers in Figure 14.34 link the signals in each block to the following explanations: 1. The control signals for the input BSCs are C_0, C_1, C_2, and C_4 and these are all buffered, together with the test clock TCK . The single output BSC also requires the control signal C_3 and this is driven from the BST controller. 2. The clock enters the ASIC package through the clock pad as .PAD(clk[0]) and exits the clock pad cell as .CIN(up_4_1_CIN1) . The test software routes this to the data input of the last boundary-scan cell as .PI(up_4_1_CIN1) and the clock exits as .PO(up_4_1_cin) . The clock then passes through the clock buffer, as before. 3. The serial input of the first boundary-scan cell comes from the controller as .bst_control_BST_SI(test_logic_bst_control_BST_SI) . 4. The serial output of the last boundary-scan cell goes to the controller as .bst_control_BST(up4_1_bst_SO) . 5. The beginning of the BSR is the first scan flip-flop in the core, which is connected to the TDI input as .a_r_ff_b0_DA(ta_TDI_CIN) . 6. The end of the scan chain leaves the core as .a_r_2(uc1_a_r_2) and enters the controller as .bst_control_scan_SO(uc1_a_r_2) . 7. The scan-enable signal .bst_control_C_9(test_logic_bst_control_C_9) is generated by the boundary-scan controller, and connects to the core as .taDriver12_I(test_logic_bst_control_C_9) .
FIGURE 14.34 The top level of the Threegates ASIC after test-logic insertion. The added test logic is shown in Figure 14.35 . The blocks are as follows: 1. This is the module declaration for the test logic in the rest of the diagram, it corresponds to block B in Table 14.34 . 2. This block contains buffers and clock control logic. 3. This is the boundary-scan controller. 4. This is the first of 26 IDR cells. In this implementation the IDCODE register is combined with the BSR. Since there are only six BSR cells we need (32 6) or 26 IDR cells to complete the 32-bit IDR.
5. This is the last IDR cell.
FIGURE 14.35 Test logic inserted in the Threegate ASIC. TABLE 14.16 The TAP (test-access port) control. 1 TAP state Reset Run_Idle Select_DR C_0 x 00x0xxx 00x0xxx C_1 x 11x1xxx 11x1xxx C_2 C_3 C_4 C_5 C_6 xxxx1xx 1 1 C_7 xxxx0xx 0000001 0000001 xxxx0xx xxxx0xx xxxx0xx xxxx0xx 0 1001011 0001011 0000010 0 1001011 0001011 0000010 C_8 2 xxxx0xx 0000000 0000000 C_9 xxxx0xx 0000000 0000000
Capture_DR Shift_DR Exit1_DR Pause_DR Exit2_DR Update_DR Select_IR Capture_IR Shift_IR Exit1_IR Pause_IR Exit2_IR Update_IR
00x01xx 11x11xx 00x00xx 00x00xx 00x00xx 00x0xxx x x x x x x x
00x00xx 11x11xx 11x11xx 11x11xx 11x11xx 11x1xxx x x x x x x x
0 0 0 0 0 110100 0 0 0 0 0 0 0
1001011 1001011 1001011 1001011 1001011 1001011 1001011 1001011 1001011 1001011 1001011 1001011 1001011
0001011 0001011 0001011 0001011 0001011 0001011 0001011 0001011 0001011 0001011 0001011 0001011 0001011
0000010 1 0000001 000000T 0000000 0000010 1111101 0000001 000000T 0000001 0000010 1 0000001 0000000 0000000 0000010 1 0000001 0000000 0000000 0000010 1 0000001 0000000 0000000 0 1111101 0000001 0000000 0000000 00000x0 11111x1 0000001 0000000 0000000 00000x0 11111x1 0000001 0000000 0000000 00000x0 11111x1 0000001 0000000 0000000 00000x0 11111x1 0000001 0000000 0000000 00000x0 11111x1 0000001 0000000 0000000 00000x0 11111x1 0000001 0000000 0000000 00000x0 1111101 0000001 0000000 0000000
The numbers in Figure 14.35 refer to the following explanations: 1. The system clock (CLK, not the test clock TCK) from the top level (after passing through the boundary-scan cell) is fed through a MUX so that CLK may be controlled during scan. 2. The signal bst_control_BST is the end (output) of the boundary-scan cells and the start (input) to the ID register only cells. 3. The signal id_reg_0_SO is the end (output) of the ID register. 4. The signal bst_control_BST_SI is the start of the boundary-scan chain. The job of the boundary-scan controller is to produce the control signals ( C_1 through C_9 ) for each of the 16 TAP controller states ( reset through update_IR ) for each different instruction. In this BST implementation there are seven instructions: the required EXTEST , SAMPLE , and BYPASS ; IDCODE ; INTEST (which is the equivalent of EXTEST , but for internal test); RUNBIST (which allows on-chip test structures to operate); and SCANM (which controls the internal-scan chains). The boundary-scan controller outputs are shown in Table 14.16 . There are two very important differences between this controller and the one described in Table 14.5 . The first, and most obvious, is that the control signals now depend on the instruction. This is primarily because INTEST requires the control signal at the output of the BSCs to be in different states for the input and output cells. The second difference is that the logic for the boundary-scan cell control signals is now purely combinationalwe have removed the gated clocks. For example, Figure 14.36 shows the input boundary-scan cell. The clock for the
shift flip-flop is now TCK and not a gated clock as it was in Table 14.5 . We can do this because the output of the flip-flop, SO , the scan output, is added as input to the MUX that feeds the flip-flop data input. Thus, when we wish to hold the state of the flip-flop, the control signals select SO to be connected from the output to the input. This is called a polarity-hold flip-flop . Unfortunately, we have little choice but to gate the system clock if we make the scan chain part of the BSR. We cannot have one clock for part of the BSR and another for the rest. The costly alternative is to change every scan flip-flop to a scanned polarity-hold flip-flop.
module mybs1cela0 (SO, PO, C_0, TCK, SI, C_1, C_2, C_4, PI); output SO, PO; input C_0, C_1, C_2, C_4, TCK, SI, PI; in01d1 inv_0 (.I(C_0), .ZN(iv0_ZN)); in01d1 inv_1 (.I(C_1), .ZN(iv1_ZN)); oa03d1 oai221_1 (.A1(C_0), .A2(SO), .B1(iv0_ZN), .B2(SI), .C(C_1), .ZN(oa1_ZN)); nd02d1 nand2_1 (.A1(na2_ZN), .A2(oa1_ZN), .ZN(na1_ZN)); nd03d1 nand3_1 (.A1(PO), .A2(iv0_ZN), .A3(iv1_ZN), .ZN(na2_ZN)); mx21d1 mux21_1 (.I0(PI), .I1(upo), .S(C_4), .Z(PO)); dfntnb dff_1 (.D(na1_ZN), .CP(TCK), .Q(SO), .QN(\so.QN )); lantnb latch_1 (.E(C_2), .D(SO), .Q(upo), .QN(\upo.QN )); endmodule FIGURE 14.36 Input boundary-scan cell (BSC) for the Threegates ASIC. Compare this to the generic data-register (DR) cell (used as a BSC) shown in Figure 14.2 .
14.8.3 ATVG and Fault Simulation

Table 14.17 shows the results of running the Compass ATVG software on the Threegates ASIC. We might ask: Why so many faults? and why is the fault coverage so poor? First we look at the details of the test software output. We notice the following:

q q
Line 2 . The backtrace limit is 30. We do not have any deep complex combinational logic so that this should not cause a problem. Lines 4 6 . An uncollapsed fault count of 184 indicates the test software has inserted faults on approximately 100 nodes, or at most 50 gates assuming a fanout of 1, less gates with any realistic fanout. Clearly this is less than all of the test logic that we have inserted. TABLE 14.17 ATVG (automatic test-vector generation) report for the Threegates ASIC. CREATE: Output vector database cell defaulted to [svf]asic_p_ta CREATE: Backtrack limit defaulted to 30 CREATE: Minimal compression effort: 10 (default) Fault list generation/collapsing Total number of faults: 184 Number of faults in collapsed fault list: 80 Vector generation # # VECTORS FAULTS FAULT COVER # processed # # 5 184 60.54% # # Total number of backtracks: 0 # Highest backtrack : 0 # Total number of vectors : 5 # # STAR RESULTS summary # Noncollapsed Collapsed # Fault counts: # Aborted 0 0 # Detected 89 43 # Untested 58 20 # ------ -----# Total of detectable 147 63 # # Redundant 6 2 # Tied 31 15 # # FAULT COVERAGE 60.54 % 68.25 % #
# Fault coverage = nb of detected faults / nb of detectable faults Vector/fault list database [svf]asic_p_ta created. To discover why the fault coverage is 68.25 percent we must examine each of the fault categories. First, Table 14.18 shows the undetected faults. TABLE 14.18 Untested faults (not observable) for the Threegates ASIC. Faults Explanation TADRIVER4.ZN sa0 Internal driver for BST control bundle (seven more faults like this). TA_TRST.1.CIN sa0 BST reset TRST is active-low and tied high during test. TDI.O sa0 sa1 TDI is BST serial input. UP1_B0.1.CIN sa0 sa1 Data input pad (two more faults like this one). UP3_1.1.CIN sa0 System reset is active-low and tied high during test. UP4_1.1.CIN sa0 sa1 System clock input pad. # Total number: 20 The ATVG program is generating tests for the core using internal scan. We cannot test the BST logic itself, for example. During the production test we shall test the BST logic first, separately from the corethis is often called a flush test . Thus we can ignore any faults from the BST logic for the purposes of internal-scan testing. Next we find two redundant faults: TA_TDO.1.I sa0 and sa1 . Since TDO is three-stated during the test, it makes no difference to the function of the logic if this node is tied high or lowhence these faults are redundant. Again we should ensure these faults will be caught during the flush test. Finally, Table 14.19 shows the tied faults. TABLE 14.19 Tied faults. Fault(s)
Explanation
TADRIVER1.ZN sa0 Internal BST buffer (seven more faults like this one). TA_TMS.1.CIN sa0 TMS input tied low. TA_TRST.1.CIN sa1 TRST input tied high. TEST_LOGIC.BST_CONTROL.U1.ZN sa1 Internal BST logic. UP1_B0_BST.U1.A2 sa0 Input pad (two more faults like this). UP3_1.1.CIN sa1 Reset input pad tied high. # Total number: 15 Now that we can explain all of the undetectable faults, we examine the detected faults. Table 14.20 shows only the detected faults in the core logic. Faults F1F8 in the first part of Table 14.20 correspond to the faults in Figure 14.16 . The fault list in the second part of Table 14.20 shows each fault in the core and whether it was detected (D) or collapsed and detected as an equivalent fault (CD). There are no undetected faults (U) in the logic core. TABLE 14.20 Detected core-logic faults in the Threegates ASIC. Fault(s) Explanation UC1.U2.ZN sa1 F1 UC1.U3.A2 sa1 F2 UC1.U3.ZN sa1 F5 UC1.U4.A1 sa1 F3 UC1.U4.ZN sa1 F6 UC1.U5.ZN sa0 F8 UC1.U5.ZN sa1 F7 UC1.A_R_FF_B2.Q.O sa1 F4 Fault list UC1.A_R_FF_B0.Q: (O) CD CD SA0 and SA1 collapsed to U3.A1 UC1.A_R_FF_B1.Q: (O) D D SA0 and SA1 detected. UC1.A_R_FF_B2.Q: (O) CD D SA0 collapsed to U2. SA1 is F4. UC1.U2: (I) CD CD (ZN) CD D I.SA1/0 collapsed to O.SA1/0. O. SA1 is F1. UC1.U3: (A1) CD CD (A2) CD D (ZN) CD D A1.SA1 collapsed to U2.ZN.SA1. UC1.U4: (A1) CD D (A2) CD CD (ZN) CD D A2.SA1 collapsed to A_R_FF_B2.Q.SA1. UC1.U5: (A1) CD CD (A2) CD CD (ZN) D D A1.SA1 collapsed to U3.ZN.SA1
14.8.4 Test Vectors

Next we generate the test vectors for the Threegates ASIC. There are three types of vectors in scan testing. Serial vectors are the bit patterns that we shift into the scan chain. We have three flip-flops in the scan chain plus six boundary-scan cells, so each serial vector is 9 bits long. There are serial input vectors that we apply as a stimulus and serial output vectors that we expect as a response. Parallel vectors are applied to the pads before we shift the serial vectors into the scan chain. We have nine input pads (three core data, one core clock, one core reset, and four input TAP pads TMS , TCK , TRST , and TDI ) and two outputs (one core data output and TDO ). Each parallel vector is thus 11 bits long and contains 9 bits of stimulus and 2 bits of response. A test program consists of applying the stimulus bits from one parallel vector to the nine input pins for one test cycle. In the next nine test cycles we shift a 9-bit stimulus from a serial vector into the scan chain (and receive a 9-bit response, the result of the previous tests, from the scan chain). We can generate the serial and parallel vectors separately, or we can merge the vectors to give a set of broadside vectors . Each broadside vector corresponds to one test cycle and can be used for simulation. Some testers require broadside vectors; others can generate them from the serial and parallel vectors. TABLE 14.21 Serial test vectors Serial-input scan data #1 1 1 1 0 1 0 1 1 0 #2 1 0 1 1 0 1 0 0 1 #3 1 1 0 1 1 1 0 1 0 #4 0 0 0 1 0 0 0 0 0 #5 0 1 0 0 1 1 1 0 1 ÛC1.A_R_FF_B0.Q ÛP1_B2_BST.SO.Q ÛP2_1_BST.SO.Q ÛC1.A_R_FF_B1.Q ÛP1_B1_BST.SO.Q ÛP3_1_BST.SO.Q ÛC1.A_R_FF_B2.Q ÛP1_B0_BST.SO.Q ÛP4_1_BST.SO.Q Fault Fault number Vector number Core input UC1.U2.ZN sa1 011 3 F1 UC1.U3.A2 sa1 000 4 F2 UC1.U3.ZN sa1 010 5 F5 UC1.U4.A1 sa1 101 2 F3 UC1.U4.ZN sa1 111 1 F6 UC1.U5.ZN sa0 1 111 F8 UC1.U5.ZN sa1 101 F7 2 UC1.A_R_FF_B2.Q.O sa1 101 F4 2 Table 14.21 shows the serial test vectors for the Threegates ASIC. The third serial test vector is '110111010' . This test vector is shifted
into the BSR, so that the first three bits in this vector end up in the first three bits of the BSR. The first three bits of the BSR, nearest TDI , are the scan flip-flops, the other six bits are boundary-scan cells). Since UC1.A_R_FF_B0.Q is a_r[0] and so on, the third test vector will set a_r = 011 where a_r[2] = 0. This is the vector we require to test the function a_r[0]'.a_r[1] + a_r[1].a_r[2 ] for fault UC1.U2.ZN sa1 in the Threegates ASIC. From Figure 14.31 we see that this is a stuck-at-1 fault on the output of the inverter whose input is connected to a_r[0] . This fault corresponds to fault F1 in the circuit of Figure 14.16 . The fault simulation we performed earlier told us the vector ABC = 011 is a test for fault F1 for the function A'B + BC.
14.8.5 Production Tester Vector Formats

The final step in test-program generation is to format the test vectors for the production tester. As an example the following shows the Sentry tester file format for testing a D flip-flop. For an average ASIC there would be thousands of vectors in this file. # Pin declaration: pin names are separated by semi-colons (all pins # on a bus must be listed and separated by commas) pre_; clr_; d; clk; q; q_; # Pin declarations are separated from test vectors by $ $ # The first number on each line is the time since start in ns, # followed by space or a tab. # The symbols following the time are the test vectors # (in the same order as the pin declaration) # an "=" means don't do anything # an "s" means sense the pin at the beginning of this time point # (before the input changes at this time point have any effect) # # pcdcqq # rlal _ # ertk # __a 00 1010== # clear the flip-flop 10 1110ss # d=1, clock=0 20 1111ss # d=1, clock=1 30 1110ss # d=1, clock=0 40 1100ss # d=0, clock=0 50 1101ss # d=0, clock=1
60 1100ss # d=0, clock=0 70 ====ss
14.8.6 Test Flow

Normally we leave test-vector generation and the production-test program generation until the very last step in ASIC design after physical design is complete. All of the steps have been described before the discussion of physical design, because it is still important to consider test very early in the design flow. Next, as an example of considering test as part of logical design, we shall return to our Viterbi decoder example. TABLE 14.22 Timing effects of test-logic insertion for the Viterbi decoder. Timing of critical paths before test-logic insertion # Slack(ns) Num Paths # -3.3826 1 * # -1.7536 18 ******* # -.1245 4 ** # 1.5045 1 * # 3.1336 0 * # 4.7626 0 * # 6.3916 134 ****************************************** # 8.0207 6 *** # 9.6497 3 ** # 11.2787 0 * # 12.9078 24 ******** # instance name # inPin --> outPin incr arrival trs rampDel cap cell # (ns) (ns) (ns) (pf) # v_1.u100.u1.subout6.Q_ff_b0 # CP --> QN 1.73 1.73 R .20 .10 dfctnb ... # v_1.u100.u2.metric0.Q_ff_b4 # setup: D --> CP .16 21.75 F .00 .00 dfctnh
After test-logic insertion # -4.0034 1 * # -1.9835 18 ***** # .0365 4 ** # 2.0565 1 * # 4.0764 0 * # 6.0964 138 ******************************* # 8.1164 2 * # 10.1363 3 ** # 12.1563 24 ****** # 14.1763 0 * # 16.1963 187 ****************************************** # v_1.u100.u1.subout7.Q_ff_b1 # CP --> Q 1.40 1.40 R .28 .13 mfctnb ... # v_1.u100.u2.metric0.Q_ff_b4 # setup: DB --> CP .39 21.98 F .00 .00 mfctnh
1. Outputs are specified for each instruction as 0123456, where: 0 = EXTEST, 1 = SAMPLE, 2 = BYPASS, 3 = INTEST, 4 = IDCODE, 5 = RUNBIST, 6 = SCANM. 2. T denotes gated clock TCK. [ Chapter start ] [ Previous page ] [ Next page ]
14.9 The Viterbi Decoder Example

Table 14.22 shows the timing analysis for the Viterbi decoder before and after test insertion. The Compass test software inserts internal scan and boundary scan exactly as in the Threegates example. The timing analysis is in the form of histograms showing the distributions of the timing delays for all paths. In this analysis we set an aggressive constraint of 20 ns (50 MHz) for the clock. The critical path before test insertion is 21.75 ns (the slack is thus negative at 1.75 ns). The path starts at u1.subout6.Q_ff_b0 and ends at u2.metric0.Q_ff_b4 , both flip-flops inside the flattened block, v_1.u100 , that we created during synthesis in an attempt to improve speed. The first flip-flop in the path is a dfctnb ; the last flip-flop is a dfctnh . The suffix 'b' denotes 1X drive and suffix 'h' denotes 2X drive. After test insertion the critical path is 21.98 ns. The end point is identical, but the start point is now subout7.Q_ff_b1 . This is not too surprising. What is happening is that there are a set of paths of nearly equal length. Changing the flip-flops to their scan versions ( mfctnb and mfctnh ) increases the delay slightly. The exact delay depends on the capacitive load at the output, the path (clock-to-Q, clock-to-QN, or setup), and the input signal rise time. Adding test logic has not increased the critical path delay substantially. Almost as important is that the distribution of delays has not changed substantially. Also very important is the fact that the distributions show that there are only approximately 20 paths with delays close to the critical path delay. This means that we should be able to constrain these paths during physical design and achieve a performance after routing that is close to our preroute predictions.
TABLE 14.23 Fault coverage for the Viterbi decoder. Fault list generation/collapsing Total number of faults: 8846 Number of faults in collapsed fault list: 3869 Vector generation # # VECTORS FAULTS FAULT COVER # processed # # 20 7515 82.92% # 40 8087 89.39% # 60 8313 91.74% # 80 8632 95.29% # 87 8846 96.06% # Total number of backtracks: 3000 # Highest backtrack : 30 # Total number of vectors : 87 # # # # # # # # # # # # # STAR RESULTS summary Noncollapsed Collapsed Fault counts: Aborted 178 85 Detected 8427 3680 Untested 168 60 ------ -----Total of detectable 8773 3825 Redundant 10 6 Tied 63 38 FAULT COVERAGE 96.06 % 96.21 %
Next we check the logic for fault coverage. Table 14.23 shows that the ATPG
software has inserted nearly 9000 faults, which is reasonable for the size of our design. Fault coverage is 96 percent. Most of the untested and tied faults arise from the BST logic exactly as we have already described in the Threegates example. If we had not completed this small test case first, we might not have noticed this. The aborted faults are almost all within the large flattened block, v_1.u100 . If we assume the approximately 60 faults due to the BST logic are covered by a flush test, our fault coverage increases to 3740/3825 or 98 percent. To improve upon this figure, some, but not all, of the aborted faults can be detected by substantially increasing the backtrack limit from the default value of 30. To discover the reasons for the remaining aborted faults, we could use a controllability/observability program. If we wish to increase the fault coverage even further, we either need to change our test approach or change the design architecture. In our case we believe that we can probably obtain close to 99 percent stuck-at fault coverage with the existing architecture and thus we are ready to move on to physical design. [ Chapter start ] [ Previous page ] [ Next page ]
14.10 Summary
14.10 Summary
The primary reason to consider test early during ASIC design is that it can become very expensive if we do not. The important points we covered in this chapter are:
q q q q q
Boundary scan Single stuck-at fault model Controllability and observability ATPG using test vectors BIST with no test vectors
file:///C|/Documents%20and%20Settings/saran%20kum...waii.edu/_msmith/ASICs/HTML/Book2/CH14/CH14.a.htm [5/30/2004 11:08:12 PM]
ASIC CONSTRUCTION
ASIC CONSTRUCTION
A town planner works out the number, types, and sizes of buildings in a development project. An architect designs each building, including the arrangement of the rooms in each building. Then a builder carries out the construction according to the architects drawings. Electrical wiring is one of the last steps in the construction of each building. The physical design of ASICs is normally divided into system partitioning, floorplanning, placement, and routing. A microelectronic system is the town and the ASICs are the buildings. System partitioning corresponds to town planning, ASIC floorplanning is the architects job, placement is done by the builder, and the routing is done by the electrician. We shall design most, but not all, ASICs using these design steps. 15.1 Physical Design 15.2 CAD Tools 15.3 System Partitioning 15.4 Estimating ASIC Size 15.5 Power Dissipation 15.6 FPGA Partitioning 15.7 Partitioning Methods 15.8 Summary 15.9 Problems 15.10 Bibliography 15.11 References
ASIC CONSTRUCTION
15.1 Physical Design

Figure 15.1 shows part of the design flow, the physical design steps, for an ASIC (omitting simulation, test, and other logical design steps that have already been covered). Some of the steps in Figure 15.1 might be performed in a different order from that shown. For example, we might, depending on the size of the system, perform system partitioning before we do any design entry or synthesis. There may be some iteration between the different steps too.
FIGURE 15.1 Part of an ASIC design flow showing the system partitioning, floorplanning, placement, and routing steps. These steps may be performed in a slightly different order, iterated or omitted depending on the type and size of the system and its ASICs. As the focus shifts from logic to interconnect, floorplanning assumes an increasingly important role. Each of the steps shown in the figure must be performed and each depends on the previous step. However, the trend is toward completing these steps in a parallel fashion and iterating, rather than in a sequential manner.
We must first apply system partitioning to divide a microelectronics system into separate ASICs. In floorplanning we estimate sizes and set the initial relative locations of the various blocks in our ASIC (sometimes we also call this chip planning). At the same time we allocate space for clock and power wiring and decide on the location of the I/O and power pads. Placement defines the location of the logic cells within the flexible blocks and sets aside space for the interconnect to each logic cell. Placement for a gate-array or standard-cell design assigns each logic cell to a position in a row. For an FPGA, placement chooses which of the fixed logic resources on the chip are used for which logic cells. Floorplanning and placement are closely related and are sometimes combined in a single CAD tool. Routing makes the connections between logic cells. Routing is a hard problem by itself and is normally split into two distinct steps, called global and local routing. Global routing determines where the interconnections between the placed logic cells and blocks will be situated. Only the routes to be used by the interconnections are decided in this step, not the actual locations of the interconnections within the wiring areas. Global routing is sometimes called loose routing for this reason. Local routing joins the logic cells with interconnections. Information on which interconnection areas to use comes from the global router. Only at this stage of layout do we finally decide on the width, mask layer, and exact location of the interconnections. Local routing is also known as detailed routing. [ Chapter start ] [ Previous page ] [ Next page ]
15.2 CAD Tools
15.2 CAD Tools

In order to develop a CAD tool it is necessary to convert each of the physical design steps to a problem with well-defined goals and objectives. The goals for each physical design step are the things we must achieve. The objectives for each step are things we would like to meet on the way to achieving the goals. Some examples of goals and objectives for each of the ASIC physical design steps are as follows: System partitioning:
q q
Goal. Partition a system into a number of ASICs. Objectives. Minimize the number of external connections between the ASICs. Keep each ASIC smaller than a maximum size.
Floorplanning:
q q
Goal. Calculate the sizes of all the blocks and assign them locations. Objective. Keep the highly connected blocks physically close to each other.
Placement:
q
Goal. Assign the interconnect areas and the location of all the logic cells within the flexible blocks. Objectives. Minimize the ASIC area and the interconnect density.
Global routing:
15.2 CAD Tools
q q
Goal. Determine the location of all the interconnect. Objective. Minimize the total interconnect area used.
Detailed routing:
q q
Goal. Completely route all the interconnect on the chip. Objective. Minimize the total interconnect length used.
There is no magic recipe involved in the choice of the ASIC physical design steps. These steps have been chosen simply because, as tools and techniques have developed historically, these steps proved to be the easiest way to split up the larger problem of ASIC physical design. The boundaries between the steps are not cast in stone. For example, floorplanning and placement are often thought of as one step and in some tools placement and routing are performed together.
15.2.1 Methods and Algorithms

A CAD tool needs methods or algorithms to generate a solution to each problem using a reasonable amount of computer time. Often there is no best solution possible to a particular problem, and the tools must use heuristic algorithms, or rules of thumb, to try and find a good solution. The term algorithm is usually reserved for a method that always gives a solution. We need to know how practical any algorithm is. We say the complexity of an algorithm is O ( f ( n )) (read as order f ( n )) if there are constants k and n 0 so that the running time of the algorithm T ( n ) is less than k f ( n ) for all n > n 0 [ Sedgewick, 1988]. Here n is a measure of the size of the problem (number of transistors, number of wires, and so on). In ASIC design n is usually very large. We have to be careful, though. The notation does not specify the units of time. An algorithm that is O ( n 2 ) nanoseconds might be better than an algorithm that is O ( n ) seconds, for quite large values of n . The notation O ( n ) refers to an upper limit on the running time of the algorithm. A practical example may take less running timeit is just that we cannot prove it. We also have to be careful of the constants k and n 0 . They can hide
15.2 CAD Tools
overhead present in the implementation and may be large enough to mask the dependence on n , up to large values of n . The function f (n) is usually one of the following kinds:
q
f (n) = constant. The algorithm is constant in time. In this case, steps of the algorithm are repeated once or just a few times. It would be nice if our algorithms had this property, but it does not usually happen in ASIC design. f (n) = log n . The algorithm is logarithmic in time. This usually happens when a big problem is (possibly recursively) transformed into a smaller one. f (n) = n . The algorithm is linear in time. This is a good situation for an ASIC algorithm that works with n objects. f (n) = n log n . This type of algorithm arises when a large problem is split into a number of smaller problems, each solved independently. f (n) = n 2 . The algorithm is quadratic in time and usually only practical for small ASIC problems.
If the time it takes to solve a problem increases with the size of the problem at a rate that is polynomial but faster than quadratic (or worse in an exponential fashion), it is usually not appropriate for ASIC design. Even after subdividing the ASIC physical design problem into smaller steps, each of the steps still results in problems that are hard to solve automatically. In fact, each of the ASIC physical design steps, in general, belongs to a class of mathematical problems known as NP-complete problems. This means that it is unlikely we can find an algorithm to solve the problem exactly in polynomial time. Suppose we find a practical method to solve our problem, even if we can find a solution we now have a dilemma. How shall we know if we have a good solution if, because the problem is NP-complete, we cannot find the optimum or best solution to which to compare it? We need to know how close we are to the optimum solution to a problem, even if that optimum solution cannot be found exactly. We need to make a quantitative measurement of the quality of the solution that we are able to find. Often we combine several parameters or metrics that measure our goals and objectives into a measurement function or objective function. If we are minimizing the measurement function, it is a cost function. If we are maximizing the measurement function, we call the function a gain function (sometimes just gain).
15.2 CAD Tools
Now we are ready to solve each of the ASIC physical design steps with the following items in hand: a set of goals and objectives, a way to measure the goals and objectives, and an algorithm or method to find a solution that meets the goals and objectives. As designers attempt to achieve a desired ASIC performance they make a continuous trade-off between speed, area, power, and several other factors. Presently CAD tools are not smart enough to be able to do this alone. In fact, current CAD tools are only capable of finding a solution subject to a few, very simple, objectives. [ Chapter start ] [ Previous page ] [ Next page ]
15.3 System Partitioning

Microelectronic systems typically consist of many functional blocks. If a functional block is too large to fit in one ASIC, we may have to split, or partition, the function into pieces using goals and objectives that we need to specify. For example, we might want to minimize the number of pins for each ASIC to minimize package cost. We can use CAD tools to help us with this type of system partitioning. Figure 15.2 shows the system diagram of the Sun Microsystems SPARCstation 1. The system is partitioned as follows; the numbers refer to the labels in Figure 15.2 . (See Section 1.3, Case Study for the sources of infomation in this section.)
q q
q q q q q
Nine custom ASICs (19) Memory subsystems (SIMMs, single-in-line memory modules): CPU cache (10), RAM (11), memory cache (12, 13) Six ASSPs (application-specific standard products) for I/O (1419) An ASSP for time of day (20) An EPROM (21) Video memory subsystem (22) One analog/digital ASSP DAC (digital-to-analog converter) (23)
Table 15.1 shows the details of the nine custom ASICs used in the SPARCstation 1. Some of the partitioning of the system shown in Figure 15.2 is determined by whether to use ASSPs or custom ASICs. Some of these design decisions are based on intangible issues: time to market, previous experience with a technology, the ability to reuse part of a design from a previous product. No CAD tools can help with such decisions. The goals and objectives are too poorly defined and finding a way to
measure these factors is very difficult. CAD tools cannot answer a question such as: What is the cheapest way to build my system? but can help the designer answer the question: How do I split this circuit into pieces that will fit on a chip? Table 15.2 shows the partitioning of the SPARCstation 10 so you can compare it to the SPARCstation 1. Notice that the gate counts of nearly all of the SPARCstation 10 ASICs have increased by a factor of 10, but the pin counts have increased by a smaller factor.
FIGURE 15.2 The Sun Microsystems SPARCstation 1 system block diagram. The acronyms for the various ASICs are listed in Table 15.1 . TABLE 15.1 System partitioning for the Sun Microsystems SPARCstation 1.
SPARCstation 1 ASIC 1 SPARC IU (integer unit) SPARC FPU (floating-point 2 unit) 3 Cache controller MMU (memory-management 4 unit) 5 Data buffer DMA (direct memory access) 6 controller 7 Video controller/data buffer 8 RAM controller 9 Clock generator Abbreviations: PGA = pin-grid array PQFP = plastic quad flat pack PLCC = plastic leaded chip carrier
Gates /k-gate 20 50 9 5 3 9 4 1 1
Pins 179 144 160 120 120 120 120 100 44
Package PGA PGA PQFP PQFP PQFP PQFP PQFP PQFP PLCC
Type CBIC FC GA GA GA GA GA GA GA
CBIC = LSI Logic cell-based ASIC GA = LSI Logic channelless gate array FC = full custom
15.4 Estimating ASIC Size

Table 15.3 shows some useful numbers for estimating ASIC die size. Suppose we wish to estimate the die size of a 40 k-gate ASIC in a 0.35 m gate array, three-level metal process with 166 I/O pads. For this ASIC the minimum feature size is 0.35 m. Thus (one-half the minimum feature size) = 0.35 m/2 = 0.175 m. Using our data and Table 15.3 , we can derive the following information. We know that 0.35 m standard-cell density is roughly 5 10 4 gate/ 2 . From this we can calculate the gate density for a 0.35 m gate array: gate density = 0.35 m standard-cell density (0.8 to 0.9) = 4 10 4 to 4.5 10 4 gate/ 2 . (15.1) This gives the core size (logic and routing only) as (4 10 4 gates/gate density) routing factor (1/gate-array utilization) 4 10 4 /(4 10 4 to 4.5 10 4 ) (1 to 2) 1/(0.8 to 0.9) = 10 8 = to 2.5 10 8 2 = 4840 to 11,900 mil 2 .
(15.2)
TABLE 15.2 System partitioning for the Sun Microsystems SPARCstation 10. SPARCstation 10 ASIC Gates Pins Package Type
SuperSPARC Superscalar SPARC 2 SuperCache cache controller 3 EMC memory control 4 MSI MBusSBus interface DMA2 Ethernet, SCSI, 5 parallel port 6 SEC SBus to 8-bit bus 7 DBRI dual ISDN interface 8 MMCodec stereo codec Abbreviations: PGA = pin-grid array PQFP = plastic quad flat pack PLCC = plastic leaded chip carrier 1
3 M-transistors 2 M-transistors 40 k-gate 40 k-gate 30 k-gate 20 k-gate 72 k-gate 32 k-gate
293 369 299 223 160 160 132 44
PGA PGA PGA PGA PQFP PQFP PQFP PLCC
FC FC GA GA GA GA GA FC
GA = channelless gate array FC = full custom
We shall need to add (0.175/0.5) 2 (15 to 20) = 10.5 to 21 mil (per side) for the pad heights (we included the effects of scaling in this calculation). With a pad pitch of 5 mil and roughly 166/4 = 42 I/Os per side (not counting any power pads), we need a die at least 5 42 = 210 mil on a side for the I/Os. Thus the die size must be at least 210 210 = 4.4 10 4 mil 2 to fit 166 I/Os. Of this die area only 1.19 10 4 /(4.4 10 4 ) = 27 % (at most) is used by the core logic. This is a severely pad-limited design and we need to rethink the partitioning of this system. Table 15.4 shows some typical areas for datapath elements. You would use many of these datapath elements in floating-point arithmetic (these elements are largeyou should not use floating-point arithmetic unless you have to):
q q q q
A leading-one detector with barrel shifter normalizes a mantissa. A priority encoder corrects exponents due to mantissa normalization. A denormalizing barrel shifter aligns mantissas. A normalizing barrel shifter with a leading-one detector normalizes mantissa
subtraction. TABLE 15.3 Some useful numbers for ASIC estimates, normalized to a 1 m technology unless noted. Scaling Parameter Typical value Comment 1 Lambda, 0.5 m = 0.5 (minimum feature size) 1 micron = 10 6 m = 1 m = minimum feature size 0.25 to 1.0 m 5 to 10 mil = 125 to 250 m In a 1 m technology, 0.5 m. Not to be confused with minimum CAD grid size (which is usually less than 0.01 m). Less than drawn gate length, usually by about 10 percent. For a 1 m technology, 2LM ( = 0.5 m). Scales less than linearly with . For a 1 m technology, 2LM ( = 0.5 m). Scales approximately linearly with . Approximately constant Approximately constant NA
CAD pitch
Effective gate length I/O-pad width (pitch)
I/O-pad height
15 to 20 mil = 375 to 500 m 1000 mil/side, 10 6 mil 2 100 mil/side, 10 4 mil 2
Large die Small die
1 1
Standard-cell density
1.5 10 3 gate/ m 2 = 1.0 gate/mil 2 gate/ 8 m2 = 5.0 gate/mil 2 10 3
For 1 m, 2LM, library = 4 10 4 gate / 2 (independent of scaling). For 0.5 m, 3LM, library = 5 10 4 gate/ 2 (independent of scaling). For 2LM, approximately constant For 3LM, approximately constant For the same process as standard cells Approximately constant Varies widely, figure is for low-cost plastic package, approximately constant Varies widely, figure is for a mature, 2LM CMOS process, approximately constant
1/ 2
Standard-cell density
1/ 2
Gate-array utilization
60 to 80 %
80 to 90 % Gate-array density Standard-cell routing factor = (cell area + route area)/cell area (0.8 to 0.9) standard cell density 1.5 to 2.5 (2LM) 1.0 to 2.0 (3LM)
Package cost
$0.01/pin, penny per pin
Wafer cost
$1 k to $5 k average $2 k
TABLE 15.4 Area estimates for datapath functions. 2 Area/ Area per 2 (32Datapath function bit/ 2 bit) 7.7E + High-speed comparator (432 bit) 24,000 05 High-speed comparator (32128 9.2E + 28,800 bit) 05 7200 log 2 n 1.2E + Leading-one detector ( n -bit) 06 6000 + 800 3.2E + All-ones detector ( n -bit) log 2 n 05 Priority encoder ( n -bit) 19,000 + 1400 log 2 ( n 2) 5500 + 800 log 2 n 19,000 + 1000 n + 1600 m 24,000 12,000 + 6000 n + 8400 t 34,000 + 9600 n 190,000 + 18,000 n 54,000 + 18,000 ( n 2) 8.4E + 05 3.0E + 05 3.4E + 06 7.7E + 05 1.5E + 07 1.1E + 07 2.4E + 07 1.9E + 07
Area/ 2 (64bit) 1.5E + 06 1.8E + 06 2.8E + 06 6.9E + 05 1.8E + 06 6.6E + 05 1.2E + 07 1.5E + 06 6.0E + 07 4.1E + 07 8.5E + 07 7.4E + 07
Zero detector ( n -bit) Barrel shifter/rotator ( n- by m -bit) Carry-save adder Digital delay line ( n delay stages, t output taps) Synchronous FIFO ( n -bit) Multiplier-accumulator ( n -bit) Unsigned multiplier ( n- by m -bit)
2:1 MUX 8:1 MUX Low-speed adder 2901 ALU Low-speed adder/subtracter Sync. updown counter with sync. load and clear Low-speed decrementer Low-speed incrementer Low-speed incrementer/decrementer
7200 29,000 28,000 41,000 30,000 43,000 14,000 14,000 20,000
2.3E + 05 9.2E + 05 8.8E + 05 1.3E + 06 9.6E + 05 1.4E + 06 4.6E + 05 4.6E + 05 6.5E + 05
4.6E + 05 1.8E + 06 1.8E + 06 2.6E + 06 1.9E + 06 2.8E + 06 9.2E + 05 9.2E + 05 1.3E + 06
Most datapath elements have an area per bit that depends on the number of bits in the datapath (the datapath width). Sometimes this dependency is linear (for the multipliers and the barrel shifter, for example); in other elements it depends on the logarithm (to base 2) of the datapath width (the leading one, all ones, and zero detectors, for example). In some elements you might expect there to be a dependency on datapath width, but it is small (the comparators are an example). The area estimates given in Table 15.4 can be misleading. The exact size of an adder, for example, depends on the architecture: carry-save, carry-select, carry-lookahead, or ripple-carry (which depends on the speed you require). These area figures also exclude the routing between datapath elements, which is difficult to predictit will depend on the number and size of the datapath elements, their type, and how much logic is random and how much is datapath. Figure 15.3 (a) shows the typical size of SRAM constructed on an ASIC. These
figures are based on the use of a RAM compiler (as opposed to building memory from flip-flops or latches) using a standard CMOS ASIC process, typically using a six-transistor cell. The actual size of a memory will depend on (1) the required access time, (2) the use of synchronous or asynchronous read or write, (3) the number and type of ports (readwrite), (4) the use of special design rules, (5) the number of interconnect layers available, (6) the RAM architecture (number of devices in RAM cell), and (7) the process technology (active pull-up devices or pull-up resistors). (a) (b)
FIGURE 15.3 (a) ASIC memory size. These figures are for static RAM constructed using compilers in a 2LM ASIC process, but with no special memory design rules. The actual area of a RAM will depend on the speed and number of readwrite ports. (b) Multiplier size for a 2LM process. The actual area will depend on the multiplier architecture and speed. The maximum size of SRAM in Figure 15.3 (a) is 32 k-bit, which occupies approximately 6.0 10 7 2 . In a 0.5 m process (with = 0.25 m), the area of a 32 k-bit SRAM is 6.0 10 7 0.25 0.25 = 3.75 10 6 m 2 (or about 2 mm on a sidea large piece of silicon). If you need an SRAM that is larger than this, you probably need to consult with your ASIC vendor to determine the best way to implement a large on-chip memory. Figure 15.3 (b) shows the typical sizes for multipliers. Again the actual multiplier size will depend on the architecture (Booth encoding, Wallace tree, and so on), the process technology, and design rules.
Table 15.5 shows some estimated gate counts for medium-size functions corresponding to some popular ASSP devices. TABLE 15.5 Gate size estimates for popular ASSP functions. ASSP Function Gate estimate device Universal synchronous/asynchronous 8251A 2900 receiver/transmitter (USART) 8253 Programmable interval timer 5680 8255A Programmable peripheral interface 7841403 8259 Programmable interrupt controller 2205 8237 Programmable DMA controller 5100 8284 Clock generator/driver 99 8288 Bus controller 250 8254 Programmable interval timer 3500 6845 CRT controller 2843 87030 SCSI controller 3600 87012 Ethernet controller 3900 2901 4 bit ALU 917 2902 Carry-lookahead ALU 33 2904 Status and shift control 500 2910 12- bit microprogram controller 1100 Source: Fujitsu channelless gate-array data book, AU and CG21 series.
1. 2LM = two-level metal; 3LM = three-level metal. 2. Area estimates are for a two-level metal (2 LM) process. Areas for a three-level metal (3LM) process are approximately 0.75 to 1.0 times these figures. [ Chapter start ] [ Previous page ] [ Next page ]
15.5 Power Dissipation

Power dissipation in CMOS logic arises from the following sources:
q
Dynamic power dissipation due to switching current from charging and discharging parasitic capacitance. Dynamic power dissipation due to short-circuit current when both n -channel and p -channel transistors are momentarily on at the same time. Static power dissipation due to leakage current and subthreshold current.
15.5.1 Switching Current

When the p -channel transistor in an inverter is charging a capacitance, C , at a frequency, f , the current through the transistor is C (d V /d t ). The power dissipation is thus CV (d V /d t ) for one-half the period of the input, t = 1/(2 f ). The power dissipated in the p -channel transistor is thus 1/(2f) dV CV dt dt = V DD
CV d V
= 0.5 CV DD 2
(15.3)
When the n -channel transistor discharges the capacitor, the power dissipation is
equal, making the total power dissipation P 1 = fCV 2 DD (15.4)
Most of the power dissipation in a CMOS ASIC arises from this sourcethe switching current. The best way to reduce power is to reduce V DD (because it appears as a squared term in Eq. 15.4 ), and to reduce C , the amount of capacitance we have to switch. A rough estimate is that 20 percent of the nodes switch (or toggle) in a circuit per clock cycle. To determine more accurately the power dissipation due to switching, we need to find out how many nodes toggle during typical circuit operation using a dynamic logic simulator. This requires input vectors that correspond to typical operation, which can be difficult to produce. Using a digital simulator also will not take into account the effect of glitches, which can be significant. Power simulators are usually a hybrid between SPICE transistor-level simulators and digital event-driven simulators [ Najm, 1994].
15.5.2 Short-Circuit Current

The short-circuit current or crowbar current can be particularly important for output drivers and large clock buffers. For a CMOS inverter (see Problem 15.17 ) the power dissipation due to the crowbar current is P 2 = (1/12) f t rf (V DD 2 V t n ) 3 (15.5)
where we assume the following: We ratio the p -channel and n -channel transistor sizes so that = ( W/L ) C ox is the same for both p - and n -channel transistors, the magnitude of the threshold voltages V t n are assumed equal for both transistor types, and t rf is the rise and fall time (assumed equal) of the input signal [ Veendrick, 1984]. For example, consider an output buffer that is capable of sinking 12 mA at an output voltage of 0.5 V. From Eq. 2.9 we can derive the transistor gain factor that we need as follows:
I DS = [( V GS V t n ) -0.5 V DS ] V DS 12 10 3 [(3.3 0.65) (0.5) (0.5)] (0.5) (15.6)
= 0.01 AV 1 If the output buffer is switching at 100 MHz and the input rise time to the buffer is 2 ns, we can calculate the power dissipation due to short-circuit current as P 2 = (1/12) f t rf (V DD 2 V t n ) 3 = (0.01) (100 106) (2 10 9 ) (3.3 (2)(0.65)) 3 = 0.00133W or about 1 mW . If the output load is 10 pF, the dissipation due to switching current is P 1 = fCV 2 DD = (100 10 6 ) (10 10 12 )(3.3) 2 = 0.01089 W or about 10 mW . As a general rule, if we adjust the transistor sizes so that the rise times and fall times through a chain of logic are approximately equal (as they should be), the short-circuit current is typically less than 20 percent of the switching current. For the example output buffer, we can make a rough estimate of the output-node switching time by assuming the buffer output drive current is constant at 12 mA. This current will cause the voltage on the output load capacitance to change between 3.3 V
(15.7)
and 0 V at a constant slew rate d V /d t for a time CV (10 10 12 ) (3.3) t = = I (12 10 3 )
(15.8)
This is close to the input rise time of 2 ns. So our estimate of the short-circuit current being less than 20 percent of the switching current assuming equal input rise time and output rise time is valid in this case.
15.5.3 Subthreshold and Leakage Current

Despite the claim made in Section 2.1, a CMOS transistor is never completely off . For example, a typical specification for a 0.5 m process for the subthreshold current (per micron of gate width for V GS = 0 V) is less than 5 pA m 1 , but not zero. With 10 million transistors on a large chip and with each transistor 10 m wide, we will have a total subthreshold current of 0.1 mA; high, but reasonable. The problem is that the subthreshold current does not scale with process technology. When the gate-to-source voltage, V GS , of an MOS transistor is less than the threshold voltage, V t , the transistor conducts a very small subthreshold current in the subthreshold region q V GS I DS = I 0 exp nkT 1 (15.9)
where I 0 is a constant, and the constant, n, is normally between 1 and 2. The slope, S, of the transistor current in the subthreshold region is
nkT S = q
nkT log 10 e = 2.3 V/decade . (15.10) q
For example, at a junction temperature, T = 125 C ( 400 K) and assuming n 1.5, S = 120 mV/decade ( q = 1.6 10 19 Fm 1 , k = 1.38 10 23 JK 1 ), which does not scale. The constant value of S = 120 mV/decade means it takes 120 mV to reduce the subthreshold current by a factor of 10 in any process. If we reduce the threshold voltages to 0.36 V in a deep-submicron process, for example, this means at V GS = 0 V we can only reduce I DS to 0.001 times its value at V GS = V t . This problem can lead to large static currents. Transistor leakage is caused by the fact that a reverse-biased diode conducts a very small leakage current. The sources and drains of every transistor, as well as the junctions between the wells and substrate, form parasitic diodes. The parasitic-diode leakage currents are strongly dependent on the type and quality of the process as well as temperature. The parasitic diodes have two components in parallel: an area diode and a perimeter diode. The ideal parasitic diode currents are given by the following equation: I = I s exp qVD 1 nkT (15.11)
.(15.1) Table 15.6 shows specified maximum leakage currents of junction parasitic diodes as well as the leakage currents of the field transistors (the parasitic MOS transistors formed when poly crosses over the thick oxide, or field oxide) in a typical 0.5 m process.
TABLE 15.6 Diffusion leakage currents (at 25 C) for a typical 0.5 m ( = 0.25 m) CMOS process. Junction Diode type Leakage (max.) Unit n -diffusion/ p substrate n -diffusion/ p substrate p -diffusion/ n -well p -diff/ n -well n -well / p substrate Field NMOS transistor Field PMOS transistor area perimeter area perimeter area 0.6 2.0 0.6 3.0 1.0 100 30 fA m 2 V 1 fA m 1 V 1 fA m 2 V 1 fA m 1 V 1 fA m 2 V 1 fA m 1 fA m 1
For example, if we have an n -diffusion region at a potential of 3.3 V that is 10 m by 4 m in size, the parasitic leakage current due to the area diode would be 40 m 2 3.3 V 0.6 fA m 2 V 1 = (40) (3.3) (0.6 10 15 ) = 7.92 10 14 A , or approximately 80 fA. The perimeter of this drain region is 28 m, so that the leakage current due to the perimeter diode is 28 m 3.3 V 2.0 fA m 1 V 1
= (28) (3.3) (2.0 10 15 ) = 2.848 10 13 A , or approximately 0.2 pA, over twice as large as the area-diode leakage current. As a very rough estimate, if we have 100,000 transistors each with a source and a drain 10 m by 4 m, and half of them are biased at 3.3 V, then the total leakage current would be (100 10 5 ) (2) (0.5) (280 10 15 ) = 2.8 10 6 A , (15.12)
or approximately 3 A. This is the same order of magnitude (a few microamperes) as the quiescent leakage current, I DDQ , that we expect to measure when we test an ASIC with power applied, but with no signal activity. A measurement of more current than this in a nonactive CMOS ASIC indicates a problem with the chip manufacture or the design. We use this measurement to test an ASIC using an IDDQ test. [ Chapter start ] [ Previous page ] [ Next page ]
15.6 FPGA Partitioning

In Section 15.3 we saw how many different issues have to be considered when partitioning a complex system into custom ASICs. There are no commercial tools that can help us with all of these issuesa spreadsheet is the best tool in this case. Things are a little easier if we limit ourselves to partitioning a group of logic cells into FPGAsand restrict the FPGAs to be all of the same type.
15.6.1 ATM Simulator

In this section we shall examine a hardware simulator for Asynchronous Transfer Mode ( ATM ). ATM is a signaling protocol for many different types of traffic including constant bit rates (voice signals) as well as variable bit rates (compressed video). The ATM Connection Simulator is a card that is connected to a computer. Under computer control the card monitors and corrupts the ATM signals to simulate the effects of real networks. An example would be to test different video compression algorithms. Compressed video is very bursty (brief periods of very high activity), has very strict delay constraints, and is susceptible to errors. ATM is based on ATM cells (packets). Each ATM cell has 53 bytes: a 5-byte header and a 48-byte payload; Figure 15.4 shows the format of the ATM packet. The ATM Connection Simulator looks at the entire header as an address.
FIGURE 15.4 The asynchronous transfer mode (ATM) cell format. The ATM protocol uses 53-byte cells or packets of information with a data payload and header information for routing and error control. Figure 15.5 shows the system block diagram of the ATM simulator designed by Craig Fujikami at the University of Hawaii. Now produced by AdTech, the simulator emulates the characteristics of a single connection in an ATM network and models ATM traffic policing, ATM cell delays, and ATM cell errors. The simulator is partitioned into the three major blocks, shown in Figure 15.5 , and connected to an IBM-compatible PC through an Intel 80186 controller board together with an interface board. These three blocks are
FIGURE 15.5 An asynchronous transfer mode (ATM) connection simulator.
q q
The traffic policer, which regulates the input to the simulator. The delay generator, which delays ATM cells, reorders ATM cells, and inserts ATM cells with valid ATM cell headers. The error generator, which produces bit errors and four random variables that are needed by the other two blocks.
The error generator performs the following operations on ATM cells: 1. Payload bit error ratio generation. The user specifies the Bernoulli probability, p BER , of the payload bit error ratio. 2. Random-variable generation for ATM cell loss, misinsertion, reordering, and deletion.
The delay generator delays, misinserts, and reorders the target ATM cells. Finally, the traffic policer performs the following operations: 1. Performs header screening and remapping. 2. Checks ATM cell conformance. 3. Deletes selected ATM cells. Table 15.7 shows the partitioning of the ATM board into 12 Lattice Logic FPGAs (ispLSI 1048) corresponding to the 12 blocks shown in Figure 15.5 . The Lattice Logic ispLSI 1048 has 48 GLBs (generic logic blocks) on each chip. This system was partitioned by handwith difficulty. Tools for automatic partitioning of systems like this will become increasingly important. In Section 15.6.2 we shall briefly look at some examples of such tools, before examining the partitioning methods that are used in Section 15.7 . TABLE 15.7 Partitioning of the ATM board using Lattice Logic ispLSI 1048 FPGAs. Each FPGA contains 48 generic logic blocks (GLBs). Chip # Size Chip # Size 1 42 GLBs 7 36 GLBs 2 8 22 GLBs 64 k-bit 8 SRAM 3 4 5 6 38 GLBs 38 GLBs 42 GLBs 64 k-bit 16 SRAM 9 10 11 12 256 k-bit 16 SRAM 43 GLBs 40 GLBs 30 GLBs
15.6.2 Automatic Partitioning with FPGAs

Some vendors of programmable ASICs provide partitioning software. For example, Altera uses its own software system for design. You can perform design entry using an HDL, schematic entry, or using the Altera hardware design language (AHDL)similar to PALASM or ABEL. In AHDL you can direct the partitioner to automatically partition logic into chips within the same family, using the AUTO keyword: DEVICE top_level IS AUTO; % the partitioner assign logic
You can use the CLIQUE keyword to keep logic together (this is not quite the same as a clique in a graphmore on this in Section 15.7.3 ): CLIQUE fast_logic BEGIN |shift_register: MACRO; % keep this in one device END; An additional option, to reserve space on a device, is very useful for making last minute additions or changes. [ Chapter start ] [ Previous page ] [ Next page ]
15.7 Partitioning Methods

System partitioning requires goals and objectives, methods and algorithms to find solutions, and ways to evaluate these solutions. We start with measuring connectivity, proceed to an example that illustrates the concepts of system partitioning and then to the algorithms for partitioning. Assume that we have decided which parts of the system will use ASICs. The goal of partitioning is to divide this part of the system so that each partition is a single ASIC. To do this we may need to take into account any or all of the following objectives:
q q q q
A maximum size for each ASIC A maximum number of ASICs A maximum number of connections for each ASIC A maximum number of total connections between all ASICs
We know how to measure the first two objectives. Next we shall explain ways to measure the last two.
15.7.1 Measuring Connectivity

To measure connectivity we need some help from the mathematics of graph theory. It turns out that the terms, definitions, and ideas of graph theory are central to ASIC construction, and they are often used in manuals and books that describe the knobs and dials of ASIC design tools.
FIGURE 15.6 Networks, graphs, and partitioning. (a) A network containing circuit logic cells and nets. (b) The equivalent graph with vertexes and edges. For example: logic cell D maps to node D in the graph; net 1 maps to the edge (A, B) in the graph. Net 3 (with three connections) maps to three edges in the graph: (B, C), (B, F), and (C, F). (c) Partitioning a network and its graph. A network with a net cut that cuts two nets. (d) The network graph showing the corresponding edge cut. The net cutset in c contains two nets, but the corresponding edge cutset in d contains four edges. This means a graph is not an exact model of a network for partitioning purposes. Figure 15.6 (a) shows a circuit schematic, netlist, or network. The network consists of circuit modules AF. Equivalent terms for a circuit module are a cell, logic cell, macro, or a block. A cell or logic cell usually refers to a small logic gate (NAND etc.), but can also be a collection of other cells; macro refers to gate-array cells; a
block is usually a collection of gates or cells. We shall use the term logic cell in this chapter to cover all of these. Each logic cell has electrical connections between the terminals ( connectors or pins). The network can be represented as the mathematical graph shown in Figure 15.6 (b). A graph is like a spiders web: it contains vertexes (or vertices) AF (also known as graph nodes or points) that are connected by edges. A graph vertex corresponds to a logic cell. An electrical connection (a net or a signal) between two logic cells corresponds to a graph edge. Figure 15.6 (c) shows a network with nine logic cells AI. A connection, for example between logic cells A and B in Figure 15.6 (c), is written as net (A, B). Net (A, B) is represented by the single edge (A, B) in the network graph, shown in Figure 15.6 (d). A net with three terminals, for example net (B, C, F), must be modeled with three edges in the network graph: edges (B, C), (B, F), and (C, F). A net with four terminals requires six edges and so on. Figure 15.6 illustrates the differences between the nets of a network and the edges in the network graphs. Notice that a net can have more than two terminals, but a terminal has only one net. If we divide, or partition, the network shown in Figure 15.6 (c) into two parts, corresponding to creating two ASICs, we can divide the networks graph in the same way. Figure 15.6 (d) shows a possible division, called a cutset. We say that there is a net cutset (for the network) and an edge cutset (for the graph). The connections between the two ASICs are external connections, the connections inside each ASIC are internal connections. Notice that the number of external connections is not modeled correctly by the network graph. When we divide the network into two by drawing a line across connections, we make net cuts. The resulting set of net cuts is the net cutset. The number of net cuts we make corresponds to the number of external connections between the two partitions. When we divide the network graph into the same partitions we make edge cuts and we create the edge cutset. We have already shown that nets and graph edges are not equivalent when a net has more than two terminals. Thus the number of edge cuts made when we partition a graph into two is not necessarily equal to the number of net cuts in the network. As we shall see presently the differences between nets and graph edges is important when we consider
partitioning a network by partitioning its graph [ Schweikert and Kernighan, 1979].
15.7.2 A Simple Partitioning Example

Figure 15.7 (a) shows a simple network we need to partition [ Goto and Matsud, 1986]. There are 12 logic cells, labeled AL, connected by 12 nets (labeled 112). At this level, each logic cell is a large circuit block and might be RAM, ROM, an ALU, and so on. Each net might also be a bus, but, for the moment, we assume that each net is a single connection and all nets are weighted equally. The goal is to partition our simple network into ASICs. Our objectives are the following:
q q q q
Use no more than three ASICs. Each ASIC is to contain no more than four logic cells. Use the minimum number of external connections for each ASIC. Use the minimum total number of external connections.
Figure 15.7 (b) shows a partitioning with five external connections; two of the ASICs have three pins; the third has four pins.We might be able to find this arrangement by hand, but for larger systems we need help. (a)
(b)
FIGURE 15.7 Partitioning example. (a) We wish to partition this network into three ASICs with no more than four logic cells per ASIC. (b) A partitioning with five external connections (nets 2, 4, 5, 6, and 8)the minimum number. (c) A constructed partition using logic cell C as a seed. It is difficult to get from this local minimum, with seven external connections (2, 3, 5, 7, 9,11,12), to the optimum solution of b.
(c)
Splitting a network into several pieces is a network partitioning problem. In the following sections we shall examine two types of algorithms to solve this problem and describe how they are used in system partitioning. Section 15.7.3 describes constructive partitioning, which uses a set of rules to find a solution. Section 15.7.4 describes iterative partitioning improvement (or iterative partitioning refinement), which takes an existing solution and tries to improve it. Often we apply iterative improvement to a constructive partitioning. We also use many of these partitioning algorithms in solving floorplanning and placement problems that we shall discuss in Chapter 16.
15.7.3 Constructive Partitioning

The most common constructive partitioning algorithms use seed growth or cluster growth. A simple seed-growth algorithm for constructive partitioning consists of the following steps: 1. Start a new partition with a seed logic cell. 2. Consider all the logic cells that are not yet in a partition. Select each of these logic cells in turn. 3. Calculate a gain function, g(m) , that measures the benefit of adding logic cell m to the current partition. One measure of gain is the number of connections
between logic cell m and the current partition. 4. Add the logic cell with the highest gain g(m) to the current partition. 5. Repeat the process from step 2. If you reach the limit of logic cells in a partition, start again at step 1. We may choose different gain functions according to our objectives (but we have to be careful to distinguish between connections and nets). The algorithm starts with the choice of a seed logic cell ( seed module, or just seed). The logic cell with the most nets is a good choice as the seed logic cell. You can also use a set of seed logic cells known as a cluster. Some people also use the term clique borrowed from graph theory. A clique of a graph is a subset of nodes where each pair of nodes is connected by an edgelike your group of friends at school where everyone knows everyone else in your clique . In some tools you can use schematic pages (at the leaf or lowest hierarchical level) as a starting point for partitioning. If you use a high-level design language, you can use a Verilog module (different from a circuit module) or VHDL entity/architecture as seeds (again at the leaf level).
15.7.4 Iterative Partitioning Improvement

The most common iterative improvement algorithms are based on interchange and group migration. The process of interchanging (swapping) logic cells in an effort to improve the partition is an interchange method. If the swap improves the partition, we accept the trial interchange; otherwise we select a new set of logic cells to swap. There is a limit to what we can achieve with a partitioning algorithm based on simple interchange. For example, Figure 15.7 (c) shows a partitioning of the network of part a using a constructed partitioning algorithm with logic cell C as the seed. To get from the solution shown in part c to the solution of part b, which has a minimum number of external connections, requires a complicated swap. The three pairs: D and F, J and K, C and L need to be swappedall at the same time. It would take a very long time to consider all possible swaps of this complexity. A simple interchange algorithm considers only one change and rejects it immediately if it is not an improvement. Algorithms of this type are greedy algorithms in the sense that they will accept a move only if it provides immediate benefit. Such shortsightedness leads an algorithm
to a local minimum from which it cannot escape. Stuck in a valley, a greedy algorithm is not prepared to walk over a hill to see if there is a better solution in the next valley. This type of problem occurs repeatedly in CAD algorithms. Group migration consists of swapping groups of logic cells between partitions. The group migration algorithms are better than simple interchange methods at improving a solution but are more complex. Almost all group migration methods are based on the powerful and general KernighanLin algorithm ( KL algorithm) that partitions a graph [ Kernighan and Lin, 1970]. The problem of dividing a graph into two pieces, minimizing the nets that are cut, is the min-cut problema very important one in VLSI design. As the next section shows, the KL algorithm can be applied to many different problems in ASIC design. We shall examine the algorithm next and then see how to apply it to system partitioning.
15.7.5 The KernighanLin Algorithm

Figure 15.8 illustrates some of the terms and definitions needed to describe the KL algorithm. External edges cross between partitions; internal edges are contained inside a partition. Consider a network with 2 m nodes (where m is an integer) each of equal size. If we assign a cost to each edge of the network graph, we can define a cost matrix C = c ij , where c ij = c ji and c ii = 0. If all connections are equal in importance, the elements of the cost matrix are 1 or 0, and in this special case we usually call the matrix the connectivity matrix. Costs higher than 1 could represent the number of wires in a bus, multiple connections to a single logic cell, or nets that we need to keep close for timing reasons.
FIGURE 15.8 Terms used by the KernighanLin partitioning algorithm. (a) An example network graph. (b) The connectivity matrix, C; the column and rows are labeled to help you see how the matrix entries correspond to the node numbers in the graph. For example, C 17 (column 1, row 7) equals 1 because nodes 1 and 7 are connected. In this example all edges have an equal weight of 1, but in general the edges may have different weights. Suppose we already have split a network into two partitions, A and B , each with m nodes (perhaps using a constructed partitioning). Our goal now is to swap nodes between A and B with the objective of minimizing the number of external edges connecting the two partitions. Each external edge may be weighted by a cost, and our objective corresponds to minimizing a cost function that we shall call the total external cost, cut cost, or cut weight, W : W =
aA,bB
c ab
(15.13)
In Figure 15.8 (a) the cut weight is 4 (all the edges have weights of 1). In order to simplify the measurement of the change in cut weight when we interchange nodes, we need some more definitions. First, for any node a in partition A , we define an external edge cost, which measures the connections from node a to B ,
Ea =
yB
c ay (15.14)
For example, in Figure 15.8 (a) E 1 = 1, and E 3 = 0. Second, we define the internal edge cost to measure the internal connections to a , Ia =
zA
c az (15.15)
.(15.2) So, in Figure 15.8 (a), I 1 = 0, and I 3 = 2. We define the edge costs for partition B in a similar way (so E 8 = 2, and I 8 = 1). The cost difference is the difference between external edge costs and internal edge costs, Dx = ExIx. (15.16)
Thus, in Figure 15.8 (a) D 1 = 1, D 3 = 2, and D 8 = 1. Now pick any node in A , and any node in B . If we swap these nodes, a and b, we need to measure the reduction in cut weight, which we call the gain, g . We can express g in terms of the edge costs as follows: g = D a + D b 2 c ab . (15.17)
The last term accounts for the fact that a and b may be connected. So, in Figure 15.8 (a), if we swap nodes 1 and 6, then g = D 1 + D 6 2 c 16 = 1 + 1. If we swap nodes 2 and 8, then g = D 2 + D 8 2 c 28 = 1 + 2 2.
The KL algorithm finds a group of node pairs to swap that increases the gain even though swapping individual node pairs from that group might decrease the gain. First we pretend to swap all of the nodes a pair at a time. Pretend swaps are like studying chess games when you make a series of trial moves in your head. This is the algorithm: 1. Find two nodes, a i from A , and b i from B , so that the gain from swapping them is a maximum. The gain is g i = D ai + D bi 2 c aibi . (15.18)
2. Next pretend swap a i and b i even if the gain g i is zero or negative, and do not consider a i and b i eligible for being swapped again. 3. Repeat steps 1 and 2 a total of m times until all the nodes of A and B have been pretend swapped. We are back where we started, but we have ordered pairs of nodes in A and B according to the gain from interchanging those pairs. 4. Now we can choose which nodes we shall actually swap. Suppose we only swap the first n pairs of nodes that we found in the preceding process. In other words we swap nodes X = a 1 , a 2 ,, a n from A with nodes Y = b 1 , b 2 ,, b n from B. The total gain would be n Gn =
i=1
gi.
(15.19)
5. We now choose n corresponding to the maximum value of G n . If the maximum value of G n > 0, then we swap the sets of nodes X and Y and thus reduce the cut weight by G n . We use this new partitioning to start the process again at the first step. If the maximum value of G n = 0, then we cannot improve the current
partitioning and we stop. We have found a locally optimum solution. Figure 15.9 shows an example of partitioning a graph using the KL algorithm. Each completion of steps 1 through 5 is a pass through the algorithm. Kernighan and Lin found that typically 24 passes were required to reach a solution. The most important feature of the KL algorithm is that we are prepared to consider moves even though they seem to make things worse. This is like unraveling a tangled ball of string or solving a Rubiks cube puzzle. Sometimes you need to make things worse so they can get better later. The KL algorithm works well for partitioning graphs. However, there are the following problems that we need to address before we can apply the algorithm to network partitioning:
FIGURE 15.9 Partitioning a graph using the KernighanLin algorithm. (a) Shows how swapping node 1 of partition A with node 6 of partition B results in a gain of g = 1. (b) A graph of the gain resulting from swapping pairs of nodes. (c) The total gain is equal to the sum of the gains obtained at each step.
q q q q q q
It minimizes the number of edges cut, not the number of nets cut. It does not allow logic cells to be different sizes. It is expensive in computation time. It does not allow partitions to be unequal or find the optimum partition size. It does not allow for selected logic cells to be fixed in place. The results are random.

q
It does not directly allow for more than two partitions.
To implement a net-cut partitioning rather than an edge-cut partitioning, we can just keep track of the nets rather than the edges [ Schweikert and Kernighan, 1979]. We can no longer use a connectivity or cost matrix to represent connections, though. Fortunately, several people have found efficient data structures to handle the bookkeeping tasks. One example is the FiducciaMattheyses algorithm to be described shortly. To represent nets with multiple terminals in a network accurately, we can extend the definition of a network graph. Figure 15.10 shows how a hypergraph with a special type of vertex, a star, and a hyperedge, represents a net with more than two terminals in a network.
FIGURE 15.10 A hypergraph. (a) The network contains a net y with three terminals. (b) In the network hypergraph we can model net y by a single hyperedge (B, C, D) and a star node. Now there is a direct correspondence between wires or nets in the network and hyperedges in the graph. In the KL algorithm, the internal and external edge costs have to be calculated for all the nodes before we can select the nodes to be swapped. Then we have to find the pair of nodes that give the largest gain when swapped. This requires an amount of computer time that grows as n 2 log n for a graph with 2n nodes. This n 2 dependency is a major problem for partitioning large networks. The FiducciaMattheyses algorithm (the FM algorithm) is an extension to the KL algorithm that addresses
the differences between nets and edges and also reduces the computational effort [ Fiduccia and Mattheyses, 1982]. The key features of this algorithm are the following:
q
Only one logic cell, the base logic cell, moves at a time. In order to stop the algorithm from moving all the logic cells to one large partition, the base logic cell is chosen to maintain balance between partitions. The balance is the ratio of total logic cell size in one partition to the total logic cell size in the other. Altering the balance allows us to vary the sizes of the partitions. Critical nets are used to simplify the gain calculations. A net is a critical net if it has an attached logic cell that, when swapped, changes the number of nets cut. It is only necessary to recalculate the gains of logic cells on critical nets that are attached to the base logic cell. The logic cells that are free to move are stored in a doubly linked list. The lists are sorted according to gain. This allows the logic cells with maximum gain to be found quickly.
These techniques reduce the computation time so that it increases only slightly more than linearly with the number of logic cells in the network, a very important improvement [Fiduccia and Mattheyses, 1982]. Kernighan and Lin suggested simulating logic cells of different sizes by clumping s logic cells together with highly weighted nets to simulate a logic cell of size s . The FM algorithm takes logic-cell size into account as it selects a logic cell to swap based on maintaining the balance between the total logic-cell size of each of the partitions. To generate unequal partitions using the KL algorithm, we can introduce dummy logic cells with no connections into one of the partitions. The FM algorithm adjusts the partition size according to the balance parameter. Often we need to fix logic cells in place during partitioning. This may be because we need to keep logic cells together or apart for reasons other than connectivity, perhaps due to timing, power, or noise constraints. Another reason to fix logic cells would be to improve a partitioning that you have already partially completed. The FM algorithm allows you to fix logic cells by removing them from consideration as the base logic cells you move. Methods based on the KL algorithm find locally optimum solutions in a random fashion. There are two reasons for this. The first reason is the random starting partition. The second reason is that the choice of nodes
to swap is based on the gain. The choice between moves that have equal gain is arbitrary. Extensions to the KL algorithm address both of these problems. Finding nodes that are naturally grouped or clustered and assigning them to one of the initial partitions improves the results of the KL algorithm. Although these are constructive partitioning methods, they are covered here because they are closely linked with the KL iterative improvement algorithm.
15.7.6 The Ratio-Cut Algorithm

The ratio-cut algorithm removes the restriction of constant partition sizes. The cut weight W for a cut that divides a network into two partitions, A and B , is given by W =
aA,bB
c ab (15.20)
The KL algorithm minimizes W while keeping partitions A and B the same size. The ratio of a cut is defined as W R = |A||B|
(15.21)
In this equation | A | and | B | are the sizes of partitions A and B . The size of a partition is equal to the number of nodes it contains (also known as the set cardinality). The cut that minimizes R is called the ratio cut. The original description of the ratio-cut algorithm uses ratio cuts to partition a network into small, highly connected groups. Then you form a reduced network from these groupseach small group of logic cells forms a node in the reduced network. Finally, you use the FM algorithm to improve the reduced network [ Cheng and Wei, 1991].
15.7.7 The Look-ahead Algorithm

Both the KL and FM algorithms consider only the immediate gain to be made by moving a node. When there is a tie between nodes with equal gain (as often happens), there is no mechanism to make the best choice. This is like playing chess looking only one move ahead. Figure 15.11 shows an example of two nodes that have equal gains, but moving one of the nodes will allow a move that has a higher gain later.
FIGURE 15.11 An example of network partitioning that shows the need to look ahead when selecting logic cells to be moved between partitions. Partitionings (a), (b), and (c) show one sequence of moves, partitionings (d), (e), and (f) show a second sequence. The partitioning in (a) can be improved by moving node 2 from A to B with a gain of 1. The result of this move is shown in (b). This partitioning can be improved by moving node 3 to B, again with a gain of 1. The partitioning shown in (d) is the same as (a). We can move node 5 to B with a gain of 1 as shown in (e), but now we can move node 4
to B with a gain of 2. We call the gain for the initial move the first-level gain. Gains from subsequent moves are then second-level and higher gains. We can define a gain vector that contains these gains. Figure 15.11 shows how the first-level and second-level gains are calculated. Using the gain vector allows us to use a look-ahead algorithm in the choice of nodes to be swapped. This reduces both the mean and variation in the number of cuts in the resulting partitions. We have described algorithms that are efficient at dividing a network into two pieces. Normally we wish to divide a system into more than two pieces. We can do this by recursively applying the algorithms. For example, if we wish to divide a system network into three pieces, we could apply the FM algorithm first, using a balance of 2:1, to generate two partitions, with one twice as large as the other. Then we apply the algorithm again to the larger of the two partitions, with a balance of 1:1, which will give us three partitions of roughly the same size.
15.7.8 Simulated Annealing

A different approach to solving large graph problems (and other types of problems) that arise in VLSI layout, including system partitioning, uses the simulated-annealing algorithm [ Kirkpatrick et al., 1983]. Simulated annealing takes an existing solution and then makes successive changes in a series of random moves. Each move is accepted or rejected based on an energy function, calculated for each new trial configuration. The minimums of the energy function correspond to possible solutions. The best solution is the global minimum. So far the description of simulated annealing is similar to the interchange algorithms, but there is an important difference. In an interchange strategy we accept the new trial configuration only if the energy function decreases, which means the new configuration is an improvement. However, in the simulated-annealing algorithm, we accept the new configuration even if the energy function increases for the new configurationwhich means things are getting worse. The probability of accepting a worse configuration is controlled by the exponential expression exp(D E / T ), where D E is the resulting increase in the energy function. The parameter T is a variable that
we control and corresponds to the temperature in the annealing of a metal cooling (this is why the process is called simulated annealing). We accept moves that seemingly take us away from a desirable solution to allow the system to escape from a local minimum and find other, better, solutions. The name for this strategy is hill climbing. As the temperature is slowly decreased, we decrease the probability of making moves that increase the energy function. Finally, as the temperature approaches zero, we refuse to make any moves that increase the energy of the system and the system falls and comes to rest at the nearest local minimum. Hopefully, the solution that corresponds to the minimum we have found is a good one. The critical parameter governing the behavior of the simulated-annealing algorithm is the rate at which the temperature T is reduced. This rate is known as the cooling schedule. Often we set a parameter a that relates the temperatures, T i and T i + 1 , at the i th and i + 1th iteration: T i +1 = T i . (15.22)
To find a good solution, a local minimum close to the global minimum, requires a high initial temperature and a slow cooling schedule. This results in many trial moves and very long computer run times [ Rose, Klebsch, and Wolf, 1990]. If we are prepared to wait a long time (forever in the worst case), simulated annealing is useful because we can guarantee that we can find the optimum solution. Simulated annealing is useful in several of the ASIC construction steps and we shall return to it in Section 16.2.7.
15.7.9 Other Partitioning Objectives

In partitioning a real system we need to weight each logic cell according to its area in order to control the total areas of each ASIC. This can be done if the area of each logic cell can either be calculated or estimated. This is usually done as part of floorplanning, so we may need to return to partitioning after floorplanning. There will be many objectives or constraints that we need to take into account during
partitioning. For example, certain logic cells in a system may need to be located on the same ASIC in order to avoid adding the delay of any external interconnections. These timing constraints can be implemented by adding weights to nets to make them more important than others. Some logic cells may consume more power than others and you may need to add power constraints to avoid exceeding the power-handling capability of a single ASIC. It is difficult, though, to assign more than rough estimates of power consumption for each logic cell at the system planning stage, before any simulation has been completed. Certain logic cells may only be available in a certain technologyif you want to include memory on an ASIC, for example. In this case, technology constraints will keep together logic cells requiring similar technologies. We probably want to impose cost constraints to implement certain logic cells in the lowest cost technology available or to keep ASICs below a certain size in order to use a low-cost package. The type of test strategy you adopt will also affect the partitioning of logic. Large RAM blocks may require BIST circuitry; large amounts of sequential logic may require scan testing, possibly with a boundary-scan interface. One of the objects of testability is to maintain controllability and observability of logic inside each ASIC. In order to do this, test constraints may require that we force certain connections to be external. No automated partitioning tools can take into account all of these constraints. The best CAD tool to help you with these decisions is a spreadsheet. [ Chapter start ] [ Previous page ] [ Next page ]
15.8 Summary
15.8 Summary
The construction or physical design of ASICs in a microelectronics system is a very large and complex problem. To solve the problem we divide it into several steps: system partitioning, floorplanning, placement, and routing. To solve each of these smaller problems we need goals and objectives, measurement metrics, as well as algorithms and methods. System partitioning is the first step in ASIC assembly. An example of the SPARCstation 1 illustrated the various issues involved in partitioning. Presently commercial CAD tools are able to automatically partition systems and chips only at a low level, at the level of a network or netlist. Partitioning for FPGAs is currently the most advanced. Next we discussed the methods to use for system partitioning. We saw how to represent networks as graphs, containing nets and edges, and how the mathematics of graph theory is useful in system partitioning and the other steps of ASIC assembly. We covered methods and algorithms for partitioning and explained that most are based on the KernighanLin min-cut algorithm. The important points in this chapter are
q q q q q
The goals and objectives of partitioning Partitioning as an art not a science The simple nature of the algorithms necessary for VLSI-sized problems The random nature of the algorithms we use The controls for the algorithms used in ASIC design
15.8 Summary
FLOORPLANNING AND PLACEMENT

The input to the floorplanning step is the output of system partitioning and design entrya netlist. Floorplanning precedes placement, but we shall cover them together. The output of the placement step is a set of directions for the routing tools. At the start of floorplanning we have a netlist describing circuit blocks, the logic cells within the blocks, and their connections. For example, Figure 16.1 shows the Viterbi decoder example as a collection of standard cells with no room set aside yet for routing. We can think of the standard cells as a hod of bricks to be made into a wall. What we have to do now is set aside spaces (we call these spaces the channels ) for interconnect, the mortar, and arrange the cells. Figure 16.2 shows a finished wallafter floorplanning and placement steps are complete. We still have not completed any routing at this pointthat comes laterall we have done is placed the logic cells in a fashion that we hope will minimize the total interconnect length, for example.
FIGURE 16.1 The starting point for the floorplanning and placement steps for the Viterbi decoder (containing only standard cells). This is the initial display of the floorplanning and placement tool. The small boxes that look like bricks are the outlines of the standard cells. The largest standard cells, at the bottom of the display (labeled dfctnb) are 188 D flipflops. The '+' symbols represent the drawing origins of the standard cellsfor the D flip-flops they are shifted to the left and below the logic cell bottom left-hand corner. The large box surrounding all the logic cells represents the estimated chip size. (This is a screen shot from Cadence Cell Ensemble.)
FIGURE 16.2 The Viterbi Decoder (from Figure 16.1 ) after floorplanning and placement. There are 18 rows of standard cells separated by 17 horizontal channels (labeled 218). The channels are routed as numbered. In this example, the I/O pads are omitted to show the cell placement more clearly. Figure 17.1 shows the same placement without the channel labels. (A screen shot from Cadence Cell Ensemble.) 16.1 Floorplanning 16.2 Placement 16.3 Physical Design Flow 16.4 Information Formats 16.5 Summary 16.6 Problems
16.7 Bibliography 16.8 References
16.1 Floorplanning
16.1 Floorplanning
Figure 16.3 shows that both interconnect delay and gate delay decrease as we scale down feature sizesbut at different rates. This is because interconnect capacitance tends to a limit of about 2 pFcm 1 for a minimum-width wire while gate delay continues to decrease (see Section 17.4, Circuit Extraction and DRC). Floorplanning allows us to predict this interconnect delay by estimating interconnect length. FIGURE 16.3 Interconnect and gate delays. As feature sizes decrease, both average interconnect delay and average gate delay decreasebut at different rates. This is because interconnect capacitance tends to a limit that is independent of scaling. Interconnect delay now dominates gate delay.
16.1.1 Floorplanning Goals and Objectives

The input to a floorplanning tool is a hierarchical netlist that describes the interconnection of the blocks (RAM, ROM, ALU, cache controller, and so on); the logic cells (NAND, NOR, D flip-flop, and so on) within the blocks; and the logic cell connectors (the terms terminals , pins , or ports mean the same thing as connectors ). The netlist is a logical description of the ASIC; the floorplan is a physical description of an ASIC. Floorplanning is thus a mapping between the logical description (the
16.1 Floorplanning
netlist) and the physical description (the floorplan). The goals of floorplanning are to:
q q q q q
arrange the blocks on a chip, decide the location of the I/O pads, decide the location and number of the power pads, decide the type of power distribution, and decide the location and type of clock distribution.
The objectives of floorplanning are to minimize the chip area and minimize delay. Measuring area is straightforward, but measuring delay is more difficult and we shall explore this next.
16.1.2 Measurement of Delay in Floorplanning

Throughout the ASIC design process we need to predict the performance of the final layout. In floorplanning we wish to predict the interconnect delay before we complete any routing. Imagine trying to predict how long it takes to get from Russia to China without knowing where in Russia we are or where our destination is in China. Actually it is worse, because in floorplanning we may move Russia or China. To predict delay we need to know the parasitics associated with interconnect: the interconnect capacitance ( wiring capacitance or routing capacitance ) as well as the interconnect resistance. At the floorplanning stage we know only the fanout ( FO ) of a net (the number of gates driven by a net) and the size of the block that the net belongs to. We cannot predict the resistance of the various pieces of the interconnect path since we do not yet know the shape of the interconnect for a net. However, we can estimate the total length of the interconnect and thus estimate the total capacitance. We estimate interconnect length by collecting statistics from previously routed chips and analyzing the results. From these statistics we create tables that predict the interconnect capacitance as a function of net fanout and block size. A floorplanning tool can then use these predicted-capacitance tables (also known as interconnect-load tables or wire-load tables ). Figure 16.4 shows how we derive and use wire-load tables and illustrates the following facts:
16.1 Floorplanning
FIGURE 16.4 Predicted capacitance. (a) Interconnect lengths as a function of fanout (FO) and circuit-block size. (b) Wireload table. There is only one capacitance value for each fanout (typically the average value). (c) The wire-load table predicts the capacitance and delay of a net (with a considerable error). Net A and net B both have a fanout of 1, both have the same predicted net delay, but net B in fact has a much greater delay than net A in the actual layout (of course we shall not know what the actual layout is until much later in the design process).
q q
Typically between 60 and 70 percent of nets have a FO = 1. The distribution for a FO = 1 has a very long tail, stretching to interconnects that run from corner to corner of the chip. The distribution for a FO = 1 often has two peaks, corresponding to a distribution for close neighbors in subgroups within a block, superimposed on a distribution corresponding to routing between subgroups.
16.1 Floorplanning
q
q q
We often see a twin-peaked distribution at the chip level also, corresponding to separate distributions for interblock routing (inside blocks) and intrablock routing (between blocks). The distributions for FO > 1 are more symmetrical and flatter than for FO = 1. The wire-load tables can only contain one number, for example the average net capacitance, for any one distribution. Many tools take a worst-case approach and use the 80- or 90-percentile point instead of the average. Thus a tool may use a predicted capacitance for which we know 90 percent of the nets will have less than the estimated capacitance. We need to repeat the statistical analysis for blocks with different sizes. For example, a net with a FO = 1 in a 25 k-gate block will have a different (larger) average length than if the net were in a 5 k-gate block. The statistics depend on the shape (aspect ratio) of the block (usually the statistics are only calculated for square blocks). The statistics will also depend on the type of netlist. For example, the distributions will be different for a netlist generated by setting a constraint for minimum logic delay during synthesiswhich tends to generate large numbers of two-input NAND gatesthan for netlists generated using minimum-area constraints.
There are no standards for the wire-load tables themselves, but there are some standards for their use and for presenting the extracted loads (see Section 16.4 ). Wireload tables often present loads in terms of a standard load that is usually the input capacitance of a two-input NAND gate with a 1X (default) drive strength. TABLE 16.1 A wire-load table showing average interconnect lengths (mm). 1 Array (available gates) 3k 11 k 105 k Chip size (mm) 3.45 5.11 12.50 Fanout 1 2 4 0.56 0.85 1.46 0.84 1.34 2.25 1.75 2.70 4.92
Table 16.1 shows the estimated metal interconnect lengths, as a function of die size
16.1 Floorplanning
and fanout, for a series of three-level metal gate arrays. In this case the interconnect capacitance is about 2 pFcm 1 , a typical figure. Figure 16.5 shows that, because we do not decrease chip size as we scale down feature size, the worst-case interconnect delay increases. One way to measure the worst-case delay uses an interconnect that completely crosses the chip, a coast-tocoast interconnect . In certain cases the worst-case delay of a 0.25 m process may be worse than a 0.35 m process, for example.
FIGURE 16.5 Worst-case interconnect delay. As we scale circuits, but avoid scaling the chip size, the worst-case interconnect delay increases.
16.1.3 Floorplanning Tools

Figure 16.6 (a) shows an initial random floorplan generated by a floorplanning tool. Two of the blocks, A and C in this example, are standard-cell areas (the chip shown in Figure 16.1 is one large standard-cell area). These are flexible blocks (or variable blocks ) because, although their total area is fixed, their shape (aspect ratio) and connector locations may be adjusted during the placement step. The dimensions and connector locations of the other fixed blocks (perhaps RAM, ROM, compiled cells, or megacells) can only be modified when they are created. We may force logic cells to be in selected flexible blocks by seeding . We choose seed cells by name. For example, ram_control* would select all logic cells whose names started with ram_control to be placed in one flexible block. The special symbol, usually ' * ', is a wildcard symbol . Seeding may be hard or soft. A hard seed is fixed and not allowed to move during the remaining floorplanning and placement steps. A soft seed is an initial suggestion only and can be altered if necessary by the floorplanner. We
16.1 Floorplanning
may also use seed connectors within flexible blocksforcing certain nets to appear in a specified order, or location at the boundary of a flexible block.
FIGURE 16.6 Floorplanning a cell-based ASIC. (a) Initial floorplan generated by the floorplanning tool. Two of the blocks are flexible (A and C) and contain rows of standard cells (unplaced). A pop-up window shows the status of block A. (b) An estimated placement for flexible blocks A and C. The connector positions are known and a rats nest display shows the heavy congestion below block B. (c) Moving blocks to improve the floorplan. (d) The updated display shows the reduced congestion after the changes. The floorplanner can complete an estimated placement to determine the positions of connectors at the boundaries of the flexible blocks. Figure 16.6 (b) illustrates a rat's nest display of the connections between blocks. Connections are shown as bundles between the centers of blocks or as flight lines between connectors. Figure 16.6 (c) and (d) show how we can move the blocks in a floorplanning tool to minimize routing
16.1 Floorplanning
congestion . We need to control the aspect ratio of our floorplan because we have to fit our chip into the die cavity (a fixed-size hole, usually square) inside a package. Figure 16.7 (a)(c) show how we can rearrange our chip to achieve a square aspect ratio. Figure 16.7 (c) also shows a congestion map , another form of routability display. There is no standard measure of routability. Generally the interconnect channels , (or wiring channelsI shall call them channels from now on) have a certain channel capacity ; that is, they can handle only a fixed number of interconnects. One measure of congestion is the difference between the number of interconnects that we actually need, called the channel density , and the channel capacity. Another measure, shown in Figure 16.7 (c), uses the ratio of channel density to the channel capacity. With practice, we can create a good initial placement by floorplanning and a pictorial display. This is one area where the human ability to recognize patterns and spatial relations is currently superior to a computer programs ability.
16.1 Floorplanning
FIGURE 16.7 Congestion analysis. (a) The initial floorplan with a 2:1.5 die aspect ratio. (b) Altering the floorplan to give a 1:1 chip aspect ratio. (c) A trial floorplan with a congestion map. Blocks A and C have been placed so that we know the terminal positions in the channels. Shading indicates the ratio of channel density to the channel capacity. Dark areas show regions that cannot be routed because the channel congestion exceeds the estimated capacity. (d) Resizing flexible blocks A and C alleviates congestion.
FIGURE 16.8 Routing a T-junction between two channels in two-level metal. The dots represent logic cell pins. (a) Routing channel A (the stem of the T) first allows us to adjust the width of channel B. (b) If we route channel B first (the top of the T), this fixes the width of channel A. We have to route the stem of a T-junction before we route the top.
16.1.4 Channel Definition

During the floorplanning step we assign the areas between blocks that are to be used for interconnect. This process is known as channel definition or channel allocation . Figure 16.8 shows a T-shaped junction between two rectangular channels and illustrates why we must route the stem (vertical) of the T before the bar. The general problem of choosing the order of rectangular channels to route is channel ordering .
16.1 Floorplanning
FIGURE 16.9 Defining the channel routing order for a slicing floorplan using a slicing tree. (a) Make a cut all the way across the chip between circuit blocks. Continue slicing until each piece contains just one circuit block. Each cut divides a piece into two without cutting through a circuit block. (b) A sequence of cuts: 1, 2, 3, and 4 that successively slices the chip until only circuit blocks are left. (c) The slicing tree corresponding to the sequence of cuts gives the order in which to route the channels: 4, 3, 2, and finally 1. Figure 16.9 shows a floorplan of a chip containing several blocks. Suppose we cut along the block boundaries slicing the chip into two pieces ( Figure 16.9 a). Then suppose we can slice each of these pieces into two. If we can continue in this fashion until all the blocks are separated, then we have a slicing floorplan ( Figure 16.9 b). Figure 16.9 (c) shows how the sequence we use to slice the chip defines a hierarchy of the blocks. Reversing the slicing order ensures that we route the stems of all the channel T-junctions first.
16.1 Floorplanning
FIGURE 16.10 Cyclic constraints. (a) A nonslicing floorplan with a cyclic constraint that prevents channel routing. (b) In this case it is difficult to find a slicing floorplan without increasing the chip area. (c) This floorplan may be sliced (with initial cuts 1 or 2) and has no cyclic constraints, but it is inefficient in area use and will be very difficult to route. Figure 16.10 shows a floorplan that is not a slicing structure. We cannot cut the chip all the way across with a knife without chopping a circuit block in two. This means we cannot route any of the channels in this floorplan without routing all of the other channels first. We say there is a cyclic constraint in this floorplan. There are two solutions to this problem. One solution is to move the blocks until we obtain a slicing floorplan. The other solution is to allow the use of L -shaped, rather than rectangular, channels (or areas with fixed connectors on all sidesa switch box ). We need an area-based router rather than a channel router to route L -shaped regions or switch boxes (see Section 17.2.6, Area-Routing Algorithms). Figure 16.11 (a) displays the floorplan of the ASIC shown in Figure 16.7 . We can remove the cyclic constraint by moving the blocks again, but this increases the chip size. Figure 16.11 (b) shows an alternative solution. We merge the flexible standard cell areas A and C. We can do this by selective flattening of the netlist. Sometimes flattening can reduce the routing area because routing between blocks is usually less efficient than routing inside the row-based blocks. Figure 16.11 (b) shows the channel definition and routing order for our chip.
16.1 Floorplanning
FIGURE 16.11 Channel definition and ordering. (a) We can eliminate the cyclic constraint by merging the blocks A and C. (b) A slicing structure.
16.1.5 I/O and Power Planning

Every chip communicates with the outside world. Signals flow onto and off the chip and we need to supply power. We need to consider the I/O and power constraints early in the floorplanning process. A silicon chip or die (plural die, dies, or dice) is mounted on a chip carrier inside a chip package . Connections are made by bonding the chip pads to fingers on a metal lead frame that is part of the package. The metal lead-frame fingers connect to the package pins . A die consists of a logic core inside a pad ring . Figure 16.12 (a) shows a pad-limited die and Figure 16.12 (b) shows a core-limited die . On a pad-limited die we use tall, thin pad-limited pads , which maximize the number of pads we can fit around the outside of the chip. On a corelimited die we use short, wide core-limited pads . Figure 16.12 (c) shows how we can use both types of pad to change the aspect ratio of a die to be different from that of the core.
16.1 Floorplanning
FIGURE 16.12 Pad-limited and core-limited die. (a) A padlimited die. The number of pads determines the die size. (b) A core-limited die: The core logic determines the die size. (c) Using both pad-limited pads and core-limited pads for a square die. Special power pads are used for the positive supply, or VDD, power buses (or power rails ) and the ground or negative supply, VSS or GND. Usually one set of VDD/VSS pads supplies one power ring that runs around the pad ring and supplies power to the I/O pads only. Another set of VDD/VSS pads connects to a second power ring that supplies the logic core. We sometimes call the I/O power dirty power since it has to supply large transient currents to the output transistors. We keep dirty power separate to avoid injecting noise into the internal-logic power (the clean power ). I/O pads also contain special circuits to protect against electrostatic discharge ( ESD ). These circuits can withstand very short high-voltage (several kilovolt) pulses that can be generated during human or machine handling. Depending on the type of package and how the foundry attaches the silicon die to the chip cavity in the chip carrier, there may be an electrical connection between the chip carrier and the die substrate. Usually the die is cemented in the chip cavity with a conductive epoxy, making an electrical connection between substrate and the package cavity in the chip carrier. If we make an electrical connection between the substrate and a chip pad, or to a package pin, it must be to VDD ( n -type substrate) or VSS ( p type substrate). This substrate connection (for the whole chip) employs a down bond (or drop bond) to the carrier. We have several options:
16.1 Floorplanning
q q
q q
We can dedicate one (or more) chip pad(s) to down bond to the chip carrier. We can make a connection from a chip pad to the lead frame and down bond from the chip pad to the chip carrier. We can make a connection from a chip pad to the lead frame and down bond from the lead frame. We can down bond from the lead frame without using a chip pad. We can leave the substrate and/or chip carrier unconnected.
Depending on the package design, the type and positioning of down bonds may be fixed. This means we need to fix the position of the chip pad for down bonding using a pad seed . A double bond connects two pads to one chip-carrier finger and one package pin. We can do this to save package pins or reduce the series inductance of bond wires (typically a few nanohenries) by parallel connection of the pads. A multiple-signal pad or pad group is a set of pads. For example, an oscillator pad usually comprises a set of two adjacent pads that we connect to an external crystal. The oscillator circuit and the two signal pads form a single logic cell. Another common example is a clock pad . Some foundries allow a special form of corner pad (normal pads are edge pads ) that squeezes two pads into the area at the corners of a chip using a special two-pad corner cell , to help meet bond-wire angle design rules (see also Figure 16.13 b and c). To reduce the series resistive and inductive impedance of power supply networks, it is normal to use multiple VDD and VSS pads. This is particularly important with the simultaneously switching outputs ( SSOs ) that occur when driving buses off-chip [ Wada, Eino, and Anami, 1990]. The output pads can easily consume most of the power on a CMOS ASIC, because the load on a pad (usually tens of picofarads) is much larger than typical on-chip capacitive loads. Depending on the technology it may be necessary to provide dedicated VDD and VSS pads for every few SSOs. Design rules set how many SSOs can be used per VDD/VSS pad pair. These dedicated VDD/VSS pads must follow groups of output pads as they are seeded or planned on the floorplan. With some chip packages this can become difficult because design rules limit the location of package pins that may be used for supplies (due to the differing series inductance of each pin).
16.1 Floorplanning
Using a pad mapping we translate the logical pad in a netlist to a physical pad from a pad library . We might control pad seeding and mapping in the floorplanner. The handling of I/O pads can become quite complex; there are several nonobvious factors that must be considered when generating a pad ring:
q
Ideally we would only need to design library pad cells for one orientation. For example, an edge pad for the south side of the chip, and a corner pad for the southeast corner. We could then generate other orientations by rotation and flipping (mirroring). Some ASIC vendors will not allow rotation or mirroring of logic cells in the mask file. To avoid these problems we may need to have separate horizontal, vertical, left-handed, and right-handed pad cells in the library with appropriate logical to physical pad mappings. If we mix pad-limited and core-limited edge pads in the same pad ring, this complicates the design of corner pads. Usually the two types of edge pad cannot abut. In this case a corner pad also becomes a pad-format changer , or hybrid corner pad . In single-supply chips we have one VDD net and one VSS net, both global power nets . It is also possible to use mixed power supplies (for example, 3.3 V and 5 V) or multiple power supplies ( digital VDD, analog VDD).
Figure 16.13 (a) and (b) are magnified views of the southeast corner of our example chip and show the different types of I/O cells. Figure 16.13 (c) shows a stagger-bond arrangement using two rows of I/O pads. In this case the design rules for bond wires (the spacing and the angle at which the bond wires leave the pads) become very important.
16.1 Floorplanning
FIGURE 16.13 Bonding pads. (a) This chip uses both padlimited and core-limited pads. (b) A hybrid corner pad. (c) A chip with stagger-bonded pads. (d) An area-bump bonded chip (or flip-chip). The chip is turned upside down and solder bumps connect the pads to the lead frame. Figure 16.13 (d) shows an area-bump bonding arrangement (also known as flip-chip, solder-bump or C4, terms coined by IBM who developed this technology [ Masleid, 1991]) used, for example, with ball-grid array ( BGA ) packages. Even though the bonding pads are located in the center of the chip, the I/O circuits are still often located at the edges of the chip because of difficulties in power supply distribution and integrating I/O circuits together with logic in the center of the die. In an MGA the pad spacing and I/O-cell spacing is fixedeach pad occupies a fixed pad slot (or pad site ). This means that the properties of the pad I/O are also fixed but, if we need to, we can parallel adjacent output cells to increase the drive. To
16.1 Floorplanning
increase flexibility further the I/O cells can use a separation, the I/O-cell pitch , that is smaller than the pad pitch . For example, three 4 mA driver cells can occupy two pad slots. Then we can use two 4 mA output cells in parallel to drive one pad, forming an 8 mA output pad as shown in Figure 16.14 . This arrangement also means the I/O pad cells can be changed without changing the base array. This is useful as bonding techniques improve and the pads can be moved closer together.
FIGURE 16.14 Gate-array I/O pads. (a) Cell-based ASICs may contain pad cells of different sizes and widths. (b) A corner of a gate-array base. (c) A gate-array base with different I/O cell and pad pitches.
16.1 Floorplanning
FIGURE 16.15 Power distribution. (a) Power distributed using m1 for VSS and m2 for VDD. This helps minimize the number of vias and layer crossings needed but causes problems in the routing channels. (b) In this floorplan m1 is run parallel to the longest side of all channels, the channel spine. This can make automatic routing easier but may increase the number of vias and layer crossings. (c) An expanded view of part of a channel (interconnect is shown as lines). If power runs on different layers along the spine of a channel, this forces signals to change layers. (d) A closeup of VDD and VSS buses as they cross. Changing layers requires a large number of via contacts to reduce resistance. Figure 16.15 shows two possible power distribution schemes. The long direction of a rectangular channel is the channel spine . Some automatic routers may require that metal lines parallel to a channel spine use a preferred layer (either m1, m2, or m3). Alternatively we say that a particular metal layer runs in a preferred direction . Since we can have both horizontal and vertical channels, we may have the situation shown
16.1 Floorplanning
in Figure 16.15 , where we have to decide whether to use a preferred layer or the preferred direction for some channels. This may or may not be handled automatically by the routing software.
16.1.6 Clock Planning

Figure 16.16 (a) shows a clock spine (not to be confused with a channel spine) routing scheme with all clock pins driven directly from the clock driver. MGAs and FPGAs often use this fish bone type of clock distribution scheme. Figure 16.16 (b) shows a clock spine for a cell-based ASIC. Figure 16.16 (c) shows the clock-driver cell, often part of a special clock-pad cell. Figure 16.16 (d) illustrates clock skew and clock latency . Since all clocked elements are driven from one net with a clock spine, skew is caused by differing interconnect lengths and loads. If the clock-driver delay is much larger than the interconnect delays, a clock spine achieves minimum skew but with long latency.
16.1 Floorplanning
FIGURE 16.16 Clock distribution. (a) A clock spine for a gate array. (b) A clock spine for a cell-based ASIC (typical chips have thousands of clock nets). (c) A clock spine is usually driven from one or more clock-driver cells. Delay in the driver cell is a function of the number of stages and the ratio of output to input capacitance for each stage (taper). (d) Clock latency and clock skew. We would like to minimize both latency and skew. Clock skew represents a fraction of the clock period that we cannot use for computation. A clock skew of 500 ps with a 200 MHz clock means that we waste 500 ps of every 5 ns clock cycle, or 10 percent of performance. Latency can cause a similar loss of performance at the system level when we need to resynchronize our output signals with a master system clock. Figure 16.16 (c) illustrates the construction of a clock-driver cell. The delay through a chain of CMOS gates is minimized when the ratio between the input capacitance C 1 and the output (load) capacitance C 2 is about 3 (exactly e 2.7, an exponential ratio, if we neglect the effect of parasitics). This means that the fastest way to drive a large load is to use a chain of buffers with their input and output loads chosen to maintain this ratio, or taper (we use this as a noun and a verb). This is not necessarily the smallest or lowest-power method, though. Suppose we have an ASIC with the following specifications:
q q q q q q q
40,000 flip-flops Input capacitance of the clock input to each flip-flop is 0.025 pF Clock frequency is 200 MHz V DD = 3.3 V Chip size is 20 mm on a side Clock spine consists of 200 lines across the chip Interconnect capacitance is 2 pFcm 1
In this case the clock-spine capacitance C L = 200 2 cm 2 pFcm 1 = 800 pF. If

16.1 Floorplanning
we drive the clock spine with a chain of buffers with taper equal to e 2.7, and with a first-stage input capacitance of 0.025 pF (a reasonable value for a 0.5 m process), we will need 800 10 12 0.025 10 12 The power dissipated charging the input capacitance of the flip-flop clock is fCV 2 or P 1 1 = (4 10 4 ) (200 MHz) (0.025 pF) (3.3 V) 2 = 2.178 W . (16.2) or approximately 2 W. This is only a little larger than the power dissipated driving the 800 pF clock-spine interconnect that we can calculate as follows: P 2 1 = (200 ) (200 MHz) (20 mm) (2 pFcm 1 )(3.3 V) 2 = 1.7424 W . (16.3) All of this power is dissipated in the clock-driver cell. The worst problem, however, is the enormous peak current in the final inverter stage. If we assume the needed rise time is 0.1 ns (with a 200 MHz clock whose period is 5 ns), the peak current would have to approach (800 pF) (3.3 V) I = 25 A . (16.4) 0.1 ns Clearly such a current is not possible without extraordinary design techniques. Clock spines are used to drive loads of 100200 pF but, as is apparent from the power dissipation problems of this example, it would be better to find a way to spread the power dissipation more evenly across the chip.
log
or 11 stages. (16.1)
16.1 Floorplanning
We can design a tree of clock buffers so that the taper of each stage is e 2.7 by using a fanout of three at each node, as shown in Figure 16.17 (a) and (b). The clock tree , shown in Figure 16.17 (c), uses the same number of stages as a clock spine, but with a lower peak current for the inverter buffers. Figure 16.17 (c) illustrates that we now have another problemwe need to balance the delays through the tree carefully to minimize clock skew (see Section 17.3.1, Clock Routing).
FIGURE 16.17 A clock tree. (a) Minimum delay is achieved when the taper of successive stages is about 3. (b) Using a fanout of three at successive nodes. (c) A clock tree for the cell-based ASIC of Figure 16.16 b. We have to balance the clock arrival times at all of the leaf nodes to minimize clock skew. Designing a clock tree that balances the rise and fall times at the leaf nodes has the beneficial side-effect of minimizing the effect of hot-electron wearout . This problem occurs when an electron gains enough energy to become hot and jump out of the channel into the gate oxide (the problem is worse for electrons in n -channel devices because electrons are more mobile than holes). The trapped electrons change the threshold voltage of the device and this alters the delay of the buffers. As the buffer delays change with time, this introduces unpredictable skew. The problem is worst when the n -channel device is carrying maximum current with a high voltage
16.1 Floorplanning
across the channelthis occurs during the rise-and fall-time transitions. Balancing the rise and fall times in each buffer means that they all wear out at the same rate, minimizing any additional skew. A phase-locked loop ( PLL ) is an electronic flywheel that locks in frequency to an input clock signal. The input and output frequencies may differ in phase, however. This means that we can, for example, drive a clock network with a PLL in such a way that the output of the clock network is locked in phase to the incoming clock, thus eliminating the latency of the clock network . A PLL can also help to reduce random variation of the input clock frequency, known as jitter , which, since it is unpredictable, must also be discounted from the time available for computation in each clock cycle. Actel was one of the first FPGA vendors to incorporate PLLs, and Actels online product literature explains their use in ASIC design. 1. Interconnect lengths are derived from interconnect capacitance data. Interconnect capacitance is 2 pFcm 1 . [ Chapter start ] [ Previous page ] [ Next page ]
16.2 Placement
16.2 Placement
After completing a floorplan we can begin placement of the logic cells within the flexible blocks. Placement is much more suited to automation than floorplanning. Thus we shall need measurement techniques and algorithms. After we complete floorplanning and placement, we can predict both intrablock and interblock capacitances. This allows us to return to logic synthesis with more accurate estimates of the capacitive loads that each logic cell must drive.
16.2.1 Placement Terms and Definitions

CBIC, MGA, and FPGA architectures all have rows of logic cells separated by the interconnectthese are row-based ASICs . Figure 16.18 shows an example of the interconnect structure for a CBIC. Interconnect runs in horizontal and vertical directions in the channels and in the vertical direction by crossing through the logic cells. Figure 16.18 (c) illustrates the fact that it is possible to use over-the-cell routing ( OTC routing) in areas that are not blocked. However, OTC routing is complicated by the fact that the logic cells themselves may contain metal on the routing layers. We shall return to this topic in Section 17.2.7, Multilevel Routing. Figure 16.19 shows the interconnect structure of a two-level metal MGA.
16.2 Placement
FIGURE 16.18 Interconnect structure. (a) The two-level metal CBIC floorplan shown in Figure 16.11 b. (b) A channel from the flexible block A. This channel has a channel height equal to the maximum channel density of 7 (there is room for seven interconnects to run horizontally in m1). (c) A channel that uses OTC (over-the-cell) routing in m2. Most ASICs currently use two or three levels of metal for signal routing. With two layers of metal, we route within the rectangular channels using the first metal layer for horizontal routing, parallel to the channel spine, and the second metal layer for the vertical direction (if there is a third metal layer it will normally run in the horizontal direction again). The maximum number of horizontal interconnects that can be placed side by side, parallel to the channel spine, is the channel capacity .
16.2 Placement
FIGURE 16.19 Gate-array interconnect. (a) A small two-level metal gate array (about 4.6 k-gate). (b) Routing in a block. (c) Channel routing showing channel density and channel capacity. The channel height on a gate array may only be increased in increments of a row. If the interconnect does not use up all of the channel, the rest of the space is wasted. The interconnect in the channel runs in m1 in the horizontal direction with m2 in the vertical direction. Vertical interconnect uses feedthroughs (or feedthrus in the United States) to cross the logic cells. Here are some commonly used terms with explanations (there are no generally accepted definitions):
q
An unused vertical track (or just track ) in a logic cell is called an uncommitted feedthrough (also built-in feedthrough , implicit feedthrough , or jumper ). A vertical strip of metal that runs from the top to bottom of a cell (for double-
16.2 Placement
entry cells ), but has no connections inside the cell, is also called a feedthrough or jumper. Two connectors for the same physical net are electrically equivalent connectors (or equipotential connectors ). For double-entry cells these are usually at the top and bottom of the logic cell. A dedicated feedthrough cell (or crosser cell ) is an empty cell (with no logic) that can hold one or more vertical interconnects. These are used if there are no other feedthroughs available. A feedthrough pin or feedthrough terminal is an input or output that has connections at both the top and bottom of the standard cell. A spacer cell (usually the same as a feedthrough cell) is used to fill space in rows so that the ends of all rows in a flexible block may be aligned to connect to power buses, for example.
There is no standard terminology for connectors and the terms can be very confusing. There is a difference between connectors that are joined inside the logic cell using a high-resistance material such as polysilicon and connectors that are joined by lowresistance metal. The high-resistance kind are really two separate alternative connectors (that cannot be used as a feedthrough), whereas the low-resistance kind are electrically equivalent connectors. There may be two or more connectors to a logic cell, which are not joined inside the cell, and which must be joined by the router ( must-join connectors ). There are also logically equivalent connectors (or functionally equivalent connectors, sometimes also called just equivalent connectorswhich is very confusing). The two inputs of a two-input NAND gate may be logically equivalent connectors. The placement tool can swap these without altering the logic (but the two inputs may have different delay properties, so it is not always a good idea to swap them). There can also be logically equivalent connector groups . For example, in an OAI22 (OR-AND-INVERT) gate there are four inputs: A1, A2 are inputs to one OR gate (gate A), and B1, B2 are inputs to the second OR gate (gate B). Then group A = (A1, A2) is logically equivalent to group B = (B1, B2)if we swap one input (A1 or A2) from gate A to gate B, we must swap the other input in the group (A2 or A1). In the case of channeled gate arrays and FPGAs, the horizontal interconnect areasthe channels, usually on m1have a fixed capacity (sometimes they are
16.2 Placement
called fixed-resource ASICs for this reason). The channel capacity of CBICs and channelless MGAs can be expanded to hold as many interconnects as are needed. Normally we choose, as an objective, to minimize the number of interconnects that use each channel. In the vertical interconnect direction, usually m2, FPGAs still have fixed resources. In contrast the placement tool can always add vertical feedthroughs to a channeled MGA, channelless MGA, or CBIC. These problems become less important as we move to three and more levels of interconnect.
16.2.2 Placement Goals and Objectives

The goal of a placement tool is to arrange all the logic cells within the flexible blocks on a chip. Ideally, the objectives of the placement step are to
q q q
Guarantee the router can complete the routing step Minimize all the critical net delays Make the chip as dense as possible
We may also have the following additional objectives:

q q
Minimize power dissipation Minimize cross talk between signals
Objectives such as these are difficult to define in a way that can be solved with an algorithm and even harder to actually meet. Current placement tools use more specific and achievable criteria. The most commonly used placement objectives are one or more of the following:
q q q
Minimize the total estimated interconnect length Meet the timing requirements for critical nets Minimize the interconnect congestion
Each of these objectives in some way represents a compromise.
16.2 Placement
16.2.3 Measurement of Placement Goals and Objectives

In order to determine the quality of a placement, we need to be able to measure it. We need an approximate measure of interconnect length, closely correlated with the final interconnect length, that is easy to calculate. The graph structures that correspond to making all the connections for a net are known as trees on graphs (or just trees ). Special classes of trees Steiner trees minimize the total length of interconnect and they are central to ASIC routing algorithms. Figure 16.20 shows a minimum Steiner tree. This type of tree uses diagonal connectionswe want to solve a restricted version of this problem, using interconnects on a rectangular grid. This is called rectilinear routing or Manhattan routing (because of the eastwest and northsouth grid of streets in Manhattan). We say that the Euclidean distance between two points is the straight-line distance (as the crow flies). The Manhattan distance (or rectangular distance) between two points is the distance we would have to walk in New York.
16.2 Placement
FIGURE 16.20 Placement using trees on graphs. (a) The floorplan from Figure 16.11 b. (b) An expanded view of the flexible block A showing four rows of standard cells for placement (typical blocks may contain thousands or tens of thousands of logic cells). We want to find the length of the net shown with four terminals, W through Z, given the placement of four logic cells (labeled: A.211, A.19, A.43, A.25). (c) The problem for net (W, X, Y, Z) drawn as a graph. The shortest connection is the minimum Steiner tree. (d) The minimum rectilinear Steiner tree using Manhattan routing. The rectangular (Manhattan) interconnect-length measures are shown for each tree. The minimum rectilinear Steiner tree ( MRST ) is the shortest interconnect using a rectangular grid. The determination of the MRST is in general an NP-complete problemwhich means it is hard to solve. For small numbers of terminals heuristic algorithms do exist, but they are expensive to compute. Fortunately we only need to estimate the length of the interconnect. Two approximations to the MRST are shown in Figure 16.21 . The complete graph has connections from each terminal to every other terminal [ Hanan, Wolff, and Agule, 1973]. The complete-graph measure adds all the interconnect lengths of the complete-graph connection together and then divides by n /2, where n is the number of terminals. We can justify this since, in a graph with n terminals, ( n 1) interconnects will emanate from each terminal to join the other ( n 1) terminals in a complete graph connection. That makes n ( n 1) interconnects in total. However, we have then made each connection twice. So there are one-half this many, or n ( n 1)/2, interconnects needed for a complete graph connection. Now we actually only need ( n 1) interconnects to join n terminals, so we have n /2 times as many interconnects as we really need. Hence we divide the total net length of the complete graph connection by n /2 to obtain a more reasonable estimate of minimum interconnect length. Figure 16.21 (a) shows an example of the complete-graph measure.
16.2 Placement
FIGURE 16.21 Interconnect-length measures. (a) Complete-graph measure. (b) Half-perimeter measure.
The bounding box is the smallest rectangle that encloses all the terminals (not to be confused with a logic cell bounding box, which encloses all the layout in a logic cell). The half-perimeter measure (or bounding-box measure) is one-half the perimeter of the bounding box ( Figure 16.21 b) [ Schweikert, 1976]. For nets with two or three terminals (corresponding to a fanout of one or two, which usually includes over 50 percent of all nets on a chip), the half-perimeter measure is the same as the minimum Steiner tree. For nets with four or five terminals, the minimum Steiner tree is between one and two times the half-perimeter measure [ Hanan, 1966]. For a circuit with m nets, using the half-perimeter measure corresponds to minimizing the cost function, 1 f = 2 m
i=1
h i , (16.5)
where h i is the half-perimeter measure for net i . It does not really matter if our approximations are inaccurate if there is a good correlation between actual interconnect lengths (after routing) and our approximations. Figure 16.22 shows that we can adjust the complete-graph and halfperimeter measures using correction factors [ Goto and Matsuda, 1986]. Now our wiring length approximations are functions, not just of the terminal positions, but also of the number of terminals, and the size of the bounding box. One practical example adjusts a Steiner-tree approximation using the number of terminals [ Chao, Nequist, and Vuong, 1990]. This technique is used in the Cadence Gate Ensemble placement
16.2 Placement
tool, for example.
FIGURE 16.22 Correlation between total length of chip interconnect and the half-perimeter and complete-graph measures.
One problem with the measurements we have described is that the MRST may only approximate the interconnect that will be completed by the detailed router. Some programs have a meander factor that specifies, on average, the ratio of the interconnect created by the routing tool to the interconnect-length estimate used by the placement tool. Another problem is that we have concentrated on finding estimates to the MRST, but the MRST that minimizes total net length may not minimize net delay (see Section 16.2.8 ). There is no point in minimizing the interconnect length if we create a placement that is too congested to route. If we use minimum interconnect congestion as an additional placement objective, we need some way of measuring it. What we are trying to measure is interconnect density. Unfortunately we always use the term density to mean channel density (which we shall discuss in Section 17.2.2, Measurement of Channel Density). In this chapter, while we are discussing placement, we shall try to use the term congestion , instead of density, to avoid any confusion. One measure of interconnect congestion uses the maximum cut line . Imagine a
16.2 Placement
horizontal or vertical line drawn anywhere across a chip or block, as shown in Figure 16.23 . The number of interconnects that must cross this line is the cut size (the number of interconnects we cut). The maximum cut line has the highest cut size.
FIGURE 16.23 Interconnect congestion for the cell-based ASIC from Figure 16.11 (b). (a) Measurement of congestion. (b) An expanded view of flexible block A shows a maximum cut line. Many placement tools minimize estimated interconnect length or interconnect congestion as objectives. The problem with this approach is that a logic cell may be placed a long way from another logic cell to which it has just one connection. This logic cell with one connection is less important as far as the total wire length is concerned than other logic cells, to which there are many connections. However, the one long connection may be critical as far as timing delay is concerned. As technology is scaled, interconnection delays become larger relative to circuit delays and this problem gets worse. In timing-driven placement we must estimate delay for every net for every trial placement, possibly for hundreds of thousands of gates. We cannot afford to use anything other than the very simplest estimates of net delay. Unfortunately, the minimum-length Steiner tree does not necessarily correspond to the interconnect path that minimizes delay. To construct a minimum-delay path we may have to route with non-Steiner trees. In the placement phase typically we take a simple interconnectlength approximation to this minimum-delay path (typically the half-perimeter
16.2 Placement
measure). Even when we can estimate the length of the interconnect, we do not yet have information on which layers and how many vias the interconnect will use or how wide it will be. Some tools allow us to include estimates for these parameters. Often we can specify metal usage , the percentage of routing on the different layers to expect from the router. This allows the placement tool to estimate RC values and delaysand thus minimize delay.
16.2.4 Placement Algorithms

There are two classes of placement algorithms commonly used in commercial CAD tools: constructive placement and iterative placement improvement. A constructive placement method uses a set of rules to arrive at a constructed placement. The most commonly used methods are variations on the min-cut algorithm . The other commonly used constructive placement algorithm is the eigenvalue method. As in system partitioning, placement usually starts with a constructed solution and then improves it using an iterative algorithm. In most tools we can specify the locations and relative placements of certain critical logic cells as seed placements . The min-cut placement method uses successive application of partitioning [ Breuer, 1977]. The following steps are shown in Figure 16.24 : 1. Cut the placement area into two pieces. 2. Swap the logic cells to minimize the cut cost. 3. Repeat the process from step 1, cutting smaller pieces until all the logic cells are placed.
16.2 Placement
FIGURE 16.24 Min-cut placement. (a) Divide the chip into bins using a grid. (b) Merge all connections to the center of each bin. (c) Make a cut and swap logic cells between bins to minimize the cost of the cut. (d) Take the cut pieces and throw out all the edges that are not inside the piece. (e) Repeat the process with a new cut and continue until we reach the individual bins. Usually we divide the placement area into bins . The size of a bin can vary, from a bin size equal to the base cell (for a gate array) to a bin size that would hold several logic cells. We can start with a large bin size, to get a rough placement, and then reduce the bin size to get a final placement. The eigenvalue placement algorithm uses the cost matrix or weighted connectivity matrix ( eigenvalue methods are also known as spectral methods ) [Hall, 1970]. The measure we use is a cost function f that we shall minimize, given by 1 f = 2 n
i=1
c ij d ij 2 , (16.6)
16.2 Placement
where C = [ c ij ] is the (possibly weighted) connectivity matrix, and d ij is the Euclidean distance between the centers of logic cell i and logic cell j . Since we are going to minimize a cost function that is the square of the distance between logic cells, these methods are also known as quadratic placement methods. This type of cost function leads to a simple mathematical solution. We can rewrite the cost function f in matrix form: 1 f = 2 n
i=1
c ij ( x i x j ) 2 + (y i y j ) 2
= x T Bx + y T By .
(16.7)
In Eq. 16.7 , B is a symmetric matrix, the disconnection matrix (also called the Laplacian). We may express the Laplacian B in terms of the connectivity matrix C ; and D , a diagonal matrix (known as the degree matrix), defined as follows: B = DC; n d ii = (16.8)
i=1
c ij , i = 1, ... , ni ; d ij = 0, i j
We can simplify the problem by noticing that it is symmetric in the x - and y coordinates. Let us solve the simpler problem of minimizing the cost function for the placement of logic cells along just the x -axis first. We can then apply this solution to the more general two-dimensional placement problem. Before we solve this simpler problem, we introduce a constraint that the coordinates of the logic cells must
16.2 Placement
correspond to valid positions (the cells do not overlap and they are placed on-grid). We make another simplifying assumption that all logic cells are the same size and we must place them in fixed positions. We can define a vector p consisting of the valid positions: p = [ p 1 , ..., p n ] . (16.9)
For a valid placement the x -coordinates of the logic cells, x = [ x 1 , ..., x n ] . (16.10)
must be a permutation of the fixed positions, p . We can show that requiring the logic cells to be in fixed positions in this way leads to a series of n equations restricting the values of the logic cell coordinates [ Cheng and Kuh, 1984]. If we impose all of these constraint equations the problem becomes very complex. Instead we choose just one of the equations: n n xi2 =
i=1
i=1
p i 2 . (16.11)
Simplifying the problem in this way will lead to an approximate solution to the placement problem. We can write this single constraint on the x -coordinates in matrix form: n xTx = P; P =
i=1
p i 2 . (16.12)
where P is a constant. We can now summarize the formulation of the problem, with
16.2 Placement
the simplifications that we have made, for a one-dimensional solution. We must minimize a cost function, g (analogous to the cost function f that we defined for the two-dimensional problem in Eq. 16.7 ), where g = x T Bx . (16.13)
subject to the constraint: xTx = P. (16.14)
This is a standard problem that we can solve using a Lagrangian multiplier: = x T Bx [ x T x P] . (16.15)
To find the value of x that minimizes g we differentiate L partially with respect to x and set the result equal to zero. We get the following equation: [BI]x = 0. (16.16)
This last equation is called the characteristic equation for the disconnection matrix B and occurs frequently in matrix algebra (this l has nothing to do with scaling). The solutions to this equation are the eigenvectors and eigenvalues of B . Multiplying Eq. 16.16 by x T we get: x T x = x T Bx . (16.17)
However, since we imposed the constraint x T x = P and x T Bx = g , then = g /P . (16.18)
16.2 Placement
The eigenvectors of the disconnection matrix B are the solutions to our placement problem. It turns out that (because something called the rank of matrix B is n 1) there is a degenerate solution with all x -coordinates equal ( = 0)this makes some sense because putting all the logic cells on top of one another certainly minimizes the interconnect. The smallest, nonzero, eigenvalue and the corresponding eigenvector provides the solution that we want. In the two-dimensional placement problem, the x and y -coordinates are given by the eigenvectors corresponding to the two smallest, nonzero, eigenvalues. (In the next section a simple example illustrates this mathematical derivation.)
16.2.5 Eigenvalue Placement Example

Consider the following connectivity matrix C and its disconnection matrix B , calculated from Eq. 16.8 [ Hall, 1970]: 0001 C = 0011 0100 1100 1000 0001 1 0 0 1 B = 0 2 0 0 0 0 1 1 = 0 2 1 1 0010 0100 0 1 1 0 1100 1100 1 1 0 2 (16.19) Figure 16.25 (a) shows the corresponding network with four logic cells (14) and three nets (AC). Here is a MatLab script to find the eigenvalues and eigenvectors of B:
16.2 Placement
FIGURE 16.25 Eigenvalue placement. (a) An example network. (b) The one-dimensional placement.The small black squares represent the centers of the logic cells. (c) The twodimensional placement. The eigenvalue method takes no account of the logic cell sizes or actual location of logic cell connectors. (d) A complete layout. We snap the logic cells to valid locations, leaving room for the routing in the channel. C=[0 0 0 1; 0 0 1 1; 0 1 0 0; 1 1 0 0] D=[1 0 0 0; 0 2 0 0; 0 0 1 0; 0 0 0 2] B=D-C [X,D] = eig(B) Running this script, we find the eigenvalues of B are 0.5858, 0.0, 2.0, and 3.4142. The corresponding eigenvectors of B are 0.6533 0.5000 0.5000 0.2706
16.2 Placement
0.2706 0.5000 0.5000 0.6533 0.6533 0.5000 0.5000 0.2706 0.2706 0.5000 0.5000 0.6533 (16.20) For a one-dimensional placement ( Figure 16.25 b), we use the eigenvector (0.6533, 0.2706, 0.6533, 0.2706) corresponding to the smallest nonzero eigenvalue (which is 0.5858) to place the logic cells along the x -axis. The two-dimensional placement ( Figure 16.25 c) uses these same values for the x -coordinates and the eigenvector (0.5, 0.5, 0.5, 0.5) that corresponds to the next largest eigenvalue (which is 2.0) for the y coordinates. Notice that the placement shown in Figure 16.25 (c), which shows logiccell outlines (the logic-cell abutment boxes), takes no account of the cell sizes, and cells may even overlap at this stage. This is because, in Eq. 16.11 , we discarded all but one of the constraints necessary to ensure valid solutions. Often we use the approximate eigenvalue solution as an initial placement for one of the iterative improvement algorithms that we shall discuss in Section 16.2.6 .
16.2.6 Iterative Placement Improvement

An iterative placement improvement algorithm takes an existing placement and tries to improve it by moving the logic cells. There are two parts to the algorithm:
q q
The selection criteria that decides which logic cells to try moving. The measurement criteria that decides whether to move the selected cells.
There are several interchange or iterative exchange methods that differ in their selection and measurement criteria:
q q q q
pairwise interchange, force-directed interchange, force-directed relaxation, and force-directed pairwise relaxation.
16.2 Placement
All of these methods usually consider only pairs of logic cells to be exchanged. A source logic cell is picked for trial exchange with a destination logic cell. We have already discussed the use of interchange methods applied to the system partitioning step. The most widely used methods use group migration, especially the KernighanLin algorithm. The pairwise-interchange algorithm is similar to the interchange algorithm used for iterative improvement in the system partitioning step: 1. Select the source logic cell at random. 2. Try all the other logic cells in turn as the destination logic cell. 3. Use any of the measurement methods we have discussed to decide on whether to accept the interchange. 4. The process repeats from step 1, selecting each logic cell in turn as a source logic cell. Figure 16.26 (a) and (b) show how we can extend pairwise interchange to swap more than two logic cells at a time. If we swap l logic cells at a time and find a locally optimum solution, we say that solution is l -optimum . The neighborhood exchange algorithm is a modification to pairwise interchange that considers only destination logic cells in a neighborhood cells within a certain distance, e, of the source logic cell. Limiting the search area for the destination logic cell to the e -neighborhood reduces the search time. Figure 16.26 (c) and (d) show the one- and twoneighborhoods (based on Manhattan distance) for a logic cell.
16.2 Placement
FIGURE 16.26 Interchange. (a) Swapping the source logic cell with a destination logic cell in pairwise interchange. (b) Sometimes we have to swap more than two logic cells at a time to reach an optimum placement, but this is expensive in computation time. Limiting the search to neighborhoods reduces the search time. Logic cells within a distance e of a logic cell form an e-neighborhood. (c) A one-neighborhood. (d) A two-neighborhood. Neighborhoods are also used in some of the force-directed placement methods . Imagine identical springs connecting all the logic cells we wish to place. The number of springs is equal to the number of connections between logic cells. The effect of the springs is to pull connected logic cells together. The more highly connected the logic cells, the stronger the pull of the springs. The force on a logic cell i due to logic cell j is given by Hookes law , which says the force of a spring is proportional to its extension: F ij = c ij x ij . (16.21)
The vector component x ij is directed from the center of logic cell i to the center of logic cell j . The vector magnitude is calculated as either the Euclidean or Manhattan distance between the logic cell centers. The c ij form the connectivity or cost matrix (the matrix element c ij is the number of connections between logic cell i and logic cell j ). If we want, we can also weight the c ij to denote critical connections. Figure 16.27 illustrates the force-directed placement algorithm.
16.2 Placement
FIGURE 16.27 Force-directed placement. (a) A network with nine logic cells. (b) We make a grid (one logic cell per bin). (c) Forces are calculated as if springs were attached to the centers of each logic cell for each connection. The two nets connecting logic cells A and I correspond to two springs. (d) The forces are proportional to the spring extensions. In the definition of connectivity (Section 15.7.1, Measuring Connectivity) it was pointed out that the network graph does not accurately model connections for nets with more than two terminals. Nets such as clock nets, power nets, and global reset lines have a huge number of terminals. The force-directed placement algorithms usually make special allowances for these situations to prevent the largest nets from snapping all the logic cells together. In fact, without external forces to counteract the pull of the springs between logic cells, the network will collapse to a single point as it settles. An important part of force-directed placement is fixing some of the logic cells in position. Normally ASIC designers use the I/O pads or other external connections to act as anchor points or fixed seeds. Figure 16.28 illustrates the different kinds of force-directed placement algorithms. The force-directed interchange algorithm uses the force vector to select a pair of logic cells to swap. In force-directed relaxation a chain of logic cells is moved. The force-directed pairwise relaxation algorithm swaps one pair of logic cells at a time.
16.2 Placement
FIGURE 16.28 Force-directed iterative placement improvement. (a) Force-directed interchange. (b) Forcedirected relaxation. (c) Force-directed pairwise relaxation. We reach a force-directed solution when we minimize the energy of the system, corresponding to minimizing the sum of the squares of the distances separating logic cells. Force-directed placement algorithms thus also use a quadratic cost function.
16.2.7 Placement Using Simulated Annealing

The principles of simulated annealing were explained in Section 15.7.8, Simulated Annealing. Because simulated annealing requires so many iterations, it is critical that the placement objectives be easy and fast to calculate. The optimum connection pattern, the MRST, is difficult to calculate. Using the half-perimeter measure ( Section 16.2.3 ) corresponds to minimizing the total interconnect length. Applying simulated annealing to placement, the algorithm is as follows: 1. Select logic cells for a trial interchange, usually at random. 2. Evaluate the objective function E for the new placement. 3. If E is negative or zero, then exchange the logic cells. If E is positive, then exchange the logic cells with a probability of exp( E / T ). 4. Go back to step 1 for a fixed number of times, and then lower the temperature T according to a cooling schedule: T n +1 = 0.9 T n , for example.
16.2 Placement
Kirkpatrick, Gerlatt, and Vecchi first described the use of simulated annealing applied to VLSI problems [ 1983]. Experience since that time has shown that simulated annealing normally requires the use of a slow cooling schedule and this means long CPU run times [ Sechen, 1988; Wong, Leong, and Liu, 1988]. As a general rule, experiments show that simple min-cut based constructive placement is faster than simulated annealing but that simulated annealing is capable of giving better results at the expense of long computer run times. The iterative improvement methods that we described earlier are capable of giving results as good as simulated annealing, but they use more complex algorithms. While I am making wild generalizations, I will digress to discuss benchmarks of placement algorithms (or any CAD algorithm that is random). It is important to remember that the results of random methods are themselves random. Suppose the results from two random algorithms, A and B, can each vary by 10 percent for any chip placement, but both algorithms have the same average performance. If we compare single chip placements by both algorithms, they could falsely show algorithm A to be better than B by up to 20 percent or vice versa. Put another way, if we run enough test cases we will eventually find some for which A is better than B by 20 percenta trick that Ph.D. students and marketing managers both know well. Even single-run evaluations over multiple chips is hardly a fair comparison. The only way to obtain meaningful results is to compare a statistically meaningful number of runs for a statistically meaningful number of chips for each algorithm. This same caution applies to any VLSI algorithm that is random. There was a Design Automation Conference panel session whose theme was Enough of algorithms claiming improvements of 5 %.
16.2.8 Timing-Driven Placement Methods

Minimizing delay is becoming more and more important as a placement objective. There are two main approaches: net based and path based. We know that we can use net weights in our algorithms. The problem is to calculate the weights. One method finds the n most critical paths (using a timing-analysis engine, possibly in the synthesis tool). The net weights might then be the number of times each net appears in this list. The problem with this approach is that as soon as we fix (for example) the first 100 critical nets, suddenly another 200 become critical. This is rather like trying
16.2 Placement
to put worms in a canas soon as we open the lid to put one in, two more pop out. Another method to find the net weights uses the zero-slack algorithm [ Hauge et al., 1987]. Figure 16.29 shows how this works (all times are in nanoseconds). Figure 16.29 (a) shows a circuit with primary inputs at which we know the arrival times (this is the original definition, some people use the term actual times ) of each signal. We also know the required times for the primary outputs the points in time at which we want the signals to be valid. We can work forward from the primary inputs and backward from the primary outputs to determine arrival and required times at each input pin for each net. The difference between the required and arrival times at each input pin is the slack time (the time we have to spare). The zero-slack algorithm adds delay to each net until the slacks are zero, as shown in Figure 16.29 (b). The net delays can then be converted to weights or constraints in the placement. Notice that we have assumed that all the gates on a net switch at the same time so that the net delay can be placed at the output of the gate driving the neta rather poor timing model but the best we can use without any routing information.
16.2 Placement
FIGURE 16.29 The zero-slack algorithm. (a) The circuit with no net delays. (b) The zero-slack algorithm adds net delays (at the outputs of each gate, equivalent to increasing the gate delay) to reduce the slack times to zero. An important point to remember is that adjusting the net weight, even for every net on a chip, does not theoretically make the placement algorithms any more complexwe have to deal with the numbers anyway. It does not matter whether the net weight is 1 or 6.6, for example. The practical problem, however, is getting the weight information for each net (usually in the form of timing constraints) from a synthesis tool or timing verifier. These files can easily be hundreds of megabytes in size (see Section 16.4 ).
16.2 Placement
With the zero-slack algorithm we simplify but overconstrain the problem. For example, we might be able to do a better job by making some nets a little longer than the slack indicates if we can tighten up other nets. What we would really like to do is deal with paths such as the critical path shown in Figure 16.29 (a) and not just nets . Path-based algorithms have been proposed to do this, but they are complex and not all commercial tools have this capability (see, for example, [ Youssef, Lin, and Shragowitz, 1992]). There is still the question of how to predict path delays between gates with only placement information. Usually we still do not compute a routing tree but use simple approximations to the total net length (such as the half-perimeter measure) and then use this to estimate a net delay (the same to each pin on a net). It is not until the routing step that we can make accurate estimates of the actual interconnect delays.
16.2.9 A Simple Placement Example

Figure 16.30 shows an example network and placements to illustrate the measures for interconnect length and interconnect congestion. Figure 16.30 (b) and (c) illustrate the meaning of total routing length, the maximum cut line in the x -direction, the maximum cut line in the y -direction, and the maximum density. In this example we have assumed that the logic cells are all the same size, connections can be made to terminals on any side, and the routing channels between each adjacent logic cell have a capacity of 2. Figure 16.30 (d) shows what the completed layout might look like.
16.2 Placement
FIGURE 16.30 Placement example. (a) An example network. (b) In this placement, the bin size is equal to the logic cell size and all the logic cells are assumed equal size. (c) An alternative placement with a lower total routing length. (d) A layout that might result from the placement shown in b. The channel densities correspond to the cut-line sizes. Notice that the logic cells are not all the same size (which means there are errors in the interconnect-length estimates we made during placement).
16.3 Physical Design Flow

Historically placement was included with routing as a single tool (the term P&R is often used for place and route). Because interconnect delay now dominates gate delay, the trend is to include placement within a floorplanning tool and use a separate router. Figure 16.31 shows a design flow using synthesis and a floorplanning tool that includes placement. This flow consists of the following steps: 1. Design entry. The input is a logical description with no physical information. 2. Synthesis. The initial synthesis contains little or no information on any interconnect loading. The output of the synthesis tool (typically an EDIF netlist) is the input to the floorplanner. 3. Initial floorplan. From the initial floorplan interblock capacitances are input to the synthesis tool as load constraints and intrablock capacitances are input as wire-load tables. 4. Synthesis with load constraints. At this point the synthesis tool is able to resynthesize the logic based on estimates of the interconnect capacitance each gate is driving. The synthesis tool produces a forward annotation file to constrain path delays in the placement step.
FIGURE 16.31 Timing-driven floorplanning and placement design flow. Compare with Figure 15.1 on p. 806. 5. Timing-driven placement. After placement using constraints from the synthesis tool, the location of every logic cell on the chip is fixed and accurate estimates of interconnect delay can be passed back to the synthesis tool. 6. Synthesis with in-place optimization ( IPO ). The synthesis tool changes the drive strength of gates based on the accurate interconnect delay estimates from the floorplanner without altering the netlist structure. 7. Detailed placement. The placement information is ready to be input to the routing step. In Figure 16.31 we iterate between floorplanning and synthesis, continuously improving our estimate for the interconnect delay as we do so. [ Chapter start ] [ Previous page ] [ Next page ]
16.4 Information Formats

With the increasing importance of interconnect a great deal of information needs to flow between design tools. There are some de facto standards that we shall look at next. Some of the companies involved are working toward releasing these formats as IEEE standards.
16.4.1 SDF for Floorplanning and Placement

In Section 13.5.6, SDF in Simulation, we discussed the structure and use of the standard delay format ( SDF) to describe gate delay and interconnect delay. We may also use SDF with floorplanning and synthesis tools to back-annotate an interconnect delay. A synthesis tool can use this information to improve the logic structure. Here is a fragment of SDF: (INSTANCE B) (DELAY (ABSOLUTE ( INTERCONNECT A.INV8.OUT B.DFF1.Q (:0.6:) (:0.6:)))) In this example the rising and falling delay is 60 ps (equal to 0.6 units multiplied by the time scale of 100 ps per unit specified in a TIMESCALE construct that is not shown). The delay is specified between the output port of an inverter with instance name A.INV8 in block A and the Q input port of a D flip-flop (instance name B.DFF1 ) in block B. A '.' (period or fullstop) is set to be the hierarchy divider in another construct that is not shown. There is another way of specifying interconnect delay using NETDELAY (a short form of the INTERCONNECT construct) as follows:
(TIMESCALE 100ps) (INSTANCE B) (DELAY (ABSOLUTE ( NETDELAY net1 (0.6))) In this case all delays from an output port to, possibly multiple, input ports have the same value (we can also specify the output port name instead of the net name to identify the net). Alternatively we can lump interconnect delay at an input port: (TIMESCALE 100ps) (INSTANCE B.DFF1) (DELAY (ABSOLUTE ( PORT CLR (16:18:22) (17:20:25)))) This PORT construct specifies an interconnect delay placed at the input port of a logic cell (in this case the CLR pin of a flip-flop). We do not need to specify the start of a path (as we do for INTERCONNECT ). We can also use SDF to forward-annotate path delays using timing constraints (there may be hundreds or thousands of these in a file). A synthesis tool can pass this information to the floorplanning and placement steps to allow them to create better layout. SDF describes timing checks using a range of TIMINGCHECK constructs. Here is an example of a single path constraint: (TIMESCALE 100ps) (INSTANCE B) ( TIMINGCHECK ( PATHCONSTRAINT A.AOI22_1.O B.ND02_34.O (0.8) (0.8))) This describes a constraint (keyword PATHCONSTRAINT ) for the rising and falling delays between two ports at each end of a path (which may consist of several nets) to be less than 80 ps. Using the SUM construct we can constrain the sum of path delays to be less than a specific value as follows: (TIMESCALE 100ps) (INSTANCE B) ( TIMINGCHECK (SUM (AOI22_1.O ND02_34.I1) (ND02_34.O ND02_35.I1) (0.8))) We can also constrain skew between two paths (in this case to be less than 10 ps) using the DIFF construct: (TIMESCALE 100ps) (INSTANCE B) (TIMINGCHECK
( DIFF (A.I_1.O B.ND02_1.I1) (A.I_1.O.O B.ND02_2.I1) (0.1))) In addition we can constrain the skew between a reference signal (normally the clock) and all other ports in an instance (again in this case to be less than 10 ps) using the SKEWCONSTRAINT construct: (TIMESCALE 100ps) (INSTANCE B) (TIMINGCHECK ( SKEWCONSTRAINT (posedge clk) (0.1))) At present there is no easy way in SDF to constrain the skew between a reference signal and other signals to be greater than a specified amount.
16.4.2 PDEF
The physical design exchange format ( PDEF ) is a proprietary file format used by Synopsys to describe placement information and the clustering of logic cells. Here is a simple, but complete PDEF file: (CLUSTERFILE (PDEFVERSION "1.0") (DESIGN "myDesign") (DATE "THU AUG 6 12:00 1995") (VENDOR "ASICS_R_US") (PROGRAM "PDEF_GEN") (VERSION "V2.2") (DIVIDER .) ( CLUSTER (NAME "ROOT") (WIRE_LOAD "10mm x 10mm") (UTILIZATION 50.0) (MAX_UTILIZATION 60.0) (X_BOUNDS 100 1000) (Y_BOUNDS 100 1000) (CLUSTER (NAME "LEAF_1") (WIRE_LOAD "50k gates") (UTILIZATION 50.0)
(MAX_UTILIZATION 60.0) (X_BOUNDS 100 500) (Y_BOUNDS 100 200) (CELL (NAME L1.RAM01) (CELL (NAME L1.ALU01) ) ) ) This file describes two clusters:
q
ROOT , which is the top-level (the whole chip). The file describes the size ( x - and y -bounds), current and maximum area utilization (i.e., leaving space for interconnect), and the name of the wire-load table, ' 10mm x 10mm ', to use for this block, chosen because the chip is expected to be about 10 mm on a side. LEAF_1 , a block below the top level in the hierarchy. This block is to use predicted capacitances from a wire-load table named '50k gates' (chosen because we know there are roughly 50 k-gate in this block). The LEAF_1 block contains two logic cells: L1.RAM01 and L1.ALU01 .
16.4.3 LEF and DEF

The library exchange format ( LEF ) and design exchange format ( DEF ) are both proprietary formats originated by Tangent in the TanCell and TanGate place-androute tools which were bought by Cadence and now known as Cell3 Ensemble and Gate Ensemble respectively. These tools, and their derivatives, are so widely used that these formats have become a de facto standard. LEF is used to define an IC process and a logic cell library. For example, you would use LEF to describe a gate array: the base cells, the legal sites for base cells, the logic macros with their size and connectivity information, the interconnect layers and other information to set up the database that the physical design tools need. You would use DEF to describe all the physical aspects of a particular chip design including the netlist and physical location of cells on the chip. For example, if you had a complete placement from a floorplanning tool and wanted to exchange this information with Cadence Gate Ensemble or Cell3 Ensemble, you would use DEF.
16.5 Summary
16.5 Summary
Floorplanning follows the system partitioning step and is the first step in arranging circuit blocks on an ASIC. There are many factors to be considered during floorplanning: minimizing connection length and signal delay between blocks; arranging fixed blocks and reshaping flexible blocks to occupy the minimum die area; organizing the interconnect areas between blocks; planning the power, clock, and I/O distribution. The handling of some of these factors may be automated using CAD tools, but many still need to be dealt with by hand. Placement follows the floorplanning step and is more automated. It consists of organizing an array of logic cells within a flexible block. The criterion for optimization may be minimum interconnect area, minimum total interconnect length, or performance. There are two main types of placement algorithms: based on min-cut or eigenvector methods. Because interconnect delay in a submicron CMOS process dominates logic-cell delay, planning of interconnect will become more and more important. Instead of completing synthesis before starting floorplanning and placement, we will have to use synthesis and floorplanning/placement tools together to achieve an accurate estimate of timing. The key points of this chapter are:
q q q q q
Interconnect delay now dominates gate delay. Floorplanning is a mapping between logical and physical design. Floorplanning is the center of ASIC design operations for all types of ASIC. Timing-driven floorplanning is becoming an essential ASIC design tool. Placement is now an automated function.
16.5 Summary
ROUTING
ROUTING
Once the designer has floorplanned a chip and the logic cells within the flexible blocks have been placed, it is time to make the connections by routing the chip. This is still a hard problem that is made easier by dividing it into smaller problems. Routing is usually split into global routing followed by detailed routing . Suppose the ASIC is North America and some travelers in California need advice on how to drive from Stanford (near San Francisco) to Caltech (near Los Angeles). The floorplanner has decided that California is on the left (west) side of the ASIC and the placement tool has put Stanford in Northern California and Caltech in Southern California. Floorplanning and placement have defined the roads and freeways. There are two ways to go: the coastal route (using Highway 101) or the inland route (using Interstate I5, which is usually faster). The global router specifies the coastal route because the travelers are not in a hurry and I5 is congested (the global router knows this because it has already routed onto I5 many other travelers that are in a hurry today). Next, the detailed router looks at a map and gives indications from Stanford onto Highway 101 south through San Jose, Monterey, and Santa Barbara to Los Angeles and then off the freeway to Caltech in Pasadena. Figure 17.1 shows the core of the Viterbi decoder after the placement step. This implementation consists entirely of standard cells (18 rows). The I/O pads are not included in this examplewe can route the I/O pads after we route the core (though this is not always a good idea). Figure 17.2 shows the Viterbi decoder chip after global and detailed routing. The routing runs in the channels between the rows of logic cells, but the individual interconnections are too small to see.
ROUTING
FIGURE 17.1 The core of the Viterbi decoder chip after placement (a screen shot from Cadence Cell Ensemble). This is the same placement as shown in Figure 16.2, but without the channel labels. You can see the rows of standard cells; the widest cells are the D flip-flops.
ROUTING
FIGURE 17.2 The core of the Viterbi decoder chip after the completion of global and detailed routing (a screen shot from Cadence Cell Ensemble). This chip uses two-level metal. Although you cannot see the difference, m1 runs in the horizontal direction and m2 in the vertical direction. 17.1 Global Routing 17.2 Detailed Routing 17.3 Special Routing 17.4 Circuit Extraction and DRC 17.5 Summary 17.6 Problems 17.7 Bibliography 17.8 References
ROUTING
17.1 Global Routing
17.1 Global Routing

The details of global routing differ slightly between cell-based ASICs, gate arrays, and FPGAs, but the principles are the same in each case. A global router does not make any connections, it just plans them. We typically global route the whole chip (or large pieces if it is a large chip) before detail routing the whole chip (or the pieces). There are two types of areas to global route: inside the flexible blocks and between blocks (the Viterbi decoder, although a cell-based ASIC, only involved the global routing of one large flexible block).
17.1.1 Goals and Objectives

The input to the global router is a floorplan that includes the locations of all the fixed and flexible blocks; the placement information for flexible blocks; and the locations of all the logic cells. The goal of global routing is to provide complete instructions to the detailed router on where to route every net. The objectives of global routing are one or more of the following:
q q q
Minimize the total interconnect length. Maximize the probability that the detailed router can complete the routing. Minimize the critical path delay.
In both floorplanning and placement, with minimum interconnect length as an objective, it is necessary to find the shortest total path length connecting a set of terminals . This path is the MRST, which is hard to find. The alternative, for both floorplanning and placement, is to use simple approximations to the length of the
17.1 Global Routing
MRST (usually the half-perimeter measure). Floorplanning and placement both assume that interconnect may be put anywhere on a rectangular grid, since at this point nets have not been assigned to the channels, but the global router must use the wiring channels and find the actual path. Often the global router needs to find a path that minimizes the delay between two terminalsthis is not necessarily the same as finding the shortest total path length for a set of terminals.
17.1.2 Measurement of Interconnect Delay

Floorplanning and placement need a fast and easy way to estimate the interconnect delay in order to evaluate each trial placement; often this is a predefined look-up table. After placement, the logic cell positions are fixed and the global router can afford to use better estimates of the interconnect delay. To illustrate one method, we shall use the Elmore constant to estimate the interconnect delay for the circuit shown in Figure 17.3 .
17.1 Global Routing
FIGURE 17.3 Measuring the delay of a net. (a) A simple circuit with an inverter A driving a net with a fanout of two. Voltages V 1 , V 2 , V 3 , and V 4 are the voltages at intermediate points along the net. (b) The layout showing the net segments (pieces of interconnect). (c) The RC model with each segment replaced by a capacitance and resistance. The ideal switch and pull-down resistance R pd model the inverter A. The problem is to find the voltages at the inputs to logic cells B and C taking into account the parasitic resistance and capacitance of the metal interconnect. Figure 17.3 (c) models logic cell A as an ideal switch with a pull-down resistance equal to R pd and models the metal interconnect using resistors and capacitors for each segment of the interconnect. The Elmore constant for node 4 (labeled V 4 ) in the network shown in Figure 17.3 (c) is 4 D4 =
k=1
Rk4Ck
(17.1)
= R 14 C 1 + R 24 C 2 + R 34 C 3 + R 44 C 4 , where, R 14 = R pd + R 1 R 24 = R pd + R 1 R 34 = R pd + R 1 + R 3 R 44 = R pd + R 1 + R 3 + R 4 (17.2)
17.1 Global Routing
In Eq. 17.2 notice that R 24 = R pd + R 1 (and not R pd + R 1 + R 2 ) because R 1 is the resistance to V 0 (ground) shared by node 2 and node 4. Suppose we have the following parameters (from the generic 0.5 m CMOS process, G5) for the layout shown in Figure 17.3 (b):
q q q q q q
m2 resistance is 50 m /square. m2 capacitance (for a minimum-width line) is 0.2 pFmm 1 . 4X inverter delay is 0.02 ns + 0.5 C L ns ( C L is in picofarads). Delay is measured using 0.35/0.65 output trip points. m2 minimum width is 3 = 0.9 m. 1X inverter input capacitance is 0.02 pF (a standard load).
First we need to find the pull-down resistance, R pd , of the 4X inverter. If we model the gate with a linear pull-down resistor, R pd , driving a load C L , the output waveform is exp t /( C L R pd ) (normalized to 1V). The output reaches 63 percent of its final value when t = C L R pd , because exp (1) = 0.63. Then, because the delay is measured with a 0.65 trip point, the constant 0.5 nspF 1 = 0.5 k is very close to the equivalent pull-down resistance. Thus, R pd 500 . From the given data, we can calculate the R s and C s: (0.1 mm) (50 10 3 ) R1=R2 = 0.9 m (1 mm) (50 10 3 ) R3 = 0.9 m (2 mm) (50 10 3 )
= 6
= 56
17.1 Global Routing
R4 =
0.9 m
= 112 (17.3)
C 1 = (0.1 mm) (0.2 pFmm 1 ) = 0.02 pF C 2 = (0.1 mm) (0.2 pFmm 1 ) + 0.02 pF = 0.04 pF C 3 = (1 mm) (0.2 pFmm 1 ) = 0.2 pF C 4 = (2 mm) (0.2 pFmm 1 ) + 0.02 pF = 0.42 pF (17.4) Now we can calculate the path resistance, R ki , values (notice that R ki = R ik ): R 14 = 500 + 6 R 24 = 500 + 6 = 506 = 506
R 34 = 500 + 6 + 56 = 562 R 44 = 500 + 6 + 56 + 112 = 674 (17.5) Finally, we can calculate Elmores constants for node 4 and node 2 as follows: D 4 = R 14 C 1 + R 24 C 2 + R 34 C 3 + R 44 C 4 = (506)(0.02) + (506)(0.04) + (562)(0.2) + (674)(0.42) = 425 ps . D 2 = R 12 C 1 + R 22 C 2 + R 32 C 3 + R 42 C 4 = ( R pd + R 1 )( C 2 + C 3 + C 4 ) (17.7) (17.6)
17.1 Global Routing
+ ( R pd + R 1 + R 2 ) C 2 = (500 + 6 + 6)(0.04) + (500 + 6)(0.02 + 0.2 + 0.2) = 344 ps . and D 4 D 2 = (425 344) = 81 ps. A lumped-delay model neglects the effects of interconnect resistance and simply sums all the node capacitances (the lumped capacitance ) as follows: D = R pd ( C 1 + C 2 + C 3 + C 4 ) = (500) (0.02 + 0.04 + 0.2 + 0.42) = 340 ps . Comparing Eqs. 17.6 17.8 , we can see that the delay of the inverter can be assigned as follows: 20 ps (the intrinsic delay, 0.2 ns, due to the cell output capacitance), 340 ps (due to the pull-down resistance and the output capacitance), 4 ps (due to the interconnect from A to B), and 65 ps (due to the interconnect from A to C). We can see that the error from neglecting interconnect resistance can be important. Even using the Elmore constant we still made the following assumptions in estimating the path delays:
q q q
(17.8)
A step-function waveform drives the net. The delay is measured from when the gate input changes. The delay is equal to the time constant of an exponential waveform that approximates the actual output waveform. The interconnect is modeled by discrete resistance and capacitance elements.
The global router could use more sophisticated estimates that remove some of these assumptions, but there is a limit to the accuracy with which delay can be estimated during global routing. For example, the global router does not know how much of the
17.1 Global Routing
routing is on which of the layers, or how many vias will be used and of which type, or how wide the metal lines will be. It may be possible to estimate how much interconnect will be horizontal and how much is vertical. Unfortunately, this knowledge does not help much if horizontal interconnect may be completed in either m1 or m3 and there is a large difference in parasitic capacitance between m1 and m3, for example. When the global router attempts to minimize interconnect delay, there is an important difference between a path and a net. The path that minimizes the delay between two terminals on a net is not necessarily the same as the path that minimizes the total path length of the net. For example, to minimize the path delay (using the Elmore constant as a measure) from the output of inverter A in Figure 17.3 (a) to the input of inverter B requires a rather complicated algorithm to construct the best path. We shall return to this problem in Section 17.1.6 .
17.1.3 Global Routing Methods

Global routing cannot use the interconnect-length approximations, such as the halfperimeter measure, that were used in placement. What is needed now is the actual path and not an approximation to the path length. However, many of the methods used in global routing are still based on the solutions to the tree on a graph problem. One approach to global routing takes each net in turn and calculates the shortest path using tree on graph algorithmswith the added restriction of using the available channels. This process is known as sequential routing . As a sequential routing algorithm proceeds, some channels will become more congested since they hold more interconnects than others. In the case of FPGAs and channeled gate arrays, the channels have a fixed channel capacity and can only hold a certain number of interconnects. There are two different ways that a global router normally handles this problem. Using order-independent routing , a global router proceeds by routing each net, ignoring how crowded the channels are. Whether a particular net is processed first or last does not matter, the channel assignment will be the same. In order-independent routing, after all the interconnects are assigned to channels, the global router returns to those channels that are the most crowded and reassigns some interconnects to other, less crowded, channels. Alternatively, a global router can
17.1 Global Routing
consider the number of interconnects already placed in various channels as it proceeds. In this case the global routing is order dependent the routing is still sequential, but now the order of processing the nets will affect the results. Iterative improvement or simulated annealing may be applied to the solutions found from both order-dependent and order-independent algorithms. This is implemented in the same way as for system partitioning and placement: A constructed solution is successively changed, one interconnect path at a time, in a series of random moves. In contrast to sequential global-routing methods, which handle nets one at a time, hierarchical routing handles all nets at a particular level at once. Rather than handling all of the nets on the chip at the same time, the global-routing problem is made more tractable by dividing the chip area into levels of hierarchy. By considering only one level of hierarchy at a time the size of the problem is reduced at each level. There are two ways to traverse the levels of hierarchy. Starting at the whole chip, or highest level, and proceeding down to the logic cells is the top-down approach. The bottom-up approach starts at the lowest level of hierarchy and globally routes the smallest areas first.
17.1.4 Global Routing Between Blocks

Figure 17.4 illustrates the global-routing problem for a cell-based ASIC. Each edge in the channel-intersection graph in Figure 17.4 (c) represents a channel. The global router is restricted to using these channels. The weight of each edge in the graph corresponds to the length of the channel. The global router plans a path for each interconnect using this graph.
17.1 Global Routing
FIGURE 17.4 Global routing for a cell-based ASIC formulated as a graph problem. (a) A cell-based ASIC with numbered channels. (b) The channels form the edges of a graph. (c) The channel-intersection graph. Each channel corresponds to an edge on a graph whose weight corresponds to the channel length. Figure 17.5 shows an example of global routing for a net with five terminals, labeled A1 through F1, for the cell-based ASIC shown in Figure 17.4 . If a designer wishes to use minimum total interconnect path length as an objective, the global router finds the minimum-length tree shown in Figure 17.5 (b). This tree determines the channels the interconnects will use. For example, the shortest connection from A1 to B1 uses channels 2, 1, and 5 (in that order). This is the information the global router passes to the detailed router. Figure 17.5 (c) shows that minimizing the total path length may not correspond to minimizing the path delay between two points.
17.1 Global Routing
FIGURE 17.5 Finding paths in global routing. (a) A cell-based ASIC (from Figure 17.4 ) showing a single net with a fanout of four (five terminals). We have to order the numbered channels to complete the interconnect path for terminals A1 through F1. (b) The terminals are projected to the center of the nearest channel, forming a graph. A minimum-length tree for the net that uses the channels and takes into account the channel capacities. (c) The minimum-length tree does not necessarily correspond to minimum delay. If we wish to minimize the delay from terminal A1 to D1, a different tree might be better. Global routing is very similar for cell-based ASICs and gate arrays, but there is a very important difference between the types of channels in these ASICs. The size of the channels in sea-of-gates arrays, channelless gate arrays, and cell-based ASICs can be varied to make sure there is enough space to complete the wiring. In channeled gatearrays and FPGAs the size, number, and location of channels are fixed. The good news is that the global router can allocate as many interconnects to each channel as it likes, since that space is committed anyway. The bad news is that there is a maximum number of interconnects that each channel can hold. If the global router needs more room, even in just one channel on the whole chip, the designer has to repeat the placement-and-routing steps and try again (or use a bigger chip).
17.1.5 Global Routing Inside Flexible Blocks

17.1 Global Routing
We shall illustrate global routing using a gate array. Figure 17.6 (a) shows the routing resources on a sea-of-gates or channelless gate array. The gate array base cells are arranged in 36 blocks, each block containing an array of 8-by-16 gate-array base cells, making a total of 4068 base cells. The horizontal interconnect resources are the routing channels that are formed from unused rows of the gate-array base cells, as shown in Figure 17.6 (b) and (c). The vertical resources are feedthroughs. For example, the logic cell shown in Figure 17.6 (d) is an inverter that contains two types of feedthrough. The inverter logic cell uses a single gate-array base cell with terminals (or connectors ) located at the top and bottom of the logic cell. The inverter input pin has two electrically equivalent terminals that the global router can use as a feedthrough. The output of the inverter is connected to only one terminal. The remaining vertical track is unused by the inverter logic cell, so this track forms an uncommitted feedthrough. You may see any of the terms landing pad (because we say that we drop a via to a landing pad), pick-up point , connector , terminal , pin , or port used for the connection to a logic cell. The term pick-up point refers to the physical pieces of metal (or sometimes polysilicon) in the logic cell to which the router connects. In a three-level metal process, the global router may be able to connect to anywhere in an areaan area pick-up point . In this book we use the term connector to refer to the physical pick-up point. The term pin more often refers to the connection on a logic schematic icon (a dot, square box, or whatever symbol is used), rather than layout. Thus the difference between a pin and a connector is that we can have multiple connectors for one pin. Terminal is often used when we talk about routing. The term port is used when we are using text (EDIF netlists or HDLs, for example) to describe circuits. In a gate array the channel capacity must be a multiple of the number of horizontal tracks in the gate-array base cell. Figure 17.6 (e) shows a gate-array base cell with seven horizontal tracks (see Section 17.2 for the factors that determine the track width and track spacing). Thus, in this gate array, we can have a channel with a capacity of 7, 14, 21, ... horizontal tracksbut not between these values.
17.1 Global Routing
FIGURE 17.6 Gate-array global routing. (a) A small gate array. (b) An enlarged view of the routing. The top channel uses three rows of gate-array base cells; the other channels use only one. (c) A further enlarged view showing how the routing in the channels connects to the logic cells. (d) One of the logic cells, an inverter. (e) There are seven horizontal wiring tracks available in one row of gate-array base cellsthe channel capacity is thus 7. Figure 17.7 shows the inverter macro for the sea-of-gates array shown in Figure 17.6 . Figure 17.7 (a) shows the base cell. Figure 17.7 (b) shows how the internal inverter wiring on m1 leaves one vertical track free as a feedthrough in a two-level metal process (connectors placed at the top and bottom of the cell). In a three-level metal process the connectors may be placed inside the cell abutment box ( Figure 17.7 c). Figure 17.8 shows the global routing for the sea-of-gates array. We divide the array
17.1 Global Routing
into nonoverlapping routing bins (or just bins , also called global routing cells or GRCs ), each containing a number of gate-array base cells.
FIGURE 17.7 The gate-array inverter from Figure 17.6 d. (a) An oxide-isolated gate-array base cell, showing the diffusion and polysilicon layers. (b) The metal and contact layers for the inverter in a 2LM (two-level metal) process. (c) The routers view of the cell in a 3LM process. We need an aside to discuss our use of the term cell . Be careful not to confuse the global routing cells with gate-array base cells (the smallest element of a gate array, consisting of a small number of n -type and p -type transistors), or with logic cells (which are NAND gates, NOR gates, and so on). A large routing bin reduces the size of the routing problem, and a small routing bin allows the router to calculate the wiring capacities more accurately. Some tools permit routing bins of different size in different areas of the chip (with smaller routing bins helping in areas of dense routing). Figure 17.8 (a) shows a routing bin that is 2 -by-4 gate-array base cells. The logic cells occupy the lower half of the routing bin. The upper half of the routing bin is the channel area, reserved for wiring. The global router calculates the edge capacities for this routing bin, including the vertical feedthroughs. The global router then determines the shortest path for each net considering these edge capacities. An example of a global-routing calculation is shown in Figure 17.8 (b). The path, described by a series of adjacent routing bins, is passed to the detailed router.
17.1 Global Routing
FIGURE 17.8 Global routing a gate array. (a) A single globalrouting cell (GRC or routing bin) containing 2-by-4 gate-array base cells. For this choice of routing bin the maximum horizontal track capacity is 14, the maximum vertical track capacity is 12. The routing bin labeled C3 contains three logic cells, two of which have feedthroughs marked 'f'. This results in the edge capacities shown. (b) A view of the top left-hand corner of the gate array showing 28 routing bins. The global router uses the edge capacities to find a sequence of routing bins to connect the nets.
17.1.6 Timing-Driven Methods

Minimizing the total pathlength using a Steiner tree does not necessarily minimize the interconnect delay of a path. Alternative tree algorithms apply in this situation, most using the Elmore constant as a method to estimate the delay of a path ( Section 17.1.2 ). As in timing-driven placement, there are two main approaches to timing-driven routing: net-based and path-based. Path-based methods are more sophisticated. For example, if there is a critical path from logic cell A to B to C, the global router may increase the delay due to the interconnect between logic cells A and B if it can reduce the delay between logic cells B and C. Placement and global routing tools may or
17.1 Global Routing
may not use the same algorithm to estimate net delay. If these tools are from different companies, the algorithms are probably different. The algorithms must be compatible, however. There is no use performing placement to minimize predicted delay if the global router uses completely different measurement methods. Companies that produce floorplanning and placement tools make sure that the output is compatible with different routing toolsoften to the extent of using different algorithms to target different routers.
17.1.7 Back-annotation
After global routing is complete it is possible to accurately predict what the length of each interconnect in every net will be after detailed routing, probably to within 5 percent. The global router can give us not just an estimate of the total net length (which was all we knew at the placement stage), but the resistance and capacitance of each path in each net. This RC information is used to calculate net delays. We can back-annotate this net delay information to the synthesis tool for in-place optimization or to a timing verifier to make sure there are no timing surprises. Differences in timing predictions at this point arise due to the different ways in which the placement algorithms estimate the paths and the way the global router actually builds the paths. [ Chapter start ] [ Previous page ] [ Next page ]
17.2 Detailed Routing

The global routing step determines the channels to be used for each interconnect. Using this information the detailed router decides the exact location and layers for each interconnect. Figure 17.9 (a) shows typical metal rules. These rules determine the m1 routing pitch ( track pitch , track spacing , or just pitch ). We can set the m1 pitch to one of three values: 1. via-to-via ( VTV ) pitch (or spacing), 2. via-to-line ( VTL or line-to-via ) pitch, or 3. line-to-line ( LTL ) pitch. The same choices apply to the m2 and other metal layers if they are present. Via-tovia spacing allows the router to place vias adjacent to each other. Via-to-line spacing is hard to use in practice because it restricts the router to nonadjacent vias. Using lineto-line spacing prevents the router from placing a via at all without using jogs and is rarely used. Via-to-via spacing is the easiest for a router to use and the most common. Using either via-to-line or via-to-via spacing means that the routing pitch is larger than the minimum metal pitch. Sometimes people draw a distinction between a cut and a via when they talk about large connections such as shown in Figure 17.10 (a). We split or stitch a large via into identically sized cuts (sometimes called a waffle via ). Because of the profile of the metal in a contact and the way current flows into a contact, often the total resistance of several small cuts is less than that of one large cut. Using identically sized cuts also means the processing conditions during contact etching, which may vary with the area and perimeter of a contact, are the same for every cut on the chip.
In a stacked via the contact cuts all overlap in a layout plot and it is impossible to tell just how many vias on which layers are present. Figure 17.10 (bf) show an alternative way to draw contacts and vias. Though this is not a standard, using the diagonal box convention makes it possible to recognize stacked vias and contacts on a layout (in any orientation). I shall use these conventions when it is necessary.
FIGURE 17.9 The metal routing pitch. (a) An example of based metal design rules for m1 and via1 (m1/m2 via). (b) Viato-via pitch for adjacent vias. (c) Via-to-line (or line-to-via) pitch for nonadjacent vias. (d) Line-to-line pitch with no vias.
FIGURE 17.10 (a) A large m1 to m2 via. The black squares represent the holes (or cuts) that are etched in the insulating material between the m1 and 2 layers. (b) A m1 to m2 via (a via1). (c) A contact from m1 to diffusion or polysilicon (a contact). (d) A via1 placed over (or stacked over) a contact. (e) A m2 to m3 via (a via2) (f) A via2 stacked over a via1 stacked over a contact. Notice that the black square in parts bc do not represent the actual location of the cuts. The black squares are offset so you can recognize stacked vias and contacts. In a two-level metal CMOS ASIC technology we complete the wiring using the two different metal layers for the horizontal and vertical directions, one layer for each direction. This is Manhattan routing , because the results look similar to the rectangular northsouth and eastwest layout of streets in New York City. Thus, for example, if terminals are on the m2 layer, then we route the horizontal branches in a channel using m2 and the vertical trunks using m1. Figure 17.11 shows that, although we may choose a preferred direction for each metal layer (for example, m1 for horizontal routing and m2 for vertical routing), this may lead to problems in cases that have both horizontal and vertical channels. In these cases we define a preferred metal layer in the direction of the channel spine. In Figure 17.11 , because the logic cell connectors are on m2, any vertical channel has to use vias at every logic cell location. By changing the orientation of the metal directions in vertical channels, we can avoid this, and instead we only need to place vias at the intersection of horizontal and vertical channels.
FIGURE 17.11 An expanded view of part of a cell-based ASIC. (a) Both channel 4 and channel 5 use m1 in the horizontal direction and m2 in the vertical direction. If the logic cell connectors are on m2 this requires vias to be placed at every logic cell connector in channel 4. (b) Channel 4 and 5 are routed with m1 along the direction of the channel spine (the long direction of the channel). Now vias are required only for nets 1 and 2, at the intersection of the channels. Figure 17.12 shows an imaginary logic cell with connectors. Double-entry logic cells intended for two-level metal routing have connectors at the top and bottom of the logic cell, usually in m2. Logic cells intended for processes with three or more levels of metal have connectors in the center of the cell, again usually on m2. Logic cells may use both m1 and m2 internally, but the use of m2 is usually minimized. The router normally uses a simplified view of the logic cell called a phantom . The phantom contains only the logic cell information that the router needs: the connector locations, types, and names; the abutment and bounding boxes; enough layer information to be able to place cells without violating design rules; and a blockage map the locations of any metal inside the cell that blocks routing.
FIGURE 17.12 The different types of connections that can be made to a cell. This cell has connectors at the top and bottom of the cell (normal for cells intended for use with a two-level metal process) and internal connectors (normal for logic cells intended for use with a three-level metal process). The interconnect and connections are drawn to scale. Figure 17.13 illustrates some terms used in the detailed routing of a channel. The channel spine in Figure 17.13 is horizontal with terminals at the top and the bottom, but a channel can also be vertical. In either case terminals are spaced along the longest edges of the channel at given, fixed locations. Terminals are usually located on a grid defined by the routing pitch on that layer (we say terminals are either ongrid or off-grid ). We make connections between terminals using interconnects that consist of one or more trunks running parallel to the length of the channel and branches that connect the trunk to the terminals. If more than one trunk is used, the trunks are connected by doglegs . Connections exit the channel at pseudoterminals .
FIGURE 17.13 Terms used in channel routing. (a) A channel with four horizontal tracks. (b) An expanded view of the lefthand portion of the channel showing (approximately to scale) how the m1 and m2 layers connect to the logic cells on either side of the channel. (c) The construction of a via1 (m1/m2 via). The trunk and branch connections run in tracks (equispaced, like railway tracks). If the trunk connections use m1, the horizontal track spacing (usually just called the track spacing for channel routing) is equal to the m1 routing pitch. The maximum number of interconnects we need in a channel multiplied by the horizontal track spacing gives the minimum height of a channel (see Section 17.2.2 on how to determine the maximum number of interconnects needed). Each terminal occupies a column . If the branches use m2, the column spacing (or vertical track spacing ) is equal to the m2 routing pitch.
17.2.1 Goals and Objectives

The goal of detailed routing is to complete all the connections between logic cells.
The most common objective is to minimize one or more of the following:

q q q
The total interconnect length and area The number of layer changes that the connections have to make The delay of critical paths
Minimizing the number of layer changes corresponds to minimizing the number of vias that add parasitic resistance and capacitance to a connection. In some cases the detailed router may not be able to complete the routing in the area provided. In the case of a cell-based ASIC or sea-of-gates array, it is possible to increase the channel size and try the routing steps again. A channeled gate array or FPGA has fixed routing resources and in these cases we must start all over again with floorplanning and placement, or use a larger chip.
17.2.2 Measurement of Channel Density

We can describe a channel-routing problem by specifying two lists of nets: one for the top edge of the channel and one for the bottom edge. The position of the net number in the list gives the column position. The net number zero represents a vacant or unused terminal. Figure 17.14 shows a channel with the numbered terminals to be connected along the top and the bottom of the channel. We call the number of nets that cross a line drawn vertically anywhere in a channel the local density . We call the maximum local density of the channel the global density or sometimes just channel density . Figure 17.14 has a channel density of 4. Channel density is an important measure in routingit tells a router the absolute fewest number of horizontal interconnects that it needs at the point where the local density is highest. In two-level routing (all the horizontal interconnects run on one routing layer) the channel density determines the minimum height of the channel. The channel capacity is the maximum number of interconnects that a channel can hold. If the channel density is greater than the channel capacity, that channel definitely cannot be routed (to learn how channel density is calculated, see Section 17.2.5 ).
FIGURE 17.14 The definitions of local channel density and global channel density. Lines represent the m1 and m2 interconnect in the channel to simplify the drawing.
17.2.3 Algorithms
We start discussion of routing methods by simplifying the general channel-routing problem. The restricted channel-routing problem limits each net in a channel to use only one horizontal segment. In other words the channel router uses only one trunk for each net. This restriction has the effect of minimizing the number of connections between the routing layers. This is equivalent to minimizing the number of vias used by the channel router in a two-layer metal technology. Minimizing the number of vias is an important objective in routing a channel, but it is not always practical. Sometimes constraints will force a channel router to use jogs or other methods to complete the routing (see Section 17.2.5 ). Next, though, we shall study an algorithm that solves the restricted channel-routing problem.
17.2.4 Left-Edge Algorithm

The left-edge algorithm ( LEA ) is the basis for several routing algorithms [ Hashimoto and Stevens, 1971]. The LEA applies to two-layer channel routing, using one layer for the trunks and the other layer for the branches. For example, m1 may be used in the horizontal direction and m2 in the vertical direction. The LEA proceeds as follows:
1. Sort the nets according to the leftmost edges of the nets horizontal segment. 2. Assign the first net on the list to the first free track. 3. Assign the next net on the list, which will fit, to the track. 4. Repeat this process from step 3 until no more nets will fit in the current track. 5. Repeat steps 24 until all nets have been assigned to tracks. 6. Connect the net segments to the top and bottom of the channel.
FIGURE 17.15 Left-edge algorithm. (a) Sorted list of segments. (b) Assignment to tracks. (c) Completed channel route (with m1 and m2 interconnect represented by lines). Figure 17.15 illustrates the LEA. The algorithm works as long as none of the branches touchwhich may occur if there are terminals in the same column belonging to different nets. In this situation we have to make sure that the trunk that
connects to the top of the channel is placed above the lower trunk. Otherwise two branches will overlap and short the nets together. In the next section we shall examine this situation more closely.
17.2.5 Constraints and Routing Graphs

Two terminals that are in the same column in a channel create a vertical constraint . We say that the terminal at the top of the column imposes a vertical constraint on the lower terminal. We can draw a graph showing the vertical constraints imposed by terminals. The nodes in a vertical-constraint graph represent terminals. A vertical constraint between two terminals is shown by an edge of the graph connecting the two terminals. A graph that contains information in the direction of an edge is a directed graph . The arrow on the graph edge shows the direction of the constraintpointing to the lower terminal, which is constrained. Figure 17.16 (a) shows an example of a channel, and Figure 17.16 (b) shows its vertical constraint graph.
FIGURE 17.16 Routing graphs. (a) Channel with a global density of 4. (b) The vertical constraint graph. If two nets occupy the same column, the net at the top of the channel imposes a vertical constraint on the net at the bottom. For example, net 2 imposes a vertical constraint on net 4. Thus the interconnect for net 4 must use a track above net 2. (c) Horizontal-constraint graph. If the segments of two nets overlap, they are connected in the horizontal-constraint graph. This graph determines the global channel density. We can also define a horizontal constraint and a corresponding horizontalconstraint graph . If the trunk for net 1 overlaps the trunk of net 2, then we say there is a horizontal constraint between net 1 and net 2. Unlike a vertical constraint, a horizontal constraint has no direction. Figure 17.16 (c) shows an example of a horizontal constraint graph and shows a group of 4 terminals (numbered 3, 5, 6, and 7) that must all overlap. Since this is the largest such group, the global channel density is 4. If there are no vertical constraints at all in a channel, we can guarantee that the LEA will find the minimum number of routing tracks. The addition of vertical constraints transforms the restricted routing problem into an NP-complete problem. There is also an arrangement of vertical constraints that none of the algorithms based on the LEA can cope with. In Figure 17.17 (a) net 1 is above net 2 in the first column of the channel. Thus net 1 imposes a vertical constraint on net 2. Net 2 is above net 1 in the last column of the channel. Then net 2 also imposes a vertical constraint on net 1. It is impossible to route this arrangement using two routing layers with the restriction of using only one trunk for each net. If we construct the vertical-constraint graph for this situation, shown in Figure 17.17 (b), there is a loop or cycle between nets 1 and 2. If there is any such vertical-constraint cycle (or cyclic constraint ) between two or more nets, the LEA will fail. A dogleg router removes the restriction that each net can use only one track or trunk. Figure 17.17 (c) shows how adding a dogleg permits a channel with a cyclic constraint to be routed.
FIGURE 17.17 The addition of a dogleg, an extra trunk, in the wiring of a net can resolve cyclic vertical constraints.
The channel-routing algorithms we have described so far do not allow interconnects on one layer to run on top of other interconnects on a different layer. These algorithms allow interconnects to cross at right angles to each other on different layers, but not to overlap . When we remove the restriction that horizontal and vertical routing must use different layers, the density of a channel is no longer the lower bound for the number of tracks required. For two routing layers the ultimate lower bound becomes half of the channel density. The practical reasoning for restricting overlap is the parasitic overlap capacitance between signal interconnects. As the dimensions of the metal interconnect are reduced, the capacitance between adjacent interconnects on the same layer ( coupling capacitance ) is comparable to the capacitance of interconnects that overlap on different layers ( overlap capacitance ). Thus, allowing a short overlap between interconnects on different layers may not be as bad as allowing two interconnects to run adjacent to each other for a long distance on the same layer. Some routers allow you to specify that two interconnects must not run adjacent to each other for more than a specified length. The channel height is fixed for channeled gate arrays; it is variable in discrete steps for channelless gate arrays; it is continuously variable for cell-based ASICs. However, for all these types of ASICs, the channel wiring is fully customized and so may be compacted or compressed after a channel router has completed the interconnect. The use of channel-routing compaction for a two-layer channel can reduce the channel height by 15 percent to 20 percent [ Cheng et al., 1992]. Modern channel routers are capable of routing a channel at or near the theoretical minimum density. We can thus consider channel routing a solved problem. Most of the difficulty in detailed routing now comes from the need to route more than two layers and to route arbitrary shaped regions. These problems are best handled by area routers.
17.2.6 Area-Routing Algorithms

There are many algorithms used for the detailed routing of general-shaped areas (see the paper by Ohtsuki in [ Ohtsuki, 1986]). Many of these were originally developed for PCB wiring. The first group we shall cover and the earliest to be used historically are the grid-expansion or maze-running algorithms. A second group of methods, which are more efficient, are the line-search algorithms. FIGURE 17.18 The Lee maze-running algorithm. The algorithm finds a path from source (X) to target (Y) by emitting a wave from both the source and the target at the same time. Successive outward moves are marked in each bin. Once the target is reached, the path is found by backtracking (if there is a choice of bins with equal labeled values, we choose the bin that avoids changing direction). (The original form of the Lee algorithm uses a single wave.) Figure 17.18 illustrates the Lee maze-running algorithm . The goal is to find a path from X to Yi.e., from the start (or source) to the finish (or target)avoiding any obstacles. The algorithm is often called wave propagation because it sends out waves, which spread out like those created by dropping a stone into a pond. Algorithms that use lines rather than waves to search for connections are more efficient than algorithms based on the Lee algorithm. Figure 17.19 illustrates the Hightower algorithm a line-search algorithm (or line-probe algorithm ): 1. Extend lines from both the source and target toward each other. 2. When an extended line, known as an escape line , meets an obstacle, choose a point on the escape line from which to project another escape line at right angles to the old one. This point is the escape point . 3. Place an escape point on the line so that the next escape line just misses the
edge of the obstacle. Escape lines emanating from the source and target intersect to form the path. FIGURE 17.19 Hightower area-routing algorithm. (a) Escape lines are constructed from source (X) and target (Y) toward each other until they hit obstacles. (b) An escape point is found on the escape line so that the next escape line perpendicular to the original misses the next obstacle. The path is complete when escape lines from source and target meet. The Hightower algorithm is faster and requires less memory than methods based on the Lee algorithm.
17.2.7 Multilevel Routing

Using two-layer routing , if the logic cells do not contain any m2, it is possible to complete some routing in m2 using over-the-cell (OTC) routing. Sometimes poly is used for short connections in the channel in a two-level metal technology; this is known as 2.5-layer routing . Using a third level of metal in three-layer routing , there is a choice of approaches. Reserved-layer routing restricts all the interconnect on each layer to flow in one direction in a given routing area (for example, in a channel, either parallel or perpendicular to the channel spine). Unreserved-layer routing moves in both horizontal and vertical directions on a given layer. Most routers use reserved routing. Reserved three-level metal routing offers another choice: Either use m1 and m3 for horizontal routing (parallel to the channel spine), with m2 for vertical routing ( HVH routing ) or use VHV routing . Since the logic cell interconnect usually blocks most of the area on the m1 layer, HVH routing is normally used. It is also important to consider the pitch of the layers when routing in the same direction on two different layers. Using HVH routing it is preferable for the m3 pitch to be a simple multiple of the m1 pitch (ideally they are the same). Some
processes have more than three levels of metal. Sometimes the upper one or two metal layers have a coarser pitch than the lower layers and are used in multilevel routing for power and clock lines rather than for signal interconnect. Figure 17.20 shows an example of three-layer channel routing. The logic cells are 64 high, the m1 routing pitch is 8 , and the m2 and m3 routing pitch is 16 . The channel in Figure 17.20 is the same as the channel using two-layer metal shown in Figure 17.13 , but using three-level metal reduces the channel height from 40 ( = 5 8 ) to 16 . Submicron processes try to use the same metal pitch on all metal layers. This makes routing easier but processing more difficult.
FIGURE 17.20 Three-level channel routing. In this diagram the m2 and m3 routing pitch is set to twice the m1 routing pitch. Routing density can be increased further if all the routing pitches can be made equala difficult process challenge. With three or more levels of metal routing it is possible to reduce the channel height in a row-based ASIC to zero. All of the interconnect is then completed over the cell. If all of the channels are eliminated, the core area (logic cells plus routing) is determined solely by the logic-cell area. The point at which this happens depends on
not only the number of metal layers and channel density, but also the routing resources (the blockages and feedthroughs) in the logic cell. This the cell porosity . Designing porous cells that help to minimize routing area is an art. For example, it is quite common to be able to produce a smaller chip using larger logic cells if the larger cells have more routing resources.
17.2.8 Timing-Driven Detailed Routing

In detailed routing the global router has already set the path the interconnect will follow. At this point little can be done to improve timing except to reduce the number of vias, alter the interconnect width to optimize delay, and minimize overlap capacitance. The gains here are relatively small, but for very long branching nets even small gains may be important. For high-frequency clock nets it may be important to shape and chamfer (round) the interconnect to match impedances at branches and control reflections at corners.
17.2.9 Final Routing Steps

If the algorithms to estimate congestion in the floorplanning tool accurately perfectly reflected the algorithms used by the global router and detailed router, routing completion should be guaranteed. Often, however, the detailed router will not be able to completely route all the nets. These problematical nets are known as unroutes . Routers handle this situation in one of two ways. The first method leaves the problematical nets unconnected. The second method completes all interconnects anyway but with some design-rule violations (the problematical nets may be shorted to other nets, for example). Some tools flag these problems as a warning (in fact there can be no more serious error). If there are many unroutes the designer needs to discover the reason and return to the floorplanner and change channel sizes (for a cell-based ASIC) or increase the basearray size (for a gate array). Returning to the global router and changing bin sizes or adjusting the algorithms may also help. In drastic cases it may be necessary to change the floorplan. If just a handful of difficult nets remain to be routed, some tools allow the designer to perform hand edits using a rip-up and reroute router (sometimes this
is done automatically by the detailed router as a last phase in the routing procedure anyway). This capability also permits engineering change orders ( ECO )corresponding to the little yellow wires on a PCB. One of the last steps in routing is via removal the detailed router looks to see if it can eliminate any vias (which can contribute a significant amount to the interconnect resistance) by changing layers or making other modifications to the completed routing. Routing compaction can then be performed as the final step. [ Chapter start ] [ Previous page ] [ Next page ]
17.3 Special Routing

The routing of nets that require special attention, clock and power nets for example, is normally done before detailed routing of signal nets. The architecture and structure of these nets is performed as part of floorplanning, but the sizing and topology of these nets is finalized as part of the routing step.
17.3.1 Clock Routing

Gate arrays normally use a clock spine (a regular grid), eliminating the need for special routing (see Section 16.1.6, Clock Planning). The clock distribution grid is designed at the same time as the gate-array base to ensure a minimum clock skew and minimum clock latencygiven power dissipation and clock buffer area limitations. Cell-based ASICs may use either a clock spine, a clock tree, or a hybrid approach. Figure 17.21 shows how a clock router may minimize clock skew in a clock spine by making the path lengths, and thus net delays, to every leaf node equalusing jogs in the interconnect paths if necessary. More sophisticated clock routers perform clocktree synthesis (automatically choosing the depth and structure of the clock tree) and clock-buffer insertion (equalizing the delay to the leaf nodes by balancing interconnect delays and buffer delays).
FIGURE 17.21 Clock routing. (a) A clock network for the cellbased ASIC from Figure 16.11. (b) Equalizing the interconnect segments between CLK and all destinations (by including jogs if necessary) minimizes clock skew. The clock tree may contain multiply-driven nodes (more than one active element driving a net). The net delay models that we have used break down in this case and we may have to extract the clock network and perform circuit simulation, followed by back-annotation of the clock delays to the netlist (for circuit extraction, see Section 17.4 ) and the bus currents to the clock router. The sizes of the clock buses depend on the current they must carry. The limits are set by reliability issues to be discussed next. Clock skew induced by hot-electron wearout was mentioned in Section 16.1.6, Clock Planning. Another factor contributing to unpredictable clock skew is changes in clock-buffer delays with variations in power-supply voltage due to data-dependent activity. This activity-induced clock skew can easily be larger than the skew achievable using a clock router. For example, there is little point in using software capable of reducing clock skew to less than 100 ps if, due to fluctuations in powersupply voltage when part of the chip becomes active, the clock-network delays change by 200 ps. The power buses supplying the buffers driving the clock spine carry direct current ( unidirectional current or DC), but the clock spine itself carries alternating current ( bidirectional current or AC). The difference between electromigration failure rates due to AC and DC leads to different rules for sizing clock buses. As we explained in
Section 16.1.6, Clock Planning, the fastest way to drive a large load in CMOS is to taper successive stages by approximately e 3. This is not necessarily the smallestarea or lowest-power approach, however [ Veendrick, 1984].
17.3.2 Power Routing

Each of the power buses has to be sized according to the current it will carry. Too much current in a power bus can lead to a failure through a mechanism known as electromigration [Young and Christou, 1994]. The required power-bus widths can be estimated automatically from library information, from a separate power simulation tool, or by entering the power-bus widths to the routing software by hand. Many routers use a default power-bus width so that it is quite easy to complete routing of an ASIC without even knowing about this problem. For a direct current ( DC) the mean time to failure ( MTTF) due to electromigration is experimentally found to obey the following equation: MTTF = A J 2 exp E / k T , (17.9)
where J is the current density; E is approximately 0.5 eV; k , Boltzmanns constant, is 8.62 10 5 eVK 1 ; and T is absolute temperature in kelvins. There are a number of different approaches to model the effect of an AC component. A typical expression is A J 2 exp E / k T MTTF = , (17.10) J | J | + k AC/DC | J | 2 where J is the average of J(t) , and | J | is the average of | J |. The constant k AC/DC relates the relative effects of AC and DC and is typically between 0.01 and 0.0001. Electromigration problems become serious with a MTTF of less than 10 5 hours (approximately 10 years) for current densities (DC) greater than 0.5 GAm 2 at
temperatures above 150 C. Table 17.1 lists example metallization reliability rules limits for the current you can pass through a metal layer, contact, or viafor the typical 0.5 m three-level metal CMOS process, G5. The limit of 1 mA of current per square micron of metal cross section is a good rule-of-thumb to follow for current density in aluminum-based interconnect. Some CMOS processes also have maximum metal-width rules (or fat-metal rules ). This is because stress (especially at the corners of the die, which occurs during die attach mounting the die on the chip carrier) can cause large metal areas to lift. A solution to this problem is to place slots in the wide metal lines. These rules are dependent on the ASIC vendors level of experience. To determine the power-bus widths we need to determine the bus currents. The largest problem is emulating the systems operating conditions. Input vectors to test the system are not necessarily representative of actual system operation. Clock-bus sizing depends strongly on the parameter k AC/DC in Eq. 17.10 , since the clock spine carries alternating current. (For the sources of power dissipation in CMOS, see Section 15.5, Power Dissipation.) Gate arrays normally use a regular power grid as part of the gate-array base. The gate-array logic cells contain two fixed-width power buses inside the cell, running horizontally on m1. The horizontal m1 power buses are then strapped in a vertical direction by m2 buses, which run vertically across the chip. The resistance of the power grid is extracted and simulated with SPICE during the base-array design to model the effects of IR drops under worst-case conditions. TABLE 17.1 Metallization reliability rules for a typical 0.5 micron ( l = 0.25 m) CMOS process. Metal Current Layer/contact/via Resistance 3 thickness 2 limit 1 m1 1 mA m
1
7000
95 m /square
m2 m3 0.8 m square m1 contact to diffusion 0.8 m square m1 contact to poly 0.8 m square m1/m2 via (via1) 0.8 m square m2/m3 via (via2)
1 mA m
1
7000 12,000
95 m /square 48 m /square 11 16 3.6 3.6
2 mA m
1
0.7 mA 0.7 mA 0.7 mA 0.7 mA
Standard cells are constructed in a similar fashion to gate-array cells, with power buses running horizontally in m1 at the top and bottom of each cell. A row of standard cells uses end-cap cells that connect to the VDD and VSS power buses placed by the power router. Power routing of cell-based ASICs may include the option to include vertical m2 straps at a specified intervals. Alternatively the number of standard cells that can be placed in a row may be limited during placement. The power router forms an interdigitated comb structure, minimizing the number of times a VDD or VSS power bus needs to change layers. This is achieved by routing with a routing bias on preferred layers. For example, VDD may be routed with a left-anddown bias on m1, with VSS routed using right-and-up bias on m2. Three-level metal processes either use a m3 with a thickness and pitch that is comparable to m1 and m2 (which usually have approximately the same thickness and pitch) or they use metal that is much thicker (up to twice as thick as m1 and m2) with a coarser pitch (up to twice as wide as m1 and m2). The factor that determines the m3/4/5 properties is normally the sophistication of the fabrication process. In a three-level metal process, power routing is similar to two-level metal ASICs. Power buses inside the logic cells are still normally run on m1. Using HVH routing it would be possible to run the power buses on m3 and drop vias all the way down to m1 when power is required in the cells. The problem with this approach is that it
creates pillars of blockage across all three layers. Using three or more layers of metal for routing, it is possible to eliminate some of the channels completely. In these cases we complete all the routing in m2 and m3 on top of the logic cells using connectors placed in the center of the cells on m1. If we can eliminate the channels between cell rows, we can flip rows about a horizontal axis and abut adjacent rows together (a technique known as flip and abut ). If the power buses are at the top (VDD) and bottom (VSS) of the cells in m1 we can abut or overlap the power buses (joining VDD to VDD and VSS to VSS in alternate rows). Power distribution schemes are also a function of process and packaging technology. Recall that flip-chip technology allows pads to be placed anywhere on a chip (see Section 16.1.5, I/O and Power Planning, especially Figure 16.13d). Four-level metal and aggressive stacked-via rules allow I/O pad circuits to be placed in the core. The problems with this approach include placing the ESD and latch-up protection circuits required in the I/O pads (normally kept widely separated from core logic) adjacent to the logic cells in the core. 1. At 125 C for unidirectional current. Limits for 110 C are 1.5 higher. Limits for 85 C are 3 higher. Current limits for bidirectional current are 1.5 higher than the unidirectional limits. 2. 10,000 (ten thousand angstroms) = 1 m. 3. Worst case at 110 C. [ Chapter start ] [ Previous page ] [ Next page ]
17.4 Circuit Extraction and DRC

After detailed routing is complete, the exact length and position of each interconnect for every net is known. Now the parasitic capacitance and resistance associated with each interconnect, via, and contact can be calculated. This data is generated by a circuit-extraction tool in one of the formats described next. It is important to extract the parasitic values that will be on the silicon wafer. The mask data or CIF widths and dimensions that are drawn in the logic cells are not necessarily the same as the final silicon dimensions. Normally mask dimensions are altered from drawn values to allow for process bias or other effects that occur during the transfer of the pattern from mask to silicon. Since this is a problem that is dealt with by the ASIC vendor and not the design software vendor, ASIC designers normally have to ask very carefully about the details of this problem. Table 17.2 shows values for the parasitic capacitances for a typical 1 m CMOS process. Notice that the fringing capacitance is greater than the parallel-plate (area) capacitance for all layers except poly. Next, we shall describe how the parasitic information is passed between tools.
17.4.1 SPF, RSPF, and DSPF

The standard parasitic format ( SPF ) (developed by Cadence [ 1990], now in the hands of OVI) describes interconnect delay and loading due to parasitic resistance and capacitance. There are three different forms of SPF: two of them ( regular SPF and reduced SPF ) contain the same information, but in different formats, and model the behavior of interconnect; the third form of SPF ( detailed SPF ) describes the actual parasitic resistance and capacitance components of a net. Figure 17.22 shows
the different types of simplified models that regular and reduced SPF support. The load at the output of gate A is represented by one of three models: lumped-C, lumpedRC, or PI segment. The pin-to-pin delays are modeled by RC delays. You can represent the pin-to-pin interconnect delay by an ideal voltage source, V(A_1) in this case, driving an RC network attached to each input pin. The actual pin-to-pin delays may not be calculated this way, however. TABLE 17.2 Parasitic capacitances for a typical 1 m ( = 0.5 m) three-level metal CMOS process. 1 Element poly (over gate oxide) to substrate poly (over field oxide) to substrate m1 to diffusion or poly m1 to substrate m2 to diffusion m2 to substrate m2 to poly m2 to m1 m3 to diffusion m3 to substrate m3 to poly m3 to m1 m3 to m2 n+ junction (at 0V bias) p+ junction (at 0V bias) Area / fF m 2 1.73 0.058 0.055 0.031 0.019 0.015 0.022 0.035 0.011 0.010 0.012 0.016 0.035 0.36 0.46 Fringing / fF m 1 NA 2 0.043 0.049 0.044 0.038 0.035 0.040 0.046 0.034 0.033 0.034 0.039 0.049 NA NA
FIGURE 17.22 The regular and reduced standard parasitic format (SPF) models for interconnect. (a) An example of an interconnect network with fanout. The driving-point admittance of the interconnect network is Y ( s ). (b) The SPF model of the interconnect. (c) The lumped-capacitance interconnect model. (d) The lumped-RC interconnect model. (e) The PI segment interconnect model (notice the capacitor nearest the output node is labeled C 2 rather than C 1 ). The values of C , R , C 1 , and C 2 are calculated so that Y 1 ( s ), Y 2 ( s ), and Y 3 ( s ) are the first-, second-, and third-order Taylor-series approximations to Y ( s ). The key features of regular and reduced SPF are as follows:
q
The loading effect of a net as seen by the driving gate is represented by choosing one of three different RC networks: lumped-C, lumped-RC, or PI segment (selected when generating the SPF) [ OBrien and Savarino, 1989].

q
The pin-to-pin delays of each path in the net are modeled by a simple RC delay (one for each path). This can be the Elmore constant for each path (see Section 17.1.2 ), but it need not be.
Here is an example regular SPF file for just one net that uses the PI segment model shown in Figure 17.22 (e): #Design Name : EXAMPLE1 #Date : 6 August 1995 #Time : 12:00:00 #Resistance Units : 1 ohms #Capacitance Units : 1 pico farads #Syntax : #N <netName> #C <capVal> # F <from CompName> <fromPinName> # GC <conductance> # | # REQ <res> # GRC <conductance> # T <toCompName> <toPinName> RC <rcConstant> A <value> # | # RPI <res> # C1 <cap> # C2 <cap> # GPI <conductance> # T <toCompName> <toPinName> RC <rcConstant> A <value> # TIMING.ADMITTANCE.MODEL = PI # TIMING.CAPACITANCE.MODEL = PP N CLOCK C 3.66 F ROOT Z RPI 8.85 C1 2.49 C2 1.17
GPI = 0.0 T DF1 G RC 22.20 T DF2 G RC 13.05 This file describes the following:
q q q q q
q q
The preamble contains the file format. This representation uses the PI segment model ( Figure 17.22 e). This net uses pin-to-pin timing. The driving gate of this net is ROOT and the output pin name is Z . The PI segment elements have values: C1 = 2.49 pF, C2 = 1.17 pF, RPI = 8.85 . Notice the order of C1 and C2 in Figure 17.22 (e). The element GPI is not normally used in SPF files. The delay from output pin Z of ROOT to input pin G of DF1 is 22.20 ns. The delay from pin Z of ROOT to pin G of DF2 is 13.05 ns.
The reduced SPF ( RSPF) contains the same information as regular SPF, but uses the SPICE format. Here is an example RSPF file that corresponds to the previous regular SPF example: * Design Name : EXAMPLE1 * Date : 6 August 1995 * Time : 12:00:00 * Resistance Units : 1 ohms * Capacitance Units : 1 pico farads *| RSPF 1.0 *| DELIMITER "_" .SUBCKT EXAMPLE1 OUT IN *| GROUND_NET VSS * TIMING.CAPACITANCE.MODEL = PP *|NET CLOCK 3.66PF *|DRIVER ROOT_Z ROOT Z *|S (ROOT_Z_OUTP1 0.0 0.0) R2 ROOT_Z ROOT_Z_OUTP1 8.85 C1 ROOT_Z_OUTP1 VSS 2.49PF
C2 ROOT_Z VSS 1.17PF *|LOAD DF2_G DF1 G *|S (DF1_G_INP1 0.0 0.0) E1 DF1_G_INP1 VSS ROOT_Z VSS 1.0 R3 DF1_G_INP1 DF1_G 22.20 C3 DF1_G VSS 1.0PF *|LOAD DF2_G DF2 G *|S (DF2_G_INP1 0.0 0.0) E2 DF2_G_INP1 VSS ROOT_Z VSS 1.0 R4 DF2_G_INP1 DF2_G 13.05 C4 DF2_G VSS 1.0PF *Instance Section XDF1 DF1_Q DF1_QN DF1_D DF1_G DF1_CD DF1_VDD DF1_VSS DFF3 XDF2 DF2_Q DF2_QN DF2_D DF2_G DF2_CD DF2_VDD DF2_VSS DFF3 XROOT ROOT_Z ROOT_A ROOT_VDD ROOT_VSS BUF .ENDS .END This file has the following features:
q
The PI segment elements ( C1 , C2 , and R2 ) have the same values as the previous example. The pin-to-pin delays are modeled at each of the gate inputs with a capacitor of value 1 pF ( C3 and C4 here) and a resistor ( R3 and R4 ) adjusted to give the correct RC delay. Since the load on the output gate is modeled by the PI segment it does not matter what value of capacitance is chosen here. The RC elements at the gate inputs are driven by ideal voltage sources ( E1 and E2 ) that are equal to the voltage at the output of the driving gate.
The detailed SPF ( DSPF) shows the resistance and capacitance of each segment in a net, again in a SPICE format. There are no models or assumptions on calculating the net delays in this format. Here is an example DSPF file that describes the interconnect shown in Figure 17.23 (a): .SUBCKT BUFFER OUT IN
* Net Section *|GROUND_NET VSS *|NET IN 3.8E-01PF *|P (IN I 0.0 0.0 5.0) *|I (INV1:A INV A I 0.0 10.0 5.0) C1 IN VSS 1.1E-01PF C2 INV1:A VSS 2.7E-01PF R1 IN INV1:A 1.7E00 *|NET OUT 1.54E-01PF *|S (OUT:1 30.0 10.0) *|P (OUT O 0.0 30.0 0.0) *|I (INV:OUT INV1 OUT O 0.0 20.0 10.0) C3 INV1:OUT VSS 1.4E-01PF C4 OUT:1 VSS 6.3E-03PF C5 OUT VSS 7.7E-03PF R2 INV1:OUT OUT:1 3.11E00 R3 OUT:1 OUT 3.03E00 *Instance Section XINV1 INV:A INV1:OUT INV .ENDS The nonstandard SPICE statements in DSPF are comments that start with '*|' and have the following formats: *|I(InstancePinName InstanceName PinName PinType PinCap X Y) *|P(PinName PinType PinCap X Y) *|NET NetName NetCap *|S(SubNodeName X Y) *|GROUND_NET NetName Figure 17.23 (b) illustrates the meanings of the DSPF terms: InstancePinName , InstanceName , PinName , NetName , and SubNodeName . The PinType is I (for IN) or O (the letter 'O', not zero, for OUT). The NetCap is the total capacitance on each net. Thus for net IN, the net capacitance is
0.38 pF = C1 + C2 = 0.11 pF + 0.27 pF. This particular file does not use the pin capacitances, PinCap . Since the DSPF represents every interconnect segment, DSPF files can be very large in size (hundreds of megabytes).
FIGURE 17.23 The detailed standard parasitic format (DSPF) for interconnect representation. (a) An example network with two m2 paths connected to a logic cell, INV1. The grid shows the coordinates. (b) The equivalent DSPF circuit corresponding to the DSPF file in the text.
17.4.2 Design Checks

ASIC designers perform two major checks before fabrication. The first check is a design-rule check ( DRC ) to ensure that nothing has gone wrong in the process of assembling the logic cells and routing. The DRC may be performed at two levels. Since the detailed router normally works with logic-cell phantoms, the first level of DRC is a phantom-level DRC , which checks for shorts, spacing violations, or other design-rule problems between logic cells. This is principally a check of the detailed router. If we have access to the real library-cell layouts (sometimes called hard layout ), we can instantiate the phantom cells and perform a second-level DRC at the transistor level. This is principally a check of the correctness of the library cells. Normally the ASIC vendor will perform this check using its own software as a type of incoming inspection. The Cadence Dracula software is one de facto standard in this
area, and you will often hear reference to a Dracula deck that consists of the Dracula code describing an ASIC vendors design rules. Sometimes ASIC vendors will give their Dracula decks to customers so that the customers can perform the DRCs themselves. The other check is a layout versus schematic ( LVS ) check to ensure that what is about to be committed to silicon is what is really wanted. An electrical schematic is extracted from the physical layout and compared to the netlist. This closes a loop between the logical and physical design processes and ensures that both are the same. The LVS check is not as straightforward as it may sound, however. The first problem with an LVS check is that the transistor-level netlist for a large ASIC forms an enormous graph. LVS software essentially has to match this graph against a reference graph that describes the design. Ensuring that every node corresponds exactly to a corresponding element in the schematic (or HDL code) is a very difficult task. The first step is normally to match certain key nodes (such as the power supplies, inputs, and outputs), but the process can very quickly become bogged down in the thousands of mismatch errors that are inevitably generated initially. The second problem with an LVS check is creating a true reference. The starting point may be HDL code or a schematic. However, logic synthesis, test insertion, clock-tree synthesis, logical-to-physical pad mapping, and several other design steps each modify the netlist. The reference netlist may not be what we wish to fabricate. In this case designers increasingly resort to formal verification that extracts a Boolean description of the function of the layout and compare that to a known good HDL description.
17.4.3 Mask Preparation

Final preparation for the ASIC artwork includes the addition of a maskwork symbol (M inside a circle), copyright symbol (C inside a circle), and company logos on each mask layer. A bonding editor creates a bonding diagram that will show the connection of pads to the lead carrier as well as checking that there are no design-rule violations (bond wires that are too close to each other or that leave the chip at extreme angles). We also add the kerf (which contains alignment marks, mask identification, and other artifacts required in fabrication), the scribe lines (the area where the die will be
separated from each other by a diamond saw), and any special hermetic edge-seal structures (usually metal). The final output of the design process is normally a magnetic tape written in Caltech Intermediate Format ( CIF , a public domain text format) or GDSII Stream (formerly also called Calma Stream, now Cadence Stream), which is a proprietary binary format. The tape is processed by the ASIC vendor or foundry (the fab ) before being transferred to the mask shop . If the layout contains drawn n -diffusion and p -diffusion regions, then the fab generates the active (thin-oxide), p -type implant, and n -type implant layers. The fab then runs another polygon-level DRC to check polygon spacing and overlap for all mask levels. A grace value (typically 0.01 m) is included to prevent false errors stemming from rounding problems and so on. The fab will then adjust the mask dimensions for fabrication either by bloating (expanding), shrinking, and merging shapes in a procedure called sizing or mask tooling . The exact procedures are described in a tooling specification . A mask bias is an amount added to a drawn polygon to allow for a difference between the mask size and the feature as it will eventually appear in silicon. The most common adjustment is to the active mask to allow for the birds beak effect , which causes an active area to be several tenths of a micron smaller on silicon than on the mask. The mask shop will use e-beam mask equipment to generate metal (usually chromium) on glass masks or reticles . The e-beam spot size determines the resolution of the mask-making equipment and is usually 0.05 m or 0.025 m (the smaller the spot size, the more expensive is the mask). The spot size is significant when we break the integer-lambda scaling rules in a deep-submicron process. For example, for a 0.35 m process ( = 0.175 m), a 1.5 separation is 0.525 m, which requires more expensive mask-making equipment with a 0.025 m spot size. For critical layers (usually the polysilicon mask) the mask shop may use optical proximity correction ( OPC ), which adjusts the position of the mask edges to allow for light diffraction and reflection (the deep-UV light used for printing mask images on the wafer has a wavelength comparable to the minimum feature sizes). 1. Fringing capacitances are per isolated line. Closely spaced lines will have reduced fringing capacitance and increased interline capacitance, with increased total
capacitance. 2. NA = not applicable. [ Chapter start ] [ Previous page ] [ Next page ]
17.5 Summary
17.5 Summary
The completion of routing finishes the ASIC physical design process. Routing is a complicated problem best divided into two steps: global and detailed routing. Global routing plans the wiring by finding the channels to be used for each path. There are differences between global routing for different types of ASICs, but the algorithms to find the shortest path are similar. Two main approaches to global routing are: one net at a time, or all nets at once. With the inclusion of timing-driven routing objectives, the routing problem becomes much harder and requires understanding the differences between finding the shortest net and finding the net with the shortest delay. Different types of detail routing include channel routing and area-based or maze routing. Detailed routing with two layers of metal is a fairly well understood problem. The most important points in this chapter are:
q q q q q q
Routing is divided into global and detailed routing. Routing algorithms should match the placement algorithms. Routing is not complete if there are unroutes. Clock and power nets are handled as special cases. Clock-net widths and power-bus widths must usually be set by hand. DRC and LVS checks are needed before a design is complete.

Asic Smith

Uploaded by

Copyright:

Available Formats

Asic Smith

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Asic Smith

Uploaded by

Copyright:

Available Formats

ASICs...

ASICs... the website

[ Chapter Index ] [ Next page ]

1.1 Types of ASICs

1.1 Types of ASICs

1.1 Types of ASICs

1.1.1 Full-Custom ASICs

1.1 Types of ASICs

1.1.2 Standard-CellBased ASICs

1.1 Types of ASICs

1.1 Types of ASICs

1.1 Types of ASICs

1.1 Types of ASICs

1.1 Types of ASICs

1.1 Types of ASICs

1.1.3 Gate-ArrayBased ASICs

Channeled gate arrays. Channelless gate arrays. Structured gate arrays.

1.1 Types of ASICs

1.1.4 Channeled Gate Array

1.1.5 Channelless Gate Array

1.1 Types of ASICs

1.1.6 Structured Gate Array

1.1 Types of ASICs

1.1 Types of ASICs

1.1.7 Programmable Logic Devices

The simplest type of programmable IC is a read-only memory ( ROM ). The most

1.1 Types of ASICs

1.1.8 Field-Programmable Gate Arrays

1.1 Types of ASICs

We shall examine these features in detail in Chapters 48.

[ Chapter start ] [ Previous page ] [ Next page ]

1.2 Design Flow

1.2 Design Flow

1.2 Design Flow

file:///C|/Documents%20and%20Settings/saran%20...i.edu/_msmith/ASICs/HTML/Book2/CH01/CH01.2.htm (2 of 2) [5/30/2004 11:00:44 PM]

1.3 Case Study

1.3 Case Study

file:///C|/Documents%20and%20Settings/saran%20...i.edu/_msmith/ASICs/HTML/Book2/CH01/CH01.3.htm (1 of 3) [5/30/2004 11:00:47 PM]

1.3 Case Study

file:///C|/Documents%20and%20Settings/saran%20...i.edu/_msmith/ASICs/HTML/Book2/CH01/CH01.3.htm (2 of 3) [5/30/2004 11:00:47 PM]

1.3 Case Study

Structural analysis Scheduling Documentation

Cosmos Suntrac Interleaf and FrameMaker

file:///C|/Documents%20and%20Settings/saran%20...i.edu/_msmith/ASICs/HTML/Book2/CH01/CH01.3.htm (3 of 3) [5/30/2004 11:00:47 PM]

1.4 Economics of ASICs

1.4 Economics of ASICs

1.4.1 Comparison Between ASIC Technologies

1.4 Economics of ASICs

different ASIC technologies, we shall quantify some of these costs.

1.4.2 Product Cost

1.4 Economics of ASICs

1.4.3 ASIC Fixed Costs

file:///C|/Documents%20and%20Settings/saran%2....edu/_msmith/ASICs/HTML/Book2/CH01/CH01.4.htm (3 of 10) [5/30/2004 11:00:49 PM]

1.4 Economics of ASICs

1.4 Economics of ASICs

1.4 Economics of ASICs

1.4 Economics of ASICs

1.4.4 ASIC Variable Costs

file:///C|/Documents%20and%20Settings/saran%2....edu/_msmith/ASICs/HTML/Book2/CH01/CH01.4.htm (7 of 10) [5/30/2004 11:00:49 PM]