Assignment
Assignment
Assignment
1
Reconfigurable Computing – Assignment # 3
2
Reconfigurable Computing – Assignment # 3
3
Reconfigurable Computing – Assignment # 3
Draw & Describe FPGA Design Flow. List it’s benefits over Paper Pencil
Design.
The standard implementation methodology for FPGA designs is borrowed from
the ASIC design flow. The Steps for the FPGA Design flow is shown in the Figure
below.
1. Design Entry
The description of the function is made using either a schematic editor, a
hardware description language (HDL), or a finite state machine (FSM)
editor. A schematic description is made by selecting components from a
given library and connecting them together to build the function circuitry.
This process has the advantage of providing a visual environment that
facilitates a direct mapping of the design functions to selected computing
blocks. The final circuit is built in a structural way. However, designs with
very large amount of function will not be easy to manage graphically.
Instead, a Hardware Description language (HDL) may be used to capture
the design either in a structural or in a behavioral way
2. Functional Simulation
After the design entry step, the designer can simulate the design to check
the correctness of the functionality. This is done by providing test patterns
to the inputs of the design and observing the outputs. The simulation is
done in software by tools which emulate the behavior of the components
4
Reconfigurable Computing – Assignment # 3
used in the design. During the simulation, the inputs and outputs of the
design are usually shown on a graphical interface, which describes the
signal evolution in time.
3. Logic Synthesis
After the design description and the functional simulation, the design can
be compiled and optimized. It is first translated into a set of Boolean
equations. Technology mapping is then used to implement the functions
with the available modules in function library of the target architecture. In
case of FPGAs, this step is called LUT-based technology mapping,
because LUTs are the modules used in the FPGA to implement the
boolean operators. The result of the logic synthesis is called the netlist.
A netlist describes the modules used to implement the functions as well
as their interconnections. There exist different netlist formats to help
exchange data between different tools. The most known are the
Electronic Design Interchange Format (EDIF).
Some FPGA manufacturers provide proprietary formats. This is the case
the Xilinx Netlist Format (XNF) for the Xilinx FPGAs.
5
Reconfigurable Computing – Assignment # 3
6
Reconfigurable Computing – Assignment # 3
Based on two critical parameters – Flexibility and Performance, we could see Following
are different types of computing techniques:
The general structure of a Von Neumann machine as shown in figure consists of:
• A memory for storing program and data. Harvard-architectures contain two parallel
accessible memories for storing program and data separately
• A control unit (also called control path) featuring a program counter that holds the
address of the next instruction to be executed.
• An arithmetic and logic unit (also called data path) in which instructions are
executed.
7
Reconfigurable Computing – Assignment # 3
In each of those five cycles only the part of the hardware involved in the computation is
activated. The rest remains idle. For example if the IR cycle is to be performed, the
program counter will be activated to get the address of the instruction, the memory will be
addressed and the instruction register to store the instruction before decoding will be also
activated. Apart from those three units (program counter, memory and instruction
register), all the other units remain idle. Fortunately, the structure of instructions allows
several of them to occupy the idle part of the processor, thus increasing the computation
throughput.
8
Reconfigurable Computing – Assignment # 3
The data path is tailored for an optimal execution of a common set of operations that
mostly characterizes the algorithms in the given class. Also, memory access is reduced
as much as possible. DSPs (Digital Signal Processor) belong to the the most used domain
specific processors.
This specialization of the DSPs increases the performance of the processorand improves
the device utilization. However, the flexibility is reduced, since it cannot be used anymore
to implement other applications other than those for which it was optimally designed.
A processor designed for only one application is called an Application Specific Processor
(ASIP). ASIPs are usually implemented as a single chips called Application Specific
Integrated Circuit (ASIC).
In an ASIP, the instruction cycles (IR, D, EX, W) are eliminated. The instruction set of the
application is directly implemented in hardware. Input data stream in the processor
through its inputs, the processor performs the required computation and the results can
be collected at the outputs of the processor.
ASIPs use a spatial approach to implement only one application. The functional units
needed for the computation of all parts of the application must be available on the surface
of the final processor. This kind of computation is called "Spatial Computing". Once again,
an ASIP that is built to perform a given computation cannot be used for other tasks other
than those for which it has been originally designed.
9
Reconfigurable Computing – Assignment # 3
Moreover the five steps (IR, D, R, EX, W) needed to perform one instruction becomes a
major drawback, in particular if the same instruction has to be executed on huge sets of
data. Flexibility is possible because "the application must always adapt to the hardware"
in order to be executed.
ASIPs bring much performance because they are optimized for a particular application.
The instruction set required for that application can then be built in a chip. Performance
is possible because "the hardware is always adapted to the application".
We would like to have a device able "to adapt to the application" on the fly. We call such
a hardware device a reconfigurable hardware or reconfigurable device or reconfigurable
processing unit (RPU) in analogy the Central Processing Unit (CPU). Reconfigurable
Computing is defined as the study of computation using reconfigurable devices.
For a given application, at a given time, the spatial structure of the device will be modified
such as to use the best computing approach to speed up that application. If a new
application has to be computed, the device structure will be modified again to match the
new application. Contrary to the Von Neumann computers, which are programmed by a
set of instructions to be executed sequentially, the structure of reconfigurable devices are
changed by modifying all or part of the hardware at compile-time or at run-time, usually
by downloading a so called bitstream into the device.
10
Reconfigurable Computing – Assignment # 3
11
Reconfigurable Computing – Assignment # 3
Draw and Describe CPLD Architecture? Explain the working of its variable
blocks?
The FastFLASH XC9500XL family is a 3.3V CPLD family targeted for high-
performance, low-voltage applications in leading-edge communications and computing
systems, where high device reliability and low power dissipation is important. The
XC9500XL architectural features address the requirements of in-system programmability.
Enhanced pin-locking capability avoids costly board rework. Each XC9500XL device is a
subsystem consisting of multiple Function Blocks (FBs) and I/O Blocks (IOBs) fully
interconnected by the FastCONNECT II switch matrix. The IOB provides buffering for
device inputs and outputs. Each FB
provides programmable logic capability with extra wide 54inputs and 18 outputs. The
FastCONNECT II switch matrix connects all FB outputs and input signals to the FB inputs.
For each FB, up to 18 outputs (depending on package pin-count) and associated output
enable signals drive directly to the IOBs.
Functional Block
12
Reconfigurable Computing – Assignment # 3
Macrocell
Each XC9500XL macrocell may be individually configured for a combinatorial or
registered function. The macrocell
and associated FB logic. Five direct product terms from the AND-array are available for
use as primary data inputs (to the OR and XOR gates) to implement combinatorial
functions, or as control inputs including clock, clock enable, set/reset, and output enable.
The product term allocator associated with each microcell selects how the five direct
terms are used.
I/O Block
The I/O Block (IOB) interfaces between the internal logic and the device user I/O pins.
Each IOB includes an input buffer, output driver, output enable selection multiplexer, and
user programmable ground control. The input buffer is compatible with 5V CMOS, 5V
TTL, 3.3V CMOS, and 2.5V CMOS signals. The input buffer uses the internal 3.3V voltage
supply (VCCINT) to ensure that the input thresholds are constant and do not vary with
the VCCIO voltage. Each input buffer provides input hysteresis (50 mV typical) to help
reduce system noise for input signals with slow rise or fall edges.
13
Reconfigurable Computing – Assignment # 3
Draw and Describe FPGA Architecture? Explain the working of its variable
blocks?
FPGAs are prefabricated silicon chips that can be programmed electrically to implement
digital designs. The first static memory based FPGA called SRAM is used for configuring
both logic and interconnection using a stream of configuration bits. Today’s modern
EPGA contains approximately 3,30,000 logic blocks and around 1,100 inputs and outputs.
The configurable logic block provides basic computation and storage elements used in
digital systems. A basic logic element consists of configurable combinational logic, a flip-
14
Reconfigurable Computing – Assignment # 3
flop, and some fast carry logic to reduce area and delay cost. Modern FPGAs contain a
heterogeneous mixture of different blocks like dedicated memory blocks, multiplexers.
Configuration memory is used throughout the logic blocks to control the specific function
of each element.
Programmable Routing
Programmable I/O
The programmable I/O pads are used to interface the logic blocks and routing architecture
to the external components. The I/O pad and the surrounding logic circuit form as an I/O
cell. These cells consume a large portion of the FPGA’s area. And the design of I/O
programmable blocks is complex, as there are great differences in the supply voltage and
reference voltage. The selection of standards is important in I/O architecture design.
Supporting a large number of standards can increase the silicon chip area required for
I/O cells.
15
Reconfigurable Computing – Assignment # 3
Power Optimization
When we refer to reconfigurable system, Power Consupmtion is very important
mentric in new geeration devices. One important way to reduce a gate’s power
consumption is to make it change its output as few times as possible. While the
gate would not be useful if it never changed its output value, it is possible to
design the logic network to reduce the number of unnecessary changes to a
gate’s output as it works to compute the desired value.
Many different techniques are used to reduce power consumption. Some of the
main ones are:
a. Eliminating Glitches: Eliminating glitching is one of the most important
techniques for power reduction in CMOS logic. Glitch reduction can often
be applied more effectively in sequential systems than is possible in
combinational logic. Sequential machines can use registers to stop the
16
Reconfigurable Computing – Assignment # 3
c. Blocking Glitch Propogation: Beyond retiming, we can also add extra levels
of registers to keep glitches from propagating. Adding registers can be
useful when there are more glitch-producing segments of logic than there
are ranks of flip-flops to catch the glitches. Such changes, however, will
change the number of cycles required to compute the machine’s outputs
and must be compatible with the rest of the system. Proper state
assignment may help reduce power consumption. For example, a one-hot
encoding requires only two signal transitions per cycle—on the old state and
new state signals. However, one-hot encoding requires a large number of
memory elements. The power consumption of the logic that computes the
required current-state and next-state functions must also be taken into
account.
d. Transistor sizing: adjusting the size of each gate or transistor for minimum
power.
e. Voltage scaling: lower supply voltages use less power, but go slower.
f. Voltage islands: Different blocks can be run at different voltages, saving
power. This design practice may require the use of level-shifters when two
blocks with different supply voltages communicate with each other.
g. Variable VDD: The voltage for a single block can be varied during operation
- high voltage (and high power) when the block needs to go fast, low voltage
when slow operation is acceptable.
h. Multiple threshold voltages: Modern processes can build transistors with
different thresholds. Power can be saved by using a mixture of CMOS
transistors with two or more different threshold voltages. In the simplest
17
Reconfigurable Computing – Assignment # 3
form there are two different thresholds available, common called High-Vt
and Low-Vt, where Vt stands for threshold voltage. High threshold
transistors are slower but leak less, and can be used in non-critical circuits.
i. Power gating: This technique uses high Vt sleep transistors which cut-off a
circuit block when the block is not switching. The sleep transistor sizing is
an important design parameter. This technique, also known as MTCMOS,
or Multi-Threshold CMOS reduces stand-by or leakage power, and also
enables Iddq testing.
j. Long-Channel transistors: Transistors of more than minimum length leak
less, but are bigger and slower.
k. Stacking and parking states: Logic gates may leak differently during
logically equivalent input states (say 10 on a NAND gate, as opposed to
01). State machines may have less leakage in certain states.
l. Logic styles: dynamic and static logic, for example, have different
speed/power tradeoffs.
Energy Optimization
Power and Energy Consumption is calculated as below:
Clock Optimization
Clock trees are a large source of dynamic power because they switch at the
maximum rate and typically have larger capacitive loads. This leads to
optimization of clock helping in power optimization as well.
18
Reconfigurable Computing – Assignment # 3
Clock can be shielded so that noise is not coupled to other signals. But shielding
increases area by 12 to 15%. Clock Optimization is achieved by buffer sizing,
gate sizing, buffer relocation, level adjustment and HFN(high fan-out net)
synthesis.(cloning is the tech. for HFN.) We try to improve setup slack in pre-
placement, in placement and post placement optimization before CTS stages
while neglecting hold slack. In post placement optimization after CTS hold slack
is improved. As a result of CTS lot of buffers are added.
The different options in CTO to reduce skew are described in the following list
Buffer and Gate sizing
• Sizes up or down buffers and gates to improve both skew and insertion
delay.
• You can impose a limit on the type of buffers and gates to be used.
• No new clock tree hierarchy will be introduced during this operation.
19
Reconfigurable Computing – Assignment # 3
Level Adjustment
• Adjust the level of the clock pins to a higher or lower part of the clock tree
hierarchy.
• No new clock tree hierarchy will be introduced during this operation.
Reconfiguration
• Clustering of sequential logic.
• Buffer placement is performed after clustering.
• Longer runtimes.
• No new clock tree hierarchy will be introduced during this operation.
20
Reconfigurable Computing – Assignment # 3
Delay Insertion
• Delay is inserted for shortest paths.
• Delay cells can be user defined or can be extracted from by the tool.
• By adding new buffers to the clock path the clock tree hierarchy will
change.
21
Reconfigurable Computing – Assignment # 3
22
Reconfigurable Computing – Assignment # 3
Fanout Delay:
• Logic gates that have large fanout (many gates attached to the output)
are prime candidates for slow operation.
• Even if all the fanout gates use minimum-size transistors, presenting the
smallest possible load, they may add up to a large load capacitance.
• Some of the fanout gates may use transistors that are larger than they
need, in which case those transistor can be reduced in size to speed up
the previous gate.
• In many cases this fortuitous situation does not occur, leaving two
possible solutions:
• The transistors of the driving gate can be enlarged, in severe cases
using the buffer chains.
• The logic can be redesigned to reduce the gate’s fanout.
Path Delay
• In other cases, performance may be limited not by a single gate, but by a
path through a number of gates.
• Combinational network delay is measured over paths through network.
• Can trace a causality chain from inputs to worst-case output.
• Critical path : path which creates longest delay.
• Can trace transistions which cause delays that are elements of the critical
delay path
23
Reconfigurable Computing – Assignment # 3
24
Reconfigurable Computing – Assignment # 3
Because hardware resources remain static for the life application, conventional
design tools provide adequate support for application development.
25
Reconfigurable Computing – Assignment # 3
26
Reconfigurable Computing – Assignment # 3
27
Reconfigurable Computing – Assignment # 3
Delay is generally used to mean the time it takes for a gate’s output to arrive at 50% of
its final value. Following are different sources of Delays through Single Gate.
Fanout Delay
o Logic gates that have large fanout (many gates attached to the output) are
prime candidates for slow operation.
o Even if all the fanout gates use minimum-size transistors, presenting the
smallest possible load, they may add up to a large load capacitance.
o Some of the fanout gates may use transistors that are larger than they need,
in which case those transistor can be reduced in size to speed up the
previous gate.
o Solution: In many cases this fortuitous situation does not occur, leaving two
possible solutions:
▪ The transistors of the driving gate can be enlarged, in severe cases
using the buffer chains.
▪ The logic can be redesigned to reduce the gate’s fanout.
▪ Increasing the sizes of its transistors
▪ Reducing the capacitance attached to it.
Path Delay
o In other cases, performance may be limited not by a single gate, but by a
path through a number of gates. Combinational network delay is measured
over paths through network.
o Critical path : path which creates longest delay.
o Solution:
▪ Can trace a causality chain from inputs to worst-case output.
▪ Can trace transistions which cause delays that are elements of the
critical delay path.
▪ Speeding up a gate off the critical path. Can be done in similar way
by implementing Fanout Delay Solutions.
▪ Using Boolean identities to reduce delay. Deep Vs Shallow Circuit
Implementation.
Wire Delay
• Delay through Resistive Interconnect
• In many modern chips, the delay through wires is larger than the delay
through gates, so studying the delay through wires is as important as
studying delay through gates.
• Delay through RC Trees
• Delay through Inductive Interconnect
• Solutions
o Inserting the buffer - we must put a series of buffers equally spaced
through the line to restore the signal.
28
Reconfigurable Computing – Assignment # 3
• Wire Sizing - wider wires near the source and narrower wires near
the sinks to minimize delay.
29