Assignment

Reconfigurable Computing – Assignment # 3
Balakrishnan Arumugam Bits ID : 2018PA01046
Differentiate between PLA, PAL, SPLD, CPLD and FPGA?
PLA PAL CPLD FPGA

Programmable Programmable A CPLD consist of A Field
logic arrays (PLA) array logic (PAL) a set of macro Programmable Gate
consist of a plane of consist of a plane of cells, Input/Output Array (FPGA) is a
AND-gates AND-gates blocks and an programmable
connected to a connected to a interconnection device consisting of
plane of OR-gates plane of OR-gates network three main parts. a
set of programmable
logic cells also called
logic blocks or
configurable logic
blocks, a
programmable
interconnection
network and a set of
input and output
cells around the
device
The inputs signals The inputs signals The connection A function to be
as well as their as well as their between the implemented in
negations are negations are Input/Output blocks FPGA is partitioned
connected to the connected to the and the macro cells in modules, each of
inputs of the AND- inputs of the AND- and those between which can be
Gates in the AND- Gates in the AND- macro cells and implemented in a
plane. The outputs plane. The outputs macro cells can be logic block. The logic
of the AND-gates of the AND-gates made through the blocks are then
are use as input for are use as input for programmable connected together
the OR gate in the the OR gate in the interconnection using the
OR-plane whose OR-plane whose network programmable
outputs correspond outputs correspond interconnection.
to those of the to those of the
PAL/PLA PAL/PLA
In PLAs both fields In PAL, OR fields Not Applicable FPGAs can be
can be are fixed and only programmed once or
programmed by the the AND-plane is several times
user programmable depending on the
technology used. -
Anti-Fuse or Memory
used.
PALs and PLAs are PALs and PLAs are They are usually
well suited to well suited to used as glue logic,
1
implement two- implement two- or to implement

level circuits, those level circuits, those small functions.
are circuits made are circuits made
upon the sum of upon the sum of
product product
Not Applicable Not Applicable CPLD use FPGA logic chips
macrocells and are can be considered to
only able to be a number of logic
connect signals to blocks consisting of
neighboring logic gate arrays which
blocks, making are connected
them less flexible through
and less suited to programmable
execute interconnects.
complicated
applications.
Not Applicable Not Applicable CPLD only FPGA whose logic
contains a limited block count can
number of logic reach to up to a
blocks of the 100,000,
maximum 100
block limit
Not Applicable Not Applicable CPLDs use FPGA are RAM
EEPROMs and based, meaning they
hence can be have to download
operated as soon the data for
as they are configuration from
powered up. an external memory
source and set it up
before it can begin to
operate, and
thereafter the FPGA
goes blank after
power down
Not Applicable Not Applicable CPLD chips are FPGAs are volatile
non volatile which as their RAM based
retain the configuration data is
programmed data available and
internally. readable by external
source
Not Applicable Not Applicable In order to change circuit modification is
or modify design simpler and more
functionality, a convenient with
CPLD device must FPGAs as the circuit
be powered down can be changed
even while the
2
and device is running

reprogrammed. through a process
called partial
reconfiguration,
The construction of The construction of Not Applicable Not Applicable
PLA can be done PAL can be done
using the using the
programmable programmable
collection of AND & collection of AND &
fixed collection of OR gates
OR gates.
The availability of The availability of Not Applicable Not Applicable
PLA is more PAL is less prolific
The flexibility of The flexibility of Not Applicable Not Applicable
PLA is less PAL programming
is more
The cost of a PAL is Not Applicable Not Applicable
The cost of PLA is expensive
middle range
The number of The number of Not Applicable Not Applicable
functions functions
implemented in implemented in
PLA is limited PAL is large
The speed of PAL is Not Applicable Not Applicable
The speed of PLA is slow
high
3
Draw & Describe FPGA Design Flow. List it’s benefits over Paper Pencil
Design.
The standard implementation methodology for FPGA designs is borrowed from
the ASIC design flow. The Steps for the FPGA Design flow is shown in the Figure
below.
1. Design Entry
The description of the function is made using either a schematic editor, a
hardware description language (HDL), or a finite state machine (FSM)
editor. A schematic description is made by selecting components from a
given library and connecting them together to build the function circuitry.
This process has the advantage of providing a visual environment that
facilitates a direct mapping of the design functions to selected computing
blocks. The final circuit is built in a structural way. However, designs with
very large amount of function will not be easy to manage graphically.
Instead, a Hardware Description language (HDL) may be used to capture
the design either in a structural or in a behavioral way
2. Functional Simulation
After the design entry step, the designer can simulate the design to check
the correctness of the functionality. This is done by providing test patterns
to the inputs of the design and observing the outputs. The simulation is
done in software by tools which emulate the behavior of the components
4
used in the design. During the simulation, the inputs and outputs of the
design are usually shown on a graphical interface, which describes the
signal evolution in time.
3. Logic Synthesis
After the design description and the functional simulation, the design can
be compiled and optimized. It is first translated into a set of Boolean
equations. Technology mapping is then used to implement the functions
with the available modules in function library of the target architecture. In
case of FPGAs, this step is called LUT-based technology mapping,
because LUTs are the modules used in the FPGA to implement the
boolean operators. The result of the logic synthesis is called the netlist.
A netlist describes the modules used to implement the functions as well
as their interconnections. There exist different netlist formats to help
exchange data between different tools. The most known are the
Electronic Design Interchange Format (EDIF).
Some FPGA manufacturers provide proprietary formats. This is the case
the Xilinx Netlist Format (XNF) for the Xilinx FPGAs.
4. Place & Route

For the netlist generated in the logic synthesis process, operators (LUTs,
Flip-Flopss, Multiplexers, etc...) should be placed on the FPGA and
connected together via routing. Those two steps are normally achieved
by CAD tools provided by the FPGA vendors. After the placement and
routing of a netlist, the CAD tools generate a file called a bitstream. A
bitstream provides the description of all the bits used to configure the
LUTs, the interconnect matrices, the state of the multiplexer and I/O of
the FPGA. The full and partial bitstreams can now be stored in a database
to be downloaded later
5. Design Tools Used

The design entry, the functional simulation and the logic synthesis are
done using the CAD tools from Xilinx, Synopsys, Synplicity, Cadence,
ALTERA and Mentor Graphics. The place and route as well as the
generation of configuration data is done by the corresponding vendors
tools. Table below provides some information on the tool capabilities of
some vendors.
5
Advantages of FPGA Design Flow over Paper Pencil Design.

• Enhanced Scalability
• Reduced Complexity
• Skill
• Reduced Inventory
• Reduced Power
• Rapid Prototype
• Faster Time to Market
• Better Performance
• Re-Programbility
6
Describe and Differentiate between computing Techniques?
Based on two critical parameters – Flexibility and Performance, we could see Following
are different types of computing techniques:
General Purpose Computing (GPC)

As name suggestions this computing technique is developed and devised for keeping
general purpose in mind. The Von Neumann architecture serve the purpose of GPC. This
architecture offer benefits such as
• Simplicity of Programming – Follows Sequential way of Human Thinking
• Fixed structure
• Able to execute any kind of computation, given a properly programmed
control
• Hardware modification is not required
The general structure of a Von Neumann machine as shown in figure consists of:
• A memory for storing program and data. Harvard-architectures contain two parallel
accessible memories for storing program and data separately
• A control unit (also called control path) featuring a program counter that holds the
address of the next instruction to be executed.
• An arithmetic and logic unit (also called data path) in which instructions are
executed.
7
A program is coded as a set of instructions to be executed sequentially, instruction after

instruction. At each step of the program execution, the next instruction is fetched from the
memory at the address specified in the program counter and decoded. The required
operands are then collected from the memory before the instruction can be executed.
After execution, the result is written back into the memory. In this process, the control
path is in charge of setting all signals necessary to read from and write to the memory,
and to allow the data path to perform the right computation. The data path is controlled
by the control path, which interprets the instructions and sets the data path’s signals
accordingly to execute the desired operation.
In general the execution of an instruction on a Von Neumann computer can be done in
five cycles:
• Instruction Read (IR) in which an instruction is fetched from the memory;
• Decoding (D) in which the meaning of the instruction is determined and the
operands are localized;
• Read Operands (R) in which the operands are read from the memory;
• Execute (EX) in which the instruction is executed with the read operands;
• Write Result (W) in which the result of the execution is stored back to the memory.
In each of those five cycles only the part of the hardware involved in the computation is
activated. The rest remains idle. For example if the IR cycle is to be performed, the
program counter will be activated to get the address of the instruction, the memory will be
addressed and the instruction register to store the instruction before decoding will be also
activated. Apart from those three units (program counter, memory and instruction
register), all the other units remain idle. Fortunately, the structure of instructions allows
several of them to occupy the idle part of the processor, thus increasing the computation
throughput.
Domain Specific Processors (DSP)

A domain specific processor is a processor tailored for a class of algorithms.
8
The data path is tailored for an optimal execution of a common set of operations that
mostly characterizes the algorithms in the given class. Also, memory access is reduced
as much as possible. DSPs (Digital Signal Processor) belong to the the most used domain
specific processors.
A DSP is a specialized processor used to speed-up computation of repetitive, numerically

intensive tasks in signal processing areas like telecommunication, multimedia,
automobile, radar, sonar, seismic, image processing, etc... The most often cited feature
of the DSPs is their ability to perform one or more
multiply accumulate (MAC) operations in single cycle. Usually, MAC operations have to
be performed on a huge set of data. In a MAC operation data are first multiplied and then
added to an accumulated value. The normal Von Neumann computer would perform a
MAC in 10 steps. The first instruction (multiply) would be fetched, then decoded, then the
operand would be read and multiply, the result would be stored back and the next
instruction (accumulate) would be read, the result stored in the previous step would be
read again and added to the accumulated value and the result would be stored back.
DSPs avoid those steps by using specialized hardware which directly performs the
addition after multiplication without having to access the memory.
This specialization of the DSPs increases the performance of the processorand improves
the device utilization. However, the flexibility is reduced, since it cannot be used anymore
to implement other applications other than those for which it was optimally designed.
Application Specific Processors (ASP)

Although DSPs incorporate a degree of application specific features like MAC and data
width optimization, they still incorporate the Von Neumann approach and, therefore,
remain sequential machines. Their performance is limited. If a processor has to be used
for only one application, which is known and fixed in advance, then the processing unit
could be designed and optimized for that particular application. In this case, we say that
"the hardware adapts itself to the application".
A processor designed for only one application is called an Application Specific Processor
(ASIP). ASIPs are usually implemented as a single chips called Application Specific
Integrated Circuit (ASIC).
In an ASIP, the instruction cycles (IR, D, EX, W) are eliminated. The instruction set of the
application is directly implemented in hardware. Input data stream in the processor
through its inputs, the processor performs the required computation and the results can
be collected at the outputs of the processor.
ASIPs use a spatial approach to implement only one application. The functional units
needed for the computation of all parts of the application must be available on the surface
of the final processor. This kind of computation is called "Spatial Computing". Once again,
an ASIP that is built to perform a given computation cannot be used for other tasks other
than those for which it has been originally designed.
9
Reconfigurable Computing (RC)

The processors are characterize by two main parameters: flexibility and performance.
The Von Neumann computers are very flexible because they are able to compute any
kind of task. This is the reason why the terminology GPP (General Purpose Processor)
is used for the Von Neumann machine. They don’t bring so much performance, because
they cannot compute in parallel.
Moreover the five steps (IR, D, R, EX, W) needed to perform one instruction becomes a
major drawback, in particular if the same instruction has to be executed on huge sets of
data. Flexibility is possible because "the application must always adapt to the hardware"
in order to be executed.
ASIPs bring much performance because they are optimized for a particular application.
The instruction set required for that application can then be built in a chip. Performance
is possible because "the hardware is always adapted to the application".
We would like to have a device able "to adapt to the application" on the fly. We call such
a hardware device a reconfigurable hardware or reconfigurable device or reconfigurable
processing unit (RPU) in analogy the Central Processing Unit (CPU). Reconfigurable
Computing is defined as the study of computation using reconfigurable devices.
For a given application, at a given time, the spatial structure of the device will be modified
such as to use the best computing approach to speed up that application. If a new
application has to be computed, the device structure will be modified again to match the
new application. Contrary to the Von Neumann computers, which are programmed by a
set of instructions to be executed sequentially, the structure of reconfigurable devices are
changed by modifying all or part of the hardware at compile-time or at run-time, usually
by downloading a so called bitstream into the device.
Differntiation between Multiple Computing Techniques
Sr # Parameter GPC DSP ASP RC

1 Performance Low Medium High High
2 Utilization Low Medium High High
3 Flexibility High Medium Low High
4 Squential Machines or Operations Yes Yes Yes No
5 Is application Fixed? No No Yes No
6 Is Application Known in advance? No Yes Yes No
7 IR, D, R, EX, W Followed completely Yes No No (R) No
8 Instruction set directly implemented in No No Yes Yes
HW
9 Spatial Computing No No Yes Yes
10 Parallel Computing No No No Yes
11 Run Configuration Management No No No Yes
10
12 Example Microcont DSP Multimedi FPGA

roller a
Processo
r
11
Draw and Describe CPLD Architecture? Explain the working of its variable
blocks?
The FastFLASH XC9500XL family is a 3.3V CPLD family targeted for high-
performance, low-voltage applications in leading-edge communications and computing
systems, where high device reliability and low power dissipation is important. The
XC9500XL architectural features address the requirements of in-system programmability.
Enhanced pin-locking capability avoids costly board rework. Each XC9500XL device is a
subsystem consisting of multiple Function Blocks (FBs) and I/O Blocks (IOBs) fully
interconnected by the FastCONNECT II switch matrix. The IOB provides buffering for
device inputs and outputs. Each FB
provides programmable logic capability with extra wide 54inputs and 18 outputs. The
FastCONNECT II switch matrix connects all FB outputs and input signals to the FB inputs.
For each FB, up to 18 outputs (depending on package pin-count) and associated output
enable signals drive directly to the IOBs.
Functional Block
12
Each Function Block is comprised of 18 independent macrocells, each capable of

implementing
a combinatorial or registered function. The FB also receives global clock, output enable,
and set/reset signals. The FB
generates 18 outputs that drive the FastCONNECT switch matrix. These 18 outputs and
their corresponding output
enable signals also drive the IOB. Logic within the FB is implemented using a sum-of-
products representation. Fifty-four inputs provide 108 true and complement signals into
the programmable AND-array to form 90 product terms.
Macrocell
Each XC9500XL macrocell may be individually configured for a combinatorial or
registered function. The macrocell
and associated FB logic. Five direct product terms from the AND-array are available for
use as primary data inputs (to the OR and XOR gates) to implement combinatorial
functions, or as control inputs including clock, clock enable, set/reset, and output enable.
The product term allocator associated with each microcell selects how the five direct
terms are used.
Product Term Allocator

The product term allocator controls how the five direct product terms are assigned to each
macrocell. For example, all
five direct terms can drive the OR function. The product term allocator can re-assign other
product terms within the FB to increase the logic capacity of a microcell beyond five direct
terms. Any macrocell requiring additional product terms can access uncommitted product
terms in other macrocells within the FB. Up to 15 product terms can be available to a
single macrocell with only a small incremental delay of tPTA.
Fast Connect II Switch matrix

The FastCONNECT II Switch Matrix connects signals to the FB inputs, All IOB outputs
(corresponding to user pin inputs) and all FB outputs drive the FastCONNECT II matrix.
Any of these (up to a fan-in limit of 54) may be selected to drive each FB with a uniform
delay.
I/O Block
The I/O Block (IOB) interfaces between the internal logic and the device user I/O pins.
Each IOB includes an input buffer, output driver, output enable selection multiplexer, and
user programmable ground control. The input buffer is compatible with 5V CMOS, 5V
TTL, 3.3V CMOS, and 2.5V CMOS signals. The input buffer uses the internal 3.3V voltage
supply (VCCINT) to ensure that the input thresholds are constant and do not vary with
the VCCIO voltage. Each input buffer provides input hysteresis (50 mV typical) to help
reduce system noise for input signals with slow rise or fall edges.
13
Draw and Describe FPGA Architecture? Explain the working of its variable
blocks?
FPGAs are prefabricated silicon chips that can be programmed electrically to implement
digital designs. The first static memory based FPGA called SRAM is used for configuring
both logic and interconnection using a stream of configuration bits. Today’s modern
EPGA contains approximately 3,30,000 logic blocks and around 1,100 inputs and outputs.
The FPGA Architecture consists of three major components
• Configurable Logic Blocks, which implement logic functions

• Configurable Routing (interconnects), which implements functions
• I/O blocks, which are used to make off-chip connections
Configurable Logic Blocks
The configurable logic block provides basic computation and storage elements used in
digital systems. A basic logic element consists of configurable combinational logic, a flip-
14
flop, and some fast carry logic to reduce area and delay cost. Modern FPGAs contain a
heterogeneous mixture of different blocks like dedicated memory blocks, multiplexers.
Configuration memory is used throughout the logic blocks to control the specific function
of each element.
Programmable Routing
The programmable routing establishes a connection between logic blocks and

Input/Output blocks to complete a user-defined design unit. It consists of multiplexers
pass transistors and tri-state buffers. Pass transistors and multiplexers are used in a logic
cluster to connect the logic elements.
Programmable I/O
The programmable I/O pads are used to interface the logic blocks and routing architecture
to the external components. The I/O pad and the surrounding logic circuit form as an I/O
cell. These cells consume a large portion of the FPGA’s area. And the design of I/O
programmable blocks is complex, as there are great differences in the supply voltage and
reference voltage. The selection of standards is important in I/O architecture design.
Supporting a large number of standards can increase the silicon chip area required for
I/O cells.
15
Write Note on Power, Energy, Clock Optimization. Also explain Network

Delay.
Power Optimization
When we refer to reconfigurable system, Power Consupmtion is very important
mentric in new geeration devices. One important way to reduce a gate’s power
consumption is to make it change its output as few times as possible. While the
gate would not be useful if it never changed its output value, it is possible to
design the logic network to reduce the number of unnecessary changes to a
gate’s output as it works to compute the desired value.
Glitches and its impact on Power

Figure shows an example of power-consuming glitching in a logic network.
Glitches are more likely to occur in multi-level logic networks because the signals
arrive at gates at different times. In this example, the NOR gate at the output
starts at 0 and ends at 0, but differences in arrival times between the gate input
connected to the primary input and the output of the NAND gate cause the NOR
gate’s output to glitch to 1.
Glitches increased due to increase long chain of operations.
Many different techniques are used to reduce power consumption. Some of the
main ones are:
a. Eliminating Glitches: Eliminating glitching is one of the most important
techniques for power reduction in CMOS logic. Glitch reduction can often
be applied more effectively in sequential systems than is possible in
combinational logic. Sequential machines can use registers to stop the
16
propagation of glitches, independent of the logic function being

implemented.
b. Re-timing: Many sequential timing optimizations can be thought of as
retiming Figure illustrates how flip-flops can be used to reduce power
consumption by blocking glitches from propagating to high capacitance
nodes. (The flip-flop and its clock connection do, of course,
consume some power of their own.) A well-placed flip-flop will be positioned
after the logic with high signal transition probabilities and before high
capacitance nodes on the same path.
c. Blocking Glitch Propogation: Beyond retiming, we can also add extra levels
of registers to keep glitches from propagating. Adding registers can be
useful when there are more glitch-producing segments of logic than there
are ranks of flip-flops to catch the glitches. Such changes, however, will
change the number of cycles required to compute the machine’s outputs
and must be compatible with the rest of the system. Proper state
assignment may help reduce power consumption. For example, a one-hot
encoding requires only two signal transitions per cycle—on the old state and
new state signals. However, one-hot encoding requires a large number of
memory elements. The power consumption of the logic that computes the
required current-state and next-state functions must also be taken into
account.
d. Transistor sizing: adjusting the size of each gate or transistor for minimum
power.
e. Voltage scaling: lower supply voltages use less power, but go slower.
f. Voltage islands: Different blocks can be run at different voltages, saving
power. This design practice may require the use of level-shifters when two
blocks with different supply voltages communicate with each other.
g. Variable VDD: The voltage for a single block can be varied during operation
- high voltage (and high power) when the block needs to go fast, low voltage
when slow operation is acceptable.
h. Multiple threshold voltages: Modern processes can build transistors with
different thresholds. Power can be saved by using a mixture of CMOS
transistors with two or more different threshold voltages. In the simplest
17
form there are two different thresholds available, common called High-Vt
and Low-Vt, where Vt stands for threshold voltage. High threshold
transistors are slower but leak less, and can be used in non-critical circuits.
i. Power gating: This technique uses high Vt sleep transistors which cut-off a
circuit block when the block is not switching. The sleep transistor sizing is
an important design parameter. This technique, also known as MTCMOS,
or Multi-Threshold CMOS reduces stand-by or leakage power, and also
enables Iddq testing.
j. Long-Channel transistors: Transistors of more than minimum length leak
less, but are bigger and slower.
k. Stacking and parking states: Logic gates may leak differently during
logically equivalent input states (say 10 on a NAND gate, as opposed to
01). State machines may have less leakage in certain states.
l. Logic styles: dynamic and static logic, for example, have different
speed/power tradeoffs.
Energy Optimization
Power and Energy Consumption is calculated as below:
Looking at above equation, it is evident that we can optimize the energy

consumption if we optimize the Power Consumption.
We are concerned about power consumption leading to heat dissipation, cooling,
physical deterioration due to temperature. However, sometimes we want to
reduce total energy consumed to enhanced battery life, specifically when we use
reconfigurable devices in handheld or portable equipments.
Clock Optimization
Clock trees are a large source of dynamic power because they switch at the
maximum rate and typically have larger capacitive loads. This leads to
optimization of clock helping in power optimization as well.
18
Clock can be shielded so that noise is not coupled to other signals. But shielding
increases area by 12 to 15%. Clock Optimization is achieved by buffer sizing,
gate sizing, buffer relocation, level adjustment and HFN(high fan-out net)
synthesis.(cloning is the tech. for HFN.) We try to improve setup slack in pre-
placement, in placement and post placement optimization before CTS stages
while neglecting hold slack. In post placement optimization after CTS hold slack
is improved. As a result of CTS lot of buffers are added.
The different options in CTO to reduce skew are described in the following list
Buffer and Gate sizing
• Sizes up or down buffers and gates to improve both skew and insertion
delay.
• You can impose a limit on the type of buffers and gates to be used.
• No new clock tree hierarchy will be introduced during this operation.
Buffer and Gate Relocation

• Physical location of the buffer or gate is moved to reduce skew and
insertion delay.
19
Level Adjustment
• Adjust the level of the clock pins to a higher or lower part of the clock tree
hierarchy.
Reconfiguration
• Clustering of sequential logic.
• Buffer placement is performed after clustering.
• Longer runtimes.
20
Delay Insertion
• Delay is inserted for shortest paths.
• Delay cells can be user defined or can be extracted from by the tool.
• By adding new buffers to the clock path the clock tree hierarchy will
change.
Dummy Load Insertion

• Uses load balancing to fine tune the clock skew by increasing the shortest
path delay.
• Dummy load cells can be user defined or can be extracted by the tool.
21
Combinational Network Delay

Combination Network is complex network of logic gates interconnected with
each other through wires. The delay through a combinational network depends
in part on the number of gates the signal must go through.
The propagation delay of Networked combinational circuit is the sum of the

propagation delays through each element on the critical path of that network.
The contamination delay is the sum of the contamination delays through each
element on the short path.
22
Following are different sources of Delays.
Fanout Delay:
• Logic gates that have large fanout (many gates attached to the output)
are prime candidates for slow operation.
• Even if all the fanout gates use minimum-size transistors, presenting the
smallest possible load, they may add up to a large load capacitance.
• Some of the fanout gates may use transistors that are larger than they
need, in which case those transistor can be reduced in size to speed up
the previous gate.
• In many cases this fortuitous situation does not occur, leaving two
possible solutions:
• The transistors of the driving gate can be enlarged, in severe cases
using the buffer chains.
• The logic can be redesigned to reduce the gate’s fanout.
Path Delay
• In other cases, performance may be limited not by a single gate, but by a
path through a number of gates.
• Combinational network delay is measured over paths through network.
• Can trace a causality chain from inputs to worst-case output.
• Critical path : path which creates longest delay.
• Can trace transistions which cause delays that are elements of the critical
delay path
23
24
What are various implementation approaches? Describe each in detail with

Examples.
(FPGA), an integrated circuit that consists of a large, uncommitted array of
programmable logic and programmable interconnect that can be easily
configured and reconfigured by the end user to implement a wide range of digital
circuits. These FPGA-based systems achieve high levels of performance by
using FPGAs
to implement custom algorithm-specific circuits that accelerate overall algorithm
execution. Yet, unlike other custom VLSI approaches, these systems remain
flexible because the same custom circuitry for one algorithm can be reused as
the custom circuitry for a completely different and unrelated algorithm. It is this
ability to remain flexible while also being able to implement custom, algorithm
specific circuitry that is fueling interest in reconfigurable logic as a new paradigm
for high-performance system design.
With the increasing size and speed of reconfigurable processor it is possible to

implement many large modules on a reconfigurable device at the same time.
Moreover, for some reconfigurable devices, only a part of the device can be
configured while the rest continues to operate. This partial reconfiguration
capability enables many functions to be temporally implemented on the device.
Depending on the time at which the reconfiguration sequence are defined, the
computation and configuration flow on a reconfigurable devices can be classified
in two categories:
Compile-time reconfiguration (CTR): In this case, the computation and

configuration sequences as well as the data exchange are defined at compile-
time and never change during a computation. This approach is more interesting
for devices, which can only be full reconfigured. However it can be applied to
partial reconfigurable devices that are logically or physically partitioned in a set
of reconfigurable bins.
CTR shown promosing development by replacing software computing with

customized logic. Most of these applications are developed as a static hardware
configuration that remains on the FPGA for the duration of the application. CTR
Basically involves the development of discrete hardware images for each
application on the reconfigurable resource.
Because hardware resources remain static for the life application, conventional
design tools provide adequate support for application development.
Run-time reconfiguration (RTR): The computation and configuration

sequences are not known at compile-time. Request to implement a given task is
known at run-time and should be handled dynamically. The reconfiguration
process exchanged part of the device to accommodate the system to changing
operational and environmental conditions. Run-time reconfiguration is a difficult
process that must handle side effect factors like defragmentation of the device
25
and communication between newly placed modules. The management of the

reconfigurable device is usually done by a scheduler and a placer that can be
implemented as part of an operating system running on a processor. The
processor can either resides inside or outside the reconfigurable chip.
In order to take advantage of the flexibility and performance gained by using

FPGAs, some applications reconfigure hardware resources during application
execution. By doing do, applications can optimize hardware resources by
replacing idle, unneeded logic with usable performance enhancing modules.
RTR provides a dynamic hardware allocation model.
With RTR, applications may allocate hardware resources as run-time conditions

dictate. Allowing dynamic hardware allocation, however, introduces a number of
additional design problems that current design automation tools do not address.
The lack of sufficient design tools and a well-defined design methodology
prevents the wide-spread use of this technique.
Following is comparision between CTR and RTR.
Sr Parameter Compile-time reconfiguration Run-time reconfiguration

# (CTR) (RTR)
1 Single System Yes No
Wide
Configuration
2 Diagrmic
Explanation
3 Hardware Static Dnyamic

Allocation
Stratergy
4 Simplicity of More Less
Design
5 Multiple No Yes
Configurations
6 Computation, Compile Time Run Time
Configuration
sequences
and data
exchange
Definition
7 Design Tool Easy Difficult
Availability
8 Advantages 1.Simplicity, 1. Flexibility,
2. Commonly followed Approach 2. Enhanced Optimization of
Hardware
26
9 Constraints 1.Flexibility 1.Defragmentation of the

device,
2. Communication between
newly placed modules,
3.Temportal Partitioning
10 Examples 1. PAM - utilize the ability of 1. Hardware Acceleration of
custom hardware to perform Image Processing - image
efficient long integer operations. processing algorithms is
2. SPLASH partitioned into several well
3. ASIC defined steps and run time
implemented.
2. Neural Network
3. Programmable Processors
27
What are the cause of delays and how to overcome them?
Delay is generally used to mean the time it takes for a gate’s output to arrive at 50% of
its final value. Following are different sources of Delays through Single Gate.
Fanout Delay
o Logic gates that have large fanout (many gates attached to the output) are
prime candidates for slow operation.
o Even if all the fanout gates use minimum-size transistors, presenting the
smallest possible load, they may add up to a large load capacitance.
o Some of the fanout gates may use transistors that are larger than they need,
in which case those transistor can be reduced in size to speed up the
previous gate.
o Solution: In many cases this fortuitous situation does not occur, leaving two
possible solutions:
▪ The transistors of the driving gate can be enlarged, in severe cases
using the buffer chains.
▪ The logic can be redesigned to reduce the gate’s fanout.
▪ Increasing the sizes of its transistors
▪ Reducing the capacitance attached to it.
Path Delay
o In other cases, performance may be limited not by a single gate, but by a
path through a number of gates. Combinational network delay is measured
over paths through network.
o Critical path : path which creates longest delay.
o Solution:
▪ Can trace a causality chain from inputs to worst-case output.
▪ Can trace transistions which cause delays that are elements of the
critical delay path.
▪ Speeding up a gate off the critical path. Can be done in similar way
by implementing Fanout Delay Solutions.
▪ Using Boolean identities to reduce delay. Deep Vs Shallow Circuit
Implementation.
Wire Delay
• Delay through Resistive Interconnect
• In many modern chips, the delay through wires is larger than the delay
through gates, so studying the delay through wires is as important as
studying delay through gates.
• Delay through RC Trees
• Delay through Inductive Interconnect
• Solutions
o Inserting the buffer - we must put a series of buffers equally spaced
through the line to restore the signal.
28
• Wire Sizing - wider wires near the source and narrower wires near
the sinks to minimize delay.
29

Assignment

Uploaded by

Copyright:

Available Formats

Assignment

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment

Uploaded by

Copyright:

Available Formats

Reconfigurable Computing – Assignment # 3

Balakrishnan Arumugam Bits ID : 2018PA01046

Differentiate between PLA, PAL, SPLD, CPLD and FPGA?

PLA PAL CPLD FPGA

Balakrishnan Arumugam Bits ID : 2018PA01046

implement two- implement two- or to implement

Balakrishnan Arumugam Bits ID : 2018PA01046

and device is running

Balakrishnan Arumugam Bits ID : 2018PA01046

Balakrishnan Arumugam Bits ID : 2018PA01046

4. Place & Route

5. Design Tools Used

Balakrishnan Arumugam Bits ID : 2018PA01046

Advantages of FPGA Design Flow over Paper Pencil Design.

Balakrishnan Arumugam Bits ID : 2018PA01046

Describe and Differentiate between computing Techniques?

General Purpose Computing (GPC)

Balakrishnan Arumugam Bits ID : 2018PA01046

A program is coded as a set of instructions to be executed sequentially, instruction after

Domain Specific Processors (DSP)

Balakrishnan Arumugam Bits ID : 2018PA01046

A DSP is a specialized processor used to speed-up computation of repetitive, numerically

Application Specific Processors (ASP)

Balakrishnan Arumugam Bits ID : 2018PA01046

Reconfigurable Computing (RC)

Differntiation between Multiple Computing Techniques

Sr # Parameter GPC DSP ASP RC

Balakrishnan Arumugam Bits ID : 2018PA01046

12 Example Microcont DSP Multimedi FPGA

Balakrishnan Arumugam Bits ID : 2018PA01046

Balakrishnan Arumugam Bits ID : 2018PA01046

Each Function Block is comprised of 18 independent macrocells, each capable of

Product Term Allocator

Fast Connect II Switch matrix

Balakrishnan Arumugam Bits ID : 2018PA01046

The FPGA Architecture consists of three major components

• Configurable Logic Blocks, which implement logic functions

Configurable Logic Blocks

Balakrishnan Arumugam Bits ID : 2018PA01046

The programmable routing establishes a connection between logic blocks and

Balakrishnan Arumugam Bits ID : 2018PA01046

Write Note on Power, Energy, Clock Optimization. Also explain Network

Glitches and its impact on Power

Glitches increased due to increase long chain of operations.

Balakrishnan Arumugam Bits ID : 2018PA01046

propagation of glitches, independent of the logic function being

Balakrishnan Arumugam Bits ID : 2018PA01046

Looking at above equation, it is evident that we can optimize the energy

Balakrishnan Arumugam Bits ID : 2018PA01046

Buffer and Gate Relocation

Balakrishnan Arumugam Bits ID : 2018PA01046

Balakrishnan Arumugam Bits ID : 2018PA01046

Dummy Load Insertion

Balakrishnan Arumugam Bits ID : 2018PA01046

Combinational Network Delay

The propagation delay of Networked combinational circuit is the sum of the

Balakrishnan Arumugam Bits ID : 2018PA01046

Following are different sources of Delays.