Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
149 views

FPGA Architecture Principles and Progression

This document discusses the evolution of FPGA (field-programmable gate array) architecture over the past 30 years. It begins with an introduction to FPGA components and benefits compared to ASICs. Specifically: 1) FPGAs consist of programmable logic blocks, I/O blocks, and routing resources that can be configured to implement any digital circuit. 2) They offer faster development cycles and lower costs than ASICs due to reprogrammability, though with lower efficiency. 3) The document outlines the FPGA architecture evaluation process and reviews innovations in logic blocks, memory blocks, DSP blocks, interconnect, and other components to reduce the efficiency gap over time.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views

FPGA Architecture Principles and Progression

This document discusses the evolution of FPGA (field-programmable gate array) architecture over the past 30 years. It begins with an introduction to FPGA components and benefits compared to ASICs. Specifically: 1) FPGAs consist of programmable logic blocks, I/O blocks, and routing resources that can be configured to implement any digital circuit. 2) They offer faster development cycles and lower costs than ASICs due to reprogrammability, though with lower efficiency. 3) The document outlines the FPGA architecture evaluation process and reviews innovations in logic blocks, memory blocks, DSP blocks, interconnect, and other components to reduce the efficiency gap over time.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Feature

FPGA Architecture:
Principles and Progression
Andrew Boutros and Vaughn Betz

Abstract
Since their inception more than thirty years ago, field-programma-
ble gate arrays (FPGAs) have been widely used to implement
a myriad of applications from different domains. As a result of
their low-level hardware reconfigurability, FPGAs have much
faster design cycles and lower development costs compared to
custom-designed chips. The design of an FPGA architecture in-
volves many different design choices starting from the high-level
architectural parameters down to the transistor-level implemen-
tation details, with the goal of making a highly programmable
device while minimizing the area and performance cost of recon-
figurability. As the needs of applications and the capabilities of
process technology are constantly evolving, FPGA architecture
must also adapt. In this article, we review the evolution of the
different key components of modern commercial FPGA archi-
tectures and shed the light on their main design principles and
implementation challenges.

I. Introduction

F
ield-programmable gate arrays (FPGAs) are recon-
figurable computer chips that can be programmed
to implement any digital hardware circuit. As de-
picted in Fig. 1, FPGAs consist of an array of different
types of programmable blocks (logic, IO, and others) that
can be flexibly interconnected using pre-fabricated rout-
ing tracks with programmable switches between them.
The functionality of all the FPGA blocks and the con-
figuration of the routing switches are controlled using
millions of static random access memory (SRAM) cells
that are programmed (i.e. written) at runtime to realize
a specific function. The user describes the desired func- Compared to building a custom application-specific
tionality in a hardware description language (HDL) such integrated circuit (ASIC), FPGAs have a much lower non-
as Verilog or VHDL, or possibly uses high-level synthesis recurring engineering cost and shorter time-to-market.
to translate C or OpenCL to HDL. The HDL design is then A pre-fabricated off-the-shelf FPGA can be used to imple-
compiled using a complex computer-aided design (CAD) ment a complete system in a matter of weeks, skipping the
flow into the bitstream file used to program all the FPGA’s physical design, layout, fabrication, and verification stag-
configuration SRAM cells. es that a custom ASIC would normally go through. They
also allow continuous hardware upgrades to support new
Digital Object Identifier 10.1109/MCAS.2021.3071607 features or fix bugs by simply loading a new bitstream af-
Date of current version: 24 May 2021 ter deployment in-field, thus the name field-programmable.

4  IEEE CIRCUITS AND SYSTEMS MAGAZINE 1531-636X/21©2021IEEE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
This makes FPGAs a compelling solution for medium and packet processing [9], and machine learning [10] work-
small volume designs, especially with the fast-paced prod- loads, among others. However, the flexibility of FPGA
uct cycles in today’s markets. The bit-level reconfigu- hardware comes with an efficiency cost vs. ASICs. Kuon
rability of FPGAs enables implementation of the exact and Rose [11] show that circuits using only the FPGA’s
hardware needed for each application (e.g. datapath bit- programmable logic blocks average 35# larger and 4#
width, pipeline stages, number of parallel compute units, slower than corresponding ASIC implementations. A
memory subsytem, etc.) instead of the fixed one-size-fits- more recent study [12] shows that for full-featured de-
all architecture of general-purpose processors (CPUs) signs which heavily utilize the other FPGA blocks (e.g.
or graphics processing units (GPUs). Consequently, they RAMs and DSPs), this area gap is reduced but is still
can achieve higher efficiency than CPUs or GPUs by im- 9#. FPGA architects seek to reduce this efficiency gap
plementing instruction-free streaming hardware [1] or a as much as possible while maintaining the program-
processor overlay with an application-customized pipe- mability that makes FPGAs useful across a wide range
line and instruction set [2]. of applications.
In this article, we intro-
duce key principles of FPGA
architecture, and highlight the
progression of these devices
over the past 30 years. Fig. 1
shows how FPGAs evolved
from simple arrays of pro-
grammable logic and IO blocks
to complex heterogeneous
multi-die systems with embed-
ded block RAMs, digital signal
processing (DSP) blocks, pro-
cessor subsystems, diverse
high-performance external
interfaces, system-level inter-
connect, and more. First, we
give a brief overview of the
CAD flows and methodol-
ogy used to evaluate new
FPGA architecture ideas.
We then detail the architec-
ture challenges and design
principles for each of the
key components of an FPGA.
We highlight key innovations
in the design and implemen-
tation of each of these com-
ponents over the past three
decades along with areas of
©SHUTTERSTOCK.COM/ULVUR ongoing research.

These advantages motivated the adoption of FPGAs in II. FPGA Architecture Evaluation
many application domains including wireless communi- As shown in Fig. 2, the FPGA architecture evaluation flow
cations, embedded signal processing, networking, ASIC consists of three main components: a suite of benchmark
prototyping, high-frequency trading, and many more [3]– applications, an architecture model, and a CAD system.
[7]. They have also been recently deployed on a large Unlike an ASIC built for a specific functionality, an FPGA
scale in datacenters to accelerate search engines [8], is a general-purpose platform designed for many use

Andrew Boutros and Vaughn Betz are with the Department of Electrical and Computer Engineering, University of Toronto and the Vector Institute for
Artificial Intelligence. (email: andrew.boutros@mail.utoronto.ca, vaughn@eecg.utoronto.ca).

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 5

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
cases, some of which may not even exist when the FPGA suites which are commonly used in academic FPGA archi-
is architected. Therefore, an FPGA architecture is evalu- tecture and CAD research. While early academic FPGA re-
ated based on its efficiency when implementing a wide search used the MCNC suite of designs, these circuits are
variety of benchmark designs that are representative of now too small (thousands of logic primitives) and simple
the key FPGA markets and application domains. Typically, (only IOs and logic) to represent modern FPGA use cases.
each FPGA vendor has a carefully selected set of bench- The VTR and particularly the Titan suite are larger and
mark designs collected from proprietary system imple- more complex, making them more representative, but as
mentations and various customer applications. There are FPGA capacity and application complexity continues to
also several open-source benchmark suites such as the grow new benchmark suites are regularly needed.
classic MCNC20 [13], the VTR [14], and the Titan23 [15] The second part of the evaluation flow is the FPGA ar-
chitecture model. The design of an FPGA involves many
different decisions from architecture-level organization
(e.g. number and type of blocks, distribution of wire seg-
ment lengths, size of logic clusters and logic elements)
Memory down to transistor-level circuit implementation (e.g. pro-
Controller grammable switch type, routing buffer transistor sizing,
register implementation). It also involves different imple-
mentation styles; the logic blocks and programmable
DSPs routing are designed and laid out as full-custom circuits,
Logic
Blocks while most hardened blocks (e.g. DSPs) mix standard-cell
and full-custom design for the block core and peripher-
als, respectively. Some blocks (RAM, IO) even include sig-
PCIe Controller

nificant analog circuitry. All these different components


Block
RAMs need to be carefully modeled to evaluate the FPGA archi-
Prog. tecture in its entirety. This is typically captured using an
IOs Processor architecture description file that specifies the organiza-
Subsystem
tion and types of the different FPGA blocks and the rout-
ing architecture, in addition to area, timing and power
models obtained from circuit-level implementations for
each of these components.
Figure 1. Early FPGA architecture with programmable logic Finally, a re-targetable CAD system such as VTR [14] is
and IOs vs. modern heterogeneous FPGA architecture with used to map the selected benchmark applications on the
RAMs, DSPs, and other hard blocks. All blocks are intercon- specified FPGA architecture. Such a CAD system consists
nected using bit-level programmable routing.
of a sequence of complex optimization algorithms that
synthesizes a benchmark written in an HDL into a circuit
netlist, maps it to the different FPGA blocks, places the
mapped blocks at specific locations on the FPGA, and
Benchmark routes the connections between them using the specified
Applications programmable routing architecture. The implementation
produced by the CAD system is then used to evaluate
CAD System
Architecture several key metrics. Total area is the sum of the areas
Model Synthesis of the FPGA blocks used by the application, along with
the programmable routing included with them. A timing
Architecture analyzer finds the critical path(s) through the blocks and
Description File Placement
routing to determine the maximum frequencies of the
Area and Timing Routing
application’s clock(s). Power consumption is estimated
Models based on resources used and signal toggle rates. FPGAs
are never designed for only one application, so these
Area, Timing and metrics are averaged across all the benchmarks. Finally,
Power Metrics
the overall evaluation blends these average area, delay,
and power metrics appropriately depending on the archi-
tecture goal (e.g. high performance or low power). Other
Figure 2. FPGA architecture evaluation flow.
metrics such as CAD tool runtime and whether or not the

6  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
CAD tools fail to route some benchmarks on an architec- as a two-level sum-of-products function. PALs achieve
ture are also often considered. configurability through programmable switches that se-
As an example, a key set of questions in FPGA archi- lect the inputs to each of the and/or gates to implement
tecture is: What functionality should be hardened (i.e. different Boolean expressions. The design tools for PALs
implemented as a new ASIC-style block) in the FPGA ar- were very simple since the delay through the device is
chitecture? How flexible should this block be? How much constant no matter what logic function is implemented.
of the FPGA die area should be dedicated to it? Ideally, However, PALs do not scale well; as device logic capacity
an FPGA architect would like the hardened functional- increased, the wires forming the and/or arrays became
ity to be usable by as many applications as possible at increasingly longer and slower and the number of pro-
the least possible silicon cost. An application that can grammable switches required grew quadratically.
make use of the hard block will benefit by being smaller, Subsequently, complex programmable logic devices
faster and more power-efficient than when implemented (CPLDs) kept the and/or arrays as the basic logic ele-
solely in the programmable fabric. This motivates having ments, but attempted to solve the scalability challenge
more programmability in the hard block to capture more by integrating multiple PALs on the same die with a
use cases; however, higher flexibility generally comes at crossbar interconnect between them at the cost of more
the cost of larger area and reduced efficiency of the hard complicated design tools. Shortly after, Xilinx pioneered
block. On the other hand, if a hard block is not usable the first lookup-table-based (LUT-based) FPGA in 1984,
by an application circuit, its silicon area is wasted; the which consisted of an array of SRAM-based LUTs with
FPGA user would rather have more of the usable general- programmable interconnect between them. This style
purpose logic blocks in the area of the unused hard block. of reconfigurable devices was shown to scale very well,
The impact of this new hard block on the programmable with LUTs achieving much higher area efficiency com-
routing must also be considered—does it need more inter- pared to the and/or logic in PALs and CPLDs. Conse-
connect or lead to slow routing paths to and from the block? quently, LUT-based architectures became increasingly
To evaluate whether a specific functionality should be hard- dominant and today LUTs form the fundamental logic
ened or not, both the cost and gain
of hardening it have to be quantified
empirically using the flow described
in this section. FPGA architects may I0 I1 I2 I3 I4
Inputs
try many ideas before landing on the
right combination of design choices
that adds just the right amount of OR Array
programmability in the right spots to
make this new hard block a net win.
In the following section, we detail
many different components of FP-
GAs and key architecture questions
for each. While we describe the key
results without detailing the experi-
mental methodology used to find
them, in general they came from a
holistic architecture evaluation flow
similar to that in Fig. 2.

III. FPGA Architecture Evolution


AND Array
A. Programmable Logic
The earliest reconfigurable com-
Outputs O0 O1 O2 O3
puting devices were programmable
array logic (PAL) architectures. PALs
consisted of an array of and gates
feeding another array of or gates, Figure 3. Programmable array logic (PAL) architecture with an and array feeding an or
as shown in Fig. 3, and could imple- array. The crosses are reconfigurable switches that are used to program any Boolean
expression as a two-level sum-of-products function.
ment any Boolean logic expression

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 7

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
element in all commercial FPGAs. Several research at- with delay gains only on small benchmarks that have
tempts [16]–[18] investigated replacing LUTs with a dif- short critical paths.
ferent form of configurable and gates: a full binary tree A K-LUT can implement any K-input Boolean function
of and gates with programmable output/input inversion by storing its truth table in configuration SRAM cells.
known as an and-inverter cone (AIC). However, when K input signals are used as multiplexer select lines to
thoroughly evaluated in [19], AIC-based FPGA architec- choose an output from the 2 K values of the truth table.
tures had significantly larger area than LUT-based ones, Fig. 4(a) shows the transistor-level circuit implementation

A B C D First Level SRAMs


I00 I01 I0N
SRAMs Input Output
Vdd Buffers Buffer
I12 … I
I10 1N Vdd


Output
Vdd Buffer … I
IM0 IM1 MN

Vdd
Second Level
Vdd (b)

Ofeedback
K Inputs

K-LUT
Vdd

Orouting

Internal Buffers Basic Logic Element (BLE)

(a) (c)

Local Crossbar Ofeedback Switch Block


Multiplexer

… …

BLE 1

BLE 2

Orouting

BLE N

Logic Block (LB) Vertical


… I Inputs Routing
… Connection Block
… Multiplexers
Horizontal
Routing
(d)

Figure 4. (a) Transistor-level implementation of a 4-LUT with internal buffers between the second and third LUT stages,
(b) Two-level multiplexer circuitry, (c) Basic logic element (BLE), and (d) Logic block (LB) internal architecture.

8  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
of a 4-LUT using pass-transistor logic. In addition to the The next major change in architecture came in 2003
output buffer, an internal buffering stage (shown between from Altera, with the introduction of fracturable LUTs
the second and third stages of the LUT in Fig. 4(a)) is typi- in their Stratix II architecture [25]. Ahmed and Rose in
cally implemented to mitigate the quadratic increase in [24] showed that an LB with ten 6-LUTs achieved 14%
delay when passing through a chain of pass-transistors. higher performance than a LB with ten 4-LUTs, but at a
The sizing of the LUT’s pass-transistors and the internal/ 17% higher area. Fracturable LUTs seek to combine the
output buffers is carefully tuned to achieve the best area- best of both worlds, achieving the performance of a larg-
delay product. Classic FPGA literature [20] defines the er LUT with the area-efficiency of smaller LUTs. A major
basic logic element (BLE) as a K-LUT coupled with an out- factor in the area increase with traditional 6-LUTs is un-
put register and bypassing 2:1 multiplexers as shown in der-utilization: Lewis et al. found that 64% of the LUTs in
Fig. 4(c). Thus, a BLE can either implement just a flip-flop benchmark applications used fewer than 6 inputs, wast-
(FF) or a K-LUT with registered or unregistered output. ing some of a 6-LUT’s functionality [26]. A fracturable {K,
As illustrated in Fig. 4(d), BLEs are typically clustered M}-LUT can be configured as a single LUT of size K or can
in logic blocks (LBs), such that an LB contains N BLEs be fractured into two LUTs of size up to K – 1 that col-
along with local interconnect. The local interconnect in lectively use no more than K + M distinct inputs. Fig. 5(a)
the logic block consists of multiplexers between signal shows that a 6-LUT is internally composed of two 5-LUTs
sources (BLE outputs and logic block inputs) and des- plus a 2:1 multiplexer. Consequently, almost no circuitry
tinations (BLE inputs). These multiplexers are often ar- (only the red added output) is necessary to allow a 6-LUT
ranged to form a local full [21] or partial [22] crossbar. to instead operate as two 5-LUTs that share the same in-
At the circuit level, these multiplexers are usually built puts. However, requiring the two 5-LUTs to share all their
as two levels of pass transistors, followed by a two-stage
buffer as shown in Fig. 4(b); this is the most efficient cir-
cuit design for FPGA multiplexers in most cases [23].
Fig. 4(d) also shows the switch and connection block
multiplexers forming the programmable routing that al-
lows logic blocks to connect to each other; this routing is O2
5-LUT

discussed in detail in Section III-B.


Over the years, the size of both LUTs (K) and LBs
0 O1
(N) have gradually increased as device logic capacity
has grown. As K increases, more functionality can be A 1
packed into a single LUT, reducing not only the number B
5-LUT

C
of LUTs needed but also the number of logic levels on D
the critical path, which increases performance. In addi- E
tion, the demand for inter-LB routing decreases as more F 6-LUT
connections are captured into the fast local intercon- (a)
nect by increasing N. On the other hand, the area of
the LUT increases exponentially with K (due to the 2 K
G
SRAM cells) and its speed degrades linearly (as the mul- H O2
5-LUT

tiplexer constitutes a chain of K pass transistors with


periodic buffering). If the LB local interconnect is imple-
mented as a crossbar, its size increases quadratically 0 O1
and its speed degrades linearly with the number of BLEs A 1
in the LB, N. Ahmed and Rose [24] empirically evaluated B
5-LUT

C
these trade-offs and concluded that LUTs of size 4–6 and D
LBs of size 3–10 BLEs offer the best area-delay product E
for an FPGA architecture, with 4-LUTs leading to a bet- F 1
6-LUT
ter area but 6-LUTs yielding a higher speed. Historically,
(b)
the first LUT-based FPGA from Xilinx, the XC2000 series
in 1984, had an LB that contained only two 3-LUTs (i.e.
N = 2, K = 3). LB size gradually increased over time Figure 5. 6-LUT fracturable into two 5-LUTs with (a) no ad-
and by 1999, Xilinx’s Virtex family included four 4-LUTs ditional input ports, leading to 5 shared inputs (A-E) or (b) two
and Altera’s Apex 20K family included ten 4-LUTs in additional input ports and steering multiplexers, leading to
only 2 shared inputs (C, D).
each LB.

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 9

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
inputs will limit how often both can be simultaneously the other, can achieve performance close to plain 6-LUTs
used. Adding extra routing ports as shown in Fig. 5(b) along with the area and power advantages of 4-LUTs. In
increases the area of the fracturable 6-LUT, but makes terms of LB size, FPGA architectures from Altera/Intel
it easier to find two logic functions that can be packed and Xilinx converged on the use of relatively large LBs
together into it. The adaptive logic module (ALM) in the with ten and eight BLEs respectively, for several genera-
Stratix II architecture implemented a {6, 2}-LUT that had tions. However, the recently announced Versal architec-
8 input and 2 output ports. Thus, an ALM can implement ture from Xilinx further increases the number of BLEs per
a 6-LUT or two 5-LUTs sharing 2 inputs (and therefore LB to thirty two [29]. The reasons for this large increase
a total of 8 distinct inputs). Pairs of smaller LUTs could are two-fold. First, inter-LB wire delay is scaling poorly
also be implemented without any shared inputs, such with process shrinks, so capturing more connections
as two 4-LUTs or one 5-LUT and one 3-LUT. With a frac- within an LB’s local routing is increasingly beneficial.
turable 6-LUT, larger logic functions are implemented in Second, ever-larger FPGA designs tend to increase CAD
6-LUTs reducing the logic levels on the critical path and tool runtime, but larger LBs can help mitigate this trend
achieving performance improvement. On the other hand, by simplifying placement and inter-LB routing.
smaller logic functions can be packed together (each us- Another important architecture choice is the number
ing only half an ALM), improving area-efficiency. The LB of FFs per BLE. Early FPGAs coupled a (non-fracturable)
in Stratix II not only increased the performance by 15%, LUT with a single FF as shown in Fig. 4(c). When they
but also reduced the logic and routing area by 2.6% com- moved to fracturable LUTs, both Altera/Intel and Xilinx
pared to a baseline 4-LUT-based LB [26]. architectures added a second FF to each BLE so that
Xilinx later adopted a related fracturable LUT ap- both outputs of the fractured LUT could be registered as
proach in their Virtex-5 architecture. Like Stratix II, a shown in Fig. 5(a) and 5(b). In the Stratix V architecture,
Virtex-5 6-LUT can be decomposed into two 5-LUTs. How- the number of FFs was further increased from two to four
ever, Xilinx chose to minimize the extra circuitry added per BLE in order to accommodate increased demand for
for fracturability as shown in Fig. 5(a)—no extra input FFs as designs became more deeply pipelined to achieve
routing ports or steering multiplexers are added. This higher performance [30]. Low-cost multiplexing circuitry
results in a lower area per fracturable LUT, but makes it allows sharing the existing inputs between the LUTs and
more difficult to pack two smaller LUTs together as they FFs to avoid adding more costly routing ports. Stratix V
must use no more than 5 distinct inputs [27]. While sub- also implements FFs as pulse latches instead of edge-
sequent architectures from both Altera/Intel and Xilinx triggered FFs. As shown in Fig. 6(b), this removes one of
have also been based on fracturable 6-LUTs, a recent Mi- the two latches that would be present in a master-slave
crosemi study [28] revisited the 4-LUT vs. 6-LUT efficien- FF (Fig. 6(a)), reducing the register delay and area. A
cy trade-off for newer process technologies, CAD tools pulse latch acts as a cheaper FF with worse hold time as
and designs than those used in [24]. It shows that a LUT it latches the data input during a very short pulse instead
structure with two tightly coupled 4-LUTs, one feeding of a clock edge as in conventional FFs. If a pulse genera-
tor was built for each FF, the overall area per FF would
increase rather than decrease. Instead, Stratix V contains
Master Latch Slave Latch only two configurable pulse generators per LB; each of
clk clk the 40 pulse latches in an LB selects which generator
Q provides its pulse input. The FPGA CAD tools can also
D
program the pulse width in these generators, allowing a
QLatch limited amount of time borrowing between source and
clk clk
destination registers. Longer pulses further degrade hold
(a)
time, but generally any hold violations can be solved by
Pulse Latch the FPGA routing algorithm using longer wiring paths to
cpulse delay signals. Xilinx also uses pulse latches as its FFs in
D its Ultrascale+ architecture [31].
Q
Arithmetic operations (add and subtract) are very
cpulse common in FPGA designs: Murray et al. found that 22%
(b) of the logic elements in a suite of FPGA designs were
implementing arithmetic [32]. While these operations
can be implemented with LUTs, each bit of arithmetic in
Figure 6. Circuitry for (a) Master-slave positive-edge-trig- a ripple carry adder requires two LUTs (one for the sum
gered flip-flop, and (b) Pulse latch.
output and one for the carry). This leads to both high

10  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
logic ­utilization and a slow critical path due to connecting chitectures that incorporate 4 bits of arithmetic per logic
many LUTs in series to compute carries for multi-bit addi- element arranged in one or two carry chains with different
tions. Consequently, all modern FPGA architectures in- configurations, instead of just 2 bits of arithmetic in an Intel
clude hardened arithmetic circuitry in their logic blocks. Stratix-like ALM. These proposals do not require increas-
There are many variants, but all have several common ing the number of the (relatively expensive) routing ports
points. First, to avoid adding expensive routing ports, the in the logic clusters when implementing multiplications
arithmetic circuits re-use the LUT routing ports or are due to the high degree of input sharing in a multiplier ar-
fed by the LUT outputs. Second, the carry bits are propa- ray (i.e. for an N-bit multiplier, only 2 N inputs are needed
gated on special, dedicated interconnect with little or to generate N2 partial products). The most promising of
no programmability so that the crucial carry path is these proposals increases the density of MAC operations
fast. The lowest cost arithmetic circuitry hardens ripple by 1.7# while simultaneously improving their speed. It
carry structures and achieves a large speed gain over also reduces the required logic and routing area by 8%
LUTs (3.4# for a 32-bit adder in [32]). Hardening more for general benchmarks, highlighting that more arithmetic
sophisticated structures like carry skip adders further density is beneficial for applications beyond DL.
improves speed (an additional 20% speed-up at 32-bits in
[33]). The latest Versal architecture from Xilinx hardens
the carry logic for 8-bit carry look-ahead adders (i.e. the Cout[i ]
addition can only start on every eighth BLE), while the prop
sum, propagate and generate logic is all implemented in 4-LUT
the fracturable 6-LUTs feeding the carry logic as shown
in Fig. 7(a) [29]. This organization allows implementing
4-LUT
1-bit of arithmetic per logic element. On the other hand, gen
the latest Intel Agilex architecture can implement 2-bits
of arithmetic per logic element, with dedicated intercon- 4-LUT
nect for the carry between logic elements as shown in Sum[i ]
Fig. 7(b). It achieves that by hardening 2-bit carry-skip A[i ]
adders that are fed by the four 4-LUTs contained within B [i ]
4-LUT
a 6-LUT [34]. The study by Murray et al. [32] shows that
the combination of fracturable LUTs and 2 bits of arith-
metic (similar to that adopted in Altera/Intel FPGAs) Cout[i –1]
is particularly efficient compared to architectures with (a)
non-fracturable LUTs or 1 bit of arithmetic per logic ele-
Cout[i+1]
ment. It also concludes that having dedicated arithmetic
A[i +1]
circuits (i.e. hardening adders and carry chains) inside 4-LUT
the FPGA logic elements increases average performance Sum[i +1]
by 75% and 15% for arithmetic microbenchmarks and B [i +1]
4-LUT
general benchmark circuits, respectively.
Cout[i ]

Recently, deep learning (DL) has become a key work- A[i ]


load in many end-user applications, with its core opera- 4-LUT
Sum[i ]
tion being multiply-accumulate (MAC). Generally, MACs
B [i]
can be implemented in DSP blocks as will be described 4-LUT
in Section III-E; however low-precision MACs with 8-bit
or narrower operands (which are becoming increasingly
popular in DL workloads) can also be implemented ef- Cout[i–1]
ficiently in the programmable logic [9]. LUTs are used (b)
to generate the partial products of a multiplier array fol-
lowed by an adder tree to reduce the partial products
and perform the accumulation. Consequently, multiple Figure 7. Overview of the hard arithmetic circuitry (in red) in
the logic elements of (a) Xilinx and (b) Altera/Intel FPGAs.
recent studies [35]–[37] have investigated increasing the A[i] and B[i] are the ith bits of the two addition operands A
density of hardened adders in the FPGA’s logic fabric to and B. The Xilinx LEs compute carry propagate and gener-
enhance its performance when implementing arithmetic- ate in the LUTs, while the Altera/Intel ones use LUTs to pass
heavy applications such as DL acceleration. The work in inputs to the hard adders. Unlabled inputs are unused when
implementing adders.
[36] and [37] proposed multiple different logic block ar-

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 11

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
cation is more frequent between
modules that are near each other
LB LB LB LB
in the design hierarchy, and hier-
Switch Switch archical FPGAs can realize these
Box Box connections with short wires that
connect small regions of a chip. As
LB LB LB LB
shown in Fig. 8, to communicate to
more distant regions of a hierarchi-
cal FPGA, a connection (highlight-
Switch Box ed in red) passes through multiple
wires and switches as it traverses
different levels of the interconnect
hierarchy. This style of architec-
LB LB LB LB ture was popular in many earlier
Switch Switch FPGAs, such as Altera’s 7K and Apex
Box Box 20K families, but it leads to very long
wires at the upper levels of the inter-
LB LB LB LB
connect hierarchy which became
problematic as process scaling
made such wires increasingly resis-
Figure 8. Routing architecture in hierarchical FPGAs.
tive. A strictly hierarchical routing
architecture also results in some
B. Programmable Routing blocks that are physically close to-
Programmable routing commonly accounts for over 50% gether (e.g. the blue blocks in Fig 8) which still require
of both the fabric area and the critical path delay of ap- several wires and switches to connect. Consequently it
plications [38], so its efficiency is crucial. Programmable is primarily used today for smaller FPGAs, such as the
routing is composed of pre-fabricated wiring segments FlexLogix FPGA IP cores that can be embedded in larger
and programmable switches. By programming an appro- SoC designs [39].
priate sequence of switches to be on, any function block The other type of FPGA interconnect is island-style,
output can be connected to any input. There are two as depicted in Fig. 9. This architecture was pioneered
main classes of FPGA routing architecture. Hierarchical by Xilinx and is inspired by the fact that a regular two-
FPGAs are inspired by the fact that designs are inherently dimensional layout of horizontal and vertical directed
hierarchical: higher-level modules instantiate lower level wire segments can be efficiently laid out. As shown
modules and connect signals between them. Communi- in Fig. 9, island-style routing includes three components:
routing wire segments, connection blocks (multiplex-
ers) that connect function block inputs to the routing
wires, and switch blocks (programmable switches) that
connect routing wires together to realize longer routes.
LB LB LB The placement engine in FPGA CAD tools chooses which
function block implements each element of a design in or-
der to minimize the required wiring. Consequently, most
connections between function blocks span a small dis-
tance and can be implemented with a few routing wires
LB LB LB as illustrated by the red connection in Fig. 9.
Creating a good routing architecture involves manag-
ing many complex trade-offs. It should contain enough
programmable switching and wire segments that the vast
majority of circuits can be implemented; however, too
many wires and switches waste area. A routing architec-
Figure 9. Island-style routing architecture. Thick solid lines ture should also match the needs of applications: ideally
are routing wires while dashed lines are programmable short connections will be made with short wires to mini-
switches. Connection and switch blocks are shaded in yellow mize capacitance and layout area, while long connec-
and green, respectively.
tions can use longer wiring segments to avoid the extra

12  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
delay of passing through many routing switches. Some of straight at a wire end point. The local routing in a logic
the routing architecture parameters include: how many cluster (described in Section III-A) allows some block
routing wires each logic block input or output can con- inputs and some block outputs to be swapped during
nect to (Fc), how many other routing wires each wire routing. By leveraging this extra degree of flexibility and
can connect to (Fs), the lengths of the routing wire seg- considering all the options presented by the multi-stage
ments, the routing switch pattern, the electrical design programmable routing network, the routing CAD tool can
of the wires and switches themselves, and the number achieve high completion rates even with low Fc and Fs
of routing wires per channel [20]. In Fig. 9 for example, values. Switch patterns that give more options to the
Fc = 3, Fs = 3, the channel width is 4 wires, and some routing CAD tool also help routability; for example, the
routing wires are of length 1, while others are of length 2. Wilton switch pattern ensures that following a different
Fully evaluating these trade-offs for target applications sequence of channels lets the router reach different wire
and at a specific process node requires experimentation segments near a destination block [43].
using a full CAD flow as detailed in Section II. There are also multiple options for the electrical de-
Early island-style architectures incorporated only sign of programmable switches, as shown in Fig. 10. Early
short wires that traversed a single logic block between FPGAs used pass gate transistors controlled by SRAM
programmable switches. Later research showed that this cells to connect wires. While this is the smallest switch
resulted in more programmable switches than neces- possible in a conventional CMOS process, the delay of
sary, and that making all wiring segments span four logic routing wires connected in series by pass transistors
blocks before terminating reduced application delay by grows quadratically, making them very slow for large
40% and routing area by 25% [40]. Modern architectures FPGAs. Adding some tri-state buffer switches costs area,
include multiple lengths of wiring segments to better but improves speed [40]. Most recent FPGAs primarily
match the needs of short and long connections, but the use a multiplexer built out of pass gates followed by a
most plentiful wire segments remain of moderate length, buffer that cannot be tri-stated, as shown in detail in
with four logic blocks being a popular choice. Longer dis- Fig. 4(b). The pass transistors in this direct drive switch
tance connections can achieve lower delay using longer can be small as they are lightly loaded, while the buffer
wire segments, but in recent process nodes wires that can be larger to drive the significant capacitance of a
span many (e.g. 16) logic blocks must use wide and thick routing wire segment. Such direct drive switches create
metal traces on upper metal layers to achieve acceptable a major constraint on the switch pattern: a wire can only
resistance [41]. The amount of such long-distance wir- be driven at one point, so only function block outputs
ing one can include in a metal stack is limited. To best and routing wires near that point can feed its routing mul-
leverage such scarce wiring, Intel’s Stratix FPGAs allow tiplexer inputs and hence be possible signal sources. De-
long wire segments to be connected only to short wire spite this constraint, both academic and industrial work
segments, rather than function block inputs or outputs has concluded that direct drive switches improve both
[42]. This creates a form of routing hierarchy within an area and speed due to their superior electrical character-
island-style FPGA, where short connections use only the istics [42], [44]. The exception is expensive or rare wires
shorter wires, but longer connections pass through short such as long wires implemented on wide metal traces on
wires to reach the long wire network. Another area where upper metal layers or the interposer-crossing wires dis-
hierarchical FPGA concepts are used within island-style cussed later in Section III-G. These wires often have mul-
FPGAs is within the logic blocks. As illustrated in Fig. 4(d), tiple tri-state buffers that can drive them, as the cost of
most logic blocks now group multiple BLEs together with
local routing. This means each logic block is a small clus-
ter in a hierarchical FPGA; island-style routing intercon-
Configuration
nects the resulting thousands of logic clusters. SRAMs
There has been a great deal of research into the op-
timal amount of switching, and how to best arrange the
switches. While there are many detailed choices, a few
principles have emerged. The first is that the connectiv-
ity between function block pins and wires (Fc) can be
relatively low: typically only 10% or less of the wires that
pass by a pin will have switches to connect to it. Simi-
larly, the number of other wires that a routing wire can Figure 10. Different implementations for SRAM-controlled
connect to at its end (Fs) can also be low, but it should programmable switches using pass transistors (left), tri-state
buffers (middle), or buffered multiplexers (right).
be at least 3 so that a signal can turn left, right, or go

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 13

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
these larger programmable switches is merited to allow to form the positive and negative line for differential IO
more flexible usage of these expensive wires. standards. Third, IO buffers are implemented with mul-
A major challenge for FPGA routing is that the delay of tiple parallel pull-up and pull-down transistors so that
long wires is not improving with process scaling, which their drive strengths can be programmably adjusted by
means that the delay to cross the chip is stagnating or enabling or disabling different numbers of pull-up/pull-
increasing even as clock frequencies rise. This has led down pairs. By programming some pull-up or pull-down
FPGA application developers to increase the amount of transistors to be enabled even when no output is being
pipelining in their designs, thereby allowing multiple driven, FPGA IOs can also be programmed to implement
clock cycles for long routes. To make this strategy more different on-chip termination resistances to minimize
effective, some FPGA manufacturers have integrated signal reflections. Programmable delay chains provide a
registers within the routing network itself. Intel’s Stratix fourth level of configurability, allowing fine delay adjust-
10 device allows each routing driver (i.e. multiplexer fol- ments of signal timing to and from the IO buffer.
lowed by a buffer) to be configured as a pulse latch as In addition to electrical and timing programmability,
shown in 6(b), thereby acting as a register with low delay FPGA IO blocks contain additional hardened digital cir-
but relatively poor hold time. This allows deep pipelining cuitry to simplify capturing and transferring IO data to
of interconnect without using expensive logic resources, the fabric. Generally some or all of this hardened circuit-
at the cost of a modest area and delay increase to the ry can be bypassed by SRAM-controlled muxes, allowing
routing driver [45]. Hold time concerns mean that us- FPGA users to choose which hardened functions are de-
ing pulse latches in immediately consecutive Stratix 10 sirable for a given design and IO protocol. Part ➄ of
routing switches is not possible, so Intel refined this ap- Fig. 11 shows a number of common digital logic options
proach in their next-generation Agilex devices by inte- on the IO input path: a capture register, double to single-
grating actual registers (with better hold time) on only data rate conversion registers (used with DDR memo-
one-third of the interconnect drivers (to mitigate the area ries), and serial-to-parallel converters to allow transfer
cost) [34]. Rather than integrating registers throughout to the fabric at a lower frequency. Most FPGAs now also
the interconnect, Xilinx’s Versal devices instead add by- contain by-passable higher-level blocks that connect to
passable registers only on the inputs to function blocks. a group of IOs and implement higher-level protocols like
Unlike Intel’s interconnect registers, these input registers DDR memory controllers. Together these approaches al-
are full-featured, with clock enable and clear signals [46]. low the general-purpose FPGA IOs to service many differ-
ent protocols, at speeds up to 3.2 Gb/s.
C. Programmable IO The highest speed IOs implement serial protocols,
FPGAs include unique programmable IO structures to such as PCIe and Ethernet, that embed the clock in data
allow them to communicate with a very wide variety transitions and can run at 28 Gb/s or more. To achieve
of other devices, making FPGAs the communications these speeds, FPGAs include a separate group of differen-
hub of many systems. For a single set of physical IOs to tial-only IOs with less voltage and electrical programma-
programmably support many different IO interfaces and bility; they can only be used as serial transceivers [50].
standards is challenging, as it requires adaptation to dif- Just as for the general-purpose IOs, these serial IOs have
ferent voltage levels, electrical characteristics, timing a sequence of high-speed hardened circuits between
specifications, and command protocols. Both the value them and the fabric, some of which can be optionally
and the challenge of programmable IO are highlighted bypassed to allow end-users to customize the exact in-
by the large area devoted to IOs on FPGAs. For example, terface protocol.
Altera’s Stratix II (90 nm) devices support 28 different IO Overall, FPGA IO design is very challenging, due to the
standards and devote 20% (largest device) to 48% (small- dual (and competing) demands to make the IO not only
est device) of their die area to IO-related structures. very fast but also programmable. In addition, distribut-
As Fig. 11 shows, FPGAs address this challenge using a ing the very high data bandwidths from IO interfaces
combination of approaches [47]–[49]. First, FPGAs use IO requires wide soft buses in the fabric, which creates ad-
buffers that can operate across a range of voltages. These ditional challenges as discussed later in Section III-F.
IOs are grouped into banks (commonly on the order of 50
IOs per bank), where each bank has a separate Vddio rail D. On-Chip Memory
for the IO buffer. This allows different banks to operate The first form of on-chip memory elements in FPGA ar-
at different voltage levels; e.g. IOs in one bank could be chitectures was FFs integrated in the FPGA’s logic blocks
operating at 1.8 V while those in a different bank operate as described in Section III-A. However, as FPGA logic
at 1.2 V. Second, each IO can be used separately for sin- capacity grew, they were used to implement larger sys-
gle-ended standards, or pairs of IOs can be programmed tems which almost always require memory to buffer and

14  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
Each Pair of IOs can be Configured as Programmable drive strength of output
2 3
Two Single-Ended IOs or One Differential IO buffers via multiple parallel pull up/down
Vddio transistors and programmable termination
Different Vddio Rails for the IO Buffers resistances to minimize signal reflections.
1 in Different Banks (e.g., Vddio1 and Vddio2) Out1EN
Single-Ended
Out1 PDC IO
Vddio1
+
Logic In1 PDC
Bank 1 OutEN
Control

Blocks +
Impedance

IO
– Out

SECOND QUARTER 2021


Differential

To/From Fabric
In2 PDC In

Input/Output Capture
IO Vddio2 IOs Out2 Single-Ended
PDC
Banks IO
Out2EN Drive Strength
Config. SRAMs

5 Different Options For Capturing Input


Bank 2 4 Programmable Delay Chain (PDC)
Serial-To-
Parallel
Out
Double Data From In
To Rate 1 IO
Fabric
Double Data Delay Config.
Rate 2 SRAMs
Single
Rate

Figure 11. Overview of the different techniques for implementing programmable IOs on FPGAs.

IEEE CIRCUITS AND SYSTEMS MAGAZINE

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
15
re-use data on chip, making it highly desirable to have include hard functional blocks for memory (block RAMs
denser on-chip storage since building large RAMs out of or BRAMs) was the Altera Flex 10 K in 1995 [51]. It in-
registers and LUTs is over 100# less dense than an (ASIC- cluded columns of small (2 kb) BRAMs that connect to
style) SRAM block. At the same time, the RAM needs of the rest of the fabric through the programmable routing.
applications implemented on FPGAs are very diverse, FPGAs have gradually incorporated larger and more di-
including (but not limited to) small coefficient storage verse BRAMs and it is typical for ~25% of the area of a
RAMs for FIR filters, large buffers for network packets, modern FPGA to be dedicated for BRAMs [52].
caches and register files for processor-like modules, An FPGA BRAM consists of an SRAM-based memory
read-only memory for instructions, and FIFOs of myriad core, with additional peripheral circuitry to make them
sizes to decouple computation modules. This means that more configurable for multiple purposes and to con-
there is no single RAM configuration (capacity, word nect them to the programmable routing. An SRAM-based
width, number of ports) used universally in FPGA de- BRAM is typically organized as illustrated in Fig. 12.
signs, making it challenging to decide on what kind(s) of It consists of a two-dimensional array of SRAM cells to
RAM blocks should be added to an FPGA such that they store bits, and a considerable amount of peripheral
are efficient for a broad range of uses. The first FPGA to circuitry to orchestrate access to these cells for read/

CS

Vdd On Switch
Off Switch
ExtAddr

SB
W
W
Output Crossbar RdataA
ExtAddrA
W CSL Dout
WCnfg

Log2(W)
Dec.

Sen
WenA

Sense Amplifier
Local Crossbar

W SA SA SA SA SA SA SA SA
Wen
WdataA
Din WD WD WD WD WD WD WD WD
W BL BL BL BL BL BL BL BL
WL0A
Row Dec. A

Row Dec. B

WL1A
AddrA
WL2A SRAM Cells Din
Log2(D )
Log2(D )+
Write Driver

WL3A
Log2(W)+
W+1 CB
Read/Write Circuitry B
BLA BLA
BLB BLB Wen
WLA

General-Purpose Routing
WLB
BLA BLA

Figure 12. Organization and circuitry of a conventional dual-port SRAM-based FPGA BRAM. The components highlighted in
blue are common in any SRAM-based memory module, while those highlighted in green are FPGA-specific. This BRAM has a
maximum data width of 8 bits, but the output crossbar is configured for 4-bit output mode.

16  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
write operations. To simplify timing of the read and write BRAMs are designed to have configurable width and
operations, all modern FPGA BRAMs register all their in- depth by adding low-cost multiplexing circuitry to the
puts. During a write operation, the column decoder acti- peripherals of the memory array. For example, in Fig. 12
vates the write drivers, which in turn charge the bitlines the actual SRAM array is implemented as a 4 × 8-bit array,
(BL and BL ) according to the input data to-be-written to meaning it naturally stores 8-bit data words. By adding
the memory cells. Simultaneously, the row decoder acti- multiplexers controlled by 3 address bits to the output
vates the wordline of the row specified by the input write crossbar, and extra decoding and enabling logic to the
address, connecting one row of cells to their bitlines so read/write circuitry, this RAM can also operate in 8 × 4-bit,
they are overwritten with new data. During a read op- 16 × 2-bit or 32 × 1-bit modes. The width configurability
eration, both the BL and BL are pre-charged high and decoder (WCnfg Dec.) selects between Vdd and address
then the row decoder activates the wordline of the row bits, as shown in the top-left of Fig. 12 for a maximum
specified by the input read address. The contents of the word size of 8 bits. The multiplexers are programmed
activated cells cause a slight difference in the voltage be- using configuration SRAM cells and are used to generate
tween BL and BL , which is sensed and amplified by the column select (CS) and write enable (Wen) signals that
sense amplifier circuit to produce the output data [52]. control the sense amplifiers and write drivers for nar-
The main architectural decisions in designing FPGA row read and write operations, respectively. For typical
BRAMs are choosing their capacity, data word width, and BRAM sizes (several kb or more), the cost of this addi-
number of read/write ports. More capable BRAMs cost tional width configurability circuitry is small compared
more silicon area, so architects must carefully balance to the cost of a conventional SRAM array, and it does not
BRAM design choices while taking into account the most require any additional routing ports.
common use cases in application circuits. The area oc- Another unique component of the FPGA BRAMs com-
cupied by the SRAM cells grows linearly with the capac- pared to conventional memory blocks is their interface
ity of the BRAM, but the area of the peripheral circuitry to the programmable routing fabric. This interface is gen-
and the number of routing ports grows sub-linearly. This erally designed to be similar to that of the logic blocks
means that larger BRAMs have lower area per bit, making described in Section III-A; it is easier to create a routing
large on-chip buffers more efficient. On the other hand, if architecture that balances flexibility and cost well if all
an application requires only small RAMs, much of the ca- block types connect to it in similar ways. Connection
pacity of a larger BRAM may be wasted. Similarly, a BRAM block multiplexers, followed by local crossbars in some
with a larger data width can provide higher data band- FPGAs, form the BRAM input routing ports, while the
width to downstream logic. However, it costs more area read outputs drive switch block multiplexers to form the
than a BRAM with the same capacity but a smaller word output routing ports. These routing interfaces are costly,
width, as the larger data word width necessitates more particularly for small BRAMs; they constitute 5% to 35%
sense amplifiers, write drivers and programmable rout- of the BRAM tile area for 256Kb down to 8Kb BRAMs, re-
ing ports. Finally, increasing the number of read/write spectively [55]. This motivates minimizing the number
ports to a BRAM increases the area of both the SRAM of routing ports to a BRAM as much as possible without
cells and the peripheral circuitry, but again increases the unduly comprising its functionality. Table I summarizes
data bandwidth the BRAM can provide and allows more the number of routing ports required for different num-
diverse uses. For example, FIFOs (which are ubiquitous bers and types of BRAM read and write ports. For ex-
in FPGA designs) require both a read and a write port. ample, a single-port BRAM (1r/w) requires W + log 2 (D)
The implementation details of a dual-port SRAM cell is input ports for write data and read/write address, and W
shown at the bottom of Fig. 12. Implementing a second
port to the SRAM cell (port B highlighted in red) adds
Table I.
two transistors, increasing the area of the SRAM cells Number of routing ports needed for different numbers
by 33%. In addition, the second port also needs an ad- and types of BRAM read/write ports (W: data width, D:
ditional copy of the sense amplifiers, write drivers and BRAM depth).
row decoders (the “Read/Write Circuitry B” and “Row BRAM Ports BRAM Mode # Routing Ports
Decoder B” blocks in Fig. 12). If both ports are read/write
1r Single-port ROM log 2 (D) + W
(r/w), we also have to double the number of ports to the
programmable routing. 1r/w Single-port RAM log 2 (D) + 2W
Because the FPGA on-chip memory must satisfy the 1r+1w Simple dual-port RAM 2 log 2 (D) + 2W
needs of every application implemented on that FPGA, it 2r/w True dual-port RAM 2 log 2 (D) + 4W
is also common to add extra configurability to BRAMs to
2r+2w Quad-port RAM 4 log 2 (D) + 4W
allow them to adapt to application needs [53], [54]. FPGA

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 17

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
output ports for read data, where W and D are the maxi- as explained in Section III-A. This means it can operate
mum word width and the BRAM depth, respectively. The as a 64 × 1-bit or a 32 × 2-bit memory with 6 or 5 bits for
table shows that a true dual-port (2r/w) BRAM requires read address, respectively. This leaves only 2 or 3 unused
2 W more ports compared to a simple dual-port (1r+1w) routing ports, which are not enough for write address,
BRAM, which significantly increases the cost of the rout- data, and write enable (8 total signals) if we want to read
ing interfaces. While true dual-port memory is useful and write in each cycle (simple dual-port mode), which is
for register files, caches and shared memory switches, the most commonly used RAM mode in FPGA designs. To
the most common use of multi-ported RAMs on FPGAs overcome this problem, an entire logic block of 10 ALMs
is for FIFOs, which require only one read and one write is configured as a LUT-RAM to amortize the control cir-
port (1r+1w rather than 2r/w ports). Consequently, FPGA cuitry and address bits across 10 ALMs. The write address
BRAMs typically have true dual-port SRAM cores but and write enable are assembled by bringing one signal
with only enough routing interfaces for simple-dual port in from an unused routing port in each ALM and broad-
mode at the full width supported by the SRAM core (W), casting the resulting address and enable to all ALMs [60].
and limit the width of the true-dual port mode to only half Consequently, a logic block can implement a 64 × 10-bit or
of the maximum width (W/2). 32 × 20-bit simple dual-port RAM, but has a restriction that
Another way to mitigate the cost of additional BRAM a single logic block cannot mix logic and LUT-RAM. Xilinx
ports is to multi-pump the memory blocks (i.e. operate Ultrascale similarly converts an entire logic block to LUT-
the BRAMs at a frequency that is a multiple of that used RAM, but all the routing ports of one out of the eight LUTs
for the rest of the design logic). By doing so, a physi- in a logic block are repurposed to drive the (broadcast)
cally single-ported SRAM array can implement a logically write address and enable signals. Therefore, a Xilinx logic
multi-ported BRAM without the cost of additional ports block can implement a 64 × 7-bit or 32 × 14-bit simple dual-
as in Tabula’s Spacetime architecture [56]. Multi-pump- port RAM, or a slightly wider single-port RAM (64 × 8-bit
ing can also be used with conventional FPGA BRAMs by or 32 × 16-bit). Avoiding extra routing ports keeps the cost
building the time-multiplexing logic in the soft fabric; of LUT-RAM low, but it still adds some area. Since it would
however, this leads to aggressive timing constraints for be very unusual for a design to convert more than 50% of
the time-multiplexing logic, which can make timing clo- the logic fabric to LUT-RAMs, both Altera/Intel and Xilinx
sure more challenging and increase compile time. Altera have elected to make only half of their logic blocks LUT-
introduced quad-port BRAMs in its Mercury devices in RAM capable in their recent architectures, thereby further
the early 2000s to make shared memory switches (use- reducing the area cost.
ful in packet processing) and register files more efficent Designers require many different RAMs in a typical
[57]. However, this feature increased the BRAM size and design, all of which must be implemented by the fixed
was not sufficiently used to justify its inclusion in subse- BRAM and LUT-RAM resources on the chip. Forcing de-
quent FPGA generations. Instead designers use a variety signers to determine the best way to combine BRAM and
of techniques to combine dual-ported FPGA BRAMs and LUT-RAM for each memory configuration they need and
soft logic to make highly-ported structures when needed, writing verilog to implement them would be laborious
albeit at lower efficiency [58], [59]. We refer the inter- and also would tie the design to a specific FPGA architec-
ested reader to both [52] and [55] for extensive details ture. Instead, the vendor CAD tools include a RAM map-
about the design of BRAM core and peripheral circuitry. ping stage that implements the logical memories in the
In addition to building BRAMs, FPGA vendors can add user’s design using the physical BRAMs and LUT-RAMs
circuitry that allows designers to repurpose the LUTs on the chip. The RAM mapper chooses the physical mem-
that form the logic fabric into additional RAM blocks. ory implementation (i.e. memory type and the width/
The truth tables in the logic block K-LUTs are 2K ×1-bit number/type of its ports) and generates any additional
read-only memories; they are written once by the con- logic required to combine multiple BRAMs or LUT-RAMs
figuration circuitry when the design bitstream is loaded. to implement each logical RAM. Fig. 13 gives an example
Since LUTs already have read circuitry (read out a stored of mapping a logical 2048 × 32-bit RAM with 2 read and 1
value based on a K-bit input/address), they can be used write ports to an FPGA with physical 1024 × 8-bit dual-port
as small distributed LUT-based RAMs (LUT-RAMs) just BRAMs. First, four physical BRAMs are combined in par-
by adding designer-controlled write circuitry. However, allel to make wider RAMs with no extra logic. Then, soft
a major concern is the number of additional routing logic resources are used to perform depth-wise stitching
ports necessary to implement the write functionality of two sets of four physical BRAMs, such that the most-
to change a LUT to a LUT-RAM. For example, an ALM in significant bits of the write and read addresses are used
recent Altera/Intel architectures is a 6-LUT that can be as write enable and read output mux select signals, re-
fractured into two 5-LUTs and has 8 input routing ports, spectively. Finally, in this case we require two read ports

18  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
and one write port while the
physical BRAMs only support [31:0]
a maximum of 2r/w ports. To
Rdata1

RAddr1[10]
implement the second read
port, the whole structure is ei-
ther replicated (see Fig. 13) or
double-pumped as previously
explained. Several algorithms

8b 8b 8b 8b
1.024 Words

8b 8b 8b 8b
1.024 Words
for optimizing RAM mapping

8
8
are described in [61], [62].
Over the past 25 years,

8
8
FPGA memory architecture

8
8
has evolved considerably and
has also become increasing­­

8
8
ly important, as the ratio of
memory to logic on an FPGA
[31:0]

WAddr [9:0]
RAddr1 [9:0]
WAddr [9:0]
RAddr1 [9:0]
Wdata
die has grown significantly.
Physical RAM

Fig. 14 plots the memory bits


Wen

Wen
WAddr [10]
WAddr[10]
[31:0]
per logic element (including Rdata0
LUT-RAM) versus the number
of logic elements in Altera/

RAddr0
[10]
Intel devices starting from

Figure 13. Mapping a 2048 × 32-bit 2r+1w logical RAM to an FPGA with 1024 × 8-bit 1r+1w physical BRAMs.
the 350 nm Flex 10 K devices
(1995) to 10 nm Agilex de-
vices (2019). There has been
8b 8b 8b 8b

1.024 Words

8b 8b 8b 8b
1.024 Words
8
8

a gradual increase in the


memory richness of FPGAs
8
8

over time, and to meet the


8
8

demand for more bits, modern


BRAMs have larger capacities
8
8

(20 kb) than the first BRAMs


(2 kb). Some FPGAs have had
Wdata
WAddr
RAddr0

[31:0]

RAddr0

WAddr

highly heterogeneous BRAM


[9:0]
[9:0]

[9:0]
[9:0]

architectures in order to pro-


Wen
WAddr[10]

Wen
WAddr[10]

vide some physical RAMs


that are efficient for small
or wide logical RAMs, and
others that are efficient for
Rdata0

Rdata1
[31:0]

[31:0]

large and relatively narrow


logical RAMs. For example,
Stratix (130 nm) had 3 types
of BRAM, with capacities of 2,048 Words
512 b, 4 kb and 512 kb. The
Logical RAM

introduction of LUT-RAM in
32 b

Stratix III (65 nm) reduced


the need for small BRAMs, so
it moved to a memory archi-
tecture with 9 kb and 144 kb
BRAM. Stratix V (28 nm) and
later Intel devices have
Wen
Wdata
WAddr
RAddr0

RAddr1

[31:0]
[10:0]
[10:0]

[10:0]

moved to a combination of
LUT-RAM and a single medi-
um-sized BRAM (20 kb) to

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 19

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
Stratix IV, it achieves the best area
350 nm 150 nm 130 nm 90 nm 65 nm 40 nm
when mapped to a single 144 kb
28 nm 14 nm 10 nm
BRAM. Interestingly, mapping to
eighteen 9 kb BRAMs is only 1.9#
140
larger in silicon area (note that
120
20 kb
output width limitations lead to 18
Memory Bits Per LE

100 BRAMs instead of the 16 one might


80 expect). The 9 kb BRAM imple-
mentation is actually faster than
60 512 b/4 kb/512 kb
the 144 kb BRAM implementation,
40 2 kb as the smaller BRAMs have higher
9 kb/144 kb
20 maximum operating frequencies.
Mapping such a large logical RAM
0
1,000 10 k 100 k 1M to LUT-RAMs is inefficient, requir-
Number of LEs (4-LUT Equivalent) ing 12.7# more area and running at
40% of the frequency. Finally, map-
ping only to the logic and routing
Figure 14. Trends in memory bits per LE for Altera/Intel FPGAs starting from the
350 nm Flex 10k (1995) to the 10 nm Agilex (2019) architecture. The labels show the resources shows how important
sizes of BRAMs in each of these architectures. on-chip RAM is: area is over 300#
larger than the 144 kb BRAM. While
the 144 kb BRAM is most efficient
simplify both the FPGA layout as well as RAM mapping for this single test case, real designs have diverse logical
and placement. Tatsumura et al. [52] plot a similar trend RAMs, and for small or shallow memories the 9 kb and
for on-chip memory density in Xilinx devices as well. Sim- LUT-RAM options would outperform the 144 kb BRAM,
ilarly to Intel, Xilinx’s RAM architecture combines LUT- motivating a diversity of on-chip RAM resources. To
RAM and a medium-sized 18 kb RAM, but also includes choose the best mix of BRAM sizes and maximum word
hard circuitry to combine two BRAMs into a single 36 kb widths, one needs both a RAM mapping tool and tools to
block. However, Xilinx’s most recent devices have also estimate the area, speed and power of each BRAM [55].
included a large 288 kb BRAM (UltraRAM) to be more ef- Published studies into BRAM architecture trade-offs for
ficient for very large buffers, showing that there is still no FPGAs include [30], [55], [64].
general agreement on the best BRAM architecture. Until now, all commercial FPGAs use only SRAM-based
To give some insight into the relative areas and effi- memory cells in their BRAMs. With the desire for more
ciencies of different RAM blocks, Table II shows the re- dense BRAMs that would enable more memory-rich
source usage, silicon area, and frequency of a 2048 × 72-bit FPGAs and SRAM scaling becoming increasingly difficult
logical RAM when it is implemented by Quartus (the due to process variation, a few academic studies (e.g. [52],
CAD flow for Altera/Intel FPGAs) in a variety of ways on [65]) have explored the use of other emerging memory
a Stratix IV device. The silicon areas are computed using technologies such as magnetic tunnel junctions (MTJs)
the published Stratix III block areas from [63] and scaling to build FPGA memory blocks. According to [52], MTJ-
them from 65 nm down to 40 nm, as Stratix III and IV have based BRAMs could increase the FPGA memory capac-
the same architecture but use different process nodes. ity by up to 2.95# with the same die size; however, they
As this logical RAM is a perfect fit to the 144 kb BRAM in would increase the process complexity.

E. DSP Blocks
Table II. Initially the only dedicated arith-
Implementation results for a 2048 × 72-bit 1r+1w RAM using BRAMs, metic circuits in commercial FPGA
LUT-RAMs and registers on Stratix IV.
architectures were carry chains
BRAMs to implement efficient adders, as
Implementation half-ALM 9k 144k Area (mm2) Freq. (Mhz) discussed in Section III-A. Thus mul-
144kb BRAMs 0 0 1 0.22 (1.0×) 336 (1.0×) tipliers had to be implemented in
9kb BRAMs 0 18 0 0.41 (1.9×) 497 (1.5×) the soft logic using LUTs and carry
chains, incurring a substantial area
LUT-RAM 6597 0 0 2.81 (12.8×) 134 (0.4×)
and delay penalty. As high-mul-
Registers 165155 0 0 68.8 (313×) 129 (0.4×) tiplier-density signal processing

20  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
and communication applications constituted a major tions and signal processing domains in their Stratix archi-
FPGA market, designers proposed novel implementa- tecture [42] (see second block in Fig. 15). The main design
tions to mitigate the inefficiency of multiplier imple- philosophy of this DSP block was to minimize the amount
mentations in soft logic. For example, the multipli- of soft logic resources used to implement common DSP
er-less distributed arithmetic technique was proposed to algorithms by hardening more functionality inside the
implement efficient finite impulse response (FIR) filter DSP block, and enhancing its flexibility to allow more ap-
structures on LUT-based FPGAs [66], [67]. plications to make use of it. The Stratix DSP block was
With the prevalence of multipliers in FPGA designs from highly configurable with support for different modes of
key application domains, and their lower area/delay/power operation and multiplication precisions unlike the fixed-
efficiency when implemented in soft logic, they quickly function hard 18-bit multipliers in the Virtex-II architec-
became a candidate for hardening as dedicated circuits in ture. Each Stratix variable-precision DSP block spanned
FPGA architectures. An N-bit multiplier array consists of N2 8 rows and could implement eight 9 # 9 bit multipliers,
logic elements with only 2 N inputs and outputs. There- four 18 # 18 bit multipliers, or one 36 # 36 multiplier.
fore, the gains of hardening the multiplier logic and the These modes of operation selected by Altera highlight
cost of the programmable interfaces to the FPGA’s rout- an important theme of designing FPGA hard blocks: in-
ing fabric resulted in a net efficiency gain and strongly creasing the configurability and utility of these blocks by
advocated for adopting hard multipliers in subsequent adding low-cost circuitry. For example, an 18 # 18 mul-
FPGA architectures. As shown at top left of Fig. 15, Xilinx tiplier array can be decomposed into two 9 # 9 arrays
introduced its Virtex-II architecture with the industry’s that together use the same number of inputs and outputs
first 18 # 18 bit hard multiplier blocks [68]. To simplify (and hence routing ports). Similarly, four 18 # 18 multipli-
the layout integration with the full-custom FPGA fabric, ers can be combined into one 36 # 36 array using cheap
these multipliers were arranged in columns right beside glue logic. Fig. 16 shows how an 18 # 18 multiplier array
BRAM columns. In order to further reduce the intercon- can be fractured into multiple 9 # 9 arrays. It can be split
nect cost, the multiplier block and its adjacent BRAM into four 9 # 9 arrays by doubling the number of input
had to share some interconnect resources, limiting the and output pins. However, to avoid adding these costly
maximum usable data width of the BRAM block. Multiple routing interfaces, the 18 # 18 array is split into only
hard 18-bit multipliers could be combined to form bigger two 9 # 9 arrays (colored blue in Fig. 16). This is done
multipliers or FIR filters using soft logic resources. by splitting the partial product compressor trees at the
In 2002, Altera adopted a different approach by intro- positions indicated by the red dashed lines and adding
ducing full featured DSP blocks targeting the communica- inverting capabilities to the border cells of the top-right

MULT MULT
Reg

Reg

MULT
Reg

Add/Sub/Acc

Reg

Add/Sub/Acc
MULT
Filter Coeff.
+

Eight 9 × 9
ALU
Reg

Reg
Reg
Reg

One Four 18 × 18 One One fp32


18 ×18 One 36 × 36 25 × 18 One 27 × 27
Reg

Reg

Reg
Reg

Two 18 × 18

Xilinx Virtex-II Altera Stratix Xilinx Virtex-6 Intel Arria 10


(2001) (2002) (2009) (2013)

MULT MULT MULT


Add/Sub/Acc
Reg

Reg

Data Reuse
Filter Coeff.
+

fp32 One fp32 Thirty 8 × 8 3×


Regs

fp16 Two fp16 Sixty 4 × 4


ALU
Reg

Reg

Reg

Add
24 × 24 One 27 × 27 fp32
16 × 16 Two 18 × 18 Bfloat16 Acc
Reg

Reg
+

8×8 Four 9 × 9 Bfloat24

Xilinx Versal Intel Agilex Intel Stratix 10 NX


(2019) (2019) (2020)

Figure 15. DSP block evolution in Altera/Intel and Xilinx FPGAs. Incrementally added features are highlighted in red.

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 21

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
array, marked with crosses in Fig. 16 to implement two’s introduced the ability to cascade the adders/accumula-
complement signed multiplication using the Baugh-Wool- tors using dedicated interconnects to implement high-
ey algorithm [69] (the bottom left array already has the speed systolic FIR filters with hardened reduction chains.
inverting capability from the 18 # 18 array). An N-tap FIR filter performs a discrete 1 D convolution
In addition to the fracturable multiplier arrays, the between the samples of a signal X = {x 0, x 1, f, x T } and
Stratix DSP also incorporated an adder/output block to certain coefficients C = {c 0, c 1, f, c N - 1} that represent the
perform summation and accumulation operations, as impulse response of the desired filter, as shown in eq. (1).
well as hardened input registers that could be configured N
as shift registers with dedicated cascade interconnect yn = c0 xn + c1 xn - 1 + g + c N xn - N = / c i x n - i (1)
i=0
between them to implement efficient FIR filter structures
[70]. The latest 28 nm architectures from Lattice also Many of the FIR filters used in practice are symmetric
have a variable precision DSP block that can implement with c i = c N - i, for i = 0 to N/2. As a result of this sym-
the same range of precisions, in addition to special 1 D metry, the filter computation can be refactored as shown
and 2 D symmetry modes for filter structures and video in eq. (2).
processing applications, respectively. Xilinx also adopted
a full-featured DSP block approach by introducing their y n = c 0 [x n + x n - N ] + g + c N/2 - 1 [x n - N/2 - 1 + x n - N/2](2)
DSP48 tiles in the Virtex-4 architecture [71]. Each DSP
tile had two fixed-precision 18 # 18 bit multipliers with Fig. 17 shows the structure of a systolic symmetric FIR
similar functionalities to the Stratix DSP block (e.g. input filter circuit, which is a key use case for FPGAs in wireless
cascades, adder/subtractor/accumulator). Virtex-4 also base stations. Both Stratix and Virtex-4 DSP blocks can
implement the portions highlighted
by the dotted boxes, resulting in sig-
A1 A0 nificant efficiency gains compared
Input A to implementing them in the FPGA’s
soft logic. Interestingly, while FPGA
CAD tools will automatically imple-
B0 ment a multiplication (*) operation
in DSP blocks, they will generally
Input B
not make use of any of the advanced
DSP block features (e.g. accumula-
tion, systolic registers for FIR filters)
B1
unless a designer manually instanti-
Output ates a DSP block in the proper mode.
Consequently, using the more pow-
erful DSP block features makes a de-
A1 × B1 A0 × B0 sign less portable.
The Stratix III DSP block was simi-
Figure 16. Fracturing an 18 × 18 multiplier array into two 9 × 9 arrays with the same lar to the Stratix II one, but could im-
number of input/output ports. plement four 18 # 18 multipliers per
half a DSP block (instead of two) if
their results are summed to limit
the number of output routing interfaces [72]. Table III
X lists the implementation results of both symmetric and
asymmetric 51-tap FIR filters, with and without using the
hard DSP blocks on a Stratix IV device. When DSP blocks
are not used, we experiment with two different cases:
fixed filter coefficients, and filter coefficients that can
C0 C1 C2 C3 change at runtime. If the filter coefficients are fixed, the
multiplier arrays implemented in the soft logic are opti-
Y mized by synthesizing away parts of the partial product
generation logic that correspond to zero bits in the co-
efficient values. Hence, it has lower resource utilization
Figure 17. Systolic symmetric FIR filter circuit.
than with input coefficients that can change at runtime.

22  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
For the symmetric filter, even when using the DSP blocks, DSP48E1 tile [75]. Then, they increased their multiplica-
we still need to use some soft logic resources to imple- tion width again to 27 # 18 bit and added a fourth input
ment the input cascade chains and pre-adders as shown to their ALU in the Ultrascale family DSP48E2 tile [76].
in Fig. 17. Using the hard DSP blocks results in 3# higher As illustrated in Fig. 15, up to 2009 the evolution of
area efficiency vs. using the soft fabric in the case of fixed the DSP block architecture was mainly driven by the
coefficients. This gap grows to 6.2# for filter coefficients precisions and requirements of communication applica-
that are changeable during runtime. For the asymmetric tions, especially in wireless base stations, with very few
filter, the complete FIR filter structure can be implement- academic research explorations [77], [78]. More recently,
ed in the DSP blocks without any soft logic resources. FPGAs have been widely deployed in datacenters to ac-
Thus, the area efficiency gap increases to 3.9# and 8.5# celerate various types of workloads such as search en-
for fixed and input coefficients, respectively. These gains gines and network packet processing [9]. In addition, DL
are large but still less than the 35# gap between FPGAs has emerged as a key component of many applications
and ASICs [11] usually cited in academia. The difference both in datacenter and edge workloads, with MAC be-
is partly due to some soft logic remaining in most appli- ing its core arithmetic operation. Driven by these new
cation circuits, but even in the case where the FIR filter trends, the DSP block architecture has evolved in two
perfectly fits into DSP blocks with no soft logic the area different directions. The first direction targets the high-
reduction hits a maximum of 8.5#. The primary reasons performance computing (HPC) domain by adding na-
for the lower than 35# gain of [11] are the interfaces to tive support for single-precision floating-point (fp32)
the programmable routing and the general inter-tile pro- multiplication. Before that, FPGA vendors would supply
grammable routing wires and multiplexers that must be designers with IP cores that implement floating-point
implemented in the DSP tile. In all cases, using the hard arithmetic out of fixed-point DSPs and a considerable
DSP blocks results in about 2# frequency improvement amount of soft logic resources. This created a huge bar-
as shown in Table III. rier for FPGAs to compete with CPUs and GPUs (which
The subsequent FPGA architecture generations from have dedicated floating-point units) in the HPC domain.
both Altera and Xilinx witnessed only minor changes in Native floating-point capabilities were first introduced
the DSP block architecture. The main focus of both ven- in Intel’s Arria 10 architecture, with a key design goal of
dors was to fine tune the DSP block capabilities for key ap- avoiding a large increase in DSP block area [79]. By re-
plication domains without adding costly programmable using the same interface to the programmable routing,
routing interfaces. In Stratix V, the DSP block was greatly not supporting uncommon features like subnormals,
simplified to natively support two 18 # 18 bit multipli- flags and multiple rounding schemes, and maximizing
cations (the key precision used in wireless base station the reuse of existing fixed-point hardware, the block area
signal processing) or one 27 # 27 multiplication (useful increase was limited to only 10% (i.e. 0.5% total die area
for single-precision floating-point
mantissas). As a result, the simpler Table III.
Stratix V DSP block spanned a sin- Implementation results for a 51-tap symmetric FIR filter on Stratix IV
gle row, which is more friendly to with and without using the hardened DSP blocks.
Altera’s row redundancy scheme. Symmetric Filter
In addition, input pre-adders as Implementation half-ALMs DSPs Area (mm2) Freq. (Mhz)
well as embedded coefficient banks
With DSPs 403 0.49 (1.0×) 510 (1.0×)
to store read-only filter weights 32
8
were added [73], which allowed
Without DSPs 3505 0 1.46 (3.0×) 248 (0.5×)
implementing the whole symmet- (fixed coeff.)
ric FIR filter structure shown in
Without DSPs 7238 0 3.01 (6.2×) 220 (0.4×)
Fig. 17 inside the DSP blocks with- (input coeff.)
out the need for any soft logic re-
Asymmetric Filter
sources. On the other hand, Xilinx
Implementation half-ALMs DSPs Area (mm2) Freq. (Mhz)
switched from 18 # 18 multipliers
With DSPs 0 0.63 (1.0×) 510 (1.0×)
to 25 # 18 multipliers in their Vir- 63
8
tex-5 DSP48E tile [74], after which
they incorporated input pre-adders Without DSPs 5975 0 2.48 (3.9×) 245 (0.5×)
and enhanced their adder/accumu- (fixed coeff.)
lator unit to also support bitwise Without DSPs 12867 0 5.35 (8.5×) 217 (0.4×)
logic operations in their Virtex-6 (input coeff.)

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 23

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
increase). Floating-point capabilities will also be support- application domains between high-precision floating-
ed in the DSP58 tiles of the next-generation Xilinx Versal point in HPC, medium-precision fixed-point in communi-
architecture [80]. cations, and low-precision fixed-point in DL. As a result,
The second direction targets increasing the density of Intel has recently announced an AI-optimized FPGA, the
low-precision integer multiplication specifically for DL in- Stratix 10 NX, which replaces conventional DSP blocks
ference workloads. Prior work has demonstrated the use with AI tensor blocks [87]. The new tensor blocks drop
of low-precision fixed-point arithmetic (8-bit and below) the support for legacy DSP modes and precisions that
instead of fp32 at negligible or no accuracy degradation, were targeting the communications domain and adopt
but greatly reduced hardware cost [81]–[83]. However, new ones targeting the DL domain specifically. This ten-
the required precision is model-dependent and can even sor block significantly increases the number of int8 and
vary between different layers of the same model. As a re- int4 MACs to 30 and 60 per block respectively, at almost
sult, FPGAs have emerged as an attractive solution for the same die size [88]. Feeding all multipliers with inputs
DL inference due to their ability to implement custom without adding more routing ports is a key concern. Ac-
precision datapaths, their energy efficiency compared cordingly, the NX tensor block introduces a double-buff-
to GPUs, and their lower development cost compared ered data reuse register network that can be sequentially
to custom ASICs. This has led both academic research- loaded from a smaller number of routing ports, while al-
ers and FPGA vendors to investigate adding native sup- lowing common DL compute patterns to make the best
port for low-precision multiplication to DSP blocks. The use of all available multipliers [89]. The next-generation
authors of [84] enhance the fracturability of an Intel-like Speedster7t FPGA from Achronix will also include a ma-
DSP block to support more int9 and int4 multiply and chine learning processing (MLP) block [90]. It supports a
MAC operations, while keeping the same DSP block rout- variety of precisions from int16 down to int3 in addi-
ing interface and ensuring its backward compatibility. tion to fp24, fp16 and bfloat16 floating-point formats.
The proposed DSP block could implement four int9 and The MLP block in Speedster7t will also feature a tightly
eight int4 multiply/MAC operations along with Arria- coupled BRAM and circular register file that enable the
10-like DSP block functionality at the cost of 12% DSP reuse of both input values and output results. Each of
block area increase, which is equivalent to only 0.6% in- these tightly integrated memory banks has a 72-bit exter-
crease in total die area. This DSP block increased the per- nal input but can be configured to have an up-to 144-bit
formance of 8-bit and 4-bit DL accelerators by 1.3# and output that feeds the MLP’s multiplier arrays, reducing
1.6# while reducing the utilized FPGA resources by 15% the number of required routing ports by 2#.
and 30% respectively, compared to an FPGA with DSPs
that do not natively support these modes of operation. F. System-Level Interconnect: Network-on-Chip
Another academic work [85] enhanced a Xilinx-like DSP FPGAs have continuously increased both in capacity and
block by including a fracturable multiplier array instead in the bandwidth of their external IO interfaces such as
of the fixed-precision multiplier in the DSP48E2 block to DDR, PCIe and Ethernet. Distributing the data traffic be-
support int9, int4 and int2 precisions. It also added tween these high-speed interfaces and the ever-larger
a FIFO register file and special dedicated interconnect soft fabric is a challenge. This system-level interconnect
between DSP blocks to enable more efficient standard, has traditionally been built by configuring parts of the
point-wise and depth-wise convolution layers. Shortly FPGA logic and routing to implement soft buses that re-
after, Intel announced that the same int9 mode of op- alize multiplexing, arbitration, pipelining and wiring be-
eration will be added to the next-generation Agilex DSP tween the relevant endpoints. These external interfaces
block along with half-precision floating-point (fp16) and operate at higher frequencies than the FPGA fabric can
brain float (bfloat16) precisions [86]. Also, the next- achieve, and therefore the only way to match their band-
generation Xilinx Versal architecture will natively sup- width is to use wider (soft) buses. For example, a single
port int8 multiplications in its DSP58 tiles [80]. channel of high-bandwidth memory (HBM) has a 128-bit
Throughout the years, the DSP block architecture has double data rate interface operating at 1 GHz, so a
evolved to best suite the requirements of key application bandwidth-matched soft bus running at 250 MHz must
domains of FPGAs, and provide higher flexibility such be 1024 bits wide. With recent FPGAs incorporating up to
that many different applications can benefit from its ca- 8 HBM channels [91] as well as numerous PCIe, Ethernet
pabilities. The common focus across all the steps of this and other interfaces, system level interconnect can rap-
evolution was reusing multiplier arrays and routing ports idly use a major fraction of the FPGA logic and routing
as much as possible to best utilize both these costly re- resources. In addition, system-level interconnect tends to
sources. However, this becomes harder with the recent span large distances. The combination of very wide and
divergence in the DSP block requirements of key FPGA physically long buses makes timing closure c ­ hallenging

24  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
and usually requires deep pipelining of the soft bus, a hard NoC for system-level communication between
further increasing its resource use. The system-level in- various endpoints (Gigabit transceivers, processor, AI
terconnect challenge is becoming more difficult in ad- subsystems, soft fabric), and is in fact the only way for
vanced process nodes, as the number and speed of FPGA external memory interfaces to communicate with the
external interfaces increases, and the metal wire parasit- rest of the device. It uses 128-bit wide links running at
ics (and thus interconnect delay) scales poorly [92]. 1 GHz, matching a DDR channel’s bandwidth. Its topology
Abdelfattah and Betz [93]–[95] proposed embed- is related to a mesh, but with all horizontal links pushed
ding a hard, packet-switched network-on-chip (NoC) in to the top and bottom of the device to make it easier to
the FPGA fabric to enable more efficient and easier-to- lay out within the FPGA floorplan. The Versal NoC con-
use system-level interconnect. Although a full-featured tains multiple rows (i.e. chains of links and routers) at the
packet-switched NoC could be implemented using the top and bottom of the device, and a number of vertical
soft logic and routing of an FPGA, an NoC with hardened NoC columns (similar to any other hard block columns
routers and links is 23# more area efficient, 6# faster, such as DSPs) depending on the device size as shown
and consumes 11# less power compared to a soft NoC in Fig. 18(a). The NoC has programmable routing tables
[95]. Designing a hard NoC for an FPGA is challenging that are configured at boot time and provides standard
since the FPGA architect must commit many choices to AXI interfaces [104] as its fabric ports. The Speedster7t
silicon (e.g. number of routers, link width, NoC topology) NoC topology is optimized for external interface to
yet still maintain the flexibility of an FPGA to implement
a wide variety of applications using many different ex-
ternal interfaces and communication endpoints. Work
in [95] advocates for a mesh topology with a moderate
Specialized Engines
number of routers (e.g. 16) and fairly wide (128 bit) links;
these choices keep the area cost to less than 2% of the
FPGA while ensuring the NoC is easier to lay out and a
single NoC link can carry the entire bandwidth of a DDR
channel. A hard NoC must also be able to flexibly connect

Transceivers
to user logic implemented in the FPGA fabric; Abdelfat- Routers
Subsystem
Processor

tah et al. [96] introduced the fabric port which interfaces


the hard NoC routers to the FPGA programmable fabric
Links
by performing width adaptation, clock domain crossing
and voltage translation. This decouples the NoC from the
FPGA fabric such that the NoC can run at a fixed (high)
frequency, and still interface to FPGA logic and IO inter-
Memory Controllers and High-Speed IOs
faces of different speeds and bandwidth requirements
(a)
with very little glue logic. Hard NoCs also appear very
well suited to FPGAs in datacenters. Datacenter FPGAs Transceivers
are normally configured in two parts: a shell provides
system-level interconnect to the external interfaces, and
a role implements the application acceleration function-
Memory Controllers
Memory Controllers

ality [9]. The resource use of the shell can be significant: NoC NoC
Row Column
it requires 23% of the device resources in the first genera-
tion of Microsoft’s Catapult systems [8]. Yazdanshenas
et al. [97] showed that a hard NoC significantly improves
Peripheral
resource utilization, operating frequency and routing Ring NoC
congestion in such shell + role FPGA use cases. Other
studies have proposed FPGA-specific optimizations to in-
crease the area efficiency and performance of soft NoCs
[98]–[100]. However, [101] shows that even optimized Security and Config. Mem. Controllers
soft NoCs still trail hard NoCs in most respects (usable (b)
bandwidth, latency, area and routing congestion).
Recent Xilinx (Versal) and Achronix (Speedster7t) Figure 18. Network-on-Chip sytem-level interconnect in
FPGAs integrate a hard NoC [102], [103] similar to next-generation (a) Xilinx Versal and (b) Achronix Speed-
ster7t architectures.
the academic proposals discussed above. Versal uses

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 25

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
fabric transfers. It consists of a peripheral ring around the-art). Combining multiple smaller dice on a silicon in-
the fabric with NoC rows and columns at regular inter- terposer is an alternative approach that can have higher
vals over the FPGA fabric as shown in Fig. 18(b). The yield. A second motivation for 2.5 D systems is to enable
peripheral ring NoC can operate independently without integration of different specialized chiplets (possibly us-
configuring the FPGA fabric to route the traffic between ing different process technologies) into a single system.
different external interfaces. There is no direct connec- This approach is also attractive for FPGAs as the fabric’s
tivity between the NoC rows and columns; the packets programmability can bridge disparate chiplet functional-
from a master block connecting to a NoC row will pass ity and interface protocols.
through the peripheral ring to reach a slave block con- Xilinx’s largest Virtex-7 (28 nm) and Virtex Ultrascale
nected to a NoC column. (20 nm) FPGAs use passive silicon interposers to inte-
grate three to four FPGA dice that each form a portion
G. Interposers of the FPGA’s rows. The largest interposer-based devices
FPGAs have been early adopters of interposer technol- provide more than twice the logic elements of the largest
ogy that allows dense interconnection of multiple silicon monolithic FPGAs at the same process node. The FPGA
dice. As shown in Fig. 19(a), a passive interposer is a sili- programmable routing requires a large amount of inter-
con die (often in a trailing process technology to reduce connect, raising the question of whether the interposer
cost) with conventional metal layers forming routing microbumps (which are much larger and slower than
tracks and thousands of microbumps on its surface that conventional routing tracks) will limit the routability
connect to two or more dice flipped on top of it. One mo- of the sytem. For example, in Virtex-7 interposer-based
tivation for interposer-based FPGAs is achieving higher FPGAs, only 23% of the vertical routing tracks cross be-
logic capacity at a reasonable cost. Both high-end sys- tween dice through the interposer [105], with an esti-
tems and emulation platforms to validate ASIC designs mated additional delay of ~1 ns [106]. The study in [105]
before fabrication demand FPGAs with high logic capac- showed that CAD tools that place the FPGA logic to mini-
ity. However, large monolithic (i.e. single-silicon-die) de- mize crossing of an interposer boundary combined with
vices have poor yield, especially early in the lifetime of a architecture changes that increase the switch flexibility
process technology (exactly when the FPGA is state-of- to the interposer-crossing tracks can largely mitigate the
impact of this reduced signal count. The entire vertical
bandwidth of the NoC in the next-generation Xilinx Versal
Microbumps
architecture (discussed in Section III-F) crosses between
dice, helping to provide more interconnect bandwidth.
Through-Silicon Vias
Interposer An embedded NoC makes good use of the limited number
FPGA 1 FPGA 2 of wires that can cross an interposer, as it runs its links
at a high frequency and they can be shared by different
communication streams as they are packet switched.
Intel FPGAs instead use smaller interposers called
embedded multi-die interconnect bridges (EMIB) carved
Package Substrate into the package substrate as shown in Fig. 19(b). Intel
Stratix 10 devices use EMIB to integrate a large FPGA fab-
(a)
ric die with smaller IO transceiver or HBM chiplets in the
Transceiver same package, decoupling the design and process tech-
Microbumps Chiplets
nology choices of these two crucial elements of an FPGA.
Some recent studies [107]–[109] used EMIB technology
Tx FPGA Tx
to tightly couple an FPGA fabric with specialized ASIC
accelerator chiplets for DL applications. This approach
Interposers offloads specific kernels of the computation (e.g. matrix-
Package Substrate matrix or matrix-vector multiplications) to the more ef-
ficient specialized chiplets, while leveraging the FPGA
(b)
fabric to interface to the outside world and to implement
rapidly changing DL model components.
Figure 19. Different interposer technologies used for inte-
grating multiple chips in one package in: (a) Xilinx multi-die H. Other FPGA Components
interposer-based FPGAs and (b) Intel devices with EMIB- Modern FPGA architectures contain other important com-
connected transceiver chiplets.
ponents that we will not cover in detail.

26  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
One such component is the configuration circuitry that erogeneous systems. Controlling power consumption
loads the bitstream into the millions of SRAM cells that is an overriding concern, and is likely to lead to FPGAs
control the LUTs, routing switches and configuration bits with more power-gating and more heterogeneous hard
in hard blocks. On power up, a configuration control- blocks. We do not claim to predict the future of FPGA
ler loads this bitstream serially from a source such as architecture, except that it will be interesting and dif-
on-board flash or a hardened PCIe interface. When a suf- ferent from today!
ficient group of configuration bits are buffered, they
are written in parallel to a group of configuration SRAM Acknowledgments
cells, in a manner similar to writing a (very wide) word The authors would like to thank Fynn Schwiegelshohn
to an SRAM array. This configuration circuitry can also for valuable feedback, and the NSERC/Intel industrial
be accessed by the FPGA soft logic, allowing partial re- research chair in programmable silicon and the Vector
configuration of one part of the device while another institute for funding support.
portion continues processing. A complete FPGA ap-
plication is very valuable intellectual property, and Andrew Boutros received his B.Sc. de-
without security measures it could be cloned simply gree in electronics engineering from the
by copying the programming bitstream. To avoid this, German University in Cairo in 2016, and
FPGA CAD tools can optionally encrypt a bitstream, his M.A.Sc. degree in electrical and com-
and FPGA devices can have a private decryption key puter engineering from the University of
programmed in by the manufacturer, making a bit- Toronto in 2018. He was a research sci-
stream usable only by a single customer who purchas- entist at Intel’s Accelerator Architecture Lab in Oregon
es FPGAs with the proper key. before he returned to the University of Toronto where he
Since FPGA applications are often communicating is currently pursuing his Ph.D. degree. His research inter-
with many different devices at different speeds, they ests include FPGA architecture and CAD, deep learning
commonly include dozens of clocks. Most of these clocks acceleration, and domain-specific architectures. He is a
are generated on-chip by programmable phase-locked post-graduate affiliate of the Vector Institute for Artificial
loops (PLLs), delay-locked loops (DLLs) and clock data re- Intelligence and the Center for Spatial Computational
covery (CDR) circuits. Distributing many high frequency Learning. He received two best paper awards at Reconfig
clocks in different ways for different applications is chal- 2016 and FPL 2018.
lenging, and leads to special interconnect networks for
clocks. These clock networks are similar in principle to Vaughn Betz received his B.Sc. degree in
the programmable interconnect of Section III-B but use electrical engineering from the Univer-
routing wire and switch topologies that allow construc- sity of Manitoba in 1991, his M.S. degree
tion of low-skew networks like H-trees, and are imple- in electrical and computer engineering
mented using wider metal and shielding conductors to from the University of Illinois at Urbana–
reduce crosstalk and hence jitter. Champaign in 1993, and his Ph.D. degree
in electrical and computer engineering from the Univer-
IV. Conclusion and Future Directions sity of Toronto in 1998. He is the original developer of the
FPGAs have evolved from simple arrays of program- widely used VPR FPGA placement, routing and architec-
mable logic blocks and IOs interconnected via pro- ture evaluation CAD flow, and a lead developer in the VTR
grammable routing into more complex multi-die sys- project that has built upon VPR. He co-founded Right
tems with many different embedded components such Track CAD to commercialize VPR, and joined Altera upon
as BRAMs, DSPs, high-speed external interfaces, and its acquisition of Right Track CAD. Dr. Betz spent 11 years
system-level NoCs. The recent adoption of FPGAs in at Altera, ultimately as Senior Director of software engi-
the HPC and datacenter domains, along with the emer- neering, and is one of the architects of the Quartus CAD
gence of new high-demand applications such as deep system and the first five generations of the Stratix and
learning, is ushering in a new phase of FPGA architec- Cyclone FPGA families. He is currently a professor and
ture design. These new applications and the multi-user the NSERC/Intel Industrial Research Chair in Program-
paradigm of the datacenter create opportunities for mable Silicon at the University of Toronto. He holds 101
architectural innovation. At the same time, process US patents and has published over 100 technical articles
technology scaling is changing in fundamental ways. in the FPGA area, thirteen of which have won best or
Wire delay is scaling poorly which motivates rethink- most significant paper awards. Dr. Betz is a Fellow of the
ing programmable routing architecture. Interposers IEEE and the NAI, and Faculty Affiliate of the Vector Insti-
and 3D integration enable entirely new types of het- tute for Artificial Intelligence.

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 27

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
References [27] T. Ahmed et al., “Packing techniques for Virtex-5 FPGAs,” ACM Trans.
[1] M. Hall and V. Betz, “HPIPE: Heterogeneous layer-pipelined and sparse- Reconfigurable Technol. Syst. (TRETS), vol. 2, no. 3, pp. 1–24, 2009. doi:
aware CNN inference for FPGAs,” 2020, arXiv:2007.10451. 10.1145/1575774.1575777.
[2] P. Yiannacouras et al., “Data parallel FPGA workloads: Software versus [28] W. Feng et al., “Improving FPGA performance with a S44 LUT struc-
hardware,” in Proc. IEEE Int. Conf. Field-Programmable Logic Appl. (FPL), ture,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays
2009, pp. 51–58. (FPGA), 2018, pp. 61–66. doi: 10.1145/3174243.3174272.
[3] M. Cummings and S. Haruyama, “FPGA in the software radio,” IEEE [29] “Versal ACAP Configurable Logic Block Architecture Manual (AM005
Commun. Mag., vol. 37, no. 2, pp. 108–112, 1999. doi: 10.1109/35.747258. v1.0),” Xilinx Inc, 2020.
[4] J. Rettkowski et al., “HW/SW co-design of the HOG algorithm on a [30] D. Lewis et al., “Architectural enhancements in Stratix V,” in Proc.
Xilinx Zynq SoC,” J. Parallel Distrib. Comput., vol. 109, pp. 50–62, 2017. doi: ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2013, pp.
10.1016/j.jpdc.2017.05.005. 147–156.
[5] A. Bitar et al., “Bringing programmability to the data plane: Packet [31] I. Ganusov and B. Devlin, “Time-borrowing platform in the Xilinx Ul-
processing with a NoC-enhanced FPGA,” in Proc. IEEE Int. Conf. Field-Pro- trascale+ family of FPGAs and MPSoCs,” in Proc. IEEE Int. Conf.Field Pro-
grammable Technol. (FPT), 2015, pp. 24–31. grammable Logic Appl. (FPL), 2016, pp. 1–9.
[6] H. Krupnova and G. Saucier, “FPGA-based Emulation: Industrial and [32] K. Murray et al., “Optimizing FPGA logic block architectures for arith-
Custom Prototyping Solutions,” in Proc. Int. Workshop on Field-Program- metic,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, no. 6, pp.
mable Logic Appl. (FPL), Springer-Verlag, 2000, pp. 68–77. 1378–1391, 2020. doi: 10.1109/TVLSI.2020.2965772.
[7] A. Boutros et al., “Build fast, trade fast: FPGA-based high-frequency [33] S. Yazdanshenas and V. Betz, “Automatic circuit design and model-
trading using high-level synthesis,” in Proc. IEEE Int. Conf. Reconfigurable ling for heterogeneous FPGAs,” in Proc. IEEE Int. Conf. Field Programmable
Comput. FPGAs (ReConFig), 2017, pp. 1–6. Technol. (ICFPT), 2017, pp. 9–16.
[8] A. Putnam et al., “A reconfigurable fabric for accelerating large-scale [34] J. Chromczak et al., “Architectural enhancements in Intel Agilex FPGAs,”
datacenter services,” in Proc. ACM/IEEE Int. Symp. Comput. Architecture in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2020,
(ISCA), 2014, pp. 13–24. pp. 140–149.
[9] A. M. Caulfield et al., “A cloud-scale acceleration architecture,” in Proc. [35] S. Rasoulinezhad et al., “LUXOR: An FPGA logic cell architecture
IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2016, pp. 1–13. for efficient compressor tree implementations,” in Proc. ACM/SIGDA Int.
[10] J. Fowers et al., “A configurable cloud-scale DNN processor for real- Symp. Field-Programmable Gate Arrays (FPGA), 2020, pp. 161–171.
time AI,” in Proc. ACM/IEEE Int. Symp. Comput. Architecture (ISCA), 2018, [36] A. Boutros et al., “Math doesn’t have to be hard: Logic block architec-
pp. 1–14. tures to enhance low-precision multiply-accumulate on FPGAs,” in Proc.
[11] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2019, pp.
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 2, pp. 94–103.
203–215, 2007. doi: 10.1109/TCAD.2006.884574. [37] M. Eldafrawy et al., “FPGA logic block architectures for efficient deep
[12] A. Boutros et al., “You cannot improve what you do not measure: learning inference,” ACM Trans. Reconfigurable Technol. Syst. (TRETS),
FPGA vs. ASIC efficiency gaps for convolutional neural network infer- vol. 13, no. 3, pp. 1–34, 2020. doi: 10.1145/3393668.
ence,” ACM Trans. Reconfigurable Technol. Syst. (TRETS), vol. 11, no. 3, pp. [38] C. Chiasson and V. Betz, “Should FPGAs abandon the pass gate?” in
1–23, 2018. doi: 10.1145/3242898. Proc. Int. Conf. Field-Programmable Logic Appl., 2013, pp. 1–8.
[13] S. Yang, Logic Synthesis and Optimization Benchmarks User Guide: Ver- [39] FlexLogix eFPGA. https://flex-logix.com/efpga/
sion 3.0. Microelectronics Center of North Carolina, 1991. [40] V. Betz and J. Rose, “FPGA routing architecture: Segmentation and
[14] K. Murray et al., “VTR 8: High-performance CAD and customizable buffering to optimize speed and density,” in Proc. ACM Int. Symp. FPGAs,
FPGA architecture modelling,” ACM Trans. Reconfigurable Technol. Syst. 1999, pp. 59–68.
(TRETS), vol. 13, no. 2, pp. 1–55, June 2020. doi: 10.1145/3388617. [41] O. Petelin and V. Betz, “The speed of diversity: Exploring complex
[15] K. Murray et al., “Titan: enabling large and complex benchmarks in FPGA routing toplogies for the global metal layer,” in Proc. IEEE Int. Conf.
academic CAD,” in Proc. IEEE Int. Conf. Field-Programmable Logic Appl. Field-Programmable Logic Appl. (FPL), 2016, pp. 1–10.
(FPL), 2013, pp. 1–8. [42] D. Lewis et al., “The Stratix routing and logic architecture,” in Proc. ACM/
[16] H. Parandeh-Afshar et al., “Rethinking FPGAs: Elude the flexibility SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2003, pp. 12–20.
excess of LUTs with and-inverter cones,” in Proc. ACM/SIGDA Int. Symp. [43] X. Tang et al., “A study on switch block patterns for tileable FPGA
Field-Programmable Gate Arrays (FPGA), 2012, pp. 119–128. routing architectures,” in Proc. IEEE Int. Conf. Field-Programmable Technol.
[17] H. Parandeh-Afshar et al., “Shadow AICs: Reaping the benefits of and- (FPT), 2019, pp. 247–250.
inverter cones with minimal architectural impact,” in Proc. ACM/SIGDA [44] G. Lemieux et al., “Directional and single-driver wires in FPGA inter-
Int. Symp. Field-Programmable Gate Arrays (FPGA), 2013, pp. 279–279. connect,” in Proc. IEEE Int. Conf. Field-Programmable Technol. (FPT), 2004,
[18] H. Parandeh-Afshar et al., “Shadow and-inverter cones,” in Proc. IEEE pp. 41–48.
Int. Conf. Field-Programmable Logic Appl. (FPL), 2013, pp. 1–4. [45] D. Lewis et al., “The Stratix 10 highly pipelined FPGA architecture,”
[19] G. Zgheib et al., “Revisiting and-inverter cones,” in Proc. ACM/SIGDA in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA),
Int. Symp. Field-Programmable Gate Arrays (FPGA), 2014, pp. 45–54. doi: 2016, pp. 159–168.
10.1145/2554688.2554791. [46] B. Gaide et al., “Xilinx adaptive compute acceleration platform: Ver-
[20] V. Betz et al., Architecture and CAD for Deep-Submicron FPGAs. Spring- sal architecture,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate
er Science & Business Media, 1999. Arrays (FPGA), 2019, pp. 84–93. doi: 10.1145/3289602.3293906.
[21] V. Betz and J. Rose, “How much logic should go in an FPGA logic [47] J. Tyhach et al., “A 90 nm FPGA I/O buffer design with 1.6 Gbps data
block?” IEEE Design Test Comput., vol. 15, no. 1, pp. 10–15, 1998. doi: rate for source-synchronous system and 300 MHz clock rate for external
10.1109/54.655177. memory interface,” in Proc. IEEE Custom Integrated Circuits Conf., 2004,
[22] G. Lemieux et al., “Generating highly-routable sparse crossbars for pp. 431–434.
PLDs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays [48] N. Zhang et al., “Low-voltage and high-speed FPGA I/O cell design in
(FPGA), 2000, pp. 155–164. doi: 10.1145/329166.329199. 90nm CMOS,” in Proc. IEEE Int. Conf. ASIC, 2009, pp. 533–536.
[23] C. Chiasson and V. Betz, “COFFE: Fully-automated transistor sizing [49] T. Qian et al., “A 1.25Gbps programmable FPGA I/O buffer with multi-
for FPGAs,” in Proc. IEEE Int. Conf. Field-Programmable Technol. (FPT), standard support,” in Proc. IEEE Int. Conf. Integr. Circuits Microsyst., 2018,
2013, pp. 34–41. pp. 362–365.
[24] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep- [50] P. Upadhyaya et al., “A fully-adaptive wideband 0.5–32.75Gb/s FPGA
submicron FPGA performance and density,” IEEE Trans. Very Large transceiver in 16nm FinFET CMOS technology,” in Proc. IIEEE Symp. VLSI
Scale Integr. (VLSI) Syst., vol. 12, no. 3, pp. 288–298, 2004. doi: 10.1109/ Circuits, 2016, pp. 1–2.
TVLSI.2004.824300. [51] “Implementing RAM functions in FLEX 10K Devices (A-AN-052-01),”
[25] “Stratix II Device Handbook, Volume 1 (SII5V1-4.5).” Altera Corp., 2007. Altera Corp., 1995.
[26] D. Lewis et al., “The Stratix II logic and routing architecture,” in Proc. [52] K. Tatsumura et al., “High density, low energy, magnetic tunnel junc-
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2005, tion based block RAMs for memory-rich FPGAs,” in Proc. IEEE Int. Conf.
pp. 14–20. Field-Programmable Technol. (FPT), 2016, pp. 4–11.

28  IEEE CIRCUITS AND SYSTEMS MAGAZINE SECOND QUARTER 2021

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.
[53] T. Ngai et al., “An SRAM-programmable field-configurable memory,” [83] A. Mishra et al., “WRPN: Wide reduced-precision networks,” 2017,
in Proc. IEEE Custom Integr. Circuits Conf. (CICC), 1995, pp. 499–502. arXiv:1709.01134.
[54] S. Wilton et al., “Architecture of centralized field-configurable memo- [84] A. Boutros et al., “Embracing diversity: Enhanced DSP blocks for low-
ry,” in Proc. ACM Int. Symp. Field-Programmable Gate Arrays (FPGA), 1995, precision deep learning on FPGAs,” in Proc. IEEE Int. Conf. Field Program-
pp. 97–103. mable Logic Appl. (FPL), 2018, pp. 35–357.
[55] S. Yazdanshenas et al., “Don’t forget the memory: Automatic block [85] S. Rasoulinezhad et al., “PIR-DSP: An FPGA DSP block architecture
RAM modelling, optimization, and architecture exploration,” in Proc. for multi-precision deep neural networks,” in Proc. IEEE Int. Symp. Field-
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2017, pp. Programmable Custom Comput. Mach. (FCCM), 2019, pp. 35–44.
115–124. [86] Intel Corp. “Intel Agilex variable precision DSP blocks user guide (UG-
[56] T. R. Halfhill, “Tabula’s time machine,” Microprocessor Rep., vol. 131, 20213),” Intel Corp., 2020.
2010. [87] “Intel Stratix 10 NX FPGA: AI-optimized FPGA for high-bandwidth,
[57] “Mercury programmable logic device family (DS-MERCURY-2.2),” Al- low-latency AI acceleration (SS-1121-1.0),” Intel Corp., 2020.
tera Corp., 2003. [88] L. Gwennap, “Stratix 10 NX adds AI blocks,” The Linley Group News-
[58] C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for letters, 2020.
FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays [89] A. Boutros et al., “Beyond peak performance: Comparing the real
(FPGA), 2010, pp. 41–50. doi: 10.1145/1723112.1723122. performance of AI-optimized FPGAs and GPUs,” in Proc. IEEE Int. Conf.
[59] C. E. LaForest et al., “Multi-ported memories for FPGAs via XOR,” in Field-Programmable Technol. (FPT), 2020.
Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2012, [90] “Speedster7t machine learning processing user guide (UG088),”
pp. 209–218. doi: 10.1145/2145694.2145730. Achronix Corp., 2019.
[60] D. Lewis et al., “Architectural enhancements in Stratix-III and Stratix- [91] “High Bandwidth Memory (HBM2) Interface Intel FPGA IP User Guide
IV,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (UG-20031),” Intel Corp., 2020.
(FPGA), 2009, pp. 33–42. [92] M. T. Bohr, “Interconnect scaling: The real limiter to high perfor-
[61] R. Tessier, et al. “Power-efficient RAM mapping algorithms for FPGA mance ULSI,” in Proc. Int. Electron Devices Meeting, 1995, pp. 241–244.
embedded memory blocks,” IEEE Trans. Comput.-Aided Design Integr. Cir- [93] M. S. Abdelfattah and V. Betz, “Design tradeoffs for hard and soft
cuits Syst., vol. 26, no. 2, pp. 278–290, 2007. doi: 10.1109/TCAD.2006.887924. FPGA-based networks-on-chip,” in Proc. IEEE Int. Conf. Field-Programmable
[62] B.-C. C. Lai and J.-L. Lin, “Efficient designs of multiported memory on Technol. (FPT), 2012, pp. 95–103.
FPGA,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 1, pp. [94] M. S. Abdelfattah and V. Betz, “The power of communication: Energy-
139–150, 2016. doi: 10.1109/TVLSI.2016.2568579. efficient NoCs for FPGAs,” in Proc. IEEE Int. Conf. Field Programmable Logic
[63] H. Wong et al., “Comparing FPGA vs. custom CMOS and the impact on Appl. (FPL), 2013, pp. 1–8.
processor microarchitecture,” in Proc. ACM/SIGDA Int. Symp. Field-Program- [95] S. Abdelfattah and V. Betz, “The case for embedded networks on chip
mable Gate Arrays (FPGA), 2011, pp. 5–14. doi: 10.1145/1950413.1950419. on field-programmable gate arrays,” IEEE Micro, vol. 34, no. 1, pp. 80–89,
[64] E. Kadric et al., “Impact of memory architecture on FPGA energy con- 2013.
sumption,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays [96] M. S. Abdelfattah et al., “Take the highway: Design for embedded
(FPGA), 2015, pp. 146–155. doi: 10.1145/2684746.2689062. NoCs on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate
[65] L. Ju et al., “NVM-based FPGA block RAM with adaptive SLC-MLC Arrays (FPGA), 2015, pp. 98–107. doi: 10.1145/2684746.2689074.
conversion,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37, [97] S. Yazdanshenas and V. Betz, “Quantifying and mitigating the costs
no. 11, pp. 2661–2672, 2018. doi: 10.1109/TCAD.2018.2857261. of FPGA virtualization,” in Proc. IEEE Int. Conf. Field-Programmable Logic
[66] P. Longa and A. Miri, “Area-efficient FIR filter design on FPGAs us- Appl. (FPL), 2017, pp. 1–7.
ing distributed arithmetic,” in Proc. IEEE Int. Symp. Signal Process. Inform. [98] N. Kapre, and and J. Gray, “Hoplite: A deflection-routed directional
Technol., 2006, pp. 248–252. torus NoC for FPGAs,” ACM Trans. Reconfigurable Technol. Syst. (TRETS),
[67] P. K. Meher et al., “FPGA realization of FIR Filters by efficient and vol. 10, no. 2, pp. 1–24, 2017. doi: 10.1145/3027486.
flexible systolization using distributed arithmetic,” IEEE Trans. Signal [99] M. K. Papamichael and J. C. Hoe, “CONNECT: Re-examining conven-
Process., vol. 56, no. 7, pp. 3009–3017, 2008. doi: 10.1109/TSP.2007.914926. tional wisdom for designing NoCs in the context of FPGAs,” in Proc. ACM/
[68] “Virtex-II platform FPGAs: Complete data sheet (DS031 v4.0),” Xilinx SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA), 2012, pp. 37–46.
Inc, 2014. [100] Y. Huan and A. DeHon, “FPGA optimized packet-switched NoC using
[69] C. R. Baugh and B. A. Wooley, “A two’s complement parallel array split and merge primitives,” in Proc. IEEE Int. Conf. Field-Programmable
multiplication algorithm,” IEEE Trans. Comput., vol. C-22, no. 12, pp. 1045– Technol. (FPT), 2012, pp. 47–52.
1047, 1973. doi: 10.1109/T-C.1973.223648. [101] S. Yazdanshenas and V. Betz, “Interconnect solutions for virtualized
[70] “Using the DSP Blocks in Stratix & Stratix GX Devices (AN-214-3.0),” field-programmable gate arrays,” IEEE Access, vol. 6, pp. 10,497–10,507,
Altera Corp., 2002. 2018. doi: 10.1109/ACCESS.2018.2806618.
[71] “XtremeDSP for Virtex-4 FPGAs (UG073 v2.7),” Xilinx Inc, 2008. [102] I. Swarbrick et al., “Network-on-chip programmable platform in ver-
[72] “DSP Blocks in Stratix III Devices (SIII51005-1.7),” Altera Corp., 2010. sal ACAP architecture,” in Proc. ACM/SIGDA International Symposium on
[73] “Stratix V Device Handbook Volume 1: Device Interfaces and Integra- Field-Programmable Gate Arrays (FPGA), 2019, pp. 212–221.
tion (SV-5V1),” Altera Corp., 2020. [103] “Speedster7t network on chip user guide (UG089),” Achronix Corp.,
[74] “Virtex-5 FPGA XtremeDSP Design Considerations (UG193 v3.6),” 2019.
Xilinx Inc, 2017. [104] “AMBA AXI and ACE protocol specification,” ARM Holdings, Tech.
[75] “Virtex-6 FPGA DSP48E1 Slice (UG369 v1.3),” Xilinx Inc, 2011. Rep., 2013.
[76] “UltraScale Architecture DSP Slice (UG579 v1.9),” Xilinx Inc, 2019. [105] E. Nasiri et al., “Multiple dice working as one: CAD flows and rout-
[77] H. Parandeh-Afshar and P. Ienne, “Highly versatile DSP blocks for ing architectures for silicon interposer FPGAs,” IEEE Trans. Very Large
improved FPGA arithmetic performance,” in Proc. IEEE Int. Symp. Field- Scale Integr. (VLSI) Syst., vol. 24, no. 5, pp. 1821–1834, 2015. doi: 10.1109/
Programmable Custom Computing Mach. (FCCM), 2010, pp. 229–236. TVLSI.2015.2478280.
[78] A. Cevrero et al., “Field programmable compressor trees: Accel- [106] R. Chaware et al., “Assembly and reliability challenges in 3D integra-
eration of multi-input addition on FPGAs,” ACM Trans. Reconfigurable tion of 28nm FPGA die on a large high density 65nm passive interposer,” in
Technol. Syst. (TRETS), vol. 2, no. 2, pp. 1–36, 2009. doi: 10.1145/1534916. Proc. IEEE Electronic Components Technol. Conf., 2012, pp. 279–283.
1534923. [107] E. Nurvitadhi et al., “In-package domain-specific ASICs for intel
[79] M. Langhammer and B. Pasca, “Floating-point DSP block architecture Stratix 10 FPGAs: A case study of accelerating deep learning using ten-
for FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays sortile ASIC,” in Proc. IEEE Int. Conf. Field-Programmable Logic Appl. (FPL),
(FPGA), 2015, pp. 117–125. doi: 10.1145/2684746.2689071. 2018, pp. 106–1064.
[80] S. Ahmad et al., “Xilinx First 7nm device: Versal AI core (VC1902),” in [108] E. Nurvitadhi et al., “Evaluating and Enhancing Intel Stratix 10 FP-
Proc. Hot Chips Symp., 2019, pp. 1–28. GAs for persistent real-time AI,” in Proc. ACM/SIGDA Int. Symp. Field-Pro-
[81] P. Gysel et al., “Hardware-oriented approximation of convolutional grammable Gate Arrays (FPGA), 2019, pp. 119–119.
neural networks,” 2016, arXiv:1604.03168. [109] E. Nurvitadhi et al., “Why compete when you can work together: FPGA-
[82] N. Mellempudi et al., “Mixed low-precision deep learning inference ASIC integration for persistent RNNs,” in Proc. IEEE Int. Symp. Field-Program-
using dynamic fixed point,” 2017, arXiv:1701.08978. mable Custom Comput. Mach. (FCCM), 2019, pp. 199–207.

SECOND QUARTER 2021 IEEE CIRCUITS AND SYSTEMS MAGAZINE 29

Authorized licensed use limited to: Universidad Federal de Pernambuco. Downloaded on September 18,2022 at 03:04:10 UTC from IEEE Xplore. Restrictions apply.

You might also like