Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

FPGA PPT Presentation On Flow

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

What Are FPGAs

Field-Programmable Gate Array


Can be configured to act like any circuit – More later!
Can do many things, but we focus on computation acceleration
FPGAs Come In Many Forms

PCIe-Attached In-Storage

CPU Integrated In-Network


How Is It Different From CPU/GPUs
GPU – The other major accelerator
CPU/GPU hardware is fixed
o “General purpose”
o we write programs (sequence of instructions) for them
FPGA hardware is not fixed
o “Special purpose”
o Hardware can be whatever we want
o Will our hardware require/support software? Maybe!
Optimized hardware is very efficient
o GPU-level performance**
o 10x power efficiency (300 W vs 30 W)
Analogy
CPU/GPU comes with fixed circuits FPGA gives you a big bag of components

To build whatever Could be a CPU/GPU!

“The Z-Berry”
“Experimental Investigations on Radiation Characteristics of IC Chips”
benryves.com “Z80 Computer”
Shadi Soundation: Homebrew 4 bit CPU
Fine-Grained Parallelism of
Special-Purpose Circuits
Example -- Calculating gravitational force:
8 instructions on a CPU → 8 cycles**
Much fewer cycles on a special purpose circuit
A = G × m 1 × m2 B = (x1 - x2)2 C = (y1 - y2)2
A = G × m1 C = x 1 - x2 E = y 1 - y2
D=B+C
B = A × m2 D = C2 F = E2
Ret = B / G
G=D+F 3 cycles with compound operations
Ret = B / G May slow down clock
Ret = (G × m1 × m2) / ((x1 - x2)2 + (y1 - y2)2)
4 cycles with basic operations 1 cycle with even further compound operations
Coarse-Grained Parallelism of
Special-Purpose Circuits
Typical unit of parallelism for general-purpose units are threads ~= cores
Special-purpose processing units can also be replicated for parallelism
o Large, complex processing units: Few can fit in chip
o Small, simple processing units: Many can fit in chip
Only generates hardware useful for the application
o Instruction? Decoding? Cache? Coherence?
How Is It Different From ASICs
ASIC (Application-Specific Integrated Circuit)
o Special chip purpose-built for an application
o E.g., ASIC bitcoin miner, Intel neural network accelerator
o Function cannot be changed once expensively built
 + FPGAs can be field-programmed
o Function can be changed completely whenever
o FPGA fabric emulates custom circuits
 - Emulated circuits are not as efficient as bare-metal
o ~10x performance (larger circuits, faster clock)
o ~10x power efficiency
Basic FPGA Architecture
“Configurable logic block (CLB)” Programmable
~
I/O block Latch
6-Input
Look-Up
Table
FF

Ex) 2-LUT for “AND”


Input 1 Input 2 Output Sequential circuit
0 0 0 construction
0 1 0
1 0 0
1 1 1
Programmable interconnect
Basic FPGA Architecture – DSP Blocks
“DSP block”
CLBs act as gates – Many needed to
implement high-level logic
Arithmetic operation provided as
efficient ALU blocks
o “Digital Signal Processing (DSP) blocks”
o Each block provides an adder + multiplier

× +/-
Basic FPGA Architecture – Block RAM
“Block RAM”
CLB can act as flip-flops
o (~1 bit/block) – tiny!
Some on-chip SRAM provided as blocks
o ~18/36 Kbit/block, MBs per chip
o Massively parallel access to data → multi-
TB/s bandwidth
Basic FPGA Architecture – Hard Cores
Some functions are provided as
Memory efficient, non-configurable “hard cores”
o Multi-core ARM cores (“Zynq” series)
o Multi-Gigabit Transceivers
o PCIe/Ethernet PHY
o Memory controllers
Ethernet
o …

ARM PCIe
Example Accelerator Card Architecture
“FPGA Mezzanine Card” Expansion
o Network Ports, Memory, Storage, PCIe, …
General-Purpose I/O Pins Multi-Gigabit Transceivers
FMC

1GbE DRAM

FPGA
40GbE DRAM

PCIe
Example Accelerator Card (VCU108)
Programming FPGAs
Languages and tools overlap with ASIC/VLSI design
FPGAs for acceleration typically done with either
o Hardware Description Languages (HDL): Register-Transfer Level (RTL) languages
o High-Level Synthesis: Compiler translates software programming languages to RTL
RTL models a circuit using:
o Registers (state), and
o Combinational logic (computation)
Hardware Description Language
Software programming languages: Describes process
Hardware description languages: Describes structure
std::queue<float> input_queue; FIFO#(Float) input_queue <- mkFIFO;
std::queue<float> output_queue; Exists in memory FIFO#(Float) output_queue <- mkFIFO; Exists on chip
float factor; Reg#(Float) factor <- mkReg;
FloatMultIfc mult <- mkFloatMult;
while (true) {
if ( !input_queue.empty() ) { rule in;
ret = input_queue.front() * factor; mult.enq(factor, input_queue.first);
Instructions
output_queue.push(ret) input_queue.deq;
For CPU Creates
input_queue.pop(); endrule
} rule out; circuits
} ret <- mult.result;
output_queue.enq(ret);
endrule
Major Hardware Description Languages
Verilog: Most widely used in industry
o Relatively low-level language supported by everyone
Chisel – Compiles to Verilog
o Relatively high-level language from Berkeley
o Embedded in the Scala programming language
o Prominently used in RISC-V development (Rocket core, etc)
Bluespec – Compiles to Verilog
o Relatively high-level language from MIT
o Supports types, interfaces, etc
o Also active RISC-V development (Piccolo, etc)
High-Level Synthesis
Compiler translates software programming languages to RTL
High-Level Synthesis compiler from Xilinx, Altera/Intel
o Compiles C/C++, annotated with #pragma’s into RTL
o Theory/history behind it is a complex can of worms we won’t go into
o Personal experience: needs to be HEAVILY annotated to get performance
o Anecdote: Naïve RISC-V in Vivado HLS achieves IPC of 0.0002 [1], 0.04 after
optimizations [2]
OpenCL
o Inherently parallel language more efficiently translated to hardware
o Stable software interface

[1] http://msyksphinz.hatenablog.com/entry/2019/02/20/040000
[2] http://msyksphinz.hatenablog.com/entry/2019/02/27/040000
FPGA Compilation Toolchain
“Which transceiver instance should
top_transceiver_01 map to?”
High-Level High-level language vendor tool And so, so much more…
HDL Code
Constraint
Functional File Cycle-level
Simulation Simulation
Language
Compiler FPGA Vendor toolchain (Few open source)

Verilog/ Map/
Synthesize Netlist Place/ Bitfile
VHDL
Route
Programming/Using an FPGA Accelerator
Bitfile is programmed to FPGA over “JTAG” interface
o Typically used over USB cable
o Supports FPGA programming, limited debugging access, etc
PCIe-attached FPGA accelerator card is typically used similarly to GPUs
o Program FPGA, execute software
o Software copies data to FPGA board, notify FPGA
-> FPGA logic performs computations
-> Software copies data back from FPGA
FPGA flexibility gives immense freedom of usage patterns
o Streaming, coherent memory, …
Partial Reconfiguration

FPGA
Parts of the FPGA can be
Sub-components
swapped out dynamically
without turning off FPGA
o Physical area is drawn on chip
Used in Amazon F1, etc
Toolchain support for
isolation
FPGAs In The Cloud
Amazon EC2 F1 instance (1 – 4 FPGAs)
Microsoft Azure, etc…

You might also like