Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Resource-efficient RISC-V Vector Extension Architecture for

FPGA-based Accelerators
Md Ashraful Islam Kenji KISE
ashraful@arch.cs.titech.ac.jp kise@c.titech.ac.jp
School of Computing, Tokyo Institute of Technology School of Computing, Tokyo Institute of Technology
Meguro-ku, Tokyo, Japan Meguro-ku, Tokyo, Japan

ABSTRACT 1 INTRODUCTION
For the increasing demands of embedded computation, hardware Cloud-based data processing approach is limited by bandwidth, data
accelerators are widely used with processors. FPGA provides flex- privacy, and latency. The demand for real-time workloads on the
ibility to design such accelerators because it is a programmable embedded systems at the edge is growing. Many of these embedded
device. But developing a custom accelerator for each application applications are data-intensive, but the computation performance
is time-consuming and not reusable. On the other hand, vector is limited by the scalar processor due to a lack of data-level paral-
processing brings the opportunity to accelerate computation by lelism. In FPGA-based design, such data-parallel applications can
taking advantage of data-level parallelism. be addressed by custom-designed hardware accelerators. However,
This paper presents the architecture of a scalable soft Vector the development of embedded applications with such a custom ac-
Processing Unit for FPGA based on a subset of the RISC-V vector celerator requires hardware design experience. On the other hand,
extension instruction set. Maximum vector length and the number a vector processing unit allows the developer to exploit such data-
of lanes are configurable in the proposed architecture. We have level parallelism without redesigning the accelerator every time.
integrated our proposed vector processing unit into a 32-bit scalar The advent of quantized integer-based AI inference enables the
RISC-V core and implemented it in FPGA. The implementation development of different applications such as image classification,
result shows that our proposed architecture consumes significantly voice recognition, and many others at the edge. These applications
less FPGA resources and has more than four times frequency im- require integer computation such as multiplication, addition, dot
provement than other vector processing units. It achieves 11.9 giga operation (for convolution), min/max, etc., with varied numbers of
operation per second for 8-bit integer convolution operation. We parallel operations. Single instruction multiple data (SIMD) allows
demonstrate that the performance of the proposed vector process- executing operations over the wide registers of a fixed length such
ing unit is scalable with maximum vector length and the number as ARM NEON[10], AVX-512[6] etc. On the other hand, in a Vector
of lanes. Architecture, vector length can be changed by application up to a
predefined maximum vector length (MVL) such as ARM SVE[16],
CCS CONCEPTS RISC-V vector extension[7].
RISC-V vector extension (RVV) is an open-source instruction set
• Computer systems organization → Embedded systems; Ar- architecture(ISA) that allows researchers to develop their own ar-
chitecture. chitecture to implement the vector processing unit. This ISA defines
arithmetic instructions for integer, fixed point, and floating point
KEYWORDS data types. Since FPGA is a resource-constrained device implement-
RISC-V, Soft Processor, Vector Extension, Variable Precision, IoT, ing the vector processor with all data types will consume a lot of
Edge computing resources. Support for all of those data type may not be required
for a specific application. In this research, we have focused on the
ACM Reference Format: integer data type implementation, though the same architecture
Md Ashraful Islam and Kenji KISE. 2023. Resource-efficient RISC-V Vector can be extended for other data types as well. The ISA also defines
Extension Architecture for FPGA-based Accelerators. In The International the vector memory load and store instructions for memory oper-
Symposium on Highly Efficient Accelerators and Reconfigurable Technologies ations, reduction instructions to process the contents of a single
2023 (HEART 2023), June 14–16, 2023, Kusatsu, Japan. ACM, New York, NY, vector register, permutation instructions to shuffle the elements in
USA, 8 pages. https://doi.org/10.1145/3597031.3597047 the vector register, and vector mask instructions for conditional
executions.
In the past, several soft vector processing units and co-processors
have been presented, such as VESPA[17], which is a vector co-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed processor coupled to a MIPS processor, VIPERS[18], a vector pro-
for profit or commercial advantage and that copies bear this notice and the full citation cessing unit. Both VESPA and VIPERS have poor support for the
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
variable width integer data type. While VEGAS[5] supports the
republish, to post on servers or to redistribute to lists, requires prior specific permission variable width data type, but it used scratchpad memory instead
and/or a fee. Request permissions from permissions@acm.org. of the vector register file and did not have load-store instruction.
HEART 2023, June 14–16, 2023, Kusatsu, Japan
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
VENICE[14] or MXP[15] has reduced the area of VEGAS, but still,
ACM ISBN 979-8-4007-0043-9/23/06. . . $15.00
https://doi.org/10.1145/3597031.3597047

78
HEART 2023, June 14–16, 2023, Kusatsu, Japan Md Ashraful Islam and Kenji KISE

it relies on scratchpad memory, and custom instruction need to be This paper is structured as follows: Section 2 describes the archi-
issued by the scalar core. tecture of the proposed VPU. Section 3 presents the FPGA imple-
Initial RISC-V vector extension-based architecture was proposed mentation results and compares our work to published works, and
in Ara[3] based on RVV version 0.5. This architecture is proposed section 4 concludes the paper.
for ASIC implementation and has a vector sliding unit (SLDU) that
performs inter-lane data transfer involving data from all the vector 2 ARCHITECTURE
register file (VRF) banks at once. Such banked VRF and inter-lane
We implement a subset of the RVV[7] v1.0 ISA extension (Zve32)
connection to SLDU will be a major overhead for the FPGA soft
for integer computation which is commonly used for acceleration
vector processing unit to implement multiple lanes. Because the
such as convolution, matrix multiplication, etc. It does not support
number of elements to be transferred among the lanes will increases
widening and permutation instructions. However, it supports inte-
with lanes. Thus interconnecting these elements over the multiple
ger scalar move instructions. RVV defines the vector element length
lanes VRF banks will require larger crossbar or other interconnect
as ELEN, which is maximum size of a single vector element in bits.
with multiple ports. Another key point is that Ara architecture
Users can select element width SEW from 8-bit to ELEN bits. The
does not have support for vector reduction sum operation, which
number of bits in a single vector register is defined as vector length
is important for the dot/convolution operation.
(VLEN). Depending on the implementation VLEN can be 32-bit to
Vicuna[13] is a timing-predictable vector processor implemented
65,536-bit. The number of elements in a single vector register can
on FPGA. It has implemented the vector register file as monolithic
be found by dividing the VLEN by SEW. In this architecture, ELEN
multi-ported RAM, and vector functional units read the whole
is 32-bit and SEW can be 8-bit, 16-bit, or 32-bit. The number of lanes
vector register into a large shift register, which will overwhelm the
can be configured from 1 to 16, and the maximum vector length
FPGA resource with increasing vector length. SIMD-based vector
(MVL) can be configured from 128-bit to 4096-bit.
processing unit was proposed in [2][1], which has a 2-stage pipeline
RVV also defines the vector control status register (CSR) which
that might limit FPGA operating frequency, and it does not support
is used to by the software to control the execution and check the
reduction instruction.
status. For example, if application software wants to process the
A pluggable vector processing unit is presented in [11] with
vector register contents as 8-bit element, the user should write
banked SRAM as VRF. To reduce the cross-lane communication
the vector type (vtype) CSR’s SEW field to 0 by executing vector
for vector permutation instruction, VRF banks are consolidated as
configuration instruction.
a single unit and maintain the locking protocol to avoid hazards
Figure 1 shows the block diagram of the proposed vector microar-
over different functional units. However, such centralized VRF with
chitecture. Our proposed vector processing unit (VPU) executes the
multiple bank arbitrators requires much FPGA logic, and complex-
vector instructions irrespective of the scalar RISC-V core architec-
ity grows with the number of functional units and vector length.
ture. VPU can have single or multiple lanes of up to 16 lanes. Each
Spatz[4] is a vector processing unit with a shared L1 cache. Again
lane has a 32-bit data path consisting of its own part of the vector
it is based on centralized VRF and implemented using the latch for
register file, an execute and write back unit. Each lane can perform
ASIC.
4/2/1 operations in parallel if the selected element width (SEW) is
In this work, the architecture of a scalable soft vector processing
8/16/32 bit, respectively.
unit (VPU) is presented supporting a subset of RISC-V Vector ex-
Vector instructions are executed in the 5-stage pipeline - decode
tension ISA similar to [4] except for permutation instructions. It is
(DEC), sequencer (SEQ), vector register file (VRF), execute (EXE),
pluggable into any scalar RISC-V processor core. Development of
and writeback (WB). Once the scalar processor fetches and decodes
an embedded application for such VPU is easy due to the support
the vector instruction, it pushes that vector instruction to the vector
of the compiler. Exploiting data-level parallelism will accelerate the
extension queue, given that branch prediction and other hazards
performance of such applications. The main contributions of this
and exceptions are resolved. For some vector instructions, a scalar
paper are summarized below:
operand from the scalar core register file is required, which is also
pushed into the same queue along with the vector instruction. Few
• We introduce a 5-stage pipeline-based VPU, which is pa-
instructions requires the single vector register element value from
rameterized and designed for FPGA. Our design can accom-
the the VPU to the scalar core. For such instructions the write back
modate other functional units to support different vector
unit of lane 0 will return the value directly to the scalar core. While
instructions.
executing such instructions, the scalar core should be stalled and
• We show that our distributed VRF-based architecture pro-
wait for the return value from the VPU. For other vector instructions
vides flexibility to configure the number of lanes and the
the scalar core can execute in parallel to the VPU, given that there
MVL. Our VPU performance is scalable with a number of
are no exceptions or race conditions between the scalar core and
lanes and MVL.
the VPU. The functional description of major blocks is given in the
• We present an efficient reduction sum operation in conjunc-
following sections.
tion with a vector multiplier to reduce the latency of integer
dot/convolution operation.
• We implement the VPU in FPGA for different parameters 2.1 Decoder
and show that FPGA resource utilization is much lower than Vector Decoder reads 32-bit instructions from the instruction queue,
the other proposed designs, and the operating frequency of which are pushed by the scalar RISC-V core. Then it decodes the
our VPU has a negligible drop with the number of lanes. instruction into the appropriate control signals and required source

79
Resource-efficient RISC-V Vector Extension Architecture for FPGA-based Accelerators HEART 2023, June 14–16, 2023, Kusatsu, Japan

Figure 1: Proposed Vector Processing Unit’s microarchitecture

and destination addresses of the vector registers for the subsequent devices, it is a minimum of 4KB). According to the RISC-
pipeline stages. If the instruction requires a scalar value from the V Vector ISA, the VRF has 32 registers. For 32-bit element
scalar core register file, then that value is also popped from the width, it requires only 128 Bytes, which underutilizes the
queue. available BRAM memory. To increase memory utilization,
Decoder output is passed to the sequencer stage and waits for the we have allocated more bits to a single vector register by
sequencer to be ready if the previous instruction in the sequencer increasing the vector length (VLEN). Thus for any given
is not finished. It then reads the next instruction from the queue if SEW, the the number of elements in a single vector register
the queue is not empty. For vector configuration instruction (e.g. will increase.
vsetvli), the decoder reads and sets the vector CSR values and return • Increasing the number of elements per vector register in-
the new value of the vector length to the scalar core. The updated creases the MVL, which is beneficial for application acceler-
vector configuration values are provided by the decoder to the ation due to reduced software loop iteration and may hide
sequencer for the forthcoming instructions executions. memory latency for long memory bursts.
• Implementation of a monolithic register file with banked
2.2 Vector Register File and Sequencer memory for multiple lanes is inefficient for FPGA as it re-
In our architecture, each lane has a partial set of the vector register quires the interconnection of memory banks, and area in-
file. The maximum vector length (MVL) of the vector register is creases exponentially with the number of lanes. Each lane
parameterized and is equally divided among the lanes. Since the has its own VRF allocated with a partial set of elements from
VRF is distributed among the lanes and there is no interconnections 32 vector registers and does not require any interconnection
among VRF in different lanes, this architecture does not not support for inter-lane data transfer from VRF. Thus implementing
instructions which requires inter-lane VRF data exchange. Owing multiple lanes are straight forward and can be configured
to that fact, this architecture does not support widening and permu- according to the application requirements.
tation instructions. Because those instructions requires to access
the VRF elements over the multiple lanes. Figure 2 shows the vector The VRF is implemented as a 2-read and 2-write port (2R2W)
register file organization for MVL=512 and the number of lanes register file using the live value table (LVT) technique[9]. Two
NLANE=4 and SEW=32 (i.e., each element e in this figure is 32-bit). read ports to perform two register operands read (vs1/vs3 and vs2)
Elements of each vector register are indexed as e0, e1, e2, and so operation. The scalar value from the scalar core is multiplexed with
on. Our motivation for this VRF organization is the following. the vs1/vs3 vector register read data. One register file write port
• VRF is implemented using block RAM (BRAM), which is is used for a vector load operation, and another write port is used
available in the FPGA as a memory resource. BRAM is usually for ALU and multiplication operations. There is no data bypassing
having a few kilobytes of RAM (in modern Xilinx FPGA path from the write port to the read port.

80
HEART 2023, June 14–16, 2023, Kusatsu, Japan Md Ashraful Islam and Kenji KISE

Figure 2: Vector register file organization

As the vector registers have multiple elements, the vector execu- The vector division instructions are not implemented, and shift
tion has to iterate over this number of 32-bit elements, we named instructions are optional, as those are less frequently used in many
it SEQLEN (MVL/(NLANE X 32)). For example, according to figure applications.
2, lane 0 has to process elements e0, e4, e8, and e12 for the given VINT takes two operands from the VRF and takes two clock
instruction, thus requiring four iterations. The sequencer performs cycles to compute the result in the pipeline. It performs four op-
this iterative execution over multiple lanes in parallel. erations for SEW=8-bit, two operations for SEW=16-bit, and one
Sequencer also checks for the data hazards of read after write operation for SEW=32-bit. For variable precision (8/16/32-bit) vec-
(RAW), write after read (WAR), and write after write (WAW) before tor multiplication, we have used a similar approach presented in
dispatching it to VRF for subsequent execution. Sequencer tracks [12]. The block diagram of VMUL is shown in figure 4(b). In the
the state of 32 vector registers of VRF using a state table to detect first clock cycle, it calculates the partial product from the 16-bit
data hazards. The vector register file state of the destination reg- input multiplicand and multiplier (for 8-bit multiplication, it is
ister (vd) is updated by the sequencer, and during the writeback signed/unsigned extended to 16-bit) and stored in registers. In the
corresponding vd register state is restored by the writeback module. second clock cycle, based on the multiplier precision, it performs a
If the hazard is detected, the sequencer inserts a bubble for the summation of the partial products and calculates the multiplication
subsequent pipeline stages. result put into a register (R).
The pipeline diagram of our VPU is illustrated in figure 3. In our Since the VRF can produce only two operands, some instructions
design, the execution stage takes two clock cycles, and the other that require three inputs, such as multiply-accumulate instruction,
stages are one clock cycle. While the sequencer is doing iterative will require two operations, for example, multiply operation and
execution, the decoder is stalled by the sequencer. In figure 3, we then add operation. Implementing three read ports (3R2W) for VRF
have shown the example for SEQLEN=2. Therefore the sequencer operand read will use up two more BRAMs compared to 2R2W.
has to iterate two times (SEQ 0 and SEQ 1) to execute the instruction. However, if the target application is multiply-accumulate intensive,
In this example, instruction K+2 has a read-after-write (RAW) data one can configure the VRF as 3R2W without any architectural
hazard with instruction K+1. As a result, the sequencer stalls two changes.
clock cycles. From this figure, it can be easily verified that if the Since many accelerators manipulate multiplication and convolu-
SEQLEN is greater than or equal to 4, then there will be no stall tion functions, we have optimized the dot operation for these func-
due to data hazard except for the reduction operation, which will tions. We do not support widening operation, but we fuse widening
be explained in section 2.3. multiplication (VWMUL) and widening reduction (VWREDSUM)
instruction internally to support dot operation. The VWMUL in-
2.3 Execution and Write back struction calculates the multiplication results and stores them in
the internal accumulator (ACC) as shown in figure 4 (b). This accu-
Figure 4(a) shows the block diagram of the execution (EX) and
mulates the summation over the elements of the destination vector
writeback (WB) stages. The execution unit has a vector integer
register of its own lane. For example, if there are eight elements
unit (VINT) and a vector load store unit (VLSU). VINT executes
per vector register per lane (SEQLEN=8), then the accumulator
the vector integer arithmetic (e.g. vadd, vsub, vmin, vmax, vsll, vsrl
will store the summation of 8 vector element multiplication results.
etc.) and multiplication instruction (vmul, vmulh etc.) and VLSU
During VWREDSUM instruction, the reduction tree module does a
performs vector load-store instructions(vle, vse). Vector multiplica-
summation of these accumulators from all the lanes.
tion instructions are executed in the VMUL module, and the VALU
module performs vector integer arithmetic and logical instructions.

81
Resource-efficient RISC-V Vector Extension Architecture for FPGA-based Accelerators HEART 2023, June 14–16, 2023, Kusatsu, Japan

Figure 3: Pipeline execution for two elements per vector register per lane

Figure 4: (a) Execution and Write back block diagram (b) VMUL block diagram (c) reduction tree

When the number of lanes increases, the addition of these accu- multiplication results along the lane. Reduction sum instruction
mulators over multiple lanes becomes a timing-critical path because immediately after the multiplication instruction will stalled for 2
of long wire delay and LUT chaining delay. To overcome this issue, clock cycles due to RAW data hazard with the multiplication result’s
we have used a quad-tree graph-based reduction tree presented in vector register. After stalling for 2 clock cycles the sequencer will
figure 4 (c). Accumulators from multiple lanes are the input nodes issue the reduction instruction. Then the reduction sum operation
of this graph, and intermediate nodes calculate the summation of 4 will take 2 clock cycles to do the summation in the quad tree over
inputs from the previous layer nodes and register the results. These the 16 lanes. Total 12 clock cycles will be required to perform
node registers act as pipeline registers to break down the critical dot/convolution operation on 512 INT8 elements to produce the
path of reduction operation over multiple lanes. result.
The aim of this reduction tree is to reduce the required num- Vector load store unit (VLSU) has a memory address queue for
ber of clock cycles for reduction sum instruction while achieving load and store operation. Write-data queue for the store operation
higher operating frequency. For example, in 16-lane VPU, if we and the read-data queue for the load operation. Address and write-
do the iterative execution, then it will take eight clock cycles for data queue have 16-depth, so if the SEQLEN=8 it can queue two
SEQLEN=8, while the reduction tree will take only two clock cycles. vector load or store instruction. Each location in the queue holds the
Again for iterative execution, there will be 16 addition per clock address of the given iteration of load or store instruction from the
cycle, which will slow down the maximum operating frequency. sequencer. Read data queue have only single depth because the load
On the other hand, the proposed quad-tree graph-based reduction data from the TCM will be written back to the VRF immediately.
tree will take only two clock cycles for 16 lanes. Since the reduction RVV have define unit stride, constant and index strided address
instruction will execute on the vector elements at once, there will for vector load and store instruction. Our architecture implements
be RAW data hazard, and the sequencer will insert two bubbles unit stride stride as main feature and other strided load and store
after the multiply instruction and then will execute the reduction instructions are optional. However the VLSU do not support the
sum instruction. segmented vector load and store instructions.
For instance, if MVL is 4,096 bit and SEW is 8-bit then there There are tightly coupled data memories (TCM) connected to
are 512 8-bit integer (INT8) elements in a vector register. And if each lane. TCM consists of dual port BRAM and have local and
NLANE is 16, then there are 32 elements per vector register per lane. global port as presented in [8]. TCMs are organized as word (32-bit)
It will take 8 clocks to multiply 2 vector registers and produce 16-bit interleaving memory among the lanes (lane 0 TCM have byte 0 to
results. During multiplication operation, it will also accumulate the 3, lane 1 TCM have byte 4 to 7 and so on). Each lane have dedicated

82
HEART 2023, June 14–16, 2023, Kusatsu, Japan Md Ashraful Islam and Kenji KISE

read and write port to it’s TCM (local port) and another port (global more than three times higher operating frequency than other VPUs.
port) of TCM is connected to the ring interconnect. For instance, if And considering the NLANE, our proposed VPU uses significantly
each lane is accessing its own TCM, then there is a fixed 2-clock fewer FPGA resources than other VPUs.
cycle latency. If it accesses to other lane’s TCM, then it will go VPU2[2] and VPU3[1] have the same architecture but the config-
through the ring bus. urations are different. VPU3[1] has similar logic utilization to the
Since memory operation latency can vary with the memory proposed architecture though their operating frequency is much
location, the VINT can execute in parallel while the VLSU queue is lower than the proposed architecture. Compared to our proposed
not empty, given that there is no data dependency between them. 5-stage pipeline architecture, VPU3 has 2-stage pipeline, which
Write back unit multiplexes data from VALU and VMUL and write could be the reason of this frequency difference. Vicuna[13] has 2
it back to one of the write ports of VRF in that lane. The load data stage pipeline though the execution can take multiple clock cycles.
from the VLSU unit is written to another port of that VRF. But architecture takes much more logic resources compared to the
proposed VPU which might negatively impact the Vicuna operating
frequency.
3 EVALUATION For performance evaluation, we have used convolution because
We implemented our vector extension core in SystemVerilog and it is widely used in signal processing, image processing and convolu-
then integrated it with RVCoreP-32IM [8], which is a 5-stage pipeline tional neural network. We perform cycle-accurate RTL simulation
32-bit RISC-V soft processor core with RV32IM extension. We have for the convolution operation with kernel size 3x3x256 and in-
configured our vector processing unit’s number of lanes (NLANE) put matrix size 16x16x256 and produce an output matrix of size
as 1, 4, 8, and 16 lanes for evaluation. 14x14. We have used the convolution filter and input data type
We evaluate the maximum frequency and hardware resource of 8-bit(INT8), 16-bit(INT16), and 32-bit(INT32) integers, and the
utilization for both Xilinx Artix-7 FPGA and Zynq UltrScale+ FPGA output is a 32-bit(INT32) integer. We use the vector load, widening
device. Artix-7 is relatively cost-optimized smaller FPGA than Zynq multiplication, and reduction sum and move instruction to perform
UltrScale+ FPGA. For edge computing sometimes smaller and cost- the convolution operation.
effective FPGA devices like Artix-7 is preferred. We use the Nexys
A7 FPGA board which contains xc7a-100tcsg324-1 from Artix-7
FPGA device family, speed grade -1. Vivado 2021.1 is used for Syn-
thesis, placement, and routing with a default strategy to evaluate
the operating frequency and hardware resource utilization. For this
evaluation, we have used 4KB of tightly coupled data RAM for
each vector lane. The RISC-V core RVCoreP-32IM is connected to
these lanes using a ring interconnect. RVCoreP-32IM has its own
instruction and data tightly coupled memory of 32KB for program
execution and a UART as a peripheral.
FPGA implementation result is shown in table 1. From the table
1, we can see that the look-up table (LUT) and register utilization do
not increase linearly with the increment of NLANE, as the vector
instruction decoder and sequencer are fixed overhead for all lane Figure 5: Giga operations per second (GOPS) for different
configurations. Since the vector register file uses the BRAM and and VPU configuration
VMUL uses the DSP, BRAM and DSP utilizations go up straightly
with the NLANE. In the FPGA implementation, two BRAMs are The cycle count required to perform the convolution operation
used as four 2KB RAMs for the 2R2W port for the vector register is shown in table 3. As shown in table 3, for a given number of lanes,
file. In all configurations, RVCoreP-32IM uses around 1,500 LUTs. the performance improved with a higher MVL. This is because of
The maximum operating frequency (FMAX) drops only 5 MHz less instruction overhead. Since our architecture supports 4 INT8, 2
for the 16-lane configuration, while other configurations have 125 INT16, and 1 INT32 operation per lane, the execution time for INT8,
MHz. In our design, the critical path is partial product sum genera- INT16, and INT32 also increases linearly. Table 3 also illustrates
tion through vector multiplier units. For the same NLANE, increas- that adding more lanes significantly improves the performance,
ing the MVL (that is, the number of elements per vector register owing to higher parallel operations.
per lane) does not increase the FPGA logic utilization. It does not We have calculated the Giga operations per second (GOPS) from
impact the maximum operating frequency. the simulation result (using FMAX for Zynq Ultrasacle+ FPGA) and
We have presented the comparison of resource usage with other presented it in figure 5. To calculate GOPS, we took the total number
soft vector processors in table 2. Since those publications mostly of operations by multiplying the number of vector instructions
used different Xilinx Zynq UltrScale+, we have also done the syn- executed by the VPU with the number of elements processed by
thesis for Zynq UltrScale+ FPGA device with the configuration of a single vector instruction. We then measure the execution time
8 Lane and MVL=1024. VPU1[11] has a dedicated unit for permu- for the convolution operation by multiplying the clock period of
tation, which could be a reason for higher resource usage. From FMAX for Zynq Ultrasacle+ FPGA device with the cycle count as
the table 2 we can see that our proposed architecture has achieved given table 3. After that, we get the GOPS from the total number of
more than four times higher operating frequency than VPU3 and operations divided by the execution time.

83
Resource-efficient RISC-V Vector Extension Architecture for FPGA-based Accelerators HEART 2023, June 14–16, 2023, Kusatsu, Japan

Table 1: FPGA (Artix-7) resource utilization for vector processing unit

NLANE 1 4 8 16
MVL 128 256 512 1024 1024 2048 2048 4096
LUT 1,236 1,209 3,750 3,741 7,119 7,187 13,978 13,852
Register 737 743 2,059 2,068 3,871 3,853 7,375 7,382
BRAM 2 2 8 8 16 16 32 32
DSP 5 5 20 20 40 40 80 80
FMAX
125 125 125 125 125 125 120 120
(MHz)

Table 2: Comparison of FPGA resource utilization (for Zynq 4 CONCLUSION


UltrScale+ device)
In this paper, we presented a vector processing unit with the ability
to configure the MVL, thus allowing more elements to be pro-
VPU MVL LUT Register FMAX (MHz) cessed with lower instruction overhead. We support a parameter-
2048, ized NLANE, which benefits a longer vector to execute in a shorter
Vicuna[13] 80k 40k 80
NLANE=32 time using parallel lanes. Performance increases with MVL for a
VPU1[11] 512 136.5k 37.9k 75 given NLANE, whereas the FPGA resource utilization does not
256, increases. And the resource utilization of FPGA also does not grow
VPU2[2] 20.5k 9.5k N/A
NLANE=8 linearly with increasing NLANE. Our FPGA implementation has
128, demonstrated 4x frequency improvement and the lowest FPGA
VPU3[1] 7.3k 4.85k 62.5
NLANE=2 resource utilization. Considering the resource utilization, our pro-
2048, posed VPU have achieved higher performance than other VPUs.
Proposal 6.5k 3.8k 270
NLANE=8 Given those factors, our architecture provides a resource-efficient
implementation for FPGA while maintaining higher frequency.
Even though the implemented subset of vector extension is suit-
Table 3: Required number of clock cycles to perform 3x3x256
able for many embedded applications, some other applications,
convolution operation
like encryption, cryptography, etc., require vector permutation in-
struction which is not supported in this architecture. We also have
VPU Configuration INT8 INT16 INT32
not implemented the widening instructions for the simplification
NLANE 1, MVL 128 682,276 1,359,652 2,714,404
of VRF. In future work, we will implement the support for vec-
NLANE 1, MVL 256 513,716 1,021,748 2,037,812 tor permutation instruction and study the widening instructions
NLANE 4, MVL 512 174,244 343,588 682,276 requirement. We will also investigate the embedded benchmark
NLANE 4, MVL 1024 132,692 259,700 513,716 program by using RISC-V vector instruction.
NLANE 8, MVL 1024 89,572 174,244 343,588
NLANE 8, MVL 2048 69,188 132,692 259,700
NLANE 16, MVL 2048 47,236 89,572 174,244 REFERENCES
NLANE 16, MVL 4096 40,964 69,188 132,692 [1] Muhammad Ali and Diana Göhringer. 2022. Application Specific Instruction-Set
Processors for Machine Learning Applications. In 2022 International Conference
on Field-Programmable Technology (ICFPT). IEEE, 1–4.
[2] Muhammad Ali, Matthias von Ameln, and Diana Goehringer. 2021. Vector
Table 4: Comparison of GOPS with other VPUs processing unit: a RISC-V based SIMD co-processor for embedded processing. In
2021 24th Euromicro Conference on Digital System Design (DSD). IEEE, 30–34.
[3] Matheus Cavalcante, Fabian Schuiki, Florian Zaruba, Michael Schaffner, and
VPU NLANE GOPS Luca Benini. 2019. Ara: A 1-GHz+ scalable and energy-efficient RISC-V vector
Vicuna[13] 32 10 processor with multiprecision floating-point support in 22-nm FD-SOI. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems 28, 2 (2019), 530–543.
VEGAS[5] 32 15 [4] Matheus Cavalcante, Domenic Wüthrich, Matteo Perotti, Samuel Riedel, and Luca
Proposal 16 11.9 Benini. 2022. Spatz: A Compact Vector Processing Unit for High-Performance
and Energy-Efficient Shared-L1 Clusters. In Proceedings of the 41st IEEE/ACM
International Conference on Computer-Aided Design. 1–9.
[5] Christopher H Chou, Aaron Severance, Alex D Brant, Zhiduo Liu, Saurabh Sant,
For INT8 with NLANE=16 and MVL=4096, we achieved the max- and Guy GF Lemieux. 2011. VEGAS: Soft vector processor with scratchpad
memory. In Proceedings of the 19th ACM/SIGDA international symposium on Field
imum GOPS of 11.9. We have compared the GOPS values with a programmable gate arrays. 15–24.
few other VPUs in table 4. It shows that our 16-lane VPU achieved [6] Intel Corporation. 2022. Intel® 64 and IA-32 Architectures Software Developer’s
Manual. Retrieved March 28, 2023 from https://www.intel.com/content/www/
remarkable GOPS compared to the other VPUs. The key factors of us/en/developer/articles/technical/intel-sdm.html
our VPU performance are substantially higher operating frequency [7] RISC-V International. 2022. Working draft of the proposed RISC-V V vector extension.
due to pipeline execution and reduced inter-lane communication Retrieved March 28, 2023 from https://github.com/riscv/riscv-v-spec
[8] Md Ashraful Islam and Kenji Kise. 2022. An Efficient Resource Shared RISC-V
because of distributed VRF, and optimized reduction tree for lower Multicore Architecture. IEICE TRANSACTIONS on Information and Systems 105,
latency. 9 (2022), 1506–1515.

84
HEART 2023, June 14–16, 2023, Kusatsu, Japan Md Ashraful Islam and Kenji KISE

[9] Charles Eric LaForest and J Gregory Steffan. 2010. Efficient multi-ported mem- [14] Aaron Severance and Guy Lemieux. 2012. VENICE: A compact vector processor
ories for FPGAs. In Proceedings of the 18th annual ACM/SIGDA international for FPGA applications. In 2012 International Conference on Field-Programmable
symposium on Field programmable gate arrays. 41–50. Technology. IEEE, 261–268.
[10] ARM Limited. 2009. Introducing NEON Development Article. Retrieved March 28, [15] Aaron Severance and Guy GF Lemieux. 2013. Embedded supercomputing in FP-
2023 from https://developer.arm.com/documentation/dht0002/a/?lang=en GAs with the VectorBlox MXP matrix processor. In 2013 International Conference
[11] Vincenzo Maisto and Alessandro Cilardo. 2022. A pluggable vector unit for on Hardware/Software Codesign and System Synthesis (CODES+ ISSS). IEEE, 1–10.
RISC-V vector extension. In 2022 Design, Automation & Test in Europe Conference [16] Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Gia-
& Exhibition (DATE). IEEE, 1143–1148. como Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael
[12] Stefania Perri, Pasquale Corsonello, Maria Antonia Iachino, Marco Lanuzza, and Premillieu, et al. 2017. The ARM scalable vector extension. IEEE micro 37, 2
Giuseppe Cocorullo. 2004. Variable precision arithmetic circuits for FPGA-based (2017), 26–39.
multimedia processors. IEEE Transactions on very large scale integration (VLSI) [17] Peter Yiannacouras, J Gregory Steffan, and Jonathan Rose. 2008. VESPA: portable,
systems 12, 9 (2004), 995–999. scalable, and flexible FPGA-based vector processors. In Proceedings of the 2008
[13] Michael Platzer and Peter Puschner. 2021. Vicuna: a timing-predictable RISC-V international conference on Compilers, architectures and synthesis for embedded
vector coprocessor for scalable parallel computation. In 33rd Euromicro Confer- systems. 61–70.
ence on Real-Time Systems (ECRTS 2021). Schloss Dagstuhl-Leibniz-Zentrum für [18] Jason Yu, Christopher Eagleston, Christopher Han-Yu Chou, Maxime Perreault,
Informatik. and Guy Lemieux. 2009. Vector processing as a soft processor accelerator. ACM
Transactions on Reconfigurable Technology and Systems (TRETS) 2, 2 (2009), 1–34.

85

You might also like