FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
ABSTRACT Due to recent advances in digital technologies, and availability of credible data, an area
of artificial intelligence, deep learning, has emerged and has demonstrated its ability and effectiveness
in solving complex learning problems not possible before. In particular, convolutional neural networks
(CNNs) have demonstrated their effectiveness in the image detection and recognition applications. However,
they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve the
desired performance levels. Consequently, hardware accelerators that use application-specific integrated
circuits, field-programmable gate arrays (FPGAs), and graphic processing units have been employed to
improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize parallelism and their energy
efficiency. In this paper, we review the recent existing techniques for accelerating deep learning networks on
FPGAs. We highlight the key features employed by the various techniques for improving the acceleration
performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs
acceleration. The techniques investigated in this paper represent the recent trends in the FPGA-based
accelerators of deep learning networks. Thus, this paper is expected to direct the future advances on efficient
hardware accelerators and to be useful for deep learning researchers.
INDEX TERMS Adaptable architectures, convolutional neural networks (CNNs), deep learning,
dynamic reconfiguration, energy-efficient architecture, field programmable gate arrays (FPGAs), hardware
accelerator, machine learning, neural networks, optimization, parallel computer architecture, reconfigurable
computing.
2169-3536
2018 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 7, 2019 Personal use is also permitted, but republication/redistribution requires IEEE permission. 7823
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
A. Shawahna et al.: FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification
produces higher level features, for example semi-circles, and CNNs have achieved even better accuracy in classifica-
squares [11]. The next layer assembles the output of the tion and various computer vision tasks. The classification
previous layer to parts of familiar objects, and a subsequent accuracy in ILSVRC improved to 88.8% [48], 93.3% [31],
layer detects the objects. As we go through more layers, and 96.4% [49] in the 2013, 2014, and 2015 competitions,
the network yields an activation map that represents more and respectively. Fig. 1 shows the accuracy loss for the winners
more complex features. The deeper you go into the network, of ImageNet competitions before and after the emergence of
the filters begin to be more responsive to a larger region deep learning algorithms.
of the pixel space. Higher level layers amplify aspects of
the received inputs that are important for discrimination and
suppress irrelevant variations.
In practice, CNNs are trained off-line using the back- directions used to tackle those challenges. We also provide
propagation process [54]. Then, the off-line trained CNNs future recommendations to maximize the performance of
are used to perform recognition tasks using the feed-forward FPGAs as accelerators for deep learning networks and sim-
process [55]. Therefore, the speed of feed-forward process is plify their use.
what matters. The remainder of the paper is organized as follows.
GPUs are the most widely used hardware accelerators Section II provides background information about CNNs,
for improving both training and classification processes in their key operations, and some well-known deep learning
CNNs [56]. This is due to their high memory bandwidth networks. In addition, it introduces the basic structure of
and throughput as they are highly efficient in floating-point FPGAs and highlights their features enabling them to accel-
matrix-based operations [57]–[59]. However, GPU accelera- erate computationally intensive applications. It also discusses
tors consume a large amount of power. Therefore, their use the implementation challenges of deep learning networks
in CNN-based applications implemented as a cloud service on FPGAs and how these challenges can be overcome.
on large servers or in battery operated devices becomes a Section III reviews existing CNNs compression techniques
challenge. Furthermore, GPUs gain their performance from and presents the current status of accelerating deep learning
their ability to process a large image batch in parallel. For networks using ASIC-based and FPGA-based accelerators.
some applications like a video stream, input images should Section IV describes the use of metaheuristics in the design
be processed frame by frame as the latency of the result of and optimization of CNNs implementation. Section V sum-
each frame is critical to the application’s performance. For marizes existing design approaches for accelerating deep
some tracking algorithms, the result of one frame affects the learning networks and provides recommendations for future
process of the next frame [60]. Nurvitadhi et al. [61] recently directions that will simplify the use of FPGA-based accel-
evaluated emerging DNN algorithms on latest generations erators and enhance their performance. Finally, section VI
of GPUs (i.e., NVIDIA Titan X Pascal) and FPGAs (i.e., concludes the paper.
Intel Arria 10 GX 1150 and Intel Stratix 10 2800). The
experimental results show that current trends in deep neural II. BACKGROUND AND TERMINOLOGY
networks favor FPGA platforms as they offer higher power This section gives an overview of the key operations and
efficiency (a.k.a., performance per Watt). terminology used in convolutional neural networks (CNNs)
FPGA and ASIC hardware accelerators have relatively and provides examples of well-known deep learning net-
limited memory, I/O bandwidths, and computing resources works. In addition, it illustrates the basic structure of field
compared with GPU-based accelerators. However, they can programmable gate arrays (FPGAs) and how deep learning
achieve at least moderate performance with lower power methods can benefit from the capabilities of FPGAs. The last
consumption [62]. The throughput of ASIC design can be subsection highlights the challenges of implementing deep
improved by customizing memory hierarchy and assigning learning networks on FPGAs.
dedicated resources [63]. However, the development cycle,
cost, and flexibility are not satisfactory in ASIC-based accel- A. CONVOLUTIONAL NEURAL NETWORKS (CNNs)
eration of deep learning networks [64], [65]. As an alterna- In this subsection, we describe the key operations and
tive, FPGA-based accelerators are currently in use to provide terminology involved in the construction of CNNs includ-
high throughput at a reasonable price with low power con- ing convolution, activation functions, normalization, pooling,
sumption and reconfigurability [66], [67]. The availability and characteristics of fully connected layers.
of high-level synthesis (HLS) tools, using C or C++, from
FPGA vendors lowers the programming hurdle and shortens 1) CONVOLUTION (CONV)
the development time of FPGA-based hardware accelera- A convolution operation can be thought of as the production
tors [68]–[70]. of a matrix smaller in size than the original image matrix,
Convolutional neural networks have a very useful property, representing pixels, by sliding a small window (called filter,
that is, each feature map neuron shares its weights with all feature identifier, or kernel) of size k × k over the image
other neurons [71]. Hameed et al. [72] and Keckler et al. [73] (called input feature map (FM)), to produce an output feature
proved that the highest energy expense results from accessing neuron value [75]. The filter is an array of numbers called
the off-chip DRAM memory for data movement rather than weights or parameters. These weights are computed during
computation. In other words, the energy cost of the increased the training phase. As the filter slides over the feature map,
memory accesses and data movement due to the large number it multiplies the values in the filter with the original pixel
of CNN operations often exceeds the energy cost of compu- values, that is, it first performs element-wise multiplication,
tation [64], [74]. Thus, CNN accelerators need to carefully and then sums the products, to produce a single number. The
consider this to achieve efficient architecture in terms of time inputs and outputs of the CONV layer are a series of FM
and power. arrays.
In this paper, we review the current status of using FPGAs This operation, starting from the top left corner of the FM,
as accelerators for implementing deep learning networks. is repeated by moving the window S strides at a time, first in
We highlight the implementation challenges and design the right direction, until the end of the FM is reached, and then
proceeding downwards until the FM is completely scanned Another popular activation function is
and all the elements of the FM are covered. The sliding of
the filter window and performing the operation is known by f (x) = tanh(x) (2)
the verb convolving, hence the noun convolution [11], [76]. The above standard sigmoid and tanh non-linear functions
Normally, the size of the kernel is very small, less require long training time [28]. A recently proposed and
than or equals to 11 × 11. Each output-input FM pair has commonly used AF in CNNs is rectified linear unit (ReLU)
a set of weights equal to the kernel size and each output FM which is defined as
is computed based on the sum of the convolution operations
performed on all input FMs. Note that different CONV layers f (x) = max(x, 0) (3)
in the same CNN model vary considerably in their sizes.
In summary, the convolution operation comprises four lev- ReLU activation function is known to converge faster in
els of loops; the output FMs loop (Loop-4), the loop across training, and has lesser computational complexity [80], [81]
the input FMs (Loop-3), the loop along the dimensions of than standard sigmoid and tanh functions. In addition, it does
a single input FM (scan operation, Loop-2), and the kernel not require input normalization to prevent it from saturat-
window size loop (multiply-and-accumulate (MAC) opera- ing [28], [80], [82].
tion, Loop-1). CONV layers are dominant in CNN algorithms
since they often constitute more than 90% of the total CNN 3) NORMALIZATION
operations [28], [29], [49], [74], [77], [78]. Therefore, many In real life, a phenomenon called ‘lateral inhibition’ appears,
attempts have been made to speedup CONV operations using which refers to the capacity of an excited neuron to sub-
loop unrolling technique [55], [79], as will be discussed due its neighbors, thereby creating a contrast in that area.
later. Loop unrolling maximizes the parallelism of CONV In CNNs, to accomplish this, local response normalization
MACs computation which requires a special consideration (LRN) or simply normalization is used, particularly when
of processing elements (PEs) and register arrays architecture. dealing with ReLU neurons, because they have unbounded
Fig. 2 illustrates the loop unrolling of CONV loops levels. activation that needs normalization. It detects high frequency
features with a large response. If we normalize around the
local neighborhood of the excited neuron, it becomes even
more sensitive as compared to its neighbors. At the same
time, it will dampen the responses that are uniformly large
in any given local neighborhood. If all the values are large,
then normalizing those values will diminish all of them. So,
basically it performs some kind of inhibition and boosts the
neurons with relatively larger activations.
Normalization can be done within the same fea-
ture or across neighboring features by a factor that depends
on the neighboring neurons. Expressions to compute the
response normalized activity can be found in [28] and [80].
4) POOLING
Pooling, also known as subsampling, is employed to progres-
FIGURE 2. CONV Loops Unrolling [83]: (a) Unrolling Loop-1; (b) Unrolling sively reduce the spatial size of the representation, thereby
Loop-2; (c) Unrolling Loop-3; (d) Unrolling Loop-4, where, Pkx, Pky , Pix,
Piy , Pif , and Pof are loop unrolling design variables for the kernel reducing the amount of parameters and computation in the
window width, kernel window height, input FM width, input FM height, network. Pooling layers are periodically inserted in between
number of input FMs, and the number of output FMs, respectively.
successive convolutional layers. They operate independently
on every depth slice of the input and resize it spatially using
the MAX operation. The most common form is a pooling
2) ACTIVATION FUNCTIONS (AFs) layer with filters of size 2 × 2 applied where the MAX
Activation function in neural networks is similar to action operation would be taking a maximum over 4 samples thereby
potential in animal cells such as neurons. A neuron is said to discarding 75 percent of the activations [84]. In addition to the
fire if it emits an action potential. A popularly used activation popular MAX pooling, the pooling units in some CNNs are
function is the sigmoid function which can be expressed as also used to perform other functions, such as AVG and MIN
operations [80].
f (x) = 1/(1 + e−x ) (1)
where x represents the weighted sum of the neuron inputs 5) FULLY CONNECTED LAYER (FC)
and if it is a sufficiently large positive number, the sig- A common form of a convolutional neural network archi-
moid function approximates to unity. For sufficiently large tecture comprises stacks of a few convolutional and ReLU
negative values of x, the sigmoid function is close to 0. layers, followed by layers for pooling, and this pattern is
repeated until the image has merged spatially to a small to AlexNet and VGG CNN models [49], [85], [86].
size. This is followed by one or more fully connected layers, This is due to having more types of layers, where
also known as inner-product layers, whose neurons have full non-adjacent layers incorporate shortcuts to compute
connections to all activations in the previous layer, hence the the residual functions, as well as having highly deep
name. The last fully connected layer is the classification layer structures, that is, between 50 and 1000 CONV layers.
and it holds the output such as the class scores [80]. Unlike AlexNet and VGG models where the layers are
connected in sequence, the interconnections in ResNet
B. EXAMPLES OF DEEP LEARNING NETWORKS layers are in the form of a directed acyclic graph (DAG).
We list in this subsection some of the well-known deep ResNet-50 and ResNet-152 are widely used, especially
learning networks. for image classification. ResNet-50/152 structure con-
• AlexNet (2012) is a convolutional neural network tains 53/155 CONV (most of them are followed by batch
consisting of 5 convolutional layers, interspersed by normalization (BatchNorm), scale, and ReLU layers),
2 normalization layers, as well as 3 fully connected lay- 1/1 MAX pooling, 1/1 Average pooling, 1/1 FC, and,
ers [28]. Each convolutional layer performs the activa- 16/50 element-wise (Eltwise) layers, respectively.
tion function using ReLU. In addition, 3 pooling layers
are employed with the first, second, and last convolu- C. FIELD PROGRAMMABLE GATE ARRAYS (FPGAs)
tional layers. The architecture of AlexNet CNN is shown FPGAs are off-the-shelf programmable devices that provide
in Fig. 3. AlexNet won the 2012 ImageNet challenge by a flexible platform for implementing custom hardware func-
classifying 224 × 224 input color images to 1,000 dif- tionality at a low development cost. They consist mainly of
ferent output classes. a set of programmable logic cells, called configurable logic
• VGG (2014) is a convolutional neural network model blocks (CLBs), a programmable interconnection network,
similar to AlexNet in terms of the number of fully and a set of programmable input and output cells around the
connected layers. However, it consists of 5 groups of device [87]. In addition, they have a rich set of embedded
convolutional layers [29], [81]. The exact number of components such as digital signal processing (DSP) blocks
CONV layers in each group depends on the version which are used to perform arithmetic-intensive operations
of the VGG, visual geometry group, model. Table 1 such as multiply-and-accumulate, block RAMs (BRAMs),
shows the number of CONV and FC layers for the most look-up tables (LUTs), flip-flops (FFs), clock management
commonly used VGG models. unit, high speed I/O links, and others. Fig. 4 shows a basic
• ResNets (2016) are deep residual networks with structure of an FPGA.
extremely irregular and complex structures compared FPGAs are widely considered as accelerators for
computationally-intensive applications as they enable models
with highly flexible fine-grained parallelism and associative
TABLE 1. CNN layers for VGG models. operations such as broadcast and collective response [88].
In [89] and [90], FPGA computing models used for appli-
cations acceleration are presented, including data stream-
ing, associative computing, highly parallel memory access,
use of standard hardware structures such as first in first
out (FIFO) buffers, stacks and priority queues, and functional
parallelism.
FPGAs have the advantage of maximizing performance
per Watt of power consumption, reducing costs for large
scale operations [91]. This makes them an excellent choice
as accelerators for battery operated devices and in cloud
and field programmable gate array (FPGA) implementa- tiling techniques to reduce the overall external memory traffic
tions. In general, hardware accelerators focus on designing through increasing data locality.
specific modules and architectures that ensure data reuse, Chen et al. [109] also observed that using short fixed-
enhance data locality, and accelerate convolutional (CONV) point representation of feature maps (FMs) and weights can
layer operations based on performing needed operations in also significantly reduce computation resources and mem-
parallel. ory footprint. They found that the area and power of a
32-bit multiplier can be reduced by a factor of 0.164× and
A. CNNs COMPRESSION 0.136×, respectively, using 16-bit multipliers. Consequently,
In this subsection, we review techniques that target the com- DianNao has been implemented using 65nm fabrication tech-
pression of CNNs which results in significantly reducing nology with 16-bit fixed-point arithmetic units, 6 bits of
their implementation complexity with minimal impact on which are used for the integer part and the remaining 10 for
accuracy. the fractional part. The experimental results demonstrated
Denton et al. [102] proposed a technique to reduce that DianNao has an average performance of 452 GOPS with
the memory footprint for the network weights in object power consumption of 485 mW. The results depicted that
recognition systems. They used singular value decompo- using 16-bit arithmetic units instead of 32-bit ones introduced
sition (SVD) [101] and filter clustering methods for this only 0.26% accuracy loss on MNIST dataset [110]. On the
purpose. The results for convolutional model of 15 layers other hand, the scalability and efficiency of DianNao accel-
in [48] show that the proposed technique speeds up the oper- erator are severely limited by the bandwidth constraints of the
ations in convolutional layers by a factor of 2, compared to memory system.
CPU Eigen3-based library implementation [103]. In addition, In a related research work, Chen et al. [111] and
it successfully achieved 13× memory footprint reduction for Luo et al. [112] proposed DaDianNao multi-chip super-
the fully connected layers while preserving the recognition computer which offers sufficient memory capacity suitable
accuracy within 99%. for on-chip storage of all weights in CNNs. This system
In another work, Han et al. [104] employed network prun- is mainly important for today’s large-scale deployments of
ing techniques [105]–[107] to reduce the over-fitting and sophisticated industry and consumers services. DaDianNao
complexity of neural network models. Their results demon- uses 16-bit fixed-point numbers in the inference process like
strated that pruning redundant connections as well as less DianNao, but it is implemented using 28nm technology. The
influential connections achieved 9× and 13× compression results show that DaDianNao outperforms the performance
for AlexNet and VGG-16 models, respectively, while achiev- of a single GPU architecture by up to 656.63× and reduces
ing zero accuracy loss for both. the average energy consumption by 184.05× with only 0.01%
In a subsequent work, Han et al. [108] proposed a deep accuracy error rate on MNIST dataset for a 64-chip system.
compression technique for more reduction of the storage Another member of the DianNao family, called
requirements of CNNs through the enforcement of weights PuDianNao [113], has been designed using TSMC 65nm
sharing. Deep compression basically consists of pruning, process to support multiple techniques and scenarios of
trained weights quantization, and Huffman coding pipeline machine learning (ML). PuDianNao accelerates different ML
stages. The experimental results show that the proposed com- techniques through extracting their critical locality properties
pression technique successfully reduced the storage require- and computational primitives with the use of on-chip storage
ment of AlexNet and VGG-16 CNN models by 35× and 49×, as well as 7 novel functional units. Experimental results show
respectively, without affecting their accuracy. This also that PuDianNao is 1.20× and 128.41× faster and energy-
improved the power efficiency (a.k.a., performance per Watt) efficient, respectively, than NVIDIA K20M GPU architec-
by 3× to 7×. ture. However, both of DaDianNao [111] and PuDianNao
architectures have not been optimized to be used for embed-
B. ASIC-BASED ACCELERATORS ded applications.
In this subsection, we present some recent work in the area of To improve the scalability and energy efficiency of
hardware-based accelerators (ASICs). DianNao design discussed in [109], ShiDianNao accelera-
An ASIC-based hardware accelerator referred to as tor was proposed [114]. ShiDianNao is designed especially
DianNao [109] was designed for large-scale convolutional for real-time object recognition applications such as self-
neural networks and deep neural networks. DianNao accel- driving cars, smartphones, and security using 65nm CMOS
erates neural networks by minimizing memory transfers, technology. The proposed accelerator directly connects with a
which opened a new paradigm for hardware accelerators. CMOS/CCD sensor in the image processing chip. In addition,
Since the weights are repeatedly used in the computations of all the weights of CNN layers are stored in SRAM on-chip
convolution layers, frequent memory access can significantly memory, as the target here is small CNN models. ShiDianNao
degrade the overall performance. Therefore, the authors is embedded inside the processing chip to eliminate off-
exploited the locality properties of neural network layers to chip DRAM memory accesses and minimize data movements
design custom storage structures that take advantages of these between the SRAM holding the CNN model and the individ-
properties. In addition, they employed dedicated buffers and ual processing elements from the sensor. ShiDianNao has a
implementing full-fledged multipliers. Fortunately, recent for a face detection system with LeNet-5 architecture [128].
digital signal processing (DSP)-oriented FPGAs include It utilized 90% and 28% of the general logic and multipliers,
large numbers of multiply-and-accumulate (MAC) units respectively. In addition, CNP consumed less than 15 Watts
which allow for extremely fast and low power CNN of power.
implementations. Sankaradas et al. [129] proposed a massively parallel
Thereafter, FPGA implementations of deep learning net- coprocessor to accelerate CNNs using Virtex5 LX330T
works have mainly focused on accelerating the computa- FPGA platform. The proposed accelerator mainly focused on
tional engine through optimizing CONV layer operations. optimizing computation engine by employing the parallelism
Several studies in the literature [120]–[126] have reported within convolution kernel and FMs. The coprocessor can be
FPGA-based implementations of convolution operation. considered as parallel clusters of vector processing elements
Farabet et al. [127] presented an FPGA implementa- (VPEs) where each cluster is designed using 2D convolvers,
tion of CNN that uses one dedicated hardware convolver adders, sub-samplers, and look-up tables. Each VPE consists
and a soft-processor for data processing and controlling, of multiplier-accumulator and programmable register units
respectively. The proposed implementation is referred to as to hold kernel weights and FM data. To hold the massive
convolutional network processor (CNP). CNP exploits the intermediate data of CNNs, the authors employed a dedi-
parallelism of CONV layers to accelerate the computational cated off-chip memory (4 DDR2 memory banks) with a large
engine of CNNs while fully utilizing the large number of bandwidth on the coprocessor card. Moreover, the proposed
DSPs, the MAC hardware units on FPGA. The proposed accelerator uses a low precision data representation feature
architecture consists of Virtex4 SX35 FPGA platform and with memory packing to further improve the memory band-
external memory. The authors designed a dedicated hardware width as well as the throughput. 20-bit and 16-bit fixed-point
interface with the external memory to allow 8 simultaneous representations were utilized for kernel weights and FMs,
read/write accesses transparently. In addition, they used first respectively.
in first out (FIFO) buffers between the FPGA and the external The authors examined their architecture on CNN with
memory chip in both directions to guarantee the steadiness of 4 CONV layers and without any fully connected (FC) layer
dataflow. for a face recognition application. The results show that the
The vector arithmetic and logic unit in CNP implements 2D proposed coprocessor is 6× faster than a software imple-
CONV, pooling, and non-linear activation function operations mentation on a 2.2 GHz AMD Opteron processor with less
of convolutional networks. The implementation of 2D CONV than 11 Watts of power dissipation. However, the proposed
with kernel of size 3 (i.e., K = 3) is shown in Fig. 6, where accelerator cannot be used to accelerate full CNNs as it
x is the data from input feature map (FM), y is the partial uses few CONV layers without any FC layer. A full CNN
result to be combined with the current result, z is the result model consists of both CONV layers and FC layers. Thus,
to the output FM, Wij is the weight value in the convolution an efficient CNN accelerator for real-life applications is
kernel, and W is the width of the input image. It can be needed to consider both. Similar approaches to the work of
seen that the proposed convolutional module accomplishes Sankaradas et al. [129] are presented in [130] and [131] to
K 2 MAC operations simultaneously in each clock cycle. accelerate support vector machines (SVM).
CNP represents FMs and weights using 16-bit (Q8.8) fixed- MAPLE [132] is a programmable FPGA prototype system
point format. The proposed accelerator has been implemented presented to accelerate both learning and classification tasks
in applications with unstructured large amount of data. The
authors analyzed five workload domains to help in design-
ing MAPLE. These workloads are SVM [133], supervised
semantic indexing (SSI) [134], K-means [135], generalized
learning vector quantization (GLVQ) [136], and CNNs [71].
They found that their computations can be structured as par-
allel streams of vector or matrix operations. Thus, they archi-
tected MAPLE as a 2D grid of vector processing elements as
shown in Fig. 7. To efficiently perform matrix multiplication,
they allocate a private local storage to each PE which is
used to store a column, or part of it, from the multiplier
matrix. In this way, matrix multiplication is accomplished
by streaming the multiplicand matrix rows through the PEs
where each PE performs a MAC operation. The PEs are orga-
nized in clusters, where each group is served by a separate
memory bank of the banked off-chip memories, which create
independent streams for processor-memory computation.
Moreover, MAPLE uses on-chip smart memory blocks
FIGURE 6. 2D Convolution Module of 3 × 3 Kernel [127]. to process the large intermediate data on-the-fly using
FIGURE 7. MAPLE Processing Core Architecture [132]. FIGURE 9. The Architecture of DC-CNN [100].
cycle as described in [127] with the use of Q8.8 coding to Peemen et al. [146] utilized the flexible off-chip memory
represent FMs and weights. The proposed system also uses a hierarchy method to design a configurable memory-centric
multi-port direct memory access (DMA) streaming engine to accelerator template for a variety of models of CNNs. This
allow individual streams of data to operate seamlessly within accelerator exploits data reuse in complex access patterns to
processing blocks. The results show that the proposed stream reduce off-chip memory communication, which minimizes
processor system can run small CNNs at up to 30 fps while the bandwidth requirements. The memory-centric accelerator
consuming about 15 Watts. maximizes the efficiency of on-chip memories for better
An improved version of CNP architectures given data locality using loop transformation (to optimize the tiling
in [127] and [142] was presented in [143] and referred to parameters) and block RAM (BRAM)-based multi-bank on-
as neuFlow. Particularly, neuFlow has replaced the 2D grid chip buffers [147]. At the same time, it minimizes the size of
of ALUs with a 2D grid of processing tiles (PTs). The FPGA on-chip memories to optimize energy and area usage,
proposed architecture contains a 2D grid of PTs, a control which are key requirements for embedded platforms.
unit, and a smart DMA module, as shown in Fig. 10. Each The memory-centric accelerator uses a SIMD cluster of
PT consists of local operators and a routing multiplexer MAC PEs with flexible reuse buffers to accelerate the CONV
(MUX). The top three PTs have been implemented to per- layer. The acceleration template has been implemented on
form MAC operation. Thus, they can be used to perform Virtex6 FPGAs. In addition, a MicroBlaze processor has been
2D convolution, simple dot-products, and spatial pooling. utilized to configure and communicate with the accelera-
General-purpose operations, such as dividing and squaring, tor via FIFO-based fast simplex link (FSL). The proposed
have been implemented at the middle three PTs. Therefore, accelerator has been analyzed for a CNN vision task of size
the middle row of neuFlow can be used for normalization. 2.74 GMAC and the results show that the memory-centric
Finally, neuFlow’s bottom PTs row implements non-linear accelerator is 11× faster than the standard implementation
operations. Moreover, each operator employed input and of similar FPGA resources.
output FIFOs to stall its pipeline when required. On the other Neural network next (nn-X) [148] is a real-time system-
hand, PT’s MUX is used to connect its local operators with the on-chip (SoC) computing system for deep learning networks
neighboring PT’s streaming operators and off-chip memory on mobile devices. The architecture of nn-X consists of a
instead of the used local routers and global router discussed host processor, a co-processor, and external memory. The
in [142]. co-processor accelerates the learning networks by paral-
lelizing their operations throughout arrays of configurable
processing elements referred to as collections. Each collec-
tion contains one convolution engine, one pooling module,
and one non-linear operator. The CONV engine acceler-
ates the CONV operation by fully pipelining the incoming
data with the use of cache memories. The collections are
able to communicate with one another using the collec-
tion route component to achieve cascaded pipelining, which
results in reducing accesses to external memory. The data
transfer between the collections and the external memory is
accomplished throughout the co-processor full-duplex mem-
ory router, which provides independent data streams. The
nn-X has been prototyped on Xilinx ZC706 which contains
FIGURE 10. The Architecture of neuFlow [143].
Zynq XC7Z045, two ARM Cortex-A9 processors, and 1 GB
DDR3. Eight collections have been employed to achieve large
NeuFlow uses a dataflow compiler, named luaFlow, parallelism. The results for face recognition model in [149]
to translate a high-level flow-graph representation of CNNs show that nn-X is 115× faster than the two embedded ARM
in Torch5 [144] into HDL scripts with different levels of processors.
parallelism. In addition, luaFlow produces a binary code Zhang et al. [55] proposed a roofline-based model to accel-
configuration file and holds it in the embedded control unit. erate convolutional neural networks on FPGAs. The roofline
Thereafter, the control unit configures the 2D grid of PTs model is an intuitive visual performance model used to relate
(connections and streaming operator) and the DMA ports the attainable performance to the peak performance that
through run-time configuration buses. A smart memory mod- can be provided by the hardware platform and the off-chip
ule has been designed to support multiple asynchronous memory traffic [150]. The focus in their work is primarily
accesses of off-chip memory through its reconfigurable ports. on accelerating the convolutional layers as it consumes more
By targeting the larger Xilinx Virtex6 VLX240T FPGA, neu- than 90% of the computational time during the prediction
Flow achieved 147 GOPS at 10 Watts for street scene parsing process [77]. In doing so, the authors optimized both the
CNN in [145] with the use of 16 bits to represent FMs and computation operations and the memory access operations
weights. in convolutional layers. They considered a CNN applica-
tion composed of five convolutional layers that won the FIFO interfaces, where the data transfer engines are used to
ImageNet competition in 2012 [28]. The proposed accelerator access DDR3 DRAM memory through AXI4 bus.
uses polyhedral-based data dependence analysis [151] to The accelerator is designed using Vivado 2013.4 high level
fully utilize all FPGA computational resources through loop synthesis tool and implemented on Xilinx VC707 FPGA
unrolling, loop pipelining, and loop tile size enumeration. board clocked at 100 MHz. The experimental results depict
Note that loop unrolling maximizes the parallel computation that the proposed implementation achieves a peak perfor-
of CONV MAC operations. On the other hand, local mem- mance of 61.62 GFLOPS as well as a 17.42× speedup over
ory promotion and loop transformation are used to reduce the software implementation on Intel Xeon CPU E5-2430 at
redundant communication operations and to maximize the 2.20 GHz with 15 MB cache and 16 threads. In addition to
data sharing/reuse, respectively. this, the results show that the proposed FPGA architecture
Subsequently, the roofline performance model is used to is 24.6× more energy-efficient than the software implemen-
identify the optimal design from all possible solutions in tation as the total power consumption is only 18.6 Watts.
the design space. Specifically, the authors model all possi- The proposed implementation has some limitations such as
ble legal designs delivered from the polyhedral analysis in designing the accelerator with new cross-layer unrolling fac-
roofline to find the optimal unrolling factor hTm , Tn i for every tors for different architectures of CNNs. Furthermore, using
convolutional layer, where Tm and Tn are the tile size for the the CNN accelerator with uniform unrolling factors might be
output FMs and input FMs, respectively. However, designing sub-optimal for some CONV layers, which affects the overall
a CNN accelerator with different unrolling factors to each performance.
convolutional layer is challenging. Therefore, the proposed In 2014, Microsoft research team of Catapult project
architecture enumerates all possible valid designs to find uni- integrated FPGA boards into data center applications to suc-
form cross-layer unrolling factors. Thereafter, the hardware cessfully achieve 2× speedup for Bing Ranking (the large-
accelerator is implemented based on the cross-layer optimal scale search engine) [67]. A year later, Ovtcharov et al. [152]
unrolling factors. at Microsoft Research utilized Catapult hardware infrastruc-
The proposed accelerator composed of a computational ture, a dual-socket Xeon server equipped with Stratix-V
engine and memory sub-system is depicted in Fig. 11. The GSMD5 FPGA, to design a specialized hardware for accel-
computation engine is designed as Tm duplicated tree-shaped erating the forward propagation of deep CNNs in a power-
poly structures with Tn inputs from the input FMs, Tn inputs constrained data center.
from the weights, and one input from the bias. On the other The top-level architecture of the proposed CNN accel-
hand, the memory sub-system is implemented as four sets of erator is shown in Fig. 12. Multi-banked input buffer and
on-chip buffers; two sets to store the input FMs and weights, kernel weight buffer are used to provide an efficient buffer-
each with Tn buffer banks, and two buffer sets of Tm inde- ing scheme of FMs and weights, respectively. To minimize
pendent banks for storing the output FMs. To overlap data the off-chip memory traffic, a specialized network on-chip
transfer with computation, on-chip buffers are operated in has been designed to re-distribute the output FMs on
a ping-pong manner. In addition, two independent channels the multi-banked input buffer instead of transferring them
are implemented for load and off-load operations to increase to the external memory. The 3D convolution operations
the bandwidth utilization. Moreover, MicroBlaze processor (such as the dot-product) and other CNN operations are
is used to send configuration parameters and commands
for the accelerator over AXI4lite bus. The CNN accelerator
communicates with external data transfer engines through
FIGURE 11. Zhang et al. [55] Accelerator Architecture. FIGURE 12. Top-Level Archeticture of Microsoft CNN Accelerator [152].
compiler to generate a run-time control flow which pro- the bandwidth utilization, they designed a uniformed matrix
vides energy-efficient, and, better data reuse implementa- multiplication kernel that uses either input-major mapping
tion. In addition, the DeepBurning compiler investigates the (IMM) or weight-major mapping (WMM) techniques while
accelerator on-chip memory size and throughput to properly computing FC layer. In IMM, the designed kernel batches
tile and partition the NN weights and feature data layouts. a group of different input FMs together, and then performs
Moreover, DeepBurning uses the address flow component to the matrix multiplication. IMM technique improves the data
automatically fetch and store off-chip memory and on-chip reuse of FC weights. On the other hand, the designed kernel
memory data. The authors compared the performance of with WMM technique makes use of the fact that the FC layer
DeepBurning with that in [55], considering AlexNet CNN is communication-bound in which the weight matrix is much
model, as they both operate at 100 MHz. They consid- larger than the input FM matrix. In particular, it loads input
ered a high budget resources constrained DeepBurning on FM matrix to a weight buffer and loads weight matrix to input
Zynq-7045 device. The results show that DeepBurning is FM buffer. Subsequently, a regular matrix multiplication is
1.13× slower but 1.45× more energy-efficient. performed on these matrices. As a result, WMM may allow
An OpenCL-based optimization framework to accelerate for a higher data reuse than IMM, especially for input FMs
large-scale convolutional neural network models was pro- that can be reused multiple times considering the limited
posed by Suda et al. [80]. They found that the number hardware resources.
of performed CONV MAC operations in parallel (NCONV ), For the above, the roofline model was applied to identify
SIMD vectorization factor (SCONV ), normalization layer loop the optimal mapping technique under different batch sizes
unrolling factor (NNORM ), the number of parallel pooling and data precisions. The results demonstrate that WMM is
outputs in one cycle (NPOOL ), and the number of parallel FC better than IMM in term of data reuse and bandwidth uti-
MAC operations (NFC ) are the key variables that determine lization, especially in small batch sizes which is required for
the parallelism of the design. Subsequently, they analytically real-time inference. Hence, the same matrix multiplication
and empirically modeled the execution time for each layer kernel is utilized for the computation of both CONV and
as a function of the above mentioned variables. Then, genetic FC layers, but with the use of IMM in CONV layer and
algorithm was used to explore the design space for finding the WMM in FC layer. Based on this, the authors proposed a soft-
optimal combination of the key design variables considering ware/hardware co-design library, which they named Caffeine,
the resources constraints. to accelerate CNNs on FPGAs.
The authors implemented the scalable CONV block in a With an easy-to-use developed tool, Caffeine aids in
similar fashion to that in [138] as a matrix multiplication by automatically choosing the best hardware parameters, using
flattening and on-the-fly rearrangement of the feature data. the model files from Caffe and FPGA device specifica-
The OpenCL software has been utilized in their work due tions obtained from the user. Caffeine FPGA engine uses a
to its parallel programming model as well as its ability to high-level synthesis (HLS)-based systolic-like architecture to
integrate the compiled RTL design with external memory implement matrix multiplication kernel. It allows changing
interfacing IPs [156], which uses memory coalescing tech- parameters such as number of PEs, precision, and FM size.
nique with complex load and store units. In addition, it has Caffeine further maximizes the FPGA computing capability
optimized matrix multiplication and CPU-FPGA communi- by optimizing multi-level data parallelism discussed in [55]
cation libraries [157], [158]. and pipeline parallelism using polyhedral-based optimization
The framework is used on both VGG-16 and AlexNet framework given in [163]. Caffeine framework also handles
CNN models which are implemented on P395-D8 [159] and the weights and biases reorganization in off-chip DRAM
DE5-Net [160] FPGA boards with fixed-point operations to maximize the underlying memory bandwidth utilization.
according to their precision study. They compared the pro- In addition, the double-buffering technique is employed to
posed implementation with 3.3 GHz core i5-4590 CPU prefetch the next data tile for each PE. Caffeine has been
implementation that uses Caffe tool [58] with ATLAS [161] evaluated by implementing AlexNet and VGG-16 CNNs on
optimized library for matrix/vector operations. The results Ultrascale KU060 (20nm and 200 MHz) and on Virtex7 690T
show that the OpenCL optimized framework on P395- (28nm and 150 MHz) considering different precisions. The
D8 achieved 5.5× (117.8 GOPS) and 9.5× (72.4 GOPS) VGG-16 implementation with 16-bit fixed-point on Ultra-
speedups for VGG-16 and AlexNet models, respectively. scale KU060 and Virtex7 690T provided 43.5× and 65×
On the other hand, DE5-Net FPGA achieved less throughput overall throughput enhancement, respectively, compared to
speedup than the P395-D8 (2.2× (47.5 GOPS) for VGG-16, implementation on a two-socket server, each with a 6-core
and 4.2× (31.8 GOPS) for AlexNet) as it has 7.67× less DSPs Intel CPU (E5-2609 at 1.9 GHz).
than what is available on P395-D8. A special case of dataflow, referred to as synchronous
Zhang et al. [153], [162] analyzed the transformation of dataflow (SDF) [164], is a paradigm of computation that
CONV and FC layers to regular matrix multiplication pre- allows for representing a computing system as a stream-
sented in prior work [98]. For VGG-16 model, they found ing problem. In this way, SDF model can represent the
that such transformation necessitates up to 25× duplica- hardware implementation of CNNs using linear algebra and
tion of input FMs. To address this problem and improve directed SDF graph (SDFG). Each node of SDFG represents a
hardware building block that can immediately start its com- implementation of dot-product which produces a 1 dot-
putation as soon as the data are available through its input product per cycle, with the use of a high number of multipliers
arcs. Such representation of CNN model offers a fast design and adders, fpgaConvNet uses a smaller number of MAC
space exploration. Venieris and Bouganis [165] employed units and schedules the execution of different operations
SDF model to optimize the mapping of CNNs onto FPGAs using time-multiplexing. A trade-off between the perfor-
based on HLS. mance and the required hardware resources can be achieved
In particular, the proposed fpgaConvNet framework by changing the unroll factor and the degree of multiplex-
in [165] takes as input a high-level script programmed by DL ing. Therefore, fpgaConvNet employed simulated anneal-
expert describing the CNN model, along with specifications ing [166] to find the optimal partitioning points and folding
of the targeted FPGA platform. Thereafter, it parses the input factors. Finally, fpgaConvNet uses optimal components to
script through a developed domain-specific language (DSL) derive the configuration of PEs and buffers, and generates
processor to model the CNN in the form of a directed acyclic a synthesizable Vivado HLS hardware design.
graph (DAG) where each node corresponds to a CNN layer. fpgaConvNet framework has been evaluated by mapping
Then, the DAG-based CNN is transformed into an SDFG LeNet-5 and scene labeling [167] small CNN models
representation and modeled as a topology matrix. The topol- with Q8.8 fixed-point representation onto a Zynq-7000
ogy matrix contains the number of incoming parallel streams, XC7Z020 FPGA platform working at 100 MHz. In mapping
the width of each data stream, and the production or con- LeNet-5, fpgaConvNet achieves up to 1.62× the performance
sumption rates at each node. In addition, the DSL proces- density of CNP [127]. Compared to Tegra K1 GPU imple-
sor extracts information about the platform-specific resource mentation of scene labeling CNN, fpgaConvNet surpasses
constraints. Tegra K1’s power efficiency by 1.05×.
Unlike other attempts, instead of exploring the design Ma et al. [78] proposed a Python-based modularized
space for the optimal parameters of loop unrolling and tiling, RTL compiler to accelerate CNNs by employing loop
fpgaConvNet explores the design space of the topology unrolling optimization [55], [79] for CONV layer operations.
matrix components while considering the resource con- A detailed review article of this work has been recently pub-
straints. In doing so, fpgaConvNet performs graph partition- lished and referred to as ALAMO [168]. The proposed com-
ing, coarse-grained folding, and fine-grained folding. The piler integrates both the RTL finer level optimization and the
graph partitioning splits the original SDFG into subgraphs flexibility of HLS to generate efficient Verilog parameterized
and each subgraph is then mapped to a distinct bitstream RTL scripts for ASIC or FPGA platform under the available
as shown in Fig. 15. Note that the proposed multi-bitstream number of parallel computing resources (i.e., the number of
architecture might have multiple CONV layer processors multipliers (Nm )). If Nm is greater than the number of input
(CLPs), as in the provided example. This away, on-chip FMs (Nif ), the proposed compiler fully unrolls Loop-3 (Nif ,
RAM is used for intermediate results and data reuse within refer to subsection II-A.1 for more details) while it partially
the subgraph, while accesss of off-chip memory is mini- unrolls Loop-4 (Nof ) to exploit the data reuse of shared
mized and limited for input and output streams of the sub- features among Nm /Nif output FMs. Otherwise, it partially
graph. However, this scheme adds reconfiguration penalty unrolls Loop-3 which results in Nif /Nm repeated sliding of
due to the need for reconfiguring the FPGA when the data kernel window. On the other hand, Loop-2 (X × Y ) is serially
flows between adjacent subgraphs. To amortize this over- computed after Loop-1 (K ) to minimize the number of partial
head, several input data streams are processed in a pipelined sums.
manner. The overall modules of the proposed CNN accelerator
are shown in Fig. 16. The controller is responsible for
directing and ensuring in-order computation of CNN modules
for each layer. The data routers oversee the selection of data Liu et al. [170] proposed a parallel framework for
read and data write of two adjacent modules as well as the FPGA-based CNN accelerators that exploits four levels of
assignment of buffer outputs to shared or pool multipliers of parallelism; task level, layer level, loop level, and operator
the multiplier bank. The feature buffers hold the FMs using level. Task-level parallelism involves executing multiple
on-chip RAMs. The weight buffers are used to ensure the image prediction tasks simultaneously. Layer-level paral-
availability of CONV and FC layers’ weights before their lelism exploits pipelining across layers to enable parallel
computation as well as to overlap the transfer of FC layer execution of all layers with different images. Loop-level par-
weights with its computation. The CONV module consists allelism utilizes loop unrolling in performing convolutions
of control logic, groups of adder trees, and ReLU compo- and this can be achieved either through intra-output or inter-
nents. The control logic component parameterizes the loop output parallelism. Finally, operator-level parallelism is
unrolling factors based on the configuration of each layer achieved by parallelizing the k × k MACs operations needed
(Nif , Nof , X , Y , and K ). The CONV module contains Nm /Nif for convolution operation in convolutional layers or the
adders to sum Nif parallel multiplier results and accumulate n MACs needed for inner-product computation in fully con-
them. Moreover, the adder trees can be shared by layers with nected layers. Fig. 17 shows the parallel framework exploit-
identical Nif to be as one single module. The ReLU compo- ing these four levels of parallelism.
nent checks the input pixel sign bit to either output zero or the The authors have used 16-bit fixed-point format for repre-
data pixel itself. The POOL module contains accumulators senting pixels in input feature maps and output feature maps.
or comparators to perform average or maximum operation, However, they have used 32 bits for intermediate results
respectively. The NORM module maintains the required com- which get truncated to 16 bits. In addition, they have used
ponents to perform the operations of local response normal- 8 bits for representing kernels and weights. They have pre-
ization such as square, non-linear (using look-up table), and sented a systematic methodology for design space exploration
multiplication operations. Finally, the FC module shares the to find the optimal solution that maximizes the throughput
multiplier bank module with the CONV module to perform of an FPGA-based accelerator under given FPGA constraints
the matrix-vector multiplication (MVM). such as on-chip memory, computational resources, external
ALAMO architecture permits the output pixels to be only memory bandwidth, and clock frequency.
stored in the feature buffers, which makes ALAMO suitable The proposed technique has been evaluated by imple-
for CNNs with only small intermediate data volumes. The menting three CNN accelerators on the VC709 board for
proposed RTL compiler has been tested by accelerating two LeNet, AlexNet, and VGG-S. It has achieved a throughput
CNN models; AlexNet and NiN [169]. The generated param- of 424.7 GOPS, 445.6 GOPS, and 473.4 GOPS for LeNet,
eterized RTL scripts for AlexNet and NiN are synthesized AlexNet, and VGG-S accelerators, respectively. In addition,
using Altera Quartus synthesis tool and implemented on the performance has been compared with MatConvNet tool
DE5-Net FPGA board. The experimental results for AlexNet running the CNN models on Intel Core i7-4790K CPU
model are compared with the results for OpenCL-based (4.0 GHz) and NVIDIA GTX-770 GPU (1,536 CUDA cores,
design [80] as both use the same FPGA board with similar 2 GB GDDR5, 224.3 GB/s memory bandwidth). Compared
hardware resources for AlexNet. ALAMO achieved 1.9× to the CPU implementations, the accelerators for LeNet,
and 1.3× improvement for throughput and power consump- AlexNet, and VGG-S achieved 14.84×, 6.96×, and 4.79× in
tion, respectively. Moreover, the overall throughput of NiN performance, respectively, and 51.84×, 24.69×, and 16.46×
model is 1.03× better than that of AlexNet. This is because in power efficiency, respectively. Compared to the GPU
NiN has more CONV layers and many of them have the implementations, the accelerators achieved better perfor-
same Nif . mance in the small-scale network LeNet (3.17×), comparable
performance in the medium-scale network AlexNet (0.96×), The proposed RTL-HLS hybrid framework has been
and worse performance in the large-scale network VGG-S evaluated by accelerating VGG-19, LSTM-LM [175],
(0.56×). However, the accelerators achieved higher power ResNet-152 DNNs on Stratix-V GSMD5 FPGA. Note that
efficiency than the GPU implementations in all three net- this is the first work that implements ResNet-152 on FPGA.
works with 28.3× for LeNet, 8.7× for AlexNet and 4.98× The experimental results demonstrated that the speedup of
for VGG-S. FP-DNN for 16-bit fixed-point implementations are about
FP-DNN [171] is an end-to-end framework that auto- 1.9× - 3.06× compared with the server that includes 2 pro-
matically generates optimized FPGA-based implementa- cessors each with 8-core Intel Xeon E5-2650v2 at 2.6 GHz.
tions of deep neural networks (DNNs) using an RTL-HLS In line with the current trends towards compressed neu-
hybrid library. FP-DNN compiler, programed using C++ ral networks, with dramatically reduced weights and activa-
and OpenCL, takes TensorFlow symbolic descriptions [172] tions bit-width using 1-bit or 2-bit quantization [176]–[180],
of DNNs, and then performs model inference through the Umuroglu et al. [181] conducted a set of experiments to
use of model mapper, software generator, and hardware gen- estimate the trade-off between the network size and precision
erator modules. The model mapper extracts the topological using the roofline model. They found that binarized neural
structure and layers configurations of DNN model from the networks (BNNs) [180] require 2 to 11 times more operations
TensorFlow descriptions and generates an execution graph for and parameters than an 8-bit fixed-point CNN to achieve
the target model. The execution graph shows layer-by-layer a comparable accuracy on MNIST [71] dataset. However,
operations and read/write data transactions. the performance of BNN is found to be 16× faster than the
FP-DNN compiler allocates off-chip DRAM data buffers fixed-point network.
to store intermediate data, weights, and model parameters Subsequently, the authors proposed a framework, referred
and configurations. The model mapper maximizes the storage to as FINN [181], that maps a trained BNN onto FPGA.
resource reuse through minimizing the number of required FINN generates a synthesizable C++ network description of
physical buffers. Specifically, it formulates the data reuse a flexible heterogeneous streaming architecture. The archi-
problem as a graph coloring problem [173], and then the left- tecture consists of pipelined compute engines that commu-
edge algorithm is applied to generate kernel configuration and nicate via on-chip data streams. Each BNN layer has been
kernel schedule. Subsequently, the software generator uses implemented using dedicated compute engines with 1-bit
the kernel schedule to generate a host C++ program which values for weights and FMs; +1 and −1 are used to represent
initializes the model, manages the data buffers, and sched- a set bit and unset bit, respectively.
ules the kernel execution. On the other hand, the hardware The authors have optimized accumulation, batch normal-
generator uses the kernel configuration and the execution ization (batchnorm), activation, and pooling operations of
graph to generate the FPGA hardware codes by instantiating BNNs. In particular, the accumulation of a binary dot-product
the corresponding optimized templates from an expandable has been implemented as a counter of set bits (popcount
RTL-HLS hybrid library. Each template is comprised of operation). The popcount-accumulate reduces the number of
Verilog-based computational engine and OpenCL-based con- required look-up tables (LUTs) and flip-flops (FFs) by a half,
trol logics engine. compared to the implementation of signed-accumulation.
The architecture of the proposed FPGA-based accelerator BNN batchnorm and activation operations have been sim-
consists of matrix multiplication and data arranger modules. plified and implemented together as unsigned comparison
Matrix multiplication module is a hand-written Verilog code with a threshold τk , +1 is produced when the input value
that is designed and optimized based on the hardware is greater than or equals to τk , and −1 otherwise. The value
constraints of Altera Stratix-V GSMD5 FPGA. It applies of τk is computed during run-time. Such an implementation
tiling and ping-pong double buffers techniques to improve of batchnorm-activation operations requires much smaller
the throughput. On the other hand, data arranger is an number of LUTs, without the need for DSPs and FFs, com-
OpenCL-based module that is responsible for mapping pared to regular implementation of batchnorm-activation.
the computational part of a layer to matrix multiplica- Max-pooling, average-polling, and min-pooling have been
tion as well as performing data communication with off- effectively implemented with Boolean OR-operator, Boolean
chip memory and matrix multiplication module. Mapping majority function, and Boolean AND-operator, respectively.
DNNs computational operations to matrix multiplication has The accelerator architecture is composed of building
been widely applied in prior studies [80], [132], [174]. blocks from the FINN hardware library. The matrix-vector-
FP-DNN maps FC layer to matrix multiplication by batch- threshold unit (MVTU) is the core computational building
ing input vectors together. Before model deployment, FMs block as matrix-vector operations followed by thresholding
and weights are rearranged in DRAM using the channel- form the majority of BNN operations. The design of MVTU
major scheme to optimize the communication between the consists of an input buffer, an array of P parallel PEs each
accelerator and off-chip DRAM. On the other hand, both with S SIMD lanes, and an output buffer. BNN weight
floating-point and fixed-point representations have been sup- matrix is distributed across the PEs and stored locally in on-
ported for implementation, and they can be adjusted by the chip memory. Subsequently, the input images are streamed
user. through the MVTU and multiplied with the weight matrix.
Particularly, the PE computes the dot-product between an inference on CIFAR-10 [182]. CNV contains three repeti-
input vector and a row of weight matrix, each of S-bits wide, tions of two 3 × 3 CONVs and 2 × 2 max-pooling layers.
using an XNOR gate, as shown in Fig. 18. Then, it compares Its topology is inspired by VGG-16 and BinaryNet [180].
the number of set bits to a threshold and produces a 1-bit Although CNV accepts images with 24-bits/pixel as an input
output value as previously discussed. and produces a 10-element vector of 16-bit values, 2-bits
are used for representing intermediate results while 1-bit is
used for representing CONV and FC weights. Experimen-
tal results demonstrated that the proposed design provides
high performance (2.5 TOPS) while incurring low energy
consumption (11.7 Watts). FINN outperforms the design by
Ovtcharov et al. [152] by over 13.8× for throughput.
In [83], loop optimization techniques [55], [79] have been
employed in FPGA to design a customized CNN acceler-
ator through speeding up CONV layer operations. Firstly,
an in-depth analysis is provided to numerically characterize
FIGURE 18. The Architecture of MVTU PE [181].
loop unrolling, loop tiling, and loop interchange optimization
techniques. In doing so, 8 CONV dimensions parameters
Umuroglu et al. [181] implemented the CONV layer using (N ∗ ), 8 loop unrolling design variables (P∗ ), and 8 loop tiling
a sliding window unit (SWU) and an MVTU, where convo- design variables (T ∗ ) have been used with a constraint, as for
lutional operation is transformed to matrix-multiplication of a specific loop level, 1 ≤ P∗ ≤ T ∗ ≤ N ∗ . Note that unrolling
image matrix and filter matrix. SWU generates the image Loop-1 and Loop-3 requires Pkx × Pky and Pif multipliers,
matrix to MVTU by moving the sliding window over the respectively, an adder tree with fan-in of Pkx × Pky and Pif ,
input FMs, while the filter matrix is generated by packing respectively, and an accumulator. On the other hand, unrolling
the weights from the convolution filters as shown in Fig. 19. Loop-2 requires Pix × Piy parallel units of MAC to reuse the
In order to meet the user throughput requirement, MVTU same weight for Pix × Piy times, while the input feature pixel
is folded (time-multiplexed) by controlling the values of can be reused by Pof times when unrolling Loop-4 with the
P and S. Folding of MVM decides partitioning of the matrix use of Pof parallel MAC units. Thus, Pkx × Pky × Pif ×
across PEs. Every row of matrix tile is mapped to a distinct PE Pix ×Piy×Pof multipliers are required. Please refer to Fig. 2
and every column of PE buffer is mapped to a distinct SIMD for more details on CONV loops levels and their parameters.
lane. In this away, the required number of cycles to compute In loop tile optimization, the authors have numerically set the
one MVM (total fold) is obtained as (X ×Y )/(P×S), where X lower bound on the required size of the input pixel buffer,
and Y are the dimensions of the matrix. The folding factors the weight buffer, and output pixel buffer that ensures reading
of BNN layers have been determined such that every BNN each input feature pixel and weight from the off-chip memory
layer takes nearly the same number of cycles. only once. On the other hand, loop interchange technique
To evaluate FINN, the authors implemented CNV topology has a great impact on the times of memory access as well
on Xilinx Zynq-7000 board at 200 MHz to accelerate BNNs as the number of partial sums since it determines the order of
computing CONV loops.
Secondly, the authors have provided a quantitative anal-
ysis of the design variables to minimize each of computing
latency, partial sum storage, on-chip buffer access, and off-
chip DRAM access. Subsequently, MATLAB scripts are used
to randomly sample a subset of the solution space to find the
optimal design configurations. This is due to the large solu-
tion space, more than 7.2 × 1013 possible configurations for
loop tiling variables of width (Pox) and height (Poy) output
FM alone. According to the randomly sampling results for
VGG-16 CNN model on Arria 10 GX 1150 FPGA, uniform
unrolling factors for CONV layers are used with Pix =
Pox = Piy = Poy = 14 and Pof = 16 for Loop-2 and
Loop-4, respectively, to reuse input feature pixels and
weights. On the other hand, Loop-1 and Loop-3 are seri-
ally computed to prevent the movement of the partial sums
between the MAC units and consume them ASAP since both
Loop-1 and Loop-3 need to be finished in order to obtain
FIGURE 19. Transforming CONV to Matrix-Multiplication [181], where, ifm one final output pixel. More importantly, the order of loops
and ofm are the input and output feature maps, respectively. computation has been found to be as follows. Loop-1 is
FIGURE 20. CONV Acceleration Architecture and Dataflow [83], where, Pix = Pox = 3,
Piy = Poy = 3, and Pof = 3.
computed first, then comes Loop-3, and finally Loop-2 and new subgraph, the subgraph’s weights are read into the on-
Loop-4 are computed in any order. chip memory and the multiplexers are configured to form
Finally, a customized convolution accelerator module with the appropriate datapath. Fig. 21 demonstrates how weights
efficient dataflow has been designed based on the previous reloading is applied. The authors have mentioned that the
results and used for all VGG-16 CONV layers. The CONV required time for transferring subgraph’s weights is much
accelerator consists of 3,136 (Pix × Piy × Pof ) independent smaller than the average time for full FPGA reconfiguration,
MAC units and 14 (Pof ) input pixel buffers. Fig. 20 shows an 272.7× less when loading 4.5 MB of weights for a VGG-16
example of the designed CONV accelerator when Pix, Piy, layer on Zynq XC7Z045.
and Pof are all equal to 3. The input pixels are shifted after
fetching them out of the input pixel buffers. Subsequently,
they can be reused among the input register arrays. Then,
the input pixels are fed into the associated MAC units. The
figure also shows that the input pixels and weights are shared
by Pof and Pix × Piy MAC units, respectively.
The overall CNN acceleration system mainly consists
of two SDRAM banks that hold the input feature pixels
and weights, two modular Scatter-Gather DMA (mSGDMA)
engines to facilitate the simultaneous read/write from/to the
SDRAMs, and a controller to govern the sequential compu-
tation of layers as well as the iterations of the four CONV
loops. On the other hand, dual weight buffers have been used
to increase the throughput of FC layer through overlapping
the inner-product computation with off-chip communication.
The acceleration system has been written as parameterized
Verilog scripts. The experimental results show that the pro-
posed accelerator has a throughput of 645.25 GOPS, which FIGURE 21. Weights Reloading [183].
is more than 3.2× enhancement compared to prior VGG-16
FPGA-based implementations [80], [98]. In the situation discussed above, due to limited on-chip
Venieris and Bouganis [183] further extended fpgaCon- memory capacity, it might not be possible to load all weights
vNet framework [165] to allow for optimizing either through- required for a single CONV layer. To handle this, the authors
put or latency depending on the size of the workload. For large introduced an input FMs folding factor (fin ) with each
workloads, weights reloading transformation has been intro- CONV layer. A CONV layer (CONV i ) is partitioned into
duced to efficiently design latency-critical CNNs on FPGA. fin i subgraphs in which each subgraph executes a fraction
In contrast with fpgaConvNet, where a distinct architecture of CONV i to produce a fraction of the output FMs. The
is designed for each subgraph, the weights reloading trans- proposed latency-driven methodology has been evaluated
formation allows for generating a single flexible architec- by implementing AlexNet and VGG-16 with 16-bit fixed-
ture, named as the reference architecture and derived using point precision for both on Zynq XC7Z045 at 125 MHz. The
pattern matching, to execute the workloads of all subgraphs experimental results showed 1.49× and 0.65× higher CONV
by transitioning to different modes. Upon the execution of a throughput than DeepBurning [155] and the embedded FPGA
accelerator in [98] for AlexNet and VGG-16 implementa- from the DDR and are stored in stream buffers before the
tions, respectively. first convolution layer starts execution. During a convolution
Lavin and Gray [184] demonstrated that CNN algorithms layer execution, while feature data for a convolution layer is
with small filters can be efficiently derived using Winograd being streamed into the PEs, the outputs of convolutions are
algorithm [185] and fast Fourier transform (FFT) algo- simultaneously stored in the buffers. The StreamBuffer unit
rithm [186] due to their advantages in improving resource applies the Winograd transformations to features, and streams
efficiency and reducing arithmetic complexity. Winograd the transformed features to the first PE which are forwarded
computation involves a mix of element-wise (Eltwise) and through all the PEs via the daisy-chained input connections
general-purpose matrix multiplication, where some of the between them. The ReLU unit receives the outputs of the
matrices need to be transformed. In particular, Winograd PEs via daisy-chained output connections. Then, the nor-
algorithm exploits the structure similarity among n × n tiled malization unit receives the outputs of the ReLU unit and
input FM pixels given a filter of size r × r to generate m × m applies the normalization formula across the feature maps.
tiled pixels of the output FM, where m represents the stride The pooling unit receives the outputs of the normalization
between Winograd tiles (m = n − r + 1), while minimizing unit and computes the maximum value in a window. The
the number of required CONV multiplications from m2 r 2 output of the pooling unit is stored back in the stream buffer
for conventional CONV algorithm to n2 . In another work, for further processing, if more convolution layers are to fol-
Zhang et al. [187] implemented FFT algorithm for CNN low. Otherwise, the outputs of the pooling unit are stored in
on FPGA platform. However, their proposed implementation external memory. For the fully connected layers, features data
shows little reduction of computation complexity with small are stored on PEs caches while filter weights are stored in
filters such as 3 × 3. stream buffers. For the first fully connected layer, features
Aydonat et al. [188] presented a deep learning architec- data are read back from external memory and loaded onto the
ture (DLA) based on OpenCL. Their proposed architecture PE caches. The ReLU output is sent directly to DDR, without
reduces the external memory bandwidth requirements by applying normalization or pooling. The sequencer generates
an order-of-magnitude for both the convolutional and fully the control signals to control the operation of the various
connected layers. This is achieved by caching all intermediate blocks in DLA according to the topology of the executed
feature maps on-chip in stream buffers. For fully connected CNN. Executing a different CNN requires just changing the
layers, image batching is used where a batch of images sequencer configuration.
are processed together through the fully connected layers. The DLA has been evaluated by implementing AlexNet
The approach utilizes the Winograd transformation to reduce CNN on Intel’s Arria 10 dev kit which contains a A10-1150
the multiply-accumulate operations, which could reduce the device (20nm) using a 96 batch size for the fully connected
number of needed operations by about 50%. In addition, layers. It achieved a performance of 1020 images/s. In addi-
it uses half-precision (FP16) floating-point operations with tion, it achieved 8.4x more GFLOPS than the latest Ultrascale
shared exponents, which significantly reduces the needed (KU 20nm) result reported in [162], which uses a 32 batch
computational resources. size for the fully connected layers, and 19× more GFLOPS
The overall DLA architecture is shown in Fig. 22. Each PE than the latest Stratix V result reported in [80]. Furthermore,
consists of dot-product units, accumulators, and caches, for it has achieved energy efficiency at 23 images/s/W, which
performing dot-products for convolution and fully connected is similar to what is achieved with the best publicly known
layers. Caches are used for storing filter weights. To avoid implementation of AlexNet on NVIDIA Titan X GPU.
idle computation cycles, double-buffering is used such that Unlike DLA architecture [188] where a 1D Winograd
filter weights for the next convolution layer are prefetched algorithm was employed to reduce arithmetic complexity,
onto the caches while filter weights are loaded from the Lu et al. [189] implemented a novel FPGA architecture with
caches for a particular convolution layer. Stream buffers store a two-dimensional Winograd algorithm [185] to accelerate
feature data and stream it to PEs. Each stream buffer is convolutional computation of CNNs. The overall architecture
double-buffered similar to filter caches. Images are loaded consists of line buffer structure and Winograd PE engine,
as shown in Fig. 23. Particularly, n + m input lines and m
output lines of on-chip buffers are used to effectively reuse
FM data among different tiles. While Winograd PE engine
reads the first n input lines to perform Winograd computation,
the next m input lines load pixels from off-chip memory
using FIFOs to overlap the data transfer and computation.
Thereafter, the input lines are rotated in a circular fashion
to make the next n input lines ready. On the other hand,
Winograd PE engine composed of 4 pipelined stages per-
forms transformation, element-wise matrix multiplication,
additional transformation, and accumulation of output tiles,
FIGURE 22. Overall DLA Architecture [188]. respectively.
the overall off-chip memory accesses. Note that the optimal In another recent research work, Ma et al. [196] gener-
dimension of each CLP is found based on the work in [55]. alized the previously proposed accelerator in [83] to effi-
Subsequently, C++ (HLS) templates are parameterized ciently accelerate ResNet-50 and ResNet-152 on Arria 10 GX
to design CLPs and to form a complete implementation of 1150 FPGA. In doing so, they designed flexible and scalable
CNN. A standard AXI crossbar is used to interconnect the CONV, ReLU, BatchNorm, scale, pooling, FC, and Eltwise
independent CLPs. The ping-pong double-buffering tech- primitives. In addition, local control logic and registers have
nique is also used for input FMs, output FMs, and weights to been used with each primitive to control their computation
allow for transferring data while computation is in progress. order and to hold their configurations, respectively. By doing
The experimental results of implementing AlexNet with a so, ResNets primitives can be efficiently reused for different
single precision floating-point using multi-CLP accelerator parameters of each layer.
on Virtex7 485T and 690T FPGAs at 100 MHz demonstrate For ResNets scalable CONV primitive, there are four
1.31× and 1.54× higher throughput than the state-of-the- (kernel, stride) size configurations; (3 × 3, 1), (1 × 1, 1),
art single CLP design in [55], respectively. For the more (1 × 1, 2), and (7 × 7, 2). Therefore, a similar architecture and
recent SqueezeNet network, the proposed multi-CLP accel- dataflow to that shown in Fig. 20 has been used for CONV
erator results in speedup of 1.9× and 2.3× on Virtex7 485T but with the use of two sets of register arrays; with shifting
and 690T FPGAs at 170 MHz with 16-bit fixed-point, between the registers (which is shown in Fig. 20, Set-1), and
respectively. without shifting between the registers (Set-2). The CONV
Xuechao et al. [195] presented a systolic architecture for primitive with 3 × 3 kernel and stride of 1 uses Set-1 register
automatically implementing a given CNN on FPGA based array, while Set-2 is used with (1 × 1, 1), (1 × 1, 2), and
on OpenCL description, maximizing clock frequency and (7 × 7, 2) configurations. In CONV primitive with Set-2,
resource utilization. The proposed systolic architecture is the input pixels are fed from the input pixel buffers into the
shown in Fig. 28. Each PE shifts the data of the weights (W) corresponding registers without shifting, and then to MAC
and inputs (IN) horizontally and vertically to the neighboring units. The skipped input pixels in (1 × 1, 2) configuration
PEs in each cycle. The 2D structure of PEs is designed are not stored to the input pixel buffers. On the other hand,
to match the FPGA 2D layout structure to reduce routing the (7 × 7, 2) configuration of the kernel and stride sizes is
complexity and achieve timing constraints. retained as the (1 × 1, 1) case while transferring repeated
input pixels into the input pixel buffers and rearranging their
storage patterns. The CONV primitive also takes care of zero-
paddings for different (kernel, stride) size configurations.
The loop unrolling and tiling techniques in [83] have also
been employed to accelerate CONV primitive with a uniform
mapping of PEs to all ResNets CONV layers. However,
designing of efficient CNN modules is not enough, as the
memory accesses and data movements between these mod-
ules must also be minimized. Therefore, the authors have
designed a layer-by-layer computation flow. The global con-
trol logic is responsible for governing the sequential oper-
ations of primitives and their dataflow through predefined
and preloaded layered-based execution flowchart, as shown
FIGURE 28. Systolic Array Architecture for CNN [195]. in Fig. 29. In addition, it has been modeled to reconfig-
ure ResNet primitives according to the parameters of each
The technique first finds a feasible mapping for the given layer during runtime. For instance, it maps a particular num-
CNN to the systolic array to guarantee that proper data ber of PEs to CONV layer based on loop unrolling parameters
is available at specific locations in the PE array at every as well as it controls the selection of register array type
cycle. Then, the size of PE array (dimensions) is deter- (Set-1 or Set-2) based on CONV (kernel, stride) parameters.
mined which has an impact on the required number of DSPs, On the other hand, a custom DMA manager has been
the clock frequency, and the DSPs efficiency. Finally, the data designed to control the operations of DMA. Note that the
reuse strategy is determined by choosing proper tiling sizes. DMA is responsible for transferring the input FM pixels,
The proposed technique has been evaluated using AlexNet weights, and output FM pixels between off-chip memory and
and VGG16 on Intel’s Arria 10 GT 1150 board. The tech- on-chip buffers. Unlike ALAMO architecture [168] where the
nique has explored the use of both 32-bit floating-point and output pixels are only stored in on-chip buffers, this work as
fixed-point using 8-bits for weights and 16-bits for data. well as the work discussed in [83] store the output pixels in
Evaluation results show that, for the VGG16 CNN, the tech- off-chip memory with the use of loop tiling technique in order
nique achieves up to 1,171 GOPS on Intel’s Arria 10 device to have a flexible architecture that can process large-scale
with a clock frequency of 231.85 MHZ and (8-16)-bit CNNs. The dual weight buffers technique has not been used
fixed-point representation. in this work due to the current trend in CNNs where either
this consists of determining the number of convolution layers, demonstrated that a large number of weights in fully con-
number of fully connected layers, sizes of feature maps in nected layers could be eliminated with minimal impact on
each layer, along with other operators. Recent research has accuracy. In addition, although the suggested CNN structures
by experts perform well for various applications, the question a given CNN model (please see subsection III-C). FPGA
arises whether the suggested structures could be optimized resource constraints such as on-chip memory, registers, com-
for performance with minimal impact on accuracy. Since putational resources and external memory bandwidth are con-
the designed CNN has a significant impact on the complex- sidered. The optimization problem comprises finding the best
ity of its implementation, we review in this section some combination of NCONV , SCONV , NNORM , NPOOL , and NFC
approaches attempting to optimize the design of CNNs using variables, where
metaheuristics. • NCONV is size of the filter (or neuron or kernel);
NP-hard combinatorial optimization problems [206] • SCONV is the factor by which computational resources
appear in the design of CNNs. Some examples of areas are vectorized to execute in a single-instruction stream
include design of CNN structures, selection of weights and multiple-data streams (SIMD) fashion;
bias values to improve accuracy, and determination of optimal • NNORM represents the number of normalization opera-
values of variables to reduce run-time. Below, we briefly tions performed in a single cycle;
touch upon some existing literature in these areas. • NPOOL is the number of parallel outputs of the pooling
layer in a single cycle to achieve acceleration; and,
A. CNN STRUCTURE OPTIMIZATION • NFC is the number of parallel multiply and accumu-
In the design of CNNs, the number of possible network late (MAC) operations preformed in a single work-item
structures increases exponentially with the number of layers. within the fully connected layer.
Xie and Yuille [207] used genetic algorithm in learning deep The objective function to be minimized is the run-time
network structures. The objective was to find the best CNN (RT ), and is given by
structure that would minimize the error rate. The cost function TL
X
was the CNN accuracy. They proposed an elegant encoding RTi [NCONV , SCONV , NNORM , NPOOL , NFC ] (5)
of chromosome using a fixed length binary string to represent i=0
each network structure. A CNN string represents only the
subject to digital signal processing (DSP) slices, logic, and
convolution layers.
memory constraints, where TL represents the total number of
In each generation, using standard genetic operations new
CNN layers including the repeated layers. The convolution
individuals are generated and weak ones eliminated. The
layer run-time (RT CONV ) is analytically modeled as a func-
quality of an individual was assessed by its recognition
tion of design variables as
accuracy which is obtained via the time consuming operation
of training the network, and evaluating it on a validation set. # of Convolution Opsi
RT CONV i = (6)
Two small data sets were used (MNIST and CIFAR-10) to NCONV × SCONV × Frequency
run the genetic implementation via which they demonstrated As for the other layers, that is, normalization, pooling, and
the discovery of new structures. fully connected, the following general model is proposed
# of Layer Opsi
B. CNN WEIGHTS AND BIAS VALUES OPTIMIZATION RT Layer i = (7)
An attempt to train CNNs using metaheuristics (that is, Unroll factor × Frequency
determine weights and bias values) is presented in [208]. The above analytical models are later validated by per-
The objective again was to improve accuracy and minimize forming full synthesis at selective points and running them
the estimated error. The authors experiment with three meta- on the FPGA accelerator.
heuristic algorithms, namely; simulated annealing, differen- Clearly, in order to determine the best values of the dis-
tial evolution, and harmony search. The algorithms compute cussed design variables, exhaustive search, especially if the
the values of weights and bias in the last layer. These values number of variables and or FPGA resources is large, is infea-
are used as the solution vector denoted by x which is to be sible. We have to resort to iterative non-deterministic heuris-
optimized. The move comprised adding a small value of 1x tics [206] such as simulated annealing, simulated evolution,
to perturb the state. The cost function y is modeled as tabu search, genetic algorithm, particle swarm optimization,
PN 2 0.5 cuckoo search, etc., or any of the modern metaheuristics,
1 i=n (o − u) to efficiently traverse the search space to find acceptable
y= (4)
2 N solutions.
where, o is the expected output, u is the real output, and N is The proposed methodology employing genetic algorithm
the number of used samples. The stopping criterion is when was demonstrated by optimizing the implementation of two
the iteration count is reached or when the cost function goes representative CNNs, AlexNet and VGG, on two Altera
below a pre-specified value. Stratix-V FPGA platforms, DE5-Net and P395-D8 boards,
both of which have different hardware resources. Peak per-
C. CNN DESIGN VARIABLES OPTIMIZATION formance is achieved for both, for the convolution operations,
Suda et al. [80] presented a systematic methodology for and for the entire CNN network.
design space exploration with the objective of maximizing One major issue related to use of non-deterministic
the throughput of an OpenCL-based FPGA accelerator for iterative heuristics in the design of neural networks and
CNNs is the large amount of memory required to store the maximizes resource utilization and enhances performance in
state of solution and the amount of time taken to deter- comparison to using a single CLP.
mine the cost of the solution, be it accuracy/error estimation, Since the computational requirement of FC layers is
run-time, or any other objective. Reasonable estimation significantly less than that of CONV layers, to improve
techniques and analytical formulations are required to effi- performance, and maximize resource utilization, a number
ciently traverse the design space in search of efficient of techniques such as [153], [162], [188], and [189] create
solutions. batches by grouping different input FMs and processing them
together in FC layers.
V. SUMMARY AND RECOMMENDATIONS Complex access patterns and data locality are used in
In this section, we highlight the key features discussed in the DeepBurning tool [155] for better data reuse.
acceleration of convolutional neural networks (CNNs) imple- Wang et al. [197] explored hot spots profiling to determine
mented on FPGAs, and provide recommendations to enhance the computational parts that need to be accelerated to improve
the effectiveness of employing FPGAs in the acceleration of the performance. Acceleration is accomplished by reducing
CNNs. the memory bandwidth requirements. Techniques proposed
All reviewed techniques are centered around accelerating exploit data reuse to reduce off-chip memory communica-
the convolution (CONV) operation as it consumes around tions. Loop transformations have also been used by reducing
90% of the computational time. This is achieved by utilizing tiling parameters to improve data locality, and to reduce
parallel multiply-accumulate operations bounded by resource redundant communication operations to maximize the data
limitations. In addition, careful design of data access patterns sharing/reuse.
are targeted to minimize the memory bandwidth requirements Efficient buffering, where the weight buffers are used to
utilizing internal memory structures and maximizing data ensure the availability of CONV and FC layers’ weights
reuse. This is crucial in the acceleration process due to the before their computation, as well as to overlap the transfer
large memory data that needs to be accessed including feature of FC layer weights with its computation, helps in improv-
maps (FMs) and weights. To minimize the memory foot- ing performance [78], [168]. In the Catapult project, FPGA
print and to achieve effective utilization of resources, some boards were integrated into data center applications and
techniques optimize the number of bits used to represent the achieved speedup. Microsoft Research’s Catapult utilized
feature maps and weights with minimal impact on accuracy. multi-banked input buffer and kernel weight buffer to provide
This is combined with the optimized selection of the number an efficient buffering scheme of feature maps and weights,
of fraction bits used for each layer. Other techniques optimize respectively. To minimize the off-chip memory traffic, a spe-
the number of used weights in the fully connected (FC) cialized network on-chip was designed to re-distribute the
layers as they are memory-intensive. Coprocessors are also output feature maps on the multi-banked input buffer instead
employed to automatically configure both the software and of transferring them to the external memory [152].
the hardware elements to fully exploit parallelism [100]. To further reduce memory footprint and bandwidth
To optimize parallelization of convolution operations, sev- requirement, optimal fractional length for weights and fea-
eral approaches have been attempted. Work load analysis has ture maps in each layer are used. Singular value decomposi-
been tried to determine computations that can be structured as tion (SVD) has also been applied to the weight matrix of FC
parallel streams [132]. The roofline model based accelerator layer in order to reduce memory footprint at this layer [98].
uses polyhedral-based data dependence analysis to find the Tiling techniques have been proposed where large-scale input
optimal unrolling factor for every convolutional layer [150], data is partitioned into small subsets or tiles whose size is
and to fully utilize all FPGA computational resources through configured to leverage the trade-off between the hardware
loop pipelining. To optimize performance, tiled matrix multi- cost and the speedup [197].
plication is structured as a pipelined binary adder tree for per- Automation tools have been developed that auto-
forming multiplication and generating partial sums [198]. An matically build neural networks with optimized perfor-
optimization framework has been proposed by Suda et al. [80] mance [155]. They employ pre-constructed register transfer
who identified the key variables of the design and optimize level (RTL) module library that holds hardware (including
them to maximize parallelism. logical and arithmetic operations) and configuration scripts.
To reduce computational complexity of CONV layers and DeepBurning, for example, generates the hardware descrip-
improve resource efficiency, a number of approaches such as tion for neural network scripts. Another modularized RTL
[184], [188], and [189] utilized Winograd transformation in compiler, ALAMO, integrates both the RTL finer level opti-
performing CONV operations as this reduces the computa- mization and the flexibility of high-level synthesis (HLS)
tional complexity by around 50%. to generate efficient Verilog parameterized RTL scripts for
To maximize throughput, several techniques such ASIC or FPGA platform under the available number of
as [165], [170], and [192] have used multiple CONV layer parallel computing resources (i.e., the number of multipliers)
processors (CLPs) instead of using a single CLP that is [78], [168]. Acceleration is achieved by employing loop
optimized for all CONV layers. This pipelines the operation unrolling technique for CONV layer operations. Some
of the multiple CLPs achieving layer-level parallelism which of the reviewed techniques also help minimize the size
TABLE 4. Optimization mechanisms employed for FPGA-based acceleration of deep learning networks.
TABLE 5. Optimization mechanisms employed for FPGA-based acceleration of deep learning networks.
of FPGA on-chip memories to optimize energy and area performance and throughput of FPGA-based deep learning
usage [146], [147]. networks.
In Table 4 and Table 5, we list the optimization mechanisms To enhance utilization of FPGAs in CNNs acceleration
utilized by each of the reviewed techniques to maximize and to maximize their effectiveness, we recommend the
development of a framework that includes a user-friendly detection and recognition and require both CPU and mem-
interface that allows the user to easily specify the CNN model ory intensive operations that can be effectively accelerated
to be accelerated. This includes specifying the CNN model utilizing FPGA inherent ability to maximize parallelism of
parameters in terms of number of convolution layers and their operations.
sizes, and number of fully connected layers along with other While the paper briefly touches upon the acceleration
intermediate operations. The specified CNN model weights techniques for deep learning algorithms and CNNs from
will be read from a file. In addition, the user should have both software and hardware perspectives, the core of this
the option of specifying the FPGA platform that will be used article has been the review of recent techniques employed
for implementing the CNN accelerator and the maximum in the acceleration of CNNs on FPGAs. A thorough up-to-
tolerable error, along with the selection of a library from a set date review is provided that illustrates the employment of
of applications to be used for model optimization and evalu- various possibilities and techniques such as exploitation of
ation. The framework then should perform optimizations to parallelism utilizing loop tiling and loop unrolling, effective
find the minimum number of bits that need to be used for use of internal memory to maximize data reuse, operation
representing the weights and feature maps and the number of pipelining, and effective use of data sizes to minimize mem-
fraction bits to be used for each layer. In addition, optimiza- ory footprint, and, to optimize FPGA resource utilization.
tion of fully connected layers is performed to minimize the The paper also presented the use of tools for generating
memory requirements. All such optimizations are carried out register transfer level (RTL) scripts that not only help in
bounded by the maximum error specified by the user for the automating the design process, but also help in exploring the
specified application library. design space and suggesting efficient hardware. The paper
The framework should be designed based on the devel- discusses the use of analytics such as: (i) work load analysis
opment of a scalable hardware architecture that works for in determining the computations that can be parallelized,
any given FPGA platform and achieves higher speedup with (ii) optimal loop unrolling factors, (iii) determining access
the availability of higher resources. Based on the available patterns to improve data locality, etc. In addition, a brief
resources, specified by the FPGA platform, the tool will per- review of the use of non-deterministic heuristics in solving
form optimizations to maximize parallelism and data reuse, NP-hard combinatorial optimization problems in the design
given the resource limitations. The tool will then automati- and implementation of CNNs has been presented. Finally,
cally generate the CNN model that will fit on the given FPGA the paper summarizes the key features employed by the vari-
platform and will allow the user to evaluate the performance ous FPGA-based CNN acceleration techniques and provided
based on the chosen application library. This will allow the recommendations for enhancing the effectiveness of utilizing
user to evaluate the performance gains while evaluating dif- FPGAs in CNNs acceleration.
ferent FPGA platforms with different resources. The tool
should have the option to generate performance measures ACKNOWLEDGMENT
based on different performance metrics as selected by the user The authors would like to thank King Fahd University of
such as number of frames processed per second or number Petroleum and Minerals, Dhahran, Saudi Arabia, for all sup-
of operations performed per second. In addition, the tool port. They would also like to thank Dr. Blair P. Bremberg
will report other design metrics such as resource utilization, and Ms. Sumaiya Hussain Sadiq for their help in professional
memory sizes and bandwidth, and power dissipation. English editing of this manuscript.
Furthermore, it is desired to have the option for the user
to specify the desired performance for a given CNN model REFERENCES
and have the tool perform necessary analysis and evaluation [1] Y. Bengio, ‘‘Learning deep architectures for AI,’’ Found. Trends Mach.
and recommend to the user candidate FPGA platforms for Learn., vol. 2, no. 1, pp. 1–127, 2009.
[2] J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’ Neu-
achieving the desired performance levels. This will require ral Netw., vol. 61, pp. 85–117, Jan. 2015.
the development of reasonably accurate analytical models [3] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning,
that will estimate the needed resources for achieving the vol. 1. Cambridge, MA, USA: MIT Press, 2016.
[4] L. Zhang, S. Wang, and B. Liu, ‘‘Deep learning for sentiment analysis:
desired performance. The user can then choose the recom- A survey,’’ in Wiley Interdisciplinary Reviews: Data Mining and Knowl-
mended FPGA platform and perform complete evaluation to edge Discovery. Hoboken, NJ, USA: Wiley, 2018, p. e1253.
verify that the desired performance levels are met. [5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ‘‘Learning represen-
tations by back-propagating errors,’’ Nature, vol. 323, no. 6088, p. 533,
1986.
VI. CONCLUSION [6] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ‘‘Learning represen-
In this paper, we reviewed recent developments in the area tations by back-propagating errors,’’ in Neurocomputing: Foundations of
of acceleration of deep learning networks and, in particular, Research. Cambridge, MA, USA: MIT Press, 1988, pp. 696–699.
[7] M. A. Nielsen, Neural Networks and Deep Learning, vol. 25. Washington,
convolution neural networks (CNNs) on field programmable DC, USA: Determination Press, 2015.
gate arrays (FPGAs). The paper begins with a brief overview [8] T. Weyand, I. Kostrikov, and J. Philbin, ‘‘PlaNet—Photo geolocation with
of deep learning techniques highlighting their importance, convolutional neural networks,’’ in Proc. Eur. Conf. Comput. Vis. Cham,
Switzerland: Springer, 2016, pp. 37–55.
key operations, and applications. Special emphasis is given [9] MathWorks. (2018). What Is Deep Learning? [Online]. Available:
on CNNs as they have wide applications in the area of image https://www.mathworks.com/discovery/deep-learning.html/
[10] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, [35] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, ‘‘Off-road
no. 7553, p. 436, 2015. obstacle avoidance through end-to-end learning,’’ in Proc. Adv. Neural
[11] A. Deshpande. (2018). A Beginner’s Guide To Understanding Convolu- Inf. Process. Syst., 2006, pp. 739–746.
tional Neural Networks. https://adeshpande3.github.io/A-Beginner%27s- [36] R. Hadsell et al., ‘‘A multi-range vision strategy for autonomous offroad
Guide-To-Understanding-Convolutional-Neural-Networks/ navigation,’’ in Proc. Robot. Appl. (RA), vol. 1, no. 7, 2007, pp. 457–463.
[12] J. E. Dayhoff, Neural Network Architectures: An Introduction. New York, [37] P. Sermanet et al., ‘‘A multirange architecture for collision-free off-road
NY, USA: Van Nostrand Reinhold, 1990. robot navigation,’’ J. Field Robot., vol. 26, no. 1, pp. 52–87, 2009.
[13] Y. LeCun and Y. Bengio, ‘‘Convolutional networks for images, speech, [38] B. Blanco-Filgueira, D. García-Lesta, M. Fernández-Sanjurjo,
and time series,’’ in The Handbook of Brain Theory and Neural Networks, V. M. Brea, and P. López. (2018). ‘‘Deep learning-based multiple object
vol. 3361, no. 10. Cambridge, MA, USA: MIT Press, 1995. visual tracking on embedded system for IoT and mobile edge computing
[14] J. Hauswald et al., ‘‘DjiNN and Tonic: DNN as a service and its impli- applications.’’ [Online]. Available: https://arxiv.org/abs/1808.01356
cations for future warehouse scale computers,’’ ACM SIGARCH Comput. [39] P. D. McNelis, Neural Networks in Finance: Gaining Predictive Edge in
Archit. News, vol. 43, no. 3, pp. 27–40, 2015. the Market. New York, NY, USA: Academic, 2005.
[15] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, [40] P. J. G. Lisboa and E. C. Ifeachor, Artificial Neural Networks in
and G. Toderici, ‘‘Beyond short snippets: Deep networks for video classi- Biomedicine. London, U.K.: Springer, 2000.
fication,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, [41] P. W. Mirowski, Y. LeCun, D. Madhavan, and R. Kuzniecky, ‘‘Comparing
pp. 4694–4702. SVM and convolutional networks for epileptic seizure prediction from
[16] Y. LeCun et al., ‘‘Handwritten digit recognition with a back-propagation intracranial EEG,’’ in Proc. IEEE Workshop Mach. Learn. Signal Process.
network,’’ in Proc. Adv. Neural Inf. Process. Syst., 1990, pp. 396–404. (MLSP), Oct. 2008, pp. 244–249.
[17] P. Barros, S. Magg, C. Weber, and S. Wermter, ‘‘A multichannel convo- [42] G. E. Dahl, T. N. Sainath, and G. E. Hinton, ‘‘Improving deep neural
lutional neural network for hand posture recognition,’’ in Proc. Int. Conf. networks for LVCSR using rectified linear units and dropout,’’ in Proc.
Artif. Neural Netw. Cham, Switzerland: Springer 2014, pp. 403–410. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2013,
[18] A. Graves, A.-R. Mohamed, and G. Hinton, ‘‘Speech recognition with pp. 8609–8613.
deep recurrent neural networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech [43] R. Hadsell et al., ‘‘Learning long-range vision for autonomous off-road
Signal Process. (ICASSP), May 2013, pp. 6645–6649. driving,’’ J. Field Robot., vol. 26, no. 2, pp. 120–144, Feb. 2009.
[19] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and [44] L. Deng and D. Yu, ‘‘Deep learning: Methods and applications,’’
L. Heck, ‘‘Learning deep structured semantic models for web search Found. Trends Signal Process., vol. 7, nos. 3–4, pp. 197–387,
using clickthrough data,’’ in Proc. 22nd ACM Int. Conf. Conf. Inf. Knowl. Jun. 2014.
Manage., 2013, pp. 2333–2338. [45] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierar-
[20] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, chies for accurate object detection and semantic segmentation,’’ in Proc.
‘‘Convolutional neural networks for speech recognition,’’ IEEE/ACM IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
Trans. Audio, Speech Lang. Process., vol. 22, no. 10, pp. 1533–1545, [46] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, ‘‘SqueezeDet: Unified,
Oct. 2015. small, low power fully convolutional neural networks for real-time object
[21] P. Y. Simard, D. Steinkraus, and J. C. Platt, ‘‘Best practices for convolu- detection for autonomous driving,’’ in Proc. CVPR Workshops, 2017,
tional neural networks applied to visual document analysis,’’ in Proc. 7th pp. 446–454.
Int. Conf. Document Anal. Recognit., Aug. 2003, pp. 958–963. [47] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Delving deep into rectifiers:
[22] S. Lai, L. Xu, K. Liu, and J. Zhao, ‘‘Recurrent convolutional neu- Surpassing human-level performance on ImageNet classification,’’ in
ral networks for text classification,’’ in Proc. AAAI, vol. 333, 2015, Proc. IEEE Int. Conf. Comput. Vis., Jun. 2015, pp. 1026–1034.
pp. 2267–2273. [48] M. D. Zeiler and R. Fergus, ‘‘Visualizing and understanding convolu-
[23] Y. Kim. (2014). ‘‘Convolutional neural networks for sentence classifica- tional networks,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland:
tion.’’ [Online]. Available: https://arxiv.org/abs/1408.5882 Springer, 2014, pp. 818–833.
[24] R. Collobert and J. Weston, ‘‘A unified architecture for natural language [49] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
processing: deep neural networks with multitask learning,’’ in Proc. 25th image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Int. Conf. Mach. Learn., 2008, pp. 160–167. Jun. 2016, pp. 770–778.
[25] R. Sarikaya, G. E. Hinton, and A. Deoras, ‘‘Application of deep belief [50] Image-Net. (2018). The ImageNet Large Scale Visual Recognition
networks for natural language understanding,’’ IEEE/ACM Trans. Audio, Challenge (ILSVRC). [Online]. Available: http://image-net.org/
Speech, Lang. Process., vol. 22, no. 4, pp. 778–784, Apr. 2014. challenges/LSVRC/
[26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and [51] A. Mohamed, G. E. Dahl, and G. Hinton, ‘‘Acoustic modeling using deep
L. Fei-Fei, ‘‘Large-scale video classification with convolutional neu- belief networks,’’ IEEE Trans. Audio, Speech, Language Process., vol. 20,
ral networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., no. 1, pp. 14–22, Jan. 2012.
Jun. 2014, pp. 1725–1732. [52] O. Nomura and T. Morie, ‘‘Projection-field-type VLSI convolutional
[27] J. Mutch and D. G. Lowe, ‘‘Multiclass object recognition with sparse, neural networks using merged/mixed analog-digital approach,’’ in Proc.
localized features,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Int. Conf. Neural Inf. Process. Berlin, Germany: Springer, 2007,
Pattern Recognit., vol. 1, Jun. 2006, pp. 11–18. pp. 1081–1090.
[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification [53] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, ‘‘Project
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Adam: Building an efficient and scalable deep learning training system,’’
Process. Syst., 2012, pp. 1097–1105. in Proc. OSDI, vol. 14, 2014, pp. 571–582.
[29] K. Simonyan and A. Zisserman. (2014). ‘‘Very deep convolutional [54] Y. LeCun et al., ‘‘Backpropagation applied to handwritten zip code
networks for large-scale image recognition.’’ [Online]. Available: recognition,’’ Neural Comput., vol. 1, no. 4, pp. 541–551, 1989.
https://arxiv.org/abs/1409.1556 [55] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, ‘‘Optimizing
[30] O. Russakovsky et al., ‘‘ImageNet large scale visual recognition chal- FPGA-based accelerator design for deep convolutional neural networks,’’
lenge,’’ Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2015,
[31] C. Szegedy et al. (Sep. 2015). ‘‘Going deeper with convolutions.’’ pp. 161–170.
[Online]. Available: https://arxiv.org/abs/1409.4842 [56] A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and
[32] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time H. Esmaeilzadeh, ‘‘Neural acceleration for GPU throughput processors,’’
object detection with region proposal networks,’’ in Proc. Adv. Neural Inf. in Proc. 48th Int. Symp. Microarchitecture, 2015, pp. 482–493.
Process. Syst., 2015, pp. 91–99. [57] G. Hinton et al., ‘‘Deep neural networks for acoustic modeling in speech
[33] K. Korekado, T. Morie, O. Nomura, T. Nakano, M. Matsugu, and recognition: The shared views of four research groups,’’ IEEE Signal
A. Iwata, ‘‘An image filtering processor for face/object recognition using Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
merged/mixed analog-digital architecture,’’ in Symp. VLSI Circuits Dig. [58] Y. Jia et al., ‘‘Caffe: Convolutional architecture for fast feature embed-
Tech. Papers, 2005, pp. 220–223. ding,’’ in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 675–678.
[34] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, ‘‘A convolutional neural [59] A. Vasudevan, A. Anderson, and D. Gregg, ‘‘Parallel multi channel
network cascade for face detection,’’ in Proc. IEEE Conf. Comput. Vis. convolution using general matrix multiplication,’’ in Proc. IEEE 28th Int.
Pattern Recognit., Jun. 2015, pp. 5325–5334. Conf. Appl.-Specific Syst., Archit. Process. (ASAP), Jul. 2017, pp. 19–24.
[60] K. Guo et al., ‘‘Angel-eye: A complete design flow for mapping CNN [83] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘Optimizing loop operation
onto embedded FPGA,’’ IEEE Trans. Comput.-Aided Design Integr. Cir- and dataflow in FPGA acceleration of deep convolutional neural net-
cuits Syst., vol. 37, no. 1, pp. 35–47, Jan. 2018. works,’’ in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays,
[61] E. Nurvitadhi et al., ‘‘Can FPGAs beat GPUs in accelerating next- 2017, pp. 45–54.
generation deep neural networks?’’ in Proc. ACM/SIGDA Int. Symp. [84] A. Karpathy. (2018). Convolutional Neural Networks for Visual
Field-Program. Gate Arrays, 2017, pp. 5–14. Recognition. [Online]. Available: http://cs231n.github.io/convolutional-
[62] J. Misra and I. Saha, ‘‘Artificial neural networks in hardware: A sur- networks/
vey of two decades of progress,’’ Neurocomputing, vol. 74, nos. 1–3, [85] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Identity mappings in deep residual
pp. 239–255, 2010. networks,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer,
[63] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, ‘‘Neural 2016, pp. 630–645.
acceleration for general-purpose approximate programs,’’ in Proc. [86] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, ‘‘Inception-v4,
45th Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2012, inception-resnet and the impact of residual connections on learning,’’ in
pp. 449–460. Proc. AAAI, vol. 4, 2017, p. 12.
[64] S. Han et al., ‘‘EIE: Efficient inference engine on compressed deep [87] J. Villasenor and W. H. Mangione-Smith, ‘‘Configurable computing,’’
neural network,’’ in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Sci. Amer., vol. 276, no. 6, pp. 66–71, 1997.
Archit. (ISCA), Jun. 2016, pp. 243–254. [88] S. D. Brown, R. J. Francis, J. Rose, and Z. G. Vranesic, Field-
[65] L. Du et al., ‘‘A reconfigurable streaming deep convolutional neural Programmable Gate Arrays, vol. 180. Boston, MA, USA: Springer, 2012.
network accelerator for Internet of Things,’’ IEEE Trans. Circuits Syst. I, [89] M. C. Herbordt, Y. Gu, T. VanCourt, J. Model, B. Sukhwani, and M. Chiu,
Reg. Papers, vol. 65, no. 1, pp. 198–208, Jan. 2018. ‘‘Computing models for FPGA-based accelerators,’’ Comput. Sci. Eng.,
[66] W. Vanderbauwhede and K. Benkrid, High-Performance Computing vol. 10, no. 6, pp. 35–45, Nov. 2008.
Using FPGAs. New York, NY, USA: Springer, 2013. [90] B. S. C. Varma, K. Paul, and M. Balakrishnan, Architecture Exploration
[67] A. Putnam et al., ‘‘A reconfigurable fabric for accelerating large-scale of FPGA Based Accelerators for BioInformatics Applications. Singapore:
datacenter services,’’ ACM SIGARCH Comput. Archit. News, vol. 42, Springer, 2016.
no. 3, pp. 13–24, 2014. [91] G. Lacey, G. W. Taylor, and S. Areibi. (2016). ‘‘Deep learning on
[68] Y. Liang, K. Rupnow, Y. Li, D. Min, M. N. Do, and D. Chen, ‘‘High-level FPGAs: Past, present, and future.’’ [Online]. Available: https://arxiv.org/
synthesis: Productivity, performance, and software constraints,’’ J. Elect. abs/1602.04283
Comput. Eng., vol. 2012, p. 1, Jan. 2012. [92] C. Farabet et al., ‘‘Large-scale FPGA-based convolutional networks,’’
[69] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, in Scaling up Machine Learning: Parallel and Distributed Approaches.
‘‘High-level synthesis for FPGAs: From prototyping to deployment,’’ Cambridge, U.K.: Cambridge Univ. Press, 2011, pp. 399–419.
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 30, no. 4, [93] A. Munshi, ‘‘The OpenCL specification,’’ in Proc. IEEE Hot Chips 21
pp. 473–491, Apr. 2011. Symp. (HCS), Aug. 2009, pp. 1–314.
[70] A. Canis et al., ‘‘LegUp: High-level synthesis for FPGA-based proces- [94] J. E. Stone, D. Gohara, and G. Shi, ‘‘OpenCL: A parallel programming
sor/accelerator systems,’’ in Proc. 19th ACM/SIGDA Int. Symp. Field standard for heterogeneous computing systems,’’ Comput. Sci. Eng.,
Program. Gate Arrays, 2011, pp. 33–36. vol. 12, no. 3, pp. 66–73, 2010.
[71] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn- [95] A. R. Omondi and J. C. Rajapakse, FPGA Implementations of Neural
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11, Networks, vol. 365. Boston, MA, USA: Springer, 2006.
pp. 2278–2324, Nov. 1998. [96] H. M. Waidyasooriya, M. Hariyama, and K. Uchiyama, Design of FPGA-
[72] R. Hameed et al., ‘‘Understanding sources of inefficiency in general- Based Computing Systems With OpenCL. Cham, Switzerland: Springer,
purpose chips,’’ ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, 2018.
pp. 37–47, Jun. 2010. [97] V. Sze, Y.-H. Chen, J. Emer, A. Suleiman, and Z. Zhang, ‘‘Hardware for
[73] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, machine learning: Challenges and opportunities,’’ in Proc. IEEE Custom
‘‘GPUs and the future of parallel computing,’’ IEEE Micro, vol. 31, no. 5, Integr. Circuits Conf. (CICC), Apr./May 2017, pp. 1–8.
pp. 7–17, Sep./Oct. 2011. [98] J. Qiu et al., ‘‘Going deeper with embedded FPGA platform for convolu-
[74] Y.-H. Chen, J. Emer, and V. Sze, ‘‘Eyeriss: A spatial architec- tional neural network,’’ in Proc. ACM/SIGDA Int. Symp. Field-Program.
ture for energy-efficient dataflow for convolutional neural networks,’’ Gate Arrays, 2016, pp. 26–35.
ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 367–379, [99] S. Han et al., ‘‘ESE: Efficient speech recognition engine with sparse
2016. LSTM on FPGA,’’ in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate
[75] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, ‘‘Robust Arrays, 2017, pp. 75–84.
object recognition with cortex-like mechanisms,’’ IEEE Trans. Pattern [100] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, ‘‘A dynami-
Anal. Mach. Intell., vol. 29, no. 3, pp. 411–426, Mar. 2007. cally configurable coprocessor for convolutional neural networks,’’ ACM
[76] P. Joshi. (2018). What is Local Response Normalization in Convolutional SIGARCH Comput. Archit. News, vol. 38, no. 3, pp. 247–257, 2010.
Neural Networks. [Online]. Available: https://prateekvjoshi.com/2016/ [101] C. F. Van Loan, Matrix Computations (Johns Hopkins Studies in the
04/05/what-is-local-response-normalization-in-convolutional-neural- Mathematical Sciences). Baltimore, MD, USA: The Johns Hopkins Univ.
networks/ Press, 1996.
[77] J. Cong and B. Xiao, ‘‘Minimizing computation in convolutional neural [102] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, ‘‘Exploit-
networks,’’ in Proc. Int. Conf. Artif. Neural Netw. Cham, Switzerland: ing linear structure within convolutional networks for efficient evalua-
Springer, 2014, pp. 281–290. tion,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1269–1277.
[78] Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, ‘‘Scalable and [103] G. Guennebaud et al. (2015). Eigen V3, 2010. [Online]. Available:
modularized RTL compilation of convolutional neural networks onto http://eigen.tuxfamily.org
FPGA,’’ in Proc. 26th Int. Conf. Field Program. Logic Appl. (FPL), [104] S. Han, J. Pool, J. Tran, and W. Dally, ‘‘Learning both weights and con-
Aug. 2016, pp. 1–8. nections for efficient neural network,’’ in Proc. Adv. Neural Inf. Process.
[79] D. F. Bacon, S. L. Graham, and O. J. Sharp, ‘‘Compiler transformations Syst., 2015, pp. 1135–1143.
for high-performance computing,’’ ACM Comput. Surv., vol. 26, no. 4, [105] Y. LeCun, J. S. Denker, and S. A. Solla, ‘‘Optimal brain damage,’’ in Proc.
pp. 345–420, Dec. 1994. Adv. Neural Inf. Process. Syst., 1990, pp. 598–605.
[80] N. Suda et al., ‘‘Throughput-optimized opencl-based FPGA accelerator [106] S. J. Hanson and L. Y. Pratt, ‘‘Comparing biases for minimal network
for large-scale convolutional neural networks,’’ in Proc. ACM/SIGDA Int. construction with back-propagation,’’ in Proc. Adv. Neural Inf. Process.
Symp. Field-Program. Gate Arrays, 2016, pp. 16–25. Syst., 1989, pp. 177–185.
[81] M. Denil, B. Shakibi, L. Dinh, and N. De Freitas, ‘‘Predicting param- [107] B. Hassibi and D. G. Stork, ‘‘Second order derivatives for network
eters in deep learning,’’ in Proc. Adv. Neural Inf. Process. Syst., 2013, pruning: Optimal brain surgeon,’’ in Proc. Adv. Neural Inf. Process. Syst.,
pp. 2148–2156. 1993, pp. 164–171.
[82] V. Nair and G. E. Hinton, ‘‘Rectified linear units improve restricted [108] S. Han, H. Mao, and W. J. Dally. (2015). ‘‘Deep compression: Compress-
boltzmann machines,’’ in Proc. 27th Int. Conf. Mach. Learn. (ICML), ing deep neural networks with pruning, trained quantization and huffman
2010, pp. 807–814. coding.’’ [Online]. Available: https://arxiv.org/abs/1510.00149
[109] T. Chen et al., ‘‘DianNao: A small-footprint high-throughput accelerator [134] B. Bai et al., ‘‘Learning to rank with (a lot of) word features,’’ Inf. Retr.,
for ubiquitous machine-learning,’’ ACM Sigplan Notices, vol. 49, no. 4, vol. 13, no. 3, pp. 291–314, 2010.
pp. 269–284, 2014. [135] J. MacQueen, ‘‘Some methods for classification and analysis of multi-
[110] Y. LeCun. (1998). The MNIST Database of Handwritten Digits. [Online]. variate observations,’’ in Proc. 5th Berkeley Symp. Math. Statist. Probab.,
Available: http://yann.lecun.com/exdb/mnist/ Oakland, CA, USA, vol. 1, 1967, pp. 281–297.
[111] Y. Chen et al., ‘‘DaDianNao: A machine-learning supercomputer,’’ in [136] A. Sato and K. Yamada, ‘‘Generalized learning vector quantization,’’ in
Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture. Washington, Proc. Adv. Neural Inf. Process. Syst., 1996, pp. 423–429.
DC, USA: IEEE Computer Society, Dec. 2014, pp. 609–622. [137] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, ‘‘Face recognition:
[112] T. Luo et al., ‘‘DaDianNao: A neural network supercomputer,’’ IEEE A convolutional neural-network approach,’’ IEEE Trans. Neural Netw.,
Trans. Comput., vol. 66, no. 1, pp. 73–88, Jan. 2017. vol. 8, no. 1, pp. 98–113, Jan. 1997.
[113] D. Liu et al., ‘‘Pudiannao: A polyvalent machine learning accelerator,’’ [138] K. Chellapilla, S. Puri, and P. Simard, ‘‘High performance convolutional
ACM SIGARCH Comput. Archit. News, vol. 43, no. 1, pp. 369–381, neural networks for document processing,’’ in Proc. 10th Int. Workshop
Mar. 2015. Frontiers Handwriting Recognit., 2006.
[114] Z. Du et al., ‘‘Shidiannao: Shifting vision processing closer to the sensor,’’ [139] F. Nasse, C. Thurau, and G. A. Fink, ‘‘Face detection using GPU-based
Acm Sigarch Comput. Archit. News, vol. 43, no. 3, pp. 92–104, 2015. convolutional neural networks,’’ in Proc. Int. Conf. Comput. Anal. Images
[115] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, ‘‘Efficient processing Patterns. Berlin, Germany: Springer, 2009, pp. 83–90.
of deep neural networks: A tutorial and survey,’’ Proc. IEEE, vol. 105, [140] J. D. Dixon, ‘‘Asymptotically fast factorization of integers,’’ Math. Com-
no. 12, pp. 2295–2329, Dec. 2017. put., vol. 36, no. 153, pp. 255–260, 1981.
[116] A. Shafiee et al., ‘‘ISAAC: A convolutional neural network accelerator [141] P. L. Montgomery, ‘‘A survey of modern integer factorization algo-
with in-situ analog arithmetic in crossbars,’’ ACM SIGARCH Comput. rithms,’’ CWI Quart., vol. 7, no. 4, pp. 337–365, 1994.
Archit. News, vol. 44, no. 3, pp. 14–26, 2016. [142] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and
[117] P. Chi et al., ‘‘PRIME: A novel processing-in-memory architecture for E. Culurciello, ‘‘Hardware accelerated convolutional neural networks
neural network computation in ReRAM-based main memory,’’ ACM for synthetic vision systems,’’ in Proc. IEEE Int. Symp. Circuits Syst.
SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 27–39, 2016. (ISCAS), May/Jun. 2010, pp. 257–260.
[118] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, ‘‘FlexFlow: A flexible [143] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
dataflow accelerator architecture for convolutional neural networks,’’ in Y. LeCun, ‘‘NeuFlow: A runtime reconfigurable dataflow processor for
Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2017, vision,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
pp. 553–564. Workshops (CVPRW), Jun. 2011, pp. 109–116.
[119] J. Cloutier, E. Cosatto, S. Pigeon, F. R. Boyer, and P. Y. Simard, ‘‘VIP:
[144] R. Collobert, C. Farabet, and K. Kavukcuoglu, ‘‘Torch,’’ in Proc. Work-
An FPGA-based processor for image processing and neural networks,’’
shop Mach. Learn. Open Source Softw. (NIPS), vol. 76, 2008, p. 113.
in Proc. IEEE 5th Int. Conf. Microelectron. Neural Netw., Feb. 1996,
[145] D. Grangier, L. Bottou, and R. Collobert, ‘‘Deep convolutional networks
pp. 330–336.
for scene parsing,’’ in Proc. ICML Deep Learn. Workshop, 2009, vol. 3,
[120] D. F. Wolf, R. A. Romero, and E. Marques, ‘‘Using embedded processors
no. 6, p. 109.
in hardware models of artificial neural networks,’’ in Proc. Simposio
[146] M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal, ‘‘Memory-centric
Brasileiro de automação Inteligente, Brasília, Brasil, 2001.
accelerator design for convolutional neural networks,’’ in Proc. IEEE 31st
[121] K. R. Nichols, M. A. Moussa, and S. M. Areibi, ‘‘Feasibility of floating-
Int. Conf. Comput. Design (ICCD), Oct. 2013, pp. 13–19.
point arithmetic in FPGA based artificial neural networks,’’ in Proc.
CAINE, 2002, pp. 8–13. [147] A. Beric, J. van Meerbergen, G. de Haan, and R. Sethuraman, ‘‘Memory-
centric video processing,’’ IEEE Trans. Circuits Syst. Video Technol.,
[122] K. Benkrid and S. Belkacemi, ‘‘Design and implementation of a 2D
vol. 18, no. 4, pp. 439–452, Apr. 2008.
convolution core for video applications on FPGAs,’’ in Proc. IEEE 3rd
Int. Workshop Digit. Comput. Video (DCV), Nov. 2002, pp. 85–92. [148] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello,
[123] F. Cardells-Tormo, P.-L. Molinet, J. Sempere-Agullo, L. Baldez, and ‘‘A 240 G-OPS/S mobile coprocessor for deep neural networks,’’ in
M. Bautista-Palacios, ‘‘Area-efficient 2D shift-variant convolvers for Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014,
FPGA-based digital image processing,’’ in Proc. IEEE Workshop Signal pp. 682–687.
Process. Syst. Design Implement., Aug. 2005, pp. 209–213. [149] C. Farabet, C. Poulet, and Y. LeCun, ‘‘An FPGA-based stream pro-
[124] R. G. Gironés, R. C. Palero, J. C. Boluda, and A. S. Cortés, ‘‘FPGA cessor for embedded real-time vision with convolutional networks,’’ in
implementation of a pipelined on-line backpropagation,’’ J. VLSI Signal Proc. IEEE 12th Int. Conf. Comput. Vis. Workshops (ICCV Workshops),
Process. Syst. Signal, Image Video Technol., vol. 40, no. 2, pp. 189–213, Sep./Oct. 2009, pp. 878–885.
2005. [150] S. Williams, A. Waterman, and D. Patterson, ‘‘Roofline: An insightful
[125] H. Zhang, M. Xia, and G. Hu, ‘‘A multiwindow partial buffering scheme visual performance model for multicore architectures,’’ Commun. ACM,
for FPGA-based 2-D convolvers,’’ IEEE Trans. Circuits Syst. II, Exp. vol. 52, no. 4, pp. 65–76, 2009.
Briefs, vol. 54, no. 2, pp. 200–204, Feb. 2007. [151] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, ‘‘Polyhedral-
[126] A. W. Savich, M. Moussa, and S. Areibi, ‘‘The impact of arithmetic based data reuse optimization for configurable computing,’’ in Proc.
representation on implementing MLP-BP on FPGAs: A study,’’ IEEE ACM/SIGDA Int. Symp. Field Program. Gate Arrays, 2013, pp. 29–38.
Trans. Neural Netw., vol. 18, no. 1, pp. 240–252, Jan. 2007. [152] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and
[127] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, ‘‘CNP: An FPGA-based E. S. Chung, ‘‘Accelerating deep convolutional neural networks using
processor for convolutional networks,’’ in Proc. IEEE Int. Conf. Field specialized hardware,’’ Microsoft Res., Washington, DC, USA, White
Program. Logic Appl. (FPL), Aug./Sep. 2009, pp. 32–37. Paper 11, 2015, vol. 2, no. 11, pp. 1–4.
[128] Y. LeCun et al. (2015). LeNet-5, Convolutional Neural Networks. [153] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, ‘‘Caffeine:
[Online]. Available: http://yann.lecun.com/exdb/lenet Towards uniformed representation and acceleration for deep convolu-
[129] M. Sankaradas et al., ‘‘A massively parallel coprocessor for convolutional tional neural networks,’’ IEEE Trans. Comput.-Aided Design Integr. Cir-
neural networks,’’ in Proc. 20th IEEE Int. Conf. Appl.-Specific Syst., cuits Syst., to be published, doi: 10.1109/TCAD.2017.2785257.
Archit. Processors (ASAP), Jul. 2009, pp. 53–60. [154] B. Bosi, G. Bois, and Y. Savaria, ‘‘Reconfigurable pipelined 2-D con-
[130] H. P. Graf et al., ‘‘A massively parallel digital learning processor,’’ in volvers for fast digital signal processing,’’ IEEE Trans. Very Large Scale
Proc. Adv. Neural Inf. Process. Syst., 2009, pp. 529–536. Integr. (VLSI) Syst., vol. 7, no. 3, pp. 299–308, Sep. 1999.
[131] S. Cadambi et al., ‘‘A massively parallel FPGA-based coprocessor for [155] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, ‘‘DeepBurning: Automatic
support vector machines,’’ in Proc. 17th IEEE Symp. Field Program. generation of FPGA-based learning accelerators for the neural network
Custom Comput. Mach., Apr. 2009, pp. 115–122. family,’’ in Proc. 53rd Annu. Design Autom. Conf., 2016, p. 110.
[132] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf, [156] Khronos OpenCL Working Group. (2011). The OpenCL Specifica-
‘‘A programmable parallel accelerator for learning and classification,’’ tion Version 1.1. [Online]. Available: http://www.khronos.org/registry/cl/
in Proc. ACM 19th Int. Conf. Parallel Archit. Compilation Techn., 2010, specs/opencl-1.1.pdf
pp. 273–284. [157] M. S. Abdelfattah, A. Hagiescu, and D. Singh, ‘‘Gzip on a chip: High
[133] J. C. Platt, ‘‘12 fast training of support vector machines using sequential performance lossless data compression on FPGAs using OpenCL,’’ in
minimal optimization,’’ Adv. Kernel Methods, pp. 185–208, 1999. Proc. Int. Workshop OpenCL, 2014, p. 4.
[158] Altera. (2018). OpenCL Design Examples. [Online]. Available: https:// [183] S. I. Venieris and C.-S. Bouganis, ‘‘Latency-driven design for FPGA-
www.altera.com/support/support-resources/designexamples/design- based convolutional neural networks,’’ in Proc. 27th Int. Conf. Field
software/opencl.html/ Program. Logic Appl. (FPL), Sep. 2017, pp. 1–8.
[159] Nallatech. (2018). P395-D8 OpenCL FPGA Accelerator Cards. [184] A. Lavin and S. Gray, ‘‘Fast algorithms for convolutional neural
[Online]. Available: http://www.nallatech.com/wp-content/uploads/ networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
openclcardspb_v1_51.pdf/ Jun. 2016, pp. 4013–4021.
[160] Altera. (2018). DE5-Net FPGA Kit User Manual. [Online]. Available: [185] S. Winograd, Arithmetic Complexity of Computations, vol. 33.
ftp://ftp.altera.com/up/pub/Altera_Material/Boards/DE5/DE5_User_ Philadelphia, PA, USA: SIAM, 1980.
[161] R. C. Whaley and J. J. Dongarra, ‘‘Automatically tuned linear [186] C. Van Loan, Computational Frameworks for the Fast Fourier Transform,
algebra software,’’ in Proc. IEEE/ACM Conf. Supercomput. (SC), vol. 10. Philadelphia, PA, USA: SIAM, 1992.
[187] C. Zhang and V. Prasanna, ‘‘Frequency domain acceleration of convolu-
Nov. 1998, p. 38.
tional neural networks on CPU-FPGA shared memory system,’’ in Proc.
[162] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, ‘‘Caffeine:
ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2017, pp. 35–44.
Towards uniformed representation and acceleration for deep
[188] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R.
convolutional neural networks,’’ in Proc. IEEE/ACM Int. Conf. Comput.-
Chiu, ‘‘An OpenCL deep learning accelerator on arria 10,’’ in Proc.
Aided Design (ICCAD), 2016, pp. 1–8.
ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2017, pp. 55–64.
[163] W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong, ‘‘Improving [189] L. Lu, Y. Liang, Q. Xiao, and S. Yan, ‘‘Evaluating fast algorithms for
high level synthesis optimization opportunity through polyhedral convolutional neural networks on FPGAs,’’ in Proc. IEEE 25th Annu. Int.
transformations,’’ in Proc. ACM/SIGDA Int. Symp. Field Program. Gate Symp. Field-Program. Custom Comput. Mach. (FCCM), Apr./May 2017,
Arrays, 2013, pp. 9–18. pp. 101–108.
[164] E. A. Lee and D. G. Messerschmitt, ‘‘Synchronous data flow,’’ Proc. [190] J. Zhang and J. Li, ‘‘Improving the performance of OpenCL-based FPGA
IEEE, vol. 75, no. 9, pp. 1235–1245, Sep. 1987. accelerator for convolutional neural network,’’ in Proc. ACM/SIGDA Int.
[165] S. I. Venieris and C.-S. Bouganis, ‘‘fpgaConvNet: A framework for Symp. Field-Program. Gate Arrays, 2017, pp. 25–34.
mapping convolutional neural networks on FPGAs,’’ in Proc. IEEE [191] T. S. Czajkowski et al., ‘‘OpenCL for FPGAs: Prototyping a compiler,’’ in
24th Annu. Int. Symp. Field-Program. Custom Comput. Mach. (FCCM), Proc. Int. Conf. Eng. Reconfigurable Syst. Algorithms (ERSA), 2012, p. 1.
May 2016, pp. 40–47. [192] Y. Shen, M. Ferdman, and P. Milder, ‘‘Maximizing CNN accelerator
[166] C. R. Reeves, Modern Heuristic Techniques for Combinatorial Problems efficiency through resource partitioning,’’ in Proc. ACM/IEEE 44th
(Advanced Topics in Computer Science). New York, NY, USA: Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2017, pp. 535–547.
McGraw-Hill, 1995. [193] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and
[167] L. Cavigelli, M. Magno, and L. Benini, ‘‘Accelerating real-time K. Keutzer. (2016). ‘‘SqueezeNet: AlexNet-level accuracy with 50x
embedded scene labeling with convolutional networks,’’ in Proc. 52nd fewer parameters and <0.5 MB model size.’’ [Online]. Available:
Annu. Design Autom. Conf., 2015, p. 108. https://arxiv.org/abs/1602.07360
[168] Y. Ma, N. Suda, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘ALAMO: FPGA [194] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, ‘‘A high
acceleration of deep learning algorithms with a modularized RTL performance FPGA-based accelerator for large-scale convolutional
compiler,’’ Integration, vol. 62, pp. 14–23, Jun. 2018. neural networks,’’ in Proc. 26th Int. Conf. Field Program. Logic Appl.
[169] M. Lin, Q. Chen, and S. Yan. (2013). ‘‘Network in network.’’ [Online]. (FPL), Aug./Sep. 2016, pp. 1–9.
Available: https://arxiv.org/abs/1312.4400 [195] X. Wei et al., ‘‘Automated systolic array architecture synthesis for high
throughput CNN inference on FPGAs,’’ in Proc. Design Autom. Conf.,
[170] Z. Liu et al., ‘‘Throughput-optimized FPGA accelerator for deep
2017, pp. 1–6.
convolutional neural networks,’’ ACM Trans. Reconfigurable Technol.
[196] Y. Ma, M. Kim, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘End-to-end scalable
Syst., vol. 10, no. 3, 2017, Art. no. 17.
FPGA accelerator for deep residual networks,’’ in Proc. IEEE Int. Symp.
[171] Y. Guan et al., ‘‘FP-DNN: An automated framework for mapping deep Circuits Syst. (ISCAS), May 2017, pp. 1–4.
neural networks onto FPGAs WITH RTL-HLS hybrid templates,’’ in [197] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, ‘‘DLAU: A scalable
Proc. IEEE 25th Annu. Int. Symp. Field-Program. Custom Comput. deep learning accelerator unit on FPGA,’’ IEEE Trans. Comput.-Aided
Mach. (FCCM), Apr./May 2017, pp. 152–159. Design Integr. Circuits Syst., vol. 36, no. 3, pp. 513–517, Mar. 2017.
[172] M. Abadi et al., ‘‘TensorFlow: A system for large-scale machine [198] Altera. (2018). JTAG UART Core. [Online]. Available: https://www.
learning,’’ in Proc. OSDI, vol. 16. 2016, pp. 265–283. altera.com/en_US/pdfs/literature/hb/nios2/n2cpu_nii51009.pdf
[173] M. H. Alsuwaiyel, Algorithms: Design Techniques And Analysis, vol. 14. [199] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘An automatic RTL compiler
Singapore: World Scientific, 2016. for high-throughput FPGA implementation of diverse deep convolutional
[174] S. Chetlur et al. (2014). ‘‘cuDNN: Efficient primitives for deep learning.’’ neural networks,’’ in Proc. 27th Int. Conf. Field Program. Logic Appl.
[Online]. Available: https://arxiv.org/abs/1410.0759 (FPL), Sep. 2017, pp. 1–8.
[175] W. Zaremba, I. Sutskever, and O. Vinyals. (2014). ‘‘Recurrent neural [200] M. S. Abdelfattah et al. (2018). ‘‘DLA: Compiler and FPGA overlay for
network regularization.’’ [Online]. Available: https://arxiv.org/abs/ neural network inference acceleration.’’ [Online]. Available: https://arxiv.
1409.2329 org/abs/1807.06434
[176] W. Sung, S. Shin, and K. Hwang. (2015). ‘‘Resiliency of deep neural [201] A. K. Jain, S. A. Fahmy, and D. L. Maskell, ‘‘Efficient overlay
networks under quantization.’’ [Online]. Available: https://arxiv.org/ architecture based on DSP blocks,’’ in Proc. IEEE 23rd Annu. Int. Symp.
abs/1511.06488 Field-Program. Custom Comput. Mach. (FCCM), May 2015, pp. 25–28.
[177] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, ‘‘XNOR-Net: [202] W. Liu et al., ‘‘SSD: Single shot MultiBox detector,’’ in Proc. Eur. Conf.
ImageNet classification using binary convolutional neural networks,’’ in Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37.
Proc. Eur. Conf. Comput. Vis., 2016, pp. 525–542. [203] E. Chung et al., ‘‘Accelerating persistent neural networks at datacenter
scale,’’ in Proc. Hot Chips, vol. 27, 2017.
[178] M. Kim and P. Smaragdis. (2016). ‘‘Bitwise neural networks.’’ [Online].
[204] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘Optimizing the convolution
Available: https://arxiv.org/abs/1601.06071
operation to accelerate deep neural networks on FPGA,’’ IEEE Trans.
[179] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. (2016).
Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367,
‘‘DoReFa-net: Training low bitwidth convolutional neural networks
Jul. 2018.
with low bitwidth gradients.’’ [Online]. Available: https://arxiv.org/abs/ [205] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, ‘‘Energy-efficient
1606.06160 CNN implementation on a deeply pipelined FPGA cluster,’’ in Proc. Int.
[180] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Symp. Low Power Electron. Design, 2016, pp. 326–331.
(2016). ‘‘Binarized neural networks: Training deep neural networks with [206] S. M. Sait and H. Youssef, Iterative Computer Algorithms With
weights and activations constrained to +1 or −1.’’ [Online]. Available: Applications in Engineering: Solving Combinatorial Optimization
https://arxiv.org/abs/1602.02830 Problems. Los Alamitos, CA, USA: IEEE Computer Society Press, 1999.
[181] Y. Umuroglu et al., ‘‘FINN: A framework for fast, scalable binarized [207] L. Xie and A. Yuille, ‘‘Genetic CNN,’’ in Proc. ICCV, Oct. 2017,
neural network inference,’’ in Proc. ACM/SIGDA Int. Symp. Field- pp. 1388–1397.
Program. Gate Arrays, 2017, pp. 65–74. [208] L. M. R. Rere, M. I. Fanany, and A. M. Arymurthy, ‘‘Metaheuristic
[182] A. Krizhevsky and G. Hinton, ‘‘Learning multiple layers of features from algorithms for convolution neural network,’’ Comput. Intell. Neurosci.,
tiny images,’’ Univ. Toronto, Toronto, ON, Canada, Tech. Rep. 4, 2009. vol. 2016, May 2016, Art. no. 1537325.
AHMAD SHAWAHNA received the B.Sc. degree AIMAN EL-MALEH received the B.Sc. degree
in computer engineering from An-Najah National (Hons.) in computer engineering from the King
University, Palestine, in 2012, and the M.S. degree Fahd University of Petroleum and Minerals
in computer engineering from the King Fahd (KFUPM), in 1989, the M.A.Sc. degree in elec-
University of Petroleum and Minerals (KFUPM), trical engineering from the University of Victoria,
Saudi Arabia, in 2016, where he is currently pursu- Canada, in 1991, and the Ph.D. degree in electrical
ing the Ph.D. degree with the Department of Com- engineering, with dean’s honor list, from McGill
puter Engineering. He is currently with the Center University, Canada, in 1995. He is currently a
for Communications and IT Research, KFUPM. Professor with the Computer Engineering Depart-
His research interests include hardware accelera- ment, KFUPM. He was a Member of Scientific
tor, deep learning, convolutional neural networks, field-programmable gate Staff with Mentor Graphics Corporation and the Leader in design automa-
array, wireless security, network security, the Internet of Things, and cloud tion, from 1995 to 1998. He holds five U.S. patents. His research interests
computing. include synthesis, testing, and verification of digital systems, defect and
soft-error tolerance design, VLSI design, design automation, and efficient
FPGA implementations of deep learning algorithms and data compression
techniques. He received the Best Paper Award for the most outstanding
contribution to the field of test at the 1995 European Design and Test
SADIQ M. SAIT was born in Bengaluru, India. Conference, the Excellence in Teaching Award from KFUPM, in 2001 and
He received the bachelor’s degree in electronics 2002, in 2006 and 2007, and in 2011 and 2012, the Excellence in Advising
engineering from Bangalore University, in 1981, Award from KFUPM, in 2013 and 2014 and in 2017 and 2018, the Excellence
and the master’s and Ph.D. degrees in electrical in Research Award from KFUPM, in 2010 and 2011 and in 2015 and 2016,
engineering from the King Fahd University of and the First Instructional Technology Award from KFUPM, in 2009 and
Petroleum and Minerals (KFUPM), in 1983 and 2010. His paper presented at the 1995 Design Automation Conference was
1987, respectively. He is currently a Professor also nominated for the Best Paper Award.
of computer engineering and the Director of the
Center for Communications and IT Research,
KFUPM. He has authored over 200 research
papers, has contributed chapters to technical books, and has lectured in over
25 countries. He is the principle author of two books. He is a Senior Member
of the IEEE. In 1981, he received the Best Electronic Engineer Award from
the Indian Institute of Electrical Engineers, Bengaluru.