Computer Vision Algorithms and Hardware Implementations A Survey
Computer Vision Algorithms and Hardware Implementations A Survey
A R T I C L E I N F O A B S T R A C T
Keywords: The field of computer vision is experiencing a great-leap-forward development today. This paper aims at
Computer vision providing a comprehensive survey of the recent progress on computer vision algorithms and their corresponding
Hardware accelerator hardware implementations. In particular, the prominent achievements in computer vision tasks such as image
Deep convolutional neural network
classification, object detection and image segmentation brought by deep learning techniques are highlighted. On
Artificial intelligence
the other hand, review of techniques for implementing and optimizing deep-learning-based computer vision al-
gorithms on GPU, FPGA and other new generations of hardware accelerators are presented to facilitate real-time
and/or energy-efficient operations. Finally, several promising directions for future research are presented to
motivate further development in the field.
* Corresponding author.
** Corresponding author.
E-mail addresses: xfeng@cqut.edu.cn (X. Feng), naria_petrova@163.com (Y. Jiang), 1064699383@qq.com (X. Yang), duming@dhu.edu.cn (M. Du), xinli.ece@
duke.edu (X. Li).
https://doi.org/10.1016/j.vlsi.2019.07.005
Received 11 April 2019; Received in revised form 23 June 2019; Accepted 27 July 2019
Available online xxxx
0167-9260/© 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Please cite this article as: X. Feng et al., Computer vision algorithms and hardware implementations: A survey, Integration, the VLSI Journal, https://
doi.org/10.1016/j.vlsi.2019.07.005
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
from ImageNet and the modern central processing units (CPUs) and modern history of neural network after a long trough period.
graphics processing units (GPUs), methods based on deep neural network A typical deep CNN model consists of several convolution layers
(DNN) achieve the state-of-the-art performance and bring an unprece- followed by activation functions and pooling layers, and several fully
dented development of computer vision in both algorithms and hardware connected layers before prediction. It comes into deep structure to
implementations. In recent years, CNN has become the de-facto standard facilitate filtering mechanisms by performing convolutions in multi-scale
computation framework in computer vision. Numbers of deeper and feature maps, leading to highly abstract and discriminative features.
more complicated networks are developed to make CNNs deliver near- AlexNet has 8 convolution layers, 3 pooling layers and 3 fully con-
human accuracy in many computer vision applications, such as classifi- nected layers, with a total of 60 million parameters. It successfully uses
cation, detection and segmentation. The high accuracy, however, comes ReLU as the activation function instead of sigmoid. Furthermore, data
at the price of large computational cost. As a result, dedicated hardware augmentation and dropout are widely used today as efficient learning
platforms, from the general-purpose GPUs to application-specific pro- strategies. AlexNet is hence known as the foundation work of modern
cessors, are investigated to optimize for DNN-based workloads. deep CNN.
In this paper, we look into this rapid evolution of computer vision Inspired by AlexNet, VGGNet [22] and GoogleNet [23] focus on
field by presenting a brief survey on the key algorithms that make designing deeper networks to further improve accuracy. They were the
computer systems perceivable and the underlying hardware platforms runner-up and winner of ILSVRC in 2014 respectively. By repeatedly
that make these algorithms applicable. In particular, we will discuss how stacking 3 3 convolutional kernels and 2 2 maximum pooling layers,
the recent DNN algorithms accomplish the computer vision tasks (i.e. VGGNet successfully constructs a convolutional neural network of 16–19
image classification, object detection and image segmentation) with high layers. GoogleNet has 22 layers, but its floating-point operations and
perception accuracy, and summarize the notable hardware units number of parameters are much less than those of AlexNet and VGGNet
including GPUs, field-programmable gate arrays (FPGAs) and other by removing the fully-connected layers and optimizing the operations of
advanced mobile hardware platforms that are adapted or designed to sparse matrices.
accelerate DNN-based computer vision algorithms. According to our Although deeper networks offer better accuracy, simply increasing
knowledge, there are recent summaries in the literature that discuss the the number of layers cannot continuously improve accuracy because of
DNN-based algorithms for particular tasks, including image classification vanishing/exploding gradient information during network training.
[7], object detection [8], image segmentation [9], and the corresponding ResNet [24], which makes another great progress of deep network
hardware accelerators such as FPGAs [10]. There is no comprehensive structure, proposes to use a shortcut connection between residual blocks
survey that covers both algorithm and hardware simultaneously. A to make full use of information from previous layers and keep the gra-
thorough review of existing works from both topics is essential for re- dients during backward propagation. By using this residual block, ResNet
searchers to understand the entire picture and motivate further progress successfully trains very deep networks with up to 152 layers and was the
in the computer vision field. winner of ILSVRC in 2015. Following the idea of ResNet, DenseNet [25]
The reminder of this paper is organized as follows. In Section 2, we establishes connections between all previous layers and the current layer.
overview the computer vision algorithms for three visual perception It concatenates and, therefore, reuses the features from all previous
tasks: image classification, object detection and image segmentation. layers. DenseNet presents with great advantage in classification accuracy
Important hardware platforms including GPUs, FPGAs and other hard- on ImageNet, as we can see in Table 1. Based on these works in the
ware accelerators for implementing the DNN-based algorithms are dis- literature, connecting different network layers has shown promising
cussed in Section 3. Finally, we conclude in Section 4. improvement in learning representations of deeper networks.
By using ResNet or DenseNet as the major backbone structure, re-
2. Computer vision algorithms searchers focus on improving the functionality of neural network blocks.
SENet [26], which was the winner of ILSVRC 2017, proposes a “squee-
2.1. Image classification ze-and-excitation” (SE) unit by taking channel relationship into account.
It learns to recalibrate channel-wise feature maps by explicitly modeling
Image classification is a kind of biologically primary ability of human the interdependencies among channels, which is consequently exploited
visual perception system. It has been an active task and plays a crucial to enhance informative channels and suppress other useless channels.
role in the field of computer vision, which aims to automatically classify Despite the high classification performance of the aforementioned
images into pre-defined classes. For decades, researchers have laid path CNN models, appropriately designing the optimal network structure
in developing advanced techniques to improve the classification accu-
racy. Traditionally, classification models can perform well only on small
Table 1
datasets such as CIFAR-10 [11] and MNIST [12]. The great-leap-forward Summary of different CNN models on ImageNet classification task.
development of image classification occurred when the large-scale image
Model Time Accuracy Num. of Num. of Num. of
dataset “ImageNet” was created by Feifei Li in 2009 [6]. It was almost the Parameters FLOPs Layers
same time when the well-known deep learning technologies started to
AlexNet [21] 2012 57.2% 60 M 720 M 8
show great performance in classification and stepped onto the stage of
VGGNet [22] 2014 71.5% 138 M 15,300 M 16
computer vision. GoogleNet 2014 69.8% 6.8 M 1,500 M 22
Before the explosion of deep learning methods, research works put [23]
lots of efforts in designing scale-invariant features (e.g. SIFT [13], HOG ResNet [24] 2015 78.6% 55 M 2,300 M 152
[14], GIST [15]), feature representations (e.g. Bag-of-Features [16], DenseNet 2017 79.2% 25.6 M 1,150 M 190
[25]
Fisher Kernel [17]) and classifiers (e.g. SVM [18]) for image classifica- SENet [26] 2017 82.7% 145.8 M 42,300 M –
tion [19,20]. However, these manually crafted features work against NASNet [27] 2018 82.7% 88.9 M 23,800 M –
objects in natural images with complicated background, variant color, SqueezeNet 2016 57.5% 1.2 M 833 M –
texture, illumination and ever-changing poses and view factors. At the [29]
MobileNet 2017 70.6% 4.2 M 569 M 28
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012,
[30]
AlexNet [21] won the first prize by a significant margin over the second ShuffleNet 2018 73.7% 4.7 M 524 M –
place that was based on SIFT and Fisher Vectors (FVs) [20]. It demon- [31]
strates that the classification model based on deep CNN performs much ShiftNet-A 2018 70.1% 4.1 M 1,400 M –
more robustly than other conventional methods in the presence of [32]
FE-Net [33] 2019 75.0% 5.9 M 563 M –
large-scale variations. It also represents a remarkable milestone in the
2
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
often requires significant engineering work. NASNet [27] studies a C3D features [43] are derived from a deep 3-D convolutional network
paradigm to learn the optimal convolutional architecture based on trained on the large-scale UCF101 dataset. Moreover, a two-stream
training data. It adopts a neural architecture search (NAS) framework approach [44] is proposed to factorize the learning problem of video
derived from reinforcement learning [28]. In addition, it designs a new representation into spatial and temporal cues separately. Specifically, a
search space to enable network mapping from a proxy dataset (e.g. spatial CNN is adopted to model the appearance information from RGB
CIFAR-10) to ImageNet, and a regularization technique for generaliza- frames, while a temporal CNN is used to learn the motion information
tion purpose. Offering less computational complexity than SENet, NAS- from the dense optical flow among adjacent frames.
Net achieves the state-of-the-art accuracy on ImageNet, as shown in Since the two-stream approach only depicts movements within a
Table 1. short time window and fails to consider the temporal order of different
The aforementioned deep networks facilitate classification to be more frames, several recurrent connection models for sequential data,
accurate. However, in many real-life classification applications, such as including recurrent neural networks (RNNs) and long short-term memory
robotics, autonomous driving, smartphone, etc., the classification task is (LSTM) models, are leveraged to model the temporal dynamics for
highly constrained by the computational resources that are available. The videos. In Ref. [45], two two-layer LSTM networks are trained with
problem thus becomes to pursue the optimal accuracy subject to a limited features from the two-stream approach for action recognition. In
computational budget (i.e. memory and/or MFLOPs). Therefore, a set of Ref. [46], the LSTM model and CNN model are combined to jointly learn
lightweight networks such as SqueezeNet [29], MobileNet [30], Shuf- spatial-temporal cues for video classification. In Refs. [47,48], attention
fleNet [31], ShiftNet [32] and FE-Net [33] start a wave. mechanism is introduced for convolutional LSTM models to discover
SqueezeNet substitutes most 3 3 filters by 1 1 filters and cuts relevant spatio-temporal volumes for video classification.
down the numbers of input channels for 3 3 filters to reduce the
network complexity. To maximize the accuracy with a limited number of 2.2. Object detection
network parameters, it delays the down-sampling operation to avoid
information loss in early layers. SqueezeNet is 50 smaller than AlexNet. Object detection, which is to determine and locate the object in-
If combined with deep compression [34], it can even be reduced to be stances either from a large number of predefined categories in natural
510 smaller than AlexNet. Depth-wise separable convolution is images or for a given particular object (e.g., Donald Trump's face, the
employed to decompose the standard convolution into depth-wise distorted area in an image, etc.), is another important and challenging
convolution and point-wise convolution in MobileNet [30]. Depth-wise task in computer vision. Object detection and image classification share a
convolution performs convolution on each input channel with one fil- similar technical challenge: both of them must handle a large number of
ter, while point-wise convolution combines those separate channels by highly variable objects. However, object detection is more difficult than
using 1 1 convolution. This novel design of convolution reduces both image classification, as it must identify the accurate localization of the
the computational complexity and the number of parameters. object of interest.
ShuffleNet [31] uses point-wise group convolution that divides the Historically, most research efforts have focused on detecting a single
input feature maps into groups and performs convolutions separately on category of given objects such as pedestrian [14,49] and face [50] by
each group to reduce computational cost. However, because the grouping designing a set of appropriate features (e.g. HOG [14,49], Harr-like [50],
operations limit the communication between different channels, Shuf- LBP [51], etc.). In these works, objects are detected by using a set of
fleNet further shuffles the channels and feeds each group in the following predefined feature templates matching with each location in the image or
layer with multiple channels from different groups in order to distribute feature pyramids. Standard classifiers such as SVM [14,49] and Adaboost
information across channels. [50] are often used for this purpose.
In addition to the strategies adopted to reduce the computational cost In order to build a general-purpose, robust object detection system,
of spatial convolution (e.g., depth-wise convolution), ShiftNet [32] pre- research community has started to develop large-scale, multi-class
sents a parameter-free, FLOP-free shift operation to replace expensive datasets in recent years. Pascal-VOC 2007 [52] with 20 classes and
spatial convolutions. The proposed shift operation is able to provide MS-COCO [53] with 80 object categories are two iconic object detection
spatial information communication by shifting feature maps, making it datasets. In these two datasets, detection results are evaluated by two
possible to aggregate spatial information by the following point-wise possible metrics: (i) Average Precision (AP) by counting the correctly
convolutional layer. More recently, FE-Net [33] further finds that only detected bounding boxes for which the overlap ratio exceeds 0.5, and (ii)
a few shift operations are sufficient to provide spatial information mean Average Precision (mAP) by averaging the AP values associated
communication. A sparse shift layer (SSL) is proposed to perform shift with different thresholds of the overlap ratio.
operations on a small portion of feature maps only. With only 563 M Recently, deep learning has substantially advanced the object detec-
FLOPs, FE-Net achieves the state-of-the-art performance among all major tion field. As shown in Fig. 1, striking improvements in object detection
lightweight classification models on ImageNet, as shown in Table 1. accuracy have been demonstrated over both Pascal-VOC 2007 and MS-
The aforementioned network models are briefly summarized in COCO by taking advantages of deep learning techniques.
Table 1. In addition to the conventional image classification problem R-CNN [54] was the first two-stage method among the earliest
with thousands of classes and complex scenes, multi-label classification CNN-based generic object detection techniques. It adopts AlexNet to
(e.g. face attributions [35,36]) and fine-grained classification (e.g. extract a fixed-length feature vector from each resized region proposal,
Stanford Dogs classification [37]) are also of great interest in the com- which is the object candidate generated by selective search algorithm
puter vision area. [55]. Each region is then classified by a set of category-specific linear
Furthermore, the great success of deep learning in image domain has SVMs. The method shows significant improvement in mAP over the
stimulated a variety of techniques to learn robust feature representations traditional state-of-the-art DPM detector [49]. It is, however, not elegant
for video classification, where the semantic contents such as human ac- and inefficient, due to its multistage complex pipeline and the redundant
tions [38] or complex events [39] are automatically categorized. Early CNN feature extraction from numerous region proposals.
works often treat a video clip as a collection of frames. Video classifica- Inspired by the spatial pyramid pooling in SPPnet [56] that leverages
tion is implemented by aggregating frame-level CNN features by aver- fixed-length feature outputs for arbitrary input image sizes, Fast R-CNN
aging or encoding [40]. Standard classifiers, such as SVM, are finally [57] incorporates a ROI pooling layer before the fully-connected layer to
used for recognition [41,42]. obtain a fixed-length feature vector for each proposed region, so that only
In contrast to the frame-level classification methods, there are a a single convolution operation is required for the input image. Fast
number of other approaches applying end-to-end CNN models to learn R-CNN substantially improves the detection efficiency over R-CNN and
the hidden spatio-temporal patterns in video. For example, the typical SPPnet. However, it requires expensive computation for external region
3
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
Table 2
Summary of different object detection architectures (FPS based on a Pascal Titan
X GPU).
Architecture mAP (Pascal-VOC mAP (MS- Num. of FPS
2007) COCO) FLOPs
4
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
Fig. 2. An example of different visual perception problems [75]: (a) image classification, (b) object detection, (c) semantic segmentation, and (d) instance
segmentation.
well. They are two important datasets for image segmentation on which generate the restored feature maps for dense prediction. In another
most research works are evaluated. In addition to them, Cityscapes [77] typical encoder-decoder framework U-Net [83] that is designed for
and MVD [78] provide street scenes with a large number of traffic objects biomedical image segmentation, the upsampling layer directly concate-
in each image. The popular metrics to evaluate pixel-level segmentation nates with a set of cropped duplicates of corresponding feature maps in
accuracy are pixel accuracy (PA, i.e., the proportion of pixels predicted encoder to enhance resolution during the decoder process.
correctly), mean pixel accuracy (mPA) of all classes, mean intersection Image segmentation is a difficult problem that requires both good
over union (mIOU), and frequency weighted intersection over union pixel-level accuracy, which relies on fine-grained local features, and
(FWIOU). Among them, mIOU is often preferred for its simplicity and classification accuracy, for which global context of the image is crucial to
representativeness. resolve local ambiguities. However, the pooling strategy in classic CNN
Early segmentation techniques based on deep learning usually apply architectures is a defect for losing the detailed information when multi-
CNN as the feature descriptor for each pixel that is described by its sur- pooling steps are performed.
rounding patch [79,80]. This CNN-based framework is problematic in One possible and common solution to integrate context knowledge is
efficiency and is not sufficiently accurate for redundant feature extrac- to refine the output to have fine-grained details for accurate segmenta-
tion. Fully convolutional network (FCN) [81] is the forerunner that tion by Conditional Random Field (CRF) [85–87]. Alternatively, instead
successfully implements pixel-wise dense predictions for semantic seg- of using pooling strategy, the problem can be solved by expanding
mentation in an end-to-end CNN structure. It replaces the fully-connected receptive fields in which each neuron is connected to a subset of neurons
layers of the well-known classification architectures (e.g.VGG [22], in the previous layer. Dilated convolution [88], which is a regular
GoogleNet [23], etc.) with convolution layers to facilitate inputs of convolution with upsampled filters or dilated filters, is proposed to
arbitrary sizes, and outputs a heatmap rather than a vector to indicate exponentially expand receptive fields without sacrificing resolution, as
classification scores. Prediction loss is then measured by the pixel-wise shown in Fig. 3. The DeepLab [85] model takes advantages of both
loss between the upsampled heatmap using deconvolution and the dilated convolution and CRF refinement by post-processing to integrate
labeled image of original size. FCN shows great improvement in pixel context knowledge. As can be seen in Fig. 4, it achieves much higher
accuracy over traditional segmentation methods on Pascal-VOC 2012. prediction accuracy than SegNet and FCN and is thus considered as a
However, the basic FCN structure fails to capture a large number of milestone work for semantic segmentation.
features and it does not consider spatial consistency between pixels, Several recently-proposed methods, such as RefineNet [84] and
which hinders its application to certain problems and scenarios. PSPNet [89], try to avoid or restore the loss of down-sampling in encoder
In any wise, the success of FCN architecture makes it popular and it by fusing low-level and high-level features. RefineNet designs a decoder
has been actively followed by many subsequent segmentation works module by using both short-range and long-range residual connections to
[82–84]. Generally, using classification models without fully-connected capture rich contextual information. It has achieved the state-of-the-art
layers as the backbone network to produce low-resolution feature maps performance on 7 public datasets. In PSPNet, a pyramid pooling mod-
is referred to as the encoder, while the symmetric mapping from the ule is proposed to aggregate different region-based context information
low-resolution image to pixel-wise classification outcome is termed as to exploit the capability of global context information.
the decoder. With the well-known backbone network as encoder, alter- Inspired by the spatial pyramid pooling, DeepLab V2 [87] in-
native CNN-based segmentation works are usually variant in decoder vestigates an atrous spatial pyramid pooling (ASPP) module by incor-
implementation. For example, the decoding stage of SegNet [82] uses the porating dilated convolution with different sampling rates and spatial
max-pooling indices from corresponding feature maps in its encoder for pyramid pooling to capture multi-level context information. In the latest
upsampling. The resultant maps are then convolved with a set of filters to DeepLab V3þ [90], an Xception module [91] is introduced to the
5
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
Fig. 3. The receptive field of (a) 1-dilated convolution, (b) 2-dilated convolution, and (c) 4-dilated convolution [88].
encoder, together with the improved ASPP module. It obtains the CPUs [98]. Historically, the von-Neumann-style compute-centric archi-
state-of-the-art performance on Pascal-VOC 2012, as shown in Fig. 4. tectures (e.g. CPUs) are primarily designed for effective serial computa-
The aforementioned end-to-end architectures mainly focus on se- tions with complex task scheduling. It suffers from high energy
mantic segmentation. Comparatively, most works on instance segmen- consumption and low memory bandwidth for data movement when
tation follow the pipeline that segmentation precedes recognition. For evaluating the deep CNN networks that require parallel dense compu-
example, DeepMask [92] and the instance-sensitive fully convolutional tation, high data reusability and large memory bandwidth [99].
network [93] use Fast-RCNN to classify the learned segment proposals. Fig. 5(a) compares neural network with other approaches in terms of
The fully convolutional instance segmentation (FCIS) combines the accuracy and scale (i.e. data/model size). The traditional machine
segment proposal system [93] and object detection in Ref. [94] to predict learning methods, such as decision tree, SVM, etc., are referred as “Other
object classes, boxes, and masks simultaneously. However, this method is Approaches” in Fig. 5(a). They are generally based on manually designed
not as accurate as Mask-RCNN [59] on the MS-COCO instance segmen- features. Due to the limited learning capability of these traditional
tation challenge. Mask-RCNN extends Faster-RCNN by adding a mask methods, their accuracy cannot continuously increase with data/model
prediction branch in addition to bounding box regression and class scale. On the other hand, deep neural networks are highly scalable in
recognition. The very recent path aggregation network (PANet) [95] their learning capability when deeper network structures and larger data
enhances the feature hierarchy by a bottom-up path augmentation. With sets are adopted.
subtle computation overhead, it reaches the first place in the MS-COCO
challenge of instance segmentation task and also represents the
state-of-the-art on MVD and Cityscapes.
Although pixel-wise image segmentation is progressing rapidly with
superior accuracy, it is still far from practical usage, such as video se-
mantic segmentation, due to the high complexity of dense prediction.
Therefore, developing highly efficient image segmentation framework is
one of the grand challenges in the computer vision community.
3. Hardware implementation
Fig. 5. (a) Comparison between neural networks and other approaches in terms
Fig. 4. mIOU for different semantic segmentation methods on Pascal- of accuracy and scale (i.e. data/model size) [96], and (b) trade-off between
VOC 2012. flexibility and efficiency for different hardware implementations [97].
6
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
Practically, an operation can be computed faster in the hardware algorithmic parallelism in the following aspects [103]: (i) the convolu-
platform that is application-specifically designed and/or programmed. tion operation of an n n matrix using a k k kernel can be in parallel;
Hardware acceleration thus steps forward for heavy customization in (ii) the subsampling/pooling operation can be parallelized by executing
processing capability by allowing great parallelism, having specific data- different pooling operations separately; (iii) the activation of each
paths for temporal variants, and reducing the overhead of instruction neuron in a fully connected layer can be parallelized by creating a
control [98]. For decades, hardware customization in the form of GPUs, binary-tree multiplier. With great parallel processing structures and
FPGAs, and application-specific integrated circuits (ASICs) offer a strong floating-point capabilities, GPGPUs have been recognized to be a
promising path in trading flexibility for computation efficiency, as seen in good fit to accelerate deep learning. A number of GPU-based CNN li-
Fig. 5(b). In this section, we review a number of popular hardware braries have been developed to facilitate highly optimized CNN imple-
implementations including GPUs, FPGAs and other application-specific mentation on GPUs, including cuDNN [104], Cuda-convnet [105] and
accelerators. several other libraries built upon the popular deep learning frameworks,
such as Caffe [106], Torch [107], Tensorflow [108], etc.
Computational throughput, power consumption and memory effi-
3.1. Graphics processing units (GPUs)
ciency are three important metrics when implementing deep learning on
GPUs. Fig. 6 summarizes the peak performance of recent Nvidia GPUs for
GPUs were initially developed to accelerate graphics processing. A
single-precision floating-point (FP32) arithmetic measured by GFLOPs
GPU is particularly designed for integrated transform, lighting, triangle
and power consumption gauged by Thermal Design Power (TDP). The
setup/clipping, and rendering [100]. A modern GPU is not only a
GeForce 10 series, based on the most powerful GPU architecture “Nvidia
powerful graphics engine but also a highly parallelized computing pro-
Pascal”, is a set of consumer graphics cards released by Nvidia in 2016
cessor featuring high throughput and high memory bandwidth for
[110]. With an inexpensive GeForce GTX 1060, composed of 1280 CUDA
massive parallel algorithms, which is dubbed as GPU computing or
cores delivering 3855 GFLOPs in computational throughput, one can get
general-purpose computing on GPU (GPGPU).
into deep learning with affordable cost.
In contrast to multicore CPUs that are typically out-of-order, multi-
For professional usage, Titan V and Tesla V100 are much more
instructional, running at high frequencies and using large-size caches to
powerful and scalable than the GeForce 10 series based on the Pascal and
minimize the latency of a single-thread, GPGPUs consist of thousands of
a new Volta architecture, which integrates CUDA cores with the new
cores that are in-order, operating at lower frequencies and relying on
Tensor core technology. Tensor cores are especially designed for deep
smaller-size caches. To create high performance GPU-accelerated appli-
learning, which offer an extremely wide memory bus. Compared to
cations with parallel programming, a variety of development platforms
CUDA cores, they improve the peak performance by up to 12 for
such as compute unified device architecture (CUDA) [101] and open
training and up to 6 for inference. In addition to their high throughput,
computing language (OpenCL) [102], are studied and utilized for
Tensor cores allow efficient computation with 16-bit word-length,
GPU-accelerated embedded systems, desktop workstations, enterprise
implying that the amount of transferred data can be doubled over 32-
data centers, cloud-based platforms and high-performance computing
bit arithmetic with the same memory bandwidth.
(HPC) servers.
Nvidia Jetson is a leading low-power embedded platform that enables
A number of hardware vendors have produced GPUs. Among them,
server-grade computing performance on edge devices. Jetson TX2 is
Intel, Nvidia and AMD/ATI have been the market share leaders [100]. As
based on the 16 nm NVIDIA Tegra “Parker” system-on-a-chip (SoC),
shown in Fig. 6, the evolution of GPGPUs began in 2007, when Nvidia
which delivers 1 TFLOPs of throughput in a credit-card-sized module. A
released its CUDA development environment. A great variety of GPUs
new series of RTX gaming cards (i.e., RTX 2070/2080/2080Ti) with
have been designed for a specific usage, such as Nvidia GeForce GTX and
Turing architectures were unveiled in August 2018. They have Tensor
AMD Radeon HD GPUs for powerful gaming, and Nvidia Quadro and
cores on board and support unrestricted 16-bit floating-point (FP16), 8-
Titan X series for professional workstation [100]. More recently, the
bit integer (INT8) and 4-bit integer (INT4) arithmetic. Among them,
emergence of deep learning technology ushers in significant advances in
RTX 2080Ti offers the promising performance with more than 100
GPU computing.
TFLOPs in FP16.
Taking CNN as an example, it can take advantages of the nature of
Fig. 6. Computational throughput in terms of GFLOPS and power consumption in terms of TDP (watts) for single-precision floating-point arithmetic [109].
7
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
8
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
implementations, PEs are arranged in a 2D grid as a systolic array different numerical parameters within the same network [133,134].
[121–123]. Because such a simple architecture limits the CNN kernel size When the dynamic quantization method is applied to both the convolu-
and does not offer data caching, it cannot achieve extremely high per- tional and fully-connected layers of AlexNet without fine-tuning, the
formance. Recently, loop optimization techniques, including loop reor- classification accuracy is almost unchanged (<1%) [135].
dering, unrolling, pipelining and tiling, have been proposed to address In the extreme case, a DNN may use binary weights and activations,
the aforementioned issue. Loop reordering tries to prevent redundant resulting in an extremely compact representation that is referred to as
memory access between loops to increase cache usage efficiency [124]. binary neural network (BNN) [115,133,136,137]. A BNN can be evalu-
Loop unrolling and pipelining maximize the utilization of FPGA re- ated with extremely low computational cost, as binary addition and
sources by exploring the parallelism of loop iterations [115,125,126]. multiplication can both be implemented with simple logic gates. Other
Loop tiling deals with the issue posed by insufficient on-chip memory of than BNN, a ternary DNN [138] sets its weights to þ1, 0 or 1, allowing
FPGAs. It partitions the feature maps and weights of each layer fetched each weight to be represented by 2 bits. On the other hand, the numerical
from memory into chunks, also referred to as tiles, to fit them into operations in all neurons are implemented with floating-point arithmetic
on-chip buffers [124,125,127]. (FP32).
Model compression: DNNs often carry a significant number of redun- Table 3 summarizes the performance of several typical CNN models
dant parameters and are mainly used for error-tolerant applications. deployed on FPGA using different model optimization methods. CNN
Hence, a lot of efforts have been made to simplify DNN models and, models based on quantized arithmetic are highly efficient in terms of
consequently, reduce the complexity of their hardware implementation. hardware utilization and power consumption; however, their accuracy is
These model compression methods can be broadly classified into three often compromised. On the other hand, CNN models based on low-rank
different categories: (i) pruning [115,128–131], (ii) low-rank approxi- approximation (e.g. SVD) and pruning carry a smaller number of weights,
mation [114,132], and (iii) quantization [114,121,133–137]. while simultaneously achieving high classification accuracy. The ternary
First, pruning is usually the first step to reduce model redundancy by ResNet [115], implemented with Intel StratixTM 10 FPGA [140], ach-
removing the least-important connections and/or parameters. Taking ieves a throughput of 12 TGOP/s, outperforming the throughput of Titan
CNN as an example, we can remove its weights that are extremely small X Pascal GPU by 10%.
[34] and/or cause high energy consumption [130]. After pruning, the To make a comprehensive comparison of the state-of-the-art hard-
CNN model is highly sparse and can be efficiently implemented with ware accelerators, we present the key performance metrics of several
FPGA by masking the zero weights for multiplications. Second, low-rank mainstream accelerators (i.e. CPU, GPU and FPGA) with different
approximation decomposes the weight matrix of a convolutional or fully network models (i.e. VGG and ResNet) in Table 4 [141]. The FPGA
connected layer to a set of low-rank filters that can be evaluated with low cluster in Table 4 is composed of 15 FPGA chips, as described in
computational cost [114]. Finally, because fixed-point arithmetic re- Ref. [142]. Two important observations can be made from the data in
quires less computational resources than floating-point arithmetic, Table 4. First, the throughput of FPGA is substantially higher than that of
feature maps, weight matrices and/or convolutional kernels can be CPU, but it is often lower than the throughput of GPU. Second, among
quantized by using a fixed-point representation to further reduce the FPGA, CPU and GPU, FPGA offers the highest energy efficiency.
computational cost.
A straightforward approach is to encode each numerical value with
the desired word-length according to its range [114,121]. Alternatively, 3.3. Application-specific hardware accelerators
a dynamic scheme may be adopted to assign different scaling factors to
A typical computer system is often heterogeneous, composed of a
Table 3
Performance comparison of FPGA-based CNN accelerators.
CNN Model FPGA Device Optimization Accuracy # of Param Computation Precision Frequency Throughput Power
Method (Top-5) (M) (GOP) (MHz) (GOP/s) (W)
VGG-19 [139] Arria10 – 90.1% 138 30.8 float32 370 866 41.7
GX1150
VGG-16 [114] Zynq 7Z045 SVD 87.96% 50.2 30.5 fixed16 150 137 9.6
VGG-16 [134] Arria10 Dynamic 88.1% 138 30.8 fixed8 150 645 –
GX1150
BNN: XNOR-Net Stratix5 Binary 66.8% 87.1 2.3 fixed1 150 1,964 26.2
[137] GSD8
Ternary ResNet Stratix10 Ternary, pruning 79.7% 61 1.4 float32 500 12,000 141.2
[115]
Table 4
Performance comparison of neural networks on difference hardware platforms [141].
Model Platform Device Precision Frequency. (MHz) Throughput (GOP/s) Energy Efficiency (GOP/j) Power (W)
9
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
heterogeneous suit of processors, such as CPUs added with other dis- [3] Richard Szeliski, Computer Vision: Algorithms and Applications, Springer Science
& Business Media, 2010.
similar processors, to meet the specific computing requirement. Pro-
[4] Wilson Geisler, Vision: A Computational Investigation into the Human
cessors that complement CPUs are known as application-specific Representation and Processing of Visual Information, Psyccritiques, 1983,
coprocessors. In addition to the notable coprocessors such as GPUs and pp. 581–582.
FPGAs, there are several specialized hardware units in the form of either [5] Laura McClure, Building an AI with the Intelligence of a Toddler: Fei-Fei Li at
TED2015, TED, 2015 [Online]. Available: https://blog.ted.com/building-an-ai
stand-alone devices or coprocessors that are particularly developed for -with-the-intelligence-of-a-toddler-fei-fei-li-at-ted2015/.
deep learning and/or other AI applications. [6] Deng Jia, et al., Imagenet: a large-scale hierarchical image database, in:
Tensor processing unit (TPU) [143,144], a customized ASIC devel- Conference on Computer Vision and Pattern Recognition, 2009.
[7] M. Sornam, Kavitha Muthusubash, V. Vanitha, A survey on image classification
oped by Google, is a stand-alone device specifically designed for neural and activity recognition using deep convolutional neural network architecture, in:
networks and tailored for the Google Tensorflow framework [108]. TPU International Conference on Advanced Computing, 2017.
targets a high volume of low-precision (e.g., 8-bit) arithmetic. It has [8] Li Liu, et al., Deep Learning for Generic Object Detection: a Survey, arXiv:
1809.02165, 2018.
already powered many applications at Google, such as the search engine [9] Alberto Garcia-Garcia, et al., A Review on Deep Learning Techniques Applied to
and AlphaGo [144]. Intel Nervana neural network processor (NNP) [145] Semantic Segmentation, arXiv:1704.06857, 2017.
is designed to provide the required flexibility of deep learning primitives [10] Sparsh Mittal, A survey of FPGA-based accelerators for convolutional neural
networks, Neural Comput. Appl. (2018) 1–31.
while making its core hardware components as efficient as possible. [11] Alex Krizhevsky, Geoffrey Hinton, Technical Report, Learning Multiple Layers of
Mobileye EyeQ [146] is a family of SoC devices specialized for vision Features from Tiny Images, vol. 1, University of Toronto, 2009. No. 4.
processing in autonomous driving. It shows the ability of handling [12] Yann LeCun, et al., Gradient-based learning applied to document recognition,
Proc. IEEE (1998) 2278–2324.
complex and computationally intensive vision tasks, while maintaining
[13] David Lowe, Object recognition from local scale-invariant features, in:
low power consumption. International Conference on Computer Vision, Vol.2, 1999.
Various AI coprocessors for mobile platforms are developed recently, [14] Navneet Dalal, Bill Triggs, Histograms of oriented gradients for human detection,
where an arms race seems around the corner. The mobile processor in: Conference on Computer Vision and Pattern Recognition, vol. 1, 2005.
[15] Aude Oliva, Antonio Torralba, Modeling the shape of the scene: a holistic
Qualcomm Snapdragon 845 contains a Hexagon 685 DSP core that representation of the spatial envelope, Int. J. Comput. Vis. (2001) 145–175.
supports sophisticated, on-device AI processing in camera, voice and [16] Fei-Fei Li, Pietro Perona, A bayesian hierarchical model for learning natural scene
gaming applications [147]. Imagination Technologies [148] develops a categories, in: Conference on Computer Vision and Pattern Recognition, 2005.
[17] Kostas Daniilidis, Patros Maragos, Nikos Paragios, Improving the Fisher kernel for
series of neural network accelerators (NNAs), such as PowerVR Series large-scale image classification, in: European Conference on Computer Vision,
2NX/3NX NNA. They are intellectual property (IP) cores that are 2010, pp. 143–156.
designed to deliver high performance computation and low power con- [18] Corinna Cortes, Vladimir Vapnik, Support-vector Networks, Machine Learning,
1995, pp. 273–297.
sumption for embedded and mobile devices. The new “neural engine” by [19] Jianchao Yang, et al., Linear spatial pyramid matching using sparse coding for
Apple [149], incorporating Apple A11/A12 Bionic SoC, is a pair of pro- image classification, in: Conference on Computer Vision and Pattern Recognition,
cessing cores dedicated for specific machine learning algorithms, 2009.
[20] Jorge Sanchez, Florent Perronnin, High-dimensional signature compression for
including Face ID, augmented reality, etc. HiSilicon Kirin 970 is the first large-scale image classification, in: Conference on Computer Vision and Pattern
mobile AI platform developed by HUAWEI [150]. With a dedicated Recognition, 2011.
neural processing unit (NPU), its new heterogeneous computing archi- [21] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
convolutional neural networks, Adv. Neural Inf. Process. Syst. (2012) 1097–1105.
tecture improves the throughput and energy efficiency by up to 25 and
[22] Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for
50 respectively over a quad-core Cortex-A73 CPU cluster. Large-Scale Image Recognition, arXiv:1409.1556, 2014.
[23] Christian Szegedy, et al., Going deeper with convolutions, in: Conference on
4. Conclusions Computer Vision and Pattern Recognition, 2015.
[24] Kaiming He, et al., Deep residual learning for image recognition, in: Conference on
Computer Vision and Pattern Recognition, 2016.
As a scientific discipline, computer vision has been a challenging [25] Gao Huang, et al., Densely connected convolutional networks, in: Conference on
research area and received significant attention. With the emergence of Computer Vision and Pattern Recognition, 2017.
[26] Jie Hu, Li Shen, Gang Sun, Squeeze-and-excitation networks, in: Conference on
big data, advanced deep learning algorithms and powerful hardware Computer Vision and Pattern Recognition, 2017.
accelerators, modern computer vision systems have dramatically [27] Barret Zoph, et al., Learning transferable architectures for scalable image
evolved. In this paper, we conduct a comprehensive survey on computer recognition, in: Conference on Computer Vision and Pattern Recognition, 2018.
[28] B. Zoph, Q.V. Le, Neural architecture search with reinforcement learning, in:
vision techniques. Specially, we have highlighted the recent accom- International Conference on Learning Representations, 2017.
plishments in both the algorithms for a variety of computer vision tasks [29] Forrest Iandola, et al., Squeezenet: Alexnet-Level Accuracy with 50x Fewer
such as image classification, object detection and image segmentation, Parameters and < 0.5 Mb Model Size, arXiv:1602.07360, 2016.
[30] Andrew Howard, et al., Mobilenets: Efficient Convolutional Neural Networks for
and the promising hardware platforms to implement DNNs efficiently for Mobile Vision Applications, arXiv:1704.04861, 2017.
practical applications, such as GPUs, FPGAs and other new generation of [31] Xiangyu Zhang, et al., ShuffleNet: an extremely efficient convolutional neural
hardware accelerators. network for mobile devices, in: Conference on Computer Vision and Pattern
Recognition, 2018.
In the future, increasingly compact and efficient DNNs are needed for
[32] Bichen Wu, et al., Shift: a zero flop, zero parameter alternative to spatial
real-time and embedded applications. In addition, weakly supervised or convolutions, in: Conference on Computer Vision and Pattern Recognition, 2018.
unsupervised DNN schemes must be investigated to perceive all object [33] Weijie Chen, et al., All you need is a few shifts: designing efficient convolutional
categories in all open world scenes. Furthermore, highly energy-efficient neural networks for image classification, in: Conference on Computer Vision and
Pattern Recognition, 2019.
hardware engines are required to extend the existing accelerators to a [34] Song Han, Huizi Mao, William J. Dally, Deep compression: compressing deep
broad spectrum of challenging scenarios. To address the aforementioned neural networks with pruning, trained quantization and Huffman coding, in:
grand challenges, massive innovations of computer vision systems, in International Conference on Learning Representations, 2016.
[35] Shuo Yang, et al., From facial parts responses to face detection: a deep learning
terms of both algorithm developments and hardware designs, are ex- approach, in: International Conference on Computer Vision, 2015.
pected over the next five or even ten years. [36] Bart Thomee, et al., YFCC100M: the New Data in Multimedia Research, arXiv:
1503.01817, 2015.
[37] Aditya Khosla, et al., Novel dataset for fine-grained image categorization: stanford
References dogs, in: Conference on Computer Vision and Pattern Recognition Workshop on
Fine-Grained Visual Categorization, 2011.
[1] James Kobielus, Powering AI: the Explosion of New AI Hardware Accelerator, [38] K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes
InfoWord, 2018 [Online]. Available: https://www.infoworld.com/article/3 from videos in the wild, arXiv:1212.0402 (2012).
290104/powering-ai-the-explosion-of-new-ai-hardware-accelerators.html. [39] TREC Video Retrieval Evaluation Multimedia event detection. [Online]. Available:
[2] Computer vision, WikiPedia. [Online]. Available: https://en.wikipedia.org/w https://trecvid.nist.gov/.
iki/Computer_vision. [40] Herve Jegou, et al., Aggregating local descriptors into a compact image
representation, in: Conference on Computer Vision and Pattern Recognition, 2010.
10
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
[41] Zhongwen Xu, Yi Yang, G. Alex, Hauptmann, A discriminative CNN video [79] R. Giuly, M. Martone, M. Ellisman, Method: automatic segmentation of
representation for event detection, in: Conference on Computer Vision and Pattern mitochondria utilizing patch classification, contour pair classification, and
Recognition, 2015. automatically seeded level sets, BMC Bioinf. 13 (1) (2012) 29.
[42] Hongteng Xu, Y. Zhen, H. Zha, Trailer generation via a point process-based visual [80] Holger Roth, et al., Deeporgan: multi-level deep convolutional networks for
attractiveness model, in: International Joint Conference on Artificial Intelligence, automated pancreas segmentation, in: International Conference on Medical Image
2015. Computing and Computer-Assisted Intervention, 2015.
[43] Du Tran, et al., Learning spatiotemporal features with 3D convolutional networks, [81] Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolutional networks for
in: International Conference on Computer Vision, 2015. semantic segmentation, in: Conference on Computer Vision and Pattern
[44] Karen Simonyan, Andrew Zisserman, Two-stream convolutional networks for Recognition, 2015.
action recognition in videos, Adv. Neural Inf. Process. Syst. (2014) 568–576. [82] Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla, Segnet: a deep convolutional
[45] Jeff Donahue, et al., Long-term recurrent convolutional networks for visual encoder-decoder architecture for image segmentation, Trans. Pattern Anal. Mach.
recognition and description, Trans. Pattern Anal. Mach. Intell. (2014) 677–691. Intell. (2017) 2481–2495.
[46] Zuxuan Wu, et al., Modeling spatial-temporal clues in a hybrid deep learning [83] Olaf Ronneberger, Philipp Fischer, Thomas Brox, U-net: convolutional networks
framework for video classification, in: International Conference on Multimedia, for biomedical image segmentation, in: International Conference on Medical
2015. Image Computing and Computer-Assisted Intervention, 2015.
[47] Shikhar Sharma, Kiros Ryan, Ruslan Salakhutdinov, Action Recognition Using [84] Guosheng Lin, et al., RefineNet: multi-path refinement networks for high-
Visual Attention, arXiv:1511.04119, 2015. resolution semantic segmentation, in: Conference on Computer Vision and Pattern
[48] Z. Li, et al., Video LSTM convolves, attends and flows for action recognition, Recognition, 2017.
Comput. Vis. Image Understand. 166 (2018) 41–50. [85] Liang Chieh Chen, et al., Semantic image segmentation with deep convolutional
[49] Pedro F. Felzenszwalb, et al., Object detection with discriminatively trained part- nets and fully connected CRFs, Comput. Sci. (2014) 357–361.
based models, Trans. Pattern Anal. Mach. Intell. (2010) 1627–1645. [86] Shuai Zheng, et al., Conditional random fields as recurrent neural networks, in:
[50] Viola Paul, Michael J. Jones, Robust real-time face detection, Int. J. Comput. Vis. International Conference on Computer Vision, 2015.
(2004) 137–154. [87] Liang-Chieh Chen, et al., Deeplab: semantic image segmentation with deep
[51] Timo Ahonen, Abdenour Hadid, Matti Pietikainen, Face description with local convolutional nets, atrous convolution, and fully connected CRFs, Trans. Pattern
binary patterns: application to face recognition, Trans. Pattern Anal. Mach. Intell. Anal. Mach. Intell. (2018) 834–848.
(2006) 2037–2041. [88] Yu Fisher, Vladlen Koltun, Multi-scale Context Aggregation by Dilated
[52] Mark Everingham, et al., The pascal visual object classes (voc) challenge, Int. J. Convolutions, arXiv:1511.07122, 2015.
Comput. Vis. (2010) 303–338. [89] Hengshuang Zhao, et al., Pyramid scene parsing network, in: Conference on
[53] Tsung-Yi Lin, et al., Microsoft coco: common objects in context, in: European Computer Vision and Pattern Recognition, 2017.
Conference on Computer Vision, 2014. [90] Liang-Chieh Chen, et al., Encoder-decoder with atrous separable convolution for
[54] Ross Girshick, et al., Rich feature hierarchies for accurate object detection and semantic image segmentation, in: European Conference on Computer Vision,
semantic segmentation, in: Conference on Computer Vision and Pattern 2018.
Recognition, 2014. [91] François Chollet, Xception: deep learning with depthwise separable convolutions,
[55] Jasper Uijlings, et al., Selective search for object recognition, Int. J. Comput. Vis. in: Conference on Computer Vision and Pattern Recognition, 2017.
(2013) 154–171. [92] P.O. Pinheiro, R. Collobert, P. Dollar, Learning to segment object candidates, Adv.
[56] Kaiming He, et al., Spatial pyramid pooling in deep convolutional networks for Neural Inf. Process. Syst. (2015) 1990–1998.
visual recognition, Trans. Pattern Anal. Mach. Intell. (2015) 1904–1916. [93] Jifeng Dai, et al., Instance-sensitive fully convolutional networks, in: European
[57] Ross Girshick, Fast R-CNN, in: International Conference on Computer Vision, Conference on Computer Vision, 2016.
2015. [94] J. Dai, et al., R-FCN: object detection via region-based fully convolutional
[58] Shao Ren, Kai He, Ross Girshick, et al., Faster R-CNN: towards real-time object networks, Adv. Neural Inf. Process. Syst. (2016) 379–387.
detection with region proposal networks, Trans. Pattern Anal. Mach. Intell. (2017) [95] Shu Liu, et al., Path aggregation network for instance segmentation, in:
1137–1149. Conference on Computer Vision and Pattern Recognition, 2018.
[59] Kaiming He, Georgia Gkioxari, Piotr Dollar, et al., Mask R-CNN, in: International [96] Jeff Dean, Keynote: Recent Advances in Artificial Intelligence via Machine
Conference on Computer Vision, 2017. Learning and the Implications for Computer System Design, Hotchips, 2017.
[60] Redmon Joseph, Santosh Divvala, Ross Girshick, et al., You only look once: [97] Shaaban, “Advanced computer architecture: digital signal processing (DSP):
unified, real-time object detection, in: Conference on Computer Vision and Pattern architecture & processors,” Lecture EECC722, [Online]. Available:http://meseec.
Recognition, 2016. ce.rit.edu/eecc722-fall2012/722-10-10-2012.pdf.
[61] Wei Liu, Dragomir Anguelov, Dumitru Erhan, et al., SSD: single shot multibox [98] Hardware acceleration, WikiPedia. [Online]. Available: https://en.wikipedia
detector, in: European Conference on Computer Vision, 2016. .org/wiki/Hardware_acceleration.
[62] Redmon Joseph, Farhadi Ali, YOLO9000: better, faster, stronger, in: Conference [99] S. Mittal, J.S. Vetter, A survey of CPU-GPU heterogeneous computing techniques,
on Computer Vision and Pattern Recognition, 2017. ACM Comput. Surv. 47 (4) (2015) 69.
[63] Redmon Joseph, Farhadi Ali, Yolov3: an Incremental Improvement, arXiv: [100] Graphics processing unit, WikiPedia. [Online]. Available: https://en.wikipedi
1804.02767, 2018. a.org/wiki/Graphics_processing_unit.
[64] Hu Han, et al., Relation networks for object detection, in: Conference on Computer [101] Michael Garland, et al., Parallel computing experiences with CUDA, IEEE Micro
Vision and Pattern Recognition, 2018. (2008) 13–27.
[65] Yancheng Bai, et al., SOD-MTGAN: small object detection via multi-task [102] John E. Stone, D. Gohara, G. Shi, OpenCL: a parallel programming standard for
generative adversarial network, in: European Conference on Computer Vision, heterogeneous computing systems, Comput. Sci. Eng. (2010) 66–73.
2018, pp. 8–14. [103] Magnus Halvorsen, Hardware Acceleration of Convolutional Neural Networks, MS
[66] Yancheng Bai, et al., Finding tiny faces in the wild with generative adversarial thesis, Norwegian University of Science Technology, 2015.
network, in: Conference on Computer Vision and Pattern Recognition, 2018. [104] Sharan Chetlur, et al., CUDNN: Efficient Primitives for Deep Learning, arXiv:
[67] Alex Bewley, et al., Simple online and realtime tracking with a deep association 1410.0759, 2014.
metric, in: International Conference on Image Processing, 2017. [105] Alex Krizhevsky, Cudaconvet2. [Online]. Available: https://code.google
[68] Alex Bewley, et al., Simple online and realtime tracking, in: International .com/archive/p/cuda-convnet2/.
Conference on Image Processing, 2016. [106] Yangqing Jia, et al., Caffe: convolutional architecture for fast feature embedding,
[69] Xizhou Zhu, et al., Flow-Guided feature aggregation for video object detection, in: in: International Conference on Multimedia, 2014.
International Conference on Computer Vision, 2017. [107] Ronan Collobert, Koray Kavukcuoglu, Clement Farabet, Torch7: a matlab-like
[70] Xizhou Zhu, et al., Deep feature flow for video recognition, in: Conference on environment for machine learning, in: Conference on Neural Information
Computer Vision and Pattern Recognition, 2017. Processing System Workshop, 2011.
[71] Jonathan Huang, et al., Speed/Accuracy Trade-Offs for Modern Convolutional [108] TensorFlow, URL: https://www.tensorflow.org/.
Object Detectors, 2017 [Online]. Available: http://image-net.org/challenges/t [109] Grigory Sapunov, Hardware for Deep Learning, Intento, 2018 [Online]. Available:
alks_2017/Imagenet2017VID.pdf. https://blog.inten.to/hardware-for-deep-learning-current- state-and-trends-51c
[72] Ron Ohlander, Keith Price, D. Raj Reddy, Picture segmentation using a recursive 01ebbb6dc.
region splitting method, Comput. Graph. Image Process. (1978) 313–333. [110] PASCAL GPU Architecture, URL: https://www.nvidia.com/en-us/data-center/pa
[73] Pedro F. Felzenszwalb, Daniel P. Huttenlocher, Efficient graph-based image scal-gpu-architecture/.
segmentation, Int. J. Comput. Vis. (2004) 167–181. [111] Eugenio Culurciello, Computation and Memory Bandwidth in Deep Neural
[74] Wenxian Yang, et al., User-friendly interactive image segmentation through Networks, A Medium Corporation, 2017 [Online]. Available: https://medium.co
unified combinatorial user inputs, Trans. Image Process. (2010) 2470–2479. m/@culurciello/computation-and-memory-bandwidth-in-deep-neural-netwo
[75] Alberto Garcia-Garcia, et al., A Review on Deep Learning Techniques Applied to rks-16cbac63ebd5.
Semantic Segmentation, arXiv:1704.06857, 2017. [112] NVIDIA Collective Communications Library (NCCL), URL: https://develop
[76] Mark Everingham, et al., The pascal visual object classes challenge: a er.nvidia.com/nccl.
retrospective, Int. J. Comput. Vis. (2015) 98–136. [113] Kalin Ovtcharov, et al., Accelerating Deep Convolutional Neural Networks Using
[77] Marius Cordts, et al., The cityscapes dataset for semantic urban scene Specialized Hardware, Microsoft Research Whitepaper, 2015, pp. 1–4.
understanding, in: Conference on Computer Vision and Pattern Recognition, 2016. [114] Jiantao Qiu, et al., Going deeper with embedded fpga platform for convolutional
[78] Gerhard Neuhold, et al., The mapillary vistas dataset for semantic understanding neural network, in: International Symposium on Field-Programmable Gate Arrays,
of street scenes, in: International Conference on Computer Vision, 2017. 2016.
11
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx
[115] Eriko Nurvitadhi, et al., Can FPGAs beat GPUs in accelerating next-generation [134] Yufei Ma, et al., Optimizing loop operation and dataflow in FPGA acceleration of
deep neural networks?, in: International Symposium on Field-Programmable Gate deep convolutional neural networks, in: International Symposium on Field-
Arrays, 2017. Programmable Gate Arrays, 2017.
[116] Ritchie Zhao, et al., Accelerating binarized convolutional neural networks with [135] Naveen Suda, et al., Throughput-optimized OpenCL-based FPGA accelerator for
software-programmable FPGAs, in: International Symposium on Field- large-scale convolutional neural networks, in: International Symposium on Field-
Programmable Gate Arrays, 2017. Programmable Gate Arrays, 2016.
[117] Jeremy Bottleson, et al., CLCAFFE: OpenCL accelerated CAFFE for convolutional [136] Matthieu Courbariaux, et al., Binarized Neural Networks: Training Deep Neural
neural networks, in: International Parallel and Distributed Processing Symposium Networks with Weights and Activations Constrained Toþ 1 Or-1, arXiv:
Workshops, 2016. 1602.02830, 2016.
[118] Andrew Lavin, Scott Gray, Fast algorithms for convolutional neural networks, in: [137] Shuang Liang, et al., FP-BNN: binarized neural network on FPGA, Neurocomputing
Conference on Computer Vision and Pattern Recognition, 2016. (2018) 1072–1086.
[119] Shmuel Winograd, Arithmetic Complexity of Computations, vol. 33, Society for [138] Fengfu Li, B. Zhang, B. Liu, Ternary Weight Networks, arXiv:1605.04711vol. 2,
Industrial and Applied Mathmatics, 1980. 2016.
[120] Roberto DiCecco, et al., Caffeinated FPGAs: FPGA Framework for Convolutional [139] Jialiang Zhang, Li Jing, Improving the performance of OpenCL-based FPGA
Neural Networks, Field-Programmable Technology, 2016. accelerator for convolutional neural network, in: International Symposium on
[121] Murugan Sankaradas, et al., A Massively Parallel Coprocessor for Convolutional Field-Programmable Gate Arrays, 2017.
Neural Networks, Application-specific Systems, Architectures and Processors, [140] Intel FPGA, Intel® Stratix® 10 Variable Precision DSP Blocks User Guide,
2009. Technical report, Intel FPGA Group, 2017.
[122] Srimat Chakradhar, et al., A dynamically configurable coprocessor for [141] Teng Wang, et al., A Survey of FPGA Based Deep Learning Accelerators:
convolutional neural networks, Comput. Architect. News (2010) 247–257. Challenges and Opportunities, arXiv:1901.04988, 2018.
[123] Clement Farabet, et al., Neuflow: a runtime reconfigurable dataflow processor for [142] Tong Geng, et al., A framework for acceleration of CNN training on deeply-
vision, in: Conference on Computer Vision and Pattern Recognition, 2011. pipelined FPGA clusters with work and weight load balancing, in: Conference on
[124] Chen Zhang, et al., Optimizing FPGA-based accelerator design for deep Field Programmable Logic and Applications, 2018.
convolutional neural networks, in: International Symposium on Field- [143] Joe Osborne, Google's Tensor Processing Unit Explained: This Is what the Future
Programmable Gate Arrays, 2015. Computing Looks like, TechRadar, 2016 [Online]. Available: https://www.techr
[125] Atul Rahman, et al., Design Space Exploration of FPGA Accelerators for adar.com/news/computing-components/processors/google-s-tensor-processing-u
Convolutional Neural Networks, Design, Automation & Test in Europe, 2017. nit-explained-this-is-what-the-future-of-computing-looks-like-1326915.
[126] Y. Li, et al., A GPU-outperforming FPGA accelerator architecture for binary [144] Norm Jouppi, Google Supercharges Machine Learning Tasks with TPU Custom
convolutional neural networks, J. Emerg. Technol. Comput. Syst. 14 (2) (2018) 18. Chip, Google Could, 2017 [Online]. Available: https://cloud.google.com/blog/p
[127] Steven Derrien, Sanjay Rajopadhye, Loop tiling for reconfigurable accelerators, in: roducts/gcp/google-supercharges-machine-learning-tasks-with-custom-chip.
Conference on Field Programmable Logic and Applications, 2001. [145] Naveen Rao, Intel Nervana Neural Network Processor (NNP) Redefine AI Silicon,
[128] Baoyuan Liu, et al., Sparse convolutional neural networks, in: Conference on Intel Website, 2017 [Online]. Available:
Computer Vision and Pattern Recognition, 2015. https://www.intel.ai/intel-nervana-neural-network-proc
[129] Xiaofan Zhang, et al., High-performance video content recognition with long-term essors-nnp-redefine-ai-silicon/#gs.1s250i.
recurrent convolutional network for FPGA, in: Conference on Field Programmable [146] MobilEYE, The Evolution of EyeQ, Mobileye website, 2018 [Online]. Available:
Logic and Applications, 2017. https://www.mobileye.com/our-technology/evolution-eyeq- chip/.
[130] Tien-Ju Yang, Yu-Hsin Chen, Vivienne Sze, Designing energy-efficient [147] Qualcomm, "Snapdragon 845 Mobile Platform," URL: https://www.qua
convolutional neural networks using energy-aware pruning, in: Conference on lcomm.com/products/snapdragon-845-mobile-platform.
Computer Vision and Pattern Recognition, 2017. [148] Imagination Technology, "PowerVR vision & AI," URL: https://www.imgtec.com
[131] P. Adam, et al., SPARCNet: a hardware accelerator for efficient deployment of /vision-ai/.
sparse convolutional networks, J. Emerg. Technol. Comput. Syst. 13 (3) (2017) 31. [149] James Vincent, The iPhone X's New Neural Engine Exemplifies Apple's Approach
[132] Roberto Rigamonti, et al., Learning separable filters, in: Conference on Computer to AI, The Verge, 2017 [Online]. Available: https://www.theverge.com/2017/9/1
Vision and Pattern Recognition, 2013. 3/16300464/apple-iphone-x-ai-neural-engine.
[133] Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David, Training Deep Neural [150] HUAWEI, HUAWEI Reveals the Future of Mobile AI and IFA 2017, HUAWEI
Networks with Low Precision Multiplications, arXiv:1412.7024, 2014. website, 2017 [Online]. Available: https://www.huawei.com/en/press-events
/news/2017/9/mobile-ai-ifa-2017.
12