Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Deep SCNN-Based Real-Time Object Detection For Self-Driving Vehicles Using LiDAR Temporal Data

The document presents a deep spiking convolutional neural network (SCNN) for real-time 3D object detection from LiDAR point cloud data. The network integrates SCNN with temporal coding and the YOLOv2 architecture. It includes a novel data preprocessing layer to convert point clouds to spike time data as input. Extensive experiments on the KITTI dataset show the network achieves competitive detection accuracy compared to existing approaches, with much lower estimated average energy consumption of 0.247mJ per frame. Implemented on a GPU, it achieves a high frame rate of 35.7 fps, sufficient for real-time detection.

Uploaded by

ashaf2801
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Deep SCNN-Based Real-Time Object Detection For Self-Driving Vehicles Using LiDAR Temporal Data

The document presents a deep spiking convolutional neural network (SCNN) for real-time 3D object detection from LiDAR point cloud data. The network integrates SCNN with temporal coding and the YOLOv2 architecture. It includes a novel data preprocessing layer to convert point clouds to spike time data as input. Extensive experiments on the KITTI dataset show the network achieves competitive detection accuracy compared to existing approaches, with much lower estimated average energy consumption of 0.247mJ per frame. Implemented on a GPU, it achieves a high frame rate of 35.7 fps, sufficient for real-time detection.

Uploaded by

ashaf2801
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received March 22, 2020, accepted April 19, 2020, date of publication April 27, 2020, date of current

version May 7, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.2990416

Deep SCNN-Based Real-Time Object Detection for


Self-Driving Vehicles Using LiDAR Temporal Data
SHIBO ZHOU1 , YING CHEN 2 , XIAOHUA LI 1, (Senior Member, IEEE),
AND ARINDAM SANYAL 3 , (Member, IEEE)
1 Department of Electrical and Computer Engineering, The State University of New York at Binghamton, Binghamton, NY 13902, USA
2 Department of Management Science and Engineering, School of Management, Harbin Institute of Technology, Harbin 150000, China
3 Department of Electrical Engineering, The State University of New York at Buffalo, Buffalo, NY 14260, USA

Corresponding author: Ying Chen (yingchen@hit.edu.cn)


This work was supported in part by the National Natural Science Foundation of China under Grant 91846301.

ABSTRACT Real-time accurate detection of three-dimensional (3D) objects is a fundamental necessity for
self-driving vehicles. Most existing computer vision approaches are based on convolutional neural networks
(CNNs). Although the CNN-based approaches can achieve high detection accuracy, their high energy
consumption is a severe drawback. To resolve this problem, novel energy efficient approaches should be
explored. Spiking neural network (SNN) is a promising candidate because it has orders-of-magnitude lower
energy consumption than CNN. Unfortunately, the studying of SNN has been limited in small networks only.
The application of SNN for large 3D object detection networks has remain largely open. In this paper, we
integrate spiking convolutional neural network (SCNN) with temporal coding into the YOLOv2 architecture
for real-time object detection. To take the advantage of spiking signals, we develop a novel data preprocessing
layer that translates 3D point-cloud data into spike time data. We propose an analog circuit to implement the
non-leaky integrate and fire neuron used in our SCNN, from which the energy consumption of each spike is
estimated. Moreover, we present a method to calculate the network sparsity and the energy consumption of
the overall network. Extensive experiments have been conducted based on the KITTI dataset, which show
that the proposed network can reach competitive detection accuracy as existing approaches, yet with much
lower average energy consumption. If implemented in dedicated hardware, our network could have a mean
sparsity of 56.24% and extremely low total energy consumption of 0.247mJ only. Implemented in NVIDIA
GTX 1080i GPU, we can achieve 35.7 fps frame rate, high enough for real-time object detection.

INDEX TERMS Spiking convolutional neural network, LiDAR temporal data, energy consumption, real-
time object detection.

I. INTRODUCTION To address the point cloud object detection challenge,


In recent years, increased attention has been paid to point many approaches have been proposed, which can be divided
cloud data processing for autonomous driving applications into three general classes. The first class project point clouds
because of significant improvements in automotive light into a perspective view and detect objects via image-based
detection and ranging (LiDAR) sensors, which deliver three- algorithms [4], [12]. The second class convert point clouds
dimensional (3D) point clouds of the environment in real into a 3D voxel grid and use hand-crafted features to encode
time. Point cloud data have highly variant density distri- each voxel [2], [22]. The third class are similar to the sec-
butions throughout the measurement area [14], which can ond class but change the hand-crafted features into machine-
be exploited for object detection [1], [7], [22]. Neverthe- learned features [5].
less, different from camera images, LiDAR point clouds are Owing to the machine-learned features, the third-class
unordered and sparse, which results in some difficulties for can achieve much better object detection performance.
real-time object detection. Qi et al. [14] proposed the PointNet which learns point-
wise features of point clouds using deep neural networks.
The associate editor coordinating the review of this manuscript and Qi et al. [15] proposed PointNet++ to allow networks to
approving it for publication was Wenbing Zhao . learn local structures at different scales. Zhou and Tuzel [24]

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 76903
S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

developed the VoxelNet method, which can learn discrim- processing unit (GPU), experimental results show that our
inative feature representations from point clouds and pre- network can reach a high frame rate of 35.7 fps, enough for
dict accurate 3D bounding boxes in an end-to-end mod- real-time operation. Additionally, the detection accuracies for
ule. Simon et al. [19] developed Complex-YOLO, a real-time cars, pedestrians, and cyclists can reach the state-of-the-art
3D object detector that uses an enhanced region-proposal level. To show the potential energy efficiency of our network,
network (E-RPN) to estimate the orientation of objects we propose an analog circuit implementation of the spiking
coded with imaginary and real parts for each box. Recently, neuron, based on which our proposed network would con-
Simon et al. [20] presented a novel fusion (i.e., Complexer- sume an average of 0.247 mJ only for processing each frame.
YOLO) of neural networks that uses a state-of-the-art 3D This connotes our proposed network’s high performance and
detector and visual semantic segmentation in the field of energy efficiency.
autonomous driving. The accuracy of these methods has been The contributions of this paper are listed as follows:
demonstrated with the KITTI vision benchmark dataset [3]. 1) We develop a novel data preprocessing layer to add
There is much less work that focuses on the energy temporal information to voxel data. With such temporal
consumption of real-time object detection, although low coding, we develop SCNNs with sparse spiking pat-
energy consumption is a critical requirement for many prac- terns to save energy.
tical applications such as autonomous vehicles. Convolu- 2) We combine the SCNNs with YOLOv2 architecture
tional neural networks (CNNs) have been the most popular to develop an efficient object detection network. We
techniques for object detection [20], [24]. However, their propose two variants of the network: one with skip
high energy consumption has been a challenging issue. By connection (SC) and one without SC. The network
comparison, it is well known that spiking neural networks without SC is suitable for the current neuromorphic
(SNNs) are energy efficient and can potentially have orders- chips, while the one with SC can be implemented with
of-magnitude lower energy consumption than CNNs [21]. future neuromorphic chips that support skip connec-
Although the investigation of SNNs is far less than CNNs, tions.
numerous studies have shown that SNNs are able to achieve 3) We provide an analog circuit to implement the non-
similarly high image classification accuracy [8], [9]. One of leaky integrate and fire neuron used in our SCNNs,
the major challenges for SNNs lies in the non-differentiability based on which the energy consumption of a spike
of spiking activities which makes the training of large-scale is estimated. We also provide a way to estimate the
complex networks difficult. Zhou et al. [23] proposed a direct sparsity of the network. Combining low spike energy
training-based spiking-CNN (SCNN) that could recognize consumption and high network sparsity, the overall
the CIFAR-10 dataset using much less energy than CNN network energy consumption will be much lower than
while reaching the same state-of-the-art accuracy. Neverthe- existing models. Simulation results demonstrate the
less, efficient SCNN methods have not yet been reported over extremely low energy consumption of our network.
more complex data sets such as the KITTI 3D point clouds.
For the point cloud data, a special issue is how to translate II. NETWORK ARCHITECTURE
from point-cloud format into a spiking format suitable for As shown in Fig. 1, the proposed network comprises three
SNNs. If the input data are camera images, the input image functional blocks: a point cloud data preprocessing layer
can be converted into spike trains based on pixel intensity [21] (LiDAR spike generation), spiking-convolution layers (fea-
or encoded into spike times [9]. Unfortunately, this approach ture learning), and a detection layer. In the following subsec-
is not efficient enough for point cloud data because of the tions, we provide a detailed description of each of these three
sparsity and non-even density distributions of point clouds. blocks.
In this paper, similar to [20] and [24], we first quantize
the point cloud to a 3D voxel representation so as to reduce A. PREPROCESSING LAYER
the input data amount. Then, we design an innovative data- For the 3D point cloud data such as KITTI data, each point
preprocessing layer that converts the 3D voxels into spike i is described by a four-number vector (xi , yi , zi , ri ), where
signals by adding time information to each voxel. Using this xi , yi , zi are the 3D position of the reflection object point, and
special temporal coding method, the input data are converted ri is the received laser light reflection intensity. The LiDAR
into spike times directly, and this permits us to design SCNNs device emits a pulse, which is reflected from the object and
with energy-efficient temporal coding. Finally, we use such received by the LiDAR reception device. If the laser emission
SCNNs to replace the CNNs of the YOLOv2 architecture [17] equipment is the origin of the coordinate system, the distance
to develop a large-scale object detection network. Note that between each reflection point i and the laser emission equip-
similar to [23], the network is an end-to-end object detection ment can be calculated as:
network that combines feature extraction and bounding-box q
prediction. di = xi2 + y2i + z2i . (1)
We evaluate our developed network on the bird’s-eye
view and 3D detection tasks provided by the KITTI bench- Considering the volume of the space that LiDAR scans,
mark. Implemented over a NVIDIA GTX 1080i graphical 3D point cloud data set is usually huge and sparse. Using

76904 VOLUME 8, 2020


S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

FIGURE 1. Proposed object detection architecture.

using the following equation:


2di
ti = , (2)
c
where c is the propagation speed of the laser pulse. Since ti is
the time information of each point, it can be used to represent
the spike time for SCNN object detection networks.
For the voxel (xv , yv , zv ), the value tv is calculated accord-
ing to ti of all points inside this voxel. Due to the sparsity of
point cloud, there are a lot of voxels that have no data points.
In this case, we let tv =0. For the voxels that have one or
more data points, we randomly select one data point and use
its time information to represent the voxel, i.e., tv = ti for
a randomly selected data point i in this voxel. Note that we
I
tried using the average time tv = 1I
P
ti , for all the I data
FIGURE 2. Voxelized point clouds within a 3D coordinate system. i=1
points in this voxel, but found that this averaging method did
not lead to obvious performance gain. Therefore, we adopt
the 3D point cloud directly as input to deep networks is the simpler random selection method.
not computationally efficient. To reduce complexity (or to This propagation time-based encoding method is more
compress the LiDAR data set), one of the ways is to quan- desirable than other encoding methods. The intensity has
tize the point clouds with a 3D voxel representation [19]. large variations due to object shapes, materials, and envi-
We adopt this approach and consider the 3D region of ronmental conditions such as raining, which increases the
[0, 60] m × [−40, 40] m × [−2.73, 1.27] m. Specifically, difficulty of deep network generalization. In contrast, the
we consider all the KITTI data points with xi ∈ [0, 60]m, propagation time for each voxel is more regular. Compared
yi ∈[−40, 40]m, and zi ∈[−2.73, 1.27]m. We quantize the with using the number of data points as voxel value, propaga-
region into 768 × 1024 × 21 voxels with the size of each tion time leads naturally to SCNN with a temporal coding
voxel cell approximately equals to 0.08 × 0.08 × 0.19 m3 , that directly uses tv as spiking time. This temporal coding
as shown in Fig. 2. The voxelated space is a regular 3D considers short-term stimuli to produce a small number of
coordinate system (xv , yv , zv ), with length xv ∈ [0, 767], spikes [26], and thus leads to sparse spiking patterns.
width yv ∈ [0, 1023], and height zv ∈ [0, 20]. With the After the LiDAR data to spike time conversion, the input
voxel representation, we construct a 3D tensor with shape to the first layer of the spiking-convolution is the 3D tensor
768 × 1024 × 21 from each KITTI 3D point cloud data file with shape 768 × 1024 × 21. The element value of the tensor
and use it as input to the deep networks. is tv . In the convolutional layers, we adopt 2D convolutions
One of the differences between our work and the existing rather than 3D convolutions, where the filter sliding is con-
voxel-based works is that we use propagation time as the ducted over the first two dimensions (length and width) only.
value of each voxel (i.e., tensor element) rather than the For example, the first spike convolutional layer uses filters
received light intensity or the number of data points. As with shape 3 × 3 × 21. In other words, we take the height
introduced in Behroozpour et al. [25], the round-trip delay dimension as the color dimension of the conventional images.
of the LiDAR emitted light for each point can be calculated The reason is that the height dimension has a shape 21, which

VOLUME 8, 2020 76905


S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

where C = {i: ti < tout . Therefore,


P
wi exp(ti )
i∈C
exp(tout ) = P . (7)
wi − 1
i∈C

We use this non-leaky integrate-and-fire neuron model to


implement the convolutional neural networks. For example,
in the first convolutional layer, the input neuron time (ti of
Eq. (7)) is the voxel time tv . The convolution kernel consists
of weights wi in Eq. (7). The output neuron value is tout ,
FIGURE 3. A model of spike neuron involving accumulating and
thresholding operations.
which is calculated by Eq. (7). As a result, communication
between convolution layers occurs entirely through the tem-
poral coded spike signals.
is very small and much less than the other two dimensions. The convolution operation is illustrated in Fig. 4. Prior
Using 2D convolution will lead to great reduction of the to matrix multiplication, we need to sort the spiking-time
computational complexity but without obvious performance values ti in the small square of the input data in layer n,
loss. from small to large, into a vector T . The elements wi in
the convolutional kernel are reordered to generate a matrix
B. SPIKING-CONVOLUTION LAYERS W according to the changed position of spiking-time values.
According to Eq. (7), for wi ∈ W and ti ∈ T , a dot-product is
Following Mostafa [9], Goltz et al. [28], Comsa et al. [29] and
performed between wi and exp(t i ). The neuron fires when the
Kheradpisheh et al. [30], in the spiking-convolution layers,
accumulated dot products reach the threshold. Consequently,
we use non-leaky integrate and fire neurons with current
the value of the mapping element, tout , becomes the output.
exponentially decaying synaptic kernels. A neuron’s mem-
As noted, after the neuron fires, it is not allowed to fire again.
brane dynamics are described by
Two important marks are necessary to clarify. First, the
j
dV mem (t) X X sorting of ti , which can be computationally demanding, is
k t − tir ,

= wj?? (3) not needed when the SCNN is implemented in dedicated
dt r
i
hardware. This is because ti represents the actual arrival time
where the right-hand side of Eq. (3) is the synaptic current, of the input neuron. Smaller ti means a spike arrives earlier
j
Vmem is the membrane potential of neuron j, wji is the weight and is thus accumulated earlier. Second, the input neurons
of the synaptic connection from neuron i to neuron j, tir is the with time ti > tout do not participate in dot-product and accu-
time of the r th spike from neuron i, and k(x) is the synaptic mulation. As a matter fact, these neurons would never fire
current kernel given by spikes when implemented in dedicated hardware. Therefore,
( with temporal coding, the fired spikes can be quite sparse,
x 1 if x ≥ 0
k(x) = θ(x)exp(− ), where θ(x) = (4) which greatly conserves energy.
τsyn 0 otherwise Using the SCNN to replace the CNN of the YOLOv2
The synaptic current jumps immediately when an input spike backbone in [17], we obtain a new YOLOv2 network that
arrives. Then it decays exponentially with a time constant detects and locates objects from spiking-
τsyn . Both randτsyn are set to 1 for the rest of this paper. time data. Combined with the real-time nature of YOLOv2,
We assume that a neuron receives N spikes at times the presented network can be implemented efficiently on a
{t1 , . . . , tN } with weights {w1 , . . . , wN from N source neu- neuromorphic architecture.
rons, and these spike times accumulate. As shown in Fig.
3, the neuron spikes when its membrane potential is over C. DETECTION LAYER
the firing threshold. After a spike, the membrane potential Following Simon et al. [19], we use E-RPN to derive
automatically resets to 0. the object’s position {bx , by , length bl , width bw , probabil-
If the neuron spikes at time tout , the membrane potential ity p0 , class scores {p1 ,. . . , pn , and the orientation b∅ . To
for t < t out can be derived as achieve proper orientation, we exploit the updated Grid-RPN
approach from [19] to obtain
N
X
Vmem (t) = θ(t − ti )wi (1 − exp(−(t − ti )). (5) bx σ (tx ) + cx ,
= (8)
i=1
σ ty + c y ,

by = (9)
Assume the thresholding membrane potential be 1. Then
bw pw etw ,
= (10)
tout satisfies
X bl pl etl ,
= (11)
1= wi (1 − exp (− (tout − ti ))), (6)  
b∅ = arg |z| eib∅ = arctan2 (tIm , tRe ), (12)
i∈C

76906 VOLUME 8, 2020


S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

FIGURE 4. Illustration of spiking convolution in layer n.

where tx , ty , tw , tl , tIm and tRe are 6 coordinate parameters


for each bounding box that the network predicts. (cx , cy ) is
the cell offset from the top left corner of the image. The
bounding box prior has width and length pw , pl , respectively.
In addition, tx , ty , tw , tl , tIm and tRe are the responsible
regression parameters. With tx , ty , tw , tl and arctan2 (tIm , tRe ),
we can easily calculate the position, width, length and angle FIGURE 5. The analog circuit of the non-leaky integrate and fire neuron.
of each bounding box.
Our regression parameters are directly linked to the loss
function Lloss based on the Complex-YOLO of [19]. Specifi- A. AN ENERGY-EFFICIENT ANALOG NEURON CIRCUIT
cally, the loss function is defined as To estimate the potential energy consumption of each spike,
we propose an analog circuit as shown in Fig. 5 to implement
Lloss = LYOLO + LEuler , (13) the non-leaky integrate and fire neuron formulated in Eq. (3)
and (4).
where LYOLO is the sum of squared errors using the introduced Fig. 5 shows the transistor-level schematic of a single
multi-part loss, as shown in [17]. Additionally, according to neuron (neuron j). This neuron circuit comprises two analog
[19], the Euler regression part of the loss function, LEuler , is switched-capacitor integrators and a comparator. The first
defined as integrator performs the summation operation in the right-
hand-side of Eq. (3). When the neuron i spikes, a capacitor
LEuler Cji is charged to a fixed-potential Vref , where the ratio of
Xs2 XB capacitor Cji /Cf represents weight of the synaptic connec-
h 2 2 i
obj
= λcoord αij tim − t̂im + tre − t̂re ,
i=0 j=0 tion from neuron i to neuron j(wji ). The charge across Cji
(14) is dumped onto the feedback capacitor Cf when the neu-
ron i is not spiking and the capacitor Cf starts discharging
obj
where αsb indicates the bth bounding-box predictor in cell through the switch connected parallel to it (denoted by S1 in
s, which has the highest Intersection over Union (IoU) in Fig. 5). The second integrator samples the output of the first
comparison with the ground truth for that prediction. λcoord is integrator and starts integrating it and stores the charge on
a scaling factor used to guarantee stable convergence during the feedback capacitor Cf 2 . Once the output of the second
early phases. t̂im and t̂re are the estimated responsible regres- integrator exceeds a threshold voltage, Vthres , a comparator
sion parameters. fires and produces an output of ‘1’ which represents firing
of neuron j. The comparator output also resets the second
III. ENERGY CONSUMPTION integrator.
Similar to [19] and [24], our network is a single-stage detector The amplifiers in the integrators are designed using simple
that can be trained in an end-to-end manner. The spike-based two-stage operational-amplifier structure with 55dB gain.
time coding strategy we use in this paper is similar to the A strongARM latch architecture is used to design the com-
scheme of time-to-first-spike in [13]. Hence, our proposed parator. Designed in 65nm CMOS process, the neuron con-
network will have lower running time than CNN-based mod- sumes 19 pJ energy for 1MHz input pulses. Since each neuron
els [5], [18]. Moreover, our network is more energy efficient is only allowed to spike no more than once, each spike is thus
because the network signals are transmitted via spikes. In averaged to consume p = 19 pJ on the analog circuit. Static
this section, we analyze the energy consumption by first power consumed by the amplifiers dominates the energy
proposing an analog circuit of the spiking neuron, and then consumption of each neuron. However, gain-bandwidth of the
pointing out the sparsity of our proposed network. amplifiers scale with frequency of input spikes, and energy

VOLUME 8, 2020 76907


S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

consumption of each neuron can be reduced for systems with IV. TRAINING AND EXPERIMENTS
low frequency of input spikes. In this section, we describe the proposed network and its
Since all the computations are done in analog domain, the training for the evaluation of the KITTI data in details.
proposed neuron does not need digital memory to store inter-
mediate results [10], [27], and hence, eliminates frequent data
movements between memory and computation circuits. Thus, A. TRAINING DETAILS
the proposed analog neuron is expected to have significantly We evaluated the proposed network using the challenging
lower energy consumption compared to digital implementa- KITTI object detection benchmark, which contained 7481
tion of the neuron. The weights of the spiking neural network, and 7518 samples for training and testing, respectively.
i.e, strength of synaptic connections between neurons, are Because the input was LiDAR data, we only focused on birds-
encoded as capacitance values which further reduces storage eye-view and 3D object detection for cars, pedestrians, and
requirement. cyclists. Similar to the literature, each class was evaluated
The implementation shown in Fig. 5 can be used when based on three difficulty levels (i.e., easy, moderate, and
the SCNN has been trained and the weights are fixed. To hard), considering the object’s size, distance, occlusion, and
make the neural weights programmable, we can apply non- truncation.
volatile memory (NVM) on chip to make the capacitors pro- The detailed architecture of the proposed network is given
grammable. For each capacitor, we can split it into several in Table 1. Although our proposed network was based on the
slices and the weights stored in the NVM can select the num- Complex-YOLO design from [19], we used only 9 spiking-
ber of slices for each capacitor and set its weight. For a given convolutional layers, 1 traditional convolutional layer, 5 max-
application, if the weights are updated only once at start-up pool layers, and 3 intermediate layers. As a comparison, [19]
then the energy consumption due to memory access is not used 18 convolutional layers, 5 maxpool layers, and 3 inter-
an issue. Memory access energy only becomes problematic mediate layers. Our proposed network was comparatively
if the memory has to be accessed every cycle of operation. simpler than that of [19]. From the perspective of energy
Therefore, in a programmable deep network, the major cost consumption, the simpler the network architecture, the less
would be the area increase since the memory has to be stored energy the network will consume.
on the chip. The power increase is not going to be significant. As a special note, in the last layer, we used a traditional
convolutional layer instead of a spiking-convolutional layer.
B. NETWORK SPARSITY FOR ENERGY EFFICIENCY Using the spiking-convolutional layer as the last layer would
As seen in Eq. (6), the neuron spikes only when the membrane degrade performance compared with the current ones, which
potential of the neuron reaches the threshold. After the neuron was a surprising observation we obtained from our prelimi-
spikes, it is not allowed to spike again. To recognize the nary test. A heuristic explanation is that because for SCNN
pattern or detect a scene, if the first N neurons from the input the input values were time information, negative values were
layer can cause a neuron of the next layer to spike, the rest not allowed. But the values of some coordinates of the real
of the M neurons in the input layer will not spike, which 3D LiDAR data were negative based on the range presented
means a lot of neurons will not spike in each layer. Hence, in Section II. Traditional convolutional layers could handle
we can calculate the sparsity of each layer using the following negative values with the linear activation function f (x) = x.
equation: Another special note is that we tested our network both
with skipped connection (SC) or without skipped connec-
Ni tion. As introduced by He et al. [6] and Orhan and Pitkow
Si = , (15)
Mi + Ni [11], SCs are simply extra connections between nodes at
different layers of a neural network that skip one or more
where Si is the sparsity of the ith layer, Ni is the number of layers of nonlinear processing. They can improve the training
spiking active neurons in the ith layer, and Mi is the number of very deep neural networks without changing their main
of non-active neurons in the ith layer. The overall sparsity of structure. Hence, in our network, we used SCs to improve
a K -layer network is the performance. However, latest detection chips (e.g., HTPU
PK from CORAL and Akida from BrainChip) do not support
Ntotal i=1 Ni SCs. The main reason is that the nodes of the hardware
Stotal = = PK PK , (16)
Mtotal + Ntotal do not support event-packet synchronization across multiple
i=1 Mi + i=1 Ni
layers. An SC also consumes extra energy in practice. Due
where Mtotal is the total number of non-active neurons, and to these considerations, we also experimented our network
Ntotal is the total number of active neurons. The total energy without SC.
consumption of the network can be calculated as The network was trained from scratch using stochastic
gradient descent with a weight decay of 5e−4 and momentum
P = Ntotal × p, (17) of 0.9. The implementation was based on a modified version
of the YOLOv2 framework [17]. Because the proposed model
where p is the energy consumed by each spike. was a supervised learning-based network and the testing

76908 VOLUME 8, 2020


S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

TABLE 1. Proposed network architecture.

samples in the KITTI dataset had no labels, following the TABLE 2. Sparsity of the network for KITTI dataset.
practice of [1], [19], [24], we divided the training set with
an available ground truth and allocated 85% data for training
and 15% for testing. To train the model well, during the first
epoch, we started with a learning rate at 5e −5 to ensure
convergence. After four epochs, we scaled the learning rate Car class, and 0.5 for Pedestrian and Cyclist classes. The IoU
up to 5e−4 and gradually decreased it up to 1000 epochs. threshold was the same for both the bird’s-eye view and full
Based on Eq. (15) and (16), we obtained the sparsity of 3D evaluation. We compared the methods using the average
each layer and the sparsity of the overall network for each precision (AP) metric.
sample. There were 1122 samples for validation, and the min-
imum, maximum, and mean sparsity values of our networks
1) EVALUATION IN THE BIRD’s- EYE VIEW
are given in Table 2. For the two trained networks (with SC,
and without SC), there was no major difference between their Our evaluation results for bird’s-eye view detection are given
maximum and minimum sparsity. The mean sparsity of the in Table 3. Simon et al. [19] compared their proposed model,
networks was 56.24% for the KITTI dataset. Complex-YOLO, with the first five leading models presented
Assume each spike consume 19 pJ. The energy consump- in Table 3 and demonstrated that their model outperformed
tion of the network was thus 0.247 mJ because the number of all five in terms of running time and efficiency. They were
active neurons was about 13 million on average. Note that the still able to achieve detection accuracy comparable with the
estimated energy consumption does not include the energy state of the art. As noted, although Complexer-YOLO [20]
spent in the last layer of the network and the preprocessing was more complicated than Complex-YOLO, the detection
layer because these two layers do not involve spiking neu- accuracies for all classes were lower than that of Complex-
rons. In CNN-based networks and many other SNN-based YOLO. Hence, we first focused on the comparison between
networks, all neurons are used for object detection or recogni- Complex-YOLO and our network. As seen, all accuracy
tion. Thus, their energy consumption should be much higher values of our proposed network with SC for detecting the
than our obtained one. The small energy consumption value car, pedestrian, and cyclist were higher than those using
connotes the energy efficiency of our proposed network. Complex-YOLO. Our network with SC showed better per-
formance in object detection than our network without SC.
Therefore, in the sequel, we mainly consider the model with
B. EXPERIMENTS SC and compare it with the others.
We set up our experiments following the official KITTI eval- Running the experiments with a NVIDIA GTX 1080i
uation protocol, where the IoU thresholds were 0.7 for the GPU, our network achieved frame rate 35.7 fps. Although this

VOLUME 8, 2020 76909


S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

TABLE 3. Performance comparison for birds-eye-view detection: APs (in %) for our proposed networks compared with existing leading models.

FIGURE 6. Qualitative results. 3D boxes detected with LiDAR are projected onto the RGB images.

frame rate was lower than the 50.4 fps of Complex-YOLO, presented in Table 4. Similar to [19], we did not directly
it was still much higher than those of the other five models. estimate the height information with regression but instead
We also compared our detection results with the latest ones used a fixed spatial height location extracted from ground
in [31]–[33]. Even though Point-RCNN [31] in the Car and truth to implement the 3D object detection. As seen in Table
Cyclist (easy and moderate) detection had higher accuracies 4, the detection accuracies of our network with SC on the car,
than the others, its frame rate was low. As observed in Table 3, pedestrian, and cyclist were all better than those of Complex-
at the hard cyclist level, our proposed model had higher YOLO. Additionally, the accuracies of the proposed network
accuracies than the others. Its accuracies on other levels were with SC were comparable to the other models. Moreover, our
competitive to the others. network reached its highest accuracy with the moderate and
As an ablation study, following the architecture in Fig. 1, hard cyclist levels. The proposed network with SC in all cases
we used CNN instead of SCNN to build the network. The showed better performance than the network without SC and
results of this CNN-based network had higher frame rate the ablation network with CNN only.
but had much lower accuracy in all levels than the SCNN To illustrate the detection performance of our proposed
network, which indicates the benefits of using SCNN. network with SC, several 3D detection examples are pre-
sented in Fig. 6. For better visualization, we projected 3D
2) 3D OBJECT DETECTION boxes detected using LiDAR onto the red–green–blue (RGB)
Apart from the bird’s-eye view detection, we applied our images. As seen in Fig. 6, the proposed network with SC had
proposed network to 3D object detection. The results are highly accurate 3D bounding boxes in all categories.

76910 VOLUME 8, 2020


S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

TABLE 4. Performance comparison for 3D object detection: APs (in %) for our proposed networks compared with existing leading models.

V. CONCLUSION [4] A. Gonzalez, G. Villalonga, J. Xu, D. Vazquez, J. Amores, and


Existing LiDAR-based 3D real-time object detection meth- A. M. Lopez, ‘‘Multiview random forest of local experts combining RGB
and LiDAR data for pedestrian detection,’’ in Proc. IEEE Intell. Vehicles
ods use CNN. Although they can achieve high detection Symp., Seoul, South Korea, Jun. 2015, pp. 356–361.
accuracy, their high energy consumption is a great concern [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies
for practical vehicular applications. This paper is the first to for accurate object detection and semantic segmentation,’’ in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 580–587.
report the development of an SCNN-based YOLOv2 archi- [6] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning
tecture for real-time object detection over the KITTI 3D for image recognition,’’ 2015, arXiv:1512.03385. [Online]. Available:
point-cloud dataset considering the energy consumption. We http://arxiv.org/abs/1512.03385
[7] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. Waslander, ‘‘Joint 3D
designed a novel data preprocessing layer to translate the proposal generation and object detection from view aggregation,’’ 2017,
3D point clouds directly into spike times. To better show arXiv:1712.02294. [Online]. Available: http://arxiv.org/abs/1712.02294
the energy efficiency of the proposed network in real-time [8] S. R. Kheradpisheh, M. Ganjtabesh, S. J. Thorpe, and T. Masquelier,
‘‘STDP-based spiking deep convolutional neural networks for object
object detection, we built an analog neuron circuit to obtain recognition,’’ Neural Netw., vol. 99, pp. 56–67, Mar. 2018.
the energy cost of each spike. We also proposed an energy [9] H. Mostafa, ‘‘Supervised learning based on temporal coding in spiking
consumption and network sparsity estimation method. Our neural networks,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 7,
pp. 3227–3235, Jul. 2018.
proposed network had a mean spiking sparsity of 56.24% [10] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, ‘‘A mixed-signal bina-
and consumed an average of 0.247 mJ only, indicating rized convolutional-neural-network accelerator integrating dense weight
higher energy efficiency. Experimental results over the KITTI storage and multiplication for reduced data movement,’’ in Proc. IEEE
Symp. VLSI Circuits, Jun. 2018, pp. 141–142.
dataset demonstrated that our proposed network reached the- [11] A. E. Orhan and X. Pitkow, ‘‘Skip connections eliminate singulari-
state-of-the-art accuracy in the bird’s-eye view and full 3D ties,’’ 2017, arXiv:1701.09175. [Online]. Available: http://arxiv.org/abs/
detection. In some cases, our proposed network performed 1701.09175
[12] C. Premebida, J. Carreira, J. Batista, and U. Nunes, ‘‘Pedestrian detection
better than other typical models reported in literature. combining RGB and dense LiDAR data,’’ in Proc. IEEE/RSJ Int. Conf.
Intell. Robots Syst., Sep. 2014, pp. 4112–4117.
ACKNOWLEDGMENT [13] F. Ponulak and A. Kasinski, ‘‘Introduction to spiking neural networks:
Information processing, learning and applications,’’ Acta Neurobiol. Exp.,
The authors would like to thank Prof. Wenfeng Zhao of vol. 71, no. 4, pp. 409–433, 2011.
Binghamton University for the valuable suggestions in the [14] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, ‘‘PointNet: Deep learning on point
revision process. sets for 3D classification and segmentation,’’ 2016, arXiv:1612.00593.
[Online]. Available: http://arxiv.org/abs/1612.00593
[15] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, ‘‘PointNet++: Deep hierarchical
REFERENCES feature learning on point sets in a metric space,’’ 2017, arXiv:1706.02413.
[1] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, ‘‘Multi-view 3D object [Online]. Available: http://arxiv.org/abs/1706.02413
detection network for autonomous driving,’’ in Proc. IEEE Conf. Comput. [16] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, ‘‘Frustum PointNets
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6526–6534. for 3D object detection from RGB-D data,’’ 2017, arXiv:1711.08488.
[2] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner, ‘‘Vote3Deep: [Online]. Available: http://arxiv.org/abs/1711.08488
Fast object detection in 3D point clouds using efficient convolutional neu- [17] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
ral networks,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), May 2017, Unified, real-time object detection,’’ 2015, arXiv:1506.02640. [Online].
pp. 1355–1361. Available: http://arxiv.org/abs/1506.02640
[3] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving? [18] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time
The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. Comput. Vis. object detection with region proposal networks,’’ 2015, arXiv:1506.01497.
Pattern Recognit., Jun. 2012, pp. 3354–3361. [Online]. Available: http://arxiv.org/abs/1506.01497

VOLUME 8, 2020 76911


S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

[19] M. Simon, S. Milz, K. Amende, and H.-M. Gross, ‘‘Complex-YOLO: YING CHEN received the Ph.D. degree in indus-
Real-time 3D object detection on point clouds,’’ 2018, arXiv:1803.06199. trial engineering from The University of Texas at
[Online]. Available: http://arxiv.org/abs/1803.06199 Arlington. He is currently an Assistant Professor
[20] M. Simon, K. Amende, A. Kraus, J. Honer, T. Sämann, H. Kaulbersch, with the School of Economics and Management,
S. Milz, and H. M. Gross, ‘‘Complexer-YOLO: Real-time 3D object detec- Harbin Institute of Technology. His research inter-
tion and tracking on semantic point clouds,’’ 2019, arXiv:1904.07537. ests include data mining, machine learning, and
[Online]. Available: http://arxiv.org/abs/1904.07537 decision-making under uncertainty and optimiza-
[21] A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and
tion.
A. Maida, ‘‘Deep learning in spiking neural networks,’’ Neural Netw.,
vol. 111, pp. 47–63, Mar. 2019.
[22] D. Z. Wang and I. Posner, ‘‘Voting for voting in online point cloud object
detection,’’ in Proc. Robot., Sci. Syst., Rome, Italy, Jul. 2015, pp. 1–9.
[23] S. Zhou, Y. Chen, Q. Ye, and J. Li, ‘‘Direct training based spiking convo-
lutional neural networks for object recognition,’’ 2019, arXiv:1909.10837.
[Online]. Available: http://arxiv.org/abs/1909.10837
[24] Y. Zhou and O. Tuzel, ‘‘VoxelNet: End-to-end learning for point cloud
based 3D object detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., Salt Lake City, UT, USA, Jun. 2018, pp. 4490–4499.
[25] B. Behroozpour, P. A. M. Sandborn, M. C. Wu, and B. E. Boser, ‘‘LiDAR
system architectures and circuits,’’ IEEE Commun. Mag., vol. 55, no. 10,
pp. 135–142, Oct. 2017.
[26] P. Dayan and L. F. Abbott, Theoretical Neuroscience: Computational XIAOHUA LI (Senior Member, IEEE) received
and Mathematical Modeling of Neural Systems. Cambridge, MA, USA: the B.S. and M.S. degrees from Shanghai Jiao
Massachusetts Institute of Technology Press, 2001. Tong University, Shanghai, China, in 1992 and
[27] A. Biswas and A. P. Chandrakasan, ‘‘Conv-RAM: An energy-efficient 1995, respectively, and the Ph.D. degree in electri-
SRAM with embedded convolution computation for low-power CNN-
cal engineering from the University of Cincinnati,
based machine learning applications,’’ in IEEE Int. Solid-State Circuits
Cincinnati, OH, USA, in 2000. He was an Assis-
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018, pp. 488–490.
[28] J. Göltz, A. Baumbach, S. Billaudelle, O. Breitwieser, D. Dold, L. Kriener, tant Professor with the Department of Electrical
A. F. Kungl, W. Senn, J. Schemmel, K. Meier, and M. A. Petrovici, ‘‘Fast and Computer Engineering, The State University
and deep neuromorphic learning with time-to-first-spike coding,’’ 2019, of New York at Binghamton, Binghamton, NY,
arXiv:1912.11443. [Online]. Available: http://arxiv.org/abs/1912.11443 USA, from 2000 to 2006, where he has been an
[29] I. M. Comsa, K. Potempa, L. Versari, T. Fischbacher, A. Gesmundo, Associate Professor, since 2006. His research interests include signal pro-
and J. Alakuijala, ‘‘Temporal coding in spiking neural networks with cessing, machine learning, deep learning, wireless communications, and
alpha synaptic function,’’ 2019, arXiv:1907.13223. [Online]. Available: wireless information assurance.
http://arxiv.org/abs/1907.13223
[30] S. R. Kheradpisheh and T. Masquelier, ‘‘S4NN: Temporal backpropa-
gation for spiking neural networks with one spike per neuron,’’ 2019,
arXiv:1910.09495. [Online]. Available: http://arxiv.org/abs/1910.09495
[31] S. Shi, X. Wang, and H. Li, ‘‘PointRCNN: 3D object proposal generation
and detection from point cloud,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2019, pp. 770–779.
[32] J. Deng and K. Czarnecki, ‘‘MLOD: A multi-view 3D object detection
based on robust feature fusion method,’’ in Proc. IEEE Intell. Transp. Syst.
Conf. (ITSC), Oct. 2019, pp. 279–284.
[33] Z. Wang, H. Fu, L. Wang, L. Xiao, and B. Dai, ‘‘SCNet: Subdivision coding
network for object detection based on 3D point cloud,’’ IEEE Access, vol. 7,
pp. 120449–120462, 2019. ARINDAM SANYAL (Member, IEEE) received
the B.E. degree from Jadavpur University, India,
in 2007, the M.Tech. degree from IIT Kharagpur,
SHIBO ZHOU received the B.S. degree in mecha- in 2009, and the Ph.D. degree from The University
tronic engineering from the Hebei University of of Texas at Austin, in 2016. He is currently an
Architecture, Hebei, China, in 2013, and the M.S. Assistant Professor with the Electrical Engineer-
degree in mechanical engineering from The Uni- ing Department, The State University of New York
versity of Texas at Arlington, Arlington, TX, USA, at Buffalo. Prior to this, he was a Design Engineer
in 2016. He is currently pursuing the Ph.D. degree working on low jitter PLLs at Silicon Laborato-
in electrical engineering with The State University ries, Austin. His research interests include ana-
of New York at Binghamton, Binghamton, NY, log/mixed signal design, bio-medical sensor design, analog security, and on-
USA. His research interests include low-power chip artificial neural networks. He was a recipient of the Mamraj Agarwal
neuromorphic ASIC simulation and development, Award, in 2001, the Intel/Texas Instruments/Catalyst Foundation CICC Stu-
novel unsupervised learning algorithm, spiking neural network for object dent Scholarship Award, in 2014, and the National Science Foundation CISE
recognition and detection, computer vision and machine learning, deep Research Initiation Initiative (CRII) Award, in 2020.
neural networks, and 3D perception of self-driving car.

76912 VOLUME 8, 2020

You might also like