Deep SCNN-Based Real-Time Object Detection For Self-Driving Vehicles Using LiDAR Temporal Data

The document presents a deep spiking convolutional neural network (SCNN) for real-time 3D object detection from LiDAR point cloud data. The network integrates SCNN with temporal coding and the YOLOv2 architecture. It includes a novel data preprocessing layer to convert point clouds to spike time data as input. Extensive experiments on the KITTI dataset show the network achieves competitive detection accuracy compared to existing approaches, with much lower estimated average energy consumption of 0.247mJ per frame. Implemented on a GPU, it achieves a high frame rate of 35.7 fps, sufficient for real-time detection.

Uploaded by

ashaf2801

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Deep SCNN-Based Real-Time Object Detection For Self-Driving Vehicles Using LiDAR Temporal Data

Uploaded by

ashaf2801

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Received March 22, 2020, accepted April 19, 2020, date of publication April 27, 2020, date of current

version May 7, 2020.

Digital Object Identifier 10.1109/ACCESS.2020.2990416

Deep SCNN-Based Real-Time Object Detection for

Self-Driving Vehicles Using LiDAR Temporal Data
SHIBO ZHOU1 , YING CHEN 2 , XIAOHUA LI 1, (Senior Member, IEEE),
AND ARINDAM SANYAL 3 , (Member, IEEE)
1 Department of Electrical and Computer Engineering, The State University of New York at Binghamton, Binghamton, NY 13902, USA
2 Department of Management Science and Engineering, School of Management, Harbin Institute of Technology, Harbin 150000, China
3 Department of Electrical Engineering, The State University of New York at Buffalo, Buffalo, NY 14260, USA

Corresponding author: Ying Chen (yingchen@hit.edu.cn)

This work was supported in part by the National Natural Science Foundation of China under Grant 91846301.

ABSTRACT Real-time accurate detection of three-dimensional (3D) objects is a fundamental necessity for
self-driving vehicles. Most existing computer vision approaches are based on convolutional neural networks
(CNNs). Although the CNN-based approaches can achieve high detection accuracy, their high energy
consumption is a severe drawback. To resolve this problem, novel energy efficient approaches should be
explored. Spiking neural network (SNN) is a promising candidate because it has orders-of-magnitude lower
energy consumption than CNN. Unfortunately, the studying of SNN has been limited in small networks only.
The application of SNN for large 3D object detection networks has remain largely open. In this paper, we
integrate spiking convolutional neural network (SCNN) with temporal coding into the YOLOv2 architecture
for real-time object detection. To take the advantage of spiking signals, we develop a novel data preprocessing
layer that translates 3D point-cloud data into spike time data. We propose an analog circuit to implement the
non-leaky integrate and fire neuron used in our SCNN, from which the energy consumption of each spike is
estimated. Moreover, we present a method to calculate the network sparsity and the energy consumption of
the overall network. Extensive experiments have been conducted based on the KITTI dataset, which show
that the proposed network can reach competitive detection accuracy as existing approaches, yet with much
lower average energy consumption. If implemented in dedicated hardware, our network could have a mean
sparsity of 56.24% and extremely low total energy consumption of 0.247mJ only. Implemented in NVIDIA
GTX 1080i GPU, we can achieve 35.7 fps frame rate, high enough for real-time object detection.

INDEX TERMS Spiking convolutional neural network, LiDAR temporal data, energy consumption, real-
time object detection.

I. INTRODUCTION To address the point cloud object detection challenge,

In recent years, increased attention has been paid to point many approaches have been proposed, which can be divided
cloud data processing for autonomous driving applications into three general classes. The first class project point clouds
because of significant improvements in automotive light into a perspective view and detect objects via image-based
detection and ranging (LiDAR) sensors, which deliver three- algorithms [4], [12]. The second class convert point clouds
dimensional (3D) point clouds of the environment in real into a 3D voxel grid and use hand-crafted features to encode
time. Point cloud data have highly variant density distri- each voxel [2], [22]. The third class are similar to the sec-
butions throughout the measurement area [14], which can ond class but change the hand-crafted features into machine-
be exploited for object detection [1], [7], [22]. Neverthe- learned features [5].
less, different from camera images, LiDAR point clouds are Owing to the machine-learned features, the third-class
unordered and sparse, which results in some difficulties for can achieve much better object detection performance.
real-time object detection. Qi et al. [14] proposed the PointNet which learns point-
wise features of point clouds using deep neural networks.
The associate editor coordinating the review of this manuscript and Qi et al. [15] proposed PointNet++ to allow networks to
approving it for publication was Wenbing Zhao . learn local structures at different scales. Zhou and Tuzel [24]

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 76903
S. Zhou et al.: Deep SCNN-Based Real-Time Object Detection for Self-Driving Vehicles Using LiDAR Temporal Data

developed the VoxelNet method, which can learn discrim- processing unit (GPU), experimental results show that our
inative feature representations from point clouds and pre- network can reach a high frame rate of 35.7 fps, enough for
dict accurate 3D bounding boxes in an end-to-end mod- real-time operation. Additionally, the detection accuracies for
ule. Simon et al. [19] developed Complex-YOLO, a real-time cars, pedestrians, and cyclists can reach the state-of-the-art
3D object detector that uses an enhanced region-proposal level. To show the potential energy efficiency of our network,
network (E-RPN) to estimate the orientation of objects we propose an analog circuit implementation of the spiking
coded with imaginary and real parts for each box. Recently, neuron, based on which our proposed network would con-
Simon et al. [20] presented a novel fusion (i.e., Complexer- sume an average of 0.247 mJ only for processing each frame.
YOLO) of neural networks that uses a state-of-the-art 3D This connotes our proposed network’s high performance and
detector and visual semantic segmentation in the field of energy efficiency.
autonomous driving. The accuracy of these methods has been The contributions of this paper are listed as follows:
demonstrated with the KITTI vision benchmark dataset [3]. 1) We develop a novel data preprocessing layer to add
There is much less work that focuses on the energy temporal information to voxel data. With such temporal
consumption of real-time object detection, although low coding, we develop SCNNs with sparse spiking pat-
energy consumption is a critical requirement for many prac- terns to save energy.
tical applications such as autonomous vehicles. Convolu- 2) We combine the SCNNs with YOLOv2 architecture
tional neural networks (CNNs) have been the most popular to develop an efficient object detection network. We
techniques for object detection [20], [24]. However, their propose two variants of the network: one with skip
high energy consumption has been a challenging issue. By connection (SC) and one without SC. The network
comparison, it is well known that spiking neural networks without SC is suitable for the current neuromorphic
(SNNs) are energy efficient and can potentially have orders- chips, while the one with SC can be implemented with
of-magnitude lower energy consumption than CNNs [21]. future neuromorphic chips that support skip connec-
Although the investigation of SNNs is far less than CNNs, tions.
numerous studies have shown that SNNs are able to achieve 3) We provide an analog circuit to implement the non-
similarly high image classification accuracy [8], [9]. One of leaky integrate and fire neuron used in our SCNNs,
the major challenges for SNNs lies in the non-differentiability based on which the energy consumption of a spike
of spiking activities which makes the training of large-scale is estimated. We also provide a way to estimate the
complex networks difficult. Zhou et al. [23] proposed a direct sparsity of the network. Combining low spike energy
training-based spiking-CNN (SCNN) that could recognize consumption and high network sparsity, the overall
the CIFAR-10 dataset using much less energy than CNN network energy consumption will be much lower than
while reaching the same state-of-the-art accuracy. Neverthe- existing models. Simulation results demonstrate the
less, efficient SCNN methods have not yet been reported over extremely low energy consumption of our network.
more complex data sets such as the KITTI 3D point clouds.
For the point cloud data, a special issue is how to translate II. NETWORK ARCHITECTURE
from point-cloud format into a spiking format suitable for As shown in Fig. 1, the proposed network comprises three
SNNs. If the input data are camera images, the input image functional blocks: a point cloud data preprocessing layer
can be converted into spike trains based on pixel intensity [21] (LiDAR spike generation), spiking-convolution layers (fea-
or encoded into spike times [9]. Unfortunately, this approach ture learning), and a detection layer. In the following subsec-
is not efficient enough for point cloud data because of the tions, we provide a detailed description of each of these three
sparsity and non-even density distributions of point clouds. blocks.
In this paper, similar to [20] and [24], we first quantize
the point cloud to a 3D voxel representation so as to reduce A. PREPROCESSING LAYER
the input data amount. Then, we design an innovative data- For the 3D point cloud data such as KITTI data, each point
preprocessing layer that converts the 3D voxels into spike i is described by a four-number vector (xi , yi , zi , ri ), where
signals by adding time information to each voxel. Using this xi , yi , zi are the 3D position of the reflection object point, and
special temporal coding method, the input data are converted ri is the received laser light reflection intensity. The LiDAR
into spike times directly, and this permits us to design SCNNs device emits a pulse, which is reflected from the object and
with energy-efficient temporal coding. Finally, we use such received by the LiDAR reception device. If the laser emission
SCNNs to replace the CNNs of the YOLOv2 architecture [17] equipment is the origin of the coordinate system, the distance
to develop a large-scale object detection network. Note that between each reflection point i and the laser emission equip-
similar to [23], the network is an end-to-end object detection ment can be calculated as:
network that combines feature extraction and bounding-box q
prediction. di = xi2 + y2i + z2i . (1)
We evaluate our developed network on the bird’s-eye
view and 3D detection tasks provided by the KITTI bench- Considering the volume of the space that LiDAR scans,
mark. Implemented over a NVIDIA GTX 1080i graphical 3D point cloud data set is usually huge and sparse. Using