Voxelnet: End-To-End Learning For Point Cloud Based 3D Object Detection
Voxelnet: End-To-End Learning For Point Cloud Based 3D Object Detection
Voxelnet: End-To-End Learning For Point Cloud Based 3D Object Detection
Abstract
1
Figure 2. VoxelNet architecture. The feature learning network takes a raw point cloud as input, partitions the space into voxels, and
transforms points within each voxel to a vector representation characterizing the shape information. The space is represented as a sparse
4D tensor. The convolutional middle layers processes the 4D tensor to aggregate spatial context. Finally, a RPN generates the 3D detection.
tures as in [29, 30] results in high computational and mem- methods by a large margin. We also demonstrate that Voxel-
ory requirements. Scaling up 3D feature learning networks Net achieves highly encouraging results in detecting pedes-
to orders of magnitude more points and to 3D detection trians and cyclists from LiDAR point cloud.
tasks are the main challenges that we address in this paper.
Region proposal network (RPN) [32] is a highly opti- 1.1. Related Work
mized algorithm for efficient object detection [17, 5, 31, Rapid development of 3D sensor technology has moti-
24]. However, this approach requires data to be dense and vated researchers to develop efficient representations to de-
organized in a tensor structure (e.g. image, video) which tect and localize objects in point clouds. Some of the earlier
is not the case for typical LiDAR point clouds. In this pa- methods for feature representation are [39, 8, 7, 19, 40, 33,
per, we close the gap between point set feature learning and 6, 25, 1, 34, 2]. These hand-crafted features yield satisfac-
RPN for 3D detection task. tory results when rich and detailed 3D shape information is
We present VoxelNet, a generic 3D detection framework available. However their inability to adapt to more complex
that simultaneously learns a discriminative feature represen- shapes and scenes, and learn required invariances from data
tation from point clouds and predicts accurate 3D bounding resulted in limited success for uncontrolled scenarios such
boxes, in an end-to-end fashion, as shown in Figure 2. We as autonomous navigation.
design a novel voxel feature encoding (VFE) layer, which Given that images provide detailed texture information,
enables inter-point interaction within a voxel, by combin- many algorithms infered the 3D bounding boxes from 2D
ing point-wise features with a locally aggregated feature. images [4, 3, 42, 43, 44, 36]. However, the accuracy of
Stacking multiple VFE layers allows learning complex fea- image-based 3D detection approaches are bounded by the
tures for characterizing local 3D shape information. Specif- accuracy of the depth estimation.
ically, VoxelNet divides the point cloud into equally spaced Several LIDAR based 3D object detection techniques
3D voxels, encodes each voxel via stacked VFE layers, and utilize a voxel grid representation. [41, 9] encode each
then 3D convolution further aggregates local voxel features, nonempty voxel with 6 statistical quantities that are de-
transforming the point cloud into a high-dimensional volu- rived from all the points contained within the voxel. [37]
metric representation. Finally, a RPN consumes the vol- fuses multiple local statistics to represent each voxel. [38]
umetric representation and yields the detection result. This computes the truncated signed distance on the voxel grid.
efficient algorithm benefits both from the sparse point struc- [21] uses binary encoding for the 3D voxel grid. [5] in-
ture and efficient parallel processing on the voxel grid. troduces a multi-view representation for a LiDAR point
We evaluate VoxelNet on the bird’s eye view detection cloud by computing a multi-channel feature map in the
and the full 3D detection tasks, provided by the KITTI bird’s eye view and the cylindral coordinates in the frontal
benchmark [11]. Experimental results show that VoxelNet view. Several other studies project point clouds onto a per-
outperforms the state-of-the-art LiDAR based 3D detection spective view and then use image-based feature encoding
schemes [28, 15, 22].
Point-wise Concatenate
There are also several multi-modal fusion methods that
combine images and LiDAR to improve detection accu-
Element-wise Maxpool
racy [10, 16, 5]. These methods provide improved perfor-
mance compared to LiDAR-only 3D detection, particularly
Point-wise
for small objects (pedestrians, cyclists) or when the objects Input
Point-wise
are far, since cameras provide an order of magnitude more Feature
Locally
measurements than LiDAR. However the need for an addi- Aggregated
Feature
tional camera that is time synchronized and calibrated with Point-wise
concatenated
the LiDAR restricts their use and makes the solution more Feature
Figure 3. Voxel feature encoding layer.
sensitive to sensor failure modes. In this work we focus on
LiDAR-only detection.
1.2. Contributions point cloud is sparse and has highly variable point density
throughout the space. Therefore, after grouping, a voxel
• We propose a novel end-to-end trainable deep archi- will contain a variable number of points. An illustration is
tecture for point-cloud-based 3D detection, VoxelNet, shown in Figure 2, where Voxel-1 has significantly more
that directly operates on sparse 3D points and avoids points than Voxel-2 and Voxel-4, while Voxel-3 contains no
information bottlenecks introduced by manual feature point.
engineering. Random Sampling Typically a high-definition LiDAR
• We present an efficient method to implement VoxelNet point cloud is composed of ∼100k points. Directly pro-
which benefits both from the sparse point structure and cessing all the points not only imposes increased mem-
efficient parallel processing on the voxel grid. ory/efficiency burdens on the computing platform, but also
highly variable point density throughout the space might
• We conduct experiments on KITTI benchmark and bias the detection. To this end, we randomly sample a fixed
show that VoxelNet produces state-of-the-art results number, T , of points from those voxels containing more
in LiDAR-based car, pedestrian, and cyclist detection than T points. This sampling strategy has two purposes,
benchmarks. (1) computational savings (see Section 2.3 for details); and
(2) decreases the imbalance of points between the voxels
2. VoxelNet which reduces the sampling bias, and adds more variation
In this section we explain the architecture of VoxelNet, to training.
the loss function used for training, and an efficient algo- Stacked Voxel Feature Encoding The key innovation is
rithm to implement the network. the chain of VFE layers. For simplicity, Figure 2 illustrates
the hierarchical feature encoding process for one voxel.
2.1. VoxelNet Architecture Without loss of generality, we use VFE Layer-1 to describe
The proposed VoxelNet consists of three functional the details in the following paragraph. Figure 3 shows the
blocks: (1) Feature learning network, (2) Convolutional architecture for VFE Layer-1.
middle layers, and (3) Region proposal network [32], as il- Denote V = {pi = [xi , yi , zi , ri ]T ∈ R4 }i=1...t as a
lustrated in Figure 2. We provide a detailed introduction of non-empty voxel containing t ≤ T LiDAR points, where
VoxelNet in the following sections. pi contains XYZ coordinates for the i-th point and ri is the
received reflectance. We first compute the local mean as
the centroid of all the points in V, denoted as (vx , vy , vz ).
2.1.1 Feature Learning Network
Then we augment each point pi with the relative offset w.r.t.
Voxel Partition Given a point cloud, we subdivide the 3D the centroid and obtain the input feature set Vin = {p̂i =
space into equally spaced voxels as shown in Figure 2. Sup- [xi , yi , zi , ri , xi − vx , yi − vy , zi − vz ]T ∈ R7 }i=1...t . Next,
pose the point cloud encompasses 3D space with range D, each p̂i is transformed through the fully connected network
H, W along the Z, Y, X axes respectively. We define each (FCN) into a feature space, where we can aggregate in-
voxel of size vD , vH , and vW accordingly. The resulting formation from the point features fi ∈ Rm to encode the
3D voxel grid is of size D0 = D/vD , H 0 = H/vH , W 0 = shape of the surface contained within the voxel. The FCN
W/vW . Here, for simplicity, we assume D, H, W are a is composed of a linear layer, a batch normalization (BN)
multiple of vD , vH , vW . layer, and a rectified linear unit (ReLU) layer. After obtain-
Grouping We group the points according to the voxel they ing point-wise feature representations, we use element-wise
reside in. Due to factors such as distance, occlusion, ob- MaxPooling across all fi associated to V to get the locally
ject’s relative pose, and non-uniform sampling, the LiDAR aggregated feature f̃ ∈ Rm for V. Finally, we augment
128 Block 1:
Conv2D(128, 128, 3, 2, 1) x 1
Conv2D(128, 128, 3, 1, 1) x 3 Probability score map
Block 2: 2
Conv2D(128, 128, 3, 2, 1) x 1 Conv2D(768, 2, 1, 1, 0) x 1
Conv2D(128, 128, 3, 1, 1) x 5
128 Block 3: 768
H’/2
H’ Conv2D(128, 256, 3, 2, 1) x 1
128 W’/2
Conv2D(256, 256, 3, 1, 1) x 5
H’/2 256 H’/2 14
H’/4 H’/8
W’ W’/2 W’/4 W’/8 W’/2
Deconv2D(256, 256, 4, 4, 0) x 1
H’/2
Deconv2D(128, 256, 2, 2, 0) x 1
W’/2
Conv2D(768, 14, 1, 1, 0) x 1
Deconv2D(128, 256, 3, 1, 0) x 1 Regression map
Figure 4. Region proposal network architecture.
each fi with f̃ to form the point-wise concatenated feature BN layer, and ReLU layer sequentially. The convolutional
as fiout = [fiT , f̃ T ]T ∈ R2m . Thus we obtain the output middle layers aggregate voxel-wise features within a pro-
feature set Vout = {fiout }i...t . All non-empty voxels are gressively expanding receptive field, adding more context
encoded in the same way and they share the same set of to the shape description. The detailed sizes of the filters in
parameters in FCN. the convolutional middle layers are explained in Section 3.
We use VFE-i(cin , cout ) to represent the i-th VFE layer
that transforms input features of dimension cin into output 2.1.3 Region Proposal Network
features of dimension cout . The linear layer learns a ma-
trix of size cin × (cout /2), and the point-wise concatenation Recently, region proposal networks [32] have become an
yields the output of dimension cout . important building block of top-performing object detec-
Because the output feature combines both point-wise tion frameworks [38, 5, 23]. In this work, we make several
features and locally aggregated feature, stacking VFE lay- key modifications to the RPN architecture proposed in [32],
ers encodes point interactions within a voxel and enables and combine it with the feature learning network and con-
the final feature representation to learn descriptive shape volutional middle layers to form an end-to-end trainable
information. The voxel-wise feature is obtained by trans- pipeline.
forming the output of VFE-n into RC via FCN and apply- The input to our RPN is the feature map provided by
ing element-wise Maxpool where C is the dimension of the the convolutional middle layers. The architecture of this
voxel-wise feature, as shown in Figure 2. network is illustrated in Figure 4. The network has three
Sparse Tensor Representation By processing only the blocks of fully convolutional layers. The first layer of each
non-empty voxels, we obtain a list of voxel features, each block downsamples the feature map by half via a convolu-
uniquely associated to the spatial coordinates of a particu- tion with a stride size of 2, followed by a sequence of con-
lar non-empty voxel. The obtained list of voxel-wise fea- volutions of stride 1 (×q means q applications of the filter).
tures can be represented as a sparse 4D tensor, of size After each convolution layer, BN and ReLU operations are
C × D0 × H 0 × W 0 as shown in Figure 2. Although the applied. We then upsample the output of every block to a
point cloud contains ∼100k points, more than 90% of vox- fixed size and concatanate to construct the high resolution
els typically are empty. Representing non-empty voxel fea- feature map. Finally, this feature map is mapped to the de-
tures as a sparse tensor greatly reduces the memory usage sired learning targets: (1) a probability score map and (2) a
and computation cost during backpropagation, and it is a regression map.
critical step in our efficient implementation.
2.2. Loss Function
Let {apos i }i=1...Npos be the set of Npos positive an-
2.1.2 Convolutional Middle Layers
chors and {aneg j }j=1...Nneg be the set of Nneg negative
We use ConvM D(cin , cout , k, s, p) to represent an M - anchors. We parameterize a 3D ground truth box as
dimensional convolution operator where cin and cout are (xgc , ycg , zcg , lg , wg , hg , θg ), where xgc , ycg , zcg represent the
the number of input and output channels, k, s, and p are the center location, lg , wg , hg are length, width, height of the
M -dimensional vectors corresponding to kernel size, stride box, and θg is the yaw rotation around Z-axis. To re-
size and padding size respectively. When the size across the trieve the ground truth box from a matching positive anchor
M -dimensions are the same, we use a scalar to represent parameterized as (xac , yca , zca , la , wa , ha , θa ), we define the
the size e.g. k for k = (k, k, k). residual vector u∗ ∈ R7 containing the 7 regression tar-
Each convolutional middle layer applies 3D convolution, gets corresponding to center location ∆x, ∆y, ∆z, three di-
Point Voxel Input Voxel-wise The method is summarized in Figure 5. We initialize a
Cloud Feature Buffer Feature
K × T × 7 dimensional tensor structure to store the voxel
K
Memory Copy input feature buffer where K is the maximum number of
Stacked VFE
K non-empty voxels, T is the maximum number of points
C per voxel, and 7 is the input encoding dimension for each
7
point. The points are randomized before processing. For
T 1
each point in the point cloud, we check if the corresponding
voxel already exists. This lookup operation is done effi-
K
ciently in O(1) using a hash table where the voxel coordi-
Indexing nate is used as the hash key. If the voxel is already initial-
3
1 ized we insert the point to the voxel location if there are less
Voxel Coordinate Sparse
Buffer Tensor than T points, otherwise the point is ignored. If the voxel
is not initialized, we initialize a new voxel, store its coordi-
Figure 5. Illustration of efficient implementation.
nate in the voxel coordinate buffer, and insert the point to
this voxel location. The voxel input feature and coordinate
mensions ∆l, ∆w, ∆h, and the rotation ∆θ, which are com-
buffers can be constructed via a single pass over the point
puted as:
list, therefore its complexity is O(n). To further improve
xgc − xac yg − ya zg − za the memory/compute efficiency it is possible to only store
∆x = , ∆y = c a c , ∆z = c a c ,
d a d h a limited number of voxels (K) and ignore points coming
lg wg hg from voxels with few points.
∆l = log( a ), ∆w = log( a ), ∆h = log( a ), (1) After the voxel input buffer is constructed, the stacked
l w h
∆θ = θg − θa VFE only involves point level and voxel level dense oper-
p ations which can be computed on a GPU in parallel. Note
where da = (la )2 + (wa )2 is the diagonal of the base that, after concatenation operations in VFE, we reset the
of the anchor box. Here, we aim to directly estimate the features corresponding to empty points to zero such that
oriented 3D box and normalize ∆x and ∆y homogeneously they do not affect the computed voxel features. Finally,
with the diagonal da , which is different from [32, 38, 22, 21, using the stored coordinate buffer we reorganize the com-
4, 3, 5]. We define the loss function as follows: puted sparse voxel-wise structures to the dense voxel grid.
The following convolutional middle layers and RPN oper-
ations work on a dense voxel grid which can be efficiently
1 X 1 X
L = α Lcls (ppos
i , 1) + β Lcls (pneg
j , 0) implemented on a GPU.
Npos i Nneg j
1 X 3. Training Details
+ Lreg (ui , u∗i ) (2)
Npos i In this section, we explain the implementation details of
the VoxelNet and the training procedure.
where pposi and pnegj represent the softmax output for posi-
tive anchor apos
i and negative anchor aneg
j respectively, while
3.1. Network Details
7 ∗ 7
ui ∈ R and ui ∈ R are the regression output and Our experimental setup is based on the LiDAR specifi-
ground truth for positive anchor apos i . The first two terms are cations of the KITTI dataset [11].
the normalized classification loss for {apos i }i=1...Npos and Car Detection For this task, we consider point clouds
{aneg
j } j=1...N neg , where the L cls stands for binary cross en- within the range of [−3, 1] × [−40, 40] × [0, 70.4] meters
tropy loss and α, β are postive constants balancing the rel- along Z, Y, X axis respectively. Points that are projected
ative importance. The last term Lreg is the regression loss, outside of image boundaries are removed [5]. We choose
where we use the SmoothL1 function [12, 32]. a voxel size of vD = 0.4, vH = 0.2, vW = 0.2 meters,
which leads to D0 = 10, H 0 = 400, W 0 = 352. We
2.3. Efficient Implementation
set T = 35 as the maximum number of randomly sam-
GPUs are optimized for processing dense tensor struc- pled points in each non-empty voxel. We use two VFE
tures. The problem with working directly with the point layers VFE-1(7, 32) and VFE-2(32, 128). The final FCN
cloud is that the points are sparsely distributed across space maps VFE-2 output to R128 . Thus our feature learning net
and each voxel has a variable number of points. We devised generates a sparse tensor of shape 128 × 10 × 400 × 352.
a method that converts the point cloud into a dense tensor To aggregate voxel-wise features, we employ three convo-
structure where stacked VFE operations can be processed lution middle layers sequentially as Conv3D(128, 64, 3,
in parallel across points and voxels. (2,1,1), (1,1,1)), Conv3D(64, 64, 3, (1,1,1), (0,1,1)), and
Conv3D(64, 64, 3, (2,1,1), (1,1,1)), which yields a 4D ten- Define set M = {pi = [xi , yi , zi , ri ]T ∈ R4 }i=1,...,N as
sor of size 64 × 2 × 400 × 352. After reshaping, the input the whole point cloud, consisting of N points. We parame-
to RPN is a feature map of size 128 × 400 × 352, where terize a 3D bouding box bi as (xc , yc , zc , l, w, h, θ), where
the dimensions correspond to channel, height, and width of xc , yc , zc are center locations, l, w, h are length, width,
the 3D tensor. Figure 4 illustrates the detailed network ar- height, and θ is the yaw rotation around Z-axis. We de-
chitecture for this task. Unlike [5], we use only one anchor fine Ωi = {p|x ∈ [xc − l/2, xc + l/2], y ∈ [yc − w/2, yc +
size, la = 3.9, wa = 1.6, ha = 1.56 meters, centered at w/2], z ∈ [zc − h/2, zc + h/2], p ∈ M} as the set con-
zca = −1.0 meters with two rotations, 0 and 90 degrees. taining all LiDAR points within bi , where p = [x, y, z, r]
Our anchor matching criteria is as follows: An anchor is denotes a particular LiDAR point in the whole set M.
considered as positive if it has the highest Intersection over The first form of data augmentation applies perturbation
Union (IoU) with a ground truth or its IoU with ground truth independently to each ground truth 3D bounding box to-
is above 0.6 (in bird’s eye view). An anchor is considered gether with those LiDAR points within the box. Specifi-
as negative if the IoU between it and all ground truth boxes cally, around Z-axis we rotate bi and the associated Ωi with
is less than 0.45. We treat anchors as don’t care if they have respect to (xc , yc , zc ) by a uniformally distributed random
0.45 ≤ IoU ≤ 0.6 with any ground truth. We set α = 1.5 variable ∆θ ∈ [−π/10, +π/10]. Then we add a translation
and β = 1 in Eqn. 2. (∆x, ∆y, ∆z) to the XYZ components of bi and to each
Pedestrian and Cyclist Detection The input range1 is point in Ωi , where ∆x, ∆y, ∆z are drawn independently
[−3, 1] × [−20, 20] × [0, 48] meters along Z, Y, X axis re- from a Gaussian distribution with mean zero and standard
spectively. We use the same voxel size as for car detection, deviation 1.0. To avoid physically impossible outcomes, we
which yields D = 10, H = 200, W = 240. We set T = 45 perform a collision test between any two boxes after the per-
in order to obtain more LiDAR points for better capturing turbation and revert to the original if a collision is detected.
shape information. The feature learning network and con- Since the perturbation is applied to each ground truth box
volutional middle layers are identical to the networks used and the associated LiDAR points independently, the net-
in the car detection task. For the RPN, we make one mod- work is able to learn from substantially more variations than
ification to block 1 in Figure 4 by changing the stride size from the original training data.
in the first 2D convolution from 2 to 1. This allows finer Secondly, we apply global scaling to all ground truth
resolution in anchor matching, which is necessary for de- boxes bi and to the whole point cloud M. Specifically,
tecting pedestrians and cyclists. We use anchor size la = we multiply the XYZ coordinates and the three dimen-
0.8, wa = 0.6, ha = 1.73 meters centered at zca = −0.6 sions of each bi , and the XYZ coordinates of all points
meters with 0 and 90 degrees rotation for pedestrian detec- in M with a random variable drawn from uniform distri-
tion and use anchor size la = 1.76, wa = 0.6, ha = 1.73 bution [0.95, 1.05]. Introducing global scale augmentation
meters centered at zca = −0.6 with 0 and 90 degrees rota- improves robustness of the network for detecting objects
tion for cyclist detection. The specific anchor matching cri- with various sizes and distances as shown in image-based
teria is as follows: We assign an anchor as postive if it has classification [35, 18] and detection tasks [12, 17].
the highest IoU with a ground truth, or its IoU with ground Finally, we apply global rotation to all ground truth
truth is above 0.5. An anchor is considered as negative if its boxes bi and to the whole point cloud M. The rotation
IoU with every ground truth is less than 0.35. For anchors is applied along Z-axis and around (0, 0, 0). The global ro-
having 0.35 ≤ IoU ≤ 0.5 with any ground truth, we treat tation offset is determined by sampling from uniform dis-
them as don’t care. tribution [−π/4, +π/4]. By rotating the entire point cloud,
During training, we use stochastic gradient descent we simulate the vehicle making a turn.
(SGD) with learning rate 0.01 for the first 150 epochs and
decrease the learning rate to 0.001 for the last 10 epochs. 4. Experiments
We use a batchsize of 16 point clouds.
We evaluate VoxelNet on the KITTI 3D object detection
3.2. Data Augmentation benchmark [11] which contains 7,481 training images/point
clouds and 7,518 test images/point clouds, covering three
With less than 4000 training point clouds, training our
categories: Car, Pedestrian, and Cyclist. For each class,
network from scratch will inevitably suffer from overfitting.
detection outcomes are evaluated based on three difficulty
To reduce this issue, we introduce three different forms of
levels: easy, moderate, and hard, which are determined ac-
data augmentation. The augmented training data are gener-
cording to the object size, occlusion state, and truncation
ated on-the-fly without the need to be stored on disk [20].
level. Since the ground truth for the test set is not avail-
1 Our empirical observation suggests that beyond this range, LiDAR able and the access to the test server is limited, we con-
returns from pedestrians and cyclists become very sparse and therefore duct comprehensive evaluation using the protocol described
detection results will be unreliable. in [4, 3, 5] and subdivide the training data into a training set
Car Pedestrian Cyclist
Method Modality
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
Mono3D [3] Mono 5.22 5.19 4.13 N/A N/A N/A N/A N/A N/A
3DOP [4] Stereo 12.63 9.49 7.59 N/A N/A N/A N/A N/A N/A
VeloFCN [22] LiDAR 40.14 32.08 30.47 N/A N/A N/A N/A N/A N/A
MV (BV+FV) [5] LiDAR 86.18 77.32 76.33 N/A N/A N/A N/A N/A N/A
MV (BV+FV+RGB) [5] LiDAR+Mono 86.55 78.10 76.67 N/A N/A N/A N/A N/A N/A
HC-baseline LiDAR 88.26 78.42 77.66 58.96 53.79 51.47 63.63 42.75 41.06
VoxelNet LiDAR 89.60 84.81 78.57 65.95 61.05 56.98 74.41 52.18 50.49
Table 1. Performance comparison in bird’s eye view detection: average precision (in %) on KITTI validation set.
Car Pedestrian Cyclist
Method Modality
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
Mono3D [3] Mono 2.53 2.31 2.31 N/A N/A N/A N/A N/A N/A
3DOP [4] Stereo 6.55 5.07 4.10 N/A N/A N/A N/A N/A N/A
VeloFCN [22] LiDAR 15.20 13.66 15.98 N/A N/A N/A N/A N/A N/A
MV (BV+FV) [5] LiDAR 71.19 56.60 55.30 N/A N/A N/A N/A N/A N/A
MV (BV+FV+RGB) [5] LiDAR+Mono 71.29 62.68 56.56 N/A N/A N/A N/A N/A N/A
HC-baseline LiDAR 71.73 59.75 55.69 43.95 40.18 37.48 55.35 36.07 34.15
VoxelNet LiDAR 81.97 65.46 62.85 57.86 53.42 48.87 67.17 47.65 45.11
Table 2. Performance comparison in 3D detection: average precision (in %) on KITTI validation set.
and a validation set, which results in 3,712 data samples for 4.1. Evaluation on KITTI Validation Set
training and 3,769 data samples for validation. The split
Metrics We follow the official KITTI evaluation protocol,
avoids samples from the same sequence being included in
where the IoU threshold is 0.7 for class Car and is 0.5 for
both the training and the validation set [3]. Finally we also
class Pedestrian and Cyclist. The IoU threshold is the same
present the test results using the KITTI server.
for both bird’s eye view and full 3D evaluation. We compare
the methods using the average precision (AP) metric.
For the Car category, we compare the proposed method
with several top-performing algorithms, including image Evaluation in Bird’s Eye View The evaluation result is
based approaches: Mono3D [3] and 3DOP [4]; LiDAR presented in Table 1. VoxelNet consistently outperforms all
based approaches: VeloFCN [22] and 3D-FCN [21]; and a the competing approaches across all three difficulty levels.
multi-modal approach MV [5]. Mono3D [3], 3DOP [4] and HC-baseline also achieves satisfactory performance com-
MV [5] use a pre-trained model for initialization whereas pared to the state-of-the-art [5], which shows that our base
we train VoxelNet from scratch using only the LiDAR data region proposal network (RPN) is effective. For Pedestrian
provided in KITTI. and Cyclist detection tasks in bird’s eye view, we compare
the proposed VoxelNet with HC-baseline. VoxelNet yields
To analyze the importance of end-to-end learning, we substantially higher AP than the HC-baseline for these more
implement a strong baseline that is derived from the Vox- challenging categories, which shows that end-to-end learn-
elNet architecture but uses hand-crafted features instead of ing is essential for point-cloud based detection.
the proposed feature learning network. We call this model We would like to note that [21] reported 88.9%, 77.3%,
the hand-crafted baseline (HC-baseline). HC-baseline uses and 72.7% for easy, moderate, and hard levels respectively,
the bird’s eye view features described in [5] which are but these results are obtained based on a different split of
computed at 0.1m resolution. Different from [5], we in- 6,000 training frames and ∼1,500 validation frames, and
crease the number of height channels from 4 to 16 to cap- they are not directly comparable with algorithms in Table 1.
ture more detailed shape information– further increasing Therefore, we do not include these results in the table.
the number of height channels did not lead to performance
improvement. We replace the convolutional middle lay- Evaluation in 3D Compared to the bird’s eye view de-
ers of VoxelNet with similar size 2D convolutional layers, tection, which requires only accurate localization of ob-
which are Conv2D(16, 32, 3, 1, 1), Conv2D(32, 64, 3, 2, jects in the 2D plane, 3D detection is a more challeng-
1), Conv2D(64, 128, 3, 1, 1). Finally RPN is identical in ing task as it requires finer localization of shapes in 3D
VoxelNet and HC-baseline. The total number of parame- space. Table 2 summarizes the comparison. For the
ters in HC-baseline and VoxelNet are very similar. We train class Car, VoxelNet significantly outperforms all other ap-
the HC-baseline using the same training procedure and data proaches in AP across all difficulty levels. Specifically,
augmentation described in Section 3. using only LiDAR, VoxelNet significantly outperforms the
Car Pedestrian Cyclist
Figure 6. Qualitative results. For better visualization 3D boxes detected using LiDAR are projected on to the RGB images.