Pointseg: Real-Time Semantic Segmentation Based On 3D Lidar Point Cloud
Pointseg: Real-Time Semantic Segmentation Based On 3D Lidar Point Cloud
Pointseg: Real-Time Semantic Segmentation Based On 3D Lidar Point Cloud
Abstract— We propose PointSeg, a real-time end-to-end se- are not suitable for productions; (2) they can not recognize
mantic segmentation method for road-objects based on spher- the pixel level object category as the same as the semantic
ical images. The spherical image is transformed from the 3D segmentation, which makes it difficult to apply to some
LiDAR point clouds with the shape 64 × 512 × 5 and taken as
input of the convolutional neural networks (CNNs) to predict autonomous driving tasks.
To solve these problems, we design a light network
arXiv:1807.06288v8 [cs.CV] 25 Sep 2018
took multi-scale features into consideration and combined to improve the 3D task performance. However, real-time
image features from multiple refined paths. Among those performance is still challenging. SqueezeSeg [4] is similar
methods, Networks like Deeplabv3 and FCN were compelled with our task which used the SqueezeNet [3] as the backbone
to increase the performance. SegNet and ICNet are able to and performed compatible results. However, it only referred
achieve real-time performance. Although, they have a big the CRF to improve the performance in the predicted 2D
improvement in speed or accuracy, these methods still have spherical masks, which could lose location information in
some influence on the other side. the 3D space. Without considering the 3D constraints in the
original point cloud, the results of SqueezeSeg is extremely
B. Convolutional neural networks with 3D point cloud data limited by this CRF post-process.
3D data has sufficient features and attracts much research
III. M ETHOD
attention. With the rapidly developing of deep learning, many
methods apply convolutional neural networks (CNN) on the We introduce the generation of spherical images and key
3D point cloud data directly. 3DFCN [13] and VoxelNet features of the network structure in this section. Network
[14] used a 3D-CNN [15] to extract features from width, structures and parameters are also included at the end of the
height and depth simultaneously. MV3D [16] fused multi- section.
perception from a bird’s-eye view, a front view and a camera
view to obtain a more robust feature representation. In A. Spherical image generation from point cloud
addition, some works [17] considered the representation of
three-dimensional data itself and divided it into voxels to
undertake features such as intensity, distance, local mean and
disparity. Although all of the above methods have achieved a
good accuracy, they still cost too much time in computation
which limited their applications in real-time tasks. In this
paper, we are aiming to improve the real-time performance
and keep a good accuracy at the same time.
Fi g
Co ut
Fi 1
ax Fi e1
co 3
po re2
Co nv4
v1
8
t
Fi 3
7
4
Ou v2
6
n
nv
tpu
de nv
p
re
re
re
re
re
re
re
oli
r
n
In
n
Fi
Fi
Fi
Fi
F- co
Fi
co
de
de
F-
F-
+M
-1
SR
Fig. 3: The network structure of PointSeg. Our network is based on the famours light-weight strucure SqueezeNet [3] and
SqueezeSeg [4]. Several ideas from the state-of-the-art RGB semantic segmentation methods are considered, which improve
both the efficiency and accuracy on this 3D task. Fire is the fire layer as SqueezeNet. EL means the enlargement layer and
SR is the squeeze reweighting layer.
memory-consuming because many voxels would be empty. layer and (3) enlargement layer. The network structure is
To solve this problem, we transform the Lidar point cloud shown in Fig. 3.
data by spherical projection, as the same as the SqueezeSeg 1) Fire layer: Assessing SqueezeNet [3], we find that its
[4], to achieve a kind of dense representation as: fire unit can construct a light-weight network which can
z α achieve similar performance as AlexNet [23] but costing far
α = arcsin( p ), ᾱ = b c, (1) fewer parameters than AlexNet. Therefore, we take the fire
x2 + y2 + z2 ∆α
module as our basic network unit. We follow SqueezeNet to
y β construct our feature extraction layer, which is shown in Fig.
β = arcsin( p ), β̄ = b c, (2)
x2 + y2 ∆β 3 (Fire1 to Fire9). The fire module is shown in Fig. 4 (a).
And we do not implement the MobileNet [24], ShuffleNet
where α and β are the azimuth and zenith angles, as shown in
[25]. Because both of them can not set different stride in
Fig. 2; ∆α and ∆β are the resolutions which can generate a
height and width in the process which will influence the
fixed-shape spherical image; and ᾱ and β̄ are indexes which
accuracy greatly, if we downsample the same times in height.
set the positions of points on the 2D spherical image. After
During the feature extraction downsampling process, we use
applying Equation 1 and Equation2 on the point cloud data,
the left one to replace the common convolutional layer. The
we obtain an array as H ×W ×C. Here, the data is generated
fire module contains one squeeze module and one expansion
from Velodyne HDL-64E LiDAR with 64 vertical channels.
module. The squeeze module is a single 1 × 1 convolution
Therefore, we set H = 64. Considering that in a real self-
layer which compresses the model’s channel dimensions
driving system most attentions are focusing on the front
from C to 1/4C. C is the channel number of the input
view, we filter the dataset only to consider the front view
tensor. The expansion module with one 1 × 1 convolutional
area (−45◦ , 45◦ ) and discretize it into 512 indexes, so W is
layer and one 3 × 3 convolutional layer help the network
512. C is the channel of input. In our paper, we consider
to achieve more feature representations from different kernel
x, y, zpcoordinates, intensity for each point and distance
sizes. We replace the deconvolutional layers with F-deconv
d = x2 + y 2 + z 2 as five channels in total. Therefore,
like SqueezeSeg [4] as shown in Fig. 4 (b).
we can obtain the transformed data as 64 × 512 × 5. By this
transformation, we can input it into traditional convolutional 2) Enlargement layer: Pooling layers are set to expand
layers. the receptive field and to discard the location information
We directly extract features from the transformed data, to aggregate the context information. However, location in-
which has dense and regular distribution. The time cost is formation is kind of indispensable for semantic segmentation
dramatically reduced compared with taking raw 3D point tasks. So, here in PointSeg, we reduce the number of pooling
cloud as inputs. layers used to keep more location information. To solve
this problem, instead of using the pooling layer to get
B. Network structure a large receptive field, we deploy a dilated convolutional
The proposed PointSeg has three main functional layers: layer to enlarge the receptive field after Fire9 and SR-3.
(1) fire layer (from SqueezeNet [3]), (2) squeeze reweighting Similar to Atrous Spatial Pyramid Pooling (ASPP) [26], we
(a) Fire (b) F-deconv Global Descriptor
Conv 1x1, C/4
X 1x1xC
Y
1
Sc
2
Fc
Fc
Conv 1x1, C/4 Deconv x2 H ale
H
Conv 1x1, C/2 Conv 3x3, C/2 Conv 1x1, C/2 Conv 3x3, C/2
W W
Concatenate Concatenate C C
A. Dataset and evaluation metrics 2) Ablation study for enlargment layer: Because the size
We train our model on a published dataset1 from Squeeze- of output feature (Fire9) size is 64×64 in height and weight.
Seg [4] which converts Velodyne data from the KITTI 3D To make enlargement layer described in Section. III-B.2
object detection dataset [22]. It was split into a training set achieve better performance, we evaluate different rates of
with around 8000 frames and a validation set with around the enlargement layer according to previous experiences in
2800 frames. The evaluation precision, recall and IoU are Deeplabv3 [26] and hybrid dilated convolution [29].
defined as follows: In the second part of Table I, we evaluate different
enlargement parameters based on the downsampling-3 struc-
T T T
|ρn Gn | |ρn Gn | |ρn Gn |
Pn = , Rn = , IoUn = S ture mentioned in Section. IV-B.1. We only implement the
|ρn | |Gn | ρn Gn
enlargement layer after fire 9 as shown in Fig. 3, which
where ρn is the predicted sets that belong to class-n, and Gn
increases the memory cost of the whole structure from 1.6G
is the ground-truth sets.
to 1.8G.
B. Ablation study We also tried adding another enlargement layer after fire
Our network is mainly built based on the excellent 4. However, we obtain little performance improvement but
SqueezeNet [3]. Compared with another very similar work a terrible increase in the memory cost. The different rate
SqueezeSeg [4], we improve the network structure from sets have the same memory cost where the difference is
several parts and here we show the ablation study results that they will obtain different eyesight field. Based on the
respectively. All the results are shown in Table I as percent- results shown in Table I, we choose (6,9,12) as the rate
age. of the enlargment layer in our proposed PointSeg. At this
1) Downsampling times: The original SqueezeNet [3] stage, we notice that although the performance of the car has
contains four downsampling layers. The shape of the gen- been improved, the performance of the pedestrian and cyclist
erated feature is (1/16H×1/16W×C) after downsampling still do not achieve the aim which we expected. A possible
four times. Through this structure, the output feature will explanation is that the distortion and uncommon deformation
be 64 × 32×C without downsampling in height if our from the input spherical image make the network difficult to
input is 64 × 512×C. To restore more location information predict those relatively small objects with similar patterns.
as we mentioned in Section III-B.2, we remove the last
3) Ablation study for reweight layer: Based on the dis-
downsampling layer of the basic SqueezeNet.
cussion in Section IV-B.3, we consider to utilize reweight
In the first part of Table I, We compare the results
layers to enhance the feature representation in channel-wise
of different downsampling times (downsampling-4 for four
for small objects in our scenarios. We experiment with three
downsampling times and downsampling-3 for three down-
methods to combine the network with reweight layer. which
sampling times). Note that downsampling-4 is actually as
are (i)reweight-down: add reweighting layers at the end of
the same as the SqueezeSeg [4] without CRF. Here the first
each size-invariant block in the feature downsampling pro-
row shows the result of SqueezeSeg without CRF trained by
cess like SR-1, SR-2 and SR-3 shown in Fig. 3. (ii)reweight-
ourselves to evaluate the efficiency of our proposed strategy.
up: add reweighting layers after each size-variant upsampling
The results show that reducing the downsampling times
process, which are located after F-deconv1, F-deconv3 and
from four to three dramatically improves the accuracy of
F-deconv4 as shown in Fig. 3. (ii)reweight-down/up: add
pedestrian and cyclist, which are relatively small and easily
reweighting layers based on the three skip connections where
affected by downsampling. At the same time, the predictions
both the downsampling and upsampling features with the
for cars are still comparable. Thus we choose to downsample
same size are combined as the input of reweighting layers.
three times in our PointSeg. All the ablation experiments
Experiments in this section are based on the downsampling-3
following are also based on this downsampling-3 structure.
structure (Section IV-B.1) without considering the enlarge-
1 www.dropbox.com/s/pnzgcitvppmwfuf/lidar_2d.tgz ment layer (Section IV-B.2).
Fig. 7: Visualizations of raw inputs, PointSeg predictions and results after back projection with or without RANSAC
refinements from up to down. The third row shows results which are projected back without RANSAC and the forth
row shows resutls wich are projected back without RANSAC.
TABLE II: The comparsion of runtime performance the outlier remove. The operation can help our proposed
Methods Time(ms)
PointSeg obtain a refined segmentation result as shown in
Fig. 7, and only cost around 2ms extra time. The evaluation
SqueezeSeg [4] w/ CRF 13.5
result of PointSeg aided with RANSAC is shown in the last
w/o CRF 8.5 row of Table I which we do not compare with SqueezeSeg
PointSeg w RANSAC 14 due to the randomness of RANSAC.
w/o RANSAC 12
V. C ONCLUSION
PointSeg in TX2 w/o RANSAC 98
In this paper, we improved the feature-wise and channel-
wise attention of the network to get the robust feature
representation, which shows an essential improvement in 3D
In Table I, both reweight-down and reweight-down/up semantics segmentation task from spherical images. The pro-
show better performance than the baseline downsampling- posed PointSeg system can be directly applied in autonomous
3 structure. According to experiments about reweight layers, driving systems and implemented in embedded AI computing
we find that most of the key features for this task are coming device with limited memory cost.
from the downsampling process which is the best time to Besides, we focus on the computation ability and memory
reweight the layer weight. If we implement the reweight layer cost during the implementation. Therefore, our approach
where the global descriptor is generated from the upsampled can achieve a high accuracy at real-time speeds, and spare
feature as reweight-up, the results decrease obviously be- enough space and computation ability for other tasks in driv-
cause reweighting feature weight from deconvolution layer ing systems. After an efficient RANSAC post-process, our
may add noise on feature locations. Basically, We add the method dramatically oversteps the state-of-the-art method.
reweight layers only at downsampling process as reweight- In the proposed PointSeg, the performance of the small
down. We also tried to add the reweight layer after each object, like the pedestrian, still do not achieve a very high
layer in the downsampling process. However, the accuracy level. A possible explanation is that much useful information
improves slightly but costing a lot of extra time. had lost during the downsampling process even we only took
4) Comparison with SqueezeSeg: Finally, We compare three times because of the quite small shape of those original
our results with SqueezeSeg [4], which is summarized in objects in the spherical image. We leave those as the future
the third part of Table I. Our results for the pedestrian are work.
comparable with SqueezeSeg (without CRF) and show great
improvement for car and cyclist. Our system cost 12 ms per
frame in our workstation during the forward process with
2G memory cost. The comparison of runtime performance
is shown in Table II. During the back projection process
from the mask on the spherical image to the point cloud
data, we use random sample consensus (RANSAC) to do
R EFERENCES 3-d point clouds,” IEEE Robotics and Automation Letters, vol. 3, no. 3,
pp. 1832–1839, 2018.
[1] C. Feng, Y. Taguchi, and V. R. Kamat, “Fast plane extraction in [22] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
organized point clouds using agglomerative hierarchical clustering,” in The kitti dataset,” The International Journal of Robotics Research,
Robotics and Automation (ICRA), 2014 IEEE International Conference vol. 32, no. 11, pp. 1231–1237, 2013.
on. IEEE, 2014, pp. 6218–6225. [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
[2] M. Himmelsbach, A. Mueller, T. Lüttel, and H.-J. Wünsche, “Lidar- with deep convolutional neural networks,” in Advances in neural
based 3d object perception,” in Proceedings of 1st international information processing systems, 2012, pp. 1097–1105.
workshop on cognition for technical systems, vol. 1, 2008. [24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
[3] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient
and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer convolutional neural networks for mobile vision applications,” arXiv
parameters and <1mb model size,” CoRR, vol. abs/1602.07360, preprint arXiv:1704.04861, 2017.
2016. [Online]. Available: http://arxiv.org/abs/1602.07360 [25] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
efficient convolutional neural network for mobile devices,” CoRR,
[4] B. Wu, A. Wan, X. Yue, and K. Keutzer, “Squeezeseg: Convolutional
vol. abs/1707.01083, 2017. [Online]. Available: http://arxiv.org/abs/
neural nets with recurrent CRF for real-time road-object segmentation
1707.01083
from 3d lidar point cloud,” CoRR, vol. abs/1710.07368, 2017.
[26] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
[Online]. Available: http://arxiv.org/abs/1710.07368
“Deeplab: Semantic image segmentation with deep convolutional nets,
[5] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
atrous convolution, and fully connected crfs,” IEEE Transactions on
network,” in IEEE Conf. on Computer Vision and Pattern Recognition
Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
(CVPR), 2017, pp. 2881–2890.
April 2018.
[6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [27] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”
for semantic segmentation,” in Proceedings of the IEEE conference on CoRR, vol. abs/1709.01507, 2017. [Online]. Available: http://arxiv.
computer vision and pattern recognition, 2015, pp. 3431–3440. org/abs/1709.01507
[7] L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking [28] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
atrous convolution for semantic image segmentation,” CoRR, vol. for online learning and stochastic optimization,” Journal of Machine
abs/1706.05587, 2017. [Online]. Available: http://arxiv.org/abs/1706. Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
05587 [29] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou,
[8] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated and G. W. Cottrell, “Understanding convolution for semantic
convolutions,” CoRR, vol. abs/1511.07122, 2015. [Online]. Available: segmentation,” CoRR, vol. abs/1702.08502, 2017. [Online]. Available:
http://arxiv.org/abs/1511.07122 http://arxiv.org/abs/1702.08502
[9] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected
crfs with gaussian edge potentials,” in Advances in neural information
processing systems, 2011, pp. 109–117.
[10] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
convolutional encoder-decoder architecture for image segmentation,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 39, no. 12, pp. 2481–2495, Dec 2017.
[11] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic
segmentation on high-resolution images,” CoRR, vol. abs/1704.08545,
2017. [Online]. Available: http://arxiv.org/abs/1704.08545
[12] G. Lin, A. Milan, C. Shen, and I. Reid, “RefineNet: Multi-path
refinement networks for high-resolution semantic segmentation,” in
CVPR, July 2017.
[13] B. Li, “3d fully convolutional network for vehicle detection in point
cloud,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ
International Conference on. IEEE, 2017, pp. 1513–1518.
[14] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point
cloud based 3d object detection,” CoRR, vol. abs/1711.06396, 2017.
[Online]. Available: http://arxiv.org/abs/1711.06396
[15] B. Graham, “Sparse 3d convolutional neural networks,” CoRR, vol.
abs/1505.02890, 2015. [Online]. Available: http://arxiv.org/abs/1505.
02890
[16] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object
detection network for autonomous driving,” in IEEE CVPR, vol. 1,
no. 2, 2017, p. 3.
[17] R. Dubé, A. Cramariuc, D. Dugas, J. Nieto, R. Siegwart, and C. Ca-
dena, “SegMap: 3d segment mapping using data-driven descriptors,”
in Robotics: Science and Systems (RSS), 2018.
[18] R. Schnabel, R. Wahl, and R. Klein, “Efficient ransac for point-cloud
shape detection,” in Computer graphics forum, vol. 26, no. 2. Wiley
Online Library, 2007, pp. 214–226.
[19] Z. Lin, J. Jin, and H. Talbot, “Unseeded region growing for 3d image
segmentation,” in Selected papers from the Pan-Sydney workshop on
Visualisation-Volume 2. Australian Computer Society, Inc., 2000, pp.
31–37.
[20] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
on point sets for 3d classification and segmentation,” Proc. Computer
Vision and Pattern Recognition (CVPR), IEEE, vol. 1, no. 2, p. 4,
2017.
[21] R. Dubé, M. G. Gollub, H. Sommer, I. Gilitschenski, R. Siegwart,
C. Cadena, and J. Nieto, “Incremental-segment-based localization in