Smsnet: Semantic Motion Segmentation Using Deep Convolutional Neural Networks
Smsnet: Semantic Motion Segmentation Using Deep Convolutional Neural Networks
Input
Frame 2 64 64 128 128 128
64 128
1 1
1 2 1 2 64
3 7 2 1
s stride s
s 1 1 1 p 1
n nxn convolution
1 3 1 1 p 1 1 3 1
d dilation d, stride =1 d1
d1 d1 1 3 1
n nxn convolution 1 1 1 d1
d1 d1 d2 d1 d3 d2
1 3 1
2x2 max pooling p d1 d3 d2 p 2
d3 2 d3
Batch norm
d1 d1 d2 s s d d
d 1
d 3 d3
1 1
3 d3 2
Up-convolution d2
d2 d2 2 d2
d2 d2
ReLU
Fig. 2. Depiction of the proposed SMSnet architecture for semantic motion segmentation from two consecutive images as input. The stream shown in green
learns deep motion features and in parallel the stream in gray learns semantic features, which are then both concatenated and further fused representations
are learned in the stream depicted in orange. The legend for the network architecture is shown with a red outline.
Motion Feature Learning, Semantic Feature Learning, and b) Semantic Feature Learning: The final output of
Semantic Motion Fusion. The following sections describe our network is a combined label that denotes a semantic
each of these streams in detail. class C and the state of motion M . While the stream
a) Motion Feature Learning: This stream generates described in the previous section yields information about
features that represent motion specific information. Succes- the motion in the scene, the network still requires semantic
sive frames (x j−1 , x j ) are first passed through a section of features to learn the combined semantic motion segmen-
this stream that generates high quality optical flow maps tation. The semantic feature learning stream depicted in
X̂. In this work, we embed the recently proposed deep gray blocks in Figure 2 takes as input the image x j and
convolutional architecture FlowNet2 [10] for this purpose. generates semantic features fs (x j ; θs ) | θs ⊂ θ . The structure
However, any network with the ability to generate optical of this stream is similar to our previously proposed unimodal
flow maps can be embedded in place. The flow generation AdapNet [23] architecture for semantic segmentation. The
network yields the optical flow in the x and y direction and architecture follows the design of a contractive segment
in addition we also compute the magnitude of the flow. This that aggregates semantic information while decreasing the
output tensor is of size 3 × 384 × 768, which is the same spatial dimensions of the feature maps and an expansive
dimensions as the input RGB images. Figure 1 (c) shows segment that upsamples the feature maps back to the full
a generated optical flow image from this section, while the input resolution. The architecture incorporates many recent
consecutive input frames are shown in Figure 1 (a) and (b). improvements including multiscale ResNet blocks that learn
Moving objects appear as motion patterns that differ scale invariant deep features, skip connections that enable
in scale, geometry and magnitude. In order to enable the training of the deep architecture and dilated convolutions that
network to reason about object class and its borders, we enable the integration of information from different spatial
further convolve and pool the optical flow features through scales. In our proposed SMSnet, the low resolution features
multiple network blocks. These additional network blocks from the last layer of the contractive segment are fused with
can be represented as a function fo (X̂; θo ) | θo ⊂ θ of the the learned motion features in the Semantic Motion Fusion
optical flow maps yielding a feature map tensor of size stream that follows. The expansive segment then in parallel
512 × 24 × 48. yields the full semantic labels for the input frame x j .
c) Semantic Motion Fusion: The final stream in the semantic features for all the C classes. Consecutively, we
SMSnet architecture depicted using orange blocks in Fig- train the embedded flow generation network that produces
ure 2, fuses the complementary motion and semantic features the optical flow maps which are processed and generated in
which are generated in the aforementioned streams of the the SMSnet architecture. Finally, we train the entire SMSnet
network. The feature tensors from fo (X̂; θo ) and fs (x j ; θs ) while keeping the weights of the semantic feature learning
are concatenated and further deep representations are learned stream and the flow generation network fixed. We train the
through a series of additional layers. No further pooling is network with an initial learning rate λ0 =10−7 and with the
c
performed on these features and therefore a downsampling poly learning rate policy as, λN = λ0 × 1 − NmaxN
, where
factor of 16 is maintained in comparison to the input x j . λN is the current learning rate, N is the iteration number,
Similar to the semantic feature learning stream, multiscale Nmax is the maximum number of iterations and c is the power.
ResNet blocks from [23] that utilize dilated convolutions for We train using stochastic gradient decent with a momentum
aggregating information over different field of views are used of 0.99 and a mini-batch of 2 for 50, 000 iterations which
in the layers that follow the concatenation segment. Finally, takes about a day to complete.
towards the end of this stream, we use deconvolution, also
known as transposed convolution, for upsampling the low IV. DATASET
resolution feature maps from 2048 × 24 × 48 back to the
input resolution of |C | × |M | × 384 × 768. This upsampled One of the main requirements to train a neural net-
output has joint labels in C ×M corresponding to a semantic work is a large dataset with ground truth annotations. Data
class and a motion status: static or moving. Thus the final augmentation can help expand datasets but for training a
activation function of the SMSnet is given by: network from scratch and optimizing a network with millions
of parameters, thousands of labelled images are required.
f (xi ; θ ) = fm ( fo (X̂; θo ), fs (x j ; θs ); θ f ) | θo , θs , θ f ⊂ θ (2) While there are several large datasets for various scene
understanding problems such as classification, segmentation
B. Introducing Ego-Flow Suppression
and detection, for the task of semantic motion segmentation
Movement of the camera leads to ego-motion introducing however, there only exists one public dataset [9] with 200
additional optical flow magnitudes that are not induced by labelled images which is highly insufficient for training
moving objects. This induced flow can cause ambiguities DCNNs. Obtaining ground truth for pixel-wise motion status
since objects can appear with high optical flow magnitudes is particularly hard as visible pixel displacement quickly
although they are not moving. In order to circumvent this decreases with increasing distance from the camera. In
problem, we propose a further variant of the SMSnet that addition, any ego-motion can make the labelling an arduous
predicts the optical flow map X̂ 0 which is purely caused task. To facilitate training of neural networks for semantic
by the ego-motion. We first estimate the backward camera motion segmentation and to allow for credible quantita-
translation T and the rotation matrix R from the position at tive evaluation, we create the following datasets and make
the current frame x j to the previous frame x j−1 . Using IMU them publicly available at http://deepmotion.cs.
and odometry data we can then estimate X̂ 0 as: uni-freiburg.de/. Each of these datasets have pixel-
wise semantic labels for 10 object classes and their motion
T
X̂ 0 = KRK −1 X + K (3) status (static or moving). Annotations are provided for the
z following classes: sky, building, road, sidewalk, cyclist, veg-
where, K is the intrinsic camera matrix, X = (u, v, 1)T is the etation, pole, car, sign and pedestrian.
homogenous coordinate of the pixel in image coordinates KITTI-Motion: The KITTI benchmark itself does not
and z is the depth of the corresponding pixel in meters. Cal- provide any semantic or moving object annotations. Ex-
culating the flow vector for every pixel coordinate yields the isting research on semantic motion segmentation has been
2-dimensional optical flow image which purely represents the benchmarked using the annotations for 200 images from the
ego-motion. For estimating the depth z, we use the recently KITTI dataset provided by [9], however, there are no images
proposed DispNet [17] which is based on DCNNs and has with annotations that can be used for training learning based
fast inference times. We then subtract the ego-flow X̂ 0 from approaches. In order to train our neural network, we create a
the optical flow calculated by the embedded flow generation KITTI-Motion dataset consisting of 255 images taken from
network X̂ within the SMSnet architecture. This subtraction the KITTI Raw dataset and which do not intersect with
yields to suppression of the ego-flow while keeping the flow the test set provided by [9]. The images are of resolution
magnitudes evoked from other moving objects. An example 1280 × 384 pixels and contain scenes of freeways, residential
of the optical flow with ego-flow suppression (EFS) is shown areas and inner-cities. We manually annotated the images
in Figure 1 (d). We present evaluations on both variants of with pixel-wise semantic class labels and moving object
our SMSnet, without and with EFS in Section V-A. annotations for the category of cars. In addition, we combine
two publicly available KITTI semantic segmentation datasets
C. Training [6] and [24] for pretraining the semantic stream of our
We train our network on a system with an Intel Xeon E5 network, which yields a total of 253 images. These images
with 2.4 GHz and four NVIDIA TITAN X. We first train the also do not overlap with the test set [9] or the KITTI-Motion
Semantic Feature Learning stream in SMSnet that generates dataset that we introduced.
TABLE I
Cityscapes-Motion: The Cityscapes dataset [4] is a more
C OMPARISON OF MOTION SEGMENTATION PERFORMANCE WITH
recent dataset containing 2975 training images and 500
STATE - OF - THE - ART APPROACHES ON THE KITTI DATASET.
validation images. Semantic annotations are provided for 30
categories and images are of resolution 2048 × 1024 pixels. Approach IoU
The Cityscapes dataset is highly challenging as it contains Moving Static Background
images from over 50 cities and different weather conditions,
GEO-M [13] 46.50 N/A 49.80
varying seasons and many dynamic objects. We manually AHCRF+Motion [14] 60.20 N/A 75.80
annotated all the Cityscapes images with motion labels for CRF-M [20] 73.50 N/A 82.40
the category of cars. We use this dataset in addition to KITTI- SMSnet 10-class 73.98 80.28 97.65
Motion for benchmarking the performance. SMSnet 10-class with EFS 80.87 83.77 97.84
City-KITTI-Motion: As the KITTI-Motion dataset by itself SMSnet 2-class 74.03 80.78 97.59
SMSnet 2-class with EFS 84.69 84.50 98.01
is not sufficient to train deep networks and to facilitate
comparison with other approaches that are evaluated on
KITTI data, we merge the KITTI-Motion and Cityscapes- permanently static elements such as buildings to be under
Motion training sets. Additionally, we merge the 200 image the same static class, which we denote as background in our
KITTI test set [9] with the 500 validation images from evaluations. However, as it is more informative in the context
Cityscapes to compose a corresponding evaluation set. Com- of robotics to split these two cases into different categories,
bining them also helps the network learn more generalized we consider the static class to only contain objects that are
feature representations. As we use an input resolution of movable but are stationary at that time.
768 × 384 for our network, we downsample the Citiscapes- It can be seen that the method that jointly predicts the se-
Motion images to this size. However, as the images in the mantic class and motion (CRF-M) substantially outperforms
KITTI-Motion dataset have wider resolution 1280 × 384, we approaches that perform only motion segmentation (GEO-
slice each image into three partially overlapping images. In M and AHCRF+Motion). This can be attributed to the fact
total the combined dataset yields 3734 training images and that these approaches learn to correlate motion features with
1100 for validation. Furthermore, the dataset also contains the learned semantic features which improves the overall
15 preceding frames for every annotated image and is thus motion segmentation accuracy. Intuitively, the approaches
perfectly suited for sequence based approaches. learn that there is a higher probability of a car moving than
In order to create additional training data, we randomly a building or a pole. Although Fan et al. [7] also propose
apply the following augmentations on the training images: an approach for semantic motion segmentation, the KITTI
rotation, translation, scaling, vignetting, cropping, flipping, scene flow dataset that they evaluate on have inconsistent
color, brightness and contrast modulation. As the SMSnet class labels which does not allow for meaningful comparison.
takes two consecutive images as input, we augment the pair Finally, we show the performance using variants of our
jointly with the same parameters. proposed SMSnet architecture, specifically, with and without
the subtraction of the optical flow induced by the ego-motion
V. E XPERIMENTAL R ESULTS (ego-flow), as well as considering all the semantic classes
For the network implementation, we use the Caffe [11] in KITTI and considering only the semantic classes that
deep learning library with cuDNN backend for acceleration. are potentially moveable. All the SMSnet variants shown in
We quantify the performance using the standard Jaccard In- Table I outperform the existing approaches, while our best
dex which is commonly known as average intersection-over- performing models achieve the state-of-the-art performance
union (IoU) metric. It can be computed as IoU = T P/(T P + of 84.69% for the moving classes, 84.50% for the static
FP + FN), where T P, FP and FN correspond to true posi- classes and 98.01% for the background class. It can be
tives, false positives and false negatives respectively. observed the subtraction of the ego-flow helps in improving
the moving object segmentation.
A. Baseline Comparison Since we are interested in predicting both the motion status
In order to compare the performance of our network and the semantic label, we show the performance of semantic
with state-of-the-art techniques, we train our network on segmentation in comparison to recent neural network based
the combined City-KITTI-Motion dataset and benchmark approaches in Table II. As described in Section IV, the
its performance on the KITTI set from [9] on which KITTI benchmark does not provide any official ground truth
the other approaches have reported their results. We com- for semantic segmentation, therefore to train the semantic
pare the motion segmentation against three state-of-the-art stream of our network, we combine the Cityscapes dataset
techniques including geometric-based motion segmentation with the KITTI semantic ground truth from [6] and [24] to
(GEO-M) [13], joint labelling of motion and superpixels obtain the most generalized training set. We then test the
based image segmentation (AHCRF+Motion) [14] and CRF- performance individually on the Cityscapes test set, as well
based semantic motion segmentation [20]. Table I summa- as on the KITTI semantic motion test set that was also used
rizes the results of this experiment and shows the average IoU in the motion segmentation comparison. For the experiments
of the moving object, static object and background classes. on the KITTI semantic motion test set, we observe that
Other approaches consider all the elements in the scene that our SMSnet outperforms the other approaches for most of
are movable but not moving such as a stationary car and the semantic classes. Secondly, the KITTI semantic motion
TABLE II
C OMPARISON OF SEMANTIC SEGMENTATION PERFORMANCE WITH STATE - OF - THE - ART APPROACHES ON THE KITTI AND C ITYSCAPES DATASETS .
Test Approach Sky Building Road Sidewalk Cyclist Vegetation Pole Car Sign Pedestrian
Set
FCN-8s [16] 77.35 74.24 74.41 51.41 35.79 78.80 15.99 76.20 35.97 40.87
KITTI
SegNet [1] 77.27 60.34 75.03 43.62 19.76 76.58 24.34 63.88 17.01 21.96
ParseNet [15] 81.26 70.42 73.85 42.12 41.04 71.48 32.02 77.20 31.60 47.49
SMSnet (ours) 78.39 74.27 78.10 46.11 26.85 79.88 34.84 83.63 37.70 42.88
Cityscapes
FCN-8s [16] 76.05 75.94 92.73 59.68 46.50 78.78 15.27 76.54 37.96 41.57
SegNet [1] 69.93 59.87 83.25 43.35 27.25 68.83 19.23 60.80 23.81 23.14
ParseNet [15] 77.58 76.23 92.76 60.04 47.96 79.68 22.66 76.85 40.99 44.54
SMSnet (ours) 85.43 81.08 94.50 66.89 49.26 84.85 37.92 82.40 47.48 46.47
Tested on:
test set consists of images containing sidewalks with out-
20m 40m 60m ∞
grown grass labelled as sidewalk as opposed to vegetation.
Such examples are consistently labelled as vegetation in the 20m 82.01 73.97 70.28 69.19
81.0
Whereas, while testing on the Cityscapes test set, our pro- 78.0
Trained on:
40m 81.51 81.88 79.84 78.87
posed SMSnet substantially outperforms other networks in 76.5
75.0
all the classes. 60m 80.0 80.68 79.36 78.61
73.5
Approach Time
(a) (b)
CRF-M [20] 240, 000 ms
U-Disp-CRF-FCN [7] 1, 060 ms Fig. 5. Qualitative semantic motion segmentation results on the Cityscapes
dataset. The network demonstrates robustness to complex scenes with many
SMSnet (ours) 153 ms different dynamic objects, some that are even partially occluded.
SMSnet with EFS (ours) 313 ms
and distinguishes between the static and moving cars even
E. Qualitative Evaluation in these diverse situations. Figure 5 presents results on the
In this section, we show qualitative results on various Cityscapes test set which contains more complex scenes than
datasets with our SMSnet trained on City-KITTI-Motion and the KITTI dataset. Figure 5 (a) shows a moving car over
critique its performance in diverse scenes. Figure 4 shows 80 m away and SMSnet succeeds in capturing this motion
results on images from the KITTI test set. The segmented while precisely segmenting the object. Figure 5 (b) shows
images are color coded according to the labels shown in a scene with a moving car that is partially occluded by a
Table II. Dark blue pixels indicate static cars and light green tree, yet the entire car is captured in the segmentation. This
pixels indicate moving cars. Figure 4 (a) and (b) are scenes demonstrates the ability of the SMSnet to handle diverse
from residential areas which have cars moving with low real-world scenarios.
velocities and Figure 4 (c) shows a scene on a highway which
has cars moving at much higher velocities. These scenes F. Evaluation of Transferability and Platform Independence
also have objects of different scales and lighting conditions. In this section, we demonstrate the platform independence
We can see that the network accurately segments the scene of our SMSnet model trained on the City-KITTI-Motion
Input frame 2 KITTI-Motion model Cityscapes-Motion model City-KITTI-Motion model
(a)
(b)
Fig. 6. Qualitative comparison of semantic motion segmentation models trained on various datasets and evaluated on real-world data from Freiburg.
Note that the model trained on the City-KITTI-Motion dataset generalizes better to the previously unseen city than others. Our network robustly handles
challenging conditions such as glare (a) and low lighting (b).
dataset by presenting qualitative evaluations on images cap- on our City-KITTI-Motion dataset generalized effectively to
tured using a different camera setup than those used in KITTI previously unseen conditions.
and Cityscapes datasets. We mounted a ZED stereo camera R EFERENCES
on the hood of a car and collected over 61, 000 images of
[1] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
driving scenes in Freiburg, Germany. The recorded images convolutional encoder-decoder architecture for image segmentation,”
comprise of adverse conditions such as low lighting, glare arXiv: 1511.00561, 2015.
and motion blur, which pose a great challenge for semantic [2] T. Chen and S. Lu, “Object-level motion detection from moving
cameras,” TCSVT, vol. PP, no. 99, pp. 1–1, 2016.
motion segmentation. Figure 6 shows a comparison of results [3] W. Choi, C. Pantofaru, and S. Savarese, “A general framework for
using the SMSnet trained on KITTI-Motion, Cityscapes- tracking multiple people from a moving camera,” PAMI, 2013.
Motion and the combined City-KITTI-Motion datasets. In [4] M. Cordts et al., “The cityscapes dataset for semantic urban scene
understanding,” in CVPR, 2016.
Figure 6 (a), we see that the Cityscapes-Motion model [5] B. Drayer and T. Brox, “Object detection, tracking, and motion
misclassifies the car as static while it is moving and it has segmentation for object-level video segmentation,” arxiv:1608.03066,
false positives on the sides of the image due to motion blur. 2016.
[6] G. R. et al., “Vision-based offline-online perception paradigm for
The KITTI-Motion model on the other hand, segments the autonomous driving,” in WACV, 2015.
car as moving but fails to segment it as a whole, in addition [7] Q. Fan et al., “Semantic motion segmentation for urban dynamic scene
to having numerous false positives. Figure 6 (b) shows a understanding,” in CASE, 2016.
[8] P. Gao, X. Sun, and W. Wang, “Moving object detection based on
residential scene in low lighting. We see that the KITTI- kirsch operator combined with optical flow,” in IASP, 2010.
Motion model misclassifies the moving car in the left as static [9] N. Haque, D. Reddy, and K. M. Krishna, “Kitti semantic ground truth,”
and the Cityscapes-Motion model misclassies the static car as https://github.com/native93/KITTI-Semantic-Ground-Truth/, 2016.
[10] E. Ilg et al., “Flownet 2.0: Evolution of optical flow estimation with
moving. Both these models also have a difficulty segmenting deep networks,” arXiv:1612.01925, Dec 2016.
the sidewalk entirely. Overall, one can note that the model [11] Y. Jia et al., “Caffe: Convolutional architecture for fast feature em-
trained on City-Kitti-Motion performs substantially better in bedding,” arXiv preprint arXiv:1408.5093, 2014.
[12] J. Y. Kao et al., “Moving object segmentation using depth and optical
segmenting the static and moving classes as well as having flow in car driving sequences,” in ICIP, 2016, pp. 11–15.
negligible false positives, demonstrating the good generality [13] A. Kundu et al., “Moving object detection by multi-view geometric
of the learned kernels. techniques from a single camera mounted robot,” in IROS, 2009.
[14] T. H. Lin and C. C. Wang, “Deep learning of spatio-temporal features
VI. C ONCLUSION with geometric-based moving point detection for motion segmenta-
tion,” in ICRA, May 2014, pp. 3058–3065.
In this paper, we presented a convolutional neural network [15] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to
that takes as input two input images and learns to predict both see better,” arXiv preprint arXiv: 1506.04579, 2015.
[16] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
the semantic class label and motion status of each pixel in an for semantic segmentation,” in CVPR, 2015.
image. We introduced two large first-of-a-kind datasets with [17] N.Mayer et al., “A large dataset to train convolutional networks for
ground-truth annotations that enable training of deep neural disparity, optical flow, and scene flow estimation,” in CVPR, 2016.
[18] G. Oliveira, A. Valada, C. Bollen, W. Burgard, and T. Brox, “Deep
networks for semantic motion segmentation. We presented learning for human part discovery in images,” in ICRA, 2016.
comprehensive quantitative evaluations and demonstrated [19] M. P. Patel and S. K. Parmar, “Moving object detection with moving
that the performance of our network exceeds the state of background using optic flow,” in ICRAIE, 2014.
[20] N. D. Reddy, P. Singhal, and K. M. Krishna, “Semantic motion
the art, both in terms of accuracy and prediction time. We segmentation using dense CRF formulation,” in ICVGIP, 2014.
investigated the performance of motion segmentation to vary- [21] C. S. Royden and K. D. Moore, “Use of speed cues in the detection
ing object distances and showed that our network performs of moving objects by moving observers,” Vision Research, vol. 59, pp.
17 – 24, 2012.
well even for distant moving objects. We also presented [22] P. Spagnolo, T. Orazio, M. Leo, and A. Distante, “Moving object
extensive qualitative results that show the applicability to segmentation by background subtraction and temporal analysis,” Image
autonomous driving scenarios. Furthermore, we presented Vision Comput., vol. 24, no. 5, pp. 411–423, May 2006.
[23] A. Valada et al., “Adapnet: Adaptive semantic segmentation in adverse
qualitative evaluations of various SMSnet models on real- environmental conditions,” in ICRA, 2017.
world driving data from Freiburg that contain challenging [24] P. Xu et al., “Multimodal information fusion for urban scene under-
perceptual conditions and showed that the model trained standing,” Machine Vision and Applications, pp. 331–349, 2016.