Smsnet: Semantic Motion Segmentation Using Deep Convolutional Neural Networks

This paper proposes a convolutional neural network called SMSnet that performs semantic motion segmentation on image pairs. SMSnet learns to predict the object label and motion status of each pixel by fusing features from optical flow maps and semantic segmentation. It is trained on a new dataset called Cityscapes-Motion containing over 2,900 manually annotated semantic motion labels. The paper shows that SMSnet outperforms existing approaches on standard datasets while being substantially faster, making it suitable for applications requiring real-time performance like autonomous driving.

Uploaded by

Malik Hashmat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views

Smsnet: Semantic Motion Segmentation Using Deep Convolutional Neural Networks

Uploaded by

Malik Hashmat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

SMSnet: Semantic Motion Segmentation

using Deep Convolutional Neural Networks

Johan Vertens∗ Abhinav Valada∗ Wolfram Burgard

Abstract— Interpreting the semantics and motion of objects

are prerequisites for autonomous robots that enable them
to reason and operate in dynamic real-world environments.
Existing approaches that tackle the problem of semantic motion
segmentation consist of long multistage pipelines and typically
require several seconds to process each frame. In this paper, (a) Input frame 1 (b) Input frame 2
we present a novel convolutional neural network architecture
that learns to predict both the object label and motion status of
each pixel in an image. Given a pair of consecutive images, the
network learns to fuse features from self-generated optical flow
maps and semantic segmentation kernels to yield pixel-wise se-
mantic motion labels. We also introduce the Cityscapes-Motion
dataset which contains over 2,900 manually annotated semantic (c) Generated optical flow (d) Ego-motion subtracted flow
motion labels, which is the largest dataset of its kind so far. We
demonstrate that our network outperforms existing approaches
achieving state-of-the-art performance on the KITTI dataset,
as well as in the more challenging Cityscapes-Motion dataset
while being substantially faster than existing techniques.
I. I NTRODUCTION
(e) Semantic motion segmentation (f) Overlay of static and moving cars
The advancement in robotic technology and machine
learning has led to the successful use of robots to accomplish Fig. 1. Illustration of semantic motion segmentation using our proposed
SMSnet on the Cityscapes dataset. (e) shows the output from our network
tasks in various structured and semi-structured environments consisting of semantic and motion labels, while (f) shows the segmented
such as factory floors, domestic homes and offices. This re- moving cars in green and static cars in blue for better visualization.
cent success has now paved the way to tackle more complex
tasks in challenging urban environments that contain many semantic motion annotations that enable training of deep
dynamic objects. Thus scene understanding plays a crucial networks and allow for credible quantitative evaluations.
role in ensuring the viability and safe operation in such In this work, we propose a composite deep convolutional
scenarios. The ability to classify, segment and infer the state neural network architecture that learns to predict both the
of motion of dynamic objects such as cars and pedestrians, semantic category and motion status of each pixel from a
allows robotic systems to increase their awareness, reason pair of consecutive monocular images. The composition of
about behaviours and plan autonomous actions. Previous our SMSnet architecture can be deconstructed into three
work [20], [7] has successfully shown the benefit of solving components: a section that learns motion features from
the problem of semantic motion segmentation jointly, as generated optical flow maps, a parallel section that generates
features learned for semantic labelling can help infer motion features for semantic segmentation, and a fusion section that
labels and vice versa. However, the multistage pipelines combines both the motion and semantic features and further
currently in use have long processing times deeming them learns deep representations for pixel-wise semantic motion
impractical for real-world applications. segmentation. For this work, we consider an urban driving
Deep convolutional neural network (DCNN) based ap- scenario containing moving cars that appear in different
proaches have significantly improved the state-of-art in both scales. Therefore we utilize our previously proposed multi-
semantic segmentation [16] and motion estimation [10]. scale ResNet skip layers [23] in the architecture to incorpo-
Yet, their applicability to the task of joint semantic motion rate scale invariance. Training such a network requires sev-
segmentation has not been explored. There are several chal- eral thousands of labelled consecutive image pairs. Currently
lenges that make this problem inherently hard including the the only publicly available semantic motion segmentation
ego-motion of the camera, lighting changes between consec- dataset contains 200 labelled images from the KITTI [9]
utive frames, motion blur and varying pixel displacements benchmark, which is highly insufficient for training networks
due to motion with different velocities. Another major hin- of this scale. To overcome this impediment, we release the
drance is the lack of a large enough dataset with ground truth Cityscapes-Motion dataset containing over 2,900 labelled
∗ These authors contributed equally. All authors are with the Department
images and a KITTI-Motion dataset with 255 labelled im-
of Computer Science, University of Freiburg, Germany. This work has been ages, both with additional preceding frames. Additionally,
supported by Samsung Electronics Co., Ltd. under the GRO program. we investigate the utility of combining these datasets to
measure the generalization capabilities to scenes of unseen a decade. Chen et al. [2] propose an approach that detects
cities. Utilizing efficient GPU implementations, our approach object-level motion from a moving camera using two consec-
is several times faster than existing techniques and achieves utive image frames and provides 2D bounding boxes as the
state-of-the-art performance on multiple datasets. output. They design a robust context-aware motion descriptor
that considers moving speed, as well as the direction of
II. R ELATED W ORK objects and combines them with an object classifier. The
Semantic segmentation and motion segmentation are two descriptor measures the inconsistency between local optical
fundamental problems in scene understanding that both flow histograms of objects and their surroundings, giving a
have substantial amount of literature in their areas. Early measure of the state of motion.
convolutional neural network (CNN) based segmentation Dinesh et al. [20] propose an approach that generates
approaches involve small model capacities, multi-scale pyra- motion likelihoods based on depth and optical flow estima-
mid processing, saturating tanh non-linearities and post- tions, while combining them with semantic and geometric
processing such as superpixel computation, filtering and constraints within a dense conditional random field (CRF).
random field regularization. A recent breakthrough that does More recently, a multistep framework was proposed in [7],
not require any pre or post processing was proposed by where first sparse image features in two consecutive stereo
Long et al. [16], in which they extend a CNN designed for image pairs are extracted and matched. The matched feature
classification with learned deconvolution layers that are able points are then classified using RANSAC into inliers caused
to upsample low-resolution feature maps to higher resolution by the camera and outliers caused by moving objects. Fol-
segmentation outputs. Since then, several improvements to lowing which, the outliers are clustered in a U-disparity map
this fully convolutional network (FCN) architecture have which provides the motion information of objects. Finally, a
been proposed that improve the resolution of segmentation dense CRF is used to merge the motion information with
with additional refinement layers [18], alternative schemes the semantic segmentation provided by a FCN. A major
for non-linear upsampling eliminating the need for learning disadvantage of these approaches is their long run-times
to upsample [1], and deeper networks based on the residual that range from several seconds to even minutes, making
learning framework that incorporates scale invariance [23]. them unusable for applications that require near real-time
There are numerous approaches that have been proposed performance such as autonomous driving. In contrast to these
for segmenting moving objects from stationary camera im- existing multistage techniques, we propose an approach that
ages [22], [8]. However, they cannot be directly applied is entirely based on convolutional neural networks, com-
to moving camera images, as the movement causes a dual posing of a simpler but deeper structure while being more
motion appearance which consists of the background motion accurate and several orders faster than existing techniques.
and the object motion. In general, methods that detect motion
from freely moving cameras partition the image into coherent III. S EMANTIC M OTION S EGMENTATION
regions with homogenous motion. This process splits the In this section, we first formulate the problem state-
image into background and moving clusters. These methods ment for semantic motion segmentation. We then describe
can be categorized into optical flow based and tracking based our SMSnet architecture in detail and the training pro-
approaches. Optical flow based techniques [19], [21] check cedure that we employ. We represent the training set
if the motion speed and direction of a region is consistent as S = {(Xn−1 , Xn ,Yn ), n = 1, . . . , N}, where Xn = {x j , j =
with its radially surrounding pattern. It is then classified as a 1, . . . , |Xn |} denotes the input frame and Xn−1 denotes the
moving object if the motion of this region deviates from this preceding frame. The corresponding ground truth mask can
pattern. The disadvantage of these methods is that they are be denoted as Yn = {y j , j = 1, . . . , |Xn |}, y j ∈ {0} ∪ C × M ,
prone to occlusion, noise in the optical flow map and edge where C = {1, ...,C} is the set of C semantic classes and
effects. In recent work [12], the authors derive a geometric each class can also take the label of static or moving
model that relates 2D motion to a 3D motion field relative to M = {m1 , m2 }. Where, m1 denotes a static pixel and m2
the camera based on estimated depth and motion of vanishing denotes a moving pixel. Let θ be the network parameters
points in the scene. Spectral clustering is then applied on and a = f (x j ; θ ) be the activation function. The goal of our
the recovered 3D motion field to obtain the moving object network is to learn semantic motion features by minimizing
segmentation. Although qualitative evaluations have been the cross-entropy (so f tmax) loss that can be computed as
k)
shown on the KITTI benchmark, no quantitative comparison L (k) = − Cexp(a . Using stochastic gradient descent, we
∑l=1 exp(al )
has been reported. Tracking based techniques [3], [5], [14], then solve
[13] on the other hand, aim to detect and localize target
N×|Xn |
objects in successive frames. Tracking of objects yields
θ ∗ = argmin ∑ L ( f (xi ; θ ), yi ) (1)
movement trajectories and by estimating the ego-motion of θ i=1
the camera, objects can be segmented from the background
motion. These approaches typically have long processing A. Network Architecture
pipelines resulting in high computation times and coarse We propose a novel fully convolutional neural network
segmentations. architecture that represents the sought f (x j ; θ ). Figure 2
There is only a handful of work that jointly estimates the depicts our SMSnet architecture for semantic motion seg-
semantic motion labels, mostly accounting to the last half mentation. The network consists of three different streams:
3x Input 3x
384x768 Frame 1 384x768 64 64 64 128 128 128
1 1 1
2 1 2 2 512x
7 24x48
256 256 256 512 512 512
Flow Generation 64x 512x 512x
64x 256x 256x 256x
Network 192x384 48x96 48x96
96x192 96x192 96x192 96x192

Input
Frame 2 64 64 128 128 128
64 128
1 1
1 2 1 2 64
3 7 2 1

256 256 256 512 512 512 512

64x 512x 512x 512x 512x 12x
64x 256x 256x 256x
192x384 48x96 48x96 48x96 48x96 48x96
96x192 96x192 96x192 96x192

256 256 256 256 256 256 512 512 512

1 1 1 1 2 2 2
2 1
256 256 256 256 512 512 512
2 4 8 16 4 8 16 1
1024 1024 1024 1024 1024 1024 2048 2048 2048 12x 6x
48x96 384x768
Eltsum
1024x24x48 2048x24x48 Full Semantics
Concat 384x768

128 128 128 256 256 256 512 512

1 1 1 4 4
2 1
1 128 128 256 256 256
2 4 8 16 16 1
Motion Feature Learning 512 512 512 1024 1024 1024 2048 2048 3x
384x768
Semantic Feature Learning
1024x24x48 2048x24x48 Moving/Static
Semantic Motion Fusion 384x768

s stride s
s 1 1 1 p 1
n nxn convolution
1 3 1 1 p 1 1 3 1
d dilation d, stride =1 d1
d1 d1 1 3 1
n nxn convolution 1 1 1 d1
d1 d1 d2 d1 d3 d2
1 3 1
2x2 max pooling p d1 d3 d2 p 2
d3 2 d3
Batch norm
d1 d1 d2 s s d d
d 1
d 3 d3
1 1
3 d3 2
Up-convolution d2
d2 d2 2 d2
d2 d2
ReLU

Fig. 2. Depiction of the proposed SMSnet architecture for semantic motion segmentation from two consecutive images as input. The stream shown in green
learns deep motion features and in parallel the stream in gray learns semantic features, which are then both concatenated and further fused representations
are learned in the stream depicted in orange. The legend for the network architecture is shown with a red outline.

Motion Feature Learning, Semantic Feature Learning, and b) Semantic Feature Learning: The final output of
Semantic Motion Fusion. The following sections describe our network is a combined label that denotes a semantic
each of these streams in detail. class C and the state of motion M . While the stream
a) Motion Feature Learning: This stream generates described in the previous section yields information about
features that represent motion specific information. Succes- the motion in the scene, the network still requires semantic
sive frames (x j−1 , x j ) are first passed through a section of features to learn the combined semantic motion segmen-
this stream that generates high quality optical flow maps tation. The semantic feature learning stream depicted in
X̂. In this work, we embed the recently proposed deep gray blocks in Figure 2 takes as input the image x j and
convolutional architecture FlowNet2 [10] for this purpose. generates semantic features fs (x j ; θs ) | θs ⊂ θ . The structure
However, any network with the ability to generate optical of this stream is similar to our previously proposed unimodal
flow maps can be embedded in place. The flow generation AdapNet [23] architecture for semantic segmentation. The
network yields the optical flow in the x and y direction and architecture follows the design of a contractive segment
in addition we also compute the magnitude of the flow. This that aggregates semantic information while decreasing the
output tensor is of size 3 × 384 × 768, which is the same spatial dimensions of the feature maps and an expansive
dimensions as the input RGB images. Figure 1 (c) shows segment that upsamples the feature maps back to the full
a generated optical flow image from this section, while the input resolution. The architecture incorporates many recent
consecutive input frames are shown in Figure 1 (a) and (b). improvements including multiscale ResNet blocks that learn
Moving objects appear as motion patterns that differ scale invariant deep features, skip connections that enable
in scale, geometry and magnitude. In order to enable the training of the deep architecture and dilated convolutions that
network to reason about object class and its borders, we enable the integration of information from different spatial
further convolve and pool the optical flow features through scales. In our proposed SMSnet, the low resolution features
multiple network blocks. These additional network blocks from the last layer of the contractive segment are fused with
can be represented as a function fo (X̂; θo ) | θo ⊂ θ of the the learned motion features in the Semantic Motion Fusion
optical flow maps yielding a feature map tensor of size stream that follows. The expansive segment then in parallel
512 × 24 × 48. yields the full semantic labels for the input frame x j .
c) Semantic Motion Fusion: The final stream in the semantic features for all the C classes. Consecutively, we
SMSnet architecture depicted using orange blocks in Fig- train the embedded flow generation network that produces
ure 2, fuses the complementary motion and semantic features the optical flow maps which are processed and generated in
which are generated in the aforementioned streams of the the SMSnet architecture. Finally, we train the entire SMSnet
network. The feature tensors from fo (X̂; θo ) and fs (x j ; θs ) while keeping the weights of the semantic feature learning
are concatenated and further deep representations are learned stream and the flow generation network fixed. We train the
through a series of additional layers. No further pooling is network with an initial learning rate λ0 =10−7 and with the
c
performed on these features and therefore a downsampling poly learning rate policy as, λN = λ0 × 1 − NmaxN
, where
factor of 16 is maintained in comparison to the input x j . λN is the current learning rate, N is the iteration number,
Similar to the semantic feature learning stream, multiscale Nmax is the maximum number of iterations and c is the power.
ResNet blocks from [23] that utilize dilated convolutions for We train using stochastic gradient decent with a momentum
aggregating information over different field of views are used of 0.99 and a mini-batch of 2 for 50, 000 iterations which
in the layers that follow the concatenation segment. Finally, takes about a day to complete.
towards the end of this stream, we use deconvolution, also
known as transposed convolution, for upsampling the low IV. DATASET
resolution feature maps from 2048 × 24 × 48 back to the
input resolution of |C | × |M | × 384 × 768. This upsampled One of the main requirements to train a neural net-
output has joint labels in C ×M corresponding to a semantic work is a large dataset with ground truth annotations. Data
class and a motion status: static or moving. Thus the final augmentation can help expand datasets but for training a
activation function of the SMSnet is given by: network from scratch and optimizing a network with millions
of parameters, thousands of labelled images are required.
f (xi ; θ ) = fm ( fo (X̂; θo ), fs (x j ; θs ); θ f ) | θo , θs , θ f ⊂ θ (2) While there are several large datasets for various scene
understanding problems such as classification, segmentation
B. Introducing Ego-Flow Suppression
and detection, for the task of semantic motion segmentation
Movement of the camera leads to ego-motion introducing however, there only exists one public dataset [9] with 200
additional optical flow magnitudes that are not induced by labelled images which is highly insufficient for training
moving objects. This induced flow can cause ambiguities DCNNs. Obtaining ground truth for pixel-wise motion status
since objects can appear with high optical flow magnitudes is particularly hard as visible pixel displacement quickly
although they are not moving. In order to circumvent this decreases with increasing distance from the camera. In
problem, we propose a further variant of the SMSnet that addition, any ego-motion can make the labelling an arduous
predicts the optical flow map X̂ 0 which is purely caused task. To facilitate training of neural networks for semantic
by the ego-motion. We first estimate the backward camera motion segmentation and to allow for credible quantita-
translation T and the rotation matrix R from the position at tive evaluation, we create the following datasets and make
the current frame x j to the previous frame x j−1 . Using IMU them publicly available at http://deepmotion.cs.
and odometry data we can then estimate X̂ 0 as: uni-freiburg.de/. Each of these datasets have pixel-
wise semantic labels for 10 object classes and their motion
T
X̂ 0 = KRK −1 X + K (3) status (static or moving). Annotations are provided for the
z following classes: sky, building, road, sidewalk, cyclist, veg-
where, K is the intrinsic camera matrix, X = (u, v, 1)T is the etation, pole, car, sign and pedestrian.
homogenous coordinate of the pixel in image coordinates KITTI-Motion: The KITTI benchmark itself does not
and z is the depth of the corresponding pixel in meters. Cal- provide any semantic or moving object annotations. Ex-
culating the flow vector for every pixel coordinate yields the isting research on semantic motion segmentation has been
2-dimensional optical flow image which purely represents the benchmarked using the annotations for 200 images from the
ego-motion. For estimating the depth z, we use the recently KITTI dataset provided by [9], however, there are no images
proposed DispNet [17] which is based on DCNNs and has with annotations that can be used for training learning based
fast inference times. We then subtract the ego-flow X̂ 0 from approaches. In order to train our neural network, we create a
the optical flow calculated by the embedded flow generation KITTI-Motion dataset consisting of 255 images taken from
network X̂ within the SMSnet architecture. This subtraction the KITTI Raw dataset and which do not intersect with
yields to suppression of the ego-flow while keeping the flow the test set provided by [9]. The images are of resolution
magnitudes evoked from other moving objects. An example 1280 × 384 pixels and contain scenes of freeways, residential
of the optical flow with ego-flow suppression (EFS) is shown areas and inner-cities. We manually annotated the images
in Figure 1 (d). We present evaluations on both variants of with pixel-wise semantic class labels and moving object
our SMSnet, without and with EFS in Section V-A. annotations for the category of cars. In addition, we combine
two publicly available KITTI semantic segmentation datasets
C. Training [6] and [24] for pretraining the semantic stream of our
We train our network on a system with an Intel Xeon E5 network, which yields a total of 253 images. These images
with 2.4 GHz and four NVIDIA TITAN X. We first train the also do not overlap with the test set [9] or the KITTI-Motion
Semantic Feature Learning stream in SMSnet that generates dataset that we introduced.
TABLE I
Cityscapes-Motion: The Cityscapes dataset [4] is a more
C OMPARISON OF MOTION SEGMENTATION PERFORMANCE WITH
recent dataset containing 2975 training images and 500
STATE - OF - THE - ART APPROACHES ON THE KITTI DATASET.
validation images. Semantic annotations are provided for 30
categories and images are of resolution 2048 × 1024 pixels. Approach IoU
The Cityscapes dataset is highly challenging as it contains Moving Static Background
images from over 50 cities and different weather conditions,
GEO-M [13] 46.50 N/A 49.80
varying seasons and many dynamic objects. We manually AHCRF+Motion [14] 60.20 N/A 75.80
annotated all the Cityscapes images with motion labels for CRF-M [20] 73.50 N/A 82.40
the category of cars. We use this dataset in addition to KITTI- SMSnet 10-class 73.98 80.28 97.65
Motion for benchmarking the performance. SMSnet 10-class with EFS 80.87 83.77 97.84
City-KITTI-Motion: As the KITTI-Motion dataset by itself SMSnet 2-class 74.03 80.78 97.59
SMSnet 2-class with EFS 84.69 84.50 98.01
is not sufficient to train deep networks and to facilitate
comparison with other approaches that are evaluated on
KITTI data, we merge the KITTI-Motion and Cityscapes- permanently static elements such as buildings to be under
Motion training sets. Additionally, we merge the 200 image the same static class, which we denote as background in our
KITTI test set [9] with the 500 validation images from evaluations. However, as it is more informative in the context
Cityscapes to compose a corresponding evaluation set. Com- of robotics to split these two cases into different categories,
bining them also helps the network learn more generalized we consider the static class to only contain objects that are
feature representations. As we use an input resolution of movable but are stationary at that time.
768 × 384 for our network, we downsample the Citiscapes- It can be seen that the method that jointly predicts the se-
Motion images to this size. However, as the images in the mantic class and motion (CRF-M) substantially outperforms
KITTI-Motion dataset have wider resolution 1280 × 384, we approaches that perform only motion segmentation (GEO-
slice each image into three partially overlapping images. In M and AHCRF+Motion). This can be attributed to the fact
total the combined dataset yields 3734 training images and that these approaches learn to correlate motion features with
1100 for validation. Furthermore, the dataset also contains the learned semantic features which improves the overall
15 preceding frames for every annotated image and is thus motion segmentation accuracy. Intuitively, the approaches
perfectly suited for sequence based approaches. learn that there is a higher probability of a car moving than
In order to create additional training data, we randomly a building or a pole. Although Fan et al. [7] also propose
apply the following augmentations on the training images: an approach for semantic motion segmentation, the KITTI
rotation, translation, scaling, vignetting, cropping, flipping, scene flow dataset that they evaluate on have inconsistent
color, brightness and contrast modulation. As the SMSnet class labels which does not allow for meaningful comparison.
takes two consecutive images as input, we augment the pair Finally, we show the performance using variants of our
jointly with the same parameters. proposed SMSnet architecture, specifically, with and without
the subtraction of the optical flow induced by the ego-motion
V. E XPERIMENTAL R ESULTS (ego-flow), as well as considering all the semantic classes
For the network implementation, we use the Caffe [11] in KITTI and considering only the semantic classes that
deep learning library with cuDNN backend for acceleration. are potentially moveable. All the SMSnet variants shown in
We quantify the performance using the standard Jaccard In- Table I outperform the existing approaches, while our best
dex which is commonly known as average intersection-over- performing models achieve the state-of-the-art performance
union (IoU) metric. It can be computed as IoU = T P/(T P + of 84.69% for the moving classes, 84.50% for the static
FP + FN), where T P, FP and FN correspond to true posi- classes and 98.01% for the background class. It can be
tives, false positives and false negatives respectively. observed the subtraction of the ego-flow helps in improving
the moving object segmentation.
A. Baseline Comparison Since we are interested in predicting both the motion status
In order to compare the performance of our network and the semantic label, we show the performance of semantic
with state-of-the-art techniques, we train our network on segmentation in comparison to recent neural network based
the combined City-KITTI-Motion dataset and benchmark approaches in Table II. As described in Section IV, the
its performance on the KITTI set from [9] on which KITTI benchmark does not provide any official ground truth
the other approaches have reported their results. We com- for semantic segmentation, therefore to train the semantic
pare the motion segmentation against three state-of-the-art stream of our network, we combine the Cityscapes dataset
techniques including geometric-based motion segmentation with the KITTI semantic ground truth from [6] and [24] to
(GEO-M) [13], joint labelling of motion and superpixels obtain the most generalized training set. We then test the
based image segmentation (AHCRF+Motion) [14] and CRF- performance individually on the Cityscapes test set, as well
based semantic motion segmentation [20]. Table I summa- as on the KITTI semantic motion test set that was also used
rizes the results of this experiment and shows the average IoU in the motion segmentation comparison. For the experiments
of the moving object, static object and background classes. on the KITTI semantic motion test set, we observe that
Other approaches consider all the elements in the scene that our SMSnet outperforms the other approaches for most of
are movable but not moving such as a stationary car and the semantic classes. Secondly, the KITTI semantic motion
TABLE II
C OMPARISON OF SEMANTIC SEGMENTATION PERFORMANCE WITH STATE - OF - THE - ART APPROACHES ON THE KITTI AND C ITYSCAPES DATASETS .

Test Approach Sky Building Road Sidewalk Cyclist Vegetation Pole Car Sign Pedestrian
Set

FCN-8s [16] 77.35 74.24 74.41 51.41 35.79 78.80 15.99 76.20 35.97 40.87
KITTI

SegNet [1] 77.27 60.34 75.03 43.62 19.76 76.58 24.34 63.88 17.01 21.96
ParseNet [15] 81.26 70.42 73.85 42.12 41.04 71.48 32.02 77.20 31.60 47.49
SMSnet (ours) 78.39 74.27 78.10 46.11 26.85 79.88 34.84 83.63 37.70 42.88
Cityscapes

FCN-8s [16] 76.05 75.94 92.73 59.68 46.50 78.78 15.27 76.54 37.96 41.57
SegNet [1] 69.93 59.87 83.25 43.35 27.25 68.83 19.23 60.80 23.81 23.14
ParseNet [15] 77.58 76.23 92.76 60.04 47.96 79.68 22.66 76.85 40.99 44.54
SMSnet (ours) 85.43 81.08 94.50 66.89 49.26 84.85 37.92 82.40 47.48 46.47

Tested on:
test set consists of images containing sidewalks with out-
20m 40m 60m ∞
grown grass labelled as sidewalk as opposed to vegetation.
Such examples are consistently labelled as vegetation in the 20m 82.01 73.97 70.28 69.19
81.0

Cityscapes dataset, consequently causing misclassification. 79.5

Whereas, while testing on the Cityscapes test set, our pro- 78.0

Trained on:
40m 81.51 81.88 79.84 78.87
posed SMSnet substantially outperforms other networks in 76.5
75.0
all the classes. 60m 80.0 80.68 79.36 78.61
73.5

B. Influence of Range on Motion Segmentation Accuracy 72.0

∞ 80.64 81.52 80.43 79.77 70.5
In this section, we investigate the performance of motion
segmentation using SMSnet to various ranges within which
the moving objects might lie. One of the primary challenges Fig. 3. Moving object segmentation performance of our proposed SMSnet
while considering objects within different maximum ranges.
is learning motion features of moving objects that are at
far away distances from the camera, as the appearance of
the object and the pixel displacement are both very small.
public datasets, including our newly proposed City-KITTI-
To quantify this influence, we train models on examples
Motion dataset and evaluate its efficacy on test sets from
that have moving objects within certain maximum distance
complementary datasets. Specifically, we train our network
from the camera and objects that lie beyond this distance are
individually on KITTI, Cityscapes and the combined City-
ignored for training. We then evaluate each of these models
KITTI-Motion dataset and evaluate each of them on all their
on test sets containing moving objects at varying distances.
individual test sets. For this experiment, we use the 2-class
On the one hand, including training examples that are far
model with EFS trained and evaluated with a maximum
away might enable learning of more multiscale features that
range of 40 m. In Table III we summarize the results from
can cover a wide variety of motion appearances, but on
this experiment and show the mean IoU for the static
the other hand these highly difficult training examples can
and moving classes. It can be seen in Table III that the
also confuse the training if the network is unable to learn
model trained on the combined City-KITTI dataset performs
features that can distinguish the state of distant objects. For
well on both the Cityscapes and KITTI test sets than the
this experiment, we train models on the City-KITTI-Motion
models trained on their individual counterparts. The models
dataset and also evaluate on the corresponding test set as we
trained only the KITTI dataset or only on the Cityscapes
want the evaluation to generalize over both the Cityscapes
dataset have a substantially lower performance when they
and KITTI datatsets. The results of this experiment are shown
are tested on the Cityscapes or KITTI test set respectively
in Figure 3. As hypothesized, the best accuracy is obtained
than the model trained on the City-KITTI-Motion dataset.
for a maximum distance of 20 m and the accuracy gradually
This shows the utility of combining these datasets and the
decreases with increasing maximum distance. The best trade-
good generalization that it provides.
off is obtained for a maximum distance of 40 m. It can also
be seen that the model trained with the maximum distance
at infinity performs impressively well even for challenging TABLE III
moving object examples that are at far away distances. C OMPARISON OF MODELS TRAINED - TESTED ON DIFFERENT DATASETS .

C. Generality of the Network to Different Datasets Trained On Static Moving

KITTI Cityscapes KITTI Cityscapes
A large amount of good quality training data that encom-
passes the possible scenarios is essential for successful train- KITTI 78.51 52.05 70.32 34.84
ing of neural networks. Ideally, the training dataset should Cityscapes 61.29 84.27 51.65 75.31
City-KITTI 86.10 84.03 87.02 72.78
generalize to previously unseen scenes. In this section, we
investigate the performance of models trained on various
(a) (b) (c)
Fig. 4. Qualitative semantic motion segmentation results on the KITTI dataset. In each column top to bottom: input frame 2, corresponding ego-motion
subtracted flow, semantic motion segmentation output and segmented static and moving cars overlay. The network accurately segments and classifies moving
objects of different scales and at varying velocities, in addition to pixel-wise classification of the entire scene.

D. Prediction Time Comparison

Fast prediction time is one of the most critical require-
ments for perception algorithms used in real-world robotics.
Therefore, we designed SMSnet while keeping this critical
factor in mind. To the best of our knowledge, there exists
two alternate approaches that perform semantic motion seg-
mentation to which we compare our inference time with in
Table IV. Our proposed SMSnet takes 153 ms to predict a
single frame which is significantly faster that the other two
existing approaches. The SMSnet with ego-flow subtraction
takes 313 ms which includes the disparity prediction and
solving Equation 3. In contrast to existing approaches both
variants of our proposed SMSnet enable interactive speeds
which is a prerequisite for robotic applications.
TABLE IV
C OMPARISON OF PREDICTION TIME WITH STATE - OF - THE - ART.

Approach Time
(a) (b)
CRF-M [20] 240, 000 ms
U-Disp-CRF-FCN [7] 1, 060 ms Fig. 5. Qualitative semantic motion segmentation results on the Cityscapes
dataset. The network demonstrates robustness to complex scenes with many
SMSnet (ours) 153 ms different dynamic objects, some that are even partially occluded.
SMSnet with EFS (ours) 313 ms
and distinguishes between the static and moving cars even
E. Qualitative Evaluation in these diverse situations. Figure 5 presents results on the
In this section, we show qualitative results on various Cityscapes test set which contains more complex scenes than
datasets with our SMSnet trained on City-KITTI-Motion and the KITTI dataset. Figure 5 (a) shows a moving car over
critique its performance in diverse scenes. Figure 4 shows 80 m away and SMSnet succeeds in capturing this motion
results on images from the KITTI test set. The segmented while precisely segmenting the object. Figure 5 (b) shows
images are color coded according to the labels shown in a scene with a moving car that is partially occluded by a
Table II. Dark blue pixels indicate static cars and light green tree, yet the entire car is captured in the segmentation. This
pixels indicate moving cars. Figure 4 (a) and (b) are scenes demonstrates the ability of the SMSnet to handle diverse
from residential areas which have cars moving with low real-world scenarios.
velocities and Figure 4 (c) shows a scene on a highway which
has cars moving at much higher velocities. These scenes F. Evaluation of Transferability and Platform Independence
also have objects of different scales and lighting conditions. In this section, we demonstrate the platform independence
We can see that the network accurately segments the scene of our SMSnet model trained on the City-KITTI-Motion
Input frame 2 KITTI-Motion model Cityscapes-Motion model City-KITTI-Motion model

(a)

(b)
Fig. 6. Qualitative comparison of semantic motion segmentation models trained on various datasets and evaluated on real-world data from Freiburg.
Note that the model trained on the City-KITTI-Motion dataset generalizes better to the previously unseen city than others. Our network robustly handles
challenging conditions such as glare (a) and low lighting (b).

dataset by presenting qualitative evaluations on images cap- on our City-KITTI-Motion dataset generalized effectively to
tured using a different camera setup than those used in KITTI previously unseen conditions.
and Cityscapes datasets. We mounted a ZED stereo camera R EFERENCES
on the hood of a car and collected over 61, 000 images of
[1] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
driving scenes in Freiburg, Germany. The recorded images convolutional encoder-decoder architecture for image segmentation,”
comprise of adverse conditions such as low lighting, glare arXiv: 1511.00561, 2015.
and motion blur, which pose a great challenge for semantic [2] T. Chen and S. Lu, “Object-level motion detection from moving
cameras,” TCSVT, vol. PP, no. 99, pp. 1–1, 2016.
motion segmentation. Figure 6 shows a comparison of results [3] W. Choi, C. Pantofaru, and S. Savarese, “A general framework for
using the SMSnet trained on KITTI-Motion, Cityscapes- tracking multiple people from a moving camera,” PAMI, 2013.
Motion and the combined City-KITTI-Motion datasets. In [4] M. Cordts et al., “The cityscapes dataset for semantic urban scene
understanding,” in CVPR, 2016.
Figure 6 (a), we see that the Cityscapes-Motion model [5] B. Drayer and T. Brox, “Object detection, tracking, and motion
misclassifies the car as static while it is moving and it has segmentation for object-level video segmentation,” arxiv:1608.03066,
false positives on the sides of the image due to motion blur. 2016.
[6] G. R. et al., “Vision-based offline-online perception paradigm for
The KITTI-Motion model on the other hand, segments the autonomous driving,” in WACV, 2015.
car as moving but fails to segment it as a whole, in addition [7] Q. Fan et al., “Semantic motion segmentation for urban dynamic scene
to having numerous false positives. Figure 6 (b) shows a understanding,” in CASE, 2016.
[8] P. Gao, X. Sun, and W. Wang, “Moving object detection based on
residential scene in low lighting. We see that the KITTI- kirsch operator combined with optical flow,” in IASP, 2010.
Motion model misclassifies the moving car in the left as static [9] N. Haque, D. Reddy, and K. M. Krishna, “Kitti semantic ground truth,”
and the Cityscapes-Motion model misclassies the static car as https://github.com/native93/KITTI-Semantic-Ground-Truth/, 2016.
[10] E. Ilg et al., “Flownet 2.0: Evolution of optical flow estimation with
moving. Both these models also have a difficulty segmenting deep networks,” arXiv:1612.01925, Dec 2016.
the sidewalk entirely. Overall, one can note that the model [11] Y. Jia et al., “Caffe: Convolutional architecture for fast feature em-
trained on City-Kitti-Motion performs substantially better in bedding,” arXiv preprint arXiv:1408.5093, 2014.
[12] J. Y. Kao et al., “Moving object segmentation using depth and optical
segmenting the static and moving classes as well as having flow in car driving sequences,” in ICIP, 2016, pp. 11–15.
negligible false positives, demonstrating the good generality [13] A. Kundu et al., “Moving object detection by multi-view geometric
of the learned kernels. techniques from a single camera mounted robot,” in IROS, 2009.
[14] T. H. Lin and C. C. Wang, “Deep learning of spatio-temporal features
VI. C ONCLUSION with geometric-based moving point detection for motion segmenta-
tion,” in ICRA, May 2014, pp. 3058–3065.
In this paper, we presented a convolutional neural network [15] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider to
that takes as input two input images and learns to predict both see better,” arXiv preprint arXiv: 1506.04579, 2015.
[16] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
the semantic class label and motion status of each pixel in an for semantic segmentation,” in CVPR, 2015.
image. We introduced two large first-of-a-kind datasets with [17] N.Mayer et al., “A large dataset to train convolutional networks for
ground-truth annotations that enable training of deep neural disparity, optical flow, and scene flow estimation,” in CVPR, 2016.
[18] G. Oliveira, A. Valada, C. Bollen, W. Burgard, and T. Brox, “Deep
networks for semantic motion segmentation. We presented learning for human part discovery in images,” in ICRA, 2016.
comprehensive quantitative evaluations and demonstrated [19] M. P. Patel and S. K. Parmar, “Moving object detection with moving
that the performance of our network exceeds the state of background using optic flow,” in ICRAIE, 2014.
[20] N. D. Reddy, P. Singhal, and K. M. Krishna, “Semantic motion
the art, both in terms of accuracy and prediction time. We segmentation using dense CRF formulation,” in ICVGIP, 2014.
investigated the performance of motion segmentation to vary- [21] C. S. Royden and K. D. Moore, “Use of speed cues in the detection
ing object distances and showed that our network performs of moving objects by moving observers,” Vision Research, vol. 59, pp.
17 – 24, 2012.
well even for distant moving objects. We also presented [22] P. Spagnolo, T. Orazio, M. Leo, and A. Distante, “Moving object
extensive qualitative results that show the applicability to segmentation by background subtraction and temporal analysis,” Image
autonomous driving scenarios. Furthermore, we presented Vision Comput., vol. 24, no. 5, pp. 411–423, May 2006.
[23] A. Valada et al., “Adapnet: Adaptive semantic segmentation in adverse
qualitative evaluations of various SMSnet models on real- environmental conditions,” in ICRA, 2017.
world driving data from Freiburg that contain challenging [24] P. Xu et al., “Multimodal information fusion for urban scene under-
perceptual conditions and showed that the model trained standing,” Machine Vision and Applications, pp. 331–349, 2016.