Blitznet: A Real-Time Deep Network For Scene Understanding
Blitznet: A Real-Time Deep Network For Scene Understanding
Blitznet: A Real-Time Deep Network For Scene Understanding
Abstract
Object detection. The field of object detection has been 3. Scene Understanding with BlitzNet
recently dominated by variants of the R-CNN architec- In this section, we introduce the BlitzNet architecture
ture [5, 25], where bounding-box proposals are indepen- and discuss its different building blocks.
dently classified by a convolutional neural network, and
then filtered by a non-maximum suppression algorithm. It 3.1. Global View of the Pipeline
provides great accuracy, but relatively low inference speed The joint object detection and segmentation pipeline is
since it requires a significant amount of computation per presented in Figure 2. The input image is first processed
proposal. R-FCN [13] is a fully-convolutional variant that by a convolutional neural network to produce a map that
further improves detection and significantly reduces the carries high-level features. Because of its high performance
computational cost per proposal. Its region-based mecha- for classification and good trade-off for speed, we use the
nism is however dedicated to object detection only. network ResNet-50 [9] as our feature encoder.
SSD [16] is a recent state-of-the-art object detector, Then, the resolution of the feature map is iteratively re-
which uses a sliding window approach instead of generated duced to perform a multi-scale search of bounding boxes,
proposals to classify all boxes directly. SSD creates a scale following the SSD approach [16]. Inspired by the hour-
pyramid to find objects of various sizes in one forward pass. glass architecture [19] for pose estimation and an earlier
Because of its speed and high accuracy, we have chosen to work on semantic segmentation [20], the feature maps are
build our work on, and subsequently improve, the SSD ap- then up-scaled via deconvolutional layers in order to predict
proach. Finally, YOLO [23, 24] also provides real-time ob- subsequently precise segmentation maps. Recent DSSD ap-
ject detection and shares some ideas with SSD. proach [4] uses a similar strategy for object detection the
top part of our architecture presented in Figure 2 may be
Semantic segmentation and deconvolutional layers. seen as a variant of DSSD with a simpler “deconvolution
Deconvolutional architectures consist of adding to a clas- module”, called ResSkip, that involves residual and skip
sical convolutional neural networks with feature pooling, a connections.
sequence of layers whose purpose is to increase the reso- Finally, prediction is achieved by single convolutional
lution of the output feature maps. This idea is natural in layers, one for detection, and one for segmentation, in one
Fig. 2: The BlitzNet architecture, which performs object detection and segmentation with one fully convolutional network.
On the left, CNN denotes a feature extractor, here ResNet-50 [9]; it is followed by the downscale-stream (in blue) and the last
part of the net is the upscale-stream (in purple), which consists of a sequence of deconvolution layers interleaved with ResSkip
blocks (see Figure 3). The localization and classification of bounding boxes (top) and pixelwise segmentation (bottom) are
performed in a multiscale fashion by single convolutional layers operating on the output of deconvolution layers.
forward pass, which is the main originality of our work. level features it also eases the learning process [9].
Like [4] for object detection and [19] for pose estima-
3.2. SSD and Downscale Stream tion, we also use such a mechanism with skip connections
The Single Shot MultiBox Detector [16] tiles an input that combines feature maps from the downscale and up-
image with a regular grid of anchor boxes and then uses scale streams (see Figure 2). More precisely, maps from
a convolutional neural network to classify these boxes and the downscale and upscale streams are combined with a
predict corrections to their initial coordinates. In the origi- simple strategy, which we call ResSkip, presented in Fig-
nal paper [16], the base network VGG-16 [27] is followed ure 3. First, incoming feature maps are upsampled to the
by a cascade of convolutional and pooling layers to form size of corresponding skip connection via bilinear interpo-
a sequence of feature maps with progressively decreasing lation. Then both skip connection feature maps and up-
spatial resolution and increasing field of view. In [16], sampled maps are concatenated and passed through a block
each of these layers is processed separately in order to clas- (1 × 1 convolution, 3 × 3 convolution, 1 × 1 convolution)
sify and predict coordinates correction for a set of default and summed with the upsampled input through a residual
bounding boxes of a particular scale. At test time, the set connection. The benefits of this topology will be justified
of predicted bounding boxes is filtered by non-maximum and discussed in more details in the experimental section.
suppression (NMS) to form the final output.
Our pipeline uses such a cascade (see Figure 2), but the 3.4. Multiscale Detection and Segmentation
classification of bounding boxes and pixels to build the seg- The problem of semantic segmentation and object detec-
mentation maps is performed in subsequent layers, called tion share several key properties. They both require per-
deconvolutional layers, which will be described next. region classification, based on the pixels inside an object
while taking into account its surrounding, and benefit from
3.3. Deconvolution Layers and ResSkip Blocks rich features that include localization information. Instead
Modeling visual context is often a key to complicated of training a separate network to perform these two tasks,
scenes parsing, which is typically achieved by pooling lay- we train a single one that allows weight sharing, such that
ers in a convolutional neural network, leading to large re- both tasks can benefit from each other.
ceptive fields for each output neuron. For semantic segmen- In our pipeline, most of the weights are shared. Object
tation, precise localization is equally important, and [20] detection is performed by a single convolutional layer that
proposes to use deconvolutional operations to solve that is- predicts a class and coordinate corrections for each bound-
sue. Later, this process was improved in [19] by adding ing box in the feature maps of the upscale stream. Similarly,
skip connections. Apart from combining high- and low- a single convolutional layer is used to predict the pixel la-
nels to map each layer of the upscale-stream to an interme-
diate representation. After this, each layer is upscaled to
the size of the last layer using bilinear interpolation and all
maps are concatenated. This representation is mapped to c
feature maps, where c is the number of classes, by using
3 × 3 convolutions to predict posterior class probabilities.
For detection, we use the same loss function as [16]
when performing tiling of the input image with anchor
boxes and matching them to ground truth bounding boxes.
We use activations of each layer in the upscale-stream to
regress corrections for coordinates of the anchor boxes and
to predict the class probability distribution. We use the same
data augmentation suggested in the original SSD pipeline,
namely photometric distortions, random crops, horizontal
flips and zoom-out operation.
Fig. 3: ResSkip block integrating feature maps from the up- 4. Experiments
scale and downscale streams, with skip connection.
We now present various experiments conducted on the
COCO, Pascal VOC 2007 and 2012 datasets, for which both
bels and produce segmentation maps. To achieve this we bounding box annotations and segmentation maps are avail-
upscale all the activations of the upscale stream, concate- able. Section 4.1 discusses in more details the datasets and
nate them and feed to the final classification layer. the metrics we used; Section 4.2 presents technical details
that are useful to make our work reproducible, and then
3.5. Speeding up Non-Maximum Suppression
each subsequent subsection is devoted to a particular ex-
Increasing the number of anchor boxes heavily affects periment. The last two sections discuss the inference speed
inference time because it performs NMS on a potentially and clarify particular choices in the network architecture.
huge number of proposals (in the worst case scenario, it may Our code is now available as an open-source software pack-
be all of them). Indeed, we observed that by using sliding age at http://thoth.inrialpes.fr/research/
window proposals, addition of small scale proposals slows blitznet/.
down the inference even more than increasing image reso-
lution. Surprisingly, non-maximum suppression may then 4.1. Datasets and Metrics
become the bottleneck at inference time. We observed that We use the COCO [15], VOC07, and VOC12
this occurred sometimes for particular object classes that re- datasets [2]. All images in the VOC datasets are annotated
turn a lot of bounding box candidates. with ground truth bounding boxes of objects and only a sub-
Therefore, we suggest a different post-processing strat- set of VOC12 is annotated with target segmentation masks.
egy to accelerate detection when there are too many pro- The VOC07 dataset is divided into 2 subsets, trainval (5011
posals. For each class, we pre-select the top 400 boxes images) and test (4952 images). The VOC12-train subset
with largest scores, and perform NMS leaving only 50 of contains 5717 images annotated for detection and 1464 of
them. Overall, the final detection is the top 200 highest scor- them have segmentation ground truth as well (VOC12-train-
ing boxes per image after non-maximum suppression. This seg), while VOC12-val has 5823 images for detection and
strategy yields a reasonable computational time for NMS, 1449 images for segmentation (we call this subset VOC12-
and has marginal impact on accuracy. val-seg). Both datasets have 20 object classes.
The COCO dataset includes 80 object categories for de-
3.6. Training and Loss Functions
tection and instance segmentation. For the task of detec-
Given labeled training data where each data point is an- tion, there are 80k images for training and 40k for valida-
notated with segmentation maps, or bounding boxes, or tion. There is no either a protocol for evaluation of seman-
with both, we consider a loss function which is simply the tic segmentation or even annotations to train it from. In this
sum of two loss functions of the two task. Note that we tried work, we are interested particularly in semantic segmenta-
reweighting the two loss functions, but we did not observe tion masks so we obtain them from instance segmentation
noticeable improvements in terms of accuracy. annotations by combining instances of one category.
For segmentation, the loss is the cross-entropy between To carry out more extensive experiments we leverage ex-
predicted and target class distribution of pixels [1]. Specif- tra annotations for VOC12 segmentation provided by [8],
ically, we use a 1 × 1 convolutional operation with 64 chan- which gives a total of 10,582 fully annotated images for
training that we call VOC12-train-seg-aug. We still keep the We further improve the results by training for detection and
original PASCAL annotations in VOC12 val-seg, even if a segmentation jointly achieving 79.1% and 81.5% mAP with
more precise annotation is available in [8], for a fair com- BlitzNet300 (s4) and BlitzNet512 (s8) respectively.
parison with other methods that do not benefit from these We think that the performance gain for BlitzNet300 over
extra annotations. BlitzNet512 could be explained by the larger stride used for
In VOC12 and VOC07 datasets, a predicted bounding the last layer, which is 4, vs 8 for BlitzNet512, and seems
box is correct if its intersection over union with the ground to be helpful for better learning finer details. Unfortunately,
truth bounding box is higher than 0.5. The metric for eval- training BlitzNet512 with stride 4 was impossible because
uation detection performance is the mean average precision of memory limitations on our single GPU.
(mAP) and the quality of predicted segmentation masks is
measured with mean intersection over union (mIoU). 4.4. PASCAL VOC 2012
Table 1: Comparison of detection performance on Pascal VOC 2007 test set. The models where trained on VOC07 trainval
+ VOC12 trainval. The models that have suffix “+ seg” where trained for segmentation jointly with data from VOC12 trainval
and extra annotations provided by [8]. The values in columns correspond to average precision per class (%).
network backbone mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv
SSD300* [16] VGG-16 75.8 88.1 82.9 74.4 61.9 47.6 82.7 78.8 91.5 58.1 80.0 64.1 89.4 85.7 85.5 82.6 50.2 79.8 73.6 86.6 72.1
BlitzNet300 ResNet50 75.4 87.4 82.1 74.5 61.6 45.9 81.5 78.3 91.4 58.2 80.3 64.9 89.1 83.5 85.7 81.5 50.5 79.9 74.7 84.8 71.1
BlitzNet300 + COCO ResNet50 80.2 91.0 86.5 80.0 70.1 54.7 84.4 84.1 92.5 65.1 83.5 69.2 91.2 88.1 88.5 85.7 55.8 85.4 79.3 89.8 78.2
R-FCN[13] ResNet-101 77.6 86.9 83.4 81.5 63.8 62.4 81.6 81.1 93.1 58.0 83.8 60.8 92.7 86.0 84.6 84.4 59.0 80.8 68.6 86.1 72.9
Faster RCNN ResNet-101 73.8 86.5 81.6 77.2 58.0 51.0 78.6 76.6 93.2 48.6 80.4 59.0 92.1 85.3 84.8 80.7 48.1 77.3 66.5 84.7 65.6
YOLO [23] YOLOnet 57.9 77.0 67.2 57.7 38.3 22.7 68.3 55.9 81.4 36.2 60.8 48.5 77.2 72.3 71.3 63.5 28.9 52.2 54.8 73.9 50.8
SSD512* [16] VGG-16 78.5 90.0 85.3 77.7 64.3 58.5 85.1 84.3 92.6 61.3 83.4 65.1 89.9 88.5 88.2 85.5 54.4 82.4 70.7 87.1 75.6
BlitzNet512 ResNet50 79.0 89.9 85.2 80.4 67.2 53.6 82.9 83.6 93.8 62.5 84.0 65.8 91.6 86.6 87.6 84.6 56.8 84.7 73.9 88.0 75.7
BlitzNet512 + COCO ResNet50 83.8 93.1 89.4 84.7 75.5 65.0 86.6 87.4 94.5 69.9 88.8 71.7 92.5 91.6 91.1 88.9 61.2 90.4 79.2 91.8 83.0
Table 2: Comparison of detection performance on Pascal VOC 2012 test set. The models where trained on VOC07
trainval + VOC12 trainval. The BlitzNet models where trained for segmentation jointly with data from VOC12 trainval and
extra annotations provided by [8]. Suffix ‘+ COCO’ means that the model was pretrained on the COCO dataset. The reported
values correspond to average precision per class (%). Detailed results of submissions are available on the VOC12 test server.
network seg det mIoU mAP network seg det mIoU mAP
BlitzNet300 X - 78.9 BlitzNet512 X - 33.2
BlitzNet300 X X 72.8 80.0 BlitzNet512 X X 53.5 34.1
BlitzNet300 X 72.4 - BlitzNet512 X 48.3 -
Table 3: The effect of joint learning on both tasks. The Table 5: The effect of joint training tested on COCO
networks where trained on VOC12 train-seg-aug, and tested minival2014. The networks were trained on COCO train.
on VOC12 val.
minival2014 test-dev2015
method
int 0.5 0.75 int 0.5 0.75
network seg det mIoU mAP
BlitzNet300 29.7 49.4 31.2 29.8 49.7 31.1
BlitzNet300 X - 83.0 BlitzNet512 34.1 55.1 35.9 34.2 55.5 35.8
BlitzNet300 X X 75.7 83.6
BlitzNet300 X 72.4 - Table 6: Detection performance of BlitzNet on the
COCO dataset, with minival2014 and test-dev2015 splits
Table 4: The effect of extra data with bounding box an- The networks were trained on COCO trainval dataset. De-
notations on segmentation performance. The networks tection performance is measured in average precision (%)
were trained on VOC12 trainval (aug) + VOC07 tainval. with different criteria, namely, minimum Jaccard overlap
Detection performance is measured in average precision between annotated and predicted bounding box is 0.5, 0.75
(%) and mean IoU is the metric for segmentation segmen- or integrated from 0.5 to 0.95 % (column “int”).
tation(%).
network backbone mAP % FPS # proposals input resolution
Faster-RCNN[25] VGG-16 73.2 7 - ∼ 1000 × 600
R-FCN[13] ResNet-101 80.5 9 - ∼ 1000 × 600
SSD300*[16] VGG-16 77.1 46 8732 300 × 300
SSD512*[16] VGG-16 80.6 19 24564 512 × 512
YOLO [23] YOLO net 63.4 46 - -
BlitzNet300 (s4) ResNet-50 79.1 24 45390 300 × 300
BlitzNet512 (s8) ResNet-50 81.5 19.5 32766 512 × 512
Table 7: Comparison of inference time on PASCAL VOC 2007, when running on a Titan X (Maxwell) GPU.
Block type mAP mIoU tations in segmentation stream (64) as well as the number of
Hourglass-style [19] 78.7 75.6 channels in the upscale-stream (512) where found by using
Refine-style [22] 78.0 76.1 a validation set. We did not conduct experiments by chang-
ResSkip (no res) 78.4 75.3 ing the number of layers in the upscale-stream as long as
ResSkip (ours) 79.1 75.7 our architecture is designed to be symmetric with respect
to the convolutions and the deconvolutions steps. Reduc-
Table 8: The effect of fusion block type on performance, ing the number of the steps will result in a smaller number
measured on detection (VOC07-test) and segmentation of layers in the upscale stream, which may deteriorate the
(VOC12-val) The networks were trained on VOC12-train performance as noted in [16].
(aug) + VOC07 tainval, see Sec. 4.1. Detection performance
is measured in average precision (%) and mean IoU is the 82.5
metric for segmentation segmentation(%). BlitzNet500
R-FCN SSD512
80.0
BlitzNet300
detection results on COCO test-dev2015 in Table 6. Our
results are also publicly available on the COCO evaluation 77.5 SSD300
test server.
75.0
4.6. Inference Speed Comparison
mAP %
Faster-RCNN
72.5
In Table 7 and Figure 4, we report speed comparison to
other state-of-the-art detection pipelines. Our approach is
70.0
the most accurate among the real time detectors working 24
frames per second (FPS) and in the setting close to real time
67.5
(19 FPS), it provides the most accurate detections among
the counterparts, while also providing semantic segmenta-
65.0
tion mask. Note that all methods are run using the same
GPU (Titan X, Maxwell architecture). YOLO
62.5
0 10 20 30 40 50
4.7. Study of the Network Architecture speed, FPS
The BlitzNet pipeline simultaneously operates with sev-
Fig. 4: Speed comparison with other methods. The de-
eral types of data. To demonstrate the effectiveness of the
tection accuracy of different methods measured in mAP is
ResSkip block, we set up the following experiment: we
depicted on y-axis. x-coordinate is their speed, in FPS.
leave the pipeline unchanged while only substituting this
block with another one. We consider in particular fusion
5. Conclusion
blocks that appear in the state-of-the-art approaches on se-
mantic segmentation. [19] [22] [26]. Table 8 shows that In this paper, we introduce a joint approach for object de-
our ResSkip block performs similar or better (on average) tection and semantic segmentation. By using a single fully-
than all counterparts, which may be due to the fact that its convolutional network to solve both problems at the same
design uses similar skip-connections as the Backbone net- time, learning is facilitated by weight sharing between the
work ResNet50, making the overall architecture more ho- two tasks, and inference is performed in real time. More-
mogeneous. over, we show that our pipeline is competitive in terms of
Optimal parameters for the size of intermediate represen- accuracy, and that the two tasks benefit from each other.
Fig. 5: Effect of extra data annotated for detection on the quality of estimated segmentation masks. The first column
displays test images; the second column contains its segmentation ground truth masks. The third column corresponds to
segmentations predicted by BlitzNet300 trained on VOC12 train-segmentation augmented with extra segmentation masks
and VOC07. The last row is segmentation masks produced by the same architecture but trained without VOC07.
Acknowledgements. This work was supported by a grant acknowledge the Intel gift and the support of NVIDIA Cor-
from ANR (MACARON, ANR-14-CE23-0003-01) and by poration with the donation of GPUs used for this research.
the ERC projects SOLARIS and ALLEGRO. We gratefully
References [22] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learn-
ing to refine object segments. In ECCV, 2016.
[1] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
[23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
A. L. Yuille. Semantic image segmentation with deep con-
only look once: Unified, real-time object detection. In
volutional nets and fully connected CRFs. In ICLR, 2015.
CVPR, 2016.
[2] M. Everingham, L. Van Gool, C. K. Williams, J. Winn,
[24] J. Redmon and A. Farhadi. YOLO9000: better, faster,
and A. Zisserman. The PASCAL visual object classes
stronger. arXiv preprint arXiv:1612.08242, 2016.
(VOC) challenge. International Journal of Computer Vision,
88(2):303–338, 2010. [25] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
wards real-time object detection with region proposal net-
[3] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun. Bottom-up
works. In NIPS, 2015.
segmentation for top-down detection. In CVPR, 2013.
[4] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. [26] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
DSSD: Deconvolutional single shot detector. arXiv preprint tional networks for biomedical image segmentation. In MIC-
arXiv:1701.06659, 2017. CAI, 2015.
[5] R. Girshick. Fast R-CNN. In ICCV, 2015. [27] K. Simonyan and A. Zisserman. Very deep convolutional
[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- networks for large-scale image recognition. In ICLR, 2015.
ture hierarchies for accurate object detection and semantic [28] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and
segmentation. In CVPR, 2014. R. Urtasun. Multinet: Real-time joint semantic reasoning
[7] S. Gould, T. Gao, and D. Koller. Region-based segmentation for autonomous driving. arXiv preprint arXiv:1612.07695,
and object detection. In NIPS, 2009. 2016.
[8] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. [29] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as
Semantic contours from inverse detectors. In ICCV, 2011. a whole: Joint object detection, scene classification and se-
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning mantic segmentation. In CVPR, 2012.
for image recognition. In CVPR, 2016.
[10] D. Kingma and J. Ba. Adam: A method for stochastic opti-
mization. In ICLR, 2015.
[11] I. Kokkinos. UberNet: Training a universal convolutional
neural network for low-, mid-, and high-level vision using
diverse datasets and limited memory. In CVPR, 2017.
[12] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Backpropagation
applied to handwritten zip code recognition. Neural compu-
tation, 1(4):541–551, 1989.
[13] Y. Li, K. He, J. Sun, et al. R-FCN: Object detection via
region-based fully convolutional networks. In NIPS, 2016.
[14] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
S. Belongie. Feature pyramid networks for object detection.
In CVPR, 2017.
[15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
mon objects in context. In ECCV, 2014.
[16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
Fu, and A. C. Berg. SSD: Single shot multibox detector. In
ECCV, 2016.
[17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In CVPR, 2015.
[18] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fi-
dler, R. Urtasun, and A. Yuille. The role of context for object
detection and semantic segmentation in the wild. In CVPR,
2014.
[19] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
works for human pose estimation. In ECCV, 2016.
[20] H. Noh, S. Hong, and B. Han. Learning deconvolution net-
work for semantic segmentation. In ICCV, 2015.
[21] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille.
Weakly-and semi-supervised learning of a deep convolu-
tional network for semantic image segmentation. In ICCV,
2015.
Supplementary Material
Fig. 6: Qualitative results for the taks of object detection. The results are obtained by the BlitzNet512 trained on VOC07
and VOC12 train-val augmented with extra segmentation masks.
Fig. 7: Improved and failure cases of detection by BlitzNet300 comparing to SSD300. Each pair of images corresponds
to the results of detection by SSD300 (left) and BlitzNet300 (right). The cases of improved detection are presened on the
top part of the figure and the cases where both methods still fail are placed below the dashed line. It’s clear that our pipeline
provides more accurate detections in presence of small objects, complicated scenes and objects consisting of several parts
with different appearance. The failure cases indicate that modern pipelines still struggle to handle ambiguous big objects (top
left), intraclass variability (top right), misleading context (bottom right) and highly occluded objects (bottom left)