Real-Time Instance Segmentation of Traffic Videos for Embedded Devices
:1. Introduction
2. Related Work
3. Proposed Method
3.1. Backbone Architecture
3.2. Segmentation Head
- (a)
- To achieve a faster execution time, only the P3 feature maps are used as input for the proposed Segmentation head branch. The feature maps with the highest resolution were selected because they can produce high resolution prototype masks while making use of only one upsample layer. Note that PFN computes the P3 feature maps based on Stage 3, Stage 4, and Stage 5, which results in good quality masks even for small objects.
- (b)
- The utilization of the CoordConv layer represents a novel idea, and it is not part of the original YOLACT [5] design. The proposed design helps to generate prototype masks not only based on the feature maps generated by FPN, but also on pixel locations. Note that each prototype is specialized in a specific segmenting task [5], e.g., for segmenting foreground objects in the left part of the image, and the CoordConv layer is used to reinforce this behavior.
- (c)
- The output of the prototype masks is unbounded, so that the network can output strong activations for the prototypes with high confidence.
- (d)
- The proposed design limits the number of high resolution segmented masks generated by SOLACT by setting Note that the number of prototypes is independent of the number of objects captured by the image, and it is smaller than the one used in the SOLO architecture [6]. The proposed design helps to reduce the memory footprint of the proposed architecture and the overloading of computational resources.
3.3. Object Detection Head
- (a)
- Feature map downsample: Each feature map is downsampled to an grid cell to reduce the execution time with the drawback of limiting the number of possible instances detected, i.e., if there are several very close objects, it will not be possible to distinguish them.
- (b)
- Coordinate convolution: SOLACT follows the approach from SOLO [6] and introduces a pre-processing step based on the CoordConv layer. Thanks to this simple operation, the spatial information is added with a negligible speed impact.
- (d)
- Anchor-free approach: Traditional object detection networks follow the anchor-based approach, where instead of generating a prediction for each cell in the feature maps, several predictions are created, each of them assigned an object size and shape. Such an approach could help to detect different object sizes and shapes; however, recent studies tend to avoid the design complexity of anchor-based detection [25,26,27,28,29]. The anchor-free strategy leverages the training performance, because each location on the feature map is responsible for the detection of objects centered on it, independently of their size or shape. This reduces the complexity, and as a result, the network learns the object features in each location better.
- (e)
- Bounding box-free segmentation: Unlike most the image segmentation networks, SOLACT does not depend on the bounding box prediction to construct the final segmentation masks. This may cause some false mask pieces to get out of the hypothetical bounding box; however, it gives the network much more flexibility for layer pruning and network acceleration.
3.4. Post-Processing Algorithm
3.4.1. Initial Filtering
3.4.2. Masks’ Generation and Maskness Computation
3.4.3. Final Filtering and Non-Maximum Suppression
3.5. Training Details
3.5.1. Labels’ Assignment
3.5.2. Loss Function Formulation
3.5.3. Learning Rate Adjustment
4. Experimental Validation
4.1. Experimental Setup
- (1)
- A network architecture was trained using the training set, and the trained model was saved.
- (2)
- The model was tested on the COCO test set, and the instance segmentation results based on the COCO metrics were saved as a json file.
- (3)
- The json file was uploaded online on the COCO Detection Challenge (Segmentation Mask) website [32], where the numerical results for each class are provided as a scoring output log file.
- (4)
- The results obtained for the traffic classes were extracted. In this paper, the following six values are reported: (i) the Average Precision (AP) for and all areas (small, medium, and large), called AP; (ii) AP for and all areas, called AP50; (iii) AP for and all areas, called AP75; (iv) AP for and small areas, called APs; (iv) AP for and medium areas, called APm; (iv) AP for and large areas, called APℓ.
4.2. Experimental Results
4.3. Qualitative Results
4.4. Ablation Study
- (a)
- The lightweight SOLACT variation provides the closest performance compared to the basic SOLACT architecture.
- (b)
- The lightweight SOLACT equipped with a PeleeNet-based backbone represents the best choice in terms of speed when deployed on the NVIDIA Tegra TX2 embedded device.
- (c)
- The lightweight SOLACT variation equipped with a MobileNetV2-based backbone represents the best choice in terms of speed when deployed on the NVIDIA AGX Xavier embedded device.
- (d)
- The basic SOLACT architecture remains the best choice in terms of performance.
- (e)
- All proposed architectures achieve real-time performance (more than 30 Frames Per Second (FPS)) when deployed on the NVIDIA AGX Xavier [9].
5. Conclusions
Author Contributions
Data Availability Statement
Conflicts of Interest
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Image Net Classification with Deep Convolutional Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems, Stateline, NV, USA, 2012; Volume 1, pp. 1097–1105. Available online: (accessed on 16 August 2020).
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. arXiv 2016, arXiv:1604.01685. [Google Scholar]
- Kirillov, A.; He, K.; Girshick, R.B.; Rother, C.; Dollár, P. Panoptic Segmentation. arXiv 2018, arXiv:1801.00868. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. arXiv 2017, arXiv:1703.06870. [Google Scholar]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Korea, October 2019; pp. 9156–9165. Available online: (accessed on 16 August 2020).
- Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. SOLO: Segmenting Objects by Locations. arXiv 2019, arXiv:1912.04488. [Google Scholar]
- Li, Y.; Qi, H.; Dai, J.; Ji, X.; Wei, Y. Fully Convolutional Instance-aware Semantic Segmentation. arXiv 2016, arXiv:1611.07709. [Google Scholar]
- NVDIA. NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge| NVIDIA Developer Blog. Available online: (accessed on 16 August 2020).
- NVDIA. AI-Powered Autonomous Machines at Scale|NVIDIA Jetson AGX Xavier. Available online: (accessed on 16 August 2020).
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [Green Version]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Lin, T.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. arXiv 2016, arXiv:1612.03144. [Google Scholar]
- Huang, Z.; Huang, L.; Gong, Y.; Huang, C.; Wang, X. Mask Scoring R-CNN. arXiv 2019, arXiv:1903.00241. [Google Scholar]
- Chen, L.; Hermans, A.; Papandreou, G.; Schroff, F.; Wang, P.; Adam, H. MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features. arXiv 2017, arXiv:1712.04837. [Google Scholar]
- Carreira, J.; Sminchisescu, C. CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1312–1328. [Google Scholar] [CrossRef] [PubMed]
- Pinheiro, P.H.O.; Collobert, R.; Dollár, P. Learning to Segment Object Candidates. arXiv 2015, arXiv:1506.06204. [Google Scholar]
- Zhang, Y.; Chu, J.; Leng, L.; Miao, J. Mask-Refined R-CNN: A Network for Refining Object Details in Instance Segmentation. Sensors 2020, 20, 1010. [Google Scholar] [CrossRef] [Green Version]
- Lin, T.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
- Yu, J.G.; Li, Y.; Gao, C.; Gao, H.; Xia, G.S.; Yu, Z.L.; Li, Y. Exemplar-Based Recursive Instance Segmentation with Application to Plant Image Analysis. IEEE Trans. Image Process. 2020, 29, 389–404. [Google Scholar] [CrossRef]
- Sun, Y.; Liao, S.; Gao, C.; Xie, C.; Yang, F.; Zhao, Y.; Sagata, A. Weakly Supervised Instance Segmentation Based on Two-Stage Transfer Learning. IEEE Access 2020, 8, 24135–24144. [Google Scholar] [CrossRef]
- Liu, Y.; Wu, Y.H.; Wen, P.S.; Shi, Y.J.; Qiu, Y.; Cheng, M.M. Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef]
- Liu, R.; Lehman, J.; Molino, P.; Such, F.P.; Frank, E.; Sergeev, A.; Yosinski, J. An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution. arXiv 2018, arXiv:1807.03247. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]
- Lee, Y.; Park, J. CenterMask: Real-Time Anchor-Free Instance Segmentation. arXiv 2020, arXiv:1911.06667. [Google Scholar]
- Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Shi, J. FoveaBox: Beyond Anchor-based Object Detector. arXiv 2019, arXiv:1904.03797. [Google Scholar]
- Xiang, C.; Tian, S.; Zou, W.; Xu, C. SAIS: Single-stage Anchor-free Instance Segmentation. arXiv 2019, arXiv:1912.01176. [Google Scholar]
- Yang, H.; Deng, R.; Lu, Y.; Zhu, Z.; Chen, Y.; Roland, J.T.; Lu, L.; Landman, B.A.; Fogo, A.B.; Huo, Y. CircleNet: Anchor-free Detection with Circle Representation. arXiv 2020, arXiv:2006.02474. [Google Scholar]
- Milletari, F.; Navab, N.; Ahmadi, S. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef] [Green Version]
- Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
- Cui, Y.; Lin, T.Y.; Kirillov, A.; Ronchi, M.R.; Girshick, R.; Dollr, P. COCO Detection Challenge (Segmentation Mask). 2009. Available online: (accessed on 30 September 2020).
- Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: (accessed on 13 January 2020).
- PyTorch. Available online: (accessed on 23 July 2020).
- NVDIA. TITAN X Specifications. Available online: (accessed on 6 November 2020).
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- ONNX Homepage. Available online: (accessed on 15 August 2020).
- NVIDIA TensorRT. Available online: (accessed on 23 July 2020).
- Macq S.A./N.V. Smart Mobility Solutions. Available online: (accessed on 15 July 2020).
- Wang, R.J.; Li, X.; Ao, S.; Ling, C.X. Pelee: A Real-Time Object Detection System on Mobile Devices. arXiv 2018, arXiv:1804.06882. [Google Scholar]
- Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. arXiv 2018, arXiv:1801.04381. [Google Scholar]
Class | Number of Instances | |
Training | Validation | |
person | 257,253 | 10,777 |
bicycle | 7056 | 314 |
car | 43,533 | 1918 |
motorcycle | 8654 | 367 |
bus | 6061 | 283 |
train | 4570 | 190 |
truck | 9970 | 414 |
traffic light | 12,842 | 634 |
stop sign | 1983 | 75 |
TOTAL | 352,922 | 14,972 |
Class | SOLACT | YOLACT [5] | ||||||||||
AP | AP50 | AP75 | APs | APm | APl | AP | AP50 | AP75 | APs | APm | APl | |
person | 32 | 51.1 | 34.7 | 9.2 | 37.8 | 60.5 | 22.1 | 37.5 | 23.5 | 5.2 | 22.1 | 48.6 |
bicycle | 9.9 | 21.1 | 6.8 | 2.2 | 11.5 | 25.6 | 9.7 | 25 | 5.1 | 1.8 | 10.9 | 26 |
car | 19.8 | 29.6 | 22.3 | 8.2 | 35.3 | 41 | 23 | 41 | 22.6 | 12.5 | 36.8 | 44.5 |
motorcycle | 22.8 | 40.1 | 22.2 | 2.7 | 16.4 | 42.7 | 21.6 | 43.3 | 19.5 | 2.5 | 14.7 | 42.2 |
bus | 56.1 | 66 | 62 | 1.8 | 36.9 | 74.2 | 51.4 | 62.7 | 56.8 | 4.8 | 29.4 | 69.7 |
train | 57 | 73.5 | 66.1 | 12.6 | 26.2 | 65.4 | 55.4 | 72.4 | 65.3 | 11.6 | 29.3 | 63 |
truck | 15.5 | 21 | 17.9 | 2 | 12.9 | 28.6 | 22 | 33.3 | 24.8 | 3.4 | 17.9 | 41.5 |
traffic light | 14.2 | 26.5 | 14.3 | 8.4 | 31.6 | 38.6 | 9.5 | 18 | 8.7 | 4.9 | 21.8 | 38.4 |
stop sign | 56.8 | 63.1 | 62 | 16.3 | 58.3 | 80.7 | 58.3 | 66.5 | 65 | 18.7 | 60.9 | 81.6 |
AVERAGE | 31.57 | 43.56 | 34.26 | 7.04 | 29.66 | 50.81 | 30.33 | 44.41 | 32.37 | 7.27 | 27.09 | 50.61 |
Method | COCO Validation Set | Test Set | Embedded Device | ||||||
Average COCO Metric (↑) | Speed (FPS) (↑) | ||||||||
AP | AP50 | AP75 | APs | APm | APl | AP | Tegra TX2 | AGX Xavier | |
SOLACT | 30.63 | 42.03 | 33.70 | 7.15 | 28.46 | 53.08 | 31.57 | 6.66 | 33.15 |
Lightweight segmentation head | 26.14 | 35.63 | 28.29 | 2.90 | 23.80 | 45.87 | 27.50 | 8.32 | 40.04 |
Channel pruning | 25.39 | 34.53 | 27.55 | 3.46 | 23.93 | 44.13 | 27.64 | 12.49 | 58.86 |
Lightweight SOLACT | 26.91 | 37.22 | 29.23 | 3.48 | 24.93 | 46.54 | 28.20 | 9.04 | 43.54 |
Reduced patch size | 25.18 | 35.69 | 26.98 | 2.77 | 23.58 | 45.53 | 26.90 | 13.72 | 64.58 |
PeleeNet-based backbone | 24.20 | 34.97 | 26.17 | 2.16 | 19.13 | 46.07 | 25.14 | 15.19 | 59.76 |
MobileNetV2-based backbone | 20.82 | 30.07 | 22.51 | 0.79 | 18.16 | 40.88 | 22.08 | 12.69 | 66.25 |
Method | Person | Bike | Car | Motorcycle | Bus | Train | Truck | Traffic Light | Stop Sign |
SOLACT | |||||||||
Lightweight segmentation head | |||||||||
Channel pruning | |||||||||
Lightweight SOLACT | |||||||||
Reduced patch size | |||||||||
PeleeNet-based backbone | |||||||||
MobileNetV2-based backbone |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
Share and Cite
Panero Martinez, R.; Schiopu, I.; Cornelis, B.; Munteanu, A. Real-Time Instance Segmentation of Traffic Videos for Embedded Devices. Sensors 2021, 21, 275.
Panero Martinez R, Schiopu I, Cornelis B, Munteanu A. Real-Time Instance Segmentation of Traffic Videos for Embedded Devices. Sensors. 2021; 21(1):275.
Chicago/Turabian StylePanero Martinez, Ruben, Ionut Schiopu, Bruno Cornelis, and Adrian Munteanu. 2021. "Real-Time Instance Segmentation of Traffic Videos for Embedded Devices" Sensors 21, no. 1: 275.
APA StylePanero Martinez, R., Schiopu, I., Cornelis, B., & Munteanu, A. (2021). Real-Time Instance Segmentation of Traffic Videos for Embedded Devices. Sensors, 21(1), 275.