Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475409acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Video Semantic Segmentation via Sparse Temporal Transformer

Published: 17 October 2021 Publication History

Abstract

Currently, video semantic segmentation mainly faces two challenges: 1) the demand of temporal consistency; 2) the balance between segmentation accuracy and inference efficiency. For the first challenge, existing methods usually use optical flow to capture the temporal relation in consecutive frames and maintain the temporal consistency, but the low inference speed by means of optical flow limits the real-time applications. For the second challenge, flow based key frame warping is one mainstream solution. However, the unbalanced inference latency of flow-based key frame warping makes it unsatisfactory for real-time applications. Considering the segmentation accuracy and inference efficiency, we propose a novel Sparse Temporal Transformer (STT) to bridge temporal relation among video frames adaptively, which is also equipped with query selection and key selection. The key selection and query selection strategies are separately applied to filter out temporal and spatial redundancy in our temporal transformer. Specifically, our STT can reduce the time complexity of temporal transformer by a large margin without harming the segmentation accuracy and temporal consistency. Experiments on two benchmark datasets, Cityscapes and Camvid, demonstrate that our method achieves the state-of-the-art segmentation accuracy and temporal consistency with comparable inference speed.

Supplementary Material

ZIP File (mfp1394aux.zip)

References

[1]
Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 39, 12 (2017), 2481--2495.
[2]
Gabriel J Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. 2008. Segmentation and recognition using structure from motion point clouds. In ECCV 2008.
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV 2020.
[4]
Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2020 b. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364 (2020).
[5]
L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. 2018. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 40, 4 (2018), 834--848.
[6]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 40, 4 (2017), 834--848.
[7]
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV 2018.
[8]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020 a. Generative pretraining from pixels. In ICML 2020.
[9]
Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer Tracking. arXiv preprint arXiv:2103.15436 (2021).
[10]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).
[11]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR 2016.
[12]
Goncc alo M Correia, Vlad Niculae, and André FT Martins. 2019. Adaptively sparse transformers. In EMNLP 2019.
[13]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL 2019.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2018.
[15]
Mingyu Ding, Zhe Wang, Bolei Zhou, Jianping Shi, Zhiwu Lu, and Ping Luo. 2020. Every frame counts: joint learning of video segmentation and optical flow. In AAAI 2020.
[16]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR 2021.
[17]
Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W Taylor. 2021. SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation. In CVPR 2021.
[18]
Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Mohammad Sabokrou, Mahmood Fathy, Fay Huang, and Reinhard Klette. 2016. STFCN: spatio-temporal fully convolutional neural network for semantic segmentation of street scenes. In ACCV 2016.
[19]
Junyi Feng, Songyuan Li, Xi Li, Fei Wu, Qi Tian, Ming-Hsuan Yang, and Haibin Ling. 2020. TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[20]
Raghudeep Gadde, Varun Jampani, and Peter V Gehler. 2017. Semantic video cnns through representation warping. In ICCV 2017.
[21]
Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing Zhang. 2020. Context-aware Feature Generation for Zero-shot Semantic Segmentation. In ACM MM 2020. 1921--1929.
[22]
Hao He, Xiangtai Li, Kuiyuan Yang, Guangliang Cheng, Jianping Shi, Yunhai Tong, Zhengjun Zha, and Lubin Weng. 2021. BoundarySqueeze: Image Segmentation as Boundary Squeezing. arXiv preprint arXiv:2105.11668 (2021).
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR 2016.
[24]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. In CVPR 2017.
[25]
Ping Hu, Fabian Caba, Oliver Wang, Zhe Lin, Stan Sclaroff, and Federico Perazzi. 2020. Temporally distributed networks for fast video semantic segmentation. In CVPR 2020.
[26]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In CVPR 2017.
[27]
Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In ICCV.
[28]
Samvit Jain, Xin Wang, and Joseph E Gonzalez. 2019. Accel: A corrective fusion network for efficient semantic segmentation on video. In CVPR 2019.
[29]
Wei Ji, Xi Li, Fei Wu, Zhijie Pan, and Yueting Zhuang. 2019. Human-centric clothing segmentation via deformable semantic locality-preserving network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 12 (2019), 4837--4848.
[30]
Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, et al. 2017. Video scene parsing with predictive feature learning. In ICCV 2017.
[31]
Ivan Kreso, Sinisa Segvic, and Josip Krapac. 2017. Ladder-style densenets for semantic segmentation of large natural images. In ICCV Workshops 2017. 238--245.
[32]
Xiangtai Li, Hao He, Xia Li, Duo Li, Guangliang Cheng, Jianping Shi, Lubin Weng, Yunhai Tong, and Zhouchen Lin. 2021. PointFlow: Flowing Semantics Through Points for Aerial Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4217--4226.
[33]
Yule Li, Jianping Shi, and Dahua Lin. 2018. Low-latency video semantic segmentation. In CVPR 2018.
[34]
Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In CVPR 2017.
[35]
Lizhao Liu, Junyi Cao, Minqian Liu, Yong Guo, Qi Chen, and Mingkui Tan. 2020 a. Dynamic Extension Nets for Few-shot Semantic Segmentation. In ACM MM 2020. 1441--1449.
[36]
Si Liu, Changhu Wang, Ruihe Qian, Han Yu, Renda Bao, and Yao Sun. 2017. Surveillance video parsing with single frame supervision. In CVPR 2017.
[37]
Yifan Liu, Chunhua Shen, Changqian Yu, and Jingdong Wang. 2020 b. Efficient Semantic Video Segmentation with Per-frame Inference. In ECCV 2020.
[38]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR 2015.
[39]
Dmitrii Marin, Zijian He, Peter Vajda, Priyam Chatterjee, Sam Tsai, Fei Yang, and Yuri Boykov. 2019. Efficient segmentation: Learning downsampling near semantic boundaries. In ICCV 2019.
[40]
David Nilsson and Cristian Sminchisescu. 2018. Semantic video segmentation by gated recurrent flow propagation. In CVPR 2018.
[41]
Marin Orsic, Ivan Kreso, Petra Bevandic, and Sinisa Segvic. 2019. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In CVPR 2019.
[42]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML 2018.
[43]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015.
[44]
Evan Shelhamer, Kate Rakelly, Judy Hoffman, and Trevor Darrell. 2016. Clockwork convnets for video semantic segmentation. In ECCV 2016.
[45]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR 2015.
[46]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS 2017.
[47]
Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. 2020. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020).
[48]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021. End-to-End Video Instance Segmentation with Transformers. In CVPR 2021.
[49]
Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, and Peter Vajda. 2020. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020).
[50]
Yu-Syuan Xu, Tsu-Jui Fu, Hsuan-Kung Yang, and Chun-Yi Lee. 2018. Dynamic video segmentation network. In CVPR 2018.
[51]
Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, and Baining Guo. 2020. Learning texture transformer network for image super-resolution. In CVPR 2020.
[52]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLnet: Generalized autoregressive pretraining for language understanding. In NeurIPS 2019.
[53]
Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV 2018.
[54]
Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In ICLR 2016.
[55]
Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. 2018. Icnet for real-time semantic segmentation on high-resolution images. In ECCV 2018.
[56]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In CVPR 2017.
[57]
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. 2021. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In CVPR 2021.
[58]
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. AAAI 2021 (2021).
[59]
Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. 2018. Towards high performance video object detection. In CVPR 2018.
[60]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR 2021.
[61]
Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Deep feature flow for video recognition. In CVPR 2017.

Cited By

View all
  • (2024)Temporally-Consistent Video Semantic Segmentation with Bidirectional Occlusion-guided Feature Propagation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00074(674-684)Online publication date: 3-Jan-2024
  • (2024)Learning Local and Global Temporal Contexts for Video Semantic SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338732646:10(6919-6934)Online publication date: Oct-2024
  • (2024)Structural and Pixel Relation Modeling for Semisupervised Instrument Segmentation From Surgical VideosIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.334222273(1-14)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Video Semantic Segmentation via Sparse Temporal Transformer

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. semantic segmentation
    2. semi-supervised learning
    3. temporal consistency
    4. transformer
    5. video semantic segmentation

    Qualifiers

    • Research-article

    Funding Sources

    • SenseTime Collaborative Reasearch Grant
    • Shanghai Science and Technology RD Program of China
    • Shanghai Municipal Science and Technology Major Project
    • National Natural Science Foundation of China
    • National Key R&D Program of China

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)139
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Temporally-Consistent Video Semantic Segmentation with Bidirectional Occlusion-guided Feature Propagation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00074(674-684)Online publication date: 3-Jan-2024
    • (2024)Learning Local and Global Temporal Contexts for Video Semantic SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338732646:10(6919-6934)Online publication date: Oct-2024
    • (2024)Structural and Pixel Relation Modeling for Semisupervised Instrument Segmentation From Surgical VideosIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.334222273(1-14)Online publication date: 2024
    • (2024)Event-Based Monocular Depth Estimation With Recurrent TransformersIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337874234:8(7417-7429)Online publication date: Aug-2024
    • (2024)Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00340(3544-3553)Online publication date: 16-Jun-2024
    • (2024)Global and Compact Video Context Embedding for Video Semantic SegmentationIEEE Access10.1109/ACCESS.2024.340915012(135589-135600)Online publication date: 2024
    • (2024)A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video PredictionIEEE Access10.1109/ACCESS.2024.337536512(39589-39602)Online publication date: 2024
    • (2024)AMANet: An Adaptive Memory Attention Network for video cloud detectionPattern Recognition10.1016/j.patcog.2024.110616155(110616)Online publication date: Nov-2024
    • (2024)Video Generalized Semantic Segmentation via Non-Salient Feature Reasoning and ConsistencyKnowledge-Based Systems10.1016/j.knosys.2024.111584292(111584)Online publication date: May-2024
    • (2024)TDSNet: A temporal difference based network for video semantic segmentationInformation Sciences10.1016/j.ins.2024.121335(121335)Online publication date: Aug-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media