research-article

Video Semantic Segmentation via Sparse Temporal Transformer

Authors:

Liqing ZhangAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 59 - 68

https://doi.org/10.1145/3474085.3475409

Published: 17 October 2021 Publication History

Abstract

Currently, video semantic segmentation mainly faces two challenges: 1) the demand of temporal consistency; 2) the balance between segmentation accuracy and inference efficiency. For the first challenge, existing methods usually use optical flow to capture the temporal relation in consecutive frames and maintain the temporal consistency, but the low inference speed by means of optical flow limits the real-time applications. For the second challenge, flow based key frame warping is one mainstream solution. However, the unbalanced inference latency of flow-based key frame warping makes it unsatisfactory for real-time applications. Considering the segmentation accuracy and inference efficiency, we propose a novel Sparse Temporal Transformer (STT) to bridge temporal relation among video frames adaptively, which is also equipped with query selection and key selection. The key selection and query selection strategies are separately applied to filter out temporal and spatial redundancy in our temporal transformer. Specifically, our STT can reduce the time complexity of temporal transformer by a large margin without harming the segmentation accuracy and temporal consistency. Experiments on two benchmark datasets, Cityscapes and Camvid, demonstrate that our method achieves the state-of-the-art segmentation accuracy and temporal consistency with comparable inference speed.

Supplementary Material

ZIP File (mfp1394aux.zip)

Download
55.67 MB

References

[1]

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 39, 12 (2017), 2481--2495.

[2]

Gabriel J Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. 2008. Segmentation and recognition using structure from motion point clouds. In ECCV 2008.

Digital Library

[3]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV 2020.

Digital Library

[4]

Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2020 b. Pre-trained image processing transformer. arXiv preprint arXiv:2012.00364 (2020).

[5]

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. 2018. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 40, 4 (2018), 834--848.

[6]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 40, 4 (2017), 834--848.

[7]

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV 2018.

Digital Library

[8]

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020 a. Generative pretraining from pixels. In ICML 2020.

[9]

Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. 2021. Transformer Tracking. arXiv preprint arXiv:2103.15436 (2021).

[10]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).

[11]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR 2016.

[12]

Goncc alo M Correia, Vlad Niculae, and André FT Martins. 2019. Adaptively sparse transformers. In EMNLP 2019.

[13]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL 2019.

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL 2018.

[15]

Mingyu Ding, Zhe Wang, Bolei Zhou, Jianping Shi, Zhiwu Lu, and Ping Luo. 2020. Every frame counts: joint learning of video segmentation and optical flow. In AAAI 2020.

[16]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR 2021.

[17]

Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W Taylor. 2021. SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation. In CVPR 2021.

[18]

Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Mohammad Sabokrou, Mahmood Fathy, Fay Huang, and Reinhard Klette. 2016. STFCN: spatio-temporal fully convolutional neural network for semantic segmentation of street scenes. In ACCV 2016.

[19]

Junyi Feng, Songyuan Li, Xi Li, Fei Wu, Qi Tian, Ming-Hsuan Yang, and Haibin Ling. 2020. TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).

[20]

Raghudeep Gadde, Varun Jampani, and Peter V Gehler. 2017. Semantic video cnns through representation warping. In ICCV 2017.

[21]

Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing Zhang. 2020. Context-aware Feature Generation for Zero-shot Semantic Segmentation. In ACM MM 2020. 1921--1929.

Digital Library

[22]

Hao He, Xiangtai Li, Kuiyuan Yang, Guangliang Cheng, Jianping Shi, Yunhai Tong, Zhengjun Zha, and Lubin Weng. 2021. BoundarySqueeze: Image Segmentation as Boundary Squeezing. arXiv preprint arXiv:2105.11668 (2021).

[23]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR 2016.

[24]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. In CVPR 2017.

[25]

Ping Hu, Fabian Caba, Oliver Wang, Zhe Lin, Stan Sclaroff, and Federico Perazzi. 2020. Temporally distributed networks for fast video semantic segmentation. In CVPR 2020.

[26]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In CVPR 2017.

[27]

Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In ICCV.

[28]

Samvit Jain, Xin Wang, and Joseph E Gonzalez. 2019. Accel: A corrective fusion network for efficient semantic segmentation on video. In CVPR 2019.

[29]

Wei Ji, Xi Li, Fei Wu, Zhijie Pan, and Yueting Zhuang. 2019. Human-centric clothing segmentation via deformable semantic locality-preserving network. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 12 (2019), 4837--4848.

Digital Library

[30]

Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, et al. 2017. Video scene parsing with predictive feature learning. In ICCV 2017.

[31]

Ivan Kreso, Sinisa Segvic, and Josip Krapac. 2017. Ladder-style densenets for semantic segmentation of large natural images. In ICCV Workshops 2017. 238--245.

[32]

Xiangtai Li, Hao He, Xia Li, Duo Li, Guangliang Cheng, Jianping Shi, Lubin Weng, Yunhai Tong, and Zhouchen Lin. 2021. PointFlow: Flowing Semantics Through Points for Aerial Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4217--4226.

[33]

Yule Li, Jianping Shi, and Dahua Lin. 2018. Low-latency video semantic segmentation. In CVPR 2018.

[34]

Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. 2017. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In CVPR 2017.

[35]

Lizhao Liu, Junyi Cao, Minqian Liu, Yong Guo, Qi Chen, and Mingkui Tan. 2020 a. Dynamic Extension Nets for Few-shot Semantic Segmentation. In ACM MM 2020. 1441--1449.

Digital Library

[36]

Si Liu, Changhu Wang, Ruihe Qian, Han Yu, Renda Bao, and Yao Sun. 2017. Surveillance video parsing with single frame supervision. In CVPR 2017.

[37]

Yifan Liu, Chunhua Shen, Changqian Yu, and Jingdong Wang. 2020 b. Efficient Semantic Video Segmentation with Per-frame Inference. In ECCV 2020.

[38]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR 2015.

Digital Library

[39]

Dmitrii Marin, Zijian He, Peter Vajda, Priyam Chatterjee, Sam Tsai, Fei Yang, and Yuri Boykov. 2019. Efficient segmentation: Learning downsampling near semantic boundaries. In ICCV 2019.

[40]

David Nilsson and Cristian Sminchisescu. 2018. Semantic video segmentation by gated recurrent flow propagation. In CVPR 2018.

[41]

Marin Orsic, Ivan Kreso, Petra Bevandic, and Sinisa Segvic. 2019. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In CVPR 2019.

[42]

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML 2018.

[43]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015.

[44]

Evan Shelhamer, Kate Rakelly, Judy Hoffman, and Trevor Darrell. 2016. Clockwork convnets for video semantic segmentation. In ECCV 2016.

[45]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR 2015.

[46]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS 2017.

Digital Library

[47]

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. 2020. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020).

[48]

Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021. End-to-End Video Instance Segmentation with Transformers. In CVPR 2021.

[49]

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, and Peter Vajda. 2020. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020).

[50]

Yu-Syuan Xu, Tsu-Jui Fu, Hsuan-Kung Yang, and Chun-Yi Lee. 2018. Dynamic video segmentation network. In CVPR 2018.

[51]

Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, and Baining Guo. 2020. Learning texture transformer network for image super-resolution. In CVPR 2020.

[52]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLnet: Generalized autoregressive pretraining for language understanding. In NeurIPS 2019.

Digital Library

[53]

Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV 2018.

[54]

Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In ICLR 2016.

[55]

Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. 2018. Icnet for real-time semantic segmentation on high-resolution images. In ECCV 2018.

[56]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In CVPR 2017.

[57]

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. 2021. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In CVPR 2021.

[58]

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. AAAI 2021 (2021).

[59]

Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. 2018. Towards high performance video object detection. In CVPR 2018.

[60]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR 2021.

[61]

Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Deep feature flow for video recognition. In CVPR 2017.

Cited By

Baghbaderani RLi YWang SQi H(2024)Temporally-Consistent Video Semantic Segmentation with Bidirectional Occlusion-guided Feature Propagation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00074(674-684)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00074
Sun GLiu YDing HWu MVan Gool L(2024)Learning Local and Global Temporal Contexts for Video Semantic SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338732646:10(6919-6934)Online publication date: Oct-2024
https://doi.org/10.1109/TPAMI.2024.3387326
Li CLi YLiu RWang GLv JJin YSi WHeng P(2024)Structural and Pixel Relation Modeling for Semisupervised Instrument Segmentation From Surgical VideosIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.334222273(1-14)Online publication date: 2024
https://doi.org/10.1109/TIM.2023.3342222
Show More Cited By

Index Terms

Video Semantic Segmentation via Sparse Temporal Transformer
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Video segmentation

Recommendations

TAHIR: Transformer-Based Affine Histological Image Registration
Pattern Recognition, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges
Abstract
In medical practice it is often necessary to jointly analyze differently stained histological sections. However, when slides are being prepared tissues are subjected to deformations and registration is highly required. Although the transformation ...
Non-iterative Coarse-to-Fine Transformer Networks for Joint Affine and Deformable Image Registration
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023
Abstract
Image registration is a fundamental requirement for medical image analysis. Deep registration methods based on deep learning have been widely recognized for their capabilities to perform fast end-to-end registration. Many deep registration methods ...
Semi-supervised Semantic Segmentation of Cataract Surgical Images based on DeepLab v3+
ICCDA '21: Proceedings of the 2021 5th International Conference on Compute and Data Analysis

Microscopic surgical image analysis is very important in surgical skill analysis, workflow recognition, and autonomous robotic surgery. Semantic segmentation of microscopic image is a prerequisite. Currently, supervised deep convolutional neural network ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

SenseTime Collaborative Reasearch Grant
Shanghai Science and Technology RD Program of China
Shanghai Municipal Science and Technology Major Project
National Natural Science Foundation of China
National Key R&D Program of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
994
Total Downloads

Downloads (Last 12 months)139
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Baghbaderani RLi YWang SQi H(2024)Temporally-Consistent Video Semantic Segmentation with Bidirectional Occlusion-guided Feature Propagation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00074(674-684)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00074
Sun GLiu YDing HWu MVan Gool L(2024)Learning Local and Global Temporal Contexts for Video Semantic SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338732646:10(6919-6934)Online publication date: Oct-2024
https://doi.org/10.1109/TPAMI.2024.3387326
Li CLi YLiu RWang GLv JJin YSi WHeng P(2024)Structural and Pixel Relation Modeling for Semisupervised Instrument Segmentation From Surgical VideosIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.334222273(1-14)Online publication date: 2024
https://doi.org/10.1109/TIM.2023.3342222
Liu XLi JShi JFan XTian YZhao D(2024)Event-Based Monocular Depth Estimation With Recurrent TransformersIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337874234:8(7417-7429)Online publication date: Aug-2024
https://doi.org/10.1109/TCSVT.2024.3378742
Guo DFan DLu TSakaridis CVan Gool L(2024)Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00340(3544-3553)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.00340
Sun LLiu YSun GWu MXu ZWang KGool L(2024)Global and Compact Video Context Embedding for Video Semantic SegmentationIEEE Access10.1109/ACCESS.2024.340915012(135589-135600)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3409150
Mathai MLiu YLing N(2024)A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video PredictionIEEE Access10.1109/ACCESS.2024.337536512(39589-39602)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3375365
Luo CFeng SQuan YYe YXu YLi XZhang B(2024)AMANet: An Adaptive Memory Attention Network for video cloud detectionPattern Recognition10.1016/j.patcog.2024.110616155(110616)Online publication date: Nov-2024
https://doi.org/10.1016/j.patcog.2024.110616
Zhang YZhang ZLiao MTian SYou RZou WXu C(2024)Video Generalized Semantic Segmentation via Non-Salient Feature Reasoning and ConsistencyKnowledge-Based Systems10.1016/j.knosys.2024.111584292(111584)Online publication date: May-2024
https://doi.org/10.1016/j.knosys.2024.111584
Yuan HPeng JCai Z(2024)TDSNet: A temporal difference based network for video semantic segmentationInformation Sciences10.1016/j.ins.2024.121335(121335)Online publication date: Aug-2024
https://doi.org/10.1016/j.ins.2024.121335
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents