research-article

End-to-End Video Object Detection with Spatial-Temporal Transformers

Authors:

Guangliang Cheng,

Liqing ZhangAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 1507 - 1516

https://doi.org/10.1145/3474085.3475285

Published: 17 October 2021 Publication History

Abstract

Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal Transformer consists of three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective for video object detection.

References

[1]

Hatem Belhassen, Heng Zhang, Virginie Fresse, and El-Bay Bourennane. 2019. Improving Video Object Detection by Seq-Bbox Matching. In VISIGRAPP (5: VISAPP). 226--233.

[2]

Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. 2018. Object Detection in Video with Spatiotemporal Sampling Networks. In ECCV. 342--357.

[3]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.

Digital Library

[4]

Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuanjun Xiong, Chen Change Loy, and Dahua Lin. 2018. Optimizing video object detection via a scale-time lattice. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7814--7823.

[5]

Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. 2020. Memory enhanced global-local aggregation for video object detection. In CVPR. 10337--10346.

[6]

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016a. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In NeurIPS. 379--387.

Digital Library

[7]

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016b. R-fcn: Object detection via region-based fully convolutional networks. arXiv preprint arXiv:1605.06409 (2016).

[8]

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In ICCV.

[9]

Hanming Deng, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. 2019 a. Object Guided External Memory Network for Video Object Detection. In ICCV.

[10]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[11]

Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, and Tao Mei. 2019 b. Relation Distillation Networks for Video Object Detection. In ICCV.

[12]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[13]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2017. Detect to Track and Track to Detect. In ICCV. 3057--3065.

[14]

J. Feng, S. Li, X. Li, F. Wu, Q. Tian, M. H. Yang, and H. Ling. 2020. TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge. (2020).

[15]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256.

[16]

Chaoxu Guo, Bin Fan, Jie Gu, Qian Zhang, Shiming Xiang, Veronique Prinet, and Chunhong Pan. 2019. Progressive sparse local attention for video object detection. In ICCV. 3909--3918.

[17]

Liang Han, Pichao Wang, Zhaozheng Yin, Fan Wang, and Hao Li. 2020 b. Exploiting Better Feature Aggregation for Video Object Detection. In Proceedings of the 28th ACM International Conference on Multimedia. 1469--1477.

Digital Library

[18]

Mingfei Han, Yali Wang, Xiaojun Chang, and Yu Qiao. 2020 a. Mining Inter-Video Proposal Relations for Video Object Detection. In European Conference on Computer Vision. Springer, 431--446.

[19]

Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. 2016. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 (2016).

[20]

Fei He, Naiyu Gao, Qiaozhe Li, Senyao Du, Xin Zhao, and Kaiqi Huang. 2020. Temporal Context Enhanced Feature Aggregation for Video Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10941--10948.

[21]

Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2017. Mask R-CNN. In ICCV. 2980--2988.

[22]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[23]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[24]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).

Digital Library

[25]

Zhengkai Jiang, Peng Gao, Chaoxu Guo, Qian Zhang, Shiming Xiang, and Chunhong Pan. 2019. Video object detection with locally-weighted deformable neighbors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8529--8536.

[26]

Zhengkai Jiang, Yu Liu, Ceyuan Yang, Jihao Liu, Peng Gao, Qian Zhang, Shiming Xiang, and Chunhong Pan. 2020. Learning where to focus for efficient video object detection. In European Conference on Computer Vision. Springer, 18--34.

Digital Library

[27]

Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et al. 2017. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 10 (2017), 2896--2907.

Digital Library

[28]

Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly, Vol. 2, 1--2 (1955), 83--97.

[29]

Lijian Lin, Haosheng Chen, Honglun Zhang, Jun Liang, Yu Li, Ying Shan, and Hanzi Wang. 2020. Dual Semantic Fusion Network for Video Object Detection. In Proceedings of the 28th ACM International Conference on Multimedia. 1855--1863.

Digital Library

[30]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017a. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117--2125.

[31]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017b. Focal Loss for Dense Object Detection. In ICCV.

[32]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision.

[33]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21--37.

[34]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[35]

Hao Luo, Lichao Huang, Han Shen, Yuan Li, Chang Huang, and Xinggang Wang. 2019. Object Detection in Video with Spatial-temporal Context Aggregation. arXiv preprint arXiv:1907.04988 (2019).

[36]

Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. 2021. TrackFormer: Multi-Object Tracking with Transformers. arXiv preprint arXiv:2101.02702 (2021).

[37]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 6 (2016), 1137--1149.

Digital Library

[38]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS. 91--99.

Digital Library

[39]

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In CVPR.

[40]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115, 3 (2015), 211--252.

Digital Library

[41]

Alberto Sabater, Luis Montesano, and Ana C Murillo. 2020. Robust and efficient post-processing for video object detection. arXiv preprint arXiv:2009.11050 (2020).

[42]

Mykhailo Shvets, Wei Liu, and Alexander C. Berg. 2019. Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection. In ICCV. 9755--9763.

[43]

Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng. 2016. End-to-end people detection in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2325--2333.

[44]

Guanxiong Sun, Yang Hua, Guosheng Hu, and Neil Robertson. 2021. MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection. In AAAI.

[45]

Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. 2020 a. TransTrack: Multiple-Object Tracking with Transformer. arXiv preprint arXiv:2012.15460 (2020).

[46]

Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, et al. 2020 b. Sparse r-cnn: End-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450 (2020).

[47]

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9627--9636.

[48]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017a. Attention is All you Need. In NeruIPS. 5998--6008.

Digital Library

[49]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).

[50]

Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. 2018b. Fully Motion-Aware Network for Video Object Detection. In ECCV.

[51]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018a. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.

[52]

Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2020. End-to-End Video Instance Segmentation with Transformers. arXiv preprint arXiv:2011.14503 (2020).

[53]

Haiping Wu, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. 2019. Sequence Level Semantics Aggregation for Video Object Detection. In ICCV.

[54]

Zhujun Xu, Emir Hrustic, and Damien Vivet. 2020. CenterNet Heatmap Propagation for Real-Time Video Object Detection. In European Conference on Computer Vision. Springer, 220--234.

[55]

Linjie Yang, Yuchen Fan, and Ning Xu. 2019. Video instance segmentation. In ICCV.

[56]

Chun-Han Yao, Chen Fang, Xiaohui Shen, Yangyue Wan, and Ming-Hsuan Yang. 2020. Video Object Detection via Object-Level Temporal Aggregation. In European Conference on Computer Vision. Springer, 160--177.

Digital Library

[57]

Xingyi Zhou, Dequan Wang, and Philipp Kr"ahenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).

[58]

Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. 2018. Towards High Performance Video Object Detection. In CVPR. 7210--7218.

[59]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv preprint arXiv:2010.04159 (2020).

[60]

Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017a. Flow-Guided Feature Aggregation for Video Object Detection. In ICCV.

[61]

Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017b. Deep Feature Flow for Video Recognition. In CVPR.

Cited By

Amin MHu YHu J(2024)Analyzing temporal coherence for deepfake video detectionElectronic Research Archive10.3934/era.202411932:4(2621-2641)Online publication date: 2024
https://doi.org/10.3934/era.2024119
van Leeuwen MFokkinga EHuizinga WBaan JHeslinga F(2024)Toward Versatile Small Object Detection with Temporal-YOLOv8Sensors10.3390/s2422738724:22(7387)Online publication date: 20-Nov-2024
https://doi.org/10.3390/s24227387
Sun JWei MWang JZhu MLin HNie HDeng X(2024)CenterADNet: Infrared Video Target Detection Based on Central Point RegressionSensors10.3390/s2406177824:6(1778)Online publication date: 9-Mar-2024
https://doi.org/10.3390/s24061778
Show More Cited By

Index Terms

End-to-End Video Object Detection with Spatial-Temporal Transformers
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on ...
Local track to detect for video object detection

The existing methods for video object detection are generally achieved from searching the objects through the entire image. However, they always suffer from large computation consumption as a result of dozens of similar images needing to be operated. To ...
Video Object Detection with MeanShift Tracking
Rough Sets
Abstract
Video object detection, a basic task in the computer vision, is rapidly evolving and widely used in various real-world applications. Recently, with the success of deep learning, deep video object detection has become an important research ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Shanghai Municipal Science and Technology Major Project
National Key Research and Development Program of China
Zhejiang Lab
National Key R&D Program of China
Shanghai Science and Technology RD Program of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

67
Total Citations
View Citations
1,133
Total Downloads

Downloads (Last 12 months)312
Downloads (Last 6 weeks)51

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Amin MHu YHu J(2024)Analyzing temporal coherence for deepfake video detectionElectronic Research Archive10.3934/era.202411932:4(2621-2641)Online publication date: 2024
https://doi.org/10.3934/era.2024119
van Leeuwen MFokkinga EHuizinga WBaan JHeslinga F(2024)Toward Versatile Small Object Detection with Temporal-YOLOv8Sensors10.3390/s2422738724:22(7387)Online publication date: 20-Nov-2024
https://doi.org/10.3390/s24227387
Sun JWei MWang JZhu MLin HNie HDeng X(2024)CenterADNet: Infrared Video Target Detection Based on Central Point RegressionSensors10.3390/s2406177824:6(1778)Online publication date: 9-Mar-2024
https://doi.org/10.3390/s24061778
Cui YHan CLiu D(2024)Stepwise Spatial Global-local Aggregation Networks for Autonomous DrivingACM Journal on Autonomous Transportation Systems10.1145/3674117Online publication date: 20-Jun-2024
https://doi.org/10.1145/3674117
Long SZhou QLi XLu XYing CLuo YMa LYan SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DGMamba: Domain Generalization via Generalized State Space ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681247(3607-3616)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681247
Mahmud TLiu CYaman BMarculescu D(2024)SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00663(6759-6768)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00663
Liu NNan KZhao WYao XHan J(2024)Learning Complementary Spatial–Temporal Transformer for Video Salient Object DetectionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.324324635:8(10663-10673)Online publication date: Aug-2024
https://doi.org/10.1109/TNNLS.2023.3243246
Fang XLiu DZhou PXu ZLi R(2024)Hierarchical Local-Global Transformer for Temporal Sentence GroundingIEEE Transactions on Multimedia10.1109/TMM.2023.330955126(3263-3277)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3309551
Qi QYan YWang H(2024)Class-Aware Dual-Supervised Aggregation Network for Video Object DetectionIEEE Transactions on Multimedia10.1109/TMM.2023.329261526(2109-2123)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3292615
Luo XLi ZXu CZhang BZhang LZhu JHuang PWang XYang MChang S(2024)Semi-Supervised Thyroid Nodule Detection in Ultrasound VideosIEEE Transactions on Medical Imaging10.1109/TMI.2023.334894943:5(1792-1803)Online publication date: May-2024
https://doi.org/10.1109/TMI.2023.3348949
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents