Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475285acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

End-to-End Video Object Detection with Spatial-Temporal Transformers

Published: 17 October 2021 Publication History

Abstract

Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal Transformer consists of three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective for video object detection.

References

[1]
Hatem Belhassen, Heng Zhang, Virginie Fresse, and El-Bay Bourennane. 2019. Improving Video Object Detection by Seq-Bbox Matching. In VISIGRAPP (5: VISAPP). 226--233.
[2]
Gedas Bertasius, Lorenzo Torresani, and Jianbo Shi. 2018. Object Detection in Video with Spatiotemporal Sampling Networks. In ECCV. 342--357.
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.
[4]
Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuanjun Xiong, Chen Change Loy, and Dahua Lin. 2018. Optimizing video object detection via a scale-time lattice. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7814--7823.
[5]
Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. 2020. Memory enhanced global-local aggregation for video object detection. In CVPR. 10337--10346.
[6]
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016a. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In NeurIPS. 379--387.
[7]
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016b. R-fcn: Object detection via region-based fully convolutional networks. arXiv preprint arXiv:1605.06409 (2016).
[8]
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In ICCV.
[9]
Hanming Deng, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. 2019 a. Object Guided External Memory Network for Video Object Detection. In ICCV.
[10]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[11]
Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, and Tao Mei. 2019 b. Relation Distillation Networks for Video Object Detection. In ICCV.
[12]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[13]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2017. Detect to Track and Track to Detect. In ICCV. 3057--3065.
[14]
J. Feng, S. Li, X. Li, F. Wu, Q. Tian, M. H. Yang, and H. Ling. 2020. TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge. (2020).
[15]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256.
[16]
Chaoxu Guo, Bin Fan, Jie Gu, Qian Zhang, Shiming Xiang, Veronique Prinet, and Chunhong Pan. 2019. Progressive sparse local attention for video object detection. In ICCV. 3909--3918.
[17]
Liang Han, Pichao Wang, Zhaozheng Yin, Fan Wang, and Hao Li. 2020 b. Exploiting Better Feature Aggregation for Video Object Detection. In Proceedings of the 28th ACM International Conference on Multimedia. 1469--1477.
[18]
Mingfei Han, Yali Wang, Xiaojun Chang, and Yu Qiao. 2020 a. Mining Inter-Video Proposal Relations for Video Object Detection. In European Conference on Computer Vision. Springer, 431--446.
[19]
Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. 2016. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 (2016).
[20]
Fei He, Naiyu Gao, Qiaozhe Li, Senyao Du, Xin Zhao, and Kaiqi Huang. 2020. Temporal Context Enhanced Feature Aggregation for Video Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10941--10948.
[21]
Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2017. Mask R-CNN. In ICCV. 2980--2988.
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[24]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
[25]
Zhengkai Jiang, Peng Gao, Chaoxu Guo, Qian Zhang, Shiming Xiang, and Chunhong Pan. 2019. Video object detection with locally-weighted deformable neighbors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8529--8536.
[26]
Zhengkai Jiang, Yu Liu, Ceyuan Yang, Jihao Liu, Peng Gao, Qian Zhang, Shiming Xiang, and Chunhong Pan. 2020. Learning where to focus for efficient video object detection. In European Conference on Computer Vision. Springer, 18--34.
[27]
Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et al. 2017. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 10 (2017), 2896--2907.
[28]
Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly, Vol. 2, 1--2 (1955), 83--97.
[29]
Lijian Lin, Haosheng Chen, Honglun Zhang, Jun Liang, Yu Li, Ying Shan, and Hanzi Wang. 2020. Dual Semantic Fusion Network for Video Object Detection. In Proceedings of the 28th ACM International Conference on Multimedia. 1855--1863.
[30]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017a. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117--2125.
[31]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017b. Focal Loss for Dense Object Detection. In ICCV.
[32]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision.
[33]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21--37.
[34]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[35]
Hao Luo, Lichao Huang, Han Shen, Yuan Li, Chang Huang, and Xinggang Wang. 2019. Object Detection in Video with Spatial-temporal Context Aggregation. arXiv preprint arXiv:1907.04988 (2019).
[36]
Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. 2021. TrackFormer: Multi-Object Tracking with Transformers. arXiv preprint arXiv:2101.02702 (2021).
[37]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 6 (2016), 1137--1149.
[38]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS. 91--99.
[39]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In CVPR.
[40]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115, 3 (2015), 211--252.
[41]
Alberto Sabater, Luis Montesano, and Ana C Murillo. 2020. Robust and efficient post-processing for video object detection. arXiv preprint arXiv:2009.11050 (2020).
[42]
Mykhailo Shvets, Wei Liu, and Alexander C. Berg. 2019. Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection. In ICCV. 9755--9763.
[43]
Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng. 2016. End-to-end people detection in crowded scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2325--2333.
[44]
Guanxiong Sun, Yang Hua, Guosheng Hu, and Neil Robertson. 2021. MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection. In AAAI.
[45]
Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. 2020 a. TransTrack: Multiple-Object Tracking with Transformer. arXiv preprint arXiv:2012.15460 (2020).
[46]
Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, et al. 2020 b. Sparse r-cnn: End-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450 (2020).
[47]
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9627--9636.
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017a. Attention is All you Need. In NeruIPS. 5998--6008.
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
[50]
Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. 2018b. Fully Motion-Aware Network for Video Object Detection. In ECCV.
[51]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018a. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.
[52]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2020. End-to-End Video Instance Segmentation with Transformers. arXiv preprint arXiv:2011.14503 (2020).
[53]
Haiping Wu, Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. 2019. Sequence Level Semantics Aggregation for Video Object Detection. In ICCV.
[54]
Zhujun Xu, Emir Hrustic, and Damien Vivet. 2020. CenterNet Heatmap Propagation for Real-Time Video Object Detection. In European Conference on Computer Vision. Springer, 220--234.
[55]
Linjie Yang, Yuchen Fan, and Ning Xu. 2019. Video instance segmentation. In ICCV.
[56]
Chun-Han Yao, Chen Fang, Xiaohui Shen, Yangyue Wan, and Ming-Hsuan Yang. 2020. Video Object Detection via Object-Level Temporal Aggregation. In European Conference on Computer Vision. Springer, 160--177.
[57]
Xingyi Zhou, Dequan Wang, and Philipp Kr"ahenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).
[58]
Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. 2018. Towards High Performance Video Object Detection. In CVPR. 7210--7218.
[59]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv preprint arXiv:2010.04159 (2020).
[60]
Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017a. Flow-Guided Feature Aggregation for Video Object Detection. In ICCV.
[61]
Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017b. Deep Feature Flow for Video Recognition. In CVPR.

Cited By

View all
  • (2024)Analyzing temporal coherence for deepfake video detectionElectronic Research Archive10.3934/era.202411932:4(2621-2641)Online publication date: 2024
  • (2024)Toward Versatile Small Object Detection with Temporal-YOLOv8Sensors10.3390/s2422738724:22(7387)Online publication date: 20-Nov-2024
  • (2024)CenterADNet: Infrared Video Target Detection Based on Central Point RegressionSensors10.3390/s2406177824:6(1778)Online publication date: 9-Mar-2024
  • Show More Cited By

Index Terms

  1. End-to-End Video Object Detection with Spatial-Temporal Transformers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. temporal object query
    2. transformers
    3. video object detection

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)312
    • Downloads (Last 6 weeks)51
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Analyzing temporal coherence for deepfake video detectionElectronic Research Archive10.3934/era.202411932:4(2621-2641)Online publication date: 2024
    • (2024)Toward Versatile Small Object Detection with Temporal-YOLOv8Sensors10.3390/s2422738724:22(7387)Online publication date: 20-Nov-2024
    • (2024)CenterADNet: Infrared Video Target Detection Based on Central Point RegressionSensors10.3390/s2406177824:6(1778)Online publication date: 9-Mar-2024
    • (2024)Stepwise Spatial Global-local Aggregation Networks for Autonomous DrivingACM Journal on Autonomous Transportation Systems10.1145/3674117Online publication date: 20-Jun-2024
    • (2024)DGMamba: Domain Generalization via Generalized State Space ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681247(3607-3616)Online publication date: 28-Oct-2024
    • (2024)SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00663(6759-6768)Online publication date: 3-Jan-2024
    • (2024)Learning Complementary Spatial–Temporal Transformer for Video Salient Object DetectionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.324324635:8(10663-10673)Online publication date: Aug-2024
    • (2024)Hierarchical Local-Global Transformer for Temporal Sentence GroundingIEEE Transactions on Multimedia10.1109/TMM.2023.330955126(3263-3277)Online publication date: 1-Jan-2024
    • (2024)Class-Aware Dual-Supervised Aggregation Network for Video Object DetectionIEEE Transactions on Multimedia10.1109/TMM.2023.329261526(2109-2123)Online publication date: 1-Jan-2024
    • (2024)Semi-Supervised Thyroid Nodule Detection in Ultrasound VideosIEEE Transactions on Medical Imaging10.1109/TMI.2023.334894943:5(1792-1803)Online publication date: May-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media