Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3547841acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hierarchical Hourglass Convolutional Network for Efficient Video Classification

Published: 10 October 2022 Publication History

Abstract

Videos naturally contain dynamic variation over the temporal axis, which will result in the same visual clues (e.g., semantics, objects) changing their scale, position, and perspective patterns between adjacent frames. A primary trend in video CNN is adopting spatial-2D convolution for spatial semantics and temporal-1D convolution for temporal dynamics. Though the direction achieves a favorable balance between efficiency and efficacy, it suffers from misalignment of visual clues with large displacements. Particularly, rigid temporal convolution would fail to capture correct motions when a specific target moves out of the reception field of temporal convolution between adjacent frames.
To tackle large visual displacements between temporal neighbors, we propose a new temporal convolution namedHourglass Convolution (HgC). The temporal reception field of HgC has an hourglass shape, where the spatial reception field is enlarged in prior & post temporal frames, enabling an ability to capture large displacement. Moreover, since videos contain long, short-term movements viewed from multiple temporal interval levels, we hierarchically organize the HgC net to both capture temporal dynamics from frame (short-term) and clip (long-term) levels. Besides, we also adopt strategies, such as low-resolution for short-term modeling and channel reduction for long-term modeling, from efficiency concerns. With HgC, our H$^2$CN equips off-the-shelf CNNs with a strong ability in capturing spatio-temporal dynamics at a neglectable computation overhead. We validate the efficiency and efficacy of HgC on standard action recognition benchmarks, including Something-Something V1&V2, Diving48, and EGTEA Gaze+. We also analyse the complementarity of frame-level motion and clip-level motion with visualizations. The code and models will be available at https://github.com/ty-97/H2CN.

Supplementary Material

MP4 File (MM22-fp0407.mp4)
Presentation video of MM22 paper "Hierarchical Hourglass Convolutional Network for Efficient Video Classification"

References

[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836--6846.
[2]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. In International Conference on Machine Learning. PMLR, 813--824.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[4]
Cheng Chen, Zhenshan Tan, Qingrong Cheng, Xin Jiang, Qun Liu, Yudong Zhu, and Xiaodong Gu. 2022. UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18103--18112.
[5]
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision. 764--773.
[6]
Jianing Deng, Li Wang, Shiliang Pu, and Cheng Zhuo. 2020. Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement. national conference on artificial intelligence (2020).
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al . 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[8]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6824--6835.
[9]
Linxi Fan, Shyamal Buch, Guanzhi Wang, Ryan Cao, Yuke Zhu, Juan Carlos Niebles, and Li Fei-Fei. 2020. RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition. In European Conference on Computer Vision. Springer, 505--521.
[10]
Chaowei Fang, Dingwen Zhang, Liang Wang, Yulun Zhang, Lechao Cheng, and Junwei Han. 2022. Cross-Modality High-Frequency Transformer for MR Image Super-Resolution. arXiv preprint arXiv:2203.15314 (2022).
[11]
Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203--213.
[12]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-fast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202--6211.
[13]
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In International Conference on Machine Learning. PMLR, 1243--1252.
[14]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al . 2017. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision. 5842--5850.
[15]
Yanbin Hao, Chong-Wah Ngo, and Bin Zhu. 2021. Learning to match anchor-target video pairs with dual attentional holographic networks. IEEE Transactions on Image Processing 30 (2021), 8130--8143.
[16]
Yanbin Hao, Shuo Wang, Pei Cao, Xinjian Gao, Tong Xu, Jinmeng Wu, and Xiangnan He. 2022. Attention in Attention: Modeling Context Correlation for Efficient Video Classification. IEEE Transactions on Circuits and Systems for Video Technology (2022).
[17]
Yanbin Hao, Hao Zhang, Chong-Wah Ngo, and Xiangnan He. 2022. Group Contextualization for Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 928--938.
[18]
Yanbin Hao, Hao Zhang, Chong-Wah Ngo, Qiang Liu, and Xiaojun Hu. 2020. Compact Bilinear Augmented Query Structured Attention for Sport Highlights Classification. In Proceedings of the 28th ACM International Conference on Multimedia. 628--636.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[20]
Berthold KP Horn and Brian G Schunck. 1981. Determining optical flow. Artificial intelligence 17, 1--3 (1981), 185--203.
[21]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.
[22]
Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2000--2009.
[23]
Manjin Kim, Heeseung Kwon, Chunyu Wang, Suha Kwak, and Minsu Cho. 2021. Relational Self-Attention: What's Missing in Attention for Video Understanding. Advances in Neural Information Processing Systems 34 (2021).
[24]
Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. 2021. MoViNets: Mobile Video Networks for Efficient Video Recognition. computer vision and pattern recognition (2021).
[25]
Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. 2021. Learning self-similarity in space and time as generalized motion for video action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13065--13075.
[26]
Xianhang Li, Yali Wang, Zhipeng Zhou, and Yu Qiao. 2020. Smallbignet: Integrating core and contextual views for video classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1092--1101.
[27]
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909--918.
[28]
Yingwei Li, Yi Li, and Nuno Vasconcelos. 2018. Resound: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision (ECCV). 513--528.
[29]
Yin Li, Miao Liu, and James M Rehg. 2018. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision (ECCV). 619--635.
[30]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083--7093.
[31]
Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Tong Lu. 2020. Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11669--11676.
[32]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021. Video Swin Transformer. arXiv preprint arXiv:2106.13230 (2021).
[33]
Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, and Tong Lu. 2020. TAM: Temporal Adaptive Module for Video Recognition. arXiv preprint arXiv:2005.06803 (2020).
[34]
Chenxu Luo and Alan L Yuille. 2019. Grouped spatial-temporal aggregation for efficient action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5512--5521.
[35]
Farzaneh Mahdisoltani, Guillaume Berger, Waseem Gharbieh, David Fleet, and Roland Memisevic. 2018. On the effectiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235 (2018).
[36]
Mandela Patrick, Dylan Campbell, Yuki Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, and João F Henriques. 2021. Keeping your eye on the ball: Trajectory attention in video transformers. Advances in Neural Information Processing Systems 34 (2021).
[37]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision. 5533--5541.
[38]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2016. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. International Journal of Computer Vision (2016).
[39]
Hao Shao, Shengju Qian, and Yu Liu. 2020. Temporal interlacing network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11966--11973.
[40]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27 (2014).
[41]
Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2020. Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1102--1111.
[42]
Swathikiran Sudhakaran and Oswald Lanz. 2018. Attention is all we need: Nailing down object-centric attention for egocentric activity recognition. arXiv preprint arXiv:1807.11794 (2018).
[43]
Hao Tan, Jie Lei, Thomas Wolf, and Mohit Bansal. 2021. Vimpac: Video pre-training via masked token prediction and contrastive learning. arXiv preprint arXiv:2106.11250 (2021).
[44]
Yi Tan, Yanbin Hao, Xiangnan He, Yinwei Wei, and Xun Yang. 2021. Selective Dependency Aggregation for Action Classification. In Proceedings of the 29th ACM International Conference on Multimedia. 592--601.
[45]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[46]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[47]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.
[49]
Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1895--1904.
[50]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.
[51]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.
[52]
Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV). 399--417.
[53]
Xiaohan Wang, Yu Wu, Linchao Zhu, and Yi Yang. 2020. Symbiotic attention with privileged information for egocentric action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12249--12256.
[54]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV). 305--321.
[55]
Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token Shift Transformer for Video Classification. acm multimedia (2021).
[56]
Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token shift transformer for video classification. In Proceedings of the 29th ACM International Conference on Multimedia. 917--925.
[57]
Hao Zhang and Chong-Wah Ngo. 2018. A fine granularity object-level representation for event detection and recounting. IEEE Transactions on Multimedia 21, 6 (2018), 1450--1463.
[58]
Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R Scott, and Limin Wang. 2019. V4D: 4D Convolutional Neural Networks for Video-level Representation Learning. In International Conference on Learning Representations.
[59]
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV). 803--818.
[60]
Andrew Zisserman, Joao Carreira, Karen Simonyan, Will Kay, Brian Hu Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, and Mustafa Suleyman. 2017. The Kinetics Human Action Video Dataset. arXiv: Computer Vision and Pattern Recognition (2017).

Cited By

View all
  • (2024)Endoscopic video aided identification method for gastric areaJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.10220836:9(102208)Online publication date: Nov-2024
  • (2024)PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video RecognitionInternational Journal of Computer Vision10.1007/s11263-024-02154-z132:12(5820-5840)Online publication date: 27-Jun-2024
  • (2023)A Fuzzy Error Based Fine-Tune Method for Spatio-Temporal Recognition ModelPattern Recognition and Computer Vision10.1007/978-981-99-8429-9_8(97-108)Online publication date: 24-Dec-2023

Index Terms

  1. Hierarchical Hourglass Convolutional Network for Efficient Video Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. attention
    2. convolution
    3. neural network
    4. video classification

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 23 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Endoscopic video aided identification method for gastric areaJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.10220836:9(102208)Online publication date: Nov-2024
    • (2024)PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video RecognitionInternational Journal of Computer Vision10.1007/s11263-024-02154-z132:12(5820-5840)Online publication date: 27-Jun-2024
    • (2023)A Fuzzy Error Based Fine-Tune Method for Spatio-Temporal Recognition ModelPattern Recognition and Computer Vision10.1007/978-981-99-8429-9_8(97-108)Online publication date: 24-Dec-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media