research-article

Hierarchical Hourglass Convolutional Network for Efficient Video Classification

Authors:

Yi Tan,

Yanbin Hao,

Hao Zhang,

Shuo Wang,

Xiangnan HeAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 5880 - 5891

https://doi.org/10.1145/3503161.3547841

Published: 10 October 2022 Publication History

Get Access

Abstract

Videos naturally contain dynamic variation over the temporal axis, which will result in the same visual clues (e.g., semantics, objects) changing their scale, position, and perspective patterns between adjacent frames. A primary trend in video CNN is adopting spatial-2D convolution for spatial semantics and temporal-1D convolution for temporal dynamics. Though the direction achieves a favorable balance between efficiency and efficacy, it suffers from misalignment of visual clues with large displacements. Particularly, rigid temporal convolution would fail to capture correct motions when a specific target moves out of the reception field of temporal convolution between adjacent frames.

To tackle large visual displacements between temporal neighbors, we propose a new temporal convolution namedHourglass Convolution (HgC). The temporal reception field of HgC has an hourglass shape, where the spatial reception field is enlarged in prior & post temporal frames, enabling an ability to capture large displacement. Moreover, since videos contain long, short-term movements viewed from multiple temporal interval levels, we hierarchically organize the HgC net to both capture temporal dynamics from frame (short-term) and clip (long-term) levels. Besides, we also adopt strategies, such as low-resolution for short-term modeling and channel reduction for long-term modeling, from efficiency concerns. With HgC, our H$^2$CN equips off-the-shelf CNNs with a strong ability in capturing spatio-temporal dynamics at a neglectable computation overhead. We validate the efficiency and efficacy of HgC on standard action recognition benchmarks, including Something-Something V1&V2, Diving48, and EGTEA Gaze+. We also analyse the complementarity of frame-level motion and clip-level motion with visualizations. The code and models will be available at https://github.com/ty-97/H2CN.

Supplementary Material

MP4 File (MM22-fp0407.mp4)

Presentation video of MM22 paper "Hierarchical Hourglass Convolutional Network for Efficient Video Classification"

Download
43.56 MB

References

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. 2021. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6836--6846.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

MEViT: Motion Enhanced Video Transformer for Video Classification

Multi-level wavelet network based on CNN-Transformer hybrid attention for single image deraining

Motion Based Video Classification for SPRITE Generation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations