SPViM: Sparse Pyramid Video Representation Learning Framework for Fine-Grained Action Retrieval

Wang, Lutong; Yang, Chenglei; Luan, Hongqiu; Gai, Wei; Geng, Wenxiu; Zheng, Yawen

doi:10.1007/978-981-97-5594-3_27

Lutong Wang¹⁰,
Chenglei Yang¹⁰,
Hongqiu Luan¹⁰,
Wei Gai¹⁰,
Wenxiu Geng¹⁰ &
…
Yawen Zheng¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14866))

Included in the following conference series:

International Conference on Intelligent Computing

517 Accesses

Abstract

Existing research has achieved remarkable success for video-based action understanding. However, current researches mainly focus on recognizing external actions at coarse-grained, with less attention paid to the fine-grained action understanding, thus impeding the precise localization and retrieval of internal content. To this end, we propose a Sparse Pyramid Video representation learning framework (SPViM), aiming to achieve frame-to-frame retrieval related to high-level action semantics. Firstly, an appearance encoder is introduced to construct independent visual descriptors for each input frame, where a shift window mechanism captures the underlying inter-frame nuances. Secondly, a temporal encoder containing the sparse self-attention and multi-granularity local context awareness mechanism were constructed to comprehensively describe the action hierarchy. Herein, inspired by the human brain cognitive process when retrieving specific content, we design a set of sparse constraints to guide self-attention gradually converge from global sparse to local dense centered on the target frame. Furthermore, we develop a Transformer-based temporal pyramid structure to integrate multi-scale spatio-temporal features, thereby generating comprehensive and discriminative video frame representations. Extensive experiments show that our fine-grained video retrieval method with SPViM architecture outperforms the state-of-the-art method on three challenging datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Dynamic-boosting attention for self-supervised video representation learning

Article 01 July 2021

Action Quality Assessment with Temporal Parsing Transformer

Memory-Augmented Dense Predictive Coding for Video Representation Learning

References

Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23023– 23033 (2023)
Google Scholar
Wang, C.H., Tseng, Y.C., Chiang, T.H., Chen, Y.A.: Learning multi-scale representations with single-stream network for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6165– 6175 (2023)
Google Scholar
Chiang, T.H., Tseng, Y.C., Tseng, Y.C.: A multi-embedding neural model for incident video retrieval. Pattern Recogn. 130, 108807 (2022)
Article Google Scholar
Kordopatis-Zilos, G., Tzelepis, C., Papadopoulos, S., Kompatsiaris, I., Patras, I.: DNS: distill-and-select for efficient and accurate video indexing and retrieval. Int. J. Comput. Vision 130(10), 2385–2407 (2022)
Article Google Scholar
Hadji, I., Derpanis, K.G., Jepson, A.D.: Representation learning via global temporal alignment and cycle-consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11068–11077 (2021)
Google Scholar
Rohrbach, M., et al.: Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vision 119, 346–373 (2016)
Article MathSciNet Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Haresh, S., et al.: Learning by aligning videos in time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5548–5558 (2021)
Google Scholar
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Temporal cycleconsistency learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1801–1810 (2019)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Google Scholar
Xu, R., Niu, L., Zhang, J., Zhang, L.: A proposal-based approach for activity image-to-video retrieval (2019)
Google Scholar
Jiang, Y.G., Jiang, Y., Wang, J.: VCDB: a large-scale database for partial copy detection in videos. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part IV 13, pp. 357–371. Springer (2014)
Google Scholar
Jing, W., Nie, X., Cui, C., Xi, X., Yang, G., Yin, Y.: Global-view hashing: harnessing global relations in near-duplicate video retrieval. World Wide Web 22, 771–789 (2019)
Article Google Scholar
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, Y.: Nearduplicate video retrieval by aggregating intermediate CNN layers. In: MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4–6, 2017, Proceedings, Part I 23, pp. 251–263. Springer (2017)
Google Scholar
Yuan, L., et al.: Central similarity quantization for efficient image and video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3083– 3092 (2020)
Google Scholar
Jiang, Y.G., Wang, J.: Partial copy detection in videos: a benchmark and an evaluation of popular methods. IEEE Trans. Big Data 2(1), 32–42 (2016)
Article Google Scholar
Liu, H., Zhao, Q., Wang, H., Lv, P., Chen, Y.: An image-based near-duplicate video retrieval and localization using improved edit distance. Multimed. Tools Appl. 76, 24435–24456 (2017)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Google Scholar
Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine, S.: Time-contrastive networks: Self-supervised learning from video. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141 (2018). Author, F.: Article title. Journal 2(5), 99–110 (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a largescale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE, New York (2009)
Google Scholar
Zhou, J., Wang, P., Wang, F., Liu, Q., Li, H., Jin, R.: Elsa: enhanced local selfattention for vision transformer. arXiv preprint arXiv:2112.12786 (2021)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, M., Wei, F., Li, C., Cai, D.: Frame-wise action representations for long videos via sequence contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13801–13810 (2022)
Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 527–544. Springer (2016)
Google Scholar
Liu, W., Tekin, B., Coskun, H., Vineet, V., Fua, P., Pollefeys, M.: Learning to align sequential actions in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2181–2191 (2022)
Google Scholar
Zhang, H., Liu, D., Zheng, Q., Su, B.: Modeling video as stochastic processes for fine-grained video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2225–2234 (2023)
Google Scholar

Download references

Acknowledgments

This research was supported by the National Natural Science Foundation of China (62277035) and (62332017).

Author information

Authors and Affiliations

School of Software, Shandong University, Jinan, China
Lutong Wang, Chenglei Yang, Hongqiu Luan, Wei Gai, Wenxiu Geng & Yawen Zheng

Authors

Lutong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chenglei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hongqiu Luan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Gai
View author publications
You can also search for this author in PubMed Google Scholar
Wenxiu Geng
View author publications
You can also search for this author in PubMed Google Scholar
Yawen Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenglei Yang .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Tianjin University of Science and Technology, Tianjin, China
Xiankun Zhang
Xiamen University, Xiamen, China
Jiayang Guo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Yang, C., Luan, H., Gai, W., Geng, W., Zheng, Y. (2024). SPViM: Sparse Pyramid Video Representation Learning Framework for Fine-Grained Action Retrieval. In: Huang, DS., Zhang, X., Guo, J. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14866. Springer, Singapore. https://doi.org/10.1007/978-981-97-5594-3_27

Download citation

DOI: https://doi.org/10.1007/978-981-97-5594-3_27
Published: 14 August 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5593-6
Online ISBN: 978-981-97-5594-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics