Improving Video Representation of Vision-Language Model with Decoupled Explicit Temporal Modeling

Liu, Yuxi; Zhang, Wenyu; Chen, Sihong; Zhang, Xinming

doi:10.1007/978-981-97-8511-7_37

Yuxi Liu¹⁵,
Wenyu Zhang¹⁵,
Sihong Chen¹⁶ &
…
Xinming Zhang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15037))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

82 Accesses

Abstract

Vision-Language Pre-trained (VLP) models have shown significant ability in many video tasks. For action recognition, recent studies predominantly use meticulously designed prompt tokens or positional encodings to adapt VLP models to video domains, consequently leading to a reliance on designing and learning processes. Moreover, in mainstream fine-tuning settings, models are guided by downstream tasks, which is a coarse-grained objective toward temporal modeling. To address these issues we propose an Explicit Temporal Modeling (ETM) method that mainly consists of two key designs and is decoupled from the image model. To add temporal supervision, we focus on frame-sequential order and design a temporal-related task in a contrastive manner. To reduce dependence on the quality of design and learning when modeling temporality, we propose a module with temporality-aware computation approaches and make it compatible with the newly added task. Extensive experiments are conducted on real-world datasets, demonstrating that our proposed ETM can improve VLP models’ performance on action recognition tasks. Besides, our model also performs generalization ability in few/zero-shot tasks. Code and supplementary are available at https://github.com/lyxwest/ETM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Expanding Language-Image Pretrained Models for General Video Recognition

Prompting Visual-Language Models for Efficient Video Understanding

VLG: General Video Recognition with Web Textual Knowledge

Article 25 May 2024

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: Vivit: a video vision transformer. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6816–6826 (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, pp. 813–824 (2021)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Feichtenhofer, C.: X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 200–210 (2020)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6201–6210 (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Proceedings of the European Conference on Computer Vision, pp. 105–124 (2022)
Google Scholar
Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7082–7092 (2019)
Google Scholar
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3192–3201 (2022)
Google Scholar
Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H.: Expanding language-image pretrained models for general video recognition. In: Proceedings of the European Conference on Computer Vision, pp. 1–18 (2022)
Google Scholar
Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: parameter-efficient image-to-video transfer learning. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Google Scholar
Rasheed, H.A., Khattak, M.U., Maaz, M., Khan, S.H., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6545–6554 (2023)
Google Scholar
Ryoo, M.S., Piergiovanni, A.J., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: adaptive space-time tokenization for videos. In: Advances in Neural Information Processing Systems, pp. 12786–12797 (2021)
Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae V2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560 (2023)
Google Scholar
Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. arxiv:abs/2109.08472 (2021)
Wasim, S.T., Naseer, M., Khan, S.H., Khan, F.S., Shah, M.: Vita-clip: video and text adaptive CLIP via multimodal prompting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23034–23044 (2023)
Google Scholar
Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2847–2855 (2023)
Google Scholar
Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., Schmid, C.: Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3323–3333 (2022)
Google Scholar
Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: Aim: adapting image models for efficient video action recognition. In: International Conference on Learning Representations (2023)
Google Scholar
Zhang, Y., Chen, X., Jia, J., Liu, S., Ding, K.: Text-visual prompting for efficient 2d temporal video grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14794–14804 (2023)
Google Scholar
Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8743–8752 (2020)
Google Scholar

Download references

Acknowledgements

This research was supported by Tencent. This research was also supported by the advanced computing resources provided by the Supercomputing Center of the USTC.

Author information

Authors and Affiliations

University of Science and Technology of China, Anhui, China
Yuxi Liu, Wenyu Zhang & Xinming Zhang
Tencent, Guangdong, China
Sihong Chen

Authors

Yuxi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sihong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xinming Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinming Zhang .

Editor information

Editors and Affiliations

Peking University, Beijing, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Zhang, W., Chen, S., Zhang, X. (2025). Improving Video Representation of Vision-Language Model with Decoupled Explicit Temporal Modeling. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15037. Springer, Singapore. https://doi.org/10.1007/978-981-97-8511-7_37

Download citation

DOI: https://doi.org/10.1007/978-981-97-8511-7_37
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8510-0
Online ISBN: 978-981-97-8511-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Video Representation of Vision-Language Model with Decoupled Explicit Temporal Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Expanding Language-Image Pretrained Models for General Video Recognition

Prompting Visual-Language Models for Efficient Video Understanding

VLG: General Video Recognition with Web Textual Knowledge

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Improving Video Representation of Vision-Language Model with Decoupled Explicit Temporal Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Expanding Language-Image Pretrained Models for General Video Recognition

Prompting Visual-Language Models for Efficient Video Understanding

VLG: General Video Recognition with Web Textual Knowledge

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation