Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Improving Video Representation of Vision-Language Model with Decoupled Explicit Temporal Modeling

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15037))

Included in the following conference series:

  • 82 Accesses

Abstract

Vision-Language Pre-trained (VLP) models have shown significant ability in many video tasks. For action recognition, recent studies predominantly use meticulously designed prompt tokens or positional encodings to adapt VLP models to video domains, consequently leading to a reliance on designing and learning processes. Moreover, in mainstream fine-tuning settings, models are guided by downstream tasks, which is a coarse-grained objective toward temporal modeling. To address these issues we propose an Explicit Temporal Modeling (ETM) method that mainly consists of two key designs and is decoupled from the image model. To add temporal supervision, we focus on frame-sequential order and design a temporal-related task in a contrastive manner. To reduce dependence on the quality of design and learning when modeling temporality, we propose a module with temporality-aware computation approaches and make it compatible with the newly added task. Extensive experiments are conducted on real-world datasets, demonstrating that our proposed ETM can improve VLP models’ performance on action recognition tasks. Besides, our model also performs generalization ability in few/zero-shot tasks. Code and supplementary are available at https://github.com/lyxwest/ETM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: Vivit: a video vision transformer. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6816–6826 (2021)

    Google Scholar 

  2. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, pp. 813–824 (2021)

    Google Scholar 

  3. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  4. Feichtenhofer, C.: X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 200–210 (2020)

    Google Scholar 

  5. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6201–6210 (2019)

    Google Scholar 

  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  7. Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Proceedings of the European Conference on Computer Vision, pp. 105–124 (2022)

    Google Scholar 

  8. Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7082–7092 (2019)

    Google Scholar 

  9. Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3192–3201 (2022)

    Google Scholar 

  10. Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H.: Expanding language-image pretrained models for general video recognition. In: Proceedings of the European Conference on Computer Vision, pp. 1–18 (2022)

    Google Scholar 

  11. Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: parameter-efficient image-to-video transfer learning. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  12. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  13. Rasheed, H.A., Khattak, M.U., Maaz, M., Khan, S.H., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6545–6554 (2023)

    Google Scholar 

  14. Ryoo, M.S., Piergiovanni, A.J., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: adaptive space-time tokenization for videos. In: Advances in Neural Information Processing Systems, pp. 12786–12797 (2021)

    Google Scholar 

  15. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  16. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  17. Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae V2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560 (2023)

    Google Scholar 

  18. Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. arxiv:abs/2109.08472 (2021)

  19. Wasim, S.T., Naseer, M., Khan, S.H., Khan, F.S., Shah, M.: Vita-clip: video and text adaptive CLIP via multimodal prompting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23034–23044 (2023)

    Google Scholar 

  20. Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2847–2855 (2023)

    Google Scholar 

  21. Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., Schmid, C.: Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3323–3333 (2022)

    Google Scholar 

  22. Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: Aim: adapting image models for efficient video action recognition. In: International Conference on Learning Representations (2023)

    Google Scholar 

  23. Zhang, Y., Chen, X., Jia, J., Liu, S., Ding, K.: Text-visual prompting for efficient 2d temporal video grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14794–14804 (2023)

    Google Scholar 

  24. Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8743–8752 (2020)

    Google Scholar 

Download references

Acknowledgements

This research was supported by Tencent. This research was also supported by the advanced computing resources provided by the Supercomputing Center of the USTC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinming Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, Y., Zhang, W., Chen, S., Zhang, X. (2025). Improving Video Representation of Vision-Language Model with Decoupled Explicit Temporal Modeling. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15037. Springer, Singapore. https://doi.org/10.1007/978-981-97-8511-7_37

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8511-7_37

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8510-0

  • Online ISBN: 978-981-97-8511-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics