Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Act-ChatGPT: Introducing Action Features into Multi-modal Large Language Models for Video Understanding

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Abstract

In the last few years, the advancement of GPT-4 and similar extensive large language models has significantly influenced video comprehension fields, models have been developed to exploit these advances to enhance interactive video comprehension. However, existing models generally encode video using image language models or video language models with sparse sampling, overlooking the vital action features present in each video segment. To address this gap, we propose Act-ChatGPT, an innovative interactive video comprehension model that integrates action features. Act-ChatGPT incorporates a dense sampling-based action recognition model as an additional visual encoder, enabling it to generate responses that consider the action in each video segment. Comparative analysis reveals Act-ChatGPT superiority over a base model, with qualitative evidence highlighting its adeptness at recognizing actions and responding based on them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alec, R., Jong, Wook, K., et al.: Learning transferable visual models from natural language supervision. In: ICML, vol. 139, pp. 8748–8763 (2021)

    Google Scholar 

  2. Chenfei, W., Shengming, Y., Weizhen, Q., Xiaodong, W., Zecheng, T., Nan, D.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv:2303.04671 (2023)

  3. Chiang, W.L., Li, Z., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/

  4. Hang, Z., Xin, L., Lidong, B.: Video-LLaMA: an instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858 (2023)

  5. Haotian, L., Chunyuan, L., Qingyang, W., Yong, Jae, L.: Visual instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  6. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970 (2015)

    Google Scholar 

  7. Hugo, T., Louis, M., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)

  8. Hugo, T., et al.: LLaMA: open and efficient foundation language models. arXiv:2302.13971 (2023)

  9. Jin, P., Takanobu, R., Zhang, C., Cao, X., Yuan, L.: Chat-UniVi: unified visual representation empowers large language models with image and video understanding. arXiv:2311.08046 (2023)

  10. Kaplan, J., et al.: Scaling laws for neural language models. arXiv:2001.08361 (2020)

  11. Kay, W., et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)

  12. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML, pp. 19730–19742 (2023)

    Google Scholar 

  13. Li, K., et al.: VideoChat: chat-centric video understanding. arXiv:2305.06355 (2023)

  14. Li, K., et al.: MVBench: a comprehensive multi-modal video understanding benchmark. arXiv:2311.17005 (2023)

  15. Li, K., et al.: UniFormerV2: unlocking the potential of image ViTs for video understanding. In: ICCV, pp. 1632–1643 (2023)

    Google Scholar 

  16. Li, K., et al.: Unmasked Teacher: towards training-efficient video foundation models. In: ICCV, pp. 19948–19960 (2023)

    Google Scholar 

  17. Li, Y., Wang, C., Jia, J.: LLaMA-VID: an image is worth 2 tokens in large language models. arXiv:2311.17043 (2023)

  18. Lin, B., et al.: Video-LLaVA: learning united visual representation by alignment before projection. arXiv:2311.10122 (2023)

  19. Muhammad, M., Hanoona, R., Salman, K., Fahad, Shahbaz, K.: Video-ChatGPT: towards detailed video understanding via large vision and language models. arXiv:2306.05424 (2023)

  20. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)

    Google Scholar 

  21. Wang, L., et al.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: CVPR, pp. 14549–14560 (2023)

    Google Scholar 

  22. Wei, J., et al.: Finetuned language models are zero-shot learners. In: ICLR (2022)

    Google Scholar 

  23. Xiuyuan, C., Yuan, L., Yuchen, Z., Weiran, H.: AutoEval-video: an automatic benchmark for assessing large vision language models in open-ended video question answering. arXiv:2311.14906 (2023)

  24. Zhao, W.X., Zhou, K., et al.: A survey of large language models. arXiv:2303.18223 (2023)

  25. Zhu, B., et al.: LanguageBind: extending video-language pretraining to n-modality by language-based semantic alignment. arXiv:2310.01852 (2023)

  26. Zoph, B., et al.: Emergent abilities of large language models. In: Proceedings of Transactions on Machine Learning Research (2022)

    Google Scholar 

Download references

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers, 22H00540 and 22H00548.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keiji Yanai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nakamizo, Y., Yanai, K. (2025). Act-ChatGPT: Introducing Action Features into Multi-modal Large Language Models for Video Understanding. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15323. Springer, Cham. https://doi.org/10.1007/978-3-031-78347-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78347-0_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78346-3

  • Online ISBN: 978-3-031-78347-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics