Act-ChatGPT: Introducing Action Features into Multi-modal Large Language Models for Video Understanding

Nakamizo, Yuto; Yanai, Keiji

doi:10.1007/978-3-031-78347-0_17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15323))

Included in the following conference series:

International Conference on Pattern Recognition

79 Accesses

Abstract

In the last few years, the advancement of GPT-4 and similar extensive large language models has significantly influenced video comprehension fields, models have been developed to exploit these advances to enhance interactive video comprehension. However, existing models generally encode video using image language models or video language models with sparse sampling, overlooking the vital action features present in each video segment. To address this gap, we propose Act-ChatGPT, an innovative interactive video comprehension model that integrates action features. Act-ChatGPT incorporates a dense sampling-based action recognition model as an additional visual encoder, enabling it to generate responses that consider the action in each video segment. Comparative analysis reveals Act-ChatGPT superiority over a base model, with qualitative evidence highlighting its adeptness at recognizing actions and responding based on them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Hierarchical Video Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

References

Alec, R., Jong, Wook, K., et al.: Learning transferable visual models from natural language supervision. In: ICML, vol. 139, pp. 8748–8763 (2021)
Google Scholar
Chenfei, W., Shengming, Y., Weizhen, Q., Xiaodong, W., Zecheng, T., Nan, D.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv:2303.04671 (2023)
Chiang, W.L., Li, Z., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Hang, Z., Xin, L., Lidong, B.: Video-LLaMA: an instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858 (2023)
Haotian, L., Chunyuan, L., Qingyang, W., Yong, Jae, L.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970 (2015)
Google Scholar
Hugo, T., Louis, M., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)
Hugo, T., et al.: LLaMA: open and efficient foundation language models. arXiv:2302.13971 (2023)
Jin, P., Takanobu, R., Zhang, C., Cao, X., Yuan, L.: Chat-UniVi: unified visual representation empowers large language models with image and video understanding. arXiv:2311.08046 (2023)
Kaplan, J., et al.: Scaling laws for neural language models. arXiv:2001.08361 (2020)
Kay, W., et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML, pp. 19730–19742 (2023)
Google Scholar
Li, K., et al.: VideoChat: chat-centric video understanding. arXiv:2305.06355 (2023)
Li, K., et al.: MVBench: a comprehensive multi-modal video understanding benchmark. arXiv:2311.17005 (2023)
Li, K., et al.: UniFormerV2: unlocking the potential of image ViTs for video understanding. In: ICCV, pp. 1632–1643 (2023)
Google Scholar
Li, K., et al.: Unmasked Teacher: towards training-efficient video foundation models. In: ICCV, pp. 19948–19960 (2023)
Google Scholar
Li, Y., Wang, C., Jia, J.: LLaMA-VID: an image is worth 2 tokens in large language models. arXiv:2311.17043 (2023)
Lin, B., et al.: Video-LLaVA: learning united visual representation by alignment before projection. arXiv:2311.10122 (2023)
Muhammad, M., Hanoona, R., Salman, K., Fahad, Shahbaz, K.: Video-ChatGPT: towards detailed video understanding via large vision and language models. arXiv:2306.05424 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Google Scholar
Wang, L., et al.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: CVPR, pp. 14549–14560 (2023)
Google Scholar
Wei, J., et al.: Finetuned language models are zero-shot learners. In: ICLR (2022)
Google Scholar
Xiuyuan, C., Yuan, L., Yuchen, Z., Weiran, H.: AutoEval-video: an automatic benchmark for assessing large vision language models in open-ended video question answering. arXiv:2311.14906 (2023)
Zhao, W.X., Zhou, K., et al.: A survey of large language models. arXiv:2303.18223 (2023)
Zhu, B., et al.: LanguageBind: extending video-language pretraining to n-modality by language-based semantic alignment. arXiv:2310.01852 (2023)
Zoph, B., et al.: Emergent abilities of large language models. In: Proceedings of Transactions on Machine Learning Research (2022)
Google Scholar

Download references

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers, 22H00540 and 22H00548.

Author information

Authors and Affiliations

The University of Electro-Communications, Chohu, Tokyo, Japan
Yuto Nakamizo & Keiji Yanai

Authors

Yuto Nakamizo
View author publications
You can also search for this author in PubMed Google Scholar
Keiji Yanai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keiji Yanai .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nakamizo, Y., Yanai, K. (2025). Act-ChatGPT: Introducing Action Features into Multi-modal Large Language Models for Video Understanding. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15323. Springer, Cham. https://doi.org/10.1007/978-3-031-78347-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-78347-0_17
Published: 02 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78346-3
Online ISBN: 978-3-031-78347-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Act-ChatGPT: Introducing Action Features into Multi-modal Large Language Models for Video Understanding

Abstract

Access this chapter

Subscribe and save

Buy Now