Abstract
In the last few years, the advancement of GPT-4 and similar extensive large language models has significantly influenced video comprehension fields, models have been developed to exploit these advances to enhance interactive video comprehension. However, existing models generally encode video using image language models or video language models with sparse sampling, overlooking the vital action features present in each video segment. To address this gap, we propose Act-ChatGPT, an innovative interactive video comprehension model that integrates action features. Act-ChatGPT incorporates a dense sampling-based action recognition model as an additional visual encoder, enabling it to generate responses that consider the action in each video segment. Comparative analysis reveals Act-ChatGPT superiority over a base model, with qualitative evidence highlighting its adeptness at recognizing actions and responding based on them.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alec, R., Jong, Wook, K., et al.: Learning transferable visual models from natural language supervision. In: ICML, vol. 139, pp. 8748–8763 (2021)
Chenfei, W., Shengming, Y., Weizhen, Q., Xiaodong, W., Zecheng, T., Nan, D.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv:2303.04671 (2023)
Chiang, W.L., Li, Z., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Hang, Z., Xin, L., Lidong, B.: Video-LLaMA: an instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858 (2023)
Haotian, L., Chunyuan, L., Qingyang, W., Yong, Jae, L.: Visual instruction tuning. In: NeurIPS (2023)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970 (2015)
Hugo, T., Louis, M., et al.: LLaMA 2: open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)
Hugo, T., et al.: LLaMA: open and efficient foundation language models. arXiv:2302.13971 (2023)
Jin, P., Takanobu, R., Zhang, C., Cao, X., Yuan, L.: Chat-UniVi: unified visual representation empowers large language models with image and video understanding. arXiv:2311.08046 (2023)
Kaplan, J., et al.: Scaling laws for neural language models. arXiv:2001.08361 (2020)
Kay, W., et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: ICML, pp. 19730–19742 (2023)
Li, K., et al.: VideoChat: chat-centric video understanding. arXiv:2305.06355 (2023)
Li, K., et al.: MVBench: a comprehensive multi-modal video understanding benchmark. arXiv:2311.17005 (2023)
Li, K., et al.: UniFormerV2: unlocking the potential of image ViTs for video understanding. In: ICCV, pp. 1632–1643 (2023)
Li, K., et al.: Unmasked Teacher: towards training-efficient video foundation models. In: ICCV, pp. 19948–19960 (2023)
Li, Y., Wang, C., Jia, J.: LLaMA-VID: an image is worth 2 tokens in large language models. arXiv:2311.17043 (2023)
Lin, B., et al.: Video-LLaVA: learning united visual representation by alignment before projection. arXiv:2311.10122 (2023)
Muhammad, M., Hanoona, R., Salman, K., Fahad, Shahbaz, K.: Video-ChatGPT: towards detailed video understanding via large vision and language models. arXiv:2306.05424 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Wang, L., et al.: VideoMAE V2: scaling video masked autoencoders with dual masking. In: CVPR, pp. 14549–14560 (2023)
Wei, J., et al.: Finetuned language models are zero-shot learners. In: ICLR (2022)
Xiuyuan, C., Yuan, L., Yuchen, Z., Weiran, H.: AutoEval-video: an automatic benchmark for assessing large vision language models in open-ended video question answering. arXiv:2311.14906 (2023)
Zhao, W.X., Zhou, K., et al.: A survey of large language models. arXiv:2303.18223 (2023)
Zhu, B., et al.: LanguageBind: extending video-language pretraining to n-modality by language-based semantic alignment. arXiv:2310.01852 (2023)
Zoph, B., et al.: Emergent abilities of large language models. In: Proceedings of Transactions on Machine Learning Research (2022)
Acknowledgments
This work was supported by JSPS KAKENHI Grant Numbers, 22H00540 and 22H00548.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nakamizo, Y., Yanai, K. (2025). Act-ChatGPT: Introducing Action Features into Multi-modal Large Language Models for Video Understanding. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15323. Springer, Cham. https://doi.org/10.1007/978-3-031-78347-0_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-78347-0_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78346-3
Online ISBN: 978-3-031-78347-0
eBook Packages: Computer ScienceComputer Science (R0)