VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Fan, Yue; Ma, Xiaojian; Wu, Rujie; Du, Yuntao; Li, Jiaqi; Gao, Zhi; Li, Qing

doi:10.1007/978-3-031-72670-5_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15080))

Included in the following conference series:

European Conference on Computer Vision

532 Accesses

Abstract

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro. The code and demo can be found at https://videoagent.github.io.

Y. Fan and X. Ma—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

LongVLM: Efficient Long Video Understanding via Large Language Models

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Notes

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Gao, Z., et al.: CLOVA: a closed-loop visual assistant with tool usage and update. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Gong, R., et al.: MindAgent: emergent gaming interaction. arXiv preprint arXiv:2309.09971 (2023)
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Han, T., Xie, W., Zisserman, A.: Temporal alignment networks for long-term video. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Hou, Z., et al.: GroundNLQ@ Ego4D natural language queries challenge 2023. arXiv preprint arXiv:2306.15255 (2023)
Jia, B., Chen, Y., Huang, S., Zhu, Y., Zhu, S.C.: Lemma: a multi-view dataset for learning multi-agent multi-task activities. In: European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Jia, B., Lei, T., Zhu, S.C., Huang, S.: EgoTaskQA: understanding human tasks in egocentric videos. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Korbar, B., Xian, Y., Tonioni, A., Zisserman, A., Tombari, F.: Text-conditioned resampler for long form video understanding. arXiv preprint arXiv:2312.11897 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML) (2023)
Google Scholar
Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Google Scholar
Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-ChatGPT: towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023)
Mangalam, K., Akshulakov, R., Malik, J.: EgoSchema: a diagnostic benchmark for very long-form video language understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Google Scholar
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Oquab, M., et al..: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML) (2021)
Google Scholar
Schick, T., et al.: Toolformer: Language models can teach themselves to use tools. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Google Scholar
Shafiullah, N.M.M., Paxton, C., Pinto, L., Chintala, S., Szlam, A.: Clip-fields: weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663 (2022)
Song, E., et al.: MovieChat: from dense token to sparse memory for long video understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Tay, Y., et al.: Long range arena: a benchmark for efficient transformers. arXiv preprint arXiv:2011.04006 (2020)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Wang, Y., et al.: InternVid: a large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)
Wang, Y., et al.: InternVideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022)
Wang, Y., Yang, Y., Ren, M.: LifelongMemory: Leveraging LLMs for answering queries in egocentric videos. arXiv preprint arXiv:2312.05269 (2023)
Wang, Y., Wang, Y., Wu, P., Liang, J., Zhao, D., Zheng, Z.: LSTP: language-guided spatial-temporal prompt learning for long-form video-text understanding. arXiv preprint arXiv:2402.16050 (2024)
Wiles, O., Carreira, J., Barr, I., Zisserman, A., Malinowski, M.: Compressed vision for efficient video understanding. In: Asian Conference on Computer Vision (ACCV) (2022)
Google Scholar
Wu, C.Y., Krahenbuhl, P.: Towards long-form video understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
Xiao, J., Shang, X., Yao, A., Chua, T.S.: NExT-QA: next phase of question-answering to explaining temporal actions. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Zero-shot video question answering via frozen bidirectional language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Yang, Z., Chen, G., Li, X., Wang, W., Yang, Y.: DoraemonGPT: toward understanding dynamic scenes with large language models (exemplified as a video agent). In: International Conference on Machine Learning (ICML) (2024)
Google Scholar
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Google Scholar
Zhang, C., et al.: A simple LLM framework for long-range video question-answering. arXiv preprint arXiv:2312.17235 (2023)
Zhang, H., Li, X., Bing, L.: Video-LLaMa: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931 (2020)
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI Conference on Artificial Intelligence (AAAI) (2020)
Google Scholar
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV). Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_1
Zhang, Y., Zhang, K., Li, B., Pu, F., Setiadharma, C.A., Yang, J., Liu, Z.: WorldQA: multimodal world knowledge in videos through long-chain reasoning. arXiv preprint arXiv:2405.03272 (2024)
Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)
Zhao, Y., et al.: DETRs beat YOLOs on real-time object detection. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Zhao, Y., Misra, I., Krähenbühl, P., Girdhar, R.: Learning video representations from large language models. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Zhu, C., et al.: EgoObjects: a large-scale egocentric dataset for fine-grained object understanding. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their constructive suggestions. Their insights have greatly improved the quality and clarity of our work. This work was partly supported by the National Science and Technology Major Project (2022ZD0114900).

Author information

Authors and Affiliations

State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao & Qing Li
School of Computer Science, Peking University, Beijing, China
Rujie Wu
School of Intelligence Science and Technology, Peking University, Beijing, China
Zhi Gao

Authors

Yue Fan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojian Ma
View author publications
You can also search for this author in PubMed Google Scholar
Rujie Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yuntao Du
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qing Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiaojian Ma or Qing Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7210 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, Y. et al. (2025). VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15080. Springer, Cham. https://doi.org/10.1007/978-3-031-72670-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-72670-5_5
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72669-9
Online ISBN: 978-3-031-72670-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

LongVLM: Efficient Long Video Understanding via Large Language Models

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7210 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

LongVLM: Efficient Long Video Understanding via Large Language Models

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 7210 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation