Abstract
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro. The code and demo can be found at https://videoagent.github.io.
Y. Fan and X. Ma—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Gao, Z., et al.: CLOVA: a closed-loop visual assistant with tool usage and update. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Gong, R., et al.: MindAgent: emergent gaming interaction. arXiv preprint arXiv:2309.09971 (2023)
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Han, T., Xie, W., Zisserman, A.: Temporal alignment networks for long-term video. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Hou, Z., et al.: GroundNLQ@ Ego4D natural language queries challenge 2023. arXiv preprint arXiv:2306.15255 (2023)
Jia, B., Chen, Y., Huang, S., Zhu, Y., Zhu, S.C.: Lemma: a multi-view dataset for learning multi-agent multi-task activities. In: European Conference on Computer Vision (ECCV) (2020)
Jia, B., Lei, T., Zhu, S.C., Huang, S.: EgoTaskQA: understanding human tasks in egocentric videos. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Korbar, B., Xian, Y., Tonioni, A., Zisserman, A., Tombari, F.: Text-conditioned resampler for long form video understanding. arXiv preprint arXiv:2312.11897 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML) (2023)
Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-ChatGPT: towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023)
Mangalam, K., Akshulakov, R., Malik, J.: EgoSchema: a diagnostic benchmark for very long-form video language understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: International Conference on Computer Vision (ICCV) (2019)
OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Oquab, M., et al..: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML) (2021)
Schick, T., et al.: Toolformer: Language models can teach themselves to use tools. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Shafiullah, N.M.M., Paxton, C., Pinto, L., Chintala, S., Szlam, A.: Clip-fields: weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663 (2022)
Song, E., et al.: MovieChat: from dense token to sparse memory for long video understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. In: International Conference on Computer Vision (ICCV) (2023)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Tay, Y., et al.: Long range arena: a benchmark for efficient transformers. arXiv preprint arXiv:2011.04006 (2020)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Wang, Y., et al.: InternVid: a large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)
Wang, Y., et al.: InternVideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022)
Wang, Y., Yang, Y., Ren, M.: LifelongMemory: Leveraging LLMs for answering queries in egocentric videos. arXiv preprint arXiv:2312.05269 (2023)
Wang, Y., Wang, Y., Wu, P., Liang, J., Zhao, D., Zheng, Z.: LSTP: language-guided spatial-temporal prompt learning for long-form video-text understanding. arXiv preprint arXiv:2402.16050 (2024)
Wiles, O., Carreira, J., Barr, I., Zisserman, A., Malinowski, M.: Compressed vision for efficient video understanding. In: Asian Conference on Computer Vision (ACCV) (2022)
Wu, C.Y., Krahenbuhl, P.: Towards long-form video understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
Xiao, J., Shang, X., Yao, A., Chua, T.S.: NExT-QA: next phase of question-answering to explaining temporal actions. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Zero-shot video question answering via frozen bidirectional language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Yang, Z., Chen, G., Li, X., Wang, W., Yang, Y.: DoraemonGPT: toward understanding dynamic scenes with large language models (exemplified as a video agent). In: International Conference on Machine Learning (ICML) (2024)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)
Zhang, C., et al.: A simple LLM framework for long-range video question-answering. arXiv preprint arXiv:2312.17235 (2023)
Zhang, H., Li, X., Bing, L.: Video-LLaMa: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931 (2020)
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI Conference on Artificial Intelligence (AAAI) (2020)
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV). Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_1
Zhang, Y., Zhang, K., Li, B., Pu, F., Setiadharma, C.A., Yang, J., Liu, Z.: WorldQA: multimodal world knowledge in videos through long-chain reasoning. arXiv preprint arXiv:2405.03272 (2024)
Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)
Zhao, Y., et al.: DETRs beat YOLOs on real-time object detection. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Zhao, Y., Misra, I., Krähenbühl, P., Girdhar, R.: Learning video representations from large language models. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Zhu, C., et al.: EgoObjects: a large-scale egocentric dataset for fine-grained object understanding. In: International Conference on Computer Vision (ICCV) (2023)
Acknowledgements
We thank the anonymous reviewers for their constructive suggestions. Their insights have greatly improved the quality and clarity of our work. This work was partly supported by the National Science and Technology Major Project (2022ZD0114900).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fan, Y. et al. (2025).
VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding.
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15080. Springer, Cham. https://doi.org/10.1007/978-3-031-72670-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-72670-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72669-9
Online ISBN: 978-3-031-72670-5
eBook Packages: Computer ScienceComputer Science (R0)