Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro. The code and demo can be found at https://videoagent.github.io.

Y. Fan and X. Ma—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://platform.openai.com/docs/guides/embeddings.

  2. 2.

    https://www.langchain.com/.

  3. 3.

    https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf.

References

  1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  2. Gao, Z., et al.: CLOVA: a closed-loop visual assistant with tool usage and update. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  3. Gong, R., et al.: MindAgent: emergent gaming interaction. arXiv preprint arXiv:2309.09971 (2023)

  4. Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  5. Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  6. Han, T., Xie, W., Zisserman, A.: Temporal alignment networks for long-term video. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  7. Hou, Z., et al.: GroundNLQ@ Ego4D natural language queries challenge 2023. arXiv preprint arXiv:2306.15255 (2023)

  8. Jia, B., Chen, Y., Huang, S., Zhu, Y., Zhu, S.C.: Lemma: a multi-view dataset for learning multi-agent multi-task activities. In: European Conference on Computer Vision (ECCV) (2020)

    Google Scholar 

  9. Jia, B., Lei, T., Zhu, S.C., Huang, S.: EgoTaskQA: understanding human tasks in egocentric videos. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  10. Korbar, B., Xian, Y., Tonioni, A., Zisserman, A., Tombari, F.: Text-conditioned resampler for long form video understanding. arXiv preprint arXiv:2312.11897 (2023)

  11. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML) (2023)

    Google Scholar 

  12. Lin, B., Zhu, B., Ye, Y., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023)

  13. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Google Scholar 

  14. Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-ChatGPT: towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023)

  15. Mangalam, K., Akshulakov, R., Malik, J.: EgoSchema: a diagnostic benchmark for very long-form video language understanding. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Google Scholar 

  16. Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  17. OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  18. Oquab, M., et al..: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  19. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML) (2021)

    Google Scholar 

  20. Schick, T., et al.: Toolformer: Language models can teach themselves to use tools. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Google Scholar 

  21. Shafiullah, N.M.M., Paxton, C., Pinto, L., Chintala, S., Szlam, A.: Clip-fields: weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663 (2022)

  22. Song, E., et al.: MovieChat: from dense token to sparse memory for long video understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  23. Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  24. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  25. Tay, Y., et al.: Long range arena: a benchmark for efficient transformers. arXiv preprint arXiv:2011.04006 (2020)

  26. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  27. Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  28. Wang, Y., et al.: InternVid: a large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023)

  29. Wang, Y., et al.: InternVideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022)

  30. Wang, Y., Yang, Y., Ren, M.: LifelongMemory: Leveraging LLMs for answering queries in egocentric videos. arXiv preprint arXiv:2312.05269 (2023)

  31. Wang, Y., Wang, Y., Wu, P., Liang, J., Zhao, D., Zheng, Z.: LSTP: language-guided spatial-temporal prompt learning for long-form video-text understanding. arXiv preprint arXiv:2402.16050 (2024)

  32. Wiles, O., Carreira, J., Barr, I., Zisserman, A., Malinowski, M.: Compressed vision for efficient video understanding. In: Asian Conference on Computer Vision (ACCV) (2022)

    Google Scholar 

  33. Wu, C.Y., Krahenbuhl, P.: Towards long-form video understanding. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  34. Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)

  35. Xiao, J., Shang, X., Yao, A., Chua, T.S.: NExT-QA: next phase of question-answering to explaining temporal actions. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  36. Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Zero-shot video question answering via frozen bidirectional language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  37. Yang, Z., Chen, G., Li, X., Wang, W., Yang, Y.: DoraemonGPT: toward understanding dynamic scenes with large language models (exemplified as a video agent). In: International Conference on Machine Learning (ICML) (2024)

    Google Scholar 

  38. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  39. Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. In: Advances in Neural Information Processing Systems (NeurIPS) (2024)

    Google Scholar 

  40. Zhang, C., et al.: A simple LLM framework for long-range video question-answering. arXiv preprint arXiv:2312.17235 (2023)

  41. Zhang, H., Li, X., Bing, L.: Video-LLaMa: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)

  42. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931 (2020)

  43. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI Conference on Artificial Intelligence (AAAI) (2020)

    Google Scholar 

  44. Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV). Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_1

  45. Zhang, Y., Zhang, K., Li, B., Pu, F., Setiadharma, C.A., Yang, J., Liu, Z.: WorldQA: multimodal world knowledge in videos through long-chain reasoning. arXiv preprint arXiv:2405.03272 (2024)

  46. Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)

  47. Zhao, Y., et al.: DETRs beat YOLOs on real-time object detection. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  48. Zhao, Y., Misra, I., Krähenbühl, P., Girdhar, R.: Learning video representations from large language models. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  49. Zhu, C., et al.: EgoObjects: a large-scale egocentric dataset for fine-grained object understanding. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their constructive suggestions. Their insights have greatly improved the quality and clarity of our work. This work was partly supported by the National Science and Technology Major Project (2022ZD0114900).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiaojian Ma or Qing Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7210 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fan, Y. et al. (2025). VideoAgent: A Memory-Augmented Multimodal Agent for Video Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15080. Springer, Cham. https://doi.org/10.1007/978-3-031-72670-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72670-5_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72669-9

  • Online ISBN: 978-3-031-72670-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics