VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Liu, Ye; Lin, Kevin Qinghong; Chen, Chang Wen; Shou, Mike Zheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.13444 (cs)

[Submitted on 17 Mar 2025 (v1), last revised 1 Apr 2025 (this version, v2)]

Title:VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Authors:Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou

View PDF HTML (experimental)

Abstract:Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks, including 3 on grounded video question-answering (Grounded VideoQA), 6 on video temporal grounding (VTG), and 5 on general video question-answering (VideoQA), verify that our agent achieves state-of-the-art performance on diverse video understanding tasks, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.

Comments:	Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.13444 [cs.CV]
	(or arXiv:2503.13444v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.13444

Submission history

From: Ye Liu [view email]
[v1] Mon, 17 Mar 2025 17:59:33 UTC (6,431 KB)
[v2] Tue, 1 Apr 2025 03:49:08 UTC (6,445 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators