Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Lunia, Harsh

Abstract:Recent advancements have introduced multiple vision-language models (VLMs) demonstrating impressive commonsense reasoning across various domains. Despite their individual capabilities, the potential of synergizing these complementary VLMs remains underexplored. The Cola Framework addresses this by showcasing how a large language model (LLM) can efficiently coordinate multiple VLMs through natural language communication, leveraging their distinct strengths. We have verified this claim on the challenging A-OKVQA dataset, confirming the effectiveness of such coordination. Building on this, our study investigates whether the same methodology can be applied to surveillance videos for action recognition. Specifically, we explore if leveraging the combined knowledge base of VLMs and LLM can effectively deduce actions from a video when presented with only a few selectively important frames and minimal temporal information. Our experiments demonstrate that LLM, when coordinating different VLMs, can successfully recognize patterns and deduce actions in various scenarios despite the weak temporal signals. However, our findings suggest that to enhance this approach as a viable alternative solution, integrating a stronger temporal signal and exposing the models to slightly more frames would be beneficial.

Comments:	LLMs, VLMs, Action Recognition
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.14834 [cs.CV]
	(or arXiv:2407.14834v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.14834

Computer Science > Computer Vision and Pattern Recognition

Title:Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators