Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Zhang, Huaxin; Xu, Xiaohao; Wang, Xiang; Zuo, Jialong; Han, Chuchu; Huang, Xiaonan; Gao, Changxin; Wang, Yuehuan; Sang, Nong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.12235 (cs)

[Submitted on 18 Jun 2024 (v1), last revised 29 Jun 2024 (this version, v2)]

Title:Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Authors:Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, Nong Sang

View PDF HTML (experimental)

Abstract:Towards open-ended Video Anomaly Detection (VAD), existing methods often exhibit biased detection when faced with challenging or unseen events and lack interpretability. To address these drawbacks, we propose Holmes-VAD, a novel framework that leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations. Firstly, towards unbiased and explainable VAD system, we construct the first large-scale multimodal VAD instruction-tuning benchmark, i.e., VAD-Instruct50k. This dataset is created using a carefully designed semi-automatic labeling paradigm. Efficient single-frame annotations are applied to the collected untrimmed videos, which are then synthesized into high-quality analyses of both abnormal and normal video clips using a robust off-the-shelf video captioner and a large language model (LLM). Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection. We train a lightweight temporal sampler to select frames with high anomaly response and fine-tune a multimodal large language model (LLM) to generate explanatory content. Extensive experimental results validate the generality and interpretability of the proposed Holmes-VAD, establishing it as a novel interpretable technique for real-world video anomaly analysis. To support the community, our benchmark and model will be publicly available at this https URL.

Comments:	19 pages, 9 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.12235 [cs.CV]
	(or arXiv:2406.12235v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.12235

Submission history

From: Huaxin Zhang [view email]
[v1] Tue, 18 Jun 2024 03:19:24 UTC (4,628 KB)
[v2] Sat, 29 Jun 2024 08:15:27 UTC (4,628 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators