VideoLLM-online: Online Video Large Language Model for Streaming Video

Chen, Joya; Lv, Zhaoyang; Wu, Shiwei; Lin, Kevin Qinghong; Song, Chenan; Gao, Difei; Liu, Jia-Wei; Gao, Ziteng; Mao, Dongxing; Shou, Mike Zheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.11816 (cs)

[Submitted on 17 Jun 2024]

Title:VideoLLM-online: Online Video Large Language Model for Streaming Video

Authors:Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

View PDF HTML (experimental)

Abstract:Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at this https URL.

Comments:	CVPR 2024. This arxiv version is upgraded with Llama-3
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.11816 [cs.CV]
	(or arXiv:2406.11816v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.11816

Submission history

From: Joya Chen [view email]
[v1] Mon, 17 Jun 2024 17:55:32 UTC (19,575 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoLLM-online: Online Video Large Language Model for Streaming Video

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoLLM-online: Online Video Large Language Model for Streaming Video

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators