Text-Conditioned Resampler For Long Form Video Understanding

Korbar, Bruno; Xian, Yongqin; Tonioni, Alessio; Zisserman, Andrew; Tombari, Federico

doi:10.1007/978-3-031-73016-0_16

Bruno Korbar^13,14,
Yongqin Xian¹⁴,
Alessio Tonioni¹⁴,
Andrew Zisserman^13,15 &
…
Federico Tombari^14,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15144))

Included in the following conference series:

European Conference on Computer Vision

302 Accesses

Abstract

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LongVLM: Efficient Long Video Understanding via Large Language Models

HARIVO: Harnessing Text-to-Image Models for Video Generation

Vamos: Versatile Action Models for Video Understanding

References

Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning (2022). http://arxiv.org/abs/2204.14198
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Google Scholar
Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “video” in video-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2917–2927 (2022)
Google Scholar
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chen, D., Bilgic, M., Getoor, L., Jacobs, D.: Dynamic processing allocation in video 33(11), 2174–2187 (2011)
Google Scholar
Chen, G., et al.: Internvideo-ego4d: a pack of champion solutions to ego4d challenges (2022)
Google Scholar
Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning (2023)
Google Scholar
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: Flashattention: fast and memory-efficient exact attention with IO-awareness (2022)
Google Scholar
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Counting out time: Class agnostic video repetition counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10387–10396 (2020)
Google Scholar
Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10457–10467 (2020)
Google Scholar
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1451–1459 (2021)
Google Scholar
Grauman, K., et al.: Ego4d: around the World in 3,000 Hours of Egocentric Video. In: IEEE/CVF Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Han, K., Rebuffi, S.A., Ehrhardt, S., Vedaldi, A., Zisserman, A.: Automatically discovering and learning new visual categories with ranking statistics. In: International Conference on Learning Representations (2020)
Google Scholar
Han, T., Xie, W., Zisserman, A.: Turbo training with token dropout. In: BMVC (2022)
Google Scholar
Huang, D., Hilliges, O., Gool, L.V., Wang, X.: Palm: predicting actions through language models @ ego4d long-term action anticipation challenge 2023 (2023)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv:2102.05918 [cs] (2021). http://arxiv.org/abs/2102.05918, arXiv: 2102.05918
Jiang, H., Ramakrishnan, S.K., Grauman, K.: Single-stage visual query localization in egocentric videos. arXiv preprint arXiv:2306.09324 (2023)
Kay, W., et al.: The kinetics human action video dataset (2017)
Google Scholar
Ko, D., et al.: Video-text representation learning via differentiable weak temporal alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Korbar, B., Tran, D., Torresani, L.: Scsampler: sampling salient clips from video for efficient action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6232–6242 (2019)
Google Scholar
Kuo, W., et al.: MaMMUT: a simple architecture for joint learning for MultiModal tasks (2023). http://arxiv.org/abs/2303.16839
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with Frozen image encoders and large language models (2023). http://arxiv.org/abs/2301.12597
Li, Y., Wang, C., Jia, J.: Llama-vid: an image is worth 2 tokens in large language models (2023)
Google Scholar
Liu, H., Zaharia, M., Abbeel, P.: Ring attention with blockwise transformers for near-infinite context (2023)
Google Scholar
Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatGPT: towards detailed video understanding via large vision and language models (2023)
Google Scholar
Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: a diagnostic benchmark for very long-form video language understanding. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)
Google Scholar
Mavroudi, E., Afouras, T., Torresani, L.: Learning to ground instructional articles in videos through narrations. arXiv preprint arXiv:2306.03802 (2023)
Panda, R., et al.: AdaMML: adaptive multi-modal learning for efficient video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7576–7585 (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021). https://arxiv.org/abs/2103.00020
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: Spotem: efficient video search for episodic memory (2023)
Google Scholar
Sevilla-Lara, L., Zha, S., Yan, Z., Goswami, V., Feiszli, M., Torresani, L.: Only time can tell: Discovering temporal data for temporal modeling. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 535–544 (2021)
Google Scholar
Shao, J., Wang, X., Quan, R., Yang, Y.: Action sensitivity learning for the ego4d episodic memory challenge 2023. arXiv preprint arXiv:2306.09172 (2023)
Song, E., et al.: Moviechat: from dense token to sparse memory for long video understanding (2023)
Google Scholar
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
Tan, R., et al.: Multiscale video pretraining for long-term activity forecasting. arXiv preprint arXiv:2307.12854 (2023)
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078–10093 (2022)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, L., et al.: Videomae v2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560 (2023)
Google Scholar
Wang, L., Qiao, Y., Tang, X., et al.: Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge 1(2), 2 (2014)
Google Scholar
Wang, Y., et al.: Internvideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022)
Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16249–16258 (2021)
Google Scholar
Wu, Z., Xiong, C., Jiang, Y.G., Davis, L.S.: Liteeval: a coarse-to-fine framework for resource efficient video recognition. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-QA: next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9777–9786 (2021)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2016). https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/
Yang, A., et al.: Vid2seq: large-scale pretraining of a visual language model for dense video captioning. In: CVPR (2023)
Google Scholar
Ye, Q., et al.: Hitea: hierarchical temporal-aware video-language pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15405–15416 (2023)
Google Scholar
Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR, pp. 2678–2687 (2016)
Google Scholar
Yu, K.P.: VideoBLIP. https://github.com/yukw777/VideoBLIP
Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988 (2023)
Zellers, R., et al.: MERLOT: multimodal neural script knowledge models. arXiv:2106.02636 [cs] (2021). http://arxiv.org/abs/2106.02636
Zhang, C.L., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13664, pp. 492–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_29
Chapter Google Scholar
Zhang, H., Li, X., Bing, L.: Video-llama: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023). https://arxiv.org/abs/2306.02858
Zhang, H., Li, X., Bing, L.: Video-llama: an instruction-tuned audio-visual language model for video understanding (2023)
Google Scholar
Zhao, H., Torralba, A., Torresani, L., Yan, Z.: HACS: human action clips and segments dataset for recognition and temporal localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8668–8678 (2019)
Google Scholar
Zhao, Q., et al.: AntGPT: can large language models help long-term action anticipation from videos? ICLR (2024)
Google Scholar
Zhi, Y., Tong, Z., Wang, L., Wu, G.: MGSampler: an explainable sampling strategy for video action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1513–1522 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Visual Geometry Group, University of Oxford, Oxford, UK
Bruno Korbar & Andrew Zisserman
Google, Zurich, Switzerland
Bruno Korbar, Yongqin Xian, Alessio Tonioni & Federico Tombari
Google DeepMind, London, UK
Andrew Zisserman
TU Munich, Munich, Germany
Federico Tombari

Authors

Bruno Korbar
View author publications
You can also search for this author in PubMed Google Scholar
Yongqin Xian
View author publications
You can also search for this author in PubMed Google Scholar
Alessio Tonioni
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar
Federico Tombari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bruno Korbar .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 447 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Korbar, B., Xian, Y., Tonioni, A., Zisserman, A., Tombari, F. (2025). Text-Conditioned Resampler For Long Form Video Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15144. Springer, Cham. https://doi.org/10.1007/978-3-031-73016-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-73016-0_16
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73015-3
Online ISBN: 978-3-031-73016-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Text-Conditioned Resampler For Long Form Video Understanding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LongVLM: Efficient Long Video Understanding via Large Language Models

HARIVO: Harnessing Text-to-Image Models for Video Generation

Vamos: Versatile Action Models for Video Understanding

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 447 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Text-Conditioned Resampler For Long Form Video Understanding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LongVLM: Efficient Long Video Understanding via Large Language Models

HARIVO: Harnessing Text-to-Image Models for Video Generation

Vamos: Versatile Action Models for Video Understanding

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 447 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation