Abstract
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning (2022). http://arxiv.org/abs/2204.14198
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “video” in video-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2917–2927 (2022)
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chen, D., Bilgic, M., Getoor, L., Jacobs, D.: Dynamic processing allocation in video 33(11), 2174–2187 (2011)
Chen, G., et al.: Internvideo-ego4d: a pack of champion solutions to ego4d challenges (2022)
Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning (2023)
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: Flashattention: fast and memory-efficient exact attention with IO-awareness (2022)
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Counting out time: Class agnostic video repetition counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10387–10396 (2020)
Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10457–10467 (2020)
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1451–1459 (2021)
Grauman, K., et al.: Ego4d: around the World in 3,000 Hours of Egocentric Video. In: IEEE/CVF Computer Vision and Pattern Recognition (CVPR) (2022)
Han, K., Rebuffi, S.A., Ehrhardt, S., Vedaldi, A., Zisserman, A.: Automatically discovering and learning new visual categories with ranking statistics. In: International Conference on Learning Representations (2020)
Han, T., Xie, W., Zisserman, A.: Turbo training with token dropout. In: BMVC (2022)
Huang, D., Hilliges, O., Gool, L.V., Wang, X.: Palm: predicting actions through language models @ ego4d long-term action anticipation challenge 2023 (2023)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv:2102.05918 [cs] (2021). http://arxiv.org/abs/2102.05918, arXiv: 2102.05918
Jiang, H., Ramakrishnan, S.K., Grauman, K.: Single-stage visual query localization in egocentric videos. arXiv preprint arXiv:2306.09324 (2023)
Kay, W., et al.: The kinetics human action video dataset (2017)
Ko, D., et al.: Video-text representation learning via differentiable weak temporal alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Korbar, B., Tran, D., Torresani, L.: Scsampler: sampling salient clips from video for efficient action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6232–6242 (2019)
Kuo, W., et al.: MaMMUT: a simple architecture for joint learning for MultiModal tasks (2023). http://arxiv.org/abs/2303.16839
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with Frozen image encoders and large language models (2023). http://arxiv.org/abs/2301.12597
Li, Y., Wang, C., Jia, J.: Llama-vid: an image is worth 2 tokens in large language models (2023)
Liu, H., Zaharia, M., Abbeel, P.: Ring attention with blockwise transformers for near-infinite context (2023)
Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatGPT: towards detailed video understanding via large vision and language models (2023)
Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: a diagnostic benchmark for very long-form video language understanding. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)
Mavroudi, E., Afouras, T., Torresani, L.: Learning to ground instructional articles in videos through narrations. arXiv preprint arXiv:2306.03802 (2023)
Panda, R., et al.: AdaMML: adaptive multi-modal learning for efficient video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7576–7585 (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021). https://arxiv.org/abs/2103.00020
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: Spotem: efficient video search for episodic memory (2023)
Sevilla-Lara, L., Zha, S., Yan, Z., Goswami, V., Feiszli, M., Torresani, L.: Only time can tell: Discovering temporal data for temporal modeling. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 535–544 (2021)
Shao, J., Wang, X., Quan, R., Yang, Y.: Action sensitivity learning for the ego4d episodic memory challenge 2023. arXiv preprint arXiv:2306.09172 (2023)
Song, E., et al.: Moviechat: from dense token to sparse memory for long video understanding (2023)
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)
Tan, R., et al.: Multiscale video pretraining for long-term activity forecasting. arXiv preprint arXiv:2307.12854 (2023)
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078–10093 (2022)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, L., et al.: Videomae v2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560 (2023)
Wang, L., Qiao, Y., Tang, X., et al.: Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge 1(2), 2 (2014)
Wang, Y., et al.: Internvideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022)
Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16249–16258 (2021)
Wu, Z., Xiong, C., Jiang, Y.G., Davis, L.S.: Liteeval: a coarse-to-fine framework for resource efficient video recognition. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-QA: next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9777–9786 (2021)
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2016). https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/
Yang, A., et al.: Vid2seq: large-scale pretraining of a visual language model for dense video captioning. In: CVPR (2023)
Ye, Q., et al.: Hitea: hierarchical temporal-aware video-language pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15405–15416 (2023)
Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR, pp. 2678–2687 (2016)
Yu, K.P.: VideoBLIP. https://github.com/yukw777/VideoBLIP
Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988 (2023)
Zellers, R., et al.: MERLOT: multimodal neural script knowledge models. arXiv:2106.02636 [cs] (2021). http://arxiv.org/abs/2106.02636
Zhang, C.L., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13664, pp. 492–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_29
Zhang, H., Li, X., Bing, L.: Video-llama: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023). https://arxiv.org/abs/2306.02858
Zhang, H., Li, X., Bing, L.: Video-llama: an instruction-tuned audio-visual language model for video understanding (2023)
Zhao, H., Torralba, A., Torresani, L., Yan, Z.: HACS: human action clips and segments dataset for recognition and temporal localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8668–8678 (2019)
Zhao, Q., et al.: AntGPT: can large language models help long-term action anticipation from videos? ICLR (2024)
Zhi, Y., Tong, Z., Wang, L., Wu, G.: MGSampler: an explainable sampling strategy for video action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1513–1522 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Korbar, B., Xian, Y., Tonioni, A., Zisserman, A., Tombari, F. (2025). Text-Conditioned Resampler For Long Form Video Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15144. Springer, Cham. https://doi.org/10.1007/978-3-031-73016-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-73016-0_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73015-3
Online ISBN: 978-3-031-73016-0
eBook Packages: Computer ScienceComputer Science (R0)