Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Text-Conditioned Resampler For Long Form Video Understanding

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15144))

Included in the following conference series:

  • 302 Accesses

Abstract

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning (2022). http://arxiv.org/abs/2204.14198

  2. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)

    Google Scholar 

  3. Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “video” in video-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2917–2927 (2022)

    Google Scholar 

  4. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

    Google Scholar 

  5. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  6. Chen, D., Bilgic, M., Getoor, L., Jacobs, D.: Dynamic processing allocation in video 33(11), 2174–2187 (2011)

    Google Scholar 

  7. Chen, G., et al.: Internvideo-ego4d: a pack of champion solutions to ego4d challenges (2022)

    Google Scholar 

  8. Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)

  9. Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

  10. Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning (2023)

    Google Scholar 

  11. Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: Flashattention: fast and memory-efficient exact attention with IO-awareness (2022)

    Google Scholar 

  12. Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Counting out time: Class agnostic video repetition counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10387–10396 (2020)

    Google Scholar 

  13. Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10457–10467 (2020)

    Google Scholar 

  14. Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1451–1459 (2021)

    Google Scholar 

  15. Grauman, K., et al.: Ego4d: around the World in 3,000 Hours of Egocentric Video. In: IEEE/CVF Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  16. Han, K., Rebuffi, S.A., Ehrhardt, S., Vedaldi, A., Zisserman, A.: Automatically discovering and learning new visual categories with ranking statistics. In: International Conference on Learning Representations (2020)

    Google Scholar 

  17. Han, T., Xie, W., Zisserman, A.: Turbo training with token dropout. In: BMVC (2022)

    Google Scholar 

  18. Huang, D., Hilliges, O., Gool, L.V., Wang, X.: Palm: predicting actions through language models @ ego4d long-term action anticipation challenge 2023 (2023)

    Google Scholar 

  19. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv:2102.05918 [cs] (2021). http://arxiv.org/abs/2102.05918, arXiv: 2102.05918

  20. Jiang, H., Ramakrishnan, S.K., Grauman, K.: Single-stage visual query localization in egocentric videos. arXiv preprint arXiv:2306.09324 (2023)

  21. Kay, W., et al.: The kinetics human action video dataset (2017)

    Google Scholar 

  22. Ko, D., et al.: Video-text representation learning via differentiable weak temporal alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

    Google Scholar 

  23. Korbar, B., Tran, D., Torresani, L.: Scsampler: sampling salient clips from video for efficient action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6232–6242 (2019)

    Google Scholar 

  24. Kuo, W., et al.: MaMMUT: a simple architecture for joint learning for MultiModal tasks (2023). http://arxiv.org/abs/2303.16839

  25. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with Frozen image encoders and large language models (2023). http://arxiv.org/abs/2301.12597

  26. Li, Y., Wang, C., Jia, J.: Llama-vid: an image is worth 2 tokens in large language models (2023)

    Google Scholar 

  27. Liu, H., Zaharia, M., Abbeel, P.: Ring attention with blockwise transformers for near-infinite context (2023)

    Google Scholar 

  28. Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatGPT: towards detailed video understanding via large vision and language models (2023)

    Google Scholar 

  29. Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: a diagnostic benchmark for very long-form video language understanding. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)

    Google Scholar 

  30. Mavroudi, E., Afouras, T., Torresani, L.: Learning to ground instructional articles in videos through narrations. arXiv preprint arXiv:2306.03802 (2023)

  31. Panda, R., et al.: AdaMML: adaptive multi-modal learning for efficient video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7576–7585 (2021)

    Google Scholar 

  32. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021). https://arxiv.org/abs/2103.00020

  33. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  34. Ramakrishnan, S.K., Al-Halah, Z., Grauman, K.: Spotem: efficient video search for episodic memory (2023)

    Google Scholar 

  35. Sevilla-Lara, L., Zha, S., Yan, Z., Goswami, V., Feiszli, M., Torresani, L.: Only time can tell: Discovering temporal data for temporal modeling. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 535–544 (2021)

    Google Scholar 

  36. Shao, J., Wang, X., Quan, R., Yang, Y.: Action sensitivity learning for the ego4d episodic memory challenge 2023. arXiv preprint arXiv:2306.09172 (2023)

  37. Song, E., et al.: Moviechat: from dense token to sparse memory for long video understanding (2023)

    Google Scholar 

  38. Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023)

  39. Tan, R., et al.: Multiscale video pretraining for long-term activity forecasting. arXiv preprint arXiv:2307.12854 (2023)

  40. Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078–10093 (2022)

    Google Scholar 

  41. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

    Google Scholar 

  42. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  43. Wang, L., et al.: Videomae v2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560 (2023)

    Google Scholar 

  44. Wang, L., Qiao, Y., Tang, X., et al.: Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge 1(2), 2 (2014)

    Google Scholar 

  45. Wang, Y., et al.: Internvideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022)

  46. Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16249–16258 (2021)

    Google Scholar 

  47. Wu, Z., Xiong, C., Jiang, Y.G., Davis, L.S.: Liteeval: a coarse-to-fine framework for resource efficient video recognition. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  48. Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-QA: next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9777–9786 (2021)

    Google Scholar 

  49. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2016). https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/

  50. Yang, A., et al.: Vid2seq: large-scale pretraining of a visual language model for dense video captioning. In: CVPR (2023)

    Google Scholar 

  51. Ye, Q., et al.: Hitea: hierarchical temporal-aware video-language pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15405–15416 (2023)

    Google Scholar 

  52. Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR, pp. 2678–2687 (2016)

    Google Scholar 

  53. Yu, K.P.: VideoBLIP. https://github.com/yukw777/VideoBLIP

  54. Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988 (2023)

  55. Zellers, R., et al.: MERLOT: multimodal neural script knowledge models. arXiv:2106.02636 [cs] (2021). http://arxiv.org/abs/2106.02636

  56. Zhang, C.L., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13664, pp. 492–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_29

    Chapter  Google Scholar 

  57. Zhang, H., Li, X., Bing, L.: Video-llama: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023). https://arxiv.org/abs/2306.02858

  58. Zhang, H., Li, X., Bing, L.: Video-llama: an instruction-tuned audio-visual language model for video understanding (2023)

    Google Scholar 

  59. Zhao, H., Torralba, A., Torresani, L., Yan, Z.: HACS: human action clips and segments dataset for recognition and temporal localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8668–8678 (2019)

    Google Scholar 

  60. Zhao, Q., et al.: AntGPT: can large language models help long-term action anticipation from videos? ICLR (2024)

    Google Scholar 

  61. Zhi, Y., Tong, Z., Wang, L., Wu, G.: MGSampler: an explainable sampling strategy for video action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1513–1522 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bruno Korbar .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 447 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Korbar, B., Xian, Y., Tonioni, A., Zisserman, A., Tombari, F. (2025). Text-Conditioned Resampler For Long Form Video Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15144. Springer, Cham. https://doi.org/10.1007/978-3-031-73016-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73016-0_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73015-3

  • Online ISBN: 978-3-031-73016-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics