Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency
DOI:
https://doi.org/10.1609/aaai.v38i2.27832Keywords:
CV: Language and Vision, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis, NLP: Language Grounding & Multi-modal NLPAbstract
Natural language video localization plays a pivotal role in video understanding, and leveraging weakly-labeled data is considered a promising approach to circumvent the laborintensive process of manual annotations. However, this approach encounters two significant challenges: 1) limited input distribution, namely that the limited writing styles of the language query, annotated by human annotators, hinder the model’s generalization to real-world scenarios with diverse vocabularies and sentence structures; 2) the incomplete ground truth, whose supervision guidance is insufficient. To overcome these challenges, we propose an omnipotent distillation algorithm with large language models (LLM). The distribution of the input sample is enriched to obtain diverse multi-view versions while a consistency then comes to regularize the consistency of their results for distillation. Specifically, we first train our teacher model with the proposed intra-model agreement, where multiple sub-models are supervised by each other. Then, we leverage the LLM to paraphrase the language query and distill the teacher model to a lightweight student model by enforcing the consistency between the localization results of the paraphrased sentence and the original one. In addition, to assess the generalization of the model across different dimensions of language variation, we create extensive datasets by building upon existing datasets. Our experiments demonstrate substantial performance improvements adaptively to diverse kinds of language queries.Downloads
Published
2024-03-24
How to Cite
Bao, P., Shao, Z., Yang, W., Ng, B. P., Er, M. H., & Kot, A. C. (2024). Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency. Proceedings of the AAAI Conference on Artificial Intelligence, 38(2), 747-755. https://doi.org/10.1609/aaai.v38i2.27832
Issue
Section
AAAI Technical Track on Computer Vision I