Moment is Important: Language-Based Video Moment Retrieval via Adversarial Learning

Published: 16 February 2022 Publication History


The newly emerging language-based video moment retrieval task aims at retrieving a target video moment from an untrimmed video given a natural language as the query. It is more applicable in reality since it is able to accurately localize a specific video moment, as compared to traditional whole video retrieval. In this work, we propose a novel solution to thoroughly investigate the language-based video moment retrieval issue under the adversarial learning. The key of our solution is to formulate the language-based video moment retrieval task as an adversarial learning problem with two tightly connected components. Specifically, a reinforcement learning is employed as a generator to produce a set of possible video moments. Meanwhile, a multi-task learning is utilized as a discriminator, which integrates inter-modal and intra-modal in a unified framework by employing a sequential update strategy. Finally, the generator and the discriminator are mutually reinforced in the adversarial learning, which is able to jointly optimize the performance of both video moment ranking and video moment localization. Extensive experimental results on two challenging benchmarks, i.e., Charades-STA and TACoS datasets, have well demonstrated the effectiveness and rationality of our proposed solution. Meanwhile, on the larger and unbiased datasets, i.e., ActivityNet Captions and ActivityNet-CD, our proposed framework exhibits excellent robustness.


Information & Contributors


Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
May 2022
494 pages
Issue’s Table of Contents


Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2022
Accepted: 01 July 2021
Revised: 01 June 2021
Received: 01 January 2021
Published in TOMM Volume 18, Issue 2


Author Tags

  1. Video moment retrieval
  2. cross-modal retrieval
  3. adversarial learning
  4. reinforcement learning
  5. continual multi-task learning


  • Research-article
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Hunan Province
  • National Key Research and Development Project of China
  • Science and Technology Key Projects of Hunan Province
  • Special Funds for the Construction of Innovative Provinces in Hunan Province of China
  • Science and Technology Project of Changsha City
  • Fundamental Research Funds for the Central Universities


