Deep Learning for Video Retrieval by Natural Language

X Li - Proceedings of the 1st International Workshop on …, 2019 - dl.acm.org
Proceedings of the 1st International Workshop on Fairness, Accountability …, 2019dl.acm.org
Videos are everywhere. Video retrieval, ie, finding videos that meet the information need of a
specific user, is important for a wide range of applications including communication,
education, entertainment, business, security etc. Among multiple ways of expressing the
information need, a natural-language text is the most intuitive to start a retrieval process. For
instance, to find video shots showing" a person in front of a blackboard talking or writing in a
classroom". Such a query can be submitted easily, by typing or speech recognition, to a …
Videos are everywhere. Video retrieval, i.e., finding videos that meet the information need of a specific user, is important for a wide range of applications including communication, education, entertainment, business, security etc. Among multiple ways of expressing the information need, a natural-language text is the most intuitive to start a retrieval process. For instance, to find video shots showing "a person in front of a blackboard talking or writing in a classroom". Such a query can be submitted easily, by typing or speech recognition, to a video retrieval system. Given a video as a sequence of frames and a query as a sequence of words, a fundamental problem in video retrieval by natural language is how to properly associate visual and linguistic information presented in sequential order. We attempt to address the fundamental problem by decomposing our quest along the following three dimensions: (1) Query representation, (2) Video representation, (3) Common space. The three dimensions also account for major designs in the state-of-the-art systems. We introduce a set of deep learning methods recently developed by our joint team of RUC, ZJGU, UvA and CAS. We evaluate the deep models on the TRECVID Ad-hoc Video Search (AVS) benchmark over the last three years (2016-2018). Much room exists for future research. Compared to video retrieval with semantic representations, deep learning approaches lack an intuitive explanation of the results obtained, in particular when the results are unsatisfactory. As the retrieval performance continues to improve, the accountability of a video retrieval model requires more research attention. While a well-performed deep model can be largely expected given adequate training data, novel algorithms that enable learning a video retrieval model from limited training resource are much in demand. Consider, for instance, visual annotation and retrieval for a target language other than English. Data and code used for this research are available at http://github.com/li-xirong/video-retrieval.
ACM Digital Library