research-article

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Authors:

Tat-Seng ChuaAuthors Info & Claims

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2020

Pages 1339 - 1348

https://doi.org/10.1145/3397271.3401151

Published: 25 July 2020 Publication History

Abstract

The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries.

To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos. Specifically, given a complex user query, we first recursively compose a latent semantic tree to structurally describe the text query. We then design a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Finally, both the query and videos are mapped into a joint embedding space for matching and ranking. In this approach, we have a better understanding and modeling of the complex queries, thereby achieving a better video retrieval performance. Extensive experiments on large scale video retrieval benchmark datasets demonstrate the effectiveness of our approach.

References

[1]

Da Cao, Zhiwang Yu, Hanling Zhang, Jiansheng Fang, Liqiang Nie, and Qi Tian. 2019. Video-Based Cross-Modal Recipe Retrieval. In MM. ACM, 1685--1693.

[2]

Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu, Chengjiang Li, Xu Chen, and Tiansi Dong. 2018. Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision. In EMNLP. 227--237.

[3]

Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and Juanzi Li. 2017. Bridge text and knowledge by learning multi-prototype entity mention embedding. In ACL. 1623--1633.

[4]

Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2018. Learning to compose taskspecific tree structures. In AAAI.

[5]

Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. 2016. Capacity and trainability in recurrent neural networks. In ICLR.

[6]

Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377--3388.

Digital Library

[7]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual Encoding for Zero-Example Video Retrieval. In CVPR. IEEE, 9346--9355.

[8]

Jiatao Gu, Daniel Jiwoong Im, and Victor OK Li. 2018. Neural machine translation with gumbel-greedy decoding. In AAAI.

[9]

Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In NeurIPS. 678--688.

[10]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE, 770--778.

[11]

Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang, and Qi Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Transactions on Image Processing 26, 9 (2017), 4128--4138.

Digital Library

[12]

Richang Hong, Yang Yang, Meng Wang, and Xian-Sheng Hua. 2015. Learning visual semantic relationships for efficient visual retrieval. IEEE Transactions on Big Data 1, 4 (2015), 152--161.

Digital Library

[13]

Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. 448--456.

Digital Library

[14]

Dotan Kaufman, Gil Levi, Tal Hassner, and Lior Wolf. 2017. Temporal tessellation: A unified approach for video analysis. In ICCV. IEEE, 94--104.

[15]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visualsemantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).

[16]

Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In CVPR. IEEE, 4437--4446.

[17]

A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet Classification using Deep Convolutional Neural Networks. In NeurIPS. 1097--1105.

[18]

D. Le, S. Phan, V.-T. Nguyen, B. Renoust, T. A. Nguyen, V.-N. Hoang, T. D. Ngo, M.-T. Tran, Y. Watanabe, M. Klinkigt, et al. 2016. NII-HITACHI-UIT at TRECVID 2016. In TRECVID Workshop.

[19]

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In CVPR. IEEE, 1970--1979.

[20]

Xirong Li. 2019. Deep Learning for Video Retrieval by Natural Language. In Proceedings of the 1st International Workshop on Fairness, Accountability, and Transparency in MultiMedia. 2--3.

Digital Library

[21]

Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV++: fully deep learning for ad-hoc video search. In MM. ACM, 1786--1794.

[22]

Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual semantic search: Retrieving videos via complex textual queries. In CVPR. IEEE, 2657--2664.

[23]

Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In MM. ACM, 843--851.

[24]

F. Markatopoulou, D. Galanopoulos, V. Mezaris, and I. Patras. 2017. Query and Keyframe Representations for Ad-hoc Video Search. In ICMR. ACM, 407--411.

[25]

F. Markatopoulou, A. Moumtzidou, D. Galanopoulos, T. Mironidis, V. Kaltsa, A. Ioannidou, S. Symeonidis, K. Avgerinakis, S. Andreadis, et al. 2016. ITI-CERTH Participation in TRECVID 2016. In TRECVID Workshop.

[26]

Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, and Josef Sivic. 2017. Learning from video and text via large-scale discriminative clustering. In ICCV. IEEE, 5257--5266.

[27]

Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018).

[28]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV. IEEE, 2630--2640.

[29]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR.

[30]

N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. 2018. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. In ICMR. ACM, 19--27.

[31]

P. A. Nguyen, Q. Li, Z.-Q. Cheng, Y.-J. Lu, H. Zhang, X. Wu, and C.-W. Ngo. 2017. VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search and Video Hyperlinking. In TRECVID Workshop.

[32]

K. Niu, Y. Huang, W. Ouyang, and L. Wang. 2020. Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments. IEEE Transactions on Image Processing 29 (2020), 5542--5556.

[33]

Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In CVPR. IEEE, 3202--3212.

[34]

Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.

[35]

Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao, and Dahua Lin. 2018. Find and Focus: Retrieve and Localize Video Events with Natural Language Queries. In ECCV. 200--216.

[36]

Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019. Visually Grounded Neural Syntax Acquisition. In ACL.

[37]

Cees GM Snoek, Marcel Worring, et al. 2009. Concept-based video retrieval. Foundations and Trends® in Information Retrieval 2, 4 (2009), 215--322.

[38]

C. G. M. Snoek, X. Li, C. Xu, and D. C. Koelma. 2017. University of Amsterdam and Renmin University at TRECVID 2017: Searching Video, Detecting Events and Describing Video. In Proceedings of TRECVID 2017.

[39]

Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In ACL. 1556--1566.

[40]

Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124 (2016).

[41]

K. Ueki, K. Hirakawa, K. Kikuchi, T. Ogawa, and T. Kobayashi. 2017. Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search. In TRECVID Workshop.

[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.

[43]

Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. FineGrained Action Retrieval Through Multiple Parts-of-Speech Embeddings. In ICCV. IEEE, 450--459.

[44]

J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. IEEE, 5288--5296.

[45]

R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In AAAI.

[46]

Xun Yang, Meng Wang, Richang Hong, Qi Tian, and Yong Rui. 2017. Enhancing person re-identification in a self-trained subspace. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 3 (2017), 1--23.

Digital Library

[47]

Xun Yang, Meng Wang, and Dacheng Tao. 2018. Person re-identification with metric learning using privileged information. IEEE Transactions on Image Processing 27, 2 (2018), 791--805.

[48]

Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. 2018. Memory architectures in recurrent neural network language models. In ICLR.

[49]

Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In ECCV. 471--487.

[50]

Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video Captioning and Retrieval Models with Semantic Attention. ArXiv abs/1610.02947 (2016).

[51]

Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In CVPR. IEEE, 3165--3173.

[52]

Jin Yuan, Zheng-Jun Zha, Yao-Tao Zheng, Meng Wang, Xiangdong Zhou, and Tat-Seng Chua. 2011. Learning concept bundles for video search with complex queries. In MM. ACM, 453--462.

[53]

Jin Yuan, Zheng-Jun Zha, Yan-Tao Zheng, Meng Wang, Xiangdong Zhou, and TatSeng Chua. 2011. Utilizing related samples to enhance interactive concept-based video search. IEEE Transactions on Multimedia 13, 6 (2011), 1343--1355.

Digital Library

Cited By

Hou DPang LShen HCheng XGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Improving Video Corpus Moment Retrieval with Partial Relevance EnhancementProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658088(394-403)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658088
Jing XYang GChu J(2024)An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video–text retrievalNeurocomputing10.1016/j.neucom.2024.127905596(127905)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.127905
Tian QWu W(2024)Confidence-diffusion instance contrastive learning for unsupervised domain adaptationKnowledge-Based Systems10.1016/j.knosys.2024.111717293(111717)Online publication date: Jun-2024
https://doi.org/10.1016/j.knosys.2024.111717
Show More Cited By

Index Terms

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Cross-media retrieval using query dependent search methods

The content-based cross-media retrieval is a new type of multimedia retrieval in which the media types of query examples and the returned results can be different. In order to learn the semantic correlations among multimedia objects of different ...
Read More
Advanced Information Retrieval

In this paper we explore some of the most important areas of advanced information retrieval. In particular, we look at cross-lingual information retrieval, multimedia information retrieval and semantic-based information retrieval. Cross-lingual ...
Read More
VISIONE for newbies: an easier-to-use video retrieval system
CBMI '23: Proceedings of the 20th International Conference on Content-based Multimedia Indexing

This paper presents a revised version of the VISIONE video retrieval system, which offers a wide range of search functionalities, including free text search, spatial color and object search, visual and semantic similarity search, and temporal search. ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2020

2548 pages

ISBN:9781450380164

DOI:10.1145/3397271

General Chairs:
Jimmy Huang
York University, Canada
,
Yi Chang
Jilin University, China
,
Xueqi Cheng
Chinese Academy of Sciences, China
,
Program Chairs:
Jaap Kamps
University of Amsterdam, Netherlands
,
Vanessa Murdock
Amazon, U.S.A.
,
Ji-Rong Wen
Renmin University of China, China
,
Yiqun Liu
Tsinghua University, China

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '20

Sponsor:

SIGIR

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval

July 25 - 30, 2020

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

85
Total Citations
View Citations
574
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)2

Other Metrics

View Author Metrics

Citations

Cited By

Hou DPang LShen HCheng XGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Improving Video Corpus Moment Retrieval with Partial Relevance EnhancementProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658088(394-403)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658088
Jing XYang GChu J(2024)An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video–text retrievalNeurocomputing10.1016/j.neucom.2024.127905596(127905)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.127905
Tian QWu W(2024)Confidence-diffusion instance contrastive learning for unsupervised domain adaptationKnowledge-Based Systems10.1016/j.knosys.2024.111717293(111717)Online publication date: Jun-2024
https://doi.org/10.1016/j.knosys.2024.111717
Guo JWang MWang WZhou YSong B(2024)DI-VTR: Dual Inter-modal Interaction Model for Video-Text RetrievalJournal of Information and Intelligence10.1016/j.jiixd.2024.03.003Online publication date: Mar-2024
https://doi.org/10.1016/j.jiixd.2024.03.003
Falcon ASerra GLanz O(2024)Improving semantic video retrieval models by training with a relevance-aware online mining strategyComputer Vision and Image Understanding10.1016/j.cviu.2024.104035245(104035)Online publication date: Aug-2024
https://doi.org/10.1016/j.cviu.2024.104035
Lin JLiu ZWang WWu WWang L(2024)VLG: General Video Recognition with Web Textual KnowledgeInternational Journal of Computer Vision10.1007/s11263-024-02081-zOnline publication date: 25-May-2024
https://doi.org/10.1007/s11263-024-02081-z
Zhang JGuo DYang XSong PWang M(2023)Visual-linguistic-stylistic Triple Reward for Cross-lingual Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363491720:4(1-23)Online publication date: 28-Nov-2023
https://dl.acm.org/doi/10.1145/3634917
Liu DQu XDong JZhou PXu ZWang HDi XLu WCheng Y(2023)Transform-Equivariant Consistency Learning for Temporal Sentence GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363474920:4(1-19)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3634749
Devi VMulhem PQuénot G(2023)Improving Causality in Interpretable Video RetrievalProceedings of the 20th International Conference on Content-based Multimedia Indexing10.1145/3617233.3617269(249-255)Online publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1145/3617233.3617269
Li KLi JGuo DYang XWang M(2023)Transformer-Based Visual Grounding with Cross-Modality InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725119:6(1-19)Online publication date: 9-Mar-2023
https://dl.acm.org/doi/10.1145/3587251
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents