Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3397271.3401151acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Published: 25 July 2020 Publication History
  • Get Citation Alerts
  • Abstract

    The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries.
    To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos. Specifically, given a complex user query, we first recursively compose a latent semantic tree to structurally describe the text query. We then design a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Finally, both the query and videos are mapped into a joint embedding space for matching and ranking. In this approach, we have a better understanding and modeling of the complex queries, thereby achieving a better video retrieval performance. Extensive experiments on large scale video retrieval benchmark datasets demonstrate the effectiveness of our approach.

    References

    [1]
    Da Cao, Zhiwang Yu, Hanling Zhang, Jiansheng Fang, Liqiang Nie, and Qi Tian. 2019. Video-Based Cross-Modal Recipe Retrieval. In MM. ACM, 1685--1693.
    [2]
    Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu, Chengjiang Li, Xu Chen, and Tiansi Dong. 2018. Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision. In EMNLP. 227--237.
    [3]
    Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and Juanzi Li. 2017. Bridge text and knowledge by learning multi-prototype entity mention embedding. In ACL. 1623--1633.
    [4]
    Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2018. Learning to compose taskspecific tree structures. In AAAI.
    [5]
    Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. 2016. Capacity and trainability in recurrent neural networks. In ICLR.
    [6]
    Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377--3388.
    [7]
    Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual Encoding for Zero-Example Video Retrieval. In CVPR. IEEE, 9346--9355.
    [8]
    Jiatao Gu, Daniel Jiwoong Im, and Victor OK Li. 2018. Neural machine translation with gumbel-greedy decoding. In AAAI.
    [9]
    Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. In NeurIPS. 678--688.
    [10]
    K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE, 770--778.
    [11]
    Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang, and Qi Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Transactions on Image Processing 26, 9 (2017), 4128--4138.
    [12]
    Richang Hong, Yang Yang, Meng Wang, and Xian-Sheng Hua. 2015. Learning visual semantic relationships for efficient visual retrieval. IEEE Transactions on Big Data 1, 4 (2015), 152--161.
    [13]
    Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. 448--456.
    [14]
    Dotan Kaufman, Gil Levi, Tal Hassner, and Lior Wolf. 2017. Temporal tessellation: A unified approach for video analysis. In ICCV. IEEE, 94--104.
    [15]
    Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visualsemantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).
    [16]
    Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In CVPR. IEEE, 4437--4446.
    [17]
    A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet Classification using Deep Convolutional Neural Networks. In NeurIPS. 1097--1105.
    [18]
    D. Le, S. Phan, V.-T. Nguyen, B. Renoust, T. A. Nguyen, V.-N. Hoang, T. D. Ngo, M.-T. Tran, Y. Watanabe, M. Klinkigt, et al. 2016. NII-HITACHI-UIT at TRECVID 2016. In TRECVID Workshop.
    [19]
    Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In CVPR. IEEE, 1970--1979.
    [20]
    Xirong Li. 2019. Deep Learning for Video Retrieval by Natural Language. In Proceedings of the 1st International Workshop on Fairness, Accountability, and Transparency in MultiMedia. 2--3.
    [21]
    Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV++: fully deep learning for ad-hoc video search. In MM. ACM, 1786--1794.
    [22]
    Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual semantic search: Retrieving videos via complex textual queries. In CVPR. IEEE, 2657--2664.
    [23]
    Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In MM. ACM, 843--851.
    [24]
    F. Markatopoulou, D. Galanopoulos, V. Mezaris, and I. Patras. 2017. Query and Keyframe Representations for Ad-hoc Video Search. In ICMR. ACM, 407--411.
    [25]
    F. Markatopoulou, A. Moumtzidou, D. Galanopoulos, T. Mironidis, V. Kaltsa, A. Ioannidou, S. Symeonidis, K. Avgerinakis, S. Andreadis, et al. 2016. ITI-CERTH Participation in TRECVID 2016. In TRECVID Workshop.
    [26]
    Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, and Josef Sivic. 2017. Learning from video and text via large-scale discriminative clustering. In ICCV. IEEE, 5257--5266.
    [27]
    Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018).
    [28]
    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV. IEEE, 2630--2640.
    [29]
    T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR.
    [30]
    N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. 2018. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. In ICMR. ACM, 19--27.
    [31]
    P. A. Nguyen, Q. Li, Z.-Q. Cheng, Y.-J. Lu, H. Zhang, X. Wu, and C.-W. Ngo. 2017. VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search and Video Hyperlinking. In TRECVID Workshop.
    [32]
    K. Niu, Y. Huang, W. Ouyang, and L. Wang. 2020. Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments. IEEE Transactions on Image Processing 29 (2020), 5542--5556.
    [33]
    Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In CVPR. IEEE, 3202--3212.
    [34]
    Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association.
    [35]
    Dian Shao, Yu Xiong, Yue Zhao, Qingqiu Huang, Yu Qiao, and Dahua Lin. 2018. Find and Focus: Retrieve and Localize Video Events with Natural Language Queries. In ECCV. 200--216.
    [36]
    Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019. Visually Grounded Neural Syntax Acquisition. In ACL.
    [37]
    Cees GM Snoek, Marcel Worring, et al. 2009. Concept-based video retrieval. Foundations and Trends® in Information Retrieval 2, 4 (2009), 215--322.
    [38]
    C. G. M. Snoek, X. Li, C. Xu, and D. C. Koelma. 2017. University of Amsterdam and Renmin University at TRECVID 2017: Searching Video, Detecting Events and Describing Video. In Proceedings of TRECVID 2017.
    [39]
    Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In ACL. 1556--1566.
    [40]
    Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124 (2016).
    [41]
    K. Ueki, K. Hirakawa, K. Kikuchi, T. Ogawa, and T. Kobayashi. 2017. Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search. In TRECVID Workshop.
    [42]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.
    [43]
    Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. FineGrained Action Retrieval Through Multiple Parts-of-Speech Embeddings. In ICCV. IEEE, 450--459.
    [44]
    J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR. IEEE, 5288--5296.
    [45]
    R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In AAAI.
    [46]
    Xun Yang, Meng Wang, Richang Hong, Qi Tian, and Yong Rui. 2017. Enhancing person re-identification in a self-trained subspace. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 3 (2017), 1--23.
    [47]
    Xun Yang, Meng Wang, and Dacheng Tao. 2018. Person re-identification with metric learning using privileged information. IEEE Transactions on Image Processing 27, 2 (2018), 791--805.
    [48]
    Dani Yogatama, Yishu Miao, Gabor Melis, Wang Ling, Adhiguna Kuncoro, Chris Dyer, and Phil Blunsom. 2018. Memory architectures in recurrent neural network language models. In ICLR.
    [49]
    Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In ECCV. 471--487.
    [50]
    Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2016. Video Captioning and Retrieval Models with Semantic Attention. ArXiv abs/1610.02947 (2016).
    [51]
    Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In CVPR. IEEE, 3165--3173.
    [52]
    Jin Yuan, Zheng-Jun Zha, Yao-Tao Zheng, Meng Wang, Xiangdong Zhou, and Tat-Seng Chua. 2011. Learning concept bundles for video search with complex queries. In MM. ACM, 453--462.
    [53]
    Jin Yuan, Zheng-Jun Zha, Yan-Tao Zheng, Meng Wang, Xiangdong Zhou, and TatSeng Chua. 2011. Utilizing related samples to enhance interactive concept-based video search. IEEE Transactions on Multimedia 13, 6 (2011), 1343--1355.

    Cited By

    View all
    • (2024)Improving Video Corpus Moment Retrieval with Partial Relevance EnhancementProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658088(394-403)Online publication date: 30-May-2024
    • (2024)An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video–text retrievalNeurocomputing10.1016/j.neucom.2024.127905596(127905)Online publication date: Sep-2024
    • (2024)Confidence-diffusion instance contrastive learning for unsupervised domain adaptationKnowledge-Based Systems10.1016/j.knosys.2024.111717293(111717)Online publication date: Jun-2024
    • Show More Cited By

    Index Terms

    1. Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2020
      2548 pages
      ISBN:9781450380164
      DOI:10.1145/3397271
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 July 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. latent tree structure}
      2. multimedia retrieval
      3. natural language understanding
      4. video search

      Qualifiers

      • Research-article

      Conference

      SIGIR '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)74
      • Downloads (Last 6 weeks)2

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Improving Video Corpus Moment Retrieval with Partial Relevance EnhancementProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658088(394-403)Online publication date: 30-May-2024
      • (2024)An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video–text retrievalNeurocomputing10.1016/j.neucom.2024.127905596(127905)Online publication date: Sep-2024
      • (2024)Confidence-diffusion instance contrastive learning for unsupervised domain adaptationKnowledge-Based Systems10.1016/j.knosys.2024.111717293(111717)Online publication date: Jun-2024
      • (2024)DI-VTR: Dual Inter-modal Interaction Model for Video-Text RetrievalJournal of Information and Intelligence10.1016/j.jiixd.2024.03.003Online publication date: Mar-2024
      • (2024)Improving semantic video retrieval models by training with a relevance-aware online mining strategyComputer Vision and Image Understanding10.1016/j.cviu.2024.104035245(104035)Online publication date: Aug-2024
      • (2024)VLG: General Video Recognition with Web Textual KnowledgeInternational Journal of Computer Vision10.1007/s11263-024-02081-zOnline publication date: 25-May-2024
      • (2023)Visual-linguistic-stylistic Triple Reward for Cross-lingual Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363491720:4(1-23)Online publication date: 28-Nov-2023
      • (2023)Transform-Equivariant Consistency Learning for Temporal Sentence GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363474920:4(1-19)Online publication date: 27-Nov-2023
      • (2023)Improving Causality in Interpretable Video RetrievalProceedings of the 20th International Conference on Content-based Multimedia Indexing10.1145/3617233.3617269(249-255)Online publication date: 20-Sep-2023
      • (2023)Transformer-Based Visual Grounding with Cross-Modality InteractionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725119:6(1-19)Online publication date: 9-Mar-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media