Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3487553.3524207acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
short-paper
Open access

Multi-task Ranking with User Behaviors for Text-video Search

Published: 16 August 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Text-video search has become an important demand in many industrial video sharing platforms, e.g., YouTube, TikTok, and WeChat Channels, thereby attracting increasing research attention. Traditional relevance-based ranking methods for text-video search concentrate on exploiting the semantic relevance between video and query. However, relevance is no longer the principal issue in the ranking stage, because the candidate items retrieved from the matching stage naturally guarantee adequate relevance. Instead, we argue that boosting user satisfaction should be an ultimate goal for ranking and it is promising to excavate cheap and rich user behaviors for model training. To achieve this goal, we propose an effective Multi-Task Ranking pipeline with User Behaviors (MTRUB) for text-video search. Specifically, to exploit the multi-modal data effectively, we put forward a Heterogeneous Multi-modal Fusion Module (HMFM) to fuse the query and video features of different modalities in adaptive ways. Besides that, we design an Independent Multi-modal Input Scheme (IMIS) to alleviate competing task correlation problems in multi-task learning. Experiments on the offline dataset gathered from WeChat Search demonstrate that MTRUB outperforms the baseline by 12.0% in mean gAUC and 13.3% in mean nDCG@10. We also conduct live experiments on a large-scale mobile search engine, i.e., WeChat Search, and MTRUB obtains substantial improvement compared with the traditional relevance-based ranking model.

    References

    [1]
    Konstantinos Avgerinakis, Anastasia Moumtzidou, Damianos Galanopoulos, Georgios Orfanidis, Stelios Andreadis, Foteini Markatopoulou, Elissavet Batziou, Konstantinos Ioannidis, Stefanos Vrochidis, Vasileios Mezaris, 2018. ITI-CERTH participation in TRECVID 2018. In TRECVID.
    [2]
    Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. 2018. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In WSDM. 126–134.
    [3]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
    [4]
    Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.
    [5]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
    [6]
    Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).
    [7]
    Duy-Dinh Le, Sang Phan, Vinh-Tiep Nguyen, Benjamin Renoust, Tuan A Nguyen, Van-Nam Hoang, Thanh Duc Ngo, Minh-Triet Tran, Yuki Watanabe, Martin Klinkigt, 2016. NII-HITACHI-UIT at TRECVID 2016. In TRECVID, Vol. 25.
    [8]
    Chenyi Lei, Yong Liu, Lingzi Zhang, Guoxin Wang, Haihong Tang, Houqiang Li, and Chunyan Miao. 2021. SEMI: A Sequential Multi-Modal Information Transfer Network for E-Commerce Micro-Video Recommendations. In KDD. 3161–3171.
    [9]
    Peidong Liu, Gengwei Zhang, Bochao Wang, Hang Xu, Xiaodan Liang, Yong Jiang, and Zhenguo Li. 2020. Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search. In ICLR.
    [10]
    Qi Liu, Ruobing Xie, Lei Chen, Shukai Liu, Ke Tu, Peng Cui, Bo Zhang, and Leyu Lin. 2020. Graph neural network for tag ranking in tag-enhanced video recommendation. In CIKM. 2613–2620.
    [11]
    Shang Liu, Zhenzhong Chen, Hongyi Liu, and Xinghai Hu. 2019. User-video co-attention network for personalized micro-video recommendation. In WWW. 3020–3026.
    [12]
    Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. In BMVC.
    [13]
    Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In KDD. 1930–1939.
    [14]
    Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2017. Query and keyframe representations for ad-hoc video search. In ICMR. 407–411.
    [15]
    Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516(2018).
    [16]
    Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516(2018).
    [17]
    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV. 2630–2640.
    [18]
    Phuong Anh Nguyen, Qing Li, Zhi-Qi Cheng, Yi-Jie Lu, Hao Zhang, Xiao Wu, and Chong-Wah Ngo. 2017. VIREO@ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking. In TRECVID.
    [19]
    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020(2021).
    [20]
    Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In RecSys. 269–278.
    [21]
    Kazuya Ueki, Koji Hirakawa, Kotaro Kikuchi, Tetsuji Ogawa, and Tetsunori Kobayashi. 2017. Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search. In TRECVID.
    [22]
    Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang Zhou, Bolei Zhou, and Dahua Lin. 2019. A graph-based framework to bridge movies and synopses. In ICCV. 4592–4601.
    [23]
    Xiaojun Yang, Lunjia Liao, Qin Yang, Bo Sun, and Jianxiang Xi. 2021. Limited-energy output formation for multiagent systems with intermittent interactions. Journal of the Franklin Institute 358, 13 (2021), 6462–6489.
    [24]
    Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In ECCV. 471–487.
    [25]
    Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In CVPR. 3165–3173.
    [26]
    Linchao Zhu and Yi Yang. 2020. Actbert: Learning global-local video-text representations. In CVPR. 8746–8755.

    Cited By

    View all
    • (2024)A holistic view on positive and negative implicit feedback for micro-video recommendationKnowledge-Based Systems10.1016/j.knosys.2023.111299284:COnline publication date: 17-Apr-2024
    • (2024)Understanding user intent modeling for conversational recommender systems: a systematic literature reviewUser Modeling and User-Adapted Interaction10.1007/s11257-024-09398-xOnline publication date: 6-Jun-2024

    Index Terms

    1. Multi-task Ranking with User Behaviors for Text-video Search

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '22: Companion Proceedings of the Web Conference 2022
      April 2022
      1338 pages
      ISBN:9781450391306
      DOI:10.1145/3487553
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 August 2022

      Check for updates

      Author Tags

      1. Multi-modal Fusion
      2. Multi-task Learning
      3. Ranking Model
      4. Text-video Search
      5. User Behaviors

      Qualifiers

      • Short-paper
      • Research
      • Refereed limited

      Funding Sources

      Conference

      WWW '22
      Sponsor:
      WWW '22: The ACM Web Conference 2022
      April 25 - 29, 2022
      Virtual Event, Lyon, France

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)138
      • Downloads (Last 6 weeks)17
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A holistic view on positive and negative implicit feedback for micro-video recommendationKnowledge-Based Systems10.1016/j.knosys.2023.111299284:COnline publication date: 17-Apr-2024
      • (2024)Understanding user intent modeling for conversational recommender systems: a systematic literature reviewUser Modeling and User-Adapted Interaction10.1007/s11257-024-09398-xOnline publication date: 6-Jun-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media