Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475451acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Meta Self-Paced Learning for Cross-Modal Matching

Published: 17 October 2021 Publication History

Abstract

Cross-modal matching has attracted growing attention due to the rapid emergence of the multimedia data on the web and social applications. Recently, many re-weighting methods have been proposed for accelerating model training by designing a mapping function from similarity scores to weights. However, these re-weighting methods are difficult to be universally applied in practice since manually pre-set weighting functions inevitably involve hyper-parameters. In this paper, we propose a Meta Self-Paced Network (Meta-SPN) that automatically learns a weighting scheme from data for cross-modal matching. Specifically, a meta self-paced network composed of a fully connected neural network is designed to fit the weight function, which takes the similarity score of the sample pairs as input and outputs the corresponding weight value. Our meta self-paced network considers not only the self-similarity scores, but also their potential interactions (e.g., relative-similarity) when learning the weights. Motivated by the success of meta-learning, we use the validation set to update the meta self-paced network during the training of the matching network. Experiments on two image-text matching benchmarks and two video-text matching benchmarks demonstrate the generalization and effectiveness of our method.

References

[1]
Yi Bin, Yujuan Ding, Bo Peng, Liang Penga, Yang Yang, and Tat-Seng Chua. 2021. Entity slot filling for visual captioning. IEEE TCSVT (2021).
[2]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020 b. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR. 12655--12663.
[3]
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020 c. Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR. 10638--10647.
[4]
Tianlang Chen, Jiajun Deng, and Jiebo Luo. 2020 a. Adaptive offline quintuplet loss for image-text matching. In ECCV. 549--565.
[5]
Balázs Csanád Csáji et al. 2001. Approximation with artificial neural networks. ELU, Vol. 24, 48 (2001), 7.
[6]
Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE TMM, Vol. 20, 12 (2018), 3377--3388.
[7]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In CVPR. 9346--9355.
[8]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE TPAMI (2021).
[9]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE+: Improving visual-semantic embeddings with hard negatives. In BMVC.
[10]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML. 1126--1135.
[11]
Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In CVPR. 7181--7189.
[12]
Xiang Guan, Yang Yang, Jingjing Li, Xing Xu, and Heng Tao Shen. 2021. Mind the remainder: taylor's theorem view on recurrent neural networks. IEEE TNNLS (2021).
[13]
Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. 2020. Meta-learning for mixed linear regression. In ICML. 5394--5404.
[14]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV. 201--216.
[15]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019 b. Visual semantic reasoning for image-text matching. In ICCV. 4654--4662.
[16]
Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019 a. W2vv+ fully deep learning for ad-hoc video search. In ACM MM. 1786--1794.
[17]
Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, and Gang Yang. 2020. SEA: sentence encoder assembly for video retrieval by textual queries. IEEE TMM (2020).
[18]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.
[19]
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019 b. Focus your attention: A bidirectional focal attention network for image-text matching. In ACM MM. 3--11.
[20]
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019 a. Use what you have: video retrieval using representations from collaborative experts. In BMVC.
[21]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NeurIPS Workshop.
[22]
Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2020. MRA-Net: Improving VQA via multi-modal relation attention network. IEEE TPAMI (2020).
[23]
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-View summarization network for image-text matching. In ACM MM. 1047--1055.
[24]
Christian Simon, Piotr Koniusz, Richard Nock, and Mehrtash Harandi. 2020. On modulating the gradient for meta-learning. In ECCV. 556--572.
[25]
Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle loss: A unified perspective of pair similarity optimization. In CVPR. 6398--6407.
[26]
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In NeurIPS. 3637--3645.
[27]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In ACM MM. 154--162.
[28]
Guoqing Wang, Changming Sun, and Arcot Sowmya. 2019 b. Multi-weighted co-occurrence descriptor encoding for vein recognition. IEEE TIFS, Vol. 15 (2019), 375--390.
[29]
Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE TPAMI, Vol. 41, 2 (2018), 394--407.
[30]
Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. 2019 a. Multi-similarity loss with general pair weighting for deep metric learning. In CVPR. 5022--5030.
[31]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019 c. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV. 4581--4591.
[32]
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal weighting metric learning for cross-modal matching. In CVPR. 13005--13014.
[33]
Jiwei Wei, Yang Yang, Jingjing Li, Lei Zhu, Lin Zuo, and Heng Tao Shen. 2019. Residual graph convolutional networks for zero-shot learning. In ACM MMAsia. 1--6.
[34]
Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2021. Universal weighting metric learning for cross-modal retrieval. IEEE TPAMI (2021).
[35]
Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM MM. 2088--2096.
[36]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR. 5288--5296.
[37]
Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2020 a. Joint feature synthesis and embedding: adversarial cross-modal retrieval revisited. IEEE TPAMI (2020).
[38]
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020 b. Cross-modal attention with semantic consistence for image--text matching. IEEE TNNLS, Vol. 31, 12 (2020), 5412--5425.
[39]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In ACM SIGIR. 1339--1348.
[40]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, Vol. 2 (2014), 67--78.
[41]
Mingxing Zhang, Yang Yang, Xinghan Chen, Yanli Ji, Xing Xu, Jingjing Li, and Heng Tao Shen. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In CVPR. 12669--12678.
[42]
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In CVPR. 3536--3545.

Cited By

View all
  • (2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
  • (2024)Complex Relation Embedding for Scene Graph GenerationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.322687135:6(8321-8335)Online publication date: Jun-2024
  • (2024)Runge-Kutta Guided Feature Augmentation for Few-Sample LearningIEEE Transactions on Multimedia10.1109/TMM.2024.336640426(7349-7358)Online publication date: 15-Feb-2024
  • Show More Cited By

Index Terms

  1. Meta Self-Paced Learning for Cross-Modal Matching

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal matching
    2. deep metric learning
    3. meta self-paced network

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)96
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
    • (2024)Complex Relation Embedding for Scene Graph GenerationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.322687135:6(8321-8335)Online publication date: Jun-2024
    • (2024)Runge-Kutta Guided Feature Augmentation for Few-Sample LearningIEEE Transactions on Multimedia10.1109/TMM.2024.336640426(7349-7358)Online publication date: 15-Feb-2024
    • (2024)Integrating listwise ranking into pairwise-based image-text retrievalKnowledge-Based Systems10.1016/j.knosys.2024.111431287:COnline publication date: 16-May-2024
    • (2023)Revisiting Hard Negative Mining in Contrastive Learning for Visual UnderstandingElectronics10.3390/electronics1223488412:23(4884)Online publication date: 4-Dec-2023
    • (2023)Open-Scenario Domain Adaptive Object Detection in Autonomous DrivingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611854(8453-8462)Online publication date: 26-Oct-2023
    • (2023)HGAN: Hierarchical Graph Alignment Network for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.324816025(9189-9202)Online publication date: 23-Feb-2023
    • (2023)Confidence-Aware Active Feedback for Interactive Instance SearchIEEE Transactions on Multimedia10.1109/TMM.2022.321796525(7173-7184)Online publication date: 1-Jan-2023
    • (2023)Less is Better: Exponential Loss for Cross-Modal MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.324975433:9(5271-5280)Online publication date: 1-Sep-2023
    • (2023)Multilateral Semantic Relations Modeling for Image Text Retrieval2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00277(2830-2839)Online publication date: Jun-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media