research-article

Meta Self-Paced Learning for Cross-Modal Matching

Authors:

Guoqing WangAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 3835 - 3843

https://doi.org/10.1145/3474085.3475451

Published: 17 October 2021 Publication History

Abstract

Cross-modal matching has attracted growing attention due to the rapid emergence of the multimedia data on the web and social applications. Recently, many re-weighting methods have been proposed for accelerating model training by designing a mapping function from similarity scores to weights. However, these re-weighting methods are difficult to be universally applied in practice since manually pre-set weighting functions inevitably involve hyper-parameters. In this paper, we propose a Meta Self-Paced Network (Meta-SPN) that automatically learns a weighting scheme from data for cross-modal matching. Specifically, a meta self-paced network composed of a fully connected neural network is designed to fit the weight function, which takes the similarity score of the sample pairs as input and outputs the corresponding weight value. Our meta self-paced network considers not only the self-similarity scores, but also their potential interactions (e.g., relative-similarity) when learning the weights. Motivated by the success of meta-learning, we use the validation set to update the meta self-paced network during the training of the matching network. Experiments on two image-text matching benchmarks and two video-text matching benchmarks demonstrate the generalization and effectiveness of our method.

References

[1]

Yi Bin, Yujuan Ding, Bo Peng, Liang Penga, Yang Yang, and Tat-Seng Chua. 2021. Entity slot filling for visual captioning. IEEE TCSVT (2021).

[2]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020 b. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR. 12655--12663.

[3]

Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020 c. Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR. 10638--10647.

[4]

Tianlang Chen, Jiajun Deng, and Jiebo Luo. 2020 a. Adaptive offline quintuplet loss for image-text matching. In ECCV. 549--565.

[5]

Balázs Csanád Csáji et al. 2001. Approximation with artificial neural networks. ELU, Vol. 24, 48 (2001), 7.

[6]

Jianfeng Dong, Xirong Li, and Cees GM Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE TMM, Vol. 20, 12 (2018), 3377--3388.

Digital Library

[7]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In CVPR. 9346--9355.

[8]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE TPAMI (2021).

Digital Library

[9]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE+: Improving visual-semantic embeddings with hard negatives. In BMVC.

[10]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML. 1126--1135.

Digital Library

[11]

Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In CVPR. 7181--7189.

[12]

Xiang Guan, Yang Yang, Jingjing Li, Xing Xu, and Heng Tao Shen. 2021. Mind the remainder: taylor's theorem view on recurrent neural networks. IEEE TNNLS (2021).

[13]

Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. 2020. Meta-learning for mixed linear regression. In ICML. 5394--5404.

[14]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV. 201--216.

[15]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019 b. Visual semantic reasoning for image-text matching. In ICCV. 4654--4662.

[16]

Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019 a. W2vv+ fully deep learning for ad-hoc video search. In ACM MM. 1786--1794.

Digital Library

[17]

Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, and Gang Yang. 2020. SEA: sentence encoder assembly for video retrieval by textual queries. IEEE TMM (2020).

[18]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. 740--755.

[19]

Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019 b. Focus your attention: A bidirectional focal attention network for image-text matching. In ACM MM. 3--11.

Digital Library

[20]

Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019 a. Use what you have: video retrieval using representations from collaborative experts. In BMVC.

[21]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NeurIPS Workshop.

[22]

Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2020. MRA-Net: Improving VQA via multi-modal relation attention network. IEEE TPAMI (2020).

[23]

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-View summarization network for image-text matching. In ACM MM. 1047--1055.

Digital Library

[24]

Christian Simon, Piotr Koniusz, Richard Nock, and Mehrtash Harandi. 2020. On modulating the gradient for meta-learning. In ECCV. 556--572.

[25]

Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle loss: A unified perspective of pair similarity optimization. In CVPR. 6398--6407.

[26]

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In NeurIPS. 3637--3645.

Digital Library

[27]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In ACM MM. 154--162.

Digital Library

[28]

Guoqing Wang, Changming Sun, and Arcot Sowmya. 2019 b. Multi-weighted co-occurrence descriptor encoding for vein recognition. IEEE TIFS, Vol. 15 (2019), 375--390.

[29]

Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE TPAMI, Vol. 41, 2 (2018), 394--407.

Digital Library

[30]

Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. 2019 a. Multi-similarity loss with general pair weighting for deep metric learning. In CVPR. 5022--5030.

[31]

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019 c. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV. 4581--4591.

[32]

Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal weighting metric learning for cross-modal matching. In CVPR. 13005--13014.

[33]

Jiwei Wei, Yang Yang, Jingjing Li, Lei Zhu, Lin Zuo, and Heng Tao Shen. 2019. Residual graph convolutional networks for zero-shot learning. In ACM MMAsia. 1--6.

Digital Library

[34]

Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2021. Universal weighting metric learning for cross-modal retrieval. IEEE TPAMI (2021).

Digital Library

[35]

Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM MM. 2088--2096.

Digital Library

[36]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR. 5288--5296.

[37]

Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2020 a. Joint feature synthesis and embedding: adversarial cross-modal retrieval revisited. IEEE TPAMI (2020).

[38]

Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020 b. Cross-modal attention with semantic consistence for image--text matching. IEEE TNNLS, Vol. 31, 12 (2020), 5412--5425.

[39]

Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In ACM SIGIR. 1339--1348.

Digital Library

[40]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, Vol. 2 (2014), 67--78.

[41]

Mingxing Zhang, Yang Yang, Xinghan Chen, Yanli Ji, Xing Xu, Jingjing Li, and Heng Tao Shen. 2021. Multi-stage aggregated transformer network for temporal language localization in videos. In CVPR. 12669--12678.

[42]

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In CVPR. 3536--3545.

Cited By

Zeng RMa WWu XLiu WLiu J(2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
https://doi.org/10.3390/electronics13020300
Wang ZXu XZhang YYang YShen H(2024)Complex Relation Embedding for Scene Graph GenerationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.322687135:6(8321-8335)Online publication date: Jun-2024
https://doi.org/10.1109/TNNLS.2022.3226871
Wei JYang YGuan XXu XWang GShen H(2024)Runge-Kutta Guided Feature Augmentation for Few-Sample LearningIEEE Transactions on Multimedia10.1109/TMM.2024.336640426(7349-7358)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3366404
Show More Cited By

Index Terms

Meta Self-Paced Learning for Cross-Modal Matching
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Self-Paced Cross-Modal Subspace Matching
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Cross-modal matching methods match data from different modalities according to their similarities. Most existing methods utilize label information to reduce the semantic gap between different modalities. However, it is usually time-consuming to manually ...
MCSA Self-Paced Training Kit: Microsoft Windows 2000 Core Requirements: Exams 70-210, 70-215, 70-216, and 70-218, Second Edition
MCSA Self-Paced Training Kit: Managing a Microsoft Windows 2000 Network Environment (Exam 70-218)

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundamental Research Funds for the Central Universities under Project
National Natural Science Foundation of China
Sichuan Science and Technology Program

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
475
Total Downloads

Downloads (Last 12 months)96
Downloads (Last 6 weeks)11

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zeng RMa WWu XLiu WLiu J(2024)Image–Text Cross-Modal Retrieval with Instance Contrastive EmbeddingElectronics10.3390/electronics1302030013:2(300)Online publication date: 9-Jan-2024
https://doi.org/10.3390/electronics13020300
Wang ZXu XZhang YYang YShen H(2024)Complex Relation Embedding for Scene Graph GenerationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.322687135:6(8321-8335)Online publication date: Jun-2024
https://doi.org/10.1109/TNNLS.2022.3226871
Wei JYang YGuan XXu XWang GShen H(2024)Runge-Kutta Guided Feature Augmentation for Few-Sample LearningIEEE Transactions on Multimedia10.1109/TMM.2024.336640426(7349-7358)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3366404
Li ZGuo CWang XZhang HWang Y(2024)Integrating listwise ranking into pairwise-based image-text retrievalKnowledge-Based Systems10.1016/j.knosys.2024.111431287:COnline publication date: 16-May-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111431
Zhang HLi ZYang JWang XGuo CFeng C(2023)Revisiting Hard Negative Mining in Contrastive Learning for Visual UnderstandingElectronics10.3390/electronics1223488412:23(4884)Online publication date: 4-Dec-2023
https://doi.org/10.3390/electronics12234884
Ma ZZheng ZWei JWei XYang YShen HEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Open-Scenario Domain Adaptive Object Detection in Autonomous DrivingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611854(8453-8462)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611854
Guo JWang MZhou YSong BChi YFan WChang J(2023)HGAN: Hierarchical Graph Alignment Network for Image-Text RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.324816025(9189-9202)Online publication date: 23-Feb-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3248160
Zhang YLiang CJiang L(2023)Confidence-Aware Active Feedback for Interactive Instance SearchIEEE Transactions on Multimedia10.1109/TMM.2022.321796525(7173-7184)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3217965
Wei JYang YXu XSong JWang GShen H(2023)Less is Better: Exponential Loss for Cross-Modal MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.324975433:9(5271-5280)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3249754
Wang ZGao ZGuo KYang YWang XShen H(2023)Multilateral Semantic Relations Modeling for Image Text Retrieval2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00277(2830-2839)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.00277
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents