Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3539618.3591712acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article
Open access

Learnable Pillar-based Re-ranking for Image-Text Retrieval

Published: 18 July 2023 Publication History

Abstract

Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neighbor relations (i.e., a matching structure among multiple data samples). Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. However, it is ineffective to directly extend existing re-ranking algorithms to image-text retrieval. In this paper, we analyze the reason from four perspectives, i.e., generalization, flexibility, sparsity, and asymmetry, and propose a novel learnable pillar-based re-ranking paradigm. Concretely, we first select top-ranked intra- and intermodal neighbors as pillars, and then reconstruct data samples with the neighbor relations between them and the pillars. In this way, each sample can be mapped into a multimodal pillar space only using similarities, ensuring generalization. After that, we design a neighbor-aware graph reasoning module to flexibly exploit the relations and excavate the sparse positive items within a neighborhood. We also present a structure alignment constraint to promote crossmodal collaboration and align the asymmetric modalities. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, demonstrating the effectiveness, superiority, generalization, and transferability of our proposed re-ranking paradigm.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. IEEE, 6077--6086.
[2]
Relja Arandjelović and Andrew Zisserman. 2012. Three Things Everyone should Know to Improve Object Retrieval. In CVPR. IEEE, 2911--2918.
[3]
Song Bai and Xiang Bai. 2016. Sparse Contextual Activation for Efficient Visual Re-ranking. TIP, Vol. 25, 3 (2016), 1056--1069.
[4]
Fei Cai, Shangsong Liang, and Maarten De Rijke. 2014. Personalized Document Re-ranking based on Bayesian Probabilistic Matrix Factorization. In SIGIR. ACM, 835--838.
[5]
Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. 2018. Cross-modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings. In SIGIR. ACM, 35--44.
[6]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In CVPR. IEEE, 12655--12663.
[7]
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the Best Pooling Strategy for Visual Semantic Embedding. In CVPR. IEEE, 15789--15798.
[8]
Ondrej Chum, James Philbin, Josef Sivic, Michael Isard, and Andrew Zisserman. 2007. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval. In ICCV. IEEE, 1--8.
[9]
Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In AAAI. AAAI Press, 1218--1226.
[10]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC. BMVA Press, 1--13.
[11]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NeurIPS. Curran Associates, Inc., 2121--2129.
[12]
Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2017. End-to-end Learning of Deep Visual Representations for Image Retrieval. IJCV, Vol. 124, 2 (2017), 237--254.
[13]
Albert Gordo, Filip Radenovic, and Tamara Berg. 2020. Attention-based Query Expansion Learning. In ECCV. Springer, 172--188.
[14]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality Reduction by Learning an Invariant Mapping. In CVPR. IEEE, 1735--1742.
[15]
Ahmet Iscen, Yannis Avrithis, Giorgos Tolias, Teddy Furon, and Ondvr ej Chum. 2018. Fast Spectral Ranking for Similarity Search. In CVPR. IEEE, 7632--7641.
[16]
Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Teddy Furon, and Ondrej Chum. 2017. Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations. In CVPR. IEEE, 2077--2086.
[17]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In ECCV. Springer, 201--216.
[18]
Jianjun Lei, Tianyi Qin, Bo Peng, Wanqing Li, Zhaoqing Pan, Haifeng Shen, and Sam Kwong. 2022. Reducing Background Induced Domain Shift for Adaptive Person Re-Identification. TII (2022), 1--12.
[19]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In ICCV. IEEE, 4654--4662.
[20]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV. Springer, 740--755.
[21]
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph Structured Network for Image-Text Matching. In CVPR. IEEE, 10921--10930.
[22]
Chundi Liu, Guangwei Yu, Maksims Volkovs, Cheng Chang, Himanshu Rai, Junwei Ma, and Satya Krishna Gorti. 2019. Guided Similarity Separation for Image Retrieval. In NeurIPS. Curran Associates, Inc., 1--12.
[23]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive Moment Retrieval in Videos. In SIGIR. ACM, 15--24.
[24]
Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020a. Efficient Document Re-ranking for Transformers by Precomputing Term Representations. In SIGIR. ACM, 49--58.
[25]
Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020b. Training curricula for open domain answer re-ranking. In SIGIR. ACM, 529--538.
[26]
Sean MacAvaney, Andrew Yates, Kai Hui, and Ophir Frieder. 2019. Content-based Weak Supervision for Ad-hoc Re-ranking. In SIGIR. ACM, 993--996.
[27]
Yoshitomo Matsubara, Thuy Vu, and Alessandro Moschitti. 2020. Reranking for Efficient Transformer-based Answer Selection. In SIGIR. ACM, 1577--1580.
[28]
Jianbo Ouyang, Hui Wu, Min Wang, Wengang Zhou, and Houqiang Li. 2021. Contextual Similarity Aggregation with Self-attention for Visual Re-ranking. In NeurIPS. Curran Associates, Inc., 3135--3148.
[29]
Liang Pang, Jun Xu, Qingyao Ai, Yanyan Lan, Xueqi Cheng, and Jirong Wen. 2020. Setrank: Learning a Permutation-invariant Ranking Model for Information Retrieval. In SIGIR. ACM, 499--508.
[30]
Shanmin Pang, Jin Ma, Jianru Xue, Jihua Zhu, and Vicente Ordonez. 2018. Deep Feature Aggregation and Image Re-ranking with Heat Diffusion for Image Retrieval. TMM, Vol. 21, 6 (2018), 1513--1523.
[31]
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2008. Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases. In CVPR. IEEE, 1--8.
[32]
Danfeng Qin, Stephan Gammeter, Lukas Bossard, Till Quack, and Luc Van Gool. 2011. Hello Neighbor: Accurate Object Retrieval with K-reciprocal Nearest Neighbors. In CVPR. IEEE, 777--784.
[33]
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-Aware Multi-View Summarization Network for Image-Text Matching. In MM. ACM, 1047--1055.
[34]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In SIGIR. ACM, 1104--1113.
[35]
Filip Radenović, Giorgos Tolias, and Ondvr ej Chum. 2018. Fine-tuning CNN Image Retrieval with no Human Annotation. TPAMI, Vol. 41, 7 (2018), 1655--1668.
[36]
Jerome Revaud, Jon Almazán, Rafael S Rezende, and Cesar Roberto de Souza. 2019. Learning with Average Precision: Training Image Retrieval with a Listwise Loss. In ICCV. IEEE, 5107--5116.
[37]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A Unified Embedding for Face Recognition and Clustering. In CVPR. IEEE, 815--823.
[38]
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking. In Proceedings of the ACM International Conference on Multimedia. ACM, 12--20.
[39]
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In CVPR. IEEE, 10941--10950.
[40]
Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and Liqiang Nie. 2021. Comprehensive Linguistic-visual Composition Network for Image Retrieval. In SIGIR. ACM, 1369--1378.
[41]
Yunjia Xi, Weiwen Liu, Jieming Zhu, Xilong Zhao, Xinyi Dai, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. 2022. Multi-Level Interaction Reranking with User Behavior History. In SIGIR. ACM, 1336--1346.
[42]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. TACL, Vol. 2 (2014), 67--78.
[43]
George Zerveas, Navid Rekabsaz, Daniel Cohen, and Carsten Eickhoff. 2022. Mitigating Bias in Search Results Through Contextual Document Reranking and Neutrality Regularization. In SIGIR. ACM, 2532--2538.
[44]
Xuanmeng Zhang, Minyue Jiang, Zhedong Zheng, Xiao Tan, Errui Ding, and Yi Yang. 2020. Understanding Image Retrieval Re-ranking: A Graph Neural Network Perspective. arXiv:2012.07620 (2020).
[45]
Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection Learning for Image-Text Matching. In ECCV. Springer, 686--701.
[46]
Zhedong Zheng, Tao Ruan, Yunchao Wei, Yi Yang, and Tao Mei. 2020a. VehicleNet: Learning Robust Visual Representation for Vehicle Re-identification. TMM, Vol. 23, 1520--9210 (2020), 2683--2693.
[47]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020b. Dual-Path Convolutional Image-Text Embeddings with Instance Loss. TOMM, Vol. 16, 2 (2020), 1--23.
[48]
Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking Person Re-identification with K-reciprocal Encoding. In CVPR. IEEE, 1318--1327.
[49]
Shengyao Zhuang and Guido Zuccon. 2021. TILDE: Term Independent Likelihood Model for Passage Re-ranking. In SIGIR. ACM, 1483--1492.

Cited By

View all
  • (2025)Generating counterfactual negative samples for image-text matchingInformation Processing & Management10.1016/j.ipm.2024.10399062:3(103990)Online publication date: May-2025
  • (2024)MORE'24 Multimedia Object Re-ID: Advancements, Challenges, and OpportunitiesProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658892(1336-1338)Online publication date: 30-May-2024
  • (2024)CaLa: Complementary Association Learning for Augmenting Comoposed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657823(2177-2187)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. image-text matching
  3. re-ranking

Qualifiers

  • Research-article

Funding Sources

  • the Defence Science and Technology Agency
  • the National Natural Sci- ence Foundation of China
  • the Special Fund for distinguished professors of Shandong Jianzhu University

Conference

SIGIR '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)374
  • Downloads (Last 6 weeks)51
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Generating counterfactual negative samples for image-text matchingInformation Processing & Management10.1016/j.ipm.2024.10399062:3(103990)Online publication date: May-2025
  • (2024)MORE'24 Multimedia Object Re-ID: Advancements, Challenges, and OpportunitiesProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658892(1336-1338)Online publication date: 30-May-2024
  • (2024)CaLa: Complementary Association Learning for Augmenting Comoposed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657823(2177-2187)Online publication date: 10-Jul-2024
  • (2024)Hierarchical Semantics Alignment for 3D Human Motion RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657804(1083-1092)Online publication date: 10-Jul-2024
  • (2024)Audio meets text: a loss-enhanced journey with manifold mixup and re-rankingKnowledge and Information Systems10.1007/s10115-024-02283-4Online publication date: 19-Nov-2024
  • (2023)Target-Guided Composed Image RetrievalProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611817(915-923)Online publication date: 26-Oct-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media