research-article

Open access

Learnable Pillar-based Re-ranking for Image-Text Retrieval

Authors:

Tat-Seng ChuaAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1252 - 1261

https://doi.org/10.1145/3539618.3591712

Published: 18 July 2023 Publication History

Abstract

Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neighbor relations (i.e., a matching structure among multiple data samples). Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. However, it is ineffective to directly extend existing re-ranking algorithms to image-text retrieval. In this paper, we analyze the reason from four perspectives, i.e., generalization, flexibility, sparsity, and asymmetry, and propose a novel learnable pillar-based re-ranking paradigm. Concretely, we first select top-ranked intra- and intermodal neighbors as pillars, and then reconstruct data samples with the neighbor relations between them and the pillars. In this way, each sample can be mapped into a multimodal pillar space only using similarities, ensuring generalization. After that, we design a neighbor-aware graph reasoning module to flexibly exploit the relations and excavate the sparse positive items within a neighborhood. We also present a structure alignment constraint to promote crossmodal collaboration and align the asymmetric modalities. On top of various base backbones, we carry out extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, demonstrating the effectiveness, superiority, generalization, and transferability of our proposed re-ranking paradigm.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR. IEEE, 6077--6086.

[2]

Relja Arandjelović and Andrew Zisserman. 2012. Three Things Everyone should Know to Improve Object Retrieval. In CVPR. IEEE, 2911--2918.

[3]

Song Bai and Xiang Bai. 2016. Sparse Contextual Activation for Efficient Visual Re-ranking. TIP, Vol. 25, 3 (2016), 1056--1069.

Digital Library

[4]

Fei Cai, Shangsong Liang, and Maarten De Rijke. 2014. Personalized Document Re-ranking based on Bayesian Probabilistic Matrix Factorization. In SIGIR. ACM, 835--838.

[5]

Micael Carvalho, Rémi Cadène, David Picard, Laure Soulier, Nicolas Thome, and Matthieu Cord. 2018. Cross-modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings. In SIGIR. ACM, 35--44.

[6]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In CVPR. IEEE, 12655--12663.

[7]

Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the Best Pooling Strategy for Visual Semantic Embedding. In CVPR. IEEE, 15789--15798.

[8]

Ondrej Chum, James Philbin, Josef Sivic, Michael Isard, and Andrew Zisserman. 2007. Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval. In ICCV. IEEE, 1--8.

[9]

Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In AAAI. AAAI Press, 1218--1226.

[10]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC. BMVA Press, 1--13.

[11]

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NeurIPS. Curran Associates, Inc., 2121--2129.

[12]

Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2017. End-to-end Learning of Deep Visual Representations for Image Retrieval. IJCV, Vol. 124, 2 (2017), 237--254.

Digital Library

[13]

Albert Gordo, Filip Radenovic, and Tamara Berg. 2020. Attention-based Query Expansion Learning. In ECCV. Springer, 172--188.

[14]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality Reduction by Learning an Invariant Mapping. In CVPR. IEEE, 1735--1742.

[15]

Ahmet Iscen, Yannis Avrithis, Giorgos Tolias, Teddy Furon, and Ondvr ej Chum. 2018. Fast Spectral Ranking for Similarity Search. In CVPR. IEEE, 7632--7641.

[16]

Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Teddy Furon, and Ondrej Chum. 2017. Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations. In CVPR. IEEE, 2077--2086.

[17]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In ECCV. Springer, 201--216.

[18]

Jianjun Lei, Tianyi Qin, Bo Peng, Wanqing Li, Zhaoqing Pan, Haifeng Shen, and Sam Kwong. 2022. Reducing Background Induced Domain Shift for Adaptive Person Re-Identification. TII (2022), 1--12.

[19]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In ICCV. IEEE, 4654--4662.

[20]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV. Springer, 740--755.

[21]

Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph Structured Network for Image-Text Matching. In CVPR. IEEE, 10921--10930.

[22]

Chundi Liu, Guangwei Yu, Maksims Volkovs, Cheng Chang, Himanshu Rai, Junwei Ma, and Satya Krishna Gorti. 2019. Guided Similarity Separation for Image Retrieval. In NeurIPS. Curran Associates, Inc., 1--12.

[23]

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive Moment Retrieval in Videos. In SIGIR. ACM, 15--24.

[24]

Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020a. Efficient Document Re-ranking for Transformers by Precomputing Term Representations. In SIGIR. ACM, 49--58.

[25]

Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020b. Training curricula for open domain answer re-ranking. In SIGIR. ACM, 529--538.

[26]

Sean MacAvaney, Andrew Yates, Kai Hui, and Ophir Frieder. 2019. Content-based Weak Supervision for Ad-hoc Re-ranking. In SIGIR. ACM, 993--996.

[27]

Yoshitomo Matsubara, Thuy Vu, and Alessandro Moschitti. 2020. Reranking for Efficient Transformer-based Answer Selection. In SIGIR. ACM, 1577--1580.

[28]

Jianbo Ouyang, Hui Wu, Min Wang, Wengang Zhou, and Houqiang Li. 2021. Contextual Similarity Aggregation with Self-attention for Visual Re-ranking. In NeurIPS. Curran Associates, Inc., 3135--3148.

[29]

Liang Pang, Jun Xu, Qingyao Ai, Yanyan Lan, Xueqi Cheng, and Jirong Wen. 2020. Setrank: Learning a Permutation-invariant Ranking Model for Information Retrieval. In SIGIR. ACM, 499--508.

Digital Library

[30]

Shanmin Pang, Jin Ma, Jianru Xue, Jihua Zhu, and Vicente Ordonez. 2018. Deep Feature Aggregation and Image Re-ranking with Heat Diffusion for Image Retrieval. TMM, Vol. 21, 6 (2018), 1513--1523.

Digital Library

[31]

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2008. Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases. In CVPR. IEEE, 1--8.

[32]

Danfeng Qin, Stephan Gammeter, Lukas Bossard, Till Quack, and Luc Van Gool. 2011. Hello Neighbor: Accurate Object Retrieval with K-reciprocal Nearest Neighbors. In CVPR. IEEE, 777--784.

[33]

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-Aware Multi-View Summarization Network for Image-Text Matching. In MM. ACM, 1047--1055.

[34]

Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In SIGIR. ACM, 1104--1113.

[35]

Filip Radenović, Giorgos Tolias, and Ondvr ej Chum. 2018. Fine-tuning CNN Image Retrieval with no Human Annotation. TPAMI, Vol. 41, 7 (2018), 1655--1668.

[36]

Jerome Revaud, Jon Almazán, Rafael S Rezende, and Cesar Roberto de Souza. 2019. Learning with Average Precision: Training Image Retrieval with a Listwise Loss. In ICCV. IEEE, 5107--5116.

[37]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A Unified Embedding for Face Recognition and Clustering. In CVPR. IEEE, 815--823.

[38]

Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking. In Proceedings of the ACM International Conference on Multimedia. ACM, 12--20.

Digital Library

[39]

Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-Modality Cross Attention Network for Image and Sentence Matching. In CVPR. IEEE, 10941--10950.

[40]

Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and Liqiang Nie. 2021. Comprehensive Linguistic-visual Composition Network for Image Retrieval. In SIGIR. ACM, 1369--1378.

[41]

Yunjia Xi, Weiwen Liu, Jieming Zhu, Xilong Zhao, Xinyi Dai, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. 2022. Multi-Level Interaction Reranking with User Behavior History. In SIGIR. ACM, 1336--1346.

[42]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. TACL, Vol. 2 (2014), 67--78.

[43]

George Zerveas, Navid Rekabsaz, Daniel Cohen, and Carsten Eickhoff. 2022. Mitigating Bias in Search Results Through Contextual Document Reranking and Neutrality Regularization. In SIGIR. ACM, 2532--2538.

[44]

Xuanmeng Zhang, Minyue Jiang, Zhedong Zheng, Xiao Tan, Errui Ding, and Yi Yang. 2020. Understanding Image Retrieval Re-ranking: A Graph Neural Network Perspective. arXiv:2012.07620 (2020).

[45]

Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection Learning for Image-Text Matching. In ECCV. Springer, 686--701.

[46]

Zhedong Zheng, Tao Ruan, Yunchao Wei, Yi Yang, and Tao Mei. 2020a. VehicleNet: Learning Robust Visual Representation for Vehicle Re-identification. TMM, Vol. 23, 1520--9210 (2020), 2683--2693.

[47]

Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020b. Dual-Path Convolutional Image-Text Embeddings with Instance Loss. TOMM, Vol. 16, 2 (2020), 1--23.

Digital Library

[48]

Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking Person Re-identification with K-reciprocal Encoding. In CVPR. IEEE, 1318--1327.

[49]

Shengyao Zhuang and Guido Zuccon. 2021. TILDE: Term Independent Likelihood Model for Passage Re-ranking. In SIGIR. ACM, 1483--1492.

Cited By

Su XSong DLi WRen TLiu A(2025)Generating counterfactual negative samples for image-text matchingInformation Processing & Management10.1016/j.ipm.2024.10399062:3(103990)Online publication date: May-2025
https://doi.org/10.1016/j.ipm.2024.103990
Zheng ZWang YQian XZhong ZWang ZZheng LGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)MORE'24 Multimedia Object Re-ID: Advancements, Challenges, and OpportunitiesProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658892(1336-1338)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658892
Jiang XWang YLi MWu YHu BQian XHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)CaLa: Complementary Association Learning for Augmenting Comoposed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657823(2177-2187)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657823
Show More Cited By

Index Terms

Learnable Pillar-based Re-ranking for Image-Text Retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Dynamic Modality Interaction Modeling for Image-Text Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Image-text retrieval is a fundamental and crucial branch in information retrieval. Although much progress has been made in bridging vision and language, it remains challenging because of the difficult intra-modal reasoning and cross-modal alignment. ...
TIAR: Text-Image-Audio Retrieval with weighted multimodal re-ranking
Abstract
Cross-modal retrieval has developed remarkably recently and received extensive attention as an essential method for multimodal interaction study. However, most existing models are limited to one of the applications in cross-modal retrieval, i.e., ...
Cross-modal multi-relationship aware reasoning for image-text matching
Abstract
Cross-modal image-text matching has attracted considerable interest in both computer vision and natural language processing communities. The main issue of image-text matching is to learn the compact cross-modal representations and the correlation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Defence Science and Technology Agency
the National Natural Sci- ence Foundation of China
the Special Fund for distinguished professors of Shandong Jianzhu University

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
495
Total Downloads

Downloads (Last 12 months)374
Downloads (Last 6 weeks)51

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Su XSong DLi WRen TLiu A(2025)Generating counterfactual negative samples for image-text matchingInformation Processing & Management10.1016/j.ipm.2024.10399062:3(103990)Online publication date: May-2025
https://doi.org/10.1016/j.ipm.2024.103990
Zheng ZWang YQian XZhong ZWang ZZheng LGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)MORE'24 Multimedia Object Re-ID: Advancements, Challenges, and OpportunitiesProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658892(1336-1338)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658892
Jiang XWang YLi MWu YHu BQian XHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)CaLa: Complementary Association Learning for Augmenting Comoposed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657823(2177-2187)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657823
Yang YShi HZhang HHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Hierarchical Semantics Alignment for 3D Human Motion RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657804(1083-1092)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657804
Suryawanshi YShah VRandar SJoshi A(2024)Audio meets text: a loss-enhanced journey with manifold mixup and re-rankingKnowledge and Information Systems10.1007/s10115-024-02283-4Online publication date: 19-Nov-2024
https://doi.org/10.1007/s10115-024-02283-4
Wen HZhang XSong XWei YNie LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Target-Guided Composed Image RetrievalProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611817(915-923)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611817

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten