Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3539618.3591705acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Keyword-Based Diverse Image Retrieval by Semantics-aware Contrastive Learning and Transformer

Published: 18 July 2023 Publication History

Abstract

In addition to relevance, diversity is an important yet less studied performance metric of cross-modal image retrieval systems, which is critical to user experience. Existing solutions for diversity-aware image retrieval either explicitly post-process the raw retrieval results from standard retrieval systems or try to learn multi-vector representations of images to represent their diverse semantics. However, neither of them is good enough to balance relevance and diversity. On the one hand, standard retrieval systems are usually biased to common semantics and seldom exploit diversity-aware regularization in training, which makes it difficult to promote diversity by post-processing. On the other hand, multi-vector representation methods are not guaranteed to learn robust multiple projections. As a result, irrelevant images and images of rare or unique semantics may be projected inappropriately, which degrades the relevance and diversity of the results generated by some typical algorithms like top-k. To cope with these problems, this paper presents a new method called CoLT that tries to generate much more representative and robust representations for accurately classifying images. Specifically, CoLT first extracts semantics-aware image features by enhancing the preliminary representations of an existing one-to-one cross-modal system with semantics-aware contrastive learning. Then, a transformer-based token classifier is developed to subsume all the features into their corresponding categories. Finally, a post-processing algorithm is designed to retrieve images from each category to form the final retrieval result. Extensive experiments on two real-world datasets Div400 and Div150Cred show that CoLT can effectively boost diversity, and outperforms the existing methods as a whole (with a higher F1 score).

Supplemental Material

MP4 File
Presentation video - short version

References

[1]
Yuan Bo and Xinbo Gao. 2019. Diversified textual features based image retrieval. Neurocomputing, Vol. 357 (2019), 116--124.
[2]
Deng Cai, Yan Wang, Huayang Li, Wai Lam, and Lemao Liu. 2021. Neural Machine Translation with Monolingual Translation Memory. In ACL. ACL, 7307--7318.
[3]
Da Cao, Yawen Zeng, Xiaochi Wei, Liqiang Nie, Richang Hong, and Zeng Qin. 2020. Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization. In Proceedings of the ACM MM. ACM, 898--906.
[4]
Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, and Jungong Han. 2019. Cross-modal image-text retrieval with semantic consistency. In Proceedings of the ACM MM. ACM, 1749--1757.
[5]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In Proceedings of the CVPR. IEEE, 12652--12660.
[6]
Laming Chen, Guoxin Zhang, and Eric Zhou. 2018. Fast greedy map inference for determinantal point process to improve recommendation diversity. Advances in Neural Information Processing Systems, Vol. 31 (2018).
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[9]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, Vol. 96. 226--231.
[10]
Iftah Gamzu, Marina Haikin, and Nissim Halabi. 2020. Query Rewriting for Voice Shopping Null Queries. In Proceedings of the SIGIR. ACM, 1369--1378.
[11]
Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, et al. 2020. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. Advances in Neural Information Processing Systems, Vol. 33 (2020), 11309--11321.
[12]
Yun Gu, Khushi Vyas, Mali Shen, Jie Yang, and Guang-Zhong Yang. 2021. Deep Graph-Based Multimodal Feature Embedding for Endomicroscopy Image Retrieval. IEEE Trans. Neural Networks Learn. Syst., Vol. 33, 2 (2021), 481--492.
[13]
Ning Han, Jingjing Chen, Guangyi Xiao, Zhang Hao, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In Proceedings of the ACM MM. ACM, 3826--3834.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. IEEE, 770--778.
[15]
Bogdan Ionescu, Alexandru-Lucian Gînsca, Maia Zaharieva, Bogdan Boteanu, Mihai Lupu, and Henning Müller. 2016a. Retrieving Diverse Social Images at MediaEval 2016: Challenge, Dataset and Evaluation. In Working Notes Proceedings of the MediaEval 2016 Workshop, Vol. 1739. CEUR-WS.org.
[16]
Bogdan Ionescu, Adrian Popescu, Mihai Lupu, Alexandru Lucian Gînscua, Bogdan Boteanu, and Henning Müller. 2015. Div150cred: A social image retrieval result diversification with user tagging credibility dataset. In Proceedings of the 6th ACM Multimedia Systems Conference. 207--212.
[17]
Bogdan Ionescu, Adrian Popescu, Anca-Livia Radu, and Henning Müller. 2016b. Result diversification in social image retrieval: a benchmarking framework. Multimedia Tools and Applications, Vol. 75, 2 (2016), 1301--1331.
[18]
Bogdan Ionescu, Anca-Livia Radu, María Menéndez, Henning Müller, Adrian Popescu, and Loni Babak. 2014. Div400: a social image retrieval result diversification dataset. In Multimedia Systems Conference 2014. ACM, 29--34.
[19]
Bogdan Ionescu, Maia Rohm, Bogdan Boteanu, Alexandru-Lucian Gînsca, Mihai Lupu, and Henning Müller. 2021. Benchmarking Image Retrieval Diversification Techniques for Social Media. IEEE Trans. Multim., Vol. 23 (2021), 677--691.
[20]
Bogdan Ionescu, Maia Rohm, Bogdan Boteanu, Alexandru Lucian Gînscua, Mihai Lupu, and Henning Müller. 2020. Benchmarking Image Retrieval Diversification Techniques for Social Media. IEEE Transactions on Multimedia, Vol. 23 (2020), 677--691.
[21]
Zhong Ji, Yuxin Sun, Yunlong Yu, Yanwei Pang, and Jungong Han. 2020. Attribute-Guided Network for Cross-Modal Zero-Shot Hashing. IEEE Trans. Neural Networks Learn. Syst., Vol. 31, 1 (2020), 321--330.
[22]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904--4916.
[23]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[24]
Frank Klinker. 2011. Exponential moving average versus moving exponential average. Mathematische Semesterberichte, Vol. 58, 1 (2011), 97--107.
[25]
Saar Kuzi, Abhishek Narwekar, Anusri Pampari, and ChengXiang Zhai. 2019. Help Me Search: Leveraging User-System Collaboration for Query Construction to Improve Accuracy for Difficult Queries. In Proceedings of the SIGIR. ACM, 1221--1224.
[26]
V. Quoc Le and Tomás Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the ICML. JMLR.org, 1188--1196.
[27]
Dongliang Liao, Jin Xu, Gongfu Li, Huang Weijie, Liu Weiqing, and Li Jing. 2019. Popularity Prediction on Online Articles with Deep Fusion of Temporal Process and Content Features. In Proceedings of the AAAI. AAAI, 200--207.
[28]
Haoliang Liu, Tan Yu, and Ping Li. 2021c. Inflate and shrink: Enriching and reducing interactions for fast text-image retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9796--9809.
[29]
Shenglan Liu, Muxin Sun, Lin Feng, Hong Qiao, Shuyuan Chen, and Yang Liu. 2021b. Social Neighborhood Graph and Multigraph Fusion Ranking for Multifeature Image Retrieval. IEEE Trans. Neural Networks Learn. Syst., Vol. 32, 3 (2021), 1389--1399.
[30]
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. 2021a. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2125--2134.
[31]
Minnan Luo, Xiaojun Chang, Zhihui Li, Liqiang Nie, Alexander G. Hauptmann, and Qinghua Zheng. 2017. Simple to complex cross-modal learning to rank. Computer Vision and Image Understanding (2017), 67--77.
[32]
Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
[33]
Liang Peng, Yi Bin, Xiyao Fu, Jie Zhou, Yang Yang, and Heng Tao Shen. 2017. CFM@MediaEval 2017 Retrieving Diverse Social Images Task via Re-ranking and Hierarchical Clustering. In Proceedings of the Working Notes Proceedings of the MediaEval 2017 Workshop, Vol. 1984.
[34]
Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2018. An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 9 (2018), 2372--2385.
[35]
Xubo Qin, Zhicheng Dou, and Ji-Rong Wen. 2020. Diversifying Search Results using Self-Attention Network. In Proceedings of the CIKM. ACM, 1265--1274.
[36]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.
[37]
Nikhil Rasiwasia, José Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A New Approach to Cross-Modal Multimedia Retrieval. In Proceedings of the ACM MM. ACM, 251--260.
[38]
Jean-Michel Renders and Gabriela Csurka. 2017. NLE@MediaEval'17: Combining Cross-Media Similarity and Embeddings for Retrieving Diverse Social Images. In Proceedings of the Working Notes Proceedings of the MediaEval 2017 Workshop, Vol. 1984.
[39]
Mustafa Ilker Sarac and Pinar Duygulu. 2014. Bilkent-RETINA at Retrieving Diverse Social Images Task of MediaEval 2014. In Proceedings of the Working Notes Proceedings of the MediaEval 2014 Workshop, Vol. 1263.
[40]
Omar Seddati, Nada Ben-Lhachemi, Stéphane Dupont, and Saïd Mahmoudi. 2017. UMONS @ MediaEval 2017: Diverse Social Images Retrieval. In Proceedings of the Working Notes Proceedings of the MediaEval 2017 Workshop, Vol. 1984.
[41]
Amit Singhal et al. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., Vol. 24, 4 (2001), 35--43.
[42]
Yale Song and Mohammad Soleymani. 2019. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval. In Proceedings of the CVPR. IEEE, 1979--1988.
[43]
Zhan Su, Zhicheng Dou, Yutao Zhu, Xubo Qin, and Ji-Rong Wen. 2021. Modeling Intent Graph for Search Result Diversification. In Proceedings of the SIGIR. ACM, 736--746.
[44]
Hanghang Tong, Jingrui He, Mingjing Li, Changshui Zhang, and Wei-Ying Ma. 2005. Graph based multi-modality learning. In Proceedings of the ACM MM. ACM, 862--871.
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, N. Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the NeurIPS. 5998--6008.
[46]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM MM. ACM, 154--162.
[47]
Huanwen Wang, Yawen Zeng, Jianguo Chen, Zhouting Zhao, and Hao Chen. 2022. A Spatiotemporal Graph Neural Network for session-based recommendation. Expert Systems with Applications (2022).
[48]
Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016b. A Comprehensive Survey on Cross-modal Retrieval. arXiv preprint arXiv:1607.06215 (2016).
[49]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016a. Learning deep structure-preserving image-text embeddings. In Proceedings of the CVPR. IEEE, 5005--5013.
[50]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. Camp: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5764--5773.
[51]
Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019).
[52]
Jiaxin Wu and Chong-Wah Ngo. 2020. Interpretable Embedding for Ad-Hoc Video Search. In Proceedings of the ACM MM. ACM, 3357--3366.
[53]
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. 2022. GroupViT: Semantic Segmentation Emerges from Text Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18134--18144.
[54]
Jin Xu, Bo Tang, Haibo He, and Hong Man. 2017. Semisupervised Feature Selection Based on Relevance and Redundancy Criteria. IEEE Trans. Neural Networks Learn. Syst., Vol. 28, 9 (2017), 1974--1984.
[55]
Ruicong Xu, Li Niu, Jianfu Zhang, and Liqing Zhang. 2020. A Proposal-Based Approach for Activity Image-to-Video Retrieval. In Proceedings of the AAAI. AAAI Press, 12524--12531.
[56]
Caixia Yan, Qinghua Zheng, Xiaojun Chang, Minnan Luo, Chung-Hsing Yeh, and Alexander G. Hauptmann. 2020. Semantics-Preserving Graph Propagation for Zero-Shot Object Detection. IEEE Transactions on Image Processing (2020), 8163--8176.
[57]
Xiaojun Yang, Lunjia Liao, Qin Yang, Bo Sun, and Jianxiang Xi. 2021. Limited-energy output formation for multiagent systems with intermittent interactions. Journal of the Franklin Institute (2021), 6462--6489. https://doi.org/10.1016/j.jfranklin.2021.06.009
[58]
Yi Yang, Feiping Nie, Dong Xu, Jiebo Luo, Yueting Zhuang, and Yunhe Pan. 2011. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, 4 (2011), 723--742.
[59]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.
[60]
Tan Yu, Hongliang Fei, and Ping Li. 2022. U-BERT for Fast and Scalable Text-Image Retrieval. In Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval. 193--203.
[61]
Maia Zaharieva and Patrick Schwab. 2014. A Unified Framework for Retrieving Diverse Social Images. In Proceedings of the Working Notes Proceedings of the MediaEval 2014 Workshop, Vol. 1263.
[62]
Yawen Zeng, Da Cao, Shaofei Lu, Hanling Zhang, Jiao Xu, and Qin Zheng. 2022a. Moment is Important: Language-Based Video Moment Retrieval via Adversarial Learning. ACM Trans. Multim. Comput. Commun. Appl., Vol. 18 (2022), 56:1--56:21.
[63]
Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, and Zheng Qin. 2021. Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. In Proceedings of the CVPR. IEEE, 2215--2224.
[64]
Yawen Zeng, Yiru Wang, Dongliang Liao, Gongfu Li, Weijie Huang, Jin Xu, Da Cao, and Hong Man. 2022b. Keyword-Based Diverse Image Retrieval With Variational Multiple Instance Graph. IEEE Transactions on Neural Networks and Learning Systems (2022).
[65]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2012. Effective Heterogeneous Similarity Measure with Nearest Neighbors for Cross-Media Retrieval. In Proceedings of the Advances in Multimedia Modeling. Springer, 312--322.
[66]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2014. Modeling Information Retrieval by Formal Logic: A Survey. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 24, 6 (2014), 965--978.
[67]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
[68]
Lu Zhang, Yang Wang, Jiaogen Zhou, Chenbo Zhang, Yinglu Zhang, Jihong Guan, Yatao Bian, and Shuigeng Zhou. 2022. Hierarchical Few-Shot Object Detection: Problem, Benchmark and Method. In Proceedings of the 30th ACM International Conference on Multimedia. 2002--2011.
[69]
Minyi Zhao, Yi Xu, and Shuigeng Zhou. 2021. Recursive fusion and deformable spatiotemporal attention for video compression artifact reduction. In Proceedings of the 29th ACM International Conference on Multimedia. 5646--5654.
[70]
Wanqing Zhao, Ziyu Guan, Hangzai Luo, Jinye Peng, and Jianping Fan. 2017. Deep Multiple Instance Hashing for Object-based Image Retrieval. In Proceedings of the IJCAI. 3504--3510.
[71]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In Proceedings of the CVPR. IEEE, 10394--10403.
[72]
Linchao Zhu and Yi Yang. 2020. Inflated episodic memory with region self-attention for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4344--4353.
[73]
Yueting Zhuang, Yi Yang, and Fei Wu. 2008. Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval. IEEE Transactions on Multimedia, Vol. 10, 2 (2008), 221--229.

Cited By

View all
  • (2024)Hierarchical Semantics Alignment for 3D Human Motion RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657804(1083-1092)Online publication date: 10-Jul-2024

Index Terms

  1. Keyword-Based Diverse Image Retrieval by Semantics-aware Contrastive Learning and Transformer

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2023
    3567 pages
    ISBN:9781450394086
    DOI:10.1145/3539618
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 July 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal retrieval
    2. diversification retrieval
    3. keyword-based image retrieval
    4. transformer

    Qualifiers

    • Research-article

    Conference

    SIGIR '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)97
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Hierarchical Semantics Alignment for 3D Human Motion RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657804(1083-1092)Online publication date: 10-Jul-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media