Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612283acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hi-SIGIR: Hierachical Semantic-Guided Image-to-image Retrieval via Scene Graph

Published: 27 October 2023 Publication History

Abstract

Image-to-image retrieval, a fundamental task, aims at matching similar images based on a query image. Existing methods with convolutional neural networks are usually sensitive to low-level visual features, and ignore high-level semantic relationship information. This makes retrieving complicated images with multiple objects and various relationships a significant challenge. Although some works introduce the scene graph to capture the global semantic features of the objects and their relations, they ignore the local visual representations. In addition, due to the fragility of individual modal representations, poisoning attacks in adversarial scenarios are easily achieved, hurting the robustness of the visual-guided foundation image retrieval model. To overcome these issues, we propose a novel hierarchical semantic-guided image-to-image retrieval method via scene graph, called Hi-SIGIR. Specifically, to begin with, our proposed method generates the scene graph of an image. Then, our model extracts and learns both the visual and semantic features of the nodes and relations within the scene graphs. Next, these features are fused to obtain local information and sent to the graph neural network to obtain global information. Using these information, the similarity between the scene graphs of several images is calculated at both the local and global levels to perform image retrieval. Finally, we introduce a surrogate that calculates relevance in a cross-modal manner to understand image content better. Experimental evaluations on several wildly-used benchmarks demonstrate the superiority of the proposed method.

References

[1]
Ryan Prescott Adams and Richard S Zemel. 2011. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925 (2011).
[2]
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297--5307.
[3]
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. Lecture Notes in Computer Science 3951 (2006), 404--417.
[4]
Lubomir Bourdev and Jitendra Malik. 2009. Poselets: Body part detectors trained using 3d human pose annotations. In 2009 IEEE 12th International Conference on Computer Vision. IEEE, 1365--1372.
[5]
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a "siamese" time delay neural network. Advances in neural information processing systems 6 (1993).
[6]
Ming-Yi Chen and Ching-I Teng. 2013. A comprehensive model of the effects of online store image on purchase intention in an e-commerce environment. Electronic Commerce Research 13 (2013), 1--23.
[7]
Wei Chen, Yu Liu, Weiping Wang, Erwin M Bakker, Theodoros Georgiou, Paul Fieguth, Li Liu, and Michael S Lew. 2022. Deep learning for instance retrieval: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[8]
Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.
[9]
Shiv Ram Dubey. 2021. A decade survey of content based image retrieval using deep learning. IEEE Transactions on Circuits and Systems for Video Technology 32, 5 (2021), 2687--2704.
[10]
Matthias Fey, Jan E Lenssen, Christopher Morris, Jonathan Masci, and Nils M Kriege. 2020. Deep graph matching consensus. arXiv preprint arXiv:2001.09621 (2020).
[11]
Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448.
[12]
Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14. Springer, 241--257.
[13]
Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2017. End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124, 2 (2017), 237--254.
[14]
Albert Gordo and Diane Larlus. 2017. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6589--6598.
[15]
Jindong Gu and Volker Tresp. 2020. Improving the robustness of capsule networks to image affine transformations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7285--7293.
[16]
Robert M Haralick, Karthikeyan Shanmugam, and Its' Hak Dinstein. 1973. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics 6 (1973), 610--621.
[17]
Chris Harris, Mike Stephens, et al. 1988. A combined corner and edge detector. In Alvey vision conference, Vol. 15. Citeseer, 10--5244.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[19]
Yihao Huang, Felix Juefei-Xu, Qing Guo, Yang Liu, and Geguang Pu. 2022. Fake-Locator: Robust localization of GAN-based face manipulations. IEEE Transactions on Information Forensics and Security 17 (2022), 2657--2672.
[20]
Yihao Huang, Felix Juefei-Xu, RunWang, Qing Guo, Lei Ma, Xiaofei Xie, Jianwen Li,Weikai Miao, Yang Liu, and Geguang Pu. 2020. Fakepolisher: Making deepfakes more detection-evasive by shallow reconstruction. In Proceedings of the 28th ACM international conference on multimedia. 1217--1226.
[21]
Anil K Jain and Aditya Vailaya. 1996. Image retrieval using color and shape. Pattern recognition 29, 8 (1996), 1233--1244.
[22]
Xin Ji, Wei Wang, Meihui Zhang, and Yang Yang. 2017. Cross-domain image retrieval with attention modeling. In Proceedings of the 25th ACM International Conference on Multimedia. 1654--1662.
[23]
Xiaojun Jia, Yong Zhang, Xingxing Wei, Baoyuan Wu, Ke Ma, Jue Wang, and Xiaochun Cao. 2022. Prior-guided adversarial initialization for fast adversarial training. In European Conference on Computer Vision. Springer, 567--584.
[24]
Xiaojun Jia, Yong Zhang, Baoyuan Wu, Ke Ma, Jue Wang, and Xiaochun Cao. 2022. LAS-AT: adversarial training with learnable attack strategy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13398--13408.
[25]
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3668--3678.
[26]
Mohammed Lamine Kherfi, Djemel Ziou, and Alan Bernardi. 2004. Image retrieval from the world wide web: Issues, techniques, and systems. ACM Computing Surveys (Csur) 36, 1 (2004), 35--67.
[27]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[28]
Thomas N Kipf and MaxWelling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[29]
Nils M Kriege, Fredrik D Johansson, and Christopher Morris. 2020. A survey on graph kernels. Applied Network Science 5, 1 (2020), 1--42.
[30]
Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart. 2011. BRISK: Binary robust invariant scalable keypoints. In 2011 International Conference on Computer Vision. IEEE, 2548--2555.
[31]
Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. In International Conference on Machine Learning. PMLR, 3835--3845.
[32]
Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, Jingzhi Li, Baoyuan Wu, and Xiaochun Cao. 2022. A large-scale multiple-objective method for black-box attack against object detection. In European Conference on Computer Vision. Springer, 619--636.
[33]
Siyuan Liang, Xingxing Wei, Siyuan Yao, and Xiaochun Cao. 2020. Efficient adversarial attacks for visual object tracking. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXVI 16. Springer, 34--50.
[34]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740--755.
[35]
David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2004), 91--110.
[36]
Guixiang Ma, Nesreen K Ahmed, Theodore L Willke, and Philip S Yu. 2021. Deep graph similarity learning: A survey. Data Mining and Knowledge Discovery 35 (2021), 688--725.
[37]
Ke Ma, Qianqian Xu, Jinshan Zeng, Xiaochun Cao, and Qingming Huang. 2021. Poisoning attack against estimating from pairwise comparisons. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 6393--6408.
[38]
Ke Ma, Qianqian Xu, Jinshan Zeng, Guorong Li, Xiaochun Cao, and Qingming Huang. 2022. A Tale of HodgeRank and Spectral Method: Target Attack Against Rank Aggregation is the Fixed Point of Adversarial Game. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2022), 4090--4108.
[39]
Bangalore S Manjunath and Wei-Ying Ma. 1996. Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 8 (1996), 837--842.
[40]
Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. 2018. Learning latent permutations with gumbel-sinkhorn networks. arXiv preprint arXiv:1802.08665 (2018).
[41]
Henning Müller, Nicolas Michoux, David Bandon, and Antoine Geissbuhler. 2004. A review of content-based image retrieval systems in medical applications-clinical benefits and future directions. International Journal of Medical Informatics 73, 1 (2004), 1--23.
[42]
Manh-Duy Nguyen, Binh T Nguyen, and Cathal Gurrin. 2021. A deep local and global scene-graph matching for image-text retrieval. arXiv preprint arXiv:2106.02400 (2021).
[43]
Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision. 3456--3465.
[44]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.
[45]
Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. 2010. Large-scale image retrieval with compressed fisher vectors. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 3384--3391.
[46]
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching. In 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.
[47]
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2008. Lost in quantization: Improving particular object retrieval in large scale image databases. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.
[48]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-tophrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641--2649.
[49]
Filip Radenović, Giorgos Tolias, and Ondřej Chum. 2016. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, 3--20.
[50]
Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017).
[51]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
[52]
Edward Rosten, Reid Porter, and Tom Drummond. 2008. Faster and better: A machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1 (2008), 105--119.
[53]
Michael J Swain and Dana H Ballard. 1991. Color indexing. International Journal of Computer Vision 7, 1 (1991), 11--32.
[54]
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105--6114.
[55]
Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2015. Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015).
[56]
Luo Wang, Xueming Qian, Yuting Zhang, Jialie Shen, and Xiaochun Cao. 2019. Enhancing sketch-based image retrieval by cnn semantic re-ranking. IEEE Transactions on Cybernetics 50, 7 (2019), 3330--3342.
[57]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 23318--23340.
[58]
Runzhong Wang, Junchi Yan, and Xiaokang Yang. 2021. Neural graph matching network: Learning lawler's quadratic assignment problem with extension to hypergraph and multiple-graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 5261--5279.
[59]
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508--1517.
[60]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048--2057.
[61]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
[62]
Pinar Yanardag and SVN Vishwanathan. 2015. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1365--1374.
[63]
Sangwoong Yoon, Woo Young Kang, Sungwook Jeon, SeongEun Lee, Changjin Han, Jonghun Park, and Eun-Sol Kim. 2021. Image-to-image retrieval by learning similarity between scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10718--10726.
[64]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.

Cited By

View all
  • (2024)A Digital Companion Architecture for Ambient IntelligenceProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596108:2(1-26)Online publication date: 15-May-2024

Index Terms

  1. Hi-SIGIR: Hierachical Semantic-Guided Image-to-image Retrieval via Scene Graph

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep neural network
    2. graph similarity
    3. image retrieval
    4. surrogate relevance

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program of China
    • National Natural Science Foundation of China
    • Shenzhen Science and Technology Program

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)202
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Digital Companion Architecture for Ambient IntelligenceProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596108:2(1-26)Online publication date: 15-May-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media