Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3397271.3401176acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Regional Relation Modeling for Visual Place Recognition

Published: 25 July 2020 Publication History

Abstract

In the process of visual perception, humans perceive not only the appearance of objects existing in a place but also their relationships (e.g. spatial layout). However, the dominant works on visual place recognition are always based on the assumption that two images depict the same place if they contain enough similar objects, while the relation information is neglected. In this paper, we propose a regional relation module which models the regional relationships and converts the convolutional feature maps to the relational feature maps. We further design a cascaded pooling method to get discriminative relation descriptors by preventing the influence of confusing relations and preserving as much useful information as possible. Extensive experiments on two place recognition benchmarks demonstrate that training with the proposed regional relation module improves the appearance descriptors and the relation descriptors are complementary to appearance descriptors. When these two kinds of descriptors are concatenated together, the resulting combined descriptors outperform the state-of-the-art methods.

Supplementary Material

MP4 File (3397271.3401176.mp4)
Regional Relation Modeling for Visual Place Recognition Presentation Video

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. 2425--2433.
[2]
Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR. 5297--5307.
[3]
Relja Arandjelović and Andrew Zisserman. 2013. All About VLAD. In CVPR. 1578--1585.
[4]
Relja Arandjelović and Andrew Zisserman. 2014. DisLocation: Scalable descriptor distinctiveness for location recognition. In ACCV. 188--204.
[5]
Artem Babenko and Victor Lempitsky. 2015. Aggregating local deep features for image retrieval. In ICCV. 1269--1277.
[6]
Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. 2016. Interaction networks for learning about objects, relations and physics. In NIPS. 4502--4510.
[7]
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In ECCV. 404--417.
[8]
David M. Chen, Georges Baatz, Kevin Koser, Sam S. Tsai, and Radek Grzeszczuk. 2011. City-scale landmark identification on mobile devices. In CVPR. 737--744.
[9]
Zetao Chen, Fabiola Maffra, Inkyu Sa, and Margarita Chli. 2017. Only look once, mining distinctive landmarks from ConvNet for visual place recognition. In IEEE/IROS.
[10]
Mark Cummins and Paul Newman. 2008. FAB-MAP: Probabilistic localization and mapping in the space of appearance. IJRR 27, 6 (2008), 647--665.
[11]
Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In CVPR. 3298--3308.
[12]
Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In ICCV. 1422--1430.
[13]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. IJCV 88, 2 (2010), 303--338.
[14]
Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In ECCV. 241--257.
[15]
Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2017. End-to-end learning of deep visual representations for image retrieval. IJCV 124, 2 (2017), 237--254.
[16]
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. VizWiz Grand Challenge: Answering Visual Questions from Blind People. arXiv preprint arXiv:1802.08218 (2018).
[17]
James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image. In CVPR. 1--8.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.
[19]
Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation Networks for Object Detection. In CVPR. 3588--3597.
[20]
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2008. Hamming embedding and weak geometric consistency for large scale image search. In ECCV. 304--317.
[21]
Herve Jegou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Perez, and Cordelia Schmid. 2012. Aggregating local image descriptors into compact codes. TPAMI 34, 9 (2012), 1704--1716.
[22]
Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. 2015. Predicting good features for image geo-localization using per-bundle vlad. In ICCV. 1170--1178.
[23]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR. 1988--1997.
[24]
Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. 2017. Learned contextual feature reweighting for image geo-localization. In CVPR. 2136--2145.
[25]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[26]
Jan Knopp, Josef Sivic, and Tomas Pajdla. 2010. Avoiding confusing features in place recognition. ECCV, 748--761.
[27]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael S Bernstein, and Fei-Fei Li. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV 123, 1 (2017), 32--73.
[28]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097--1105.
[29]
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. In ICLR.
[30]
Yuanzhi Liang, Yalong Bai, Wei Zhang, Xueming Qian, Li Zhu, and Tao Mei. 2019. VrR-VG: Refocusing Visually-Relevant Relationships. In Proceedings of the IEEE International Conference on Computer Vision. 10403--10412.
[31]
Liu Liu, Hongdong Li, and Yuchao Dai. 2017. Efficient global 2d-3d matching for camera localization in a large-scale 3d map. In ICCV. 2391--2400.
[32]
Yihang Lou, Yan Bai, Shiqi Wang, and Ling-Yu Duan. 2018. Multi-Scale Context Attention Network for Image Retrieval. In ACM MM. 1128--1136.
[33]
David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. IJCV 60, 2 (2004), 91--110.
[34]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In ECCV. 852--869.
[35]
Colin McManus, Winston Churchill, Will Maddern, Alexander D Stewart, and Paul Newman. 2014. Shady dealings: Robust, long-term visual localisation using illumination invariance. In ICRA. 901--906.
[36]
Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Largescale image retrieval with attentive deep local features. In ICCV. 3456--3465.
[37]
Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV. 69--84.
[38]
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching. In CVPR. 1--8.
[39]
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2008. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR. 1--8.
[40]
Filip Radenovic, Giorgos Tolias, and Ondej Chum. 2018. Fine-tuning CNN Image Retrieval with No Human Annotation. TPAMI (2018).
[41]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. 91--99.
[42]
Jerome Revaud, Jon Almazán, Rafael S Rezende, and Cesar Roberto de Souza. 2019. Learning with average precision: Training image retrieval with a listwise loss. In Proceedings of the IEEE International Conference on Computer Vision. 5107--5116.
[43]
Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. 2017. A simple neural network module for relational reasoning. In NIPS. 4967--4976.
[44]
Torsten Sattler, Bastian Leibe, and Leif Kobbelt. 2017. Efficient & effective prioritized matching for large-scale image-based localization. TPAMI 9 (2017), 1744--1756.
[45]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR. 815--823.
[46]
Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bohyung Han. 2018. CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps. In ECCV. 544--560.
[47]
Oriane Siméoni, Yannis Avrithis, and Ondrej Chum. 2019. Local Features and Visual Words Emerge in Activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11651--11660.
[48]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
[49]
Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In ICCV. 1470--1477.
[50]
Abby Stylianou, Richard Souvenir, and Robert Pless. 2019. Visualizing Deep Similarity Networks. In WACV.
[51]
Giorgos Tolias, Yannis Avrithis, and Hervé Jégou. 2016. Image search with selective match kernels: aggregation across single and multiple images. IJCV 116, 3 (2016), 247--261.
[52]
Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2016. Particular object retrieval with integral max-pooling of CNN activations. In ICLR.
[53]
Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 2015. 24/7 place recognition by view synthesis. In CVPR. 1808--1817.
[54]
A Torii, R Arandjelovic, J Sivic, M Okutomi, and T Pajdla. 2018. 24/7 Place Recognition by View Synthesis. TPAMI 40, 2 (2018), 257.
[55]
Akihiko Torii, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 2015. Visual Place Recognition with Repetitive Structures. TPAMI 37, 11 (2015), 2346--2359.
[56]
Akihiko Torii, Josef Sivic, Tomas Pajdla, and Masatoshi Okutomi. 2013. iVsual place recognition with repetitive structures. In CVPR. 883--890.
[57]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.
[58]
Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV). 399--417.
[59]
Xu Yang, Hanwang Zhang, and Jianfei Cai. 2018. Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features. In ECCV. 38--54.
[60]
Peng Yin, Lingyun Xu, Xueqian Li, Chen Yin, Yingli Li, Rangaprasad Arun Srivatsan, Lu Li, Jianmin Ji, and Yuqing He. 2019. A Multi-Domain Feature Learning Method for Visual Place Recognition. arXiv preprint arXiv:1902.10058 (2019).
[61]
Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In ICCV. 1116--1124.
[62]
Liang Zheng, Yi Yang, and Qi Tian. 2017. SIFT meets CNN: A decade survey of instance retrieval. TPAMI (2017).
[63]
Yingying Zhu, Jiong Wang, Lingxi Xie, and Liang Zheng. 2018. Attention-based Pyramid Aggregation Network for Visual Place Recognition. In ACM MM. 99--107.

Cited By

View all
  • (2023)Double-Domain Adaptation Semantics for Retrieval-Based Long-Term Visual LocalizationIEEE Transactions on Multimedia10.1109/TMM.2023.334513826(6050-6064)Online publication date: 20-Dec-2023
  • (2022)DMPCANet: A Low Dimensional Aggregation Network for Visual Place RecognitionProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531427(24-28)Online publication date: 27-Jun-2022
  • (2022)Patch-NetVLAD+: Learned patch descriptor and weighted matching strategy for place recognition2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)10.1109/MFI55806.2022.9913860(1-8)Online publication date: 20-Sep-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2020
2548 pages
ISBN:9781450380164
DOI:10.1145/3397271
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content-based image retrieval
  2. convolutional neural network
  3. relation modeling
  4. visual place recognition

Qualifiers

  • Research-article

Funding Sources

  • the Major Fundamental Research Project in the Science and Technology Plan of Shenzhen
  • National Natural Science Foundation of China
  • Natural Science Foundation of Guangdong Province of China

Conference

SIGIR '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Double-Domain Adaptation Semantics for Retrieval-Based Long-Term Visual LocalizationIEEE Transactions on Multimedia10.1109/TMM.2023.334513826(6050-6064)Online publication date: 20-Dec-2023
  • (2022)DMPCANet: A Low Dimensional Aggregation Network for Visual Place RecognitionProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531427(24-28)Online publication date: 27-Jun-2022
  • (2022)Patch-NetVLAD+: Learned patch descriptor and weighted matching strategy for place recognition2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)10.1109/MFI55806.2022.9913860(1-8)Online publication date: 20-Sep-2022
  • (2021)Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR46437.2021.01392(14136-14147)Online publication date: Jun-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media