research-article

Regional Relation Modeling for Visual Place Recognition

Authors:

Zhou ZhaoAuthors Info & Claims

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 821 - 830

https://doi.org/10.1145/3397271.3401176

Published: 25 July 2020 Publication History

Abstract

In the process of visual perception, humans perceive not only the appearance of objects existing in a place but also their relationships (e.g. spatial layout). However, the dominant works on visual place recognition are always based on the assumption that two images depict the same place if they contain enough similar objects, while the relation information is neglected. In this paper, we propose a regional relation module which models the regional relationships and converts the convolutional feature maps to the relational feature maps. We further design a cascaded pooling method to get discriminative relation descriptors by preventing the influence of confusing relations and preserving as much useful information as possible. Extensive experiments on two place recognition benchmarks demonstrate that training with the proposed regional relation module improves the appearance descriptors and the relation descriptors are complementary to appearance descriptors. When these two kinds of descriptors are concatenated together, the resulting combined descriptors outperform the state-of-the-art methods.

Supplementary Material

MP4 File (3397271.3401176.mp4)

Regional Relation Modeling for Visual Place Recognition Presentation Video

Download
304.23 MB

References

[1]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. 2425--2433.

[2]

Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR. 5297--5307.

[3]

Relja Arandjelović and Andrew Zisserman. 2013. All About VLAD. In CVPR. 1578--1585.

[4]

Relja Arandjelović and Andrew Zisserman. 2014. DisLocation: Scalable descriptor distinctiveness for location recognition. In ACCV. 188--204.

[5]

Artem Babenko and Victor Lempitsky. 2015. Aggregating local deep features for image retrieval. In ICCV. 1269--1277.

[6]

Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. 2016. Interaction networks for learning about objects, relations and physics. In NIPS. 4502--4510.

Digital Library

[7]

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In ECCV. 404--417.

[8]

David M. Chen, Georges Baatz, Kevin Koser, Sam S. Tsai, and Radek Grzeszczuk. 2011. City-scale landmark identification on mobile devices. In CVPR. 737--744.

[9]

Zetao Chen, Fabiola Maffra, Inkyu Sa, and Margarita Chli. 2017. Only look once, mining distinctive landmarks from ConvNet for visual place recognition. In IEEE/IROS.

[10]

Mark Cummins and Paul Newman. 2008. FAB-MAP: Probabilistic localization and mapping in the space of appearance. IJRR 27, 6 (2008), 647--665.

Digital Library

[11]

Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In CVPR. 3298--3308.

[12]

Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In ICCV. 1422--1430.

[13]

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. IJCV 88, 2 (2010), 303--338.

Digital Library

[14]

Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In ECCV. 241--257.

[15]

Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2017. End-to-end learning of deep visual representations for image retrieval. IJCV 124, 2 (2017), 237--254.

Digital Library

[16]

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. VizWiz Grand Challenge: Answering Visual Questions from Blind People. arXiv preprint arXiv:1802.08218 (2018).

[17]

James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image. In CVPR. 1--8.

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.

[19]

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation Networks for Object Detection. In CVPR. 3588--3597.

[20]

Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2008. Hamming embedding and weak geometric consistency for large scale image search. In ECCV. 304--317.

[21]

Herve Jegou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Perez, and Cordelia Schmid. 2012. Aggregating local image descriptors into compact codes. TPAMI 34, 9 (2012), 1704--1716.

Digital Library

[22]

Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. 2015. Predicting good features for image geo-localization using per-bundle vlad. In ICCV. 1170--1178.

[23]

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR. 1988--1997.

[24]

Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. 2017. Learned contextual feature reweighting for image geo-localization. In CVPR. 2136--2145.

[25]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[26]

Jan Knopp, Josef Sivic, and Tomas Pajdla. 2010. Avoiding confusing features in place recognition. ECCV, 748--761.

[27]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael S Bernstein, and Fei-Fei Li. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV 123, 1 (2017), 32--73.

Digital Library

[28]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097--1105.

[29]

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. In ICLR.

[30]

Yuanzhi Liang, Yalong Bai, Wei Zhang, Xueming Qian, Li Zhu, and Tao Mei. 2019. VrR-VG: Refocusing Visually-Relevant Relationships. In Proceedings of the IEEE International Conference on Computer Vision. 10403--10412.

[31]

Liu Liu, Hongdong Li, and Yuchao Dai. 2017. Efficient global 2d-3d matching for camera localization in a large-scale 3d map. In ICCV. 2391--2400.

[32]

Yihang Lou, Yan Bai, Shiqi Wang, and Ling-Yu Duan. 2018. Multi-Scale Context Attention Network for Image Retrieval. In ACM MM. 1128--1136.

[33]

David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. IJCV 60, 2 (2004), 91--110.

Digital Library

[34]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In ECCV. 852--869.

[35]

Colin McManus, Winston Churchill, Will Maddern, Alexander D Stewart, and Paul Newman. 2014. Shady dealings: Robust, long-term visual localisation using illumination invariance. In ICRA. 901--906.

[36]

Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Largescale image retrieval with attentive deep local features. In ICCV. 3456--3465.

[37]

Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV. 69--84.

[38]

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching. In CVPR. 1--8.

[39]

James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2008. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR. 1--8.

[40]

Filip Radenovic, Giorgos Tolias, and Ondej Chum. 2018. Fine-tuning CNN Image Retrieval with No Human Annotation. TPAMI (2018).

[41]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. 91--99.

[42]

Jerome Revaud, Jon Almazán, Rafael S Rezende, and Cesar Roberto de Souza. 2019. Learning with average precision: Training image retrieval with a listwise loss. In Proceedings of the IEEE International Conference on Computer Vision. 5107--5116.

[43]

Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. 2017. A simple neural network module for relational reasoning. In NIPS. 4967--4976.

[44]

Torsten Sattler, Bastian Leibe, and Leif Kobbelt. 2017. Efficient & effective prioritized matching for large-scale image-based localization. TPAMI 9 (2017), 1744--1756.

Digital Library

[45]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR. 815--823.

[46]

Paul Hongsuck Seo, Tobias Weyand, Jack Sim, and Bohyung Han. 2018. CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps. In ECCV. 544--560.

[47]

Oriane Siméoni, Yannis Avrithis, and Ondrej Chum. 2019. Local Features and Visual Words Emerge in Activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11651--11660.

[48]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.

[49]

Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In ICCV. 1470--1477.

Digital Library

[50]

Abby Stylianou, Richard Souvenir, and Robert Pless. 2019. Visualizing Deep Similarity Networks. In WACV.

[51]

Giorgos Tolias, Yannis Avrithis, and Hervé Jégou. 2016. Image search with selective match kernels: aggregation across single and multiple images. IJCV 116, 3 (2016), 247--261.

Digital Library

[52]

Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2016. Particular object retrieval with integral max-pooling of CNN activations. In ICLR.

[53]

Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 2015. 24/7 place recognition by view synthesis. In CVPR. 1808--1817.

[54]

A Torii, R Arandjelovic, J Sivic, M Okutomi, and T Pajdla. 2018. 24/7 Place Recognition by View Synthesis. TPAMI 40, 2 (2018), 257.

Digital Library

[55]

Akihiko Torii, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 2015. Visual Place Recognition with Repetitive Structures. TPAMI 37, 11 (2015), 2346--2359.

Digital Library

[56]

Akihiko Torii, Josef Sivic, Tomas Pajdla, and Masatoshi Okutomi. 2013. iVsual place recognition with repetitive structures. In CVPR. 883--890.

[57]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.

[58]

Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV). 399--417.

Digital Library

[59]

Xu Yang, Hanwang Zhang, and Jianfei Cai. 2018. Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features. In ECCV. 38--54.

[60]

Peng Yin, Lingyun Xu, Xueqian Li, Chen Yin, Yingli Li, Rangaprasad Arun Srivatsan, Lu Li, Jianmin Ji, and Yuqing He. 2019. A Multi-Domain Feature Learning Method for Visual Place Recognition. arXiv preprint arXiv:1902.10058 (2019).

[61]

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In ICCV. 1116--1124.

[62]

Liang Zheng, Yi Yang, and Qi Tian. 2017. SIFT meets CNN: A decade survey of instance retrieval. TPAMI (2017).

[63]

Yingying Zhu, Jiong Wang, Lingxi Xie, and Liang Zheng. 2018. Attention-based Pyramid Aggregation Network for Visual Place Recognition. In ACM MM. 99--107.

Cited By

Ge FZhang YWang LColeman SKerr D(2023)Double-Domain Adaptation Semantics for Retrieval-Based Long-Term Visual LocalizationIEEE Transactions on Multimedia10.1109/TMM.2023.334513826(6050-6064)Online publication date: 20-Dec-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3345138
Wang YChen HWang JZhu YOria VSapino MSatoh SKerhervé BCheng WIde ISingh V(2022)DMPCANet: A Low Dimensional Aggregation Network for Visual Place RecognitionProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531427(24-28)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3512527.3531427
Cai YZhao JCui JZhang FFeng TYe C(2022)Patch-NetVLAD+: Learned patch descriptor and weighted matching strategy for place recognition2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)10.1109/MFI55806.2022.9913860(1-8)Online publication date: 20-Sep-2022
https://doi.org/10.1109/MFI55806.2022.9913860
Show More Cited By

Index Terms

Regional Relation Modeling for Visual Place Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval

Recommendations

An extended-HCT semantic description for visual place recognition

We describe a new semantic descriptor for robots to recognize visual places. The descriptor integrates image features and color information via the hull census transform (HCT) and image histogram indexing. Our approach extracts the semantic description ...
Coarse-to-Fine Visual Place Recognition
Neural Information Processing
Abstract
Visual Place Recognition (VPR) aims to locate one or more images depicting the same place in the geotagged database with a given query and is typically conducted as an image retrieval task. Currently, global-based and local-based descriptors are ...
Robust Place Recognition with Combined Image Descriptors
MESAS 2016: Proceedings of the Third International Workshop on Modelling and Simulation for Autonomous Systems - Volume 9991

In this paper, a method of place recognition is presented. The method is generally classified under the bag-of-visual-words approach. Information from several global image descriptors is incorporated. The data fusion is performed at the feature level.

...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2020

2548 pages

ISBN:9781450380164

DOI:10.1145/3397271

General Chairs:
Jimmy Huang
York University, Canada
,
Yi Chang
Jilin University, China
,
Xueqi Cheng
Chinese Academy of Sciences, China
,
Program Chairs:
Jaap Kamps
University of Amsterdam, Netherlands
,
Vanessa Murdock
Amazon, U.S.A.
,
Ji-Rong Wen
Renmin University of China, China
,
Yiqun Liu
Tsinghua University, China

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Major Fundamental Research Project in the Science and Technology Plan of Shenzhen
National Natural Science Foundation of China
Natural Science Foundation of Guangdong Province of China

Conference

SIGIR '20

Sponsor:

SIGIR

SIGIR '20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval

July 25 - 30, 2020

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
294
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ge FZhang YWang LColeman SKerr D(2023)Double-Domain Adaptation Semantics for Retrieval-Based Long-Term Visual LocalizationIEEE Transactions on Multimedia10.1109/TMM.2023.334513826(6050-6064)Online publication date: 20-Dec-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3345138
Wang YChen HWang JZhu YOria VSapino MSatoh SKerhervé BCheng WIde ISingh V(2022)DMPCANet: A Low Dimensional Aggregation Network for Visual Place RecognitionProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531427(24-28)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3512527.3531427
Cai YZhao JCui JZhang FFeng TYe C(2022)Patch-NetVLAD+: Learned patch descriptor and weighted matching strategy for place recognition2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)10.1109/MFI55806.2022.9913860(1-8)Online publication date: 20-Sep-2022
https://doi.org/10.1109/MFI55806.2022.9913860
Hausler SGarg SXu MMilford MFischer T(2021)Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR46437.2021.01392(14136-14147)Online publication date: Jun-2021
https://doi.org/10.1109/CVPR46437.2021.01392

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten