Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Self-supervising Fine-Grained Region Similarities for Large-Scale Image Localization

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12349))

Included in the following conference series:

  • 6246 Accesses


The task of large-scale retrieval-based image localization is to estimate the geographical location of a query image by recognizing its nearest reference images from a city-scale dataset. However, the general public benchmarks only provide noisy GPS labels associated with the training images, which act as weak supervisions for learning image-to-image similarities. Such label noise prevents deep neural networks from learning discriminative features for accurate localization. To tackle this challenge, we propose to self-supervise image-to-region similarities in order to fully explore the potential of difficult positive images alongside their sub-regions. The estimated image-to-region similarities can serve as extra training supervision for improving the network in generations, which could in turn gradually refine the fine-grained similarities to achieve optimal performance. Our proposed self-enhanced image-to-region similarity labels effectively deal with the training bottleneck in the state-of-the-art pipelines without any additional parameters or manual annotations in both training and inference. Our method outperforms state-of-the-arts on the standard localization benchmarks by noticeable margins and shows excellent generalization capability on multiple image retrieval datasets (Code of this work is available at https://github.com/yxgeee/SFRS.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016)

    Google Scholar 

  2. Arandjelovic, R., Zisserman, A.: All about VLAD. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1578–1585 (2013)

    Google Scholar 

  3. Arandjelović, R., Zisserman, A.: DisLocation: scalable descriptor distinctiveness for location recognition. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 188–204. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16817-3_13

    Chapter  Google Scholar 

  4. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9

    Chapter  Google Scholar 

  5. Castle, R., Klein, G., Murray, D.W.: Video-rate localization in multiple maps for wearable augmented reality. In: 2008 12th IEEE International Symposium on Wearable Computers, pp. 15–22. IEEE (2008)

    Google Scholar 

  6. Chen, D.M., et al.: City-scale landmark identification on mobile devices. In: CVPR 2011, pp. 737–744. IEEE (2011)

    Google Scholar 

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  8. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015)

    Google Scholar 

  9. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)

    Article  MathSciNet  Google Scholar 

  10. Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. In: International Conference on Machine Learning (2018)

    Google Scholar 

  11. Ge, Y., Chen, D., Li, H.: Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. In: International Conference on Learning Representations (2020)

    Google Scholar 

  12. Ge, Y., Chen, D., Zhu, F., Zhao, R., Li, H.: Self-paced contrastive learning with hybrid memory for domain adaptive object Re-ID (2020)

    Google Scholar 

  13. Ge, Y., et al.: FD-GAN: pose-guided feature distilling GAN for robust person re-identification. In: Advances in Neural Information Processing Systems, pp. 1229–1240 (2018)

    Google Scholar 

  14. Ge, Y., Zhu, F., Zhao, R., Li, H.: Structured domain adaptation with online relation regularization for unsupervised person Re-ID (2020)

    Google Scholar 

  15. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)

  16. Goyal, P., Mahajan, D., Gupta, A., Misra, I.: Scaling and benchmarking self-supervised visual representation learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6391–6400 (2019)

    Google Scholar 

  17. Häne, C., et al.: 3D visual perception for self-driving cars using a multi-camera system: calibration, mapping, localization, and obstacle detection. Image Vis. Comput. 68, 14–27 (2017)

    Article  Google Scholar 

  18. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)

    MATH  Google Scholar 

  19. Seo, P.H., Weyand, T., Sim, J., Han, B.: CPlaNet: enhancing image geolocalization by combinatorial partitioning of maps. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 544–560. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_33

    Chapter  Google Scholar 

  20. Jegou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistency for large scale image search. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 304–317. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_24

    Chapter  Google Scholar 

  21. Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3304–3311. IEEE (2010)

    Google Scholar 

  22. Jegou, H., Perronnin, F., Douze, M., Sánchez, J., Perez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1704–1716 (2011)

    Article  Google Scholar 

  23. Kim, H.J., Dunn, E., Frahm, J.M.: Learned contextual feature reweighting for image geo-localization. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3251–3260. IEEE (2017)

    Google Scholar 

  24. Knopp, J., Sivic, J., Pajdla, T.: Avoiding confusing features in place recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 748–761. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_54

    Chapter  Google Scholar 

  25. Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1920–1929 (2019)

    Google Scholar 

  26. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  27. Liu, L., Li, H., Dai, Y.: Efficient global 2D–3D matching for camera localization in a large-scale 3D map. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2372–2381 (2017)

    Google Scholar 

  28. Liu, L., Li, H., Dai, Y.: Stochastic attraction-repulsion embedding for large scale image localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2570–2579 (2019)

    Google Scholar 

  29. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)

    Article  Google Scholar 

  30. Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Rob. 31(5), 1147–1163 (2015)

    Article  Google Scholar 

  31. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5

    Chapter  Google Scholar 

  32. Noroozi, M., Pirsiavash, H., Favaro, P.: Representation learning by learning to count. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5898–5906 (2017)

    Google Scholar 

  33. Paulin, M., Douze, M., Harchaoui, Z., Mairal, J., Perronin, F., Schmid, C.: Local convolutional features with unsupervised training for image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 91–99 (2015)

    Google Scholar 

  34. Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3384–3391. IEEE (2010)

    Google Scholar 

  35. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007)

    Google Scholar 

  36. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: IEEE Conference on Computer Vision and Pattern Recognition (2008)

    Google Scholar 

  37. Radenović, F., Tolias, G., Chum, O.: CNN image retrieval learns from BoW: unsupervised fine-tuning with hard examples. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 3–20. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_1

    Chapter  Google Scholar 

  38. Sattler, T., Leibe, B., Kobbelt, L.: Fast image-based localization using direct 2D-to-3D matching. In: 2011 International Conference on Computer Vision, pp. 667–674. IEEE (2011)

    Google Scholar 

  39. Schindler, G., Brown, M., Szeliski, R.: City-scale location recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7. IEEE (2007)

    Google Scholar 

  40. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  41. Torii, A., Arandjelovic, R., Sivic, J., Okutomi, M., Pajdla, T.: 24/7 place recognition by view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1808–1817 (2015)

    Google Scholar 

  42. Torii, A., Sivic, J., Pajdla, T., Okutomi, M.: Visual place recognition with repetitive structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 883–890 (2013)

    Google Scholar 

  43. Vo, N., Jacobs, N., Hays, J.: Revisiting IM2GPS in the deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2621–2630 (2017)

    Google Scholar 

  44. Weyand, T., Kostrikov, I., Philbin, J.: PlaNet - photo geolocation with convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 37–55. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_3

    Chapter  Google Scholar 

  45. Xie, Q., Hovy, E., Luong, M.T., Le, Q.V.: Self-training with noisy student improves ImageNet classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  46. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53

    Chapter  Google Scholar 

  47. Zhang, M., Song, G., Zhou, H., Liu, Y.: Discriminability distillation in group representation learning. In: European Conference on Computer Vision (2020)

    Google Scholar 

  48. Zhou, H., Liu, J., Liu, Z., Liu, Y., Wang, X.: Rotate-and-render: unsupervised photorealistic face rotation from single-view images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5911–5920 (2020)

    Google Scholar 

Download references


This work is supported in part by SenseTime Group Limited, in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK 14202217/14203118/14205615/14207814/14213616/14208417/14239816, in part by CUHK Direct Grant.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Hongsheng Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 131 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ge, Y., Wang, H., Zhu, F., Zhao, R., Li, H. (2020). Self-supervising Fine-Grained Region Similarities for Large-Scale Image Localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12349. Springer, Cham. https://doi.org/10.1007/978-3-030-58548-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58548-8_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58547-1

  • Online ISBN: 978-3-030-58548-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics