research-article

Covariant Peak Constraint for Accurate Keypoint Detection and Keypoint-Specific Descriptor Learning

Authors:

Yihong WuAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 26

Pages 5383 - 5397

https://doi.org/10.1109/TMM.2023.3333211

Published: 15 November 2023 Publication History

Abstract

Local feature extraction consists of keypoint detection and local descriptor extraction. Firstly, in keypoint detector learning, existing covariance constraint loss functions cannot constrain the probability distribution shapes in local probability maps that surround keypoints. And existing auxiliary peak loss functions, which are used to alleviate the problem, impair the performance of local feature methods. To solve this problem, we propose a novel Covariant Peak constraint Loss (CP Loss) which is defined as the expectations of local probability maps' position errors. Minimizing our CP Loss can make local probability maps accurately peak at reliable keypoints. Secondly, in descriptor learning, the Neural Reprojection Error (NRE) aims at constraining dense descriptor maps of images. But we argue that only those descriptors of keypoints need to be constrained. Thus, we propose a novel Conditional Neural Reprojection Error (CNRE) that is only conditioned on keypoints. Compared with the NRE, our CNRE can achieve much higher efficiency and produce more keypoint-specific descriptors with better matching performance. We use our CP Loss and CNRE to train a local feature network named as CPCN-Feat. Experimental results show that our CPCN-Feat achieves state-of-the-art performance on four challenging benchmarks.

References

[1]

K. Sun, W. Tao, and Y. Qian, “Guide to match: Multi-layer feature matching with a hybrid gaussian mixture model,” IEEE Trans. Multimedia, vol. 22, pp. 2246–2261, 2020.

[2]

C. Bai, H. Li, J. Zhang, L. Huang, and L. Zhang, “Unsupervised adversarial instance-level image retrieval,” IEEE Trans. Multimedia, vol. 23, pp. 2199–2207, 2021.

Digital Library

[3]

J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4104–4113.

[4]

Z. Zhang, T. Sattler, and D. Scaramuzza, “Reference pose generation for long-term visual localization via learned features and view synthesis,” Int. J. Comput. Vis., vol. 129, pp. 821–844, 2021.

Digital Library

[5]

X. Yang et al., “Robust and efficient RGB-D SLAM in dynamic environments,” IEEE Trans. Multimedia, vol. 23, pp. 4208–4219, 2021.

[6]

Google. ARCore, [Online]. Available: https://developers.google.com/ar/develop/fundamentals

[7]

G. Bresson, Z. Alsayed, L. Yu, and S. Glaser, “Simultaneous localization and mapping: A survey of current trends in autonomous driving,” IEEE Trans. Intell. Veh., vol. 2, no. 3, pp. 194–220, Sep. 2017.

[8]

C. Harris and M. Stephens, “A combined corner and edge detector,” in Proc. 4th Alvey Vis. Conf., 1988, pp. 147–151.

[9]

D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.

Digital Library

[10]

E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 430–443.

[11]

M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 778–792.

[12]

E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 2564–2571.

[13]

K. Lenc and A. Vedaldi, “Learning covariant feature detectors,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 100–117.

[14]

D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoint: Self-supervised interest point detection and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2018, pp. 224–236.

[15]

M. Dusmanu et al., “D2-Net: A trainable CNN for joint description and detection of local features,” in Proc. IEEE /CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8092–8101.

[16]

A. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk, “KeyNet: Keypoint detection by handcrafted and learned CNN filters,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5836–5844.

[17]

X. Zhao et al., “Alike: Accurate and lightweight keypoint detection and descriptor extraction,” IEEE Trans. Multimedia, vol. 25, pp. 3101–3112, 2023.

Digital Library

[18]

Y. Tian et al., “SOSNet: Second order similarity regularization for local descriptor learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 11016–11025.

[19]

B. Fan et al., “Deep unsupervised binary descriptor learning through locality consistency and self distinctiveness,” IEEE Trans. Multimedia, vol. 23, pp. 2770–2781, 2021.

Digital Library

[20]

C. Wang, R. Xu, S. Xu, W. Meng, and X. Zhang, “CNDESC: Cross normalization for local descriptors learning,” IEEE Trans. Multimedia, vol. 25, pp. 3989–4001, 2023.

Digital Library

[21]

C. Wang et al., “MTLDESC: Looking wider to describe better,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 2388–2396.

[22]

M. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2020, pp. 14254–14265.

[23]

J. Lee, B. Kim, and M. Cho, “Self-supervised equivariant learning for oriented keypoint detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 4847–4857.

[24]

Z. Luo et al., “ASLfeat: Learning local features of accurate shape and localization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6589–6598.

[25]

J. Revaud, C. De Souza, M. Humenberger, and P. Weinzaepfel, “R2D2: Reliable and repeatable detector and descriptor,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2019, pp. 12405–12415.

[26]

L. Zhang and S. Rusinkiewicz, “Learning to detect features in texture images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6325–6333.

[27]

J. Li et al., “Localization with sampling-argmax,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2021, pp. 27236–27248.

[28]

A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas, “Working hard to know your Neighbor's margins: Local descriptor learning loss,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2017, pp. 4829–4840.

[29]

H. Germain, V. Lepetit, and G. Bourmaud, “Neural reprojection error: Merging feature learning and camera pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 414–423.

[30]

S. Mehta and M. Rastegari, “Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn. Representations, 2022, pp. 1–13.

[31]

K. Mikolajczyk and C. Schmid, “Scale & affine invariant interest point detectors,” Int. J. Comput. Vis., vol. 60, no. 1, pp. 63–86, Oct. 2004.

Digital Library

[32]

H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 404–417.

[33]

P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “Kaze features,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 214–227.

[34]

Y. Verdie, K. Yi, P. Fua, and V. Lepetit, “Tilde: A temporally invariant learned detector,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5279–5288.

[35]

X. Zhang, F. X. Yu, S. Karaman, and S.-F. Chang, “Learning discriminative and transformation covariant local feature detectors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6818–6826.

[36]

A. Bursuc, G. Tolias, and H. Jégou, “Kernel local descriptors with implicit rotation matching,” in Proc. 5th ACM Int. Conf. Multimedia Retrieval, 2015, pp. 595–598.

Digital Library

[37]

J. Dong and S. Soatto, “Domain-size pooling in local descriptors: DSP-sift,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5097–5106.

[38]

Y. Tian, B. Fan, and F. Wu, “L2-Net: Deep learning of discriminative patch descriptor in euclidean space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 661–669.

[39]

K. He, Y. Lu, and S. Sclaroff, “Local descriptors optimized for average precision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 596–605.

[40]

Z. Luo et al., “Geodesc: Learning local descriptors by integrating geometry constraints,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 168–183.

[41]

P. Ebel, A. Mishchuk, K. M. Yi, P. Fua, and E. Trulls, “Beyond cartesian representations for local descriptors,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 253–262.

[42]

Y. Liu et al., “Gift: Learning transformation-invariant dense visual descriptors via group cnns,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2019, pp. 6992–7003.

[43]

B. Fan et al., “Seeing through darkness: Visual localization at night via weakly supervised learning of domain invariant features,” IEEE Trans. Multimedia, vol. 25, pp. 1713–1726, 2023.

Digital Library

[44]

Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, “Learning feature descriptors using camera pose supervision,” in Proc. 16th Eur. Conf. Comput. Vis., 2020, pp. 757–774.

[45]

V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature descriptors with triplets and shallow convolutional neural networks,” in Proc. Brit. Mach. Vis. Conf., 2016, pp. 119.1–119.11.

[46]

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in Proc. Int. Conf. Learn. Representations, 2017, pp. 1–12.

[47]

Z. Peng et al., “Conformer: Local features coupling global representations for visual recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 367–376.

[48]

Z. Li and N. Snavely, “Megadepth: Learning single-view depth prediction from internet photos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2041–2050.

[49]

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations, 2019, pp. 1–18.

[50]

V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk, “HPatches: A benchmark and evaluation of handcrafted and learned local descriptors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5173–5182.

[51]

J. L. Schönberger, H. Hardmeier, T. Sattler, and M. Pollefeys, “Comparative evaluation of hand-crafted and learned local features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1482–1491.

[52]

J.-W. Bian et al., “An evaluation of feature matchers for fundamental matrix estimation,” in Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–14.

[53]

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in Proc. IEEE/RSJ Conf. Intell. Robots Syst., 2012, pp. 573–580.

[54]

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3354–3361.

[55]

R. Arandjelović and A. Zisserman, “Three things everyone should know to improve object retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 2911–2918.

[56]

N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 116–131.

[57]

J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixelwise view selection for unstructured multi-view stereo,” in Proc. 14th Eur. Conf. Comput. Vis., 2016, pp. 501–518.

Recommendations

Keypoint descriptor fusion with Dempster-Shafer theory

Keypoint matching is the task of accurately finding the location of a scene point in two images. Many keypoint descriptors have been proposed in the literature aiming at providing robustness against scale, translation and rotation transformations, each ...
Fast Rotation-Invariant DAISY Descriptor for Image Keypoint Matching
ISM '10: Proceedings of the 2010 IEEE International Symposium on Multimedia

In this paper, we introduce an improved version of DAISY descriptor algorithm for fast and high-quality image key point matching. Since DAISY descriptor algorithm has many prominent advantages but lacks the ability of handling large in-plane rotation, ...
Reducing Keypoint Database Size
ICIAP '09: Proceedings of the 15th International Conference on Image Analysis and Processing

Keypoints are high dimensional descriptors for local features of an image or an object. Keypoint extraction is the first task in various computer vision algorithms, where the keypoints are then stored in a database used as the basis for comparing images ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 26, Issue

2024

11427 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 15 November 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents