Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Covariant Peak Constraint for Accurate Keypoint Detection and Keypoint-Specific Descriptor Learning

Published: 15 November 2023 Publication History

Abstract

Local feature extraction consists of keypoint detection and local descriptor extraction. Firstly, in keypoint detector learning, existing covariance constraint loss functions cannot constrain the probability distribution shapes in local probability maps that surround keypoints. And existing auxiliary peak loss functions, which are used to alleviate the problem, impair the performance of local feature methods. To solve this problem, we propose a novel Covariant Peak constraint Loss (CP Loss) which is defined as the expectations of local probability maps' position errors. Minimizing our CP Loss can make local probability maps accurately peak at reliable keypoints. Secondly, in descriptor learning, the Neural Reprojection Error (NRE) aims at constraining dense descriptor maps of images. But we argue that only those descriptors of keypoints need to be constrained. Thus, we propose a novel Conditional Neural Reprojection Error (CNRE) that is only conditioned on keypoints. Compared with the NRE, our CNRE can achieve much higher efficiency and produce more keypoint-specific descriptors with better matching performance. We use our CP Loss and CNRE to train a local feature network named as CPCN-Feat. Experimental results show that our CPCN-Feat achieves state-of-the-art performance on four challenging benchmarks.

References

[1]
K. Sun, W. Tao, and Y. Qian, “Guide to match: Multi-layer feature matching with a hybrid gaussian mixture model,” IEEE Trans. Multimedia, vol. 22, pp. 2246–2261, 2020.
[2]
C. Bai, H. Li, J. Zhang, L. Huang, and L. Zhang, “Unsupervised adversarial instance-level image retrieval,” IEEE Trans. Multimedia, vol. 23, pp. 2199–2207, 2021.
[3]
J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4104–4113.
[4]
Z. Zhang, T. Sattler, and D. Scaramuzza, “Reference pose generation for long-term visual localization via learned features and view synthesis,” Int. J. Comput. Vis., vol. 129, pp. 821–844, 2021.
[5]
X. Yang et al., “Robust and efficient RGB-D SLAM in dynamic environments,” IEEE Trans. Multimedia, vol. 23, pp. 4208–4219, 2021.
[7]
G. Bresson, Z. Alsayed, L. Yu, and S. Glaser, “Simultaneous localization and mapping: A survey of current trends in autonomous driving,” IEEE Trans. Intell. Veh., vol. 2, no. 3, pp. 194–220, Sep. 2017.
[8]
C. Harris and M. Stephens, “A combined corner and edge detector,” in Proc. 4th Alvey Vis. Conf., 1988, pp. 147–151.
[9]
D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.
[10]
E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 430–443.
[11]
M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 778–792.
[12]
E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 2564–2571.
[13]
K. Lenc and A. Vedaldi, “Learning covariant feature detectors,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 100–117.
[14]
D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoint: Self-supervised interest point detection and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2018, pp. 224–236.
[15]
M. Dusmanu et al., “D2-Net: A trainable CNN for joint description and detection of local features,” in Proc. IEEE /CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8092–8101.
[16]
A. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk, “KeyNet: Keypoint detection by handcrafted and learned CNN filters,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5836–5844.
[17]
X. Zhao et al., “Alike: Accurate and lightweight keypoint detection and descriptor extraction,” IEEE Trans. Multimedia, vol. 25, pp. 3101–3112, 2023.
[18]
Y. Tian et al., “SOSNet: Second order similarity regularization for local descriptor learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 11016–11025.
[19]
B. Fan et al., “Deep unsupervised binary descriptor learning through locality consistency and self distinctiveness,” IEEE Trans. Multimedia, vol. 23, pp. 2770–2781, 2021.
[20]
C. Wang, R. Xu, S. Xu, W. Meng, and X. Zhang, “CNDESC: Cross normalization for local descriptors learning,” IEEE Trans. Multimedia, vol. 25, pp. 3989–4001, 2023.
[21]
C. Wang et al., “MTLDESC: Looking wider to describe better,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 2388–2396.
[22]
M. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2020, pp. 14254–14265.
[23]
J. Lee, B. Kim, and M. Cho, “Self-supervised equivariant learning for oriented keypoint detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 4847–4857.
[24]
Z. Luo et al., “ASLfeat: Learning local features of accurate shape and localization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6589–6598.
[25]
J. Revaud, C. De Souza, M. Humenberger, and P. Weinzaepfel, “R2D2: Reliable and repeatable detector and descriptor,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2019, pp. 12405–12415.
[26]
L. Zhang and S. Rusinkiewicz, “Learning to detect features in texture images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6325–6333.
[27]
J. Li et al., “Localization with sampling-argmax,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2021, pp. 27236–27248.
[28]
A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas, “Working hard to know your Neighbor's margins: Local descriptor learning loss,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2017, pp. 4829–4840.
[29]
H. Germain, V. Lepetit, and G. Bourmaud, “Neural reprojection error: Merging feature learning and camera pose estimation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 414–423.
[30]
S. Mehta and M. Rastegari, “Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn. Representations, 2022, pp. 1–13.
[31]
K. Mikolajczyk and C. Schmid, “Scale & affine invariant interest point detectors,” Int. J. Comput. Vis., vol. 60, no. 1, pp. 63–86, Oct. 2004.
[32]
H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 404–417.
[33]
P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “Kaze features,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 214–227.
[34]
Y. Verdie, K. Yi, P. Fua, and V. Lepetit, “Tilde: A temporally invariant learned detector,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5279–5288.
[35]
X. Zhang, F. X. Yu, S. Karaman, and S.-F. Chang, “Learning discriminative and transformation covariant local feature detectors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6818–6826.
[36]
A. Bursuc, G. Tolias, and H. Jégou, “Kernel local descriptors with implicit rotation matching,” in Proc. 5th ACM Int. Conf. Multimedia Retrieval, 2015, pp. 595–598.
[37]
J. Dong and S. Soatto, “Domain-size pooling in local descriptors: DSP-sift,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5097–5106.
[38]
Y. Tian, B. Fan, and F. Wu, “L2-Net: Deep learning of discriminative patch descriptor in euclidean space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 661–669.
[39]
K. He, Y. Lu, and S. Sclaroff, “Local descriptors optimized for average precision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 596–605.
[40]
Z. Luo et al., “Geodesc: Learning local descriptors by integrating geometry constraints,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 168–183.
[41]
P. Ebel, A. Mishchuk, K. M. Yi, P. Fua, and E. Trulls, “Beyond cartesian representations for local descriptors,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 253–262.
[42]
Y. Liu et al., “Gift: Learning transformation-invariant dense visual descriptors via group cnns,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2019, pp. 6992–7003.
[43]
B. Fan et al., “Seeing through darkness: Visual localization at night via weakly supervised learning of domain invariant features,” IEEE Trans. Multimedia, vol. 25, pp. 1713–1726, 2023.
[44]
Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, “Learning feature descriptors using camera pose supervision,” in Proc. 16th Eur. Conf. Comput. Vis., 2020, pp. 757–774.
[45]
V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature descriptors with triplets and shallow convolutional neural networks,” in Proc. Brit. Mach. Vis. Conf., 2016, pp. 119.1–119.11.
[46]
E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in Proc. Int. Conf. Learn. Representations, 2017, pp. 1–12.
[47]
Z. Peng et al., “Conformer: Local features coupling global representations for visual recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 367–376.
[48]
Z. Li and N. Snavely, “Megadepth: Learning single-view depth prediction from internet photos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2041–2050.
[49]
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations, 2019, pp. 1–18.
[50]
V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk, “HPatches: A benchmark and evaluation of handcrafted and learned local descriptors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5173–5182.
[51]
J. L. Schönberger, H. Hardmeier, T. Sattler, and M. Pollefeys, “Comparative evaluation of hand-crafted and learned local features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1482–1491.
[52]
J.-W. Bian et al., “An evaluation of feature matchers for fundamental matrix estimation,” in Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–14.
[53]
J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in Proc. IEEE/RSJ Conf. Intell. Robots Syst., 2012, pp. 573–580.
[54]
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3354–3361.
[55]
R. Arandjelović and A. Zisserman, “Three things everyone should know to improve object retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 2911–2918.
[56]
N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 116–131.
[57]
J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixelwise view selection for unstructured multi-view stereo,” in Proc. 14th Eur. Conf. Comput. Vis., 2016, pp. 501–518.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 26, Issue
2024
11427 pages

Publisher

IEEE Press

Publication History

Published: 15 November 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media