Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Object and spatial discrimination makes weakly supervised local feature better

Published: 01 December 2024 Publication History

Abstract

Local feature extraction plays a crucial role in numerous critical visual tasks. However, there remains room for improvement in both descriptors and keypoints, particularly regarding the discriminative power of descriptors and the localization precision of keypoints. To address these challenges, this study introduces a novel local feature extraction pipeline named OSDFeat (Object and Spatial Discrimination Feature). OSDFeat employs a decoupling strategy, training descriptor and detection networks independently. Inspired by semantic correspondence, we propose an Object and Spatial Discrimination ResUNet (OSD-ResUNet). OSD-ResUNet captures features from the feature map that differentiate object appearance and spatial context, thus enhancing descriptor performance. To further improve the discriminative capability of descriptors, we propose a Discrimination Information Retained Normalization module (DIRN). DIRN complementarily integrates spatial-wise normalization and channel-wise normalization, yielding descriptors that are more distinguishable and informative. In the detection network, we propose a Cross Saliency Pooling module (CSP). CSP employs a cross-shaped kernel to aggregate long-range context in both vertical and horizontal dimensions. By enhancing the saliency of keypoints, CSP enables the detection network to effectively utilize descriptor information and achieve more precise localization of keypoints. Compared to the previous best local feature extraction methods, OSDFeat achieves Mean Matching Accuracy of 79.4% in local feature matching task, improving by 1.9% and achieving state-of-the-art results. Additionally, OSDFeat achieves competitive results in Visual Localization and 3D Reconstruction. The results of this study indicate that object and spatial discrimination can improve the accuracy and robustness of local feature, even in challenging environments. The code is available at https://github.com/pandaandyy/OSDFeat.

Graphical abstract

Display Omitted

Highlights

We propose OSD-ResUNet, which enhances descriptor learning by incorporating object appearance and spatial context.
We propose DIRN, which combines spatial-wise and channel-wise normalization to preserve discriminative information.
We propose CSP, which enhances keypoint saliency by aggregating global and local information with long-range dependencies.
We propose OSDFeat, a local feature extraction pipeline, achieving state-of-the-art on Hpatches and competitive results on Aachen Day-Night and ETH benchmarks.

References

[1]
Almalioglu Y., Turan M., Saputra M.R.U., de Gusmão P.P., Markham A., Trigoni N., Selfvio: Self-supervised deep monocular visual–Inertial odometry and depth estimation, Neural Networks 150 (2022) 119–136.
[2]
Arandjelović R., Zisserman A., Three things everyone should know to improve object retrieval, in: 2012 IEEE conference on computer vision and pattern recognition, IEEE, 2012, pp. 2911–2918.
[3]
Aslan M.F., Durdu A., Yusefi A., Yilmaz A., Hvionet: A deep learning based hybrid visual–inertial odometry approach for unmanned aerial system position estimation, Neural Networks 155 (2022) 461–474.
[4]
Balntas, V., Lenc, K., Vedaldi, A., & Mikolajczyk, K. (2017). HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5173–5182).
[5]
Barroso-Laguna A., Mikolajczyk K., Key. net: Keypoint detection by handcrafted and learned cnn filters revisited, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (1) (2022) 698–711.
[6]
Basak S., Corcoran P., McDonnell R., Schukat M., 3D face-model reconstruction from a single image: A feature aggregation approach using hierarchical transformer with weak supervision, Neural Networks 156 (2022) 108–122.
[7]
Bhowmik, A., Gumhold, S., Rother, C., & Brachmann, E. (2020). Reinforced feature points: Optimizing feature detection and description for a high-level task. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4948–4957).
[8]
Cavalli L., Larsson V., Oswald M.R., Sattler T., Pollefeys M., Handcrafted outlier detection revisited, in: Computer vision–ECCV 2020: 16th European conference, glasgow, UK, August 23–28, 2020, proceedings, part XIX 16, Springer, 2020, pp. 770–787.
[9]
Chen, H., Luo, Z., Zhang, J., Zhou, L., Bai, X., Hu, Z., et al. (2021). Learning to match features with seeded graph matching network. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6301–6310).
[10]
Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L., Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255.
[11]
DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 224–236).
[12]
Dong, J., & Soatto, S. (2015). Domain-size pooling in local descriptors: DSP-SIFT. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5097–5106).
[13]
Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., et al. (2019). D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 8092–8101).
[14]
Efe, U., Ince, K. G., & Alatan, A. (2021). Dfm: A performance baseline for deep feature matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4284–4293).
[15]
Eman M., Mahmoud T.M., Ibrahim M.M., Abd El-Hafeez T., Innovative hybrid approach for masked face recognition using pretrained mask detection and segmentation, robust PCA, and KNN classifier, Sensors 23 (15) (2023) 6727.
[16]
Fang C., Sun K., Li X., Li K., Tao W., OD-net: Orthogonal descriptor network for multiview image keypoint matching, Information Fusion 105 (2024).
[17]
Fu Y., Zhang P., Tang F., Wu Y., Covariant peak constraint for accurate keypoint detection and keypoint-specific descriptor learning, IEEE Transactions on Multimedia (2023).
[18]
Gao Y., He J., Zhang T., Zhang Z., Zhang Y., Dynamic keypoint detection network for image matching, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[19]
Gleize, P., Wang, W., & Feiszli, M. (2023). Silk: Simple learned keypoints. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 22499–22508).
[20]
Hong, S., & Kim, S. (2021). Deep matching prior: Test-time optimization for dense correspondence. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9907–9917).
[21]
Hou, Q., Zhang, L., Cheng, M.-M., & Feng, J. (2020). Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4003–4012).
[22]
Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., & Yi, K. M. (2021). Cotr: Correspondence transformer for matching across images. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6207–6217).
[23]
Lee, J., Kim, B., & Cho, M. (2022). Self-supervised equivariant learning for oriented keypoint detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4847–4857).
[24]
Lee, J., Kim, D., Ponce, J., & Ham, B. (2019). Sfnet: Learning object-aware semantic correspondence. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2278–2287).
[25]
Li X., Han K., Li S., Prisacariu V., Dual-resolution correspondence networks, Advances in Neural Information Processing Systems 33 (2020) 17346–17357.
[26]
Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2041–2050).
[27]
Li, K., Wang, L., Liu, L., Ran, Q., Xu, K., & Guo, Y. (2022). Decoupling makes weakly supervised local feature better. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15838–15848).
[28]
Li B., Wu F., Weinberger K.Q., Belongie S., Positional normalization, Advances in Neural Information Processing Systems 32 (2019).
[29]
Lowe D.G., Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2004) 91–110.
[30]
Luo, Z., Shen, T., Zhou, L., Zhang, J., Yao, Y., Li, S., et al. (2019). Contextdesc: Local descriptor augmentation with cross-modality context. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2527–2536).
[31]
Luo, Z., Zhou, L., Bai, X., Chen, H., Zhang, J., Yao, Y., et al. (2020). Aslfeat: Learning local features of accurate shape and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6589–6598).
[32]
Ma Y., Wang B., Lin H., Liu C., Hu M., Song Q., A continuation method for image registration based on dynamic adaptive kernel, Neural Networks (2023).
[33]
Melekhov I., Brostow G.J., Kannala J., Turmukhambetov D., Image stylization for robust features, 2020, arXiv preprint arXiv:2008.06959.
[34]
Melekhov I., Laskar Z., Li X., Wang S., Kannala J., Digging into self-supervised learning of feature descriptors, in: 2021 international conference on 3D vision (3DV), IEEE, 2021, pp. 1144–1155.
[35]
Mikolajczyk K., Schmid C., Scale & affine invariant interest point detectors, International Journal of Computer Vision 60 (2004) 63–86.
[36]
Mishchuk A., Mishkin D., Radenovic F., Matas J., Working hard to know your neighbor’s margins: Local descriptor learning loss, Advances in Neural Information Processing Systems 30 (2017).
[37]
Mishkin, D., Radenovic, F., & Matas, J. (2018). Repeatability is not enough: Learning affine regions via discriminability. In Proceedings of the European conference on computer vision (pp. 284–300).
[38]
Muja M., Lowe D.G., Scalable nearest neighbor algorithms for high dimensional data, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11) (2014) 2227–2240.
[39]
Noh, H., Araujo, A., Sim, J., Weyand, T., & Han, B. (2017). Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE international conference on computer vision (pp. 3456–3465).
[40]
Pautrat R., Larsson V., Oswald M.R., Pollefeys M., Online invariance selection for local feature descriptors, in: Computer vision–ECCV 2020: 16th European conference, glasgow, UK, August 23–28, 2020, proceedings, part II 16, Springer, 2020, pp. 707–724.
[41]
Qiu Y., Liu Y., Chen Y., Zhang J., Zhu J., Xu J., A2sppnet: Attentive atrous spatial pyramid pooling network for salient object detection, IEEE Transactions on Multimedia 25 (2022) 1991–2006.
[42]
Rao Y., Ju Y., Wang S., Gao F., Fan H., Dong J., Learning enriched feature descriptor for image matching and visual measurement, IEEE Transactions on Instrumentation and Measurement 72 (2023) 1–12.
[43]
Ren Z., Kong X., Zhang Y., Wang S., UKSSL: Underlying knowledge based semi-supervised learning for medical image classification, IEEE Open Journal of Engineering in Medicine and Biology (2023).
[44]
Ren Z., Wang S., Zhang Y., Weakly supervised machine learning, CAAI Transactions on Intelligence Technology (2023).
[45]
Revaud J., De Souza C., Humenberger M., Weinzaepfel P., R2d2: Reliable and repeatable detector and descriptor, Advances in Neural Information Processing Systems 32 (2019).
[46]
Rocco I., Arandjelović R., Sivic J., Efficient neighbourhood consensus networks via submanifold sparse convolutions, in: Computer vision–ECCV 2020: 16th European conference, glasgow, UK, August 23–28, 2020, proceedings, part IX 16, Springer, 2020, pp. 605–621.
[47]
Sarlin, P.-E., DeTone, D., Malisiewicz, T., & Rabinovich, A. (2020). Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4938–4947).
[48]
Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4104–4113).
[49]
Schonberger, J. L., Hardmeier, H., Sattler, T., & Pollefeys, M. (2017a). Comparative evaluation of hand-crafted and learned local features. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1482–1491).
[50]
Schonberger, J. L., Hardmeier, H., Sattler, T., & Pollefeys, M. (2017b). Comparative evaluation of hand-crafted and learned local features. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1482–1491).
[51]
Schönberger J.L., Zheng E., Frahm J.-M., Pollefeys M., Pixelwise view selection for unstructured multi-view stereo, in: Computer vision–ECCV 2016: 14th European conference, amsterdam, the netherlands, October 11-14, 2016, proceedings, part III 14, Springer, 2016, pp. 501–518.
[52]
Sun J., Ji L., Zhu J., Shared coupling-bridge scheme for weakly supervised local feature learning, IEEE Transactions on Multimedia 26 (2023) 1200–1212.
[53]
Sun, J., Shen, Z., Wang, Y., Bao, H., & Zhou, X. (2021). LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8922–8931).
[54]
Taha M.E., Mostafa T., El-Rahman A., Abd El-Hafeez T., A novel hybrid approach to masked face recognition using robust PCA and GOA optimizer, Scientific Journal for Damietta Faculty of Science 13 (3) (2023) 25–35.
[55]
Tian, Y., Balntas, V., Ng, T., Barroso-Laguna, A., Demiris, Y., & Mikolajczyk, K. (2020). D2d: Keypoint extraction with describe to detect approach. In Proceedings of the Asian conference on computer vision.
[56]
Tyszkiewicz M., Fua P., Trulls E., DISK: Learning local features with policy gradient, Advances in Neural Information Processing Systems 33 (2020) 14254–14265.
[57]
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803).
[58]
Wang C., Xu R., Xu S., Meng W., Zhang X., Cndesc: Cross normalization for local descriptors learning, IEEE Transactions on Multimedia (2022).
[59]
Wang, C., Xu, R., Zhang, Y., Xu, S., Meng, W., Fan, B., et al. (2022b). MTLDesc: Looking Wider to Describe Better. 36, In Proceedings of the AAAI conference on artificial intelligence (2), (pp. 2388–2396).
[60]
Wang Q., Zhou X., Hariharan B., Snavely N., Learning feature descriptors using camera pose supervision, in: Computer vision–ECCV 2020: 16th European conference, glasgow, UK, August 23–28, 2020, proceedings, part i 16, Springer, 2020, pp. 757–774.
[61]
Wiles, O., Ehrhardt, S., & Zisserman, A. (2021). Co-attention for conditioned image matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15920–15929).
[62]
Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the European conference on computer vision (pp. 3–19).
[63]
Xiao Z., Ye K., Cui G., Differential self-feedback dilated convolution network with dual-tree channel attention mechanism for hyperspectral image classification, IEEE Transactions on Instrumentation and Measurement (2023).
[64]
Zhang Y., Deng L., Zhu H., Wang W., Ren Z., Zhou Q., et al., Deep learning in food category recognition, Information Fusion (2023).
[65]
Zhang J., Jiao L., Ma W., Liu F., Liu X., Li L., et al., Rdlnet: A regularized descriptor learning network, IEEE Transactions on Neural Networks and Learning Systems 34 (9) (2021) 5669–5681.
[66]
Zhang Z., Sattler T., Scaramuzza D., Reference pose generation for long-term visual localization via learned features and view synthesis, International Journal of Computer Vision 129 (2021) 821–844.
[67]
Zhao X., Wu X., Chen W., Chen P.C., Xu Q., Li Z., Aliked: A lighter keypoint and descriptor extraction network via deformable transformation, IEEE Transactions on Instrumentation and Measurement 72 (2023) 1–16.
[68]
Zhao X., Wu X., Miao J., Chen W., Chen P.C., Li Z., Alike: Accurate and lightweight keypoint detection and descriptor extraction, IEEE Transactions on Multimedia 25 (2022) 3101–3112.
[69]
Zhou, Q., Sattler, T., & Leal-Taixe, L. (2021). Patch2pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4669–4678).
[70]
Zhou H., Zhao H., Wang Q., Hao G., Lei L., Miper-MVS: Multi-scale iterative probability estimation with refinement for efficient multi-view stereo, Neural Networks 162 (2023) 502–515.

Index Terms

  1. Object and spatial discrimination makes weakly supervised local feature better
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Neural Networks
            Neural Networks  Volume 180, Issue C
            Dec 2024
            1432 pages

            Publisher

            Elsevier Science Ltd.

            United Kingdom

            Publication History

            Published: 01 December 2024

            Author Tags

            1. Cross normalization
            2. Decoupled training
            3. Image long-range context modeling
            4. Semantic correspondence
            5. Weakly supervised local feature learning

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 16 Feb 2025

            Other Metrics

            Citations

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media