Object and spatial discrimination makes weakly supervised local feature better

Published: 01 December 2024


Local feature extraction plays a crucial role in numerous critical visual tasks. However, there remains room for improvement in both descriptors and keypoints, particularly regarding the discriminative power of descriptors and the localization precision of keypoints. To address these challenges, this study introduces a novel local feature extraction pipeline named OSDFeat (Object and Spatial Discrimination Feature). OSDFeat employs a decoupling strategy, training descriptor and detection networks independently. Inspired by semantic correspondence, we propose an Object and Spatial Discrimination ResUNet (OSD-ResUNet). OSD-ResUNet captures features from the feature map that differentiate object appearance and spatial context, thus enhancing descriptor performance. To further improve the discriminative capability of descriptors, we propose a Discrimination Information Retained Normalization module (DIRN). DIRN complementarily integrates spatial-wise normalization and channel-wise normalization, yielding descriptors that are more distinguishable and informative. In the detection network, we propose a Cross Saliency Pooling module (CSP). CSP employs a cross-shaped kernel to aggregate long-range context in both vertical and horizontal dimensions. By enhancing the saliency of keypoints, CSP enables the detection network to effectively utilize descriptor information and achieve more precise localization of keypoints. Compared to the previous best local feature extraction methods, OSDFeat achieves Mean Matching Accuracy of 79.4% in local feature matching task, improving by 1.9% and achieving state-of-the-art results. Additionally, OSDFeat achieves competitive results in Visual Localization and 3D Reconstruction. The results of this study indicate that object and spatial discrimination can improve the accuracy and robustness of local feature, even in challenging environments. The code is available at https://github.com/pandaandyy/OSDFeat.

Graphical abstract

Display Omitted


We propose OSD-ResUNet, which enhances descriptor learning by incorporating object appearance and spatial context.
We propose DIRN, which combines spatial-wise and channel-wise normalization to preserve discriminative information.
We propose CSP, which enhances keypoint saliency by aggregating global and local information with long-range dependencies.
We propose OSDFeat, a local feature extraction pipeline, achieving state-of-the-art on Hpatches and competitive results on Aachen Day-Night and ETH benchmarks.


  1. Object and spatial discrimination makes weakly supervised local feature better
