Match Former Wang
Match Former Wang
Match Former Wang
1 Implementation Details
Transformer. We design a four-stage hierarchical Transformer, using gray-scale
images as input, with an input channel of 1. Each stage contains a positional
patch embedding layer and three attention layers. The channel of the feature map
is gradually increased by {128, 192, 256, 512}, and the resolution is decreased by
{1/2, 1/4, 1/8, 1/16} (in the large version), or {1/4, 1/8, 1/16, 1/32} (in the lite
version). Our backbone does not contain a stem layer [3], and we use a large 7×7
convolution layer for the first patch embedding layer and a 3×3 convolution layer
for the next three layers.
MLP. Inspired by the MLP design of SegFormer [8], we adopt to use the MLP
layer after each attention layer in our match-aware encoder, which consists of two
linear layers and a depth-wise convolution layer. To avoid excessive computation,
we set the hidden features ratio [8] of all MLPs to 4. The MLP layers can enhance
the features extracted by attention and introduce residual connections.
Interleaving Self-/Cross-Attention. The extract-and-match strategy is con-
structed by interleaving self- and cross-attention within our MatchFormer model.
There are four stages in the match-aware encoder. As the feature map of the
shallow stage (i.e., stage-1 and stage-2) emphasizes textural information, more
self-attention are applied to focus on exploring the feature itself. As the feature
map of the deep stage (i.e., stage-3 and stage-4) is biased toward semantic in-
formation, more cross-attention are applied to explore similarity cross images.
The code of MatchFormer is reported in Algorithm 1.
More Structural Analysis. To explore the effect of the attention module
arrangement inside the backbone of MatchFormer, we spend large effort to ana-
lyze various self-attention and cross-attention schemes at each stage, where both
modules interact in a separate or interleaved manner. To be consistent with the
ablation study setting, we utilize the indoor model trained on 10% of ScanNet [2]
to conduct the experiment.
As shown in Table 1, the result in first row indicates that using only self-
attention without cross-attention limits the matching capacity of transformer-
∗
Equal contribution
†
Correspondence: kailun.yang@kit.edu
2 Q. Wang et al.
import torch
import torch.nn as nn
def posPE(image):
image = nn.Conv2D(image)
weight = sigmoid(DWConv(image))
image_enhance = image * weight
return image_enhance
else: # self-attention
attn_A = Q_A @ K_A.transpose()
attn_A = attn_A.softmax()
attn_B = Q_B @ K_B.transpose()
attn_B = attn_B.softmax()
image_A = (attn_A @ V_A).transpose().reshape()
image_B = (attn_B @ V_B).transpose().reshape()
# MatchFormer stages
# stage1: cross_flags in 3 layers = [False, False, True]
# stage2: cross_flags in 3 layers = [False, False, True]
# stage3: cross_flags in 3 layers = [False, False, True]
# stage4: cross_flags in 3 layers = [False, False, True]
based encoder. The results of the other separate arrangements show that ar-
ranging cross-attention modules after the self-attention stage of MatchFormer
can improve the performance of pose estimation, reaching 81.8% in precision
(P), when three stages are constructed with cross-attention modules. However,
excessive usage of cross-attention will degrade the performance due to the lack
of self-attention modules. Thus, we propose an attention-interleaving strategy
for combining the self- and cross-attention within individual stage of backbone.
In the experiments of the last four rows, the interleaving attention scheme of
MatchFormer achieves the best performance (86.7% in P). The results indicate
the effectiveness of our proposed interleaving arrangement and prove our obser-
vation that building a match-aware transformer-based encoder to perform the
extract-and-match strategy can benefit the feature matching.
MatchFormer (Supplementary Materials) 3
and H W2 c c
rc × rc is reshaped into sequences I1 and I2 to calculate the score
2
1 c c H1 W1 H2 W2
Si,j = τ ·⟨I1 (i), I2 (j)⟩ of matrix S∈ rc × rc , where ⟨·,·⟩ is the inner prod-
uct, τ is the temperature coefficient, H and W are the image height and width.
To calculate the probability of soft mutual closest neighbor matching, we use
softmax on both dimensions of S (referred to as 2D-softmax). The coarse match-
c
ing probability Pi,j is calculated via Eq. (1).
c
Pi,j = sof tmax(Si,j ) · sof tmax(Sj,i ). (1)
Table 2. Indoor pose estimation on ScanNet with less training data. The AUC
of three different thresholds and the average matching precision (P) are evaluated.
Data Pose estimation AUC
Method percent @5◦ P
@10◦ @20◦
LoFTR [6] 10% 15.47 31.72 48.63 82.6
MatchFormer-large-SEA 10% 18.01 (+2.54) 35.87 (+4.15) 53.46 (+4.83) 86.7 (+4.1)
4 Homography Estimation
Qualitative Comparisons. To evaluate the feature matching in the bench-
mark for geometric relations estimation, we perform Homography Estimation
on HPatches [1] with the MatchFormer-large-LA. In Fig. 3, we visualize more
qualitative comparison based on the matching results of MacthFormer-large-LA,
LoFTR [6], and SuperGlue [5]. MatchFormer can perform more dense and confi-
dent matching than SuperGlue. Besides, MatchFormer has further improvements
MatchFormer (Supplementary Materials) 5
Superglue
LoFTR
MatchFormer-
large-SEA
LoFTR
MatchFormer-
large-SEA
MatchFormer-
large-LA
LoFTR
Superglue
5 Image Matching
Following the experimental setup of Patch2Pix [9], we choose the same 108
HPatches sequences, including 52 sequences with illumces with viewpoint change.
Each sequence contains six images. To match the first with all others, we report
the mean matching accuracy (MMA) at thresholds from [1,10] pixels, and the
number of matches and features. The input size of the image is set to 1024, the
matching threshold is set to 0.2, and RANSAC threshold as 2 pixels.
8 Acknowledgments
This work was supported in part by the Federal Ministry of Labor and Social
Affairs (BMAS) through the AccessibleMaps project under Grant 01KM151112,
in part by the University of Excellence through the “KIT Future Fields” project,
in part by the Helmholtz Association Initiative and Networking Fund on the
HAICORE@KIT partition, and in part by Hangzhou SurImage Technology Com-
pany Ltd.
References
1. Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: HPatches: A benchmark and
evaluation of handcrafted and learned local descriptors. In: CVPR (2017) 4
2. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet:
Richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017) 1, 5
3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016) 1
4. Li, Z., Snavely, N.: MegaDepth: Learning single-view depth prediction from internet
photos. In: CVPR (2018) 4
5. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: Learning fea-
ture matching with graph neural networks. In: CVPR (2020) 4
6. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: Detector-free local feature
matching with transformers. In: CVPR (2021) 3, 4
7. Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys, M., Sivic, J., Pajdla, T.,
Torii, A.: InLoc: Indoor visual localization with dense matching and view synthesis.
In: CVPR (2018) 6
8. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: Sim-
ple and efficient design for semantic segmentation with transformers. In: NeurIPS
(2021) 1
9. Zhou, Q., Sattler, T., Leal-Taixe, L.: Patch2Pix: Epipolar-guided pixel-level corre-
spondences. In: CVPR (2021) 6