Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching

Xiang Ma 0000-0002-4963-8705 xiangma@sdu.edu.cn Shandong UniversityJinanShandongChina Xuemei Li xmli@sdu.edu.cn Shandong UniversityJinanShandongChina Lexin Fang fanglexin@mail.sdu.edu.cn Shandong UniversityJinanShandongChina  and  Caiming Zhang czhang@sdu.edu.cn Shandong UniversityJinanShandongChina
(2024)
Abstract.

Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model. Besides, a sparse correlation algorithm is proposed to select strong correlated spatial relationships, enabling the model to learn more significant features and avoid being misled by weak correlation. Extensive experiments demonstrate the superiority of DIAS, achieving 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO benchmarks.

Image-text Matching, Information Aligning, Spatial Constraint, Sparse Algorithm
copyright: acmlicensedjournalyear: 2024doi: 10.1145/3664647.3681424conference: Proceedings of the 32nd ACM International Conference on Multimedia; October 28-November 1, 2024; Melbourne, VIC, Australiabooktitle: Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australiaisbn: 979-8-4007-0686-8/24/10ccs: Information systems Information retrieval

1. Introduction

Image-text matching is a fundamental task in computer vision (CV) and natural language processing (NLP), providing support for applications such as image captioning (Li et al., 2022b; Zhang et al., 2022a), text retrieval (Li et al., 2022c), and text-to-image generation (Huang et al., 2022; Liao et al., 2022). This task aims to discover semantic correlations between images and text, and bridge the semantic gap between these two heterogeneous modalities. The key challenge lies in adjusting embeddings by utilizing matched and unmatched relationships between images and texts to achieve high-quality semantic alignment.

The matching process typically requires matching with embeddings constructed from images and texts. The existing methods can be roughly divided into two categories: global and local (Fu et al., 2023; Liu et al., 2020). Global-based matching extracts and interacts with global embeddings from the whole images and texts to calculate correlations (Chen et al., 2021; Qu et al., 2021). Local-based matching adopts a fine-grained approach, which extracts local embeddings from image regions and text words usually obtains better performance (Qu et al., 2020; Chen et al., 2020a; Zhang et al., 2020). They all aim at aligning semantics by computing and adjusting the correlation between embeddings of different modality, which involves interaction of corresponding dimensions. For example, cosine similarity (Sidorov et al., 2014) calculates the correlation between two embeddings in each dimension. However, the embeddings generally come from different models or modules, resulting in significant differences in information representation of each dimension. For instance, the image embeddings represent color information in a certain dimension, while the text embeddings may represent the information of a word in the corresponding dimension. Note that the corresponding dimension may not necessarily be in the same column of embeddings. This is known as the modality gap problem. The cross-modal interaction of such embeddings lacks rationality and potentially lead to inaccurate correlation calculation.

To enhance the rationality and effectiveness of cross-modal interaction, we propose a novel image-text matching method based on Dimensional Information Alignment and Sparse Spatial Constraint (DIAS), aiming to bridge the gap between image and text modalities from two perspectives:

(1) To ensure the rationality of correlation calculation, we enhance the correlation of the embeddings from different modalities in corresponding dimension. In subsequent processes, the interaction involves the relevant information of embeddings in their corresponding dimensions. Emphasizing only the correlation of dimensions may lead to feature redundancy, where each dimension provides similar information and lacks discriminative features. Feature redundancy can cause overfitting, reducing the generalization ability of models. Therefore, we enhance the independence of non-corresponding dimensions by reducing the correlation of them, to ensure the amount of information contained in embeddings.

(2) Most existing methods primarily focus on constraining the relationships between matched image-text pairs, with weaker emphasis on unmatched pairs. This can lead to suboptimal performance in semantic alignment. More importantly, the relationship of matched pairs is cross modal constraints, and their effectiveness is significantly affected by the modality gap. We augment existing constraints by introducing spatial inter- and intra-modalities constraints for unmatched pairs. The inter-modality constraint refers to promoting semantic consistency by requiring distance consistency between inter-modality unmatched pairs. As shown in Fig. 1(a), the distance between image i𝑖iitalic_i and text j𝑗jitalic_j is constrained to be consistent with the distance between image j𝑗jitalic_j and text i𝑖iitalic_i. The intra-modality constraint refers to emphasizing spatial structure consistency by requiring distance consistency between unmatched pairs within each modality. As shown in Fig. 1(b), the distance between image i𝑖iitalic_i and image j𝑗jitalic_j is constrained to be consistent with the distance between text i𝑖iitalic_i and text j𝑗jitalic_j. However, these two types of constraints assume the spatial relationships between images and texts exhibit symmetry, which is not always valid. Strictly following these constraints may lead to the model learning inaccurate features. Therefore, we propose a sparse correlation algorithm to select strong correlation to sparsify spatial constraints, reducing the need for symmetry.

Specifically, DIAS first obtains local embeddings of image regions and text words, and calculates the correlations between them in all dimensions to construct the correlation matrix. Each value in the matrix means the correlation of the corresponding region (row) and word (column). To align the information of embeddings from different modalities, we propose a regularizer to increase the correlation values of corresponding dimensions. Meanwhile, the correlation values between non-corresponding dimensions are decreased to suppress feature redundancy. Then, DIAS aggregates and upgrates the local embeddings, and merges them into global embeddings by pooling. As correlations of local embeddings have been adjusted in the previous step, the construction of global embeddings becomes more reasonable. Subsequently, DIAS obtains the spatial distance between inter- and intra-modalities unmatched pairs, and further employs the proposed sparse correlation algorithm to select strong correlation from them. The proposed algorithm introduces conditional probabilities of instance correlation and adapts them into a sparse regularization term, enabling the model to automatically learn how to identify strong correlation for each instance. Finally, the selected spatial relationships are used as constraints, combined with the constraints between matched pairs to achieve semantic alignment.

Our contributions are summarized as follows:

(1) We propose a dimension information alignment method for embeddings of different modalities, aiming to enhance the rationality of cross-modal interaction and suppress feature redundancy.

(2) We introduce novel inter- and intra-modality constraints to ensure the effectiveness of semantic alignment.

(3) A sparse correlation algorithm is proposed to select strong correlated spatial relationships, reducing the need for symmetry of embeddings.

Refer to caption
Figure 1. Illustration of distance consistency.

2. Related Work

Based on the implementation of cross-modal interactions, the image-text matching methods can be broadly categorized into global-based matching and local-based matching method.

Global-based matching. The typical global methods involve obtaining global embeddings of images and texts, projecting them into a shared embedding space by two branches and aligning image-text semantic. A line of works focus on how to accurately describe correlations between global embeddings. Some studies (Goel et al., 2022; Chen et al., 2020b) focus on improving correlation algorithms. For example, Jiang (Jiang et al., 2023) introduces the concept of geometric consistency to enhancing the constraint on image-text pairs. Additionally, some studies (Karpathy and Fei-Fei, 2015; Klein et al., 2015; Li et al., 2019, 2022a; Wehrmann et al., 2020) propose complex models to construct more robust global embeddings. Especially in recent years, pre-trained networks (Radford et al., 2021; Li et al., 2020) with extensive resources enrich the information contained in global embeddings. However, these methods still follow the existing paradigm, assuming embeddings from different modalities interact with the same information during correlation computation. In contrast, we focus on aligning the information representation of embeddings to enhance the rationality of correlation computation.

Local-based matching. Learning semantic alignment between local embeddings from image regions and text words is popular and offers better interpretability compared to global methods. Karpathy (Karpathy and Fei-Fei, 2015) makes the first attempt to infer matching between regions and words by aggregating similarities across all regions and words to obtain the correlation between image and text. A line of works focuses on constructing thoughtful aggregation rules to find the important region-word pairs. Chen (Chen et al., 2020a) proposes recurrent cross-attention to iteratively refine and elaborate shared semantics across different levels. Zhang (Zhang et al., 2022b) introduces negative-aware attention on unmatched pairs to enhance matching accuracy. Pan (Pan et al., 2023) considers that effective image-text semantic matching can be achieved solely by relying on the maximum region-word correlation and provides theoretical derivation. Another line of works focuses on exploiting more information. Wang (Wang et al., 2020a) introduces scene graph during matching to enrich relationships between local embeddings. Additionally, the models combining consensus knowledge (Wang et al., 2020b) and external pre-training knowledge (Wei et al., 2020; Qu et al., 2021) have been employed to enhance the cross-modal alingment. However, they still rarely consider the differences of information representation in different dimensions caused by modality gap. As mentioned earlier, we bridge the modality gap by aligning information representation of embeddings.

Refer to caption
Figure 2. Overview of DIAS, which mainly contains two steps: local embedding interaction and global embedding interaction. Firstly, DIAS extracts features from image regions and text words to construct local embeddings, and perfroms dimension information alignment to adjust the information representation of the embeddings in different dimensions (dimsubscript𝑑𝑖𝑚\mathcal{L}_{dim}caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_m end_POSTSUBSCRIPT). Then, we aggregates local embeddings to construct global embeddings. Inter- and intra-modalities spatial constraints are obtained from distance relationship between global embeddings, to suppress the influence of the modality gap, and the sparse conrrelation algorithm is used to select the strong correlated spatial relationships (intersubscript𝑖𝑛𝑡𝑒𝑟\mathcal{L}_{inter}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT and intrasubscript𝑖𝑛𝑡𝑟𝑎\mathcal{L}_{intra}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT). Finally, the image-text relevance is inferred via a contrastive learning loss function (locsubscript𝑙𝑜𝑐\mathcal{L}_{loc}caligraphic_L start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT).

3. Methodology

Considering effectiveness and interpretability, DIAS adopts the local-based matching method. In this section, we introduce the framework of local-based matching method (Sec. 3.1) and the details of DIAS. As shown in Fig.2, DIAS first perfroms dimension information alignment to adjust the information representation of the embeddings in different dimensions (Sec. 3.2). Then inter- and intra-modalities spatial constraints are introduced to suppress the influence of the modality gap (Sec. 3.3), and the sparse conrrelation algorithm is used to select the strong correlated spatial relationships (Sec. 3.4).

3.1. The Framework of Local-based Matching

Formally, given an image V, we use Faster-RCNN (Ren et al., 2016) to extract the salient regions and obtain the local image embeddings V={vi|i[1,nv],vid}Vconditional-setsubscriptv𝑖formulae-sequence𝑖1subscript𝑛𝑣subscriptv𝑖superscript𝑑\textbf{V}=\{\textbf{v}_{i}|i\in[1,n_{v}],\textbf{v}_{i}\in\mathbb{R}^{d}\}V = { v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i ∈ [ 1 , italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] , v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT } by the pre-trained ResNet-101 (He et al., 2016). visubscriptv𝑖\textbf{v}_{i}v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the local embeddings of i𝑖iitalic_i-th region. nvsubscript𝑛𝑣n_{v}italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denotes the number of regions. Similarly, given text T, we employ Bidirectional Gated Recurrent Units (BiGRU) (Schuster and Paliwal, 1997) or BERT (Devlin et al., 2018) to extract local text embeddings T={tj|j[1,nt],tjd}Tconditional-setsubscriptt𝑗formulae-sequence𝑗1subscript𝑛𝑡subscriptt𝑗superscript𝑑\textbf{T}=\{\textbf{t}_{j}|j\in[1,n_{t}],\textbf{t}_{j}\in\mathbb{R}^{d}\}T = { t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_j ∈ [ 1 , italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] , t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }. tjsubscriptt𝑗\textbf{t}_{j}t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the local embeddings of j𝑗jitalic_j-th words. ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the number of words.

Local-based matching first conduct local embedding interaction to update local embeddings based on the correlation between regions and words. The updating of visubscriptv𝑖\textbf{v}_{i}v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be described as follows:

(1) v^i=j=1ntsi,jtjj=1ntsi,j,i[1,nv]formulae-sequencesubscript^v𝑖superscriptsubscript𝑗1subscript𝑛𝑡subscript𝑠𝑖𝑗subscriptt𝑗superscriptsubscript𝑗1subscript𝑛𝑡subscript𝑠𝑖𝑗𝑖1subscript𝑛𝑣\displaystyle\hat{\textbf{v}}_{i}=\frac{\sum_{j=1}^{n_{t}}s_{i,j}\textbf{t}_{j% }}{\sum_{j=1}^{n_{t}}s_{i,j}},\ \ \ i\in[1,n_{v}]over^ start_ARG v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG , italic_i ∈ [ 1 , italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ]
si,j=σl(vi,tj)subscript𝑠𝑖𝑗subscript𝜎𝑙subscriptv𝑖subscriptt𝑗\displaystyle s_{i,j}=\sigma_{l}(\textbf{v}_{i},\textbf{t}_{j})italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

Here v^isubscript^v𝑖\hat{\textbf{v}}_{i}over^ start_ARG v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the new local embedding. σl()subscript𝜎𝑙\sigma_{l}(\cdot)italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ⋅ ) is the correlation function for local embeddings. si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the correlation value between visubscriptv𝑖\textbf{v}_{i}v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and tjsubscriptt𝑗\textbf{t}_{j}t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Then, local embeddings are transformed into global embeddings by pooling, formally as:

(2) V^=pool({v^i|i[1,nv]})^V𝑝𝑜𝑜𝑙conditional-setsubscript^v𝑖𝑖1subscript𝑛𝑣\hat{\textbf{V}}=pool(\{\hat{\textbf{v}}_{i}|i\in[1,n_{v}]\})over^ start_ARG V end_ARG = italic_p italic_o italic_o italic_l ( { over^ start_ARG v end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i ∈ [ 1 , italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] } )

Here V^^V\hat{\textbf{V}}over^ start_ARG V end_ARG is the global embedding of image V. pool()𝑝𝑜𝑜𝑙pool(\cdot)italic_p italic_o italic_o italic_l ( ⋅ ) means the pooling operation. Through the similar process, we can obtain the local embedding of word t^jsubscript^t𝑗\hat{\textbf{t}}_{j}over^ start_ARG t end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and global embedding of text T^^T\hat{\textbf{T}}over^ start_ARG T end_ARG.

The correlation between image and text is obtained based on global embedding interaction. The triplet loss is the most commonly used method for achieving semantic alignment, and the objective function can be expressed as:

(3) loc=[ασg(V^,T^)+σg(V^,T^)]++[ασg(V^,T^)+σg(V^,T^)]+subscript𝑙𝑜𝑐subscriptdelimited-[]𝛼subscript𝜎𝑔^V^Tsubscript𝜎𝑔^Vsuperscript^Tsubscriptdelimited-[]𝛼subscript𝜎𝑔^V^Tsubscript𝜎𝑔superscript^V^T\mathcal{L}_{loc}=[\alpha-\sigma_{g}(\hat{\textbf{V}},\hat{\textbf{T}})+\sigma% _{g}(\hat{\textbf{V}},\hat{\textbf{T}}^{-})]_{+}+[\alpha-\sigma_{g}(\hat{% \textbf{V}},\hat{\textbf{T}})+\sigma_{g}(\hat{\textbf{V}}^{-},\hat{\textbf{T}}% )]_{+}caligraphic_L start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT = [ italic_α - italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG , over^ start_ARG T end_ARG ) + italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG , over^ start_ARG T end_ARG start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + [ italic_α - italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG , over^ start_ARG T end_ARG ) + italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , over^ start_ARG T end_ARG ) ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT

Here α𝛼\alphaitalic_α means a margin parameter, []+=max(,0)subscriptdelimited-[]𝑚𝑎𝑥0[\cdot]_{+}=max(\cdot,0)[ ⋅ ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = italic_m italic_a italic_x ( ⋅ , 0 ). σgsubscript𝜎𝑔\sigma_{g}italic_σ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the correlation function for instances. (V^,T^)^V^T(\hat{\textbf{V}},\hat{\textbf{T}})( over^ start_ARG V end_ARG , over^ start_ARG T end_ARG ) is a positive image-text pair, and (V^,T^)^Vsuperscript^T(\hat{\textbf{V}},\hat{\textbf{T}}^{-})( over^ start_ARG V end_ARG , over^ start_ARG T end_ARG start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) and (V^,T^)superscript^V^T(\hat{\textbf{V}}^{-},\hat{\textbf{T}})( over^ start_ARG V end_ARG start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , over^ start_ARG T end_ARG ) are negative image-text pair in the batch. We use the distance-weighted sampling (Wu et al., 2017) for hard negative mining.

3.2. Dimension Information Alignment

The correlation calculation likes Eq.1 involves the cross-modal interaction in corresponding dimensions of embeddings. As mentioned earlier, due to the different sources, there are significantly differences in information representation of visubscriptv𝑖\textbf{v}_{i}v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and tjsubscriptt𝑗\textbf{t}_{j}t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in different dimensions. The interaction of them can result in calculation biases and lack of rationality. Thus, we propose a dimension information alignment method to align the information representation before the interaction by a regularizer. It can improves the correlation of visubscriptv𝑖\textbf{v}_{i}v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and tjsubscriptt𝑗\textbf{t}_{j}t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in corresponding dimensions. Meanwhile, to suppress feature redundancy that may occur during the alignment, the regularizer also reduces the correlation values between non-corresponding dimensions. Below is a detailed introduction to this process.

Refer to caption
Figure 3. Illustration of dimension information alignment. We extract the dimension vector of each dimention, and construct the correlation matrix by calculating the correlation between dimension vectors from different modalities. The proposed regularizer is used on the correlation matrix to align information repersentaion of each dimension.

Assuming there are N𝑁Nitalic_N image-text pairs. As shown in Fig. 3, we first extract dimension vectors of all local embeddings, and integrate them into mV={miV|i[1,d],miVNV}superscriptm𝑉conditional-setsuperscriptsubscriptm𝑖𝑉formulae-sequence𝑖1𝑑superscriptsubscriptm𝑖𝑉superscriptsubscript𝑁𝑉\textbf{m}^{V}=\{\textbf{m}_{i}^{V}|i\in[1,d],\textbf{m}_{i}^{V}\in\mathbb{R}^% {N_{V}}\}m start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = { m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT | italic_i ∈ [ 1 , italic_d ] , m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } and mT={mjT|j[1,d],miTNT}superscriptm𝑇conditional-setsuperscriptsubscriptm𝑗𝑇formulae-sequence𝑗1𝑑superscriptsubscriptm𝑖𝑇superscriptsubscript𝑁𝑇\textbf{m}^{T}=\{\textbf{m}_{j}^{T}|j\in[1,d],\textbf{m}_{i}^{T}\in\mathbb{R}^% {N_{T}}\}m start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = { m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_j ∈ [ 1 , italic_d ] , m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }, respectively. Here miVsuperscriptsubscriptm𝑖𝑉\textbf{m}_{i}^{V}m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT contains the information distribution of all local image embeddings in i𝑖iitalic_i-th dimension, and mjTsuperscriptsubscriptm𝑗𝑇\textbf{m}_{j}^{T}m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT contains the information distribution of all local text embeddings in j𝑗jitalic_j-th dimension. The number of regions in different images and the number of words in different texts vary. So, we use NVsubscript𝑁𝑉N_{V}italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to represent the total number of regions and words, respectively. Then, we compute the correlation between miVsuperscriptsubscriptm𝑖𝑉\textbf{m}_{i}^{V}m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT and mjTsuperscriptsubscriptm𝑗𝑇\textbf{m}_{j}^{T}m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, formally as:

(4) ci,j=σc(miV,mjT),i,j[1,d]formulae-sequencesubscript𝑐𝑖𝑗subscript𝜎𝑐superscriptsubscriptm𝑖𝑉superscriptsubscriptm𝑗𝑇𝑖𝑗1𝑑c_{i,j}=\sigma_{c}(\textbf{m}_{i}^{V},\textbf{m}_{j}^{T}),\ \ i,j\in[1,d]italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT , m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , italic_i , italic_j ∈ [ 1 , italic_d ]

Here ci,jsubscript𝑐𝑖𝑗c_{i,j}italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the correlation value of miVsuperscriptsubscriptm𝑖𝑉\textbf{m}_{i}^{V}m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT and mjTsuperscriptsubscriptm𝑗𝑇\textbf{m}_{j}^{T}m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. σcsubscript𝜎𝑐\sigma_{c}italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the correlation algorithm for dimension vectors. The correlation matrix C={ci,j|i,j[1,d]}Cconditional-setsubscript𝑐𝑖𝑗𝑖𝑗1𝑑\textbf{C}=\{c_{i,j}|i,j\in[1,d]\}C = { italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_i , italic_j ∈ [ 1 , italic_d ] } can be obtained via Eq.4.

Then, we use a regularizer to improve the correlation of corresponding dimensions and reduce the correlation between non-corresponding dimensions. For ease of understanding, we assume the corresponding dimensions are at the same column of embeddings. It means the corresponding dimension of miVsuperscriptsubscriptm𝑖𝑉\textbf{m}_{i}^{V}m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT is miTsuperscriptsubscriptm𝑖𝑇\textbf{m}_{i}^{T}m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and ci,isubscript𝑐𝑖𝑖c_{i,i}italic_c start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT is the correlation value of them. The regularizer can be expressed as:

(5) dim=i=1dci,i+i=1dj=1,jidci,jsubscript𝑑𝑖𝑚subscriptsuperscript𝑑𝑖1subscript𝑐𝑖𝑖subscriptsuperscript𝑑𝑖1subscriptsuperscript𝑑formulae-sequence𝑗1𝑗𝑖subscript𝑐𝑖𝑗\mathcal{L}_{dim}=-\sum^{d}_{i=1}c_{i,i}+\sum^{d}_{i=1}\sum^{d}_{j=1,j\neq i}c% _{i,j}caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_m end_POSTSUBSCRIPT = - ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT

The first term of Eq.5 mainly aligns the corresponding dimension, and the second term misaligns the non-corresponding dimensions. The setting of this function is relatively intuitive, but it fails to account for the magnitude difference in rows or columns of S𝑆Sitalic_S, potentially leading to computational bias. Therefore, we improve it to the following formula:

(6) dim=i=1d(ci,ij=1dci,j+ci,ij=1dcj,i)subscript𝑑𝑖𝑚subscriptsuperscript𝑑𝑖1subscript𝑐𝑖𝑖subscriptsuperscript𝑑𝑗1subscript𝑐𝑖𝑗subscript𝑐𝑖𝑖subscriptsuperscript𝑑𝑗1subscript𝑐𝑗𝑖\mathcal{L}_{dim}=\sum^{d}_{i=1}-(\frac{c_{i,i}}{\sum^{d}_{j=1}c_{i,j}}+\frac{% c_{i,i}}{\sum^{d}_{j=1}c_{j,i}})caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_m end_POSTSUBSCRIPT = ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT - ( divide start_ARG italic_c start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_c start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT end_ARG )

As shown in Eq.6, the regularizer increases the proportion of ci,isubscript𝑐𝑖𝑖c_{i,i}italic_c start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT to corresponding rows and columns in C𝐶Citalic_C, avoiding the impact of inconsistent orders of magnitude.

After aligning the dimension information, the process of aggregating and upgrating the local embeddings in Eq.1 generates more reasonable correlations. Moreover, the information representation of V^^V\hat{\textbf{V}}over^ start_ARG V end_ARG and T^^T\hat{\textbf{T}}over^ start_ARG T end_ARG in the corresponding dimensions obtained by Eq.2 is also more similar.

3.3. Spatial Constraint

After obtaining the global embeddings V^^V\hat{\textbf{V}}over^ start_ARG V end_ARG and T^^T\hat{\textbf{T}}over^ start_ARG T end_ARG, we calculate their correlation and use the loss function (Eq.3) to achieve semantic alignment. For each instance, the number of unmatched instances far exceeds the number of matched instances. Existing methods often impose stronger constraints on matched pairs and weaker constraints on unmatched pairs. For example, Eq.3 requires the correlation of matched pairs is greater than that of all unmatched pairs, while unmatched pairs only need to satisfy a threshold α𝛼\alphaitalic_α smaller than that of matched pairs. To ensure the effectiveness of semantic alignment, we propose two spatial constraint regularizers to enhance the constraint on unmatched pairs, including inter- and intra-modalities constraints.

Refer to caption
Figure 4. The histogram statistics of spatial distance between instances within and across modalities. We randomly selected some images and texts to calculating their distance, and observe the distribution pattern. It can be observed that the inter- and intra-modalities distance distribution approaches a normal distribution. These embeddings used for computation are from the state-of-the-art method (Zhang et al., 2024).
Refer to caption
Figure 5. Illustrasion for sparse correlation algorithm. We obtain the spatial matrix LxsubscriptL𝑥\textbf{L}_{x}L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, and the model learns a soft-threshold based on the conditional probability to select strong correlation for each instance.

On the one hand, we aim to maintain semantic consistency by pursuing spatial distance consistency of inter-modality unmatched pairs. Concretely, we compute the distance of all global embeddings between different modalities:

(7) xi,j=σx(V^i,T^j),i,j[1,N]formulae-sequencesubscript𝑥𝑖𝑗subscript𝜎𝑥subscript^V𝑖subscript^T𝑗𝑖𝑗1𝑁x_{i,j}=\sigma_{x}(\hat{\textbf{V}}_{i},\hat{\textbf{T}}_{j}),\ \ i,j\in[1,N]italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_i , italic_j ∈ [ 1 , italic_N ]

Here V^isubscript^V𝑖\hat{\textbf{V}}_{i}over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the global embedding of i𝑖iitalic_i-th image, and T^jsubscript^T𝑗\hat{\textbf{T}}_{j}over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the global embedding of j𝑗jitalic_j-th text. N𝑁Nitalic_N is the number of image-text pairs, and assuming the matched pair of V^isubscript^V𝑖\hat{\textbf{V}}_{i}over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is T^isubscript^T𝑖\hat{\textbf{T}}_{i}over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. σxsubscript𝜎𝑥\sigma_{x}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the distance function. xi,jsubscript𝑥𝑖𝑗x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the spatial distance between V^isubscript^V𝑖\hat{\textbf{V}}_{i}over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and T^jsubscript^T𝑗\hat{\textbf{T}}_{j}over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We combine xi,jsubscript𝑥𝑖𝑗x_{i,j}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT to construct spatial matrix X={xi,j|i,j[1,N]}Xconditional-setsubscript𝑥𝑖𝑗𝑖𝑗1𝑁\textbf{X}=\{x_{i,j}|i,j\in[1,N]\}X = { italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_i , italic_j ∈ [ 1 , italic_N ] }. The regularizer for inter-modality unmatched pairs is as follwing:

(8) intersubscript𝑖𝑛𝑡𝑒𝑟\displaystyle\mathcal{L}_{inter}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT =Lx22=XX22=i=1Nj=1N(xi,jxj,i)2absentsubscriptsuperscriptnormsubscriptL𝑥22subscriptsuperscriptnormXsuperscriptXtop22superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁superscriptsubscript𝑥𝑖𝑗subscript𝑥𝑗𝑖2\displaystyle=||\textbf{L}_{x}||^{2}_{2}=||\textbf{X}-\textbf{X}^{\top}||^{2}_% {2}=\sum_{i=1}^{N}\sum_{j=1}^{N}(x_{i,j}-x_{j,i})^{2}= | | L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = | | X - X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=i=1Nj=1N(σx(V^i,T^j)σx(V^j,T^i))2absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁superscriptsubscript𝜎𝑥subscript^V𝑖subscript^T𝑗subscript𝜎𝑥subscript^V𝑗subscript^T𝑖2\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{N}(\sigma_{x}(\hat{\textbf{V}}_{i},% \hat{\textbf{T}}_{j})-\sigma_{x}(\hat{\textbf{V}}_{j},\hat{\textbf{T}}_{i}))^{2}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Here Lx=|XX|subscriptL𝑥XsuperscriptXtop\textbf{L}_{x}=|\textbf{X}-\textbf{X}^{\top}|L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = | X - X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | is the inter-modality spatical matrix to be optimized. It can be observed that this regularizer imposes strong distance constraint only on unmatched pairs, which partially compensates for the shortcomings of Eq.3. The regularizer can effectively reduce the model’s sensitivity and enhance its robustness and generalization when handling diverse modality data. But it still handles inter-modality embeddings, which are limited by modality gap.

So, on the other hand, we aim to maintain structure consistency of different modalities by pursuing spatial distance consistency of intra-modality unmatched pairs. We compute the distance of all global embeddings in each modality:

(9) yi,j=σy(V^i,V^j),i,j[1,N]formulae-sequencesubscript𝑦𝑖𝑗subscript𝜎𝑦subscript^V𝑖subscript^V𝑗𝑖𝑗1𝑁\displaystyle y_{i,j}=\sigma_{y}(\hat{\textbf{V}}_{i},\hat{\textbf{V}}_{j}),\ % \ i,j\in[1,N]italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_i , italic_j ∈ [ 1 , italic_N ]
zi,j=σz(T^i,T^j)subscript𝑧𝑖𝑗subscript𝜎𝑧subscript^T𝑖subscript^T𝑗\displaystyle z_{i,j}=\sigma_{z}(\hat{\textbf{T}}_{i},\hat{\textbf{T}}_{j})italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

Here yi,jsubscript𝑦𝑖𝑗y_{i,j}italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT means the spatial distance between V^isubscript^V𝑖\hat{\textbf{V}}_{i}over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and V^jsubscript^V𝑗\hat{\textbf{V}}_{j}over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. zi,jsubscript𝑧𝑖𝑗z_{i,j}italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT means the spatial distance between T^isubscript^T𝑖\hat{\textbf{T}}_{i}over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and T^jsubscript^T𝑗\hat{\textbf{T}}_{j}over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. σysubscript𝜎𝑦\sigma_{y}italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and σzsubscript𝜎𝑧\sigma_{z}italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT are the distance functions of images and texts, respectively. We combine yi,jsubscript𝑦𝑖𝑗y_{i,j}italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT to construct Y={yi,j|i,j[1,N]}Yconditional-setsubscript𝑦𝑖𝑗𝑖𝑗1𝑁\textbf{Y}=\{y_{i,j}|i,j\in[1,N]\}Y = { italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_i , italic_j ∈ [ 1 , italic_N ] } and combine zi,jsubscript𝑧𝑖𝑗z_{i,j}italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT to construct Z={zi,j|i,j[1,N]}Zconditional-setsubscript𝑧𝑖𝑗𝑖𝑗1𝑁\textbf{Z}=\{z_{i,j}|i,j\in[1,N]\}Z = { italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_i , italic_j ∈ [ 1 , italic_N ] }. The regularizer for intra-modality unmatched pairs is as follwing:

(10) intrasubscript𝑖𝑛𝑡𝑟𝑎\displaystyle\mathcal{L}_{intra}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT =Lyz22=YZ22=i=1Nj=1N(yi,jzi,j)2absentsubscriptsuperscriptnormsubscriptL𝑦𝑧22subscriptsuperscriptnormYZ22superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁superscriptsubscript𝑦𝑖𝑗subscript𝑧𝑖𝑗2\displaystyle=||\textbf{L}_{yz}||^{2}_{2}=||\textbf{Y}-\textbf{Z}||^{2}_{2}=% \sum_{i=1}^{N}\sum_{j=1}^{N}(y_{i,j}-z_{i,j})^{2}= | | L start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = | | Y - Z | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=i=1Nj=1N(σy(V^i,V^j)σz(T^i,T^j))2absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁superscriptsubscript𝜎𝑦subscript^V𝑖subscript^V𝑗subscript𝜎𝑧subscript^T𝑖subscript^T𝑗2\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{N}(\sigma_{y}(\hat{\textbf{V}}_{i},% \hat{\textbf{V}}_{j})-\sigma_{z}(\hat{\textbf{T}}_{i},\hat{\textbf{T}}_{j}))^{2}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Here Lyz=|YZ|subscriptL𝑦𝑧YsuperscriptZtop\textbf{L}_{yz}=|\textbf{Y}-\textbf{Z}^{\top}|L start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT = | Y - Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | is the inter-modality spatical matrix to be optimized. It can be observed that this regularizer constrains embeddings of different modalities to have the same spatial relationships, enhancing their consistency of spatial structure. Furthermore, it only processes embeddings within modalities and is not affected by modality gap. Even if the modality gap is not completely eliminated, this regularizer can still encourage our model to learn effective features.

3.4. Sparse Correlation Algorithm

The spatial constraints assume the spatial relationships between images and texts exhibit symmetry, but this assumption is not always valid. These inaccurate relationships can affect the performance of the model. Fig. 4 shows the distribution of spatial distance between instances within and across modalities. It can be observed that the relationships between instances are mostly weakly correlated, and these relationships have little effect on characterizing the spatial position of instances. Considering the effectiveness and efficiency, we propose a sparse correlation algorithm. This algorithm concentrates spatial constraints on strong correlation relationships to capture more significant and important features. More importantly, this algorithm can reduce the need for embedding symmetry, making it more flexible.

The key issue is how to determine which instances exhibit strong correlations. The correlation distribution of different instances varies greatly, making it unsuitable to set a unified hard-threshold to distinguish strong and weak correlations. Therefore, we propose a sparse correlation algorithm to adaptively distinguish strong and weak correlations based on the situation of the instance itself. This algorithm builds conditional probabilities of correlation and uses them to obtain the soft-threshold, as shown in Fig. 5. Taking matrix LxsubscriptL𝑥\textbf{L}_{x}L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT as example, we set its i𝑖iitalic_i-th row vector as liV={li,jV|j[1,N]}superscriptsubscriptl𝑖𝑉conditional-setsuperscriptsubscript𝑙𝑖𝑗𝑉𝑗1𝑁\textbf{l}_{i}^{V}=\{l_{i,j}^{V}|j\in[1,N]\}l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = { italic_l start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT | italic_j ∈ [ 1 , italic_N ] }, and li,jV=|xi,jxj,i|superscriptsubscript𝑙𝑖𝑗𝑉subscript𝑥𝑖𝑗subscript𝑥𝑗𝑖l_{i,j}^{V}=|x_{i,j}-x_{j,i}|italic_l start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = | italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT |. liVsuperscriptsubscript𝑙𝑖𝑉l_{i}^{V}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT indicates the correlation between image VisubscriptV𝑖\textbf{V}_{i}V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and all texts. Similar, we set the j𝑗jitalic_j-th column vecter of LxsubscriptL𝑥\textbf{L}_{x}L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT as ljT={lj,iT|i[1,N]}superscriptsubscriptl𝑗𝑇conditional-setsuperscriptsubscript𝑙𝑗𝑖𝑇𝑖1𝑁\textbf{l}_{j}^{T}=\{l_{j,i}^{T}|i\in[1,N]\}l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = { italic_l start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT | italic_i ∈ [ 1 , italic_N ] }, and lj,iT=|xj,ixi,j|superscriptsubscript𝑙𝑗𝑖𝑇subscript𝑥𝑗𝑖subscript𝑥𝑖𝑗l_{j,i}^{T}=|x_{j,i}-x_{i,j}|italic_l start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = | italic_x start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT |. To explicitly quantification, we represent the conditional probability of each image as:

(11) p(liV|ljT)=Sigmoid(|xi,jxj,i|),j[1,N]formulae-sequence𝑝conditionalsuperscriptsubscript𝑙𝑖𝑉superscriptsubscript𝑙𝑗𝑇𝑆𝑖𝑔𝑚𝑜𝑖𝑑subscript𝑥𝑖𝑗subscript𝑥𝑗𝑖𝑗1𝑁p(l_{i}^{V}|l_{j}^{T})=Sigmoid(-|x_{i,j}-x_{j,i}|),\ \ j\in[1,N]italic_p ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT | italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) = italic_S italic_i italic_g italic_m italic_o italic_i italic_d ( - | italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT | ) , italic_j ∈ [ 1 , italic_N ]

Here p(liV|ljT)[0,1]𝑝conditionalsuperscriptsubscript𝑙𝑖𝑉superscriptsubscript𝑙𝑗𝑇01p(l_{i}^{V}|l_{j}^{T})\in[0,1]italic_p ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT | italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ∈ [ 0 , 1 ] represents the dependency degree of V^isubscript^V𝑖\hat{\textbf{V}}_{i}over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on T^jsubscript^T𝑗\hat{\textbf{T}}_{j}over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. A larger value of p(liV|ljT)𝑝conditionalsuperscriptsubscript𝑙𝑖𝑉superscriptsubscript𝑙𝑗𝑇p(l_{i}^{V}|l_{j}^{T})italic_p ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT | italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) indicates a stronger dependency. We expect the model to discover strong correlations for each image and text based on the latent semantics of Lxsubscript𝐿𝑥L_{x}italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, to avoid interference from other weakly correlated instances and to be as concise as possible. Specifically, we observed that the histogram of the conditional probability {p(liV|ljT)}j=1dsuperscriptsubscript𝑝conditionalsuperscriptsubscript𝑙𝑖𝑉superscriptsubscript𝑙𝑗𝑇𝑗1𝑑\{p(l_{i}^{V}|l_{j}^{T})\}_{j=1}^{d}{ italic_p ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT | italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT approximates a normal distribution, as shown in Fig. 4. Therefore, based on the statistical features of conditional probabilities, we can enable the model to learn a soft-threshold for distinguishing strong and weak correlations for each instance:

(12) κiV=μi+βiθisubscriptsuperscript𝜅𝑉𝑖subscript𝜇𝑖subscript𝛽𝑖subscript𝜃𝑖\kappa^{V}_{i}=\mu_{i}+\beta_{i}\cdot\theta_{i}italic_κ start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Here κiVsubscriptsuperscript𝜅𝑉𝑖\kappa^{V}_{i}italic_κ start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the soft-threshold of liVsuperscriptsubscriptl𝑖𝑉\textbf{l}_{i}^{V}l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT. μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the mean and standard deviation of the sampling probability values from {p(liV|ljT)}j=1dsuperscriptsubscript𝑝conditionalsuperscriptsubscript𝑙𝑖𝑉superscriptsubscript𝑙𝑗𝑇𝑗1𝑑\{p(l_{i}^{V}|l_{j}^{T})\}_{j=1}^{d}{ italic_p ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT | italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, respectively. βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a learnable parameter to adjust the sparse degree.

We combine all soft-thresholds as KV={κiV|i[1,N]}superscriptK𝑉conditional-setsubscriptsuperscript𝜅𝑉𝑖𝑖1𝑁\textbf{K}^{V}=\{\kappa^{V}_{i}|i\in[1,N]\}K start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = { italic_κ start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i ∈ [ 1 , italic_N ] }, and obtain KT={κjT|j[1,N]}superscriptK𝑇conditional-setsubscriptsuperscript𝜅𝑇𝑗𝑗1𝑁\textbf{K}^{T}=\{\kappa^{T}_{j}|j\in[1,N]\}K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = { italic_κ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_j ∈ [ 1 , italic_N ] } in a similar process. It is important to note that KVsuperscriptK𝑉\textbf{K}^{V}K start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT and KTsuperscriptK𝑇\textbf{K}^{T}K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT are distinct. For example, image VisubscriptV𝑖\textbf{V}_{i}V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT may exhibit a high dependency degree on text TjsubscriptT𝑗\textbf{T}_{j}T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, but TjsubscriptT𝑗\textbf{T}_{j}T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT may not necessarily have a high dependency degree on VisubscriptV𝑖\textbf{V}_{i}V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To avoid introducing weakly correlated information, we select spatial relationships that meet the requirements of both KVsuperscriptK𝑉\textbf{K}^{V}K start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT and KTsuperscriptK𝑇\textbf{K}^{T}K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT:

(13) L^x=BxLxsubscript^L𝑥subscriptB𝑥subscriptL𝑥\hat{\textbf{L}}_{x}=\textbf{B}_{x}\textbf{L}_{x}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = B start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT

Here Bx={bi,jx|i,j[1,N]}subscriptB𝑥conditional-setsubscriptsuperscript𝑏𝑥𝑖𝑗𝑖𝑗1𝑁\textbf{B}_{x}=\{b^{x}_{i,j}|i,j\in[1,N]\}B start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = { italic_b start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_i , italic_j ∈ [ 1 , italic_N ] } is a binary mask matrix to save strong correlation relationships, and:

(14) bi,jx={1,|xi,jxj,i|>max(κiV,κjT)0,otherwise\displaystyle b^{x}_{i,j}=\left\{\begin{aligned} &1,&&|x_{i,j}-x_{j,i}|>max(% \kappa^{V}_{i},\kappa^{T}_{j})\\ &0,&&otherwise\end{aligned}\right.italic_b start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL 1 , end_CELL start_CELL end_CELL start_CELL | italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT | > italic_m italic_a italic_x ( italic_κ start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_κ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 0 , end_CELL start_CELL end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW

max()𝑚𝑎𝑥max(\cdot)italic_m italic_a italic_x ( ⋅ ) is the function for calculating the maximum value. Base on the sparse inter-modality spatical matrix L^xsubscript^L𝑥\hat{\textbf{L}}_{x}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, we update Eq.8 as:

(15) intersubscript𝑖𝑛𝑡𝑒𝑟\displaystyle\mathcal{L}_{inter}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT =L^x22=BxXX22=i=1Nj=1Nbi,jx(xi,jxj,i)2absentsubscriptsuperscriptnormsubscript^L𝑥22subscriptB𝑥subscriptsuperscriptnormXsuperscriptXtop22superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁subscriptsuperscript𝑏𝑥𝑖𝑗superscriptsubscript𝑥𝑖𝑗subscript𝑥𝑗𝑖2\displaystyle=||\hat{\textbf{L}}_{x}||^{2}_{2}=\textbf{B}_{x}||\textbf{X}-% \textbf{X}^{\top}||^{2}_{2}=\sum_{i=1}^{N}\sum_{j=1}^{N}b^{x}_{i,j}(x_{i,j}-x_% {j,i})^{2}= | | over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = B start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | | X - X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=i=1Nj=1Nbi,jx(σx(V^i,T^j)σx(V^j,T^i))2absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁subscriptsuperscript𝑏𝑥𝑖𝑗superscriptsubscript𝜎𝑥subscript^V𝑖subscript^T𝑗subscript𝜎𝑥subscript^V𝑗subscript^T𝑖2\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{N}b^{x}_{i,j}(\sigma_{x}(\hat{\textbf{% V}}_{i},\hat{\textbf{T}}_{j})-\sigma_{x}(\hat{\textbf{V}}_{j},\hat{\textbf{T}}% _{i}))^{2}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

By performing similar operations on LyzsubscriptL𝑦𝑧\textbf{L}_{yz}L start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT, we can obtain the sparse intra-modality spatical matrix L^yzsubscript^L𝑦𝑧\hat{\textbf{L}}_{yz}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT and update Eq.10 as:

(16) intrasubscript𝑖𝑛𝑡𝑟𝑎\displaystyle\mathcal{L}_{intra}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT =L^yz22=ByzYZ22=i=1Nj=1Nbi,jyz(yi,jzi,j)2absentsubscriptsuperscriptnormsubscript^L𝑦𝑧22subscriptB𝑦𝑧subscriptsuperscriptnormYZ22superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁subscriptsuperscript𝑏𝑦𝑧𝑖𝑗superscriptsubscript𝑦𝑖𝑗subscript𝑧𝑖𝑗2\displaystyle=||\hat{\textbf{L}}_{yz}||^{2}_{2}=\textbf{B}_{yz}||\textbf{Y}-% \textbf{Z}||^{2}_{2}=\sum_{i=1}^{N}\sum_{j=1}^{N}b^{yz}_{i,j}(y_{i,j}-z_{i,j})% ^{2}= | | over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = B start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT | | Y - Z | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT italic_y italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=i=1Nj=1Nbi,jyz(σy(V^i,V^j)σz(T^i,T^j))2absentsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁subscriptsuperscript𝑏𝑦𝑧𝑖𝑗superscriptsubscript𝜎𝑦subscript^V𝑖subscript^V𝑗subscript𝜎𝑧subscript^T𝑖subscript^T𝑗2\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{N}b^{yz}_{i,j}(\sigma_{y}(\hat{\textbf% {V}}_{i},\hat{\textbf{V}}_{j})-\sigma_{z}(\hat{\textbf{T}}_{i},\hat{\textbf{T}% }_{j}))^{2}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT italic_y italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG V end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Here Byz={bi,jyz|i,j[1,N]}subscriptB𝑦𝑧conditional-setsubscriptsuperscript𝑏𝑦𝑧𝑖𝑗𝑖𝑗1𝑁\textbf{B}_{yz}=\{b^{yz}_{i,j}|i,j\in[1,N]\}B start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT = { italic_b start_POSTSUPERSCRIPT italic_y italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_i , italic_j ∈ [ 1 , italic_N ] }.

3.5. Objective Function

We combine the proposed regularization terms with the triplet loss to obtain the loss function of DIAS:

(17) =loc+ωdimdim+ωinterinter+ωintraintrasubscript𝑙𝑜𝑐subscript𝜔𝑑𝑖𝑚subscript𝑑𝑖𝑚subscript𝜔𝑖𝑛𝑡𝑒𝑟subscript𝑖𝑛𝑡𝑒𝑟subscript𝜔𝑖𝑛𝑡𝑟𝑎subscript𝑖𝑛𝑡𝑟𝑎\mathcal{L}=\mathcal{L}_{loc}+\omega_{dim}\mathcal{L}_{dim}+\omega_{inter}% \mathcal{L}_{inter}+\omega_{intra}\mathcal{L}_{intra}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT italic_d italic_i italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_m end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT

Here ωdimsubscript𝜔𝑑𝑖𝑚\omega_{dim}italic_ω start_POSTSUBSCRIPT italic_d italic_i italic_m end_POSTSUBSCRIPT, ωintersubscript𝜔𝑖𝑛𝑡𝑒𝑟\omega_{inter}italic_ω start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT and ωintrasubscript𝜔𝑖𝑛𝑡𝑟𝑎\omega_{intra}italic_ω start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT are hyper-parameters to control the effectiveness degree of each term. To ensure effective cross-modal interactions, we use neighbor sampling instead of random sampling for batches. First, we apply K-means (Steinley, 2006; Ma et al., 2023) clustering on the local image embeddings. Then, we randomly select M𝑀Mitalic_M clusters and choose P𝑃Pitalic_P images from each cluster. Finally, we pair each image with a positive text instance and obtain N=P×K𝑁𝑃𝐾N=P\times Kitalic_N = italic_P × italic_K image-text pairs for each batch.

4. Experiments

4.1. Experimental Setup

Table 1. Comparisons with state-of-the-art methods on Flickr30k and MSCOCO 1K test-sets. BUTD represents using Faster-RCNN (Chen et al., 2021) to extract local image embeddings. BiGRU and BERT represent using BiGRU (Schuster and Paliwal, 1997) or BERT (Devlin et al., 2018) to extract local text embeddings. * denotes the ensemble results of two models. The bests are in bold.
Methods Flickr30K MSCOCO 1K
IMG\rightarrowTEXT TEXT\rightarrowIMG rSum IMG\rightarrowTEXT TEXT\rightarrowIMG rSum
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
BUTD+BiGRU
GSMN*(2020)(Liu et al., 2020) 76.4 94.3 97.3 57.4 82.3 89.0 496.8 78.4 96.4 98.6 63.3 90.1 95.7 522.5
GPO(2021)(Chen et al., 2021) 76.5 94.2 97.7 56.4 83.4 89.9 498.1 78.5 96.0 98.7 61.7 90.3 95.6 520.8
MV(2022)(Li et al., 2022a) 79.0 94.9 97.7 59.1 84.6 90.6 505.8 78.7 95.7 98.7 62.7 90.4 95.7 521.9
NAAF*(2022)(Zhang et al., 2022b) 81.9 96.1 98.3 61.0 85.3 90.6 513.2 80.5 96.5 98.8 64.1 90.7 96.5 527.2
CHAN(2023)(Pan et al., 2023) 79.7 94.5 97.3 60.2 85.3 90.7 507.8 79.7 96.7 98.7 63.8 90.4 95.8 525.0
NUIF-d(2024)(Zhang et al., 2024) 81.8 94.7 97.6 59.4 85.6 91.1 509.3 80.6 96.3 98.8 64.7 91.4 96.2 528.0
DIAS 81.8 96.1 98.6 60.7 84.9 91.3 513.4 81.3 96.8 98.9 64.9 90.4 95.9 528.2
BUTD+BERT
DSRAN(2020)(Wen et al., 2020) 77.8 95.1 97.6 59.2 86.0 91.9 507.6 78.3 95.7 98.4 64.5 90.8 95.8 523.5
VSRN++*(2022)(Li et al., 2022d) 79.2 94.6 97.5 60.6 85.6 91.4 508.9 77.9 96.0 98.5 64.1 91.0 96.1 523.6
MV(2022)(Li et al., 2022a) 82.1 95.8 97.9 63.1 86.7 92.3 517.5 80.4 96.6 99.0 64.9 91.2 96.0 528.1
CHAN(2023)(Pan et al., 2023) 80.6 96.1 97.8 63.9 87.5 92.6 518.5 81.4 96.9 98.9 66.5 92.1 96.7 532.6
HREM*(2023)(Fu et al., 2023) 84.0 96.1 98.6 64.4 88.0 93.1 524.2 82.9 96.9 99.0 67.1 92.0 96.6 534.6
DIAS 83.8 96.6 98.3 64.5 88.0 93.3 524.5 83.4 97.1 99.1 67.6 92.4 96.6 536.2

Datasets and Evaluation Metrics. We evaluate DIAS mainly on Flickr30k (Young et al., 2014) and MSCOCO (Lin et al., 2014) datasets. Flickr30k contains 29,000 images for training, 1,000 images for validation, and 1,000 images for testing. MSCOCO contains 123,287 images for training, 5,000 images for validation, and 5,000 images for testing. Each image of the two datasets is associated with 5 texts. The results on MSCOCO are reported on averaging over 5-folds of 1,000 test images and on the entire 5,000 test images. As a common practice in information retrieval (Chen et al., 2021), we adopt the Recall at K (R@K) to meansure the performance, and set K=1,5,10. R@K means the percentage of ground truth in the retrieved top-K lists. rSum reflects the overall matching performance, which is the sum of R@K in both image-to-text and text-to-image matching.

Implementation Details. We use the pre-extracted local image embeddings (Chen et al., 2021) for images, and the BiGRU (Schuster and Paliwal, 1997) or BERT (Devlin et al., 2018) to extract local text embeddings. All correlation algorithms default to cosine similarity (Sidorov et al., 2014). The experiments are conducted on an NVIDIA GeForce RTX 4090 GPU. We set 30 training epochs, and the batch size is 128 for Flickr30k and 256 for MSCOCO. Adam optimizer is adopted with an initial learning rate of 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and decaying by 10% every epochs. The code is available online at https://github.com/XiangMa-Shaun/DIAS.

4.2. Comparisons with State-of-the-art Methods

To verify the performance superiority of our proposed DIAS, we compare it with the state-of-the-art models on two datasets. Existing methods are divided into two types based on their feature backbones for fair comparisons. The experimental results are cited directly from respective papers. Our model reports the single model performance without the ensemble improving trick.

Quantitative results on Flickr30K and MSCOCO 1K test-sets are shown in Table 1. DIAS outperforms state-of-the-art methods with impressive margins for the R@K and rSum, and achieves consistent superiority on different textual encoders. Furthermore, Table 2 shows the more extensive database of MSCOCO 5K test-set, DIAS also performs best on nearly all metrics.

Table 2. Comparisons with state-of-the-art methods on MSCOCO 5K test-set. * denotes the ensemble results of two models. The bests are in bold.
Methods I\rightarrowT T\rightarrowI rSum
R@1 R@5 R@10 R@1 R@5 R@10
BUTD+BiGRU
GPO 56.6 83.6 91.4 39.3 69.9 81.1 421.9
MV 56.7 84.1 91.4 40.3 70.6 81.6 424.6
NAAF* 58.9 85.2 92.0 42.5 70.9 81.4 430.9
CHAN 60.2 85.9 92.4 41.7 71.5 81.7 433.4
NUIF-d 59.3 85.5 92.0 41.9 71.3 81.8 431.8
DIAS 59.8 86.0 92.5 42.7 71.8 82.5 435.3
BUTD+BERT
VSRN++* 54.7 82.9 90.9 42.0 72.2 82.7 425.4
MV 59.1 86.3 92.5 42.5 72.8 83.1 436.3
CHAN 59.8 87.2 93.3 44.9 74.5 84.2 443.9
HREM* 64.0 88.5 93.7 45.4 75.1 84.3 450.9
DIAS 64.4 88.9 94.1 47.2 76.5 85.2 456.3
Refer to caption
Refer to caption
Figure 6. The effectiveness of sparse correlation algorithm.
Table 3. The effect of applying dimension information alignment (abbreviated as DIA) to other models.
Flickr30K MSCOCO 1K
I\rightarrowT T\rightarrowI I\rightarrowT T\rightarrowI
R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5
MV 82.1 95.8 63.1 86.7 80.4 96.6 64.9 91.2
+DIA 82.9 96.2 63.8 87.2 81.4 96.8 65.8 91.9
CHAN 80.6 96.1 63.9 87.5 81.4 96.9 66.5 92.1
+DIA 82.0 96.4 64.2 87.8 81.7 96.9 66.9 92.3
HREM 84.0 96.1 64.4 88.0 82.9 96.9 67.1 92.0
+DIA 84.2 96.5 64.6 88.0 83.0 97.2 67.5 92.5
Refer to caption
Refer to caption
Refer to caption
Figure 7. Performance comparison on varying.

4.3. Ablation Study and Discussion

To demonstrate the effectiveness of components in DIAS, we conduct ablation studies on both datasets. The baseline w/o DIA means DIAS without dimention information alignment. w/o LxsubscriptL𝑥\textbf{L}_{x}L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and w/o LyzsubscriptL𝑦𝑧\textbf{L}_{yz}L start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT denote the lack of inter- and intra-modality spatial constraints, respectively. w/o L^xsubscript^L𝑥\hat{\textbf{L}}_{x}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and w/o L^yzsubscript^L𝑦𝑧\hat{\textbf{L}}_{yz}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT mean that no sparsity regularization is applied on inter- and intra-modality spatial matrices, respectively. According to the results shown in Table 4, we have the following observations:

(1) The effectiveness of model designing. Removing any components in DIAS reduced performance, which indicates the proposed dimension information alignment, spatial constraints, and sparse correlation algorithm are effective for image-text matching tasks.

(2) Discussion on dimension information alignment. The performance of w/o DIA is the worst among the baselines, indicating that aligning dimension information is the most crucial component for DIAS. To further discuss the effectiveness of this component, we apply it to other models. The results shown in Table 3 demonstrate dimension information alignment can also improve the performance of other models to a certain extent.

Table 4. Ablation studies of our model on Flickr30K and MSCOCO 1K.
Methods Flickr30K MSCOCO 1K
I\rightarrowT T\rightarrowI I\rightarrowT T\rightarrowI
R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5
BUTD+BiGRU
w/o DIA 79.3 94.9 58.9 84.0 78.9 95.6 63.0 90.2
w/o LxsubscriptL𝑥\textbf{L}_{x}L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT 81.1 95.7 59.5 84.6 80.4 96.2 64.2 90.3
w/o LyzsubscriptL𝑦𝑧\textbf{L}_{yz}L start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT 80.8 95.2 59.5 84.6 80.1 96.2 63.7 90.2
w/o L^xsubscript^L𝑥\hat{\textbf{L}}_{x}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT 81.6 96.0 60.2 84.8 81.2 96.5 64.7 90.3
w/o L^yzsubscript^L𝑦𝑧\hat{\textbf{L}}_{yz}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT 81.5 95.8 59.9 84.7 80.9 96.4 64.1 90.2
DIAS 81.8 96.1 60.7 84.9 81.3 96.8 64.9 90.4
BUTD+BERT
w/o DIA 80.8 95.5 62.9 85.9 80.7 96.1 65.1 91.1
w/o LxsubscriptL𝑥\textbf{L}_{x}L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT 83.3 96.2 64.4 87.8 82.9 97.0 66.9 92.1
w/o LyzsubscriptL𝑦𝑧\textbf{L}_{yz}L start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT 82.7 95.9 63.7 87.2 82.1 96.8 66.3 91.8
w/o L^xsubscript^L𝑥\hat{\textbf{L}}_{x}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT 83.5 96.2 64.4 87.9 83.0 97.1 67.2 92.2
w/o L^yzsubscript^L𝑦𝑧\hat{\textbf{L}}_{yz}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT 83.4 96.2 64.0 87.8 82.8 97.0 67.0 92.2
DIAS 83.8 96.6 64.5 88.0 83.4 97.1 67.6 92.4
Table 5. Generalization ability comparison of models trained on MSCOCO and validated on Flickr30K test-set.
I\rightarrowT T\rightarrowI rSum
R@1 R@5 R@10 R@1 R@5 R@10
BUTD+BiGRU
Baseline 53.2 82.1 88.7 42.5 71.1 79.5 417.1
DIAS 69.2 91.2 95.0 54.5 79.4 87.0 476.3
BUTD+BERT
Baseline 60.6 85.4 91.4 46.7 73.7 81.8 439.6
DIAS 73.9 92.2 96.2 57.6 80.8 87.6 488.3

(3) Discussion on spatial constraint. The performance of w/o LyzsubscriptL𝑦𝑧\textbf{L}_{yz}L start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT is inferior to w/o LxsubscriptL𝑥\textbf{L}_{x}L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, indicating that introducing intra-modality spatial constraint is more effective for DIAS than inter-modality constraint. This result provides evidence for the viewpoint that intra-modality constraint is not affected by modality gap and can directly assist the model in learning robust features.

(4) Discussion on sparse correlation algorithm. The performance of w/o L^xsubscript^L𝑥\hat{\textbf{L}}_{x}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and w/o L^yzsubscript^L𝑦𝑧\hat{\textbf{L}}_{yz}over^ start_ARG L end_ARG start_POSTSUBSCRIPT italic_y italic_z end_POSTSUBSCRIPT are inferior to the complete DIAS, suggesting sparse correlation algorithm can assist the model in learning significant features by selecting strong correlated relationships. To further discuss the effectiveness of this algorithm, we compared it with the Top-k strategy and L1 sparse strategy. The Top-k strategy retains the top-k most relevant relationships for each instance. The L1 sparse strategy constrains the correlation matrix using the L1-norm. The results as shown in Fig. 6 reveal the sparse correlation algorithm outperforms these two baselines.

4.4. Robustness Analysis

Parameter sensitivity. We aim to understanding how our model performs by varying the values of hyper-parameters ωdimsubscript𝜔𝑑𝑖𝑚\omega_{dim}italic_ω start_POSTSUBSCRIPT italic_d italic_i italic_m end_POSTSUBSCRIPT, ωintersubscript𝜔𝑖𝑛𝑡𝑒𝑟\omega_{inter}italic_ω start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT and ωintrasubscript𝜔𝑖𝑛𝑡𝑟𝑎\omega_{intra}italic_ω start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT, as shown in Fig.7. When varying any of these hyper-parameters, we fix others with default settings. ωdimsubscript𝜔𝑑𝑖𝑚\omega_{dim}italic_ω start_POSTSUBSCRIPT italic_d italic_i italic_m end_POSTSUBSCRIPT, ωintersubscript𝜔𝑖𝑛𝑡𝑒𝑟\omega_{inter}italic_ω start_POSTSUBSCRIPT italic_i italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT and ωintrasubscript𝜔𝑖𝑛𝑡𝑟𝑎\omega_{intra}italic_ω start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT obtain optimal results at 10, 0.05, and 0.1, respectively.

Generalization study. To validate the generalization capability of DIAS in learning latent semantics, we conduct cross-validation experiments following (Zhang et al., 2023). Specifically, we use the model trained on MSCOCO dataset to evaluate its zero-shot transferability on Flickr30K test-set. The result shown in Table 5 indicates our proposed DIAS exhibits stronger generalization ability than the baseline, confirming DIAS is capable of learning cross-modality latent semantics.

5. Conclusion

This paper proposes a novel image-text matching model based on dimension information alignment and sparse spatial correlation algorithm (DIAS). We explicitly align information representation of embeddings in corresponding dimension, to address the issue of lack of rationality in correlation calculation caused by modality gap. Additionally, by introducing inter- and intra-modalities spatial relationships, we enhance the constraints during the cross-modal interaction. More importantly, we propose a sparse correlation algorithm to select strong spatial relationships to reduce the requirement for symmetric of embeddings, allowing the model to focus on learning more significant structural features. Extensive experiments and analyses conducted on two datasets show the superiority and rationality of DIAS.

Acknowledgements.
Supported by the National Natural Science Foundation of China (NSFC) Joint Fund with Zhejiang Integration of Informatization and Industrialization under Key Project (Grant No.U22A2033) and NSFC (Grant No.62072281).

References

  • (1)
  • Chen et al. (2020a) Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020a. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12655–12663.
  • Chen et al. (2021) Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15789–15798.
  • Chen et al. (2020b) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020b. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Fu et al. (2023) Zheren Fu, Zhendong Mao, Yan Song, and Yongdong Zhang. 2023. Learning semantic relationship among instances for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15159–15168.
  • Goel et al. (2022) Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, and Aditya Grover. 2022. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems 35 (2022), 6704–6719.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Huang et al. (2022) Mengqi Huang, Zhendong Mao, Penghui Wang, Quan Wang, and Yongdong Zhang. 2022. Dse-gan: Dynamic semantic evolution generative adversarial network for text-to-image generation. In Proceedings of the 30th ACM International Conference on Multimedia. 4345–4354.
  • Jiang et al. (2023) Qian Jiang, Changyou Chen, Han Zhao, Liqun Chen, Qing Ping, Son Dinh Tran, Yi Xu, Belinda Zeng, and Trishul Chilimbi. 2023. Understanding and constructing latent modality structures in multi-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7661–7671.
  • Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
  • Klein et al. (2015) Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4437–4446.
  • Li et al. (2022b) Jingyu Li, Zhendong Mao, Shancheng Fang, and Hao Li. 2022b. ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning.. In IJCAI. 1081–1087.
  • Li et al. (2019) Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision. 4654–4662.
  • Li et al. (2022d) Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2022d. Image-text embedding learning via visual and textual semantic reasoning. IEEE transactions on pattern analysis and machine intelligence 45, 1 (2022), 641–656.
  • Li et al. (2022c) Pandeng Li, Hongtao Xie, Jiannan Ge, Lei Zhang, Shaobo Min, and Yongdong Zhang. 2022c. Dual-stream knowledge-preserving hashing for unsupervised video retrieval. In European Conference on Computer Vision. Springer, 181–197.
  • Li et al. (2020) Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 121–137.
  • Li et al. (2022a) Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, and Xijun Xue. 2022a. Multi-View Visual Semantic Embedding.. In IJCAI, Vol. 2. 7.
  • Liao et al. (2022) Wentong Liao, Kai Hu, Michael Ying Yang, and Bodo Rosenhahn. 2022. Text to image generation with semantic-spatial aware gan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18187–18196.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
  • Liu et al. (2020) Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10921–10930.
  • Ma et al. (2023) Xiang Ma, Xuemei Li, Wenzhi Feng, Lexin Fang, and Caiming Zhang. 2023. Dynamic graph construction via motif detection for stock prediction. Information Processing & Management 60, 6 (2023), 103480.
  • Pan et al. (2023) Zhengxin Pan, Fangyu Wu, and Bailing Zhang. 2023. Fine-grained image-text matching by cross-modal hard aligning network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19275–19284.
  • Qu et al. (2020) Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia. 1047–1055.
  • Qu et al. (2021) Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  • Ren et al. (2016) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39, 6 (2016), 1137–1149.
  • Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 11 (1997), 2673–2681.
  • Sidorov et al. (2014) Grigori Sidorov, Alexander Gelbukh, Helena Gómez-Adorno, and David Pinto. 2014. Soft similarity and soft cosine measure: Similarity of features in vector space model. Computación y Sistemas 18, 3 (2014), 491–504.
  • Steinley (2006) Douglas Steinley. 2006. K-means clustering: a half-century synthesis. Brit. J. Math. Statist. Psych. 59, 1 (2006), 1–34.
  • Wang et al. (2020b) Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma. 2020b. Consensus-aware visual-semantic embedding for image-text matching. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, 18–34.
  • Wang et al. (2020a) Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020a. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 1508–1517.
  • Wehrmann et al. (2020) Jonatas Wehrmann, Camila Kolling, and Rodrigo C Barros. 2020. Adaptive cross-modal embeddings for image-text alignment. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 12313–12320.
  • Wei et al. (2020) Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10941–10950.
  • Wen et al. (2020) Keyu Wen, Xiaodong Gu, and Qingrong Cheng. 2020. Learning dual semantic relations with graph attention for image-text matching. IEEE transactions on circuits and systems for video technology 31, 7 (2020), 2866–2879.
  • Wu et al. (2017) Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. 2017. Sampling matters in deep embedding learning. In Proceedings of the IEEE international conference on computer vision. 2840–2848.
  • Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
  • Zhang et al. (2024) Huatian Zhang, Lei Zhang, Kun Zhang, and Zhendong Mao. 2024. Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 7105–7114.
  • Zhang et al. (2022a) Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, and Yongdong Zhang. 2022a. Fine-tuning with multi-modal entity prompts for news image captioning. In Proceedings of the 30th ACM International Conference on Multimedia. 4365–4373.
  • Zhang et al. (2022b) Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022b. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15661–15670.
  • Zhang et al. (2023) Kun Zhang, Lei Zhang, Bo Hu, Mengxiao Zhu, and Zhendong Mao. 2023. Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching. In Proceedings of the 31st ACM International Conference on Multimedia. 4828–4837.
  • Zhang et al. (2020) Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3536–3545.