Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching

Xiang Ma 0000-0002-4963-8705 xiangma@sdu.edu.cn Shandong UniversityJinanShandongChina , Xuemei Li xmli@sdu.edu.cn Shandong UniversityJinanShandongChina , Lexin Fang fanglexin@mail.sdu.edu.cn Shandong UniversityJinanShandongChina and Caiming Zhang czhang@sdu.edu.cn Shandong UniversityJinanShandongChina

(2024)

Abstract.

Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model. Besides, a sparse correlation algorithm is proposed to select strong correlated spatial relationships, enabling the model to learn more significant features and avoid being misled by weak correlation. Extensive experiments demonstrate the superiority of DIAS, achieving 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO benchmarks.

Image-text Matching, Information Aligning, Spatial Constraint, Sparse Algorithm

^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†doi: 10.1145/3664647.3681424^†^†conference: Proceedings of the 32nd ACM International Conference on Multimedia; October 28-November 1, 2024; Melbourne, VIC, Australia^†^†booktitle: Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australia^†^†isbn: 979-8-4007-0686-8/24/10^†^†ccs: Information systems Information retrieval

1. Introduction

Image-text matching is a fundamental task in computer vision (CV) and natural language processing (NLP), providing support for applications such as image captioning (Li et al., 2022b; Zhang et al., 2022a), text retrieval (Li et al., 2022c), and text-to-image generation (Huang et al., 2022; Liao et al., 2022). This task aims to discover semantic correlations between images and text, and bridge the semantic gap between these two heterogeneous modalities. The key challenge lies in adjusting embeddings by utilizing matched and unmatched relationships between images and texts to achieve high-quality semantic alignment.

The matching process typically requires matching with embeddings constructed from images and texts. The existing methods can be roughly divided into two categories: global and local (Fu et al., 2023; Liu et al., 2020). Global-based matching extracts and interacts with global embeddings from the whole images and texts to calculate correlations (Chen et al., 2021; Qu et al., 2021). Local-based matching adopts a fine-grained approach, which extracts local embeddings from image regions and text words usually obtains better performance (Qu et al., 2020; Chen et al., 2020a; Zhang et al., 2020). They all aim at aligning semantics by computing and adjusting the correlation between embeddings of different modality, which involves interaction of corresponding dimensions. For example, cosine similarity (Sidorov et al., 2014) calculates the correlation between two embeddings in each dimension. However, the embeddings generally come from different models or modules, resulting in significant differences in information representation of each dimension. For instance, the image embeddings represent color information in a certain dimension, while the text embeddings may represent the information of a word in the corresponding dimension. Note that the corresponding dimension may not necessarily be in the same column of embeddings. This is known as the modality gap problem. The cross-modal interaction of such embeddings lacks rationality and potentially lead to inaccurate correlation calculation.

To enhance the rationality and effectiveness of cross-modal interaction, we propose a novel image-text matching method based on Dimensional Information Alignment and Sparse Spatial Constraint (DIAS), aiming to bridge the gap between image and text modalities from two perspectives:

(1) To ensure the rationality of correlation calculation, we enhance the correlation of the embeddings from different modalities in corresponding dimension. In subsequent processes, the interaction involves the relevant information of embeddings in their corresponding dimensions. Emphasizing only the correlation of dimensions may lead to feature redundancy, where each dimension provides similar information and lacks discriminative features. Feature redundancy can cause overfitting, reducing the generalization ability of models. Therefore, we enhance the independence of non-corresponding dimensions by reducing the correlation of them, to ensure the amount of information contained in embeddings.

(2) Most existing methods primarily focus on constraining the relationships between matched image-text pairs, with weaker emphasis on unmatched pairs. This can lead to suboptimal performance in semantic alignment. More importantly, the relationship of matched pairs is cross modal constraints, and their effectiveness is significantly affected by the modality gap. We augment existing constraints by introducing spatial inter- and intra-modalities constraints for unmatched pairs. The inter-modality constraint refers to promoting semantic consistency by requiring distance consistency between inter-modality unmatched pairs. As shown in Fig. 1(a), the distance between image $i$ and text $j$ is constrained to be consistent with the distance between image $j$ and text $i$ . The intra-modality constraint refers to emphasizing spatial structure consistency by requiring distance consistency between unmatched pairs within each modality. As shown in Fig. 1(b), the distance between image $i$ and image $j$ is constrained to be consistent with the distance between text $i$ and text $j$ . However, these two types of constraints assume the spatial relationships between images and texts exhibit symmetry, which is not always valid. Strictly following these constraints may lead to the model learning inaccurate features. Therefore, we propose a sparse correlation algorithm to select strong correlation to sparsify spatial constraints, reducing the need for symmetry.

Specifically, DIAS first obtains local embeddings of image regions and text words, and calculates the correlations between them in all dimensions to construct the correlation matrix. Each value in the matrix means the correlation of the corresponding region (row) and word (column). To align the information of embeddings from different modalities, we propose a regularizer to increase the correlation values of corresponding dimensions. Meanwhile, the correlation values between non-corresponding dimensions are decreased to suppress feature redundancy. Then, DIAS aggregates and upgrates the local embeddings, and merges them into global embeddings by pooling. As correlations of local embeddings have been adjusted in the previous step, the construction of global embeddings becomes more reasonable. Subsequently, DIAS obtains the spatial distance between inter- and intra-modalities unmatched pairs, and further employs the proposed sparse correlation algorithm to select strong correlation from them. The proposed algorithm introduces conditional probabilities of instance correlation and adapts them into a sparse regularization term, enabling the model to automatically learn how to identify strong correlation for each instance. Finally, the selected spatial relationships are used as constraints, combined with the constraints between matched pairs to achieve semantic alignment.

Our contributions are summarized as follows:

(1) We propose a dimension information alignment method for embeddings of different modalities, aiming to enhance the rationality of cross-modal interaction and suppress feature redundancy.

(2) We introduce novel inter- and intra-modality constraints to ensure the effectiveness of semantic alignment.

(3) A sparse correlation algorithm is proposed to select strong correlated spatial relationships, reducing the need for symmetry of embeddings.

Refer to caption — Figure 1. Illustration of distance consistency.

2. Related Work

Based on the implementation of cross-modal interactions, the image-text matching methods can be broadly categorized into global-based matching and local-based matching method.

Global-based matching. The typical global methods involve obtaining global embeddings of images and texts, projecting them into a shared embedding space by two branches and aligning image-text semantic. A line of works focus on how to accurately describe correlations between global embeddings. Some studies (Goel et al., 2022; Chen et al., 2020b) focus on improving correlation algorithms. For example, Jiang (Jiang et al., 2023) introduces the concept of geometric consistency to enhancing the constraint on image-text pairs. Additionally, some studies (Karpathy and Fei-Fei, 2015; Klein et al., 2015; Li et al., 2019, 2022a; Wehrmann et al., 2020) propose complex models to construct more robust global embeddings. Especially in recent years, pre-trained networks (Radford et al., 2021; Li et al., 2020) with extensive resources enrich the information contained in global embeddings. However, these methods still follow the existing paradigm, assuming embeddings from different modalities interact with the same information during correlation computation. In contrast, we focus on aligning the information representation of embeddings to enhance the rationality of correlation computation.

Local-based matching. Learning semantic alignment between local embeddings from image regions and text words is popular and offers better interpretability compared to global methods. Karpathy (Karpathy and Fei-Fei, 2015) makes the first attempt to infer matching between regions and words by aggregating similarities across all regions and words to obtain the correlation between image and text. A line of works focuses on constructing thoughtful aggregation rules to find the important region-word pairs. Chen (Chen et al., 2020a) proposes recurrent cross-attention to iteratively refine and elaborate shared semantics across different levels. Zhang (Zhang et al., 2022b) introduces negative-aware attention on unmatched pairs to enhance matching accuracy. Pan (Pan et al., 2023) considers that effective image-text semantic matching can be achieved solely by relying on the maximum region-word correlation and provides theoretical derivation. Another line of works focuses on exploiting more information. Wang (Wang et al., 2020a) introduces scene graph during matching to enrich relationships between local embeddings. Additionally, the models combining consensus knowledge (Wang et al., 2020b) and external pre-training knowledge (Wei et al., 2020; Qu et al., 2021) have been employed to enhance the cross-modal alingment. However, they still rarely consider the differences of information representation in different dimensions caused by modality gap. As mentioned earlier, we bridge the modality gap by aligning information representation of embeddings.

3. Methodology

Considering effectiveness and interpretability, DIAS adopts the local-based matching method. In this section, we introduce the framework of local-based matching method (Sec. 3.1) and the details of DIAS. As shown in Fig.2, DIAS first perfroms dimension information alignment to adjust the information representation of the embeddings in different dimensions (Sec. 3.2). Then inter- and intra-modalities spatial constraints are introduced to suppress the influence of the modality gap (Sec. 3.3), and the sparse conrrelation algorithm is used to select the strong correlated spatial relationships (Sec. 3.4).

3.1. The Framework of Local-based Matching

Formally, given an image V, we use Faster-RCNN (Ren et al., 2016) to extract the salient regions and obtain the local image embeddings $\textbf{V}=\{\textbf{v}_{i}|i\in[1,n_{v}],\textbf{v}_{i}\in\mathbb{R}^{d}\}$ by the pre-trained ResNet-101 (He et al., 2016). $\textbf{v}_{i}$ is the local embeddings of $i$ -th region. $n_{v}$ denotes the number of regions. Similarly, given text T, we employ Bidirectional Gated Recurrent Units (BiGRU) (Schuster and Paliwal, 1997) or BERT (Devlin et al., 2018) to extract local text embeddings $\textbf{T}=\{\textbf{t}_{j}|j\in[1,n_{t}],\textbf{t}_{j}\in\mathbb{R}^{d}\}$ . $\textbf{t}_{j}$ is the local embeddings of $j$ -th words. $n_{t}$ denotes the number of words.

Local-based matching first conduct local embedding interaction to update local embeddings based on the correlation between regions and words. The updating of $\textbf{v}_{i}$ can be described as follows:

(1)			$\displaystyle\hat{\textbf{v}}_{i}=\frac{\sum_{j=1}^{n_{t}}s_{i,j}\textbf{t}_{j% }}{\sum_{j=1}^{n_{t}}s_{i,j}},\ \ \ i\in[1,n_{v}]$
(1)			$\displaystyle s_{i,j}=\sigma_{l}(\textbf{v}_{i},\textbf{t}_{j})$

Here $\hat{\textbf{v}}_{i}$ represents the new local embedding. $\sigma_{l}(\cdot)$ is the correlation function for local embeddings. $s_{i,j}$ is the correlation value between $\textbf{v}_{i}$ and $\textbf{t}_{j}$ . Then, local embeddings are transformed into global embeddings by pooling, formally as:

(2)

\hat{\textbf{V}}=pool(\{\hat{\textbf{v}}_{i}|i\in[1,n_{v}]\})

Here $\hat{\textbf{V}}$ is the global embedding of image V. $pool(\cdot)$ means the pooling operation. Through the similar process, we can obtain the local embedding of word $\hat{\textbf{t}}_{j}$ and global embedding of text $\hat{\textbf{T}}$ .

The correlation between image and text is obtained based on global embedding interaction. The triplet loss is the most commonly used method for achieving semantic alignment, and the objective function can be expressed as:

(3)

\mathcal{L}_{loc}=[\alpha-\sigma_{g}(\hat{\textbf{V}},\hat{\textbf{T}})+\sigma% _{g}(\hat{\textbf{V}},\hat{\textbf{T}}^{-})]_{+}+[\alpha-\sigma_{g}(\hat{% \textbf{V}},\hat{\textbf{T}})+\sigma_{g}(\hat{\textbf{V}}^{-},\hat{\textbf{T}}% )]_{+}

Here $\alpha$ means a margin parameter, $[\cdot]_{+}=max(\cdot,0)$ . $\sigma_{g}$ is the correlation function for instances. $(\hat{\textbf{V}},\hat{\textbf{T}})$ is a positive image-text pair, and $(\hat{\textbf{V}},\hat{\textbf{T}}^{-})$ and $(\hat{\textbf{V}}^{-},\hat{\textbf{T}})$ are negative image-text pair in the batch. We use the distance-weighted sampling (Wu et al., 2017) for hard negative mining.

3.2. Dimension Information Alignment

The correlation calculation likes Eq.1 involves the cross-modal interaction in corresponding dimensions of embeddings. As mentioned earlier, due to the different sources, there are significantly differences in information representation of $\textbf{v}_{i}$ and $\textbf{t}_{j}$ in different dimensions. The interaction of them can result in calculation biases and lack of rationality. Thus, we propose a dimension information alignment method to align the information representation before the interaction by a regularizer. It can improves the correlation of $\textbf{v}_{i}$ and $\textbf{t}_{j}$ in corresponding dimensions. Meanwhile, to suppress feature redundancy that may occur during the alignment, the regularizer also reduces the correlation values between non-corresponding dimensions. Below is a detailed introduction to this process.

Assuming there are $N$ image-text pairs. As shown in Fig. 3, we first extract dimension vectors of all local embeddings, and integrate them into $\textbf{m}^{V}=\{\textbf{m}_{i}^{V}|i\in[1,d],\textbf{m}_{i}^{V}\in\mathbb{R}^% {N_{V}}\}$ and $\textbf{m}^{T}=\{\textbf{m}_{j}^{T}|j\in[1,d],\textbf{m}_{i}^{T}\in\mathbb{R}^% {N_{T}}\}$ , respectively. Here $\textbf{m}_{i}^{V}$ contains the information distribution of all local image embeddings in $i$ -th dimension, and $\textbf{m}_{j}^{T}$ contains the information distribution of all local text embeddings in $j$ -th dimension. The number of regions in different images and the number of words in different texts vary. So, we use $N_{V}$ and $N_{T}$ to represent the total number of regions and words, respectively. Then, we compute the correlation between $\textbf{m}_{i}^{V}$ and $\textbf{m}_{j}^{T}$ , formally as:

(4)

c_{i,j}=\sigma_{c}(\textbf{m}_{i}^{V},\textbf{m}_{j}^{T}),\ \ i,j\in[1,d]

Here $c_{i,j}$ is the correlation value of $\textbf{m}_{i}^{V}$ and $\textbf{m}_{j}^{T}$ . $\sigma_{c}$ denotes the correlation algorithm for dimension vectors. The correlation matrix $\textbf{C}=\{c_{i,j}|i,j\in[1,d]\}$ can be obtained via Eq.4.

Then, we use a regularizer to improve the correlation of corresponding dimensions and reduce the correlation between non-corresponding dimensions. For ease of understanding, we assume the corresponding dimensions are at the same column of embeddings. It means the corresponding dimension of $\textbf{m}_{i}^{V}$ is $\textbf{m}_{i}^{T}$ and $c_{i,i}$ is the correlation value of them. The regularizer can be expressed as:

(5)

\mathcal{L}_{dim}=-\sum^{d}_{i=1}c_{i,i}+\sum^{d}_{i=1}\sum^{d}_{j=1,j\neq i}c% _{i,j}

The first term of Eq.5 mainly aligns the corresponding dimension, and the second term misaligns the non-corresponding dimensions. The setting of this function is relatively intuitive, but it fails to account for the magnitude difference in rows or columns of $S$ , potentially leading to computational bias. Therefore, we improve it to the following formula:

(6)

\mathcal{L}_{dim}=\sum^{d}_{i=1}-(\frac{c_{i,i}}{\sum^{d}_{j=1}c_{i,j}}+\frac{% c_{i,i}}{\sum^{d}_{j=1}c_{j,i}})

As shown in Eq.6, the regularizer increases the proportion of $c_{i,i}$ to corresponding rows and columns in $C$ , avoiding the impact of inconsistent orders of magnitude.

After aligning the dimension information, the process of aggregating and upgrating the local embeddings in Eq.1 generates more reasonable correlations. Moreover, the information representation of $\hat{\textbf{V}}$ and $\hat{\textbf{T}}$ in the corresponding dimensions obtained by Eq.2 is also more similar.

3.3. Spatial Constraint

After obtaining the global embeddings $\hat{\textbf{V}}$ and $\hat{\textbf{T}}$ , we calculate their correlation and use the loss function (Eq.3) to achieve semantic alignment. For each instance, the number of unmatched instances far exceeds the number of matched instances. Existing methods often impose stronger constraints on matched pairs and weaker constraints on unmatched pairs. For example, Eq.3 requires the correlation of matched pairs is greater than that of all unmatched pairs, while unmatched pairs only need to satisfy a threshold $\alpha$ smaller than that of matched pairs. To ensure the effectiveness of semantic alignment, we propose two spatial constraint regularizers to enhance the constraint on unmatched pairs, including inter- and intra-modalities constraints.

On the one hand, we aim to maintain semantic consistency by pursuing spatial distance consistency of inter-modality unmatched pairs. Concretely, we compute the distance of all global embeddings between different modalities:

(7)

x_{i,j}=\sigma_{x}(\hat{\textbf{V}}_{i},\hat{\textbf{T}}_{j}),\ \ i,j\in[1,N]

Here $\hat{\textbf{V}}_{i}$ is the global embedding of $i$ -th image, and $\hat{\textbf{T}}_{j}$ is the global embedding of $j$ -th text. $N$ is the number of image-text pairs, and assuming the matched pair of $\hat{\textbf{V}}_{i}$ is $\hat{\textbf{T}}_{i}$ . $\sigma_{x}$ is the distance function. $x_{i,j}$ is the spatial distance between $\hat{\textbf{V}}_{i}$ and $\hat{\textbf{T}}_{j}$ . We combine $x_{i,j}$ to construct spatial matrix $\textbf{X}=\{x_{i,j}|i,j\in[1,N]\}$ . The regularizer for inter-modality unmatched pairs is as follwing:

(8)		$\displaystyle\mathcal{L}_{inter}$	$\displaystyle=\|\|\textbf{L}_{x}\|\|^{2}_{2}=\|\|\textbf{X}-\textbf{X}^{\top}\|\|^{2}_% {2}=\sum_{i=1}^{N}\sum_{j=1}^{N}(x_{i,j}-x_{j,i})^{2}$
(8)			$\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{N}(\sigma_{x}(\hat{\textbf{V}}_{i},% \hat{\textbf{T}}_{j})-\sigma_{x}(\hat{\textbf{V}}_{j},\hat{\textbf{T}}_{i}))^{2}$

Here $\textbf{L}_{x}=|\textbf{X}-\textbf{X}^{\top}|$ is the inter-modality spatical matrix to be optimized. It can be observed that this regularizer imposes strong distance constraint only on unmatched pairs, which partially compensates for the shortcomings of Eq.3. The regularizer can effectively reduce the model’s sensitivity and enhance its robustness and generalization when handling diverse modality data. But it still handles inter-modality embeddings, which are limited by modality gap.

So, on the other hand, we aim to maintain structure consistency of different modalities by pursuing spatial distance consistency of intra-modality unmatched pairs. We compute the distance of all global embeddings in each modality:

(9)			$\displaystyle y_{i,j}=\sigma_{y}(\hat{\textbf{V}}_{i},\hat{\textbf{V}}_{j}),\ % \ i,j\in[1,N]$
(9)			$\displaystyle z_{i,j}=\sigma_{z}(\hat{\textbf{T}}_{i},\hat{\textbf{T}}_{j})$

Here $y_{i,j}$ means the spatial distance between $\hat{\textbf{V}}_{i}$ and $\hat{\textbf{V}}_{j}$ . $z_{i,j}$ means the spatial distance between $\hat{\textbf{T}}_{i}$ and $\hat{\textbf{T}}_{j}$ . $\sigma_{y}$ and $\sigma_{z}$ are the distance functions of images and texts, respectively. We combine $y_{i,j}$ to construct $\textbf{Y}=\{y_{i,j}|i,j\in[1,N]\}$ and combine $z_{i,j}$ to construct $\textbf{Z}=\{z_{i,j}|i,j\in[1,N]\}$ . The regularizer for intra-modality unmatched pairs is as follwing:

(10)		$\displaystyle\mathcal{L}_{intra}$	$\displaystyle=\|\|\textbf{L}_{yz}\|\|^{2}_{2}=\|\|\textbf{Y}-\textbf{Z}\|\|^{2}_{2}=% \sum_{i=1}^{N}\sum_{j=1}^{N}(y_{i,j}-z_{i,j})^{2}$
(10)			$\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{N}(\sigma_{y}(\hat{\textbf{V}}_{i},% \hat{\textbf{V}}_{j})-\sigma_{z}(\hat{\textbf{T}}_{i},\hat{\textbf{T}}_{j}))^{2}$

Here $\textbf{L}_{yz}=|\textbf{Y}-\textbf{Z}^{\top}|$ is the inter-modality spatical matrix to be optimized. It can be observed that this regularizer constrains embeddings of different modalities to have the same spatial relationships, enhancing their consistency of spatial structure. Furthermore, it only processes embeddings within modalities and is not affected by modality gap. Even if the modality gap is not completely eliminated, this regularizer can still encourage our model to learn effective features.

3.4. Sparse Correlation Algorithm

The spatial constraints assume the spatial relationships between images and texts exhibit symmetry, but this assumption is not always valid. These inaccurate relationships can affect the performance of the model. Fig. 4 shows the distribution of spatial distance between instances within and across modalities. It can be observed that the relationships between instances are mostly weakly correlated, and these relationships have little effect on characterizing the spatial position of instances. Considering the effectiveness and efficiency, we propose a sparse correlation algorithm. This algorithm concentrates spatial constraints on strong correlation relationships to capture more significant and important features. More importantly, this algorithm can reduce the need for embedding symmetry, making it more flexible.

The key issue is how to determine which instances exhibit strong correlations. The correlation distribution of different instances varies greatly, making it unsuitable to set a unified hard-threshold to distinguish strong and weak correlations. Therefore, we propose a sparse correlation algorithm to adaptively distinguish strong and weak correlations based on the situation of the instance itself. This algorithm builds conditional probabilities of correlation and uses them to obtain the soft-threshold, as shown in Fig. 5. Taking matrix $\textbf{L}_{x}$ as example, we set its $i$ -th row vector as $\textbf{l}_{i}^{V}=\{l_{i,j}^{V}|j\in[1,N]\}$ , and $l_{i,j}^{V}=|x_{i,j}-x_{j,i}|$ . $l_{i}^{V}$ indicates the correlation between image $\textbf{V}_{i}$ and all texts. Similar, we set the $j$ -th column vecter of $\textbf{L}_{x}$ as $\textbf{l}_{j}^{T}=\{l_{j,i}^{T}|i\in[1,N]\}$ , and $l_{j,i}^{T}=|x_{j,i}-x_{i,j}|$ . To explicitly quantification, we represent the conditional probability of each image as:

(11)

p(l_{i}^{V}|l_{j}^{T})=Sigmoid(-|x_{i,j}-x_{j,i}|),\ \ j\in[1,N]

Here $p(l_{i}^{V}|l_{j}^{T})\in[0,1]$ represents the dependency degree of $\hat{\textbf{V}}_{i}$ on $\hat{\textbf{T}}_{j}$ . A larger value of $p(l_{i}^{V}|l_{j}^{T})$ indicates a stronger dependency. We expect the model to discover strong correlations for each image and text based on the latent semantics of $L_{x}$ , to avoid interference from other weakly correlated instances and to be as concise as possible. Specifically, we observed that the histogram of the conditional probability $\{p(l_{i}^{V}|l_{j}^{T})\}_{j=1}^{d}$ approximates a normal distribution, as shown in Fig. 4. Therefore, based on the statistical features of conditional probabilities, we can enable the model to learn a soft-threshold for distinguishing strong and weak correlations for each instance:

(12)

\kappa^{V}_{i}=\mu_{i}+\beta_{i}\cdot\theta_{i}

Here $\kappa^{V}_{i}$ is the soft-threshold of $\textbf{l}_{i}^{V}$ . $\mu_{i}$ and $\theta_{i}$ are the mean and standard deviation of the sampling probability values from $\{p(l_{i}^{V}|l_{j}^{T})\}_{j=1}^{d}$ , respectively. $\beta_{i}$ is a learnable parameter to adjust the sparse degree.

We combine all soft-thresholds as $\textbf{K}^{V}=\{\kappa^{V}_{i}|i\in[1,N]\}$ , and obtain $\textbf{K}^{T}=\{\kappa^{T}_{j}|j\in[1,N]\}$ in a similar process. It is important to note that $\textbf{K}^{V}$ and $\textbf{K}^{T}$ are distinct. For example, image $\textbf{V}_{i}$ may exhibit a high dependency degree on text $\textbf{T}_{j}$ , but $\textbf{T}_{j}$ may not necessarily have a high dependency degree on $\textbf{V}_{i}$ . To avoid introducing weakly correlated information, we select spatial relationships that meet the requirements of both $\textbf{K}^{V}$ and $\textbf{K}^{T}$ :

(13)

\hat{\textbf{L}}_{x}=\textbf{B}_{x}\textbf{L}_{x}

Here $\textbf{B}_{x}=\{b^{x}_{i,j}|i,j\in[1,N]\}$ is a binary mask matrix to save strong correlation relationships, and:

(14)

\displaystyle b^{x}_{i,j}=\left\{\begin{aligned} &1,&&|x_{i,j}-x_{j,i}|>max(% \kappa^{V}_{i},\kappa^{T}_{j})\\ &0,&&otherwise\end{aligned}\right.

$max(\cdot)$ is the function for calculating the maximum value. Base on the sparse inter-modality spatical matrix $\hat{\textbf{L}}_{x}$ , we update Eq.8 as:

(15)		$\displaystyle\mathcal{L}_{inter}$	$\displaystyle=\|\|\hat{\textbf{L}}_{x}\|\|^{2}_{2}=\textbf{B}_{x}\|\|\textbf{X}-% \textbf{X}^{\top}\|\|^{2}_{2}=\sum_{i=1}^{N}\sum_{j=1}^{N}b^{x}_{i,j}(x_{i,j}-x_% {j,i})^{2}$
(15)			$\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{N}b^{x}_{i,j}(\sigma_{x}(\hat{\textbf{% V}}_{i},\hat{\textbf{T}}_{j})-\sigma_{x}(\hat{\textbf{V}}_{j},\hat{\textbf{T}}% _{i}))^{2}$

By performing similar operations on $\textbf{L}_{yz}$ , we can obtain the sparse intra-modality spatical matrix $\hat{\textbf{L}}_{yz}$ and update Eq.10 as:

(16)		$\displaystyle\mathcal{L}_{intra}$	$\displaystyle=\|\|\hat{\textbf{L}}_{yz}\|\|^{2}_{2}=\textbf{B}_{yz}\|\|\textbf{Y}-% \textbf{Z}\|\|^{2}_{2}=\sum_{i=1}^{N}\sum_{j=1}^{N}b^{yz}_{i,j}(y_{i,j}-z_{i,j})% ^{2}$
(16)			$\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{N}b^{yz}_{i,j}(\sigma_{y}(\hat{\textbf% {V}}_{i},\hat{\textbf{V}}_{j})-\sigma_{z}(\hat{\textbf{T}}_{i},\hat{\textbf{T}% }_{j}))^{2}$

Here $\textbf{B}_{yz}=\{b^{yz}_{i,j}|i,j\in[1,N]\}$ .

3.5. Objective Function

We combine the proposed regularization terms with the triplet loss to obtain the loss function of DIAS:

(17)

\mathcal{L}=\mathcal{L}_{loc}+\omega_{dim}\mathcal{L}_{dim}+\omega_{inter}% \mathcal{L}_{inter}+\omega_{intra}\mathcal{L}_{intra}

Here $\omega_{dim}$ , $\omega_{inter}$ and $\omega_{intra}$ are hyper-parameters to control the effectiveness degree of each term. To ensure effective cross-modal interactions, we use neighbor sampling instead of random sampling for batches. First, we apply K-means (Steinley, 2006; Ma et al., 2023) clustering on the local image embeddings. Then, we randomly select $M$ clusters and choose $P$ images from each cluster. Finally, we pair each image with a positive text instance and obtain $N=P\times K$ image-text pairs for each batch.

4. Experiments

4.1. Experimental Setup

Table 1. Comparisons with state-of-the-art methods on Flickr30k and MSCOCO 1K test-sets. BUTD represents using Faster-RCNN (Chen et al., 2021) to extract local image embeddings. BiGRU and BERT represent using BiGRU (Schuster and Paliwal, 1997) or BERT (Devlin et al., 2018) to extract local text embeddings. * denotes the ensemble results of two models. The bests are in bold.

Methods	Flickr30K							MSCOCO 1K
	IMG $\rightarrow$ TEXT			TEXT $\rightarrow$ IMG			rSum	IMG $\rightarrow$ TEXT			TEXT $\rightarrow$ IMG			rSum
	R@1	R@5	R@10	R@1	R@5	R@10	rSum	R@1	R@5	R@10	R@1	R@5	R@10	rSum
BUTD+BiGRU
GSMN*(2020)(Liu et al., 2020)	76.4	94.3	97.3	57.4	82.3	89.0	496.8	78.4	96.4	98.6	63.3	90.1	95.7	522.5
GPO(2021)(Chen et al., 2021)	76.5	94.2	97.7	56.4	83.4	89.9	498.1	78.5	96.0	98.7	61.7	90.3	95.6	520.8
MV(2022)(Li et al., 2022a)	79.0	94.9	97.7	59.1	84.6	90.6	505.8	78.7	95.7	98.7	62.7	90.4	95.7	521.9
NAAF*(2022)(Zhang et al., 2022b)	81.9	96.1	98.3	61.0	85.3	90.6	513.2	80.5	96.5	98.8	64.1	90.7	96.5	527.2
CHAN(2023)(Pan et al., 2023)	79.7	94.5	97.3	60.2	85.3	90.7	507.8	79.7	96.7	98.7	63.8	90.4	95.8	525.0
NUIF-d(2024)(Zhang et al., 2024)	81.8	94.7	97.6	59.4	85.6	91.1	509.3	80.6	96.3	98.8	64.7	91.4	96.2	528.0
DIAS	81.8	96.1	98.6	60.7	84.9	91.3	513.4	81.3	96.8	98.9	64.9	90.4	95.9	528.2
BUTD+BERT
DSRAN(2020)(Wen et al., 2020)	77.8	95.1	97.6	59.2	86.0	91.9	507.6	78.3	95.7	98.4	64.5	90.8	95.8	523.5
VSRN++*(2022)(Li et al., 2022d)	79.2	94.6	97.5	60.6	85.6	91.4	508.9	77.9	96.0	98.5	64.1	91.0	96.1	523.6
MV(2022)(Li et al., 2022a)	82.1	95.8	97.9	63.1	86.7	92.3	517.5	80.4	96.6	99.0	64.9	91.2	96.0	528.1
CHAN(2023)(Pan et al., 2023)	80.6	96.1	97.8	63.9	87.5	92.6	518.5	81.4	96.9	98.9	66.5	92.1	96.7	532.6
HREM*(2023)(Fu et al., 2023)	84.0	96.1	98.6	64.4	88.0	93.1	524.2	82.9	96.9	99.0	67.1	92.0	96.6	534.6
DIAS	83.8	96.6	98.3	64.5	88.0	93.3	524.5	83.4	97.1	99.1	67.6	92.4	96.6	536.2

Datasets and Evaluation Metrics. We evaluate DIAS mainly on Flickr30k (Young et al., 2014) and MSCOCO (Lin et al., 2014) datasets. Flickr30k contains 29,000 images for training, 1,000 images for validation, and 1,000 images for testing. MSCOCO contains 123,287 images for training, 5,000 images for validation, and 5,000 images for testing. Each image of the two datasets is associated with 5 texts. The results on MSCOCO are reported on averaging over 5-folds of 1,000 test images and on the entire 5,000 test images. As a common practice in information retrieval (Chen et al., 2021), we adopt the Recall at K (R@K) to meansure the performance, and set K=1,5,10. R@K means the percentage of ground truth in the retrieved top-K lists. rSum reflects the overall matching performance, which is the sum of R@K in both image-to-text and text-to-image matching.

Implementation Details. We use the pre-extracted local image embeddings (Chen et al., 2021) for images, and the BiGRU (Schuster and Paliwal, 1997) or BERT (Devlin et al., 2018) to extract local text embeddings. All correlation algorithms default to cosine similarity (Sidorov et al., 2014). The experiments are conducted on an NVIDIA GeForce RTX 4090 GPU. We set 30 training epochs, and the batch size is 128 for Flickr30k and 256 for MSCOCO. Adam optimizer is adopted with an initial learning rate of $5e^{-4}$ and decaying by 10% every epochs. The code is available online at https://github.com/XiangMa-Shaun/DIAS.

4.2. Comparisons with State-of-the-art Methods

To verify the performance superiority of our proposed DIAS, we compare it with the state-of-the-art models on two datasets. Existing methods are divided into two types based on their feature backbones for fair comparisons. The experimental results are cited directly from respective papers. Our model reports the single model performance without the ensemble improving trick.

Quantitative results on Flickr30K and MSCOCO 1K test-sets are shown in Table 1. DIAS outperforms state-of-the-art methods with impressive margins for the R@K and rSum, and achieves consistent superiority on different textual encoders. Furthermore, Table 2 shows the more extensive database of MSCOCO 5K test-set, DIAS also performs best on nearly all metrics.

Table 2. Comparisons with state-of-the-art methods on MSCOCO 5K test-set. * denotes the ensemble results of two models. The bests are in bold.

Methods	I $\rightarrow$ T			T $\rightarrow$ I			rSum
Methods	R@1	R@5	R@10	R@1	R@5	R@10	rSum
BUTD+BiGRU
GPO	56.6	83.6	91.4	39.3	69.9	81.1	421.9
MV	56.7	84.1	91.4	40.3	70.6	81.6	424.6
NAAF*	58.9	85.2	92.0	42.5	70.9	81.4	430.9
CHAN	60.2	85.9	92.4	41.7	71.5	81.7	433.4
NUIF-d	59.3	85.5	92.0	41.9	71.3	81.8	431.8
DIAS	59.8	86.0	92.5	42.7	71.8	82.5	435.3
BUTD+BERT
VSRN++*	54.7	82.9	90.9	42.0	72.2	82.7	425.4
MV	59.1	86.3	92.5	42.5	72.8	83.1	436.3
CHAN	59.8	87.2	93.3	44.9	74.5	84.2	443.9
HREM*	64.0	88.5	93.7	45.4	75.1	84.3	450.9
DIAS	64.4	88.9	94.1	47.2	76.5	85.2	456.3

Table 3. The effect of applying dimension information alignment (abbreviated as DIA) to other models.

	Flickr30K				MSCOCO 1K
	I $\rightarrow$ T		T $\rightarrow$ I		I $\rightarrow$ T		T $\rightarrow$ I
	R@1	R@5	R@1	R@5	R@1	R@5	R@1	R@5
MV	82.1	95.8	63.1	86.7	80.4	96.6	64.9	91.2
+DIA	82.9	96.2	63.8	87.2	81.4	96.8	65.8	91.9
CHAN	80.6	96.1	63.9	87.5	81.4	96.9	66.5	92.1
+DIA	82.0	96.4	64.2	87.8	81.7	96.9	66.9	92.3
HREM	84.0	96.1	64.4	88.0	82.9	96.9	67.1	92.0
+DIA	84.2	96.5	64.6	88.0	83.0	97.2	67.5	92.5

4.3. Ablation Study and Discussion

To demonstrate the effectiveness of components in DIAS, we conduct ablation studies on both datasets. The baseline w/o DIA means DIAS without dimention information alignment. w/o $\textbf{L}_{x}$ and w/o $\textbf{L}_{yz}$ denote the lack of inter- and intra-modality spatial constraints, respectively. w/o $\hat{\textbf{L}}_{x}$ and w/o $\hat{\textbf{L}}_{yz}$ mean that no sparsity regularization is applied on inter- and intra-modality spatial matrices, respectively. According to the results shown in Table 4, we have the following observations:

(1) The effectiveness of model designing. Removing any components in DIAS reduced performance, which indicates the proposed dimension information alignment, spatial constraints, and sparse correlation algorithm are effective for image-text matching tasks.

(2) Discussion on dimension information alignment. The performance of w/o DIA is the worst among the baselines, indicating that aligning dimension information is the most crucial component for DIAS. To further discuss the effectiveness of this component, we apply it to other models. The results shown in Table 3 demonstrate dimension information alignment can also improve the performance of other models to a certain extent.

Table 4. Ablation studies of our model on Flickr30K and MSCOCO 1K.

BUTD+BiGRU
Methods	Flickr30K				MSCOCO 1K
	I $\rightarrow$ T		T $\rightarrow$ I		I $\rightarrow$ T		T $\rightarrow$ I
	R@1	R@5	R@1	R@5	R@1	R@5	R@1	R@5
w/o DIA	79.3	94.9	58.9	84.0	78.9	95.6	63.0	90.2
w/o $\textbf{L}_{x}$	81.1	95.7	59.5	84.6	80.4	96.2	64.2	90.3
w/o $\textbf{L}_{yz}$	80.8	95.2	59.5	84.6	80.1	96.2	63.7	90.2
w/o $\hat{\textbf{L}}_{x}$	81.6	96.0	60.2	84.8	81.2	96.5	64.7	90.3
w/o $\hat{\textbf{L}}_{yz}$	81.5	95.8	59.9	84.7	80.9	96.4	64.1	90.2
DIAS	81.8	96.1	60.7	84.9	81.3	96.8	64.9	90.4
BUTD+BERT
w/o DIA	80.8	95.5	62.9	85.9	80.7	96.1	65.1	91.1
w/o $\textbf{L}_{x}$	83.3	96.2	64.4	87.8	82.9	97.0	66.9	92.1
w/o $\textbf{L}_{yz}$	82.7	95.9	63.7	87.2	82.1	96.8	66.3	91.8
w/o $\hat{\textbf{L}}_{x}$	83.5	96.2	64.4	87.9	83.0	97.1	67.2	92.2
w/o $\hat{\textbf{L}}_{yz}$	83.4	96.2	64.0	87.8	82.8	97.0	67.0	92.2
DIAS	83.8	96.6	64.5	88.0	83.4	97.1	67.6	92.4

Table 5. Generalization ability comparison of models trained on MSCOCO and validated on Flickr30K test-set.

	I $\rightarrow$ T			T $\rightarrow$ I			rSum
	R@1	R@5	R@10	R@1	R@5	R@10	rSum
BUTD+BiGRU
Baseline	53.2	82.1	88.7	42.5	71.1	79.5	417.1
DIAS	69.2	91.2	95.0	54.5	79.4	87.0	476.3
BUTD+BERT
Baseline	60.6	85.4	91.4	46.7	73.7	81.8	439.6
DIAS	73.9	92.2	96.2	57.6	80.8	87.6	488.3

(3) Discussion on spatial constraint. The performance of w/o $\textbf{L}_{yz}$ is inferior to w/o $\textbf{L}_{x}$ , indicating that introducing intra-modality spatial constraint is more effective for DIAS than inter-modality constraint. This result provides evidence for the viewpoint that intra-modality constraint is not affected by modality gap and can directly assist the model in learning robust features.

(4) Discussion on sparse correlation algorithm. The performance of w/o $\hat{\textbf{L}}_{x}$ and w/o $\hat{\textbf{L}}_{yz}$ are inferior to the complete DIAS, suggesting sparse correlation algorithm can assist the model in learning significant features by selecting strong correlated relationships. To further discuss the effectiveness of this algorithm, we compared it with the Top-k strategy and L1 sparse strategy. The Top-k strategy retains the top-k most relevant relationships for each instance. The L1 sparse strategy constrains the correlation matrix using the L1-norm. The results as shown in Fig. 6 reveal the sparse correlation algorithm outperforms these two baselines.

4.4. Robustness Analysis

Parameter sensitivity. We aim to understanding how our model performs by varying the values of hyper-parameters $\omega_{dim}$ , $\omega_{inter}$ and $\omega_{intra}$ , as shown in Fig.7. When varying any of these hyper-parameters, we fix others with default settings. $\omega_{dim}$ , $\omega_{inter}$ and $\omega_{intra}$ obtain optimal results at 10, 0.05, and 0.1, respectively.

Generalization study. To validate the generalization capability of DIAS in learning latent semantics, we conduct cross-validation experiments following (Zhang et al., 2023). Specifically, we use the model trained on MSCOCO dataset to evaluate its zero-shot transferability on Flickr30K test-set. The result shown in Table 5 indicates our proposed DIAS exhibits stronger generalization ability than the baseline, confirming DIAS is capable of learning cross-modality latent semantics.

5. Conclusion

This paper proposes a novel image-text matching model based on dimension information alignment and sparse spatial correlation algorithm (DIAS). We explicitly align information representation of embeddings in corresponding dimension, to address the issue of lack of rationality in correlation calculation caused by modality gap. Additionally, by introducing inter- and intra-modalities spatial relationships, we enhance the constraints during the cross-modal interaction. More importantly, we propose a sparse correlation algorithm to select strong spatial relationships to reduce the requirement for symmetric of embeddings, allowing the model to focus on learning more significant structural features. Extensive experiments and analyses conducted on two datasets show the superiority and rationality of DIAS.

Acknowledgements.

Supported by the National Natural Science Foundation of China (NSFC) Joint Fund with Zhejiang Integration of Informatization and Industrialization under Key Project (Grant No.U22A2033) and NSFC (Grant No.62072281).

References

(1)
Chen et al. (2020a) Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020a. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12655–12663.
Chen et al. (2021) Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15789–15798.
Chen et al. (2020b) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020b. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Fu et al. (2023) Zheren Fu, Zhendong Mao, Yan Song, and Yongdong Zhang. 2023. Learning semantic relationship among instances for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15159–15168.
Goel et al. (2022) Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan Rossi, Vishwa Vinay, and Aditya Grover. 2022. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems 35 (2022), 6704–6719.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Huang et al. (2022) Mengqi Huang, Zhendong Mao, Penghui Wang, Quan Wang, and Yongdong Zhang. 2022. Dse-gan: Dynamic semantic evolution generative adversarial network for text-to-image generation. In Proceedings of the 30th ACM International Conference on Multimedia. 4345–4354.
Jiang et al. (2023) Qian Jiang, Changyou Chen, Han Zhao, Liqun Chen, Qing Ping, Son Dinh Tran, Yi Xu, Belinda Zeng, and Trishul Chilimbi. 2023. Understanding and constructing latent modality structures in multi-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7661–7671.
Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
Klein et al. (2015) Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4437–4446.
Li et al. (2022b) Jingyu Li, Zhendong Mao, Shancheng Fang, and Hao Li. 2022b. ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning.. In IJCAI. 1081–1087.
Li et al. (2019) Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision. 4654–4662.
Li et al. (2022d) Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2022d. Image-text embedding learning via visual and textual semantic reasoning. IEEE transactions on pattern analysis and machine intelligence 45, 1 (2022), 641–656.
Li et al. (2022c) Pandeng Li, Hongtao Xie, Jiannan Ge, Lei Zhang, Shaobo Min, and Yongdong Zhang. 2022c. Dual-stream knowledge-preserving hashing for unsupervised video retrieval. In European Conference on Computer Vision. Springer, 181–197.
Li et al. (2020) Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 121–137.
Li et al. (2022a) Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, and Xijun Xue. 2022a. Multi-View Visual Semantic Embedding.. In IJCAI, Vol. 2. 7.
Liao et al. (2022) Wentong Liao, Kai Hu, Michael Ying Yang, and Bodo Rosenhahn. 2022. Text to image generation with semantic-spatial aware gan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18187–18196.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
Liu et al. (2020) Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10921–10930.
Ma et al. (2023) Xiang Ma, Xuemei Li, Wenzhi Feng, Lexin Fang, and Caiming Zhang. 2023. Dynamic graph construction via motif detection for stock prediction. Information Processing & Management 60, 6 (2023), 103480.
Pan et al. (2023) Zhengxin Pan, Fangyu Wu, and Bailing Zhang. 2023. Fine-grained image-text matching by cross-modal hard aligning network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19275–19284.
Qu et al. (2020) Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia. 1047–1055.
Qu et al. (2021) Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
Ren et al. (2016) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39, 6 (2016), 1137–1149.
Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45, 11 (1997), 2673–2681.
Sidorov et al. (2014) Grigori Sidorov, Alexander Gelbukh, Helena Gómez-Adorno, and David Pinto. 2014. Soft similarity and soft cosine measure: Similarity of features in vector space model. Computación y Sistemas 18, 3 (2014), 491–504.
Steinley (2006) Douglas Steinley. 2006. K-means clustering: a half-century synthesis. Brit. J. Math. Statist. Psych. 59, 1 (2006), 1–34.
Wang et al. (2020b) Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, and Lin Ma. 2020b. Consensus-aware visual-semantic embedding for image-text matching. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. Springer, 18–34.
Wang et al. (2020a) Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020a. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 1508–1517.
Wehrmann et al. (2020) Jonatas Wehrmann, Camila Kolling, and Rodrigo C Barros. 2020. Adaptive cross-modal embeddings for image-text alignment. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 12313–12320.
Wei et al. (2020) Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10941–10950.
Wen et al. (2020) Keyu Wen, Xiaodong Gu, and Qingrong Cheng. 2020. Learning dual semantic relations with graph attention for image-text matching. IEEE transactions on circuits and systems for video technology 31, 7 (2020), 2866–2879.
Wu et al. (2017) Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. 2017. Sampling matters in deep embedding learning. In Proceedings of the IEEE international conference on computer vision. 2840–2848.
Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
Zhang et al. (2024) Huatian Zhang, Lei Zhang, Kun Zhang, and Zhendong Mao. 2024. Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 7105–7114.
Zhang et al. (2022a) Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, and Yongdong Zhang. 2022a. Fine-tuning with multi-modal entity prompts for news image captioning. In Proceedings of the 30th ACM International Conference on Multimedia. 4365–4373.
Zhang et al. (2022b) Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022b. Negative-aware attention framework for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15661–15670.
Zhang et al. (2023) Kun Zhang, Lei Zhang, Bo Hu, Mengxiao Zhu, and Zhendong Mao. 2023. Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching. In Proceedings of the 31st ACM International Conference on Multimedia. 4828–4837.
Zhang et al. (2020) Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3536–3545.