Abstract
Nowadays, image–text matching (retrieval) has frequently attracted attention due to the growth of multimodal data. This task returns the relevant images to a textual query or descriptions that describe a visual scene and vice versa. The core challenge is how to precisely determine the similarity computation between the text and image, which requires understanding the different modalities by extracting the related information accurately. Although many approaches are established for matching textual data and visual content utilizing deep learning (DL) approaches, a few reviews of the studies of image–text matching are obtainable using DL. In this review study, we contribute to present and clarify the modern techniques based on DL in the image–text matching problem by providing an extensive study of the existing matching models, different current architectures, benchmark datasets, and evaluation methods. First, we explain the matching task and illustrate frequently used architecture. Second, we classify present approaches according to two important concepts the alignment between image and text, and the learning approach. Third, we report standard datasets and evaluation techniques. Finally, we show up current challenges to serve as an inspiration to new researchers in this field.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Image–text matching (ITM) bridges the space gap between the scenes and annotations to achieve an effective understanding the different modalities (text, image). It measures the image–text similarity depending on visual and semantic information [1]. For instance, given an image, it is feasible to search for the relevant sentences (I2T) and vice versa (T2I), where the same notion can be represented in many ways, such as text, image, and audio. The matching between images and texts is employed in various multimodal tasks such as image captioning (IC), cross-modal retrieval (CMR), text-to-image synthesis, and visual question answering (VQA) [2].
Many statistical algorithms are applied in matching, such as Canonical Correlation Analysis (CCA) [3], Cluster CCA [4], and Partial Least Squares (PLS) [5]. These algorithms determine the correlation among data and learn joint embedding with low dimensions. In recent years, deep learning (DL) approaches have demonstrated notable results in many domains, such as natural language processing (NLP) and computer vision (CV). This is because of the thorough ability to represent the data without demanding handcrafted features. This motivates the researchers to explore the strength of DL in multimodal tasks.
There are a few overviews that discuss multimedia retrieval problem; these reviews did not focus on image–text retrieval (ITR) only but also included audio and video modalities. Peng et al. [6] and Aygun et al. [7] provided cross-media reviews for image, text, video, and audio modalities retrievals. However, these overviews did not deeply focus on ITR, and they did not give sufficient information about the ITM/ITR task. Lately, ITR has attracted a frequent attention because of the growth of multimodal data. Therefore, Chen et al. [8] wrote a short review that focused on ITR approaches that proposed in the period from 2018 to 2019. Chen et al. classified the retrieval models according to the embedding methods into four groups pairwise, adversarial, attributes, and interaction. But this review covered some existing methods and neglected the hybrid methods. After that, Abdullah et al. [2] presented a review that also focus on ITR, but they classified the existing approaches based on the alignment between image and text parts.
In this study, we provide a survey on the ITM methods; our focus will be on bi-directional ITM models that utilized the DL with different learning approaches; the bi-directional ITM models have the ability to retrieve from image to text (I2T) and vice versa (T2I). We suggest to classify these methods according to learning approaches into the following learning categories: (1) deep canonical correlation analysis; (2) rank; (3) interaction; (4) adversarial; (5) cycle-consistent; (6) few-shot; (7) hybrid; and (8) vision-language pre-trained models (VL-PTMs). In contrast to the previous overviews, we explore the link between learning and alignment methods as in Fig. 2. Furthermore, we summarize the present models’ structures as in Table 1, by highlighting the used encoders, loss functions, and optimizers to illustrate the main variations in the architectures. In addition, we discuss the key challenges and future directions.
The rest of this article is arranged as follows: in Sect. 2, a general background is presented to give an overview about the ITM task. In Sect. 3, a taxonomy of Bi-ITM approaches is reported and summarized in Fig. 7, where each approach is described briefly. In Sect. 4, datasets and evaluation methods are presented. In Sect. 5, discussion and new directions are given. Finally, Sect. 6 is the conclusion.
2 Background
In this section, we give an overview about the basic structure of the ITM models in details. Generally, bi-directional CMR architecture consists mainly of three components as illustrated in Fig. 1: (1) image features branch, (2) text features branch, and common space (or latent space) [2] [8]. Although, the feature vectors for an image and its reference text have similar semantics, they distributed in distinct spaces. Hence, a common space (shared space) is used to embed these vectors in one space to be able to compare because their similarities are in different spaces using Euclidean or Cosine distance [9] (loss function). Thus, the matching task is often hard since a deep understanding of images and sentences is required. Since, the image–text retrieval model aims to assign a similarity score to a pair of an image and a sentence. If the score is high then the sentence sufficiently describes the image, and if the score is low then the sentence is unassociated to the input image. Consequently, the similarity scores used for determining the top appropriate images and sentences in both retrieval situations [10].
Before the learning process, modality features must be extracted through suitable representation methods. Significantly, the representation methods influence on the performance. Here, we will present the popular features extractors for text and image modalities, without going deeply into technical specifics.
2.1 Image Representation
Many powerful DL models are constructed for image feature extraction using convolutional neural networks (CNNs) because of its characteristics such LeNet [11], AlexNet [12], GoogleNet [13], visual geometry group (VGGNet) [14], region-based convolutional neural network (R-CNN) [15], faster region-based convolutional neural networks (Faster R-CNN) [16], and residual network (ResNet) [17]. These models can be directly integrated into multimodalities models and trained simultaneously. However, the common direction for features representation is the existing CNNs pre-trained models, especially in multimodal learning, due to the computation resources limitations and the required amount of data for sufficient training. Recently, multi-step feature extraction is suggested to obtain better representation. For instance, Lee et al. [18] and Wang et al. [19] used Faster R-CNN then the output is fed to ResNet to get the image’s features.
2.2 Text Representation
Since, sentences and images have diverse natural, different representation methods are used to encode words. The Most popular representation methods are bag of words (BOW) [20], term frequency-inverse document frequency (TF-IDF) [21], latent Dirichlet allocation (LDA) [22], Word2Vec [23], Glove [24], and recently BERT [25]. Theses embedding methods map the words into a vector space under the same distribution to measure the similarity between the words. To deal with the varying length of the sentences. Many DL Networks are established to deal with the varying length of the sentences by considering a sentence as a sequence of words. Recurrent neural networks (RNNs) [26] is a robust tool for dealing with sequential data as sentences because RNN has an internal memory. But, RNNs cannot capture long-term dependencies since it suffers from the gradient vanishing. For that reason, modified versions from RNN are utilized to better performance such as long short-term memory (LSTM) [27] and gated recurrent unit (GRU) [28]. GRU is a modified version from LSTM with 2 gates instead of 4 as in LSTM, so GRU needs less computational power than LSTM. Additional, bi-directional versions are proposed for RNN [29], LSTM [30] and GRU [26], they are widely utilized for capturing the semantics of the text. Practically, the weights of these DL networks are often initialized using Word2Vec, Glove, BERT or randomly. In addition to these methods, also CNN showed remarkable results when it used for extracting textual features as in [31, 32].
2.3 Alignment
Alignment between image (as whole or objects) and text (as whole or words) plays an important role in understanding these different modalities. The alignment shows how the inputs vectors interact with each other. The main alignment categories are illustrated in Fig. 2 as following: (1) Global Alignment: the whole image and whole text features are directly passed to the common space to compute the similarity (similarities) between each pair, (2) Local Alignment: it focuses on the correlation between image regions and text snips, (3) Hybrid Alignment: it combines the global and local approaches for achieving more precise understanding to obtain better matching results.
In addition to that, some works incorporate rational alignment with one from the above categories as [33, 34] to enhance matching scores, where rational alignment concerns with the relation between image objects (e.g., “a boy is holding a ball above the footpath”, Relations: (a boy, holding), (holding, a ball), (above, footpath)). In matching task, most of the existing works follow global approach as seen in fig where significant improvements are achieved. However, global approach is unable to advantage from the interaction between image and text to find the common semantics. Consequently, the local approach is used to allow linking between image areas or objects and text snips in several studies (see Fig. 3). But, it does not focus on non-elements in the image such as background and snow. Recently, attention mechanism is used to associate words with image areas to compute local similarity between each pair, also it allows to focus on salient regions (e.g., sky, ground) in several works [18, 19]. Since, local approach ignores fine-grained information which can give additional signs for matching learning. Thence, to investigate better visual and textual matching, the hybrid alignment is suggested in many studies where Wang et al. [35] employed local and global approaches in separated branches, and Li et al. [36] fused global and local similarities to obtain accurate matching. Even if, the hybrid alignment increases the need of memory and computation time.
2.4 Loss Functions
Several formulas for loss function are suggested, this means that they have different factors which construct loss function form such as the distance space of features, label relation, and similarity measurement. The triplet loss [37] is commonly used in image classification and retrieval. Usually, the triplet is expressed as (anchor (a), positive (p), negative(n)). The triplet aims to decrease the distance among an anchor (given) and a positive(similar) sample, and maximizes the distance with a negative (dissimilar) sample. It can be expressed as in Eq. 1 which called hinge triplet loss, I is image encode, T is text encode, N is the negative samples set, \(\alpha\) margin, S is the similarity function and \({\left[m\right]}_{+}=\mathrm{max}(0,m)\):
Faghri et al. [38] used hard-negatives concept to enhance triplet loss, where hard-negatives (HN) \({T}_{h}^{^{\prime}}\), and \({I}_{h}^{^{\prime}}\) are assumed as \({T}_{h}^{^{\prime}}={argmax}_{c\ne T}(I,C)\), and \({I}_{h}^{^{\prime}}={argmax}_{c\ne I}\left(c,T\right)\) and as in Eq. 2:
3 Retrieval Approaches’ Categorization
There is a variety of multimodal embedding learning approaches that may share similar constructions as the used methods for capturing the features of each modality which could be quite different as illustrated in Table 1. In this section, we will illustrate the image–text matching DL approaches according to our proposed categorization in details and discuss their pros and cons as summarized in Table 2, and Fig. 7 summarizes the literature works based on two factors learning category and alignment method in each category, where the learning categories are:
3.1 Deep Canonical Correlation Analysis
Deep canonical correlation analysis (DCCA) focuses on learning composite non-linear transformations for various modalities of data by maximizing the total correlation through deep networks where the resulted representations are linearly correlated [39] as shown in Fig. 1 common space. Yan et al. [40] proposed an extension for DCCA for image–text retrieval, this extension works under specific constrains to avoid overfitting issue and the eigenvalue decomposition problem. After that, Shao et al. [41] integrated DCCA with progressive learning to reduce the required data for the training, and they used hypergraph learning to extract semantic information from textual features, then related image–text pairs are clustered in the latent space according to the semantic information. But, the proposed models in [40] and [41] did not take into account finding dissimilar pairs by a direct route, and this produces false positives outputs in retrieval result. Recently to solve this issue, Hua et al. [42] designed a loss based on metric learning to learn distance metrics among text and image to measure similarity using inter and intra correlation knowledge. In addition, they used multi-scale to represent the similarity as real value instead of binary way (similar or not). Even if, DCCA aims to maximize correlation among different modalities, it requires huge memory and it does not obtain non-linear relations among various modalities. Due to that, it is not easy to implement DCCA if the number of various modalities is greater than two. In addition, its loss is sensitive to the batch size.
3.2 Rank Learning
Rank learning has three approaches which are pointwise, pairwise, and finally list-wise. They are diverse in the way of dealing with the input data, where pointwise learning takes one input and calculate the score between input element and input queue; and pairwise learning takes a pair of elements and then rank all available pairs in descending order; and finally, list-wise learning that takes all entire input list then optimize their order. The rank of an element is based on its loss, if the loss is low its rank will be high [43].
In ITM, pairwise learning is widely to use, this approach attempts to find a loss function that calculates the distance between image and text pairs in the common space. Where the distance between related images and texts is reduced, the distance between unrelated samples is increased. Using DL in matching starts using CNN to obtain the visual features as in [44], [45], [46]. They applied different CNN structures to capture visual features and LDA to represented textual features. Instead of using LDA, Karpathy et al. [33] employed dependency trees (DT) to encode words relations in a given sentence, where these relations are used as sentence fragments. Then the local similarity is measured between the image regions and sentence snips, then the global by accumulating the similarity scores of all region-word pairs. After that, Karpathy et al. [47] modified their previous work using BRNN as sentence encoder instead of DT as in [33] to achieve better performance, where DT Parsers might be trained on unassociated text corpora that will affect the performance. Furthermore, the CNN is used to capture annotations features where Ma et al. [31] used CNN for both sentence and image representation.
Dealing with various modalities is a critical task through the learning; therefore, many approaches are suggested to embed different modalities jointly. For instance, Frome et al. [48] proposed a joint embedding model that addressed the limited number of categories in available data by joint representation which used images labels and unannotated data to obtain objects. After that, this embedding method is commonly used in CRM. Instead of using joint representation directly in one step, Peng et al. [49] proposed hierarchical combination between modalities’ representations. This aimed to acquire intra- and inter-media information to realize the correction between media forms. Another approach is proposed by Mithun et al. [50], they used web data to develop CMR based on web-supervised learning to reach strong joint embedding, where, Mithun et al. used the hard-negatives loss which was proposed by Faghri et al. [38] for CMR task.
Basically, the loss function has a valuable impact since it shows the error level during the learning, so selecting or designing the loss is essential to reach the desired output. According to that, many attempts are suggested to introduce new loss functions to be used in ITM, where Zhang et al. [51] introduced a loss function that called cross-modal projection matching (CMPM) which aims to boost the correlation between matched pairs. In addition to CMPM, they proposed a classification loss called CMPC to obtain discriminative features. After that, Jian et al. [52] proposed bi-triple loss to decrease the gap between images and sentences through data labels information, where the similarity is computed using Euclidean distance. In this work, the features are extracted by a network with 2 layers. To improve the data quality, Zhen et al. [53] introduced invariance loss that aims to eliminate inconsistency between data, and discrimination loss that preserves the discrimination based on label information among different semantic classes. In addition, Liu et al. [54] proposed K-nearest-neighbor (KNN) loss to avoid overfitting or handle noised data, where KNN loss handled the proposed hard-negatives loss by Faghri et al. [38].Then, Wang et al. [32] proposed a discriminative embedding by representing the different modalities hierarchically, and semantic discrepancy (SD) loss is suggested to deal with multiple semantic levels. The model has the ability to share information among input and output by reverse connection. Recently, Biten et al. [55] modified HN loss by introducing semantic adaptive margin (SAM), where the sentences are used to update margins to find the best similar samples, unlike HN loss which concerned with Ground truth. From another perspective, Chen et al. [56] proposed to integrate intra-modal loss with HN to improve the learning with intra-modal information.
To enhance feature embedding in the rank learning, Wang et al. [57] extended Faghri et al. [38] matching model by incorporating consensus knowledge for scenes and their captions using graph in the embedding, by contrary to Shi et al. [58] incorporated scene information to improve embedding. In addition to that, Liu et al. [59] introduced neighbor-aware loss to increase the distance between different neighbors based on their semantics. Another view to improve CMR results by enhancing to way of computing the similarity, where Wang et al. [35] tried to build matching model to serve two separately tasks image–text retrieval and phrase localization. This is by building an embedding network with neighborhood constrains and a similarity network, but the similarity network failed in image-sentence task. Lately, Yang et al. [60] suggested to compute the similarity using Wasserstein distance, instead of Cosine or Euclidean, where the similarity is measured using distribution of samples. The model extracts the mutual information to improve matching.
Instead of using pairwise ranking, Xu et al. [61] employed list-wise ranking in CMR. This means for a given image, the loss is computed based on all available annotations directly at the same time. This to avoid pairwise drawback, where the number of unrelated annotations is larger than the related ones, sometimes that leads to rank irrelevant annotations before relevant ones. Consequently, it leads to decrease model accuracy. Recently as post-processing step, to enhance the performance of CMR at the testing stage because of the pairwise rank drawback, a set of re-ranking approaches is introduced. For example, Wang et al. [62] proposed a re-ranking approach to improve testing results without more training. Then, Niu et al. [63] also proposed another re-ranking approach. These models start by creating a fusion method for the image and annotation modalities, then using the re-ranking method to get improved outcomes.
3.3 Interaction Learning
In this approach, the information transfer among the modalities before entering the common space as Fig. 3. Lou et al. [64] proposed CMR based on multitask learning by a correlation network to learn the common information and to distinguish the unassociated image–text pairs. In addition, a relation-enhanced autoencoder is used to correlate the hidden embeddings where the interaction is done between modalities. Simultaneously, Wang et al. [65] proposed a massage passing between two modalities, where the silent information from one modality is aggregated then the aggregation result passes to the other modality.
Recently, many methods suggested to solve matching task by adding attention layers to transfer the information among image and text modalities, to obtain the salient regions in image and to achieve better understanding. Lee et al. [18] designed a network with stacked attention layers to obtain full alignment among image and text. This model cared about the silent areas as snow and ground using Bottom-up attention [66] to extract scenes features. Following the same idea, Wang et al. [19] extended the model by Lee et al. [18] through adding an attention layer to detect the objects positions, to enhance matching results. In contrast to Lee et al. [18] and Wang et al. [19], Huang et al. [34] focused on the relation among objects and words, and disregard silent regions. In addition to that, many studies employed interaction learning which are motived by Lee et al. work such as Wu et al. [67], Ji et al. [68], Diao et al. [69], Li et al. [36], and Qi et al. [70]. These studies proposed a two-level network to deal with global and local characteristics and computing the final similarity after obtaining the similarity based on the hybrid alignments. They guided the learning process with important words in the input sentence and objects in the image. Furthermore, Ji et al. [68] presented a method to produce pairwise representations for image–text pairs consistent with their semantics association. In addition, they employed intra-modal interaction between image regions. Contrary to Nam et al. [71] that proposed a matching network with dual attention, where the model concentrated on certain regions and words to obtain the fine-grained between them. Another attention network is designed by Liu et al. [72], it is different from the previous studies since it focused on learning weights, where Liu et al. eliminated unrelated fragments from the shared space. Zhang et al. [73] used confidence to illustrate the degree of the consistency of each region with the global text, to enhance alignment specially for salient regions. Generally, the previous attention models applied the alignment among all words and all regions, this approach consumes time and memory to compute overall similarity. Lately, Zhang et al. [74] to reduce the effect of the previous issue, they proposed to incorporate the relevance threshold between fragments with embedding learning network, to distinguish the relevant and irrelevant fragments to get better alignment.
By contrast, instead of attention layers, the graph is used to alignment image objects and words. Liu et al. [75] built a graph network that treated objects (cat, person, etc.), relations (playing, eating, etc.) and words as a structured phrase. In addition, Li et al. [76] proposed multi-level measurement for the similarity based on a graph network. Recently, Long et al. [77] introduced a dual graph representation for text and image, to bridge the gap between modalities where text sematic used to improve visual representation and visual information used to improve textual representation based on attention mechanism. They used graph convolutional networks (GCN) to apply their approach. In addition, Dong et al. [78] used GCN, where they proposed a hierarchal aggregation for the features based on GCN, where GCN used to extract relation between objects, and then integration between the object features and global feature from the other modality to narrow the modality gap and facilitate the fusion of cross-modal features. The proposed aggregation combined object’s attributes and objects relations hierarchically in both modalities. The idea behind using the graph is to embed the relations between objects and words to enhance learning.
However, this learning approach increases computations and makes learning complicated, since transferring information between modalities is done before projecting the features in latent space.
3.4 Adversarial Learning
The adversarial concept is a new learning approach based on the proposed generative adversarial network (GAN). GAN is a neural network which is composed of a generative network which generates samples that are close to given data depended on the data distribution and a discriminative network which aims to distinguish between real data and the generated examples from the generative network [79] [80], and Fig. 4 illustrates how GNA works.
Lately, several trails employ GAN in multi-model retrieval, the first trial by Park et al. [81], where the suggested model focus on category information instead of images and sentences pairs. Consequently, the embedding is done by category prediction and domain classification procedures, where the category predictor aims to characterize the features from the different modalities, and the domain classifier aims to make the extracted features from the images and sentences have the same distribution; this is through a gradient reversal layer (GRL) to create adversarial relation between domain classifier and embedding network. Although, this model depends on category tags, the learning performance may be affected if the predication is not accurate. Following the same idea, Wang et al. [82] and Sarafianos et al. [83], the authors modified the suggested model in [81] by utilizing different images and texts representations methods and different optimization techniques. Gu et al. [84] incorporated GAN in the embedding phase to capture the local and global features. Zhu et al. [85] and Wang et al. [86] used GAN to learn how to match between food recipes and images.
Recently in [9], an integration between adversarial learning and information theory is proposed for cross-model retrieval. This framework employs information theory to reduce semantic gaps between image and text, where information entropy is combined with modality categorization in an adversarial way based on the relation between modality uncertainty and information entropy.
3.5 Cycle-Consistent Learning
The cycle-consistent concept between text and image means that the retrieval model is able to interpret text features to convenient image features and inversely. This interpretation is based on reconstruction constraints to guarantee that the reconstructed text or image is equivalent to the original [87] as in Fig. 5. When it is difficult to gather many images and sentences pairs, the advantage of this approach appears, since the models learn to map directly between different modalities [80].
Cornia et al. [87] introduced a cycle-consistent network for image and text matching task, by transforming textual data to an appropriate representation in a visual domain, and visual data to the textual domain, where this mapping is synchronized for both modalities based on a similarity condition. The model applied a hinge-based triplet loss instead of cycle-consistency loss to decrease the model’s complexity. In addition, Liu et al. [88] suggested a more complicated model based on cycle learning for the same task, where the retrieval model employed the cycle embedding for image–text retrieval by incorporating intra-model consistency and inter-model correlation to achieve better translation among different modalities. The loss function of the model is defined as the total sum of three loss functions in both directions which are hard negative, reconstructed and latent loss. Simultaneously, Chen et al. [89] used the semantic consistency to learn modalities embedding spaces jointly, where consistency constraint is incorporated with loss function.
The models in [87] [88] does not utilize local relations between image part and given text. In general, such models are not simple to implement without regularization based on reconstruction constrains to be sure from the translation performance.
3.6 Few-Shot Learning
Few-shot learning (a. k. a, Attributes learning) simulates human thinking and learns the features (attributes) of objects such as shape, color, and texture through few instances. It has a generalization capacity because of its representation learning sequence where attributes information can be shared among known and unknown classes [90] as in Fig. 6. The main advantage of the few-shot learning is reducing the large amount of data that required to train DL models. In addition, it has the ability to handle unseen classes [91].
Unlike the statistical CMR proposed by Yuan et al. [92], and inspired from Ji et al. [93] who proposed an image retrieval model based on text or image using attribute learning approach. A number of CMR models is proposed such as Chakraborty et al. [94] utilized a textual attention (self-attention) to enhance the CMR performance in zero-shot cases. The model used a simple recurrent unit (SRU) to represent the sentences, in addition to that, an inter-modality fusion is applied between different modalities to focus on the important areas in the images. Simultaneously, Huang et al. [91] proposed aligned memory for CM to save the rarely happened content using graph convolutional network to control the memory. In addition, to store the sequence order of the sentences, a bi-directional GRU is used.
Comparing the previous models with the statistical model by Yuan et al. [92], DL models achieve better results. However, this learning approach faces many challenges such as inadequate training examples for unseen instance during the training phase, knowledge lack in the classes because of the inconsistent between seen and unseen classes, and the distribution heterogeneity due to the multimodal data.
3.7 Hybrid Learning
There are a set of studies that combine several learning approaches to achieve better result or to solve certain learning approach issues by another one. Where to solve the issue of the absence of the paired data, Wu et al. [95] suggested to combine the cycle approach with adversarial learning. In the proposed model which called CycleGAN, a cycle-consistency loss is utilized to encourage the generated codes from the hash functions with semantic information among inputs and outputs. Using GAN increases the correction between the given inputs and corresponding outputs, where the model can translate each modality to other one. The disadvantage of this model is the complexity. In addition, XU et al. [96] designed an improved structure for GAN based on Wasserstein distribution, and they developed three alignment methods with cycle-consistency constraints instead of using triplet ranking. In addition to that, they took in consideration the zero-shot (ZS) scenarios during testing phase.
Regarding insufficient training data in ZS, ZS integrated with GANs to generate more data such as Xu et al. [90] and Xu et al. [97]. In addition, Lin et al. [98] integrated ZS with cycle learning, they used variational autoencoder to produce latent embeddings instead of GAN to achieve stable training and to enhance retrieval result. Another trial to link GAN with ZS, Xu et al. [99] designed CM with three subnetworks, two for capturing semantic features, and rest one with self-supervised to transfer the knowledge to unknown labels. In addition, Huang et al. [100] who proposed a gated visual semantic embedding for enhancing modalities representations. Firstly, the model learnt two parallel Visual semantic embeddings (VSE), one for uncommon VSE and the other for common VSE to match common and uncommon instances, where there are two metrics for measuring common and uncommon similarities, and then fused by the proposed gated fusion matrix to produce the final representation matrix. But the number of epochs for gated modal is more than normal models. Furthermore, interaction learning is combined with other methods. For instance, GAN performance is enhanced by incorporating GAN with attention, where Wei et al. [101] proposed CM that combined GAN and attention to enhance embeddings in ITM. In addition, Ma et al. [102] proposed to solve DCCA drawbacks using selective attention by incorporating local and global features, and using intra-modality knowledge. However, these attempts achieve enhancements in ITM, but they suffer from complexity specially in loss structure which reflects on learning time and resources.
3.8 Vision-Language Pre-trained Models
All the previous learning methods achieved success in the ITM task by designing several models based on them, the most of these models are focused on the ITM task only, and are trained on ITM datasets. In vision-language tasks such as ITM, Visual question answering (VQA), or image annotation (IA), working on a specific task results poor transferability. The transfer learning approach in DL makes a model which is developed for a specific task (e.g., classification) is reused as a starting point for another model on a second task (e.g., ITM, VQA), where the pre-trained models are commonly used as the starting point (e.g., features representations) on CV and natural NLP tasks. This help the researchers achieving their tasks which require high resources, where integrating the pre-trained models in new models reduce required time and resources. Inspired by the existing pre-trained models that are used to capture the visual features as VGGNet and more as mentioned in Sect. 2.1, where the idea of the pre-training is first found out in the CV field and it shows effective results in [14]. In addition, the success of the pre-trained models in CV field extended to the NLP field after the release of transformer [103], BERT [25], and GPT-3 [104], where the transformer is a DL model that adopts the self-attention (SA) to handle long-range dependencies, since the input segments are differentially weighted [103]. The transformers are the backbone of BERT [25], and GPT-3 [104] which are pre-trained language models. Inspired by the success of the pre-trained models in NLP and CV, several attempts have been made to pre-train large-scale models on vision and language modalities; by training the models on large and general image–text datasets. These models known as vision-language pre-trained models (VL-PTMs). VL-PTMs aim to achieve better performance of the image–text tasks as ITR and VQA by acquiring general representations for the task modalities. VL-PTM is used for different tasks by fine-tuning it for the desired task.
Basically, the VL-PTMs are usually passed through three stages: (1) defining image and text encoders to obtain the inputs representations; (2) defining the interaction schema to link the different modalities by designing the pre-trained model architecture; (3) determining the desired pre-training tasks to train the model on general dataset. In encoding stage, BERT [25] is frequently used as text encoder in VL-PTMs as in ViLBERT [105], Faster R-CNN [16], and (ResNet) [17] are used for image encoding. After that, the text and the image representations are used to create an encoder that integrates both modalities information, to achieve the interaction between image and text which is the second stage in VL-PTM. To align the image parts and the text snips, the designed model needs to integrate the extracted information from the inputs texts and images, then localize the corresponding regions/objects in the images and text pairs. According to the way of integrating the extracted information from the different modalities, the encoders types can be classified as fusion encoder (single stream, dual stream), dual encoder, and sometimes both of them.
After SA or cross-attention procedure, the hidden states of the last attention layer will be utilized as the fused representation of the different modalities. There are two based on the fusion encoder single stream, and dual stream. In single stream, the representations of text and image are concatenated together, and pass to a single transformer (SA) encoder to produce the fused representation. This approach is implemented in VL-BERT [106] which used the segment encoding for the inputs instead of the global image–text pair encoding, and OSCAR [107] which used the detected object in an image as tags to achieve better alignment; where the image–text pair is presented as word-tag-image. However, the single stream method does not take in consideration the intra-modality information; due to that, a dual stream is proposed to solve this issue. In dual stream does not use SA as in single stream, but it adopts a cross-attention layer, where the input vectors from one modality and the key and value vectors from the other. The cross-attention layer contains two sub-layers; one for each modality to exchange the information among the modalities, to allow the intra-interaction for each modality and to separate intra-modal and cross-modal interaction. This schema is employed in many VL-PTMs such as ViLBERT [105], LXMERT [108], and ALBEF [109]. Recently, BLIP [110] is proposed and achieving high performance. However, applying the fusion encoder in ITR requires to encode all available pairs to calculate the similarity scores, and it depends on transformer, due to that the time complexity increases and the inference speed will be a quite slow.
In contrast to the fusion encoder, the dual encoder employs single-modal encoders to encode each modality individually. Then, it adopts an attention layer as [18] or uses dot product as ALIGN [111], and CLIP [112] to project the image and text embedding vectors to the latent space for computing the similarity scores without complex cross-attention layer as in the fusion encoder. Recently, MACK [113] is a re-ranking method to enhance the performance of ALIGN [111] and CLIP [112] based on cycle-consistent loss. This makes the dual encoder more effective in ITR than the fusion encoder, since the embedding vectors for both images and sentences can be pre-computed and stored.
In addition to that, combining the fusion and dual encoders is applied in FLAVA [114] and VLMO [115], since the fusion encoder performs better on Vision-Language understanding, while the dual encoder performs better on ITR, it is natural to combine the benefits of the two types of architectures. Table 3 shows the differences between the VL-PTMs that are used in ITM/ITR task. Once the inputs are encoded as vectors and the interaction is done, the next stage is to design pre-training tasks such as ITM, VQA, and IC for VL-PTMs. These tasks have an influence on what VL-PTM can learn from the input data. In the training phase of VL-PTM, some models take in consideration ZS as BLIP [110] and GLIP [116] to evaluate the model without fine-tuning on the evaluation dataset. In addition, g adversarial data samples are used to enhance pre-training model to overcome the overfitting issue as proposed in [117]. This shows how the VL-PTMs incorporate with the other learning approaches. In general, after pre-training the VL-PTMs are fine-tuned on a specific task based on the evaluation dataset.
Finally, to summarize our work, Table 2 illustrates the advantages and disadvantages for each learning approaches. In addition to that, Table 1 summarizes and explores the used techniques in each existing work based on the representation for both text and image, the used loss function in the learning, and the optimization techniques such as Stochastic gradient descent (SGD) and Adam which are common to use in ITM. Figure 7 shows a taxonomy for the literature works based on the learning approaches and alignment methods.
4 Datasets and Evaluation Methods
4.1 Evaluation
There are two popular evaluation methods for CMR. First, Recall@K (R@K) score which computes the portion of times where correct item (Text or image) being existed in the top K outcomes, k takes values 1, 5, and 10 as usual in retrieval task. Second, Mean Average Precision (MAP) score for all returned results on all datasets for both image–text retrievals. MAP value is the mean of average precision (AP) that is computed for all retrieval queries. In image–text retrieval, R@K is popular to use than MAP.
4.2 Datasets
There are many benchmark datasets for CMR domain summarized in Table 4, particularly for image–text annotation and search tasks such as
-
Wiki [124]: it consists of 2866 images representing 10 categories where 2,173 image for training pairs and 693 for testing. The data are selected from Wikipedia articles where each text represents an article about events, places, or people and the related image clarifies the article content as shown in Fig. 8.
-
Pascal sentences [125]: it is also called PASCAL1K, it contains 1 k images and all images have five captions using Amazon’s Mechanical Turk (AMT) as shown in Fig. 9. Randomly, the images are taken from the PASCAL VOC 2008 [126] which contains 20 classes.
-
Flickr8K [127] and Flickr30K [128]: where Flickr30K is an extension for Flickr8K. Flickr8K contains nearly 8 K images and every image has 5 clear annotations for the image content, the images are selected manually from 6 dissimilar Flickr groups. Then, Flickr30K is released which contains nearly 31 K images and 155 K sentences where each image has 5 corresponding descriptions
-
MSCOCO [129]: MSCOCO (2014) version contains approximately 164 K images, where approximately 83 k for training and 41 K validation and the rest for testing, each image has 5 corresponding descriptions. Another version is released MSCOCO (2017) where approximately 118 k for training and 5 K for validation and the rest for testing.
Table 5 shows the used evaluation approach and datasets for the literatures. In addition, Tables 6 and 7 report publicly recorded performances on the two common datasets Flickr30K and MSCOCO using R@K evaluation method.
5 Discussion and New Directions
5.1 Discussion
In this survey, we used 66 previous studies that are related to ITM that are trained on the ITM datasets as Table 4. For fair comparison and discussion, we exclude 4 previous studies which are [55], [63], [85], and [86], where Biten et al. [55] evaluated their approach on the proposed models in [1] [57] [69]. The same with Niu et al. [63], where they evaluated their re-ranking method on the proposed models in [38] [48] [100]. Zhu et al. [85] and Wang et al. [86] used data set related to food recipes. In addition, we exclude the VL_PTM that are used in the ITM task from the statistics computation, since they follow different architecture, but we will compare their results with the other learning approaches.
In Tables 6 and 7, we show some recent results in Flickr30K and MSCOCO for all learning approaches to give a sight about the performance, since these datasets the most common to use as in Fig. 10(c). From Tables 6 and 7, we can obtain that the quality of the bi-directional retrieval is not the same from I2T and T2I, where some ITM models perform well in I2T and vice versa; this depends on the features representation in common space for each modality. In addition, the performance varies from dataset to another, and this depends on data variation.
Diving deeply in the different structures for ITM methods as in Table 1, Fig. 10 illustrates statistics information that help to understand the alignment and learning approaches distribution in the previous studies. Since, the percentage of studies that used global approach is 55%, 22% for local, and 23% for hybrid from Fig. 10(b). In addition, Fig. 10(a) shows that the ranking method has the highest percentage 35% then interaction learning in second place. To understand the relation between learning and alignment approaches, which influence on model structures, Fig. 7 illustrates the relation among them.
According to Table 1, we summarize the used embeddings methods for images and text as in Fig. 10(d, e); here, we count how many a certain method is used in the previous work, by taking into consideration that the embedding may be done by more than one embedding method. In addition, Fig. 10(c) shows the frequency of datasets usage where Flickr30K has the largest value then MSCOCO. Building ITM not only needs to select appropriate modality representation and Learning approach, but also it needs to know the available resources as memory and saving space. Since extracting features and learning process consume resources. From Table 1, we can elicit the similarity in the ITM models constructions as [18, 19, 65, 67, 69] and [72,73,74,75] in interaction learning. We can obtain that VGG Net is used mostly in adversarial and hybrid learning. In opposite, R-CNN and Faster R-CNN is used in interaction learning. In addition to that, the dataset may require a specific way to obtain better results as in Wiki which contains long text because of that most studies used Doc2Vec to embed text. The previous studies cab be divided in two groups, one used Flickr and MSCOCO (with R@K for evaluation) and the other used Wiki and Pascal(with MAP for evaluation); this is due to the available resources. Furthermore, the loss is an essential factor in learning, and we can find that triplet loss is common to used specially HN.
By comparing the VL-PTMs in ITR with the other learning approaches, it seems to be clear that the performance of the VL-PTMs outperforms the others as shown in Tables 6 and 7. This is because the VL-PTMs are trained on general and large-scale data millions image–text pairs. In addition, the VL-PTMs are evaluated on small data partition compared to the training data. But, they need huge resources compared to the other learning approaches. In addition, the performance of any VL-PTM can be changed based on the amount of the pre-training data.
5.2 New Directions
Despite some encouraging achievements have been accomplished in the area of the bi-directional image–text matching, there is still a set of open issues that requires more investigation. In this subsection, we will highlight the prospective research chances which are
-
Similarity or correlation measurement: searching for matched texts and image is a hard task, since image and words have different heterogeneous spaces. Therefore, it is difficult to measure the similarity before passing the features to a common space.
-
Datasets: the available datasets face some issues such as data size, or uncategorized data. For instance, Pascal sentence and wiki datasets have a small size comparing to the rest. It is a challenge to make a decision automatically using shortage data. From another perspective, existing of real data with a small size, this can be used as a pilot dataset which is not available resource for matching task, instead of staring with large datasets to save time and resources. Unlike Wiki and Pascal sentence, Flickr30K has Uncategorized data, so you cannot obtain the data details such balanced or not. Consequently, it is preferable to design a new general Datasets or modify the existing for upcoming research. In addition, the images in the available datasets do not include textual data mostly, which may help the learning process to achieve robust performance. As a first step towards that, Malfa et al. [130] established a dataset that contains images with scene-text examples to be used CMR. Now, the easy way to create a dataset is the social media, but keeping in mind the formal and informal text descriptions.
-
Evaluation: due to the standard datasets structure where one image has 5 annotations only, and sentences are labeled as relevant or not with ignoring the degree of relevance. This makes the existing evaluations methods in ITM need enhancements to represent a deep semantic interpretation between modalities. This motived Biten et al. [55] to propose new evaluation matrices for ITM inspired by the evaluation matrices that used in image caption task.
-
Features Representation: the embedding approaches play a vital role in the retrieval task. Therefore, it is important to select a suitable approach to represent words and images to reach the desired results. Now, some studies apply multi-step features extraction to improve modality’s understanding and to create powerful language models to understand images strongly. In addition to that, the most of studies ignore intra-modal information which give useful information beside inter-modal information during the learning. Another issue using the same encoders for dissimilar datasets may be not effective. For example, Jian et al. [52] used various encoder for the used datasets to achieve better result.
-
Features Dimensions: in multimodal retrieval, the data dimensions are high, so it is promising to apply compressions techniques to compact modal’s representation to decrease hardware requirements such as space; memory; GPUs, and also to reduce training time. For instance, Serra et al. [131] proposed a compact approach for embedding deals with one-hot data. Another way is network compression which is needed to decrease the size of the created models as in [132].
-
Loss Functions: it is important to minimize the output error, some of them have issues. For instance, in triplet loss, it is hard to find sample suitable triplets and selecting proper margins. In addition, the complex structure of loss function specially in hybrid models, which consumes more time to be computed.
-
Statistical Information: Incorporating statistical information with features which extracted using DL may achieve more understanding such as [60] [57].
6 Conclusion
This paper presents an overview of image–text matching task using deep learning, summarizes the current learning methods and categorizes them into DCCA, ranking, interaction, adversarial, cycle consistent, few-shot, hybrid, and VL-PTMs. In addition to that, we classify the existing works based on alignment methods into global, local, and hybrid (global–local). For more illustration, we summarize the used techniques in the existing architectures to show the main differences between them. Then, we present commonly used benchmark datasets and empirically assess the performance of some existing works. Finally, we discuss the challenges and the future trends in image–text matching. Although remarkable studies have been achieved in matching task, it still needs more work to achieve performance that can mimic human behavior. We look forward to help junior researchers to understand the state-of-the-art in image–text matching and motivate them to more significant works.
Data Availability
Data are available on request and the links for the mentioned datasets are in Table 1.
Abbreviations
- DL:
-
Deep learning
- ITM:
-
Image–text matching
- ITR:
-
Image–text retrieval
- IC:
-
Image caption
- VQA:
-
Visual question answering
- VL-PTMs:
-
Vision-language pre-trained models
- Bi-ITM:
-
Bi-directional image–text matching
- I2T:
-
Image-to-text
- T2I:
-
Text-to-image
- CMR:
-
Cross-modal retrieval
- CCA:
-
Canonical correlation analysis
- PLS:
-
Partial least squares
- NLP:
-
Natural language processing
- CV:
-
Computer vision
- CNNs:
-
Convolutional neural networks
- VGGNet:
-
Visual geometry group
- R-CNN:
-
Region-based convolutional neural network
- Faster R-CNN:
-
Faster region-based convolutional neural networks
- ResNet:
-
Residual network
- BOW:
-
Bag of words
- TF-IDF:
-
Term frequency-inverse document frequency
- LDA:
-
Latent Dirichlet allocation
- RNN:
-
Recurrent neural network
- LSTM:
-
Long short-term memory
- GRU:
-
Gated recurrent units
- Bi-LSTM:
-
Bi-directional long short-term memory
- Bi-GRU:
-
Bi-directional gated recurrent units
- SA:
-
Self-attention
- HN:
-
Hard-negatives
- DCCA:
-
Deep canonical correlation analysis
- DT:
-
Dependency trees
- CMPM:
-
Cross-modal projection matching
- CMPC:
-
Cross-modal projection classification
- KNN:
-
K-Nearest-neighbor
- SD:
-
Semantic discrepancy
- SAM:
-
Semantic adaptive margin
- GRL:
-
Gradient reversal layer
- GCN:
-
Graph convolutional networks
- GAN:
-
Generative adversarial network
- SRU:
-
Simple recurrent unit
- ZS:
-
Zero-shot
- VSE:
-
Visual semantic embedding
- SIFT:
-
Scale invariant feature transform
- L-BFGS:
-
Limited Broyden–Fletcher–Goldfarb–Shanno
- SGD:
-
Stochastic gradient descent
References
Li, K., Zhang, Y., Li, K., Lia, Y., Fu, Y. (2019): Visual Semantic Reasoning for Image-Text Matching. in Proceedings of the IEEE/CVF International Conference on Computer Vision.
Abdullah, T., Rangarajan, L.: Image-Text Matching: Methods and Challenges, pp. 213–222. Inventive Systems and Control, Springer, Singapore (2021)
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Rasiwasia, N., Mahajan, D., Mahadevan, V., Aggarwal, G. (2014) Cluster Canonical Correlation Analysis. in Artificial Intelligence and Statistics. PMLR
Rosipal, R., Krämer, N.: Overview and Recent Advances in Partial Least Squares. In: International Statistical and Optimization Perspectives Workshop" Subspace, Latent Structure and Feature Selection", pp. 34–51. Springer, Berlin (2005)
Peng, Y., Huang, X., Zhao, Y.: An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circuits Syst. Video Technol. 28(9), 2372–2385 (2017)
Aygun, R. Benesova, W. (2018) Multimedia Retrieval that Works. in 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE
Chen, J., Zhang, L., Bai, C., Kpalma, K. (2020) Review of Recent Deep Learning Based Methods for Image-Text Retrieval. in Proceeding of 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE
Chen, W., Liu, Y., Bakker, E.M., Lew, M.S.: Integrating information theory and adversarial learning for cross-modal retrieval. Pattern Recogn. 117, 107983 (2021)
Messina, N., Amato, G., Esuli, A., Falchi, F., Gennaro, C., Marchand-Maillet, S.: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transact Multimed Comput Commun App (TOMM). 17(4), 1–23 (2021)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Imagenet classification with deep convolutional neural networks: Alex Krizhevsky, lya Sutskever, Geoffrey E. Hinton. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. (2015) Going Deeper with Convolutions. in Proceedings of the IEEE conference on computer vision and pattern recognition.
Simonyan, K. Zisserman, A. 2015 Very Deep Convolutional Networks for Large-Scale Image Recognition. in 3rd International Conference on Learning Representations. San Diego, CA, USA (May 7–9,)
Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Ren, S., He, K., Girshick, R., Sun, J. (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. in Proceeding of Conference on Neural Information Processing Systems.
He, K., Zhang, X., Ren, S., Sun, J. (2016) Deep Residual Learning for Image Recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition.
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X. (2018) Stacked Cross Attention for Image-Text Matching. in Proceedings of the European Conference on Computer Vision (ECCV).
Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., Fan, X.: Position focused attention network for image-text matching. in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. (2019)
Harris, Z.S.: Distributional Structure. Word. 10(2–3), 146–162 (1954)
Ramos, J. (2003) Using Tf-Idf to Determine Word Relevance in Document Queries. in Proceedings of the First Instructional Conference on Machine Learning. Citeseer
Blei, D.M., Ng, A.Y., Jordan, M.I. (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research. 3(Jan): p. 993–1022
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013) Efficient Estimation of Word Representations in Vector Space. in Proceeding of the International Conference on Learning Representations (ICLR) Workshop Track. Arizona, USA
Jeffrey Pennington, R.S., Christopher D. Manning. (2014) GloVe: Global Vectors for Word Representation. in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP).
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. in Proceedings of NAACL-HLT.
Chung, u., Gulcehre, C., Cho, K., Bengio, Y. (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. in NIPS 2014 Workshop on Deep Learning, December 2014.
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Bengio, Y. (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. in Empirical Methods in Natural Language Processing (EMNLP).
SMike Schuster, K.K.P.: Bidirectional Recurrent Neural Networks. IEEE Transactions On Signal Proc. 45(11), 2673–2681 (1997)
Schmidhuber, Alex Graves, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architecture. Neural Net. 18(5–6), 602–610 (2005)
Ma, L., Lu, Z., Shang, L., Li, H. (2015) Multimodal Convolutional Neural Networks for Matching Image and Sentence. in Proceedings of the IEEE international conference on computer vision.
Wang, H., Ji, Z., Lin, Z., Pang, Y., Li, X.: Stacked squeeze-and-excitation recurrent residual network for visual-semantic matching. Pattern Recogn. 105, 107359 (2020)
Andrej Karpathy, Armand Joulin, Li Fei-Fei. (2014) Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. in Proceedings of the 27th International Conference on Neural Information Processing Systems.
Huang, F., Zhang, X., Zhao, Z., Li, Z.: Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Process. 28(4), 2008–2020 (2018)
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)
Li, Z., Ling, F., Zhang, C., Ma, H.: Combining global and local similarity for cross-media retrieval. IEEE Access. 8, 21847–21856 (2020)
Schroff, F., Kalenichenko, D., Philbin, J. (2015) FaceNet: A unified Embedding for Face Recognition and Clustering. in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S. (2018) VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. in Proceedings of the British Machine Vision Conference (BMVC).
Galen Andrew, Raman Arora, Jeff Bilmes, Livescu, K. (2013) Deep Canonical Correlation Analysis. in Proceeding of International conference on machine learning. PMLR
Yan, F. Mikolajczyk, K. (2015) Deep Correlation for Matching Images and Text. in Proceedings of the IEEE conference on computer vision and pattern recognition.
Shao, J., Wang, L., Zhao, Z., Cai, A.: Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval. Neurocomputing 214, 618–628 (2016)
Hua, Y., Yang, Y., Du, J.: Deep multi-modal metric learning with multi-scale correlation for image-text retrieval. Electronics 9(3), 466 (2020)
Li, H.: A Short Introduction to Learning to Rank. IEICE Trans. Inf. Syst. 94(10), 1854–1862 (2011)
Wang, C., Yang, H., Meinel, C. (2015) Deep Semantic Mapping for Cross-Modal Retrieval. in IEEE 27th International Conference on Tools With Artificial Intelligence (ICTAI). IEEE
Wang, J., Wang, Y., Kang, C., Xiang, S., Pan, C. (2015) Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning. in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval.
Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrieval with cnn visual features: a new baseline. IEEE Transact. On Cybernet. 47(2), 449–460 (2016)
Karpathy, A. Fei-Fei, L. (2015) Deep Visual-Semantic Alignments for Generating Image Descriptions. in Proceedings of the IEEE conference on computer vision and pattern recognition.
Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M.A., Mikolov, T.: Devise: a deep visual-semantic embedding model. Adv. Neural. Inf. Process. Syst. 26, 154–162 (2013)
Peng, Y., Huang, X., Qi, J. (2016) Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks. in IJCAI.
Mithun, N.C., Panda, R., Papalexakis, E.E., Roy-Chowdhury, A.K. (2018) Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. in Proceedings of the 26th ACM international conference on Multimedia.
Zhang, Y. Lu, H. (2018) Deep Cross-Modal Projection Learning for Image-Text Matching. in Proceedings of the European Conference on Computer Vision (ECCV).
Jian, Y., Xiao, J., Cao, Y., Khan, A., Zhu, J. (2019) Deep Pairwise Ranking with Multi-Label Information for Cross-Modal Retrieval. in IEEE International Conference on Multimedia and Expo (ICME). IEEE
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep Supervised Cross-Modal Retrieval. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2019)
Liu, F. Ye, R. (2019) A Strong and Robust Baseline for Text-Image Matching. in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop.
Biten, A.F., Mafla, A., Gómez, L., Karatzas, D. (2022) Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching. in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
Chen, J., Zhang, L., Wang, Q., Bai, C., Kpalma, K. (2022) Intra-Modal Constraint Loss for Image-Text Retrieval. in 2022 IEEE International Conference on Image Processing (ICIP). IEEE
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L. (2020) Consensus-Aware Visual-Semantic Embedding for Image-Text Matching. in European Conference on Computer Vision. Springer
Shi, B., Ji, L., Lu, P., Niu, Z., Duan, N. (2019) Knowledge Aware Semantic Concept Expansion for Image-Text Matching. in IJCAI.
Chunxiao Liu, Z.M., Wenyu Zang, Bin Wang. (2019) A Neighbor-aware Approach for Image-text Matching. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE
Yang, C., Deng, Z., Li, T., Liu, H., Liu, L. (2021) Variational Deep Representation Learning for Cross-Modal Retrieval. in Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer
Xu, Q., Li, M., Yu, M.: Learning to rank with relational graph and pointwise constraint for cross-modal retrieval. Soft. Comput. 23(19), 9413–9427 (2019)
Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J. (2019) Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking. in Proceedings of the 27th ACM international conference on multimedia.
Niu, K., Huang, Y., Wang, L.: Re-ranking image-text matching by adaptive metric fusion. Pattern Recogn. 104, 107351 (2020)
Junyu Luo, Ying Shen, Xiang Ao, Zhou Zhao, Min Yang. (2019) Cross-modal Image-Text Retrieval with Multitask Learning. in Proceedings of the 28th ACM International Conference on Information and Knowledge Management.
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, Jing Shao. (2019) CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. in Proceedings of the IEEE/CVF International Conference on Computer Vision.
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L. (2018) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. in In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Wu, Y., Wang, S., Song, G., Huang, Q. (2019): Learning Fragment Self-Attention Embeddings for Image-Text Matching. in Proceedings of the 27th ACM International Conference on Multimedia.
Ji, Z., Wang, H., Han, J., Pang, Y.: SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval. IEEE Transactions on Cybernetics (2020)
Diao, H., Zhang, Y., Ma, L., Lu, H. (2021) Similarity Reasoning and Filtration for Image-Text Matching. in Proceedings of the AAAI Conference on Artificial Intelligence.
Qi, X., Zhang, Y., Qi, J., Lu, H.: Self-attention guided representation learning for image-text matching. Neurocomputing 450, 143–155 (2021)
Nam, H., Ha, J.-W., Kim, J. (2017) Dual Attention Networks for Multimodal Reasoning and Matching. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Liu, C., Mao, Z., Liu, A.-A., Zhang, T., Wang, B., Zhang, Y. (2019) Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching. in Proceedings of the 27th ACM International Conference on Multimedia.
Zhang, H., Mao, Z., Zhang, K., Zhang, Y.: Show your faith: cross-modal confidence-aware network for image-text matching. AAAI (2022). https://doi.org/10.1609/aaai.v36i3.20235
Zhang, K., Mao, Z., Liu, A., Zhang, Yongdong: Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching. IEEE Transactions on Multimedia (2022). https://doi.org/10.1109/TMM.2022.3141603
Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., Zhang, Y. (2020) Graph Structured Network for Image-Text Matching. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Li, W.-H., Yang, S., Wang, Y., Song, D., Li, X.-Y.: Multi-level similarity learning for image-text retrieval. Inf. Process. Manage. 58(1), 102432 (2021)
Long, S., Han, S.C., Wan, X., Poon, J. (2022 GraDual: Graph-Based Dual-Modal Representation for Image-Text Matching. in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision)
Dong, X., Zhang, H., Zhu, L., Nie, L., Liu, L.: Hierarchical feature aggregation based on transformer for image-text matching. IEEE Transact. Circ. Syst. Vid. Technol (2022). https://doi.org/10.1109/TCSVT.2022.3164230
Frolov, S., Hinz, T., Raue, F., Hees, J., Dengel, A.: Adversarial text-to-image synthesis: a review. Neural Netw. 144, 187–209 (2021)
Li, Z., Hu, Y., He, R., Sun, Z.: Learning disentangling and fusing networks for face completion under structured occlusions. Pattern Recogn. 99, 107073 (2020)
Park, G. Im, W. (2018) Image-Text Multi-Modal Representation Learning by Adversarial Backpropagation. in proceedings of the 40th European Conference onInformation Retrival Research ( ECIR ). France
Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T. (2017) Adversarial Cross-Modal Retrieval. in Proceedings of the 25th ACM international conference on Multimedia. USA
Sarafianos, N., Xu, X., Kakadiaris, I.A. (2019) Adversarial representation learning for text-to-image matching. in Proceedings of the IEEE International Conference on Computer Vision.
Gu, J., Cai, J., Joty, S., Niu, L., Wang, G. (2018) Look, Imagine And Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Zhu, B., Ngo, C.-W., Chen, J., Hao, Y.: (2019) R2GAN: Cross-Modal Recipe Retrieval with Generative Adversarial Network. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Wang, H., Sahoo, D., Liu, C., Lim, E.-p., Hoi, S.C. (2019) Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Cornia, M., Baraldi, L., Tavakoli, H.R., Cucchiara, R. (2018) Towards Cycle-Consistent Models for Text and Image Retrieval. in Proceedings of the European Conference on Computer Vision (ECCV) Workshops.
Liu, Y., Guo, Y., Liu, L., Bakker, E.M., Lew, M.S.: CycleMatch: a cycle-consistent embedding network for image-text matching. Pattern Recogn. 93, 365–379 (2019)
Chen, H., Ding, G., Lin, Z., Zhao, S., Han, J. (2019) Cross-Modal Image-Text Retrieval with Semantic Consistency. in Proceedings of the 27th ACM International Conference on Multimedia.
Xu, X., Tian, J., Lin, K., Lu, H., Shao, J., Shen, H.T. (2021) Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 17(1s): p. 1–17
Huang, Y. Wang, L. (2019) Acmm Aligned cross-modal memory for few-shot image and sentence matching. in Proceedings of the IEEE/CVF International Conference on Computer Vision.
Yuan, X., Wang, G., Chen, Z., Zhong, F.: CHOP: an orthogonal hashing method for zero-shot cross-modal retrieval. Pattern Recogn. Lett. 145, 247–253 (2021)
Ji, Z., Sun, Y., Yu, Y., Pang, Y., Han, J.: Attribute-guided network for cross-modal zero-shot hashing. IEEE Transact. Neural Net. Learning Syst. 31(1), 321–330 (2019)
Chakraborty, B., Wang, P., Wang, L. (2021) Inter-Modality Fusion Based Attention for Zero-Shot Cross-Modal Retrieval. in 2021 IEEE International Conference on Image Processing (ICIP). IEEE
Wu, L., Wang, Y., Shao, L.: Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans. Image Process. 28(4), 1602–1612 (2018)
Xu, X., Lin, K., Yang, Y., Hanjalic, A., Shen, H.T.: Joint Feature synthesis and embedding: adversarial cross-modal retrieval revisited. IEEE Transact Pattern Anal. Machine Intell. 44(6), 3030–47 (2020)
Xu, X., Lin, K., Lu, H., Gao, L., Shen, H.T. (2020) Correlated Features Synthesis and Alignment for Zero-Shot Cross-Modal Retrieval. in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.
Lin, K., Xu, X., Gao, L., Wang, Z., Shen, H.T. (2020) Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval. in Proceedings of the AAAI Conference on Artificial Intelligence.
Xu, X., Lu, H., Song, J., Yang, Y., Shen, H.T., Li, X.: Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transact. Cybernetics. 50(6), 2400–2413 (2019)
Huang, Y., Long, Y., Wang, L. (2019) Few-shot image and sentence matching via gated visual-semantic embedding. in Proceedings of the AAAI Conference on Artificial Intelligence.
Wei, K., Zhou, Z.: Adversarial attentive multi-modal embedding learning for image-text matching. IEEE Access. 8, 96237–96248 (2020)
Ma, L., Jiang, W., Jie, Z., Wang, X.: Bidirectional image-sentence retrieval by local and global deep matching. Neurocomputing 345, 36–44 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I. (2017) Attention is All You Need. Advances in neural information processing systems. 30
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Lu, J., Batra, D., Parikh, D., Lee, S. (2019) ViLBERT: Pretraining Task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems. 32
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: VL-BERT(2020) Pre-training of Generic Visual-Linguistic Representations. in International Conference on Learning Representations.
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F. (2020) OSCAR: Object-Semantics Aligned Pre-training for Vision-Language Tasks. in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer
Tan, H. Bansal, M. (2019) LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv preprint arXiv:1908.07490
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xion, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)
Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. (2022) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. in International Conference on Machine Learning. PMLR
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., Duerig, T. (2021) Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. in International Conference on Machine Learning. PMLR
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. (2021) Learning Transferable Visual Models From Natural Language Supervision. in International conference on machine learning. PMLR
Uang, Y., Yuming, Zeng, Y., Wang, L.: MACK: Multimodal aligned conceptual knowledge for unpaired image-text matching. Adv. Neural Info. Proc. Syst. 35, 7892–7904 (2022)
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., Kiela, D.: Flava (2022) A Foundational Language and Vision Alignment Model. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O.K., Aggarwal, K., Som, S., Piao, S., Wei, F.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. Adv. Neural. Inf. Process. Syst. 35, 32897–32912 (2022)
Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.-N. (2022) Grounded Language-Image Pre-Training. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. Adv. Neural. Inf. Process. Syst. 33, 6616–6628 (2020)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. (2020) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. in International Conference on Learning Representations.
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., Yan, J. (2021) Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-Training Paradigm. arXiv preprint arXiv:2110.05208
Jiang, K., He, X., Xu, R., Xin Eric Wang (2022) ComCLIP: Training-Free Compositional Image and Text Matching. arXiv preprint arXiv:2211.13854
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J. (2019) Objects365: A Large-Scale, High-Quality Dataset for Object Detection. in Proceedings of the IEEE/CVF international conference on computer vision.
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Le, Q. Mikolov, T. (2014) Distributed Representations of Sentences and Documents. in International Conference On Machine Learning. PMLR
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R., Levy, R., Vasconcelos, N. (2010) A new approach to cross-modal multimedia retrieval. in Proceedings of the 18th ACM international conference on Multimedia.
Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J. (2010) Collecting Image Annotations Using Amazon’s Mechanical Turk. in Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk.
Everingham, M., Eslami, S., Gool, L.V., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: Data, models and evaluation metrics. J Artificial Intell Res. 47, 853–899 (2013)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transact Assoc Comput Linguistics. 2, 67–78 (2014)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. in 13th European conference on computer vision (ECCV 2014). Zurich, Switzerland ( Sept. 6–12, 2014)
Mafa, A., Rezende, R.S., Gomez, L., Larlus, D., Karatzas, D. (2021) StacMR: Scene-Text Aware Cross-Modal Retrieval. in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
Serrà, J. Karatzoglou, A. (2017) Getting Deep Recommenders Fit: Bloom Embeddings for Sparse Binary Input/Output Networks. in Proceedings of the Eleventh ACM Conference on Recommender Systems.
Bai, Z., Li, Y., Woźniak, M., Zhou, M., Li, D.: Decomvqanet: Decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognition. 110, 107538 (2021)
Acknowledgements
I would like to thank my supervisors Dr. Magda M. Madbouly, and Prof. Adel A. El-Zoghabi for guiding me and revising this paper.
Funding
Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB). Not applicable.
Author information
Authors and Affiliations
Contributions
DBE had the idea for this article and performed the literature search and data analysis, and MMM and Adel AEZ revised this work.
Corresponding author
Ethics declarations
Conflict of Interest
We declare that we do not have any conflicts of interest to disclose.
Ethical Approval
Not applicable.
Consent for Publication
All the authors agreed to publish the article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ebaid, D.B., Madbouly, M.M. & El-Zoghabi, A.A. Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges. Int J Comput Intell Syst 16, 81 (2023). https://doi.org/10.1007/s44196-023-00260-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44196-023-00260-3