Bicro: Noisy Correspondence Rectification For Multi-Modality Data Via Bi-Directional Cross-Modal Similarity Consistency
Bicro: Noisy Correspondence Rectification For Multi-Modality Data Via Bi-Directional Cross-Modal Similarity Consistency
Bicro: Noisy Correspondence Rectification For Multi-Modality Data Via Bi-Directional Cross-Modal Similarity Consistency
Soccer player is
Abstract beaten to the ball by
soccer player.
Figure 2. Some noisy data pairs in the Flickr30K [53] and Conceptual Caption [36] datasets. The first row shows our estimated soft
correspondence labels for the image-text pairs illustrated in the second and third rows. We highlight the matched words in green and the
mismatched words in red.
grained detail is captured by mapping semantically sim- and 1, called soft correspondence labels in this paper, to
ilar items. For example, [5, 16] adopt attention mecha- measure the true degree of correspondence. In this view,
nism to explore the semantic region-word correspondences. the methods proposed by NCR [12] and DECL [31] both
[6, 23] infer relation-aware similarities with both the local improve their noise-robustness by correcting the soft cor-
and global alignments by the graph convolutional network. respondence label. Specifically, NCR [12] recasts the net-
However, all these alignment methods assume that the train- work’s prediction as the estimated soft correspondence la-
ing data pairs are perfectly aligned, which is impossible to bel and DECL [31] models the uncertainty of cross-modal
satisfy due to the high collection and annotation cost. correspondence to predict correct correspondence of paired
data. However, these methods rely on the network’s predic-
2.2. Learning with Noisy Labels tions, which would cause severe confirmation bias: Those
Supervised training of deep learning models requires confident but wrong predictions would be used to guide sub-
precisely labeled datasets and noisy labels can significantly sequent training, leading to a loop of self-reinforcing er-
degrade the generalization of models. Therefore, various rors [4]. Differently, we rectify the noisy correspondence
methods were proposed to improve robustness against noisy labels by the Bidirectional Cross-modal similarity consis-
labels. The typical algorithms for combating noisy labels tency (BiCro) that is inherently contained in the paired data.
including adding regularization [9,10,22,24,40,41,44], de- Our soft correspondence labels are generated by consider-
signing robust loss functions [26, 27, 30, 45, 47–49], select- ing the characteristics of multi-modal data itself, so as to
ing possible clean samples [11, 13, 29, 34, 46, 52, 54] and avoid the confirmation bias problem of previous methods.
correcting the labels [28, 38, 42, 50, 51]. We mainly intro-
duce the label correction approaches which are most related 3. Methodology
to this work. Specifically, label correction algorithms aim to
3.1. Problem Definition
identify and correct suspicious labels in an iterative frame-
work. Bootstrapping [33] combines the noisy label and the We first introduce the cross-modal matching task by tak-
DNN prediction as a soft pseudo-label to replace the orig- ing image-text matching as an example. Then, we will
inal label. Furthermore, Dynamic bootstrapping [2] use a present the noisy correspondence problem in cross-modal
Beta-Mixture-Model to combine label dynamically for ev- matching.
ery sample. Differently, Joint Optimization [39] trained Traditionally, given a dataset D = {(Ii , Ti , yi )}N i=1 ,
their model with a large learning rate for several epochs where N indicates the number of training samples, (Ii , Ti )
firstly, and then generated pseudo-labels by averaging the is an image-text pair and yi ∈ {0, 1} is the label. The binary
predictions of that epochs. Recently, SELFIE [37] and Ada- label yi is a hard-labeled correspondence score which indi-
Corr [59] selectively combine the labels of noisy data for cates that the pair (Ii , Ti ) is positively correlated (yi = 1)
corresponding to true labels with a high probability. How- or not (yi = 0). The aim of cross-modal matching is to
ever, almost all existing noisy label studies, focusing on project the two modalities (visual and textual modalities
the classification task, cannot be adapted to cross-modal in our case) into a shared feature space wherein positive
matching directly in which the noisy label refers to the data pairs have higher feature similarities and negative data
alignment errors between data pairs rather than the cate- pairs have lower feature similarities. Generally, the sim-
gorical labeling errors. The label of cross-modal match- ilarity of the given image-text pairs can be computed by
ing can be consider as yi ∈ {0, 1}, where data pair is S(f (I), g(T )), where S is a similarity measurement, f and
positively correlated (yi = 1) or not (yi = 0). There- g are two modal-specific feature extractors. In the follow-
fore, label correction of noisy multi-modal data pairs can ing, we will denote the S(f (I), g(T )) as S(I, T ) for sym-
be translated into generating a continuous value between 0 bol simplicity. The feature extractors f and g can be learned
by minimizing the following triplet loss:
h i Warmup A B
Lhard (Ii , Ti ) = α − S (Ii , Ti ) + S Ii , T̂h
+
h i
+ α − S (Ii , Ti ) + S Iˆh , Ti (1) Anchor points
+ selection
fore, an unknown portion of data in the noisily-collected self-sample-selection error accumulation. At the inference stage,
dataset D̃ = {(Ii , Ti , ỹi )}N
i=1 is mis-labeled, which means
we average the predictions of A and B.
that some data pairs (I, T ) are mis-matched or weakly-
matched but they are wrongly labeled as ỹ = 1. Minimiz- work, we propose to estimate the correspondence labels
ing Eq. 1 on D̃ would result in a poorly-generalized cross- by leveraging the inherent similarity relationships between
modal matching model since it would overfit to the noisy the two modalities, which gets rid of the confirmation bias
dataset and pulls close those negative samples. problem.
3.2. Robust Matching Loss by Softening the Corre- 3.3. Soft Correspondence Label Estimation
spondence Label
The key idea of our method is based on the assumption
As we have discussed in Section. 3.1, in a noisily-labeled that similar images should have similar textual descriptions
dataset D̃, the hard correspondence labels ỹ are unreliable and vice versa. Assuming we have collected some clean
and cannot accurately reflect the degree of correlation be- positive data pairs (their soft correspondence labels y ∗ are
tween different modalities. Therefore, the noisy labels ỹ 1), we can use these collected anchor points to infer the soft
need to be rectified to a more accurate estimation of the soft correspondence labels for the rest noisy data pairs, based on
correspondence scores y ∗ ∈ [0, 1] between the modalities I the aforementioned assumption. We will first show how to
and T . The rectified soft labels y ∗ are expected to be able collect the anchor points out of a given noisy dataset, then
to well-depict the correspondence degree between I and T we will present how to infer the soft correspondence labels
(i.e., y ∗ gradually grow from 0 to 1 as the correlation of I for the noisy data pairs by using the anchor points.
and T increases). Then, the rectified soft labels y ∗ can be
recast as the soft margin to learn the shared feature space
by: 3.3.1 Anchor Points Selection by Modeling Per-sample
Loss Value Distribution
h i
Lsof t (Ii , Ti ) = α̂i − S (Ii , Ti ) + S Ii , T̂h We aim to identify those clean samples as anchor points
+
h i in a noisy dataset D̃. The memorization effect of deep
ˆ
+ α̂i − S (Ii , Ti ) + S Ih , Ti (2) neural networks [56] reveals that DNNs would first mem-
+
orize training data of clean labels then those of noisy labels.
where the α̂i is the soft margin determined by yi∗ , i.e., α̂i = This phenomenon indicates that noisy examples would have
∗
myi −1 ˆ
m−1 α and m is a hyper-parameter [12], Ih and T̂h are higher loss values while clean ones would have lower loss
hard negative samples. values during the early epochs of training. Therefore, given
Now, the most challenging problem is: how to esti- a matching model (f, g, S), we can first compute the per-
mate accurate soft correspondence labels y ∗ by only us- sample loss by:
ing the hard-labeled noisy dataset D̃ = {(Ii , Ti , ỹi )}N i=1 .
N N
NCR [12] proposes to leverage the network predictions to ℓ(f,g,S) = {ℓi }i=1 = {Lhard (Ii , Ti )}i=1 (3)
assign pseudo labels for every data pair. However, this
would cause severe confirmation bias: Those confident but Then, we can utilize the difference of the per-sample loss
wrong predictions would be used to guide subsequent train- value distribution to identify clean data pairs. NCR [12]
ing, leading to a loop of self-reinforcing errors [4]. In this adopts a two-component Gaussian Mixture Model to fit the
noise rate: 0.2 noise rate: 0.4
per-sample loss value distribution:
K
Empirical pdf
X
p(ℓ) = λk p(ℓ | k) (4)
k=1
where k = 0/1 denotes the data pair (Ii , Ti , yi ) is 3.3.2 Soft Correspondence Label Estimation by Cross-
clean/noisy. Now we can select a set of anchor points modal Similarity Consistency
D̃c = {Ic , Tc , yc = 1} from the noisy dataset D̃:
In this section, we proceed to show how to estimate accurate
D̃c = {(Ii , Ti , yi = 1) | p (k = 0 | ℓi ) > δ, ∀(Ii , Ti ) ∈ D̃} correspondence labels for D̃n by using the anchor points D̃c
(7) collected by Eq. 7. The key idea of our method is based on a
where δ is a threshold. Since the remained data D̃n = D̃ \ rational assumption that similar images should have similar
D̃c is possibly noisy, we drop their labels as: textual descriptions and vice versa. In other words, in an
image-text shared feature space, if two images I1 and I2 are
D̃n = {(Ii , Ti ) | p (k = 0 | ℓi ) ≤ δ, ∀(Ii , Ti ) ∈ D̃} (8)
very close but their corresponding texts T1 and T2 are far
In Section. 3.3.2, we will show how to estimate soft corre- away from each other, we can tell that at least one of these
spondence labels for D̃n by using the anchor point samples two data pairs (i.e., (I1 , T1 ) or (I2 , T2 )) is mislabeled. Fur-
in D̃c . Then, the collected clean data and the noisy data thermore, if we assume the (I1 , T1 ) is clean (well-matched),
with their estimated soft labels can be used for the model then the gap between D(f (I2 ), g(I1 )) and D(f (T2 ), g(T1 ))
training. can reflect the degree of correlation between I2 and T2 to
However, training a model with high-confident (low some extent, where D(·, ·) is a distance function in feature
loss) examples selected by the model itself would cause space and we will write D(f (I), g(I)) as D(I, I) in the
severe error accumulation problem, which is widely ac- following. That is to say, if the distances between (I2 , I1 )
knowledged in noisy label learning [11, 19, 54]. Similar to and (T2 , T1 ) are more consistent, then I2 and T2 are more
NCR [12], we also adopt the co-teaching [11] paradigm to correlated.
alleviate the error accumulation problem. Specifically, we Formally, for the i-th noisy data pair (Ini , Tni ) in D̃n , we
simultaneously train two networks A = {f A , g A , S A } and first search its closest image Ic△ in the collected anchor point
B = {f B , g B , S B } with the same architecture but different set D̃c , then we can compute its image2text similarity con-
data sequences and initializations. At each training epoch, sistency by comparing the image feature distance D(Ini , Ic△ )
Algorithm 1 The training pipeline of our robust cross-model matching framework.
Input: A noisily-labeled dataset D̃
1 Required: the clean probability threshold δ, two individual matching models A and B with different initializations and data
batch sequences
2 Warm up the model (A, B) using Lhard Eq. 1.
3 for i = 1 : num epochs do
4 //Section. 3.3.1: modeling per-sample loss distribution using Eq. 6
5 P A = {pA A N
i | pi = p (k = 0 | ℓi )}i=1 ← BetaM ixtureM odel(D̃, B)
6 P B = {pB B N
i | pi = p (k = 0 | ℓi )}i=1 ← BetaM ixtureM odel(D̃, A)
7 for k ∈ {A, B} do
8 D̃ck = {(Ii , Ti , yi = 1) | pki > δ, ∀(Ii , Ti ) ∈ D̃} //Section. 3.3.1: anchor points selection using Eq. 7
9 D̃nk = {(Ii , Ti ) | pki ≤ δ, ∀(Ii , Ti ) ∈ D̃} //Section. 3.3.1: noisy data selection using Eq. 8
10 for j = 1 : num steps do
11 Sample a mini-batch {Bjc = (Ic , Tc , yc = 1), Bjn = (In , Tn )} from {D̃ck , D̃nk }
12 //Section. 3.3.2: soft label estimation by bidirectional cross-modal similarity consistency
13 Keep the labels in Bjc , estimate soft correspondence labels y ∗ for the noisy data Bjn using Eq. 11
14 //Section. 3.2: optimize the soft matching loss on the estimated soft correspondence label
15 Train the network k on {Bjc = (Ic , Tc , yc = 1), Bjn = (In , Tn , yn = y ∗ )} by minimizing Eq. 2
Output: Matching models (A, B)
and their corresponding text feature distance D(Tni , Tc△ ): simulate the noisy correspondence issue. The Conceptual
Captions [36] is with real noisy correspondence from the
D(Ini , Ic△ ) wild.
Ci2t = , (I i , T i ) ∈ D̃n , (Ic△ , Tc△ ) ∈ D̃c (9)
D(Tni , Tc△ ) n n
4.1. Datasets and Evaluation Metrics
Similarly, we can compute its text2image similarity consis-
tency by: The following three widely-used image-text matching
datasets are used to evaluate our method and baselines:
D(Tni , Tc♢ ) i i Flickr30K [53] contains 31,000 images collected from the
Ct2i = , (In , Tn ) ∈ D̃n , (Ic♢ , Tc♢ ) ∈ D̃c (10) Flickr website and each image is associated with five cap-
D(Ini , Ic♢ )
tions. We use 1,000 images for model validation, 1,000 im-
where Tc♢ is the closest text feature to Tni in the collected ages for model testing, and 29,000 for model training.
anchor point set D̃c . The estimated soft correspondence la- MS-COCO [25] has 123,287 images with 5 captions each.
bel yi∗ of (Ini , Tni ) is finally formulated by the bidirectional Among them, 5,000 images are used for modal validation,
cross-modal similarity consistency: 5,000 images are used for model testing, and 113,287 im-
ages are used for model training.
D(Ini , Ic△ ) D(Tni , Tc♢ ) Conceptual Captions [36] is a large-scale image-text
yi∗ = ( + )/2 (11)
D(Tni , Tc△ ) D(Ini , Ic♢ ) dataset with real-world noisy correspondence problem. It
contains 3.3M images and each image is associated with
Then the noisy data with the estimated soft correspon- one caption. All the data pairs in the Conceptual Captions
dence labels D̃n = {In , Tn , yn = y ∗ } and the clean data dataset are automatically harvested from the Internet, there-
D̃c = {Ic , Tc , yc = 1} can be combined to train the match- fore about 3%∼20% image-text pairs in the dataset are mis-
ing model by minimizing the soft triplet loss in Eq. 2. The matched or weakly-matched [36]. Following NCR [12],
detailed training pipeline is illustrated in the Algorithm. 1. we use a subset of the Conceptual Captions dataset, i.e.,
CC152K, in our experiments. 150,000 images in CC152K
4. Experiment are used for model training, 1,000 images are used for
In this section, we evaluate the effectiveness of our pro- model validation, and 1,000 images are used for model test-
posed method BiCro on three image-text matching datasets, ing.
i.e., Flickr30K [53], MS-COCO [25], and Conceptual Cap-
4.2. Implementation Details
tions [36]. The Flickr30K [53] and MS-COCO [25] and two
well-annotated datasets, we randomly corrupt their image- As a general framework, BiCro can be applied to many
text data pairs for a specific percentage (i.e., noise ratio) to existing cross-modal matching methods. Same as NCR [12]
Table 1. Image-Text Retrieval on Flickr30K and MS-COCO 1K.
Flickr30K MS-COCO
Image−→Text Text−→Image Image−→Text Text−→Image
Noise Methods R@1 R@5 R@10 R@1 R@5 R@10 Sum R@1 R@5 R@10 R@1 R@5 R@10 Sum
SCAN 58.5 81.0 90.8 35.5 65.0 75.2 406.0 62.2 90.0 96.1 46.2 80.8 89.2 464.5
VSRN 33.4 59.5 71.3 25.0 47.6 58.6 295.4 61.8 87.3 92.9 50.0 80.3 88.3 460.6
IMRAM 22.7 54.0 67.8 16.6 41.8 54.1 257.0 69.9 93.6 97.4 55.9 84.4 89.6 490.8
SAF 62.8 88.7 93.9 49.7 73.6 78.0 446.7 71.5 94.0 97.5 57.8 86.4 91.9 499.1
20% SGR 55.9 81.5 88.9 40.2 66.8 75.3 408.6 25.7 58.8 75.1 23.5 58.9 75.1 317.1
NCR 73.5 93.2 96.6 56.9 82.4 88.5 491.1 76.6 95.6 98.2 60.8 88.8 95.0 515.0
DECL 77.5 93.8 97.0 56.1 81.8 88.5 494.7 77.5 95.9 98.4 61.7 89.3 95.4 518.2
BiCro 78.3 94.1 97.3 60.0 83.7 89.5 502.9 78.2 95.9 98.4 62.5 89.8 95.5 520.3
BiCro* 78.1 94.4 97.5 60.4 84.4 89.9 504.7 78.8 96.1 98.6 63.7 90.3 95.7 523.2
SCAN 26.0 57.4 71.8 17.8 40.5 51.4 264.9 42.9 74.6 85.1 24.2 52.6 63.8 343.2
VSRN 2.6 10.3 14.8 3.0 9.3 15.0 55.0 29.8 62.1 76.6 17.1 46.1 60.3 292.0
IMRAM 5.3 25.4 37.6 5.0 13.5 19.6 106.4 51.8 82.4 90.9 38.4 70.3 78.9 412.7
SAF 7.4 19.6 26.7 4.4 12.2 17.0 87.3 13.5 43.8 48.2 16.0 39.0 50.8 211.3
40% SGR 4.1 16.6 24.1 4.1 13.2 19.7 81.8 1.3 3.7 6.3 0.5 2.5 4.1 18.4
NCR 68.1 89.6 94.8 51.4 78.4 84.8 467.1 74.7 94.6 98.0 59.6 88.1 94.7 509.7
DECL 72.7 92.3 95.4 53.4 79.4 86.4 479.6 75.6 95.5 98.3 59.5 88.3 94.8 512.0
BiCro 73.6 93.0 96.4 56.0 80.8 87.4 487.2 76.4 95.2 98.6 61.5 89.4 95.5 516.6
BiCro* 74.6 92.7 96.2 55.5 81.1 87.4 487.5 77.0 95.9 98.3 61.8 89.2 94.9 517.1
SCAN 13.6 36.5 50.3 4.8 13.6 19.8 138.6 29.9 60.9 74.8 0.9 2.4 4.1 173.0
VSRN 0.8 2.5 5.3 1.2 4.2 6.9 20.9 11.6 34.0 47.5 4.6 16.4 25.9 140.0
IMRAM 1.5 8.9 17.4 1.9 5.0 7.8 42.5 18.2 51.6 68.0 17.9 43.6 54.6 253.9
SAF 0.1 1.5 2.8 0.4 1.2 2.3 8.3 0.1 0.5 0.7 0.8 3.5 6.3 11.9
60% SGR 1.5 6.6 9.6 0.3 2.3 4.2 24.5 0.1 0.6 1.0 0.1 0.5 1.1 3.4
NCR 13.9 37.7 50.5 11.0 30.1 41.4 184.6 0.1 0.3 0.4 0.1 0.5 1.0 2.4
DECL 65.2 88.4 94.0 46.8 74.0 82.2 450.6 73.0 94.2 97.9 57.0 86.6 93.8 502.5
BiCro 68.3 90.4 93.8 51.9 76.9 84.4 465.7 73.9 94.7 97.7 58.7 87.0 93.8 505.8
BiCro* 67.6 90.8 94.4 51.2 77.6 84.7 466.3 73.9 94.4 97.8 58.3 87.2 93.9 505.5
and DECL [31], we implement BiCro based on SGRAF [7], model (denoted as BiCro in Table. 1 and Table. 2), (2) treat
whose performance is the state-of-the-art in image-text those data pairs with estimated soft labels under a threshold
matching. We follow the same training settings (e.g., op- (mismatch threshold θ) as mismatched data, and set their
timizer, network architecture, all the hyperparameters, etc.) correspondence labels as zero (denoted as BiCro* in Ta-
as previous works [12,31] to make a fair comparison. Please ble. 1 and Table. 2). The effect of the mismatch threshold θ
refer to NCR [12] for more training details. will be discussed in Section. 4.5.
We first warmup the matching models A and B for 10 4.3. Comparison with State-of-the-Arts
epochs to make them achieve an initial convergence. To
reduce the effect of noisy data pairs, we select a small por- To demonstrate the effectiveness of the proposed BiCro,
tion of data with small loss value in each batch to warmup we evaluate the BiCro against several baseline methods,
the models based on a predefined warmup selection ratio α. including general methods (SCAN [17], VSRN [20], IM-
The choice of α will be discussed in Section. 4.5. At the RAM [3], SGRAF, SGR and SAF [7]) and robust learn-
training stage, the total number of iterations is 40 epochs, ing method against noisy correspondence ( NCR [12] and
among them, the first 20 epochs are trained with clean sam- DECL [31] ).
ples screened by BMM, and the next 20 epochs were trained
on all samples (with our estimated soft labels). At each 4.3.1 Results on Simulated Noise
training epoch, we select 10% data pairs with a higher prob-
ability of being clean in Eq. 7 as the selected anchor points. Table. 1 reports the experimental results on the 1K test im-
The remained data are regarded as noisy ones. At the in- ages of Flickr30K dataset and over 5 folds of 1K test images
ference stage, we average the similarities predicted by net- of MS-COCO dataset. From the experimental results, we
works A and B for the retrieval evaluation. We propose two can find that the proposed BiCro performs significantly bet-
strategies to handle the noisy data pairs, (1) feed all of the ter than the state-of-the-art methods. In comparison to the
data pairs with their estimated soft labels to the matching best baseline DECL [31], BiCro improves the sum score for
82.5
Table 2. Image-Text Retrieval on CC152K.
Image−→Text Text−→Image 82.0
Methods R@1 R@5 R@10 R@1 R@5 R@10 Sum
AVG (R@1,R@5,R@10)
SCAN 30.5 55.3 65.3 26.9 53.0 64.7 295.7 81.5
VSRN 32.6 61.3 70.5 32.5 59.4 70.4 326.7
81.0
IMRAM 33.1 57.6 68.1 29.0 56.8 67.4 312.0
SAF 31.7 59.3 68.2 31.9 59.0 67.9 318.0 80.5
SGR 11.3 29.7 39.6 13.1 30.1 41.6 165.4
NCR 39.5 64.5 73.5 40.3 64.6 73.2 355.6 80.0
DECL 39.0 66.1 75.5 40.7 66.3 76.7 364.3
BiCro 40.7 67.3 76.7 39.7 67.6 76.9 368.9 79.5
0.0 0.2 0.4 0.6 0.8 1.0
BiCro* 40.8 67.2 76.1 42.1 67.6 76.4 370.2 Mismatch Threshold
Table 3. Ablation studies on Flickr30K with 40% noise. Figure 5. Variation of retrieval performance with different selec-
Methods Image−→Text Text−→Image tion ratio ϵ and mismatch threshold θ. Note that θ = 0 denotes
Co- Soft
teaching label
Bmm Warmup R@1 R@5 R@10 R@1 R@5 R@10 using the original generated soft labels for all data pairs.
✓ ✓ ✓ ✓ 74.6 92.7 96.2 55.5 81.1 87.4
✓ ✓ ✓ 69.9 91.9 95.7 52.0 79.0 85.5
the original generated soft labels (θ = 0). Moreover, further
✓ ✓ ✓ 72.0 91.0 95.0 53.3 78.8 86.1 increasing the θ leads to the performance drop since a too-
✓ ✓ ✓ 72.0 92.3 95.5 55.2 79.8 86.6
✓ ✓ ✓ 54.3 81.9 88.1 40.9 65.8 73.6
large mismatch threshold θ treats weakly labeled samples
as mislabeled samples, resulting in the decline of general-
retrieving by 8.2%,7.6%,15.1%,2.1%,4.6% and 3.3% under ization ability.
different noise rates, respectively. In addition, BiCro* fur- 4.5. Ablation Study
ther improves the overall performance by filtering out the
mismatched pairs. The results at a noise rate of 60% show a In this section, we carry out the ablation study on the
strong robustness of our BiCro against the noisy correspon- Flickr30K with a noise ratio of 40%. We show the effect
dence with high noise rates. The failure of NCR to deal with of each component in Table. 3. In order to prob the ef-
high noise rates may be that it cannot divide the noisy and fectiveness of soft label estimation in BiCro, the labels of
clean pairs well relying on GMM [12], while our method clean and noisy samples divided by co-teaching are set to
solves this drawback by using BMM. 1 and 0 respectively as a comparison experiment to BiCro
and the experimental results are shown in the third row. The
fourth row shows the experiments of our method with GMM
4.3.2 Results on Real-world Noise and BMM. From the results, we draw the following con-
We evaluate the proposed method under the real noisy cor- clusions: 1) Since the quality of noise-robust cross-modal
respondence of CC152K. The experimental results are re- learning depends on the soft correspondence labels, the per-
ported in Table. 2. From the results, one could observe formance of adopting the soft labels estimated by BiCro sur-
that our BiCro could achieve competitive performance un- passes the performance of using the original binary labels.
der real noise. Specifically, BiCro is 4.6% higher than 2) Although leveraging Gaussian-Mixture-Model achieves
the best baseline in terms of sum in retrieval, respectively. decent results, the performance can be further improved by
Moreover, the performance gap between BiCro and BiCro* Beta-Mixture-Model. 3) Co-teaching and Warmup, as im-
shows that the filtering of data pairs according to soft cor- portant modules of the base network, are also crucial in our
respondence labels can further reduce the impact of data framework BiCro. 4) The model achieves the best test accu-
mismatch issue on performance. racy by utilizing all the components, which shows the com-
plementarity of our proposed BiCro.
4.4. Hyperparameter Analysis
5. Conclusion
We now analyze the effect of the warmup selection ratio
ϵ and mismatch threshold θ, which denote the ratio of data This paper focuses on the challenge of robust cross-
that participated in warmup and the threshold that sets the modal matching on noisy data. To address this problem,
soft label to 0. We report the average performance of re- We propose a general framework called bidirectional cross-
trieving image and text with different ϵ and θ on Flickr30K modal similarity consistency (BiCro) for soft correspon-
with the noise ratio of 40%. As shown in Fig. 5, setting dence label estimation given only noisily-collected data
ϵ = 0.3 achieves the best overall accuracy for retrieving. pairs. The effectiveness of the proposed framework was
Using all data (ϵ = 1) in warmup gives limited perfor- verified on both synthetic noisy datasets and real noisy
mance, which indicates it is necessary to select low loss dataset. The visualization results also demonstrate that our
data in warmup. On the other hand, with the increases of estimated soft labels are the accurate estimation of the true
θ, our BiCro* improves the accuracy compared with using correlation degree between different modality data.
6. Acknowledgements [13] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and
Li Fei-Fei. MentorNet: Learning data-driven curriculum for
This work was partially supported by the National Key very deep neural networks on corrupted labels. In ICML,
R&D Program of China (No.2021ZD0110901). pages 2309–2318, 2018. 3
[14] Qing-Yuan Jiang and Wu-Jun Li. Deep cross-modal hashing.
References In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 3232–3240, 2017. 1
[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien
[15] Kushal Kafle and Christopher Kanan. Visual question an-
Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
swering: Datasets, algorithms, and future challenges. Com-
Bottom-up and top-down attention for image captioning and
puter Vision and Image Understanding, 163:3–20, 2017. 1
visual question answering. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages [16] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-
6077–6086, 2018. 1 and-language transformer without convolution or region su-
pervision. In ICML, pages 5583–5594, 2021. 3
[2] Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor,
and Kevin McGuinness. Unsupervised label noise modeling [17] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi-
and loss correction. In ICML, pages 312–321, 2019. 3 aodong He. Stacked cross attention for image-text matching.
In ECCV, 2018. 7
[3] Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu,
and Jungong Han. Imram: Iterative matching with recur- [18] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi-
rent attention memory for cross-modal image-text retrieval. aodong He. Stacked cross attention for image-text matching.
In Proceedings of the IEEE/CVF conference on computer vi- In Proceedings of the European conference on computer vi-
sion and pattern recognition, pages 12655–12663, 2020. 7 sion (ECCV), pages 201–216, 2018. 1
[4] Mingcai Chen, Hao Cheng, Yuntao Du, Ming Xu, Wenyu [19] Junnan Li, Richard Socher, and Steven C.H. Hoi. Dividemix:
Jiang, and Chongjun Wang. Two wrongs don’t make a right: Learning with noisy labels as semi-supervised learning. In
Combating confirmation bias in learning with label noise. ICLR, 2020. 5
arXiv preprint arXiv:2112.02960, 2021. 3, 4 [20] Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun
[5] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Fu. Visual semantic reasoning for image-text matching. In
Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. ICCV, pages 4653–4661, 2019. 7
UNITER: universal image-text representation learning. In [21] Linghui Li, Sheng Tang, Lixi Deng, Yongdong Zhang, and
ECCV, pages 104–120, 2020. 3 Qi Tian. Image caption with global-local attention. In Thirty-
[6] Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. Sim- first AAAI conference on artificial intelligence, 2017. 1
ilarity reasoning and filtration for image-text matching. In [22] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak.
Proceedings of the AAAI Conference on Artificial Intelli- Gradient descent with early stopping is provably robust to
gence, volume 35, pages 1218–1226, 2021. 3 label noise for overparameterized neural networks. In AIS-
[7] Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. Sim- TATS, 2020. 3
ilarity reasoning and filtration for image-text matching. In [23] Sheng Li, Zhiqiang Tao, Kang Li, and Yun Fu. Visual to text:
AAAI, pages 1218–1226, 2021. 7 Survey of image and video captioning. IEEE Transactions on
[8] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Emerging Topics in Computational Intelligence, 3(4):297–
Fidler. Vse++: Improving visual-semantic embeddings with 312, 2019. 1, 3
hard negatives. arXiv preprint arXiv:1707.05612, 2017. 2, 4 [24] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao,
[9] Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with
Dengke Dong, Matthew R Scott, and Dinglong Huang. Cur- distillation. In ICCV, pages 1910–1918, 2017. 3
riculumnet: Weakly supervised learning from large-scale [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
web images. In ECCV, pages 135–150, 2018. 3 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[10] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Zitnick. Microsoft coco: Common objects in context. In
Tsang, Ya Zhang, and Masashi Sugiyama. Masking: A new European conference on computer vision, 2014. 6
perspective of noisy supervision. In NeurIPS, pages 5836– [26] Tongliang Liu and Dacheng Tao. Classification with noisy
5846, 2018. 3 labels by importance reweighting. IEEE Transactions on
[11] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao pattern analysis and machine intelligence, 38(3):447–461,
Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co- 2016. 3
teaching: Robust training of deep neural networks with ex- [27] Xingjun Ma, Hanxun Huang, Yisen Wang, Simone Romano,
tremely noisy labels. Advances in neural information pro- Sarah M. Erfani, and James Bailey. Normalized loss func-
cessing systems, 31, 2018. 2, 3, 4, 5 tions for deep learning with noisy labels. In ICML, 2020.
[12] Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, 3
Xinyan Xiao, Hua Wu, and Xi Peng. Learning with noisy [28] Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou,
correspondence for cross-modal matching. Advances in Neu- Sarah M Erfani, Shu-Tao Xia, Sudanthi Wijewickrema, and
ral Information Processing Systems, 34:29406–29419, 2021. James Bailey. Dimensionality-driven learning with noisy la-
2, 3, 4, 5, 6, 7, 8 bels. In ICML, pages 3361–3370, 2018. 3
[29] Eran Malach and Shai Shalev-Shwartz. Decoupling” when to [44] Songhua Wu, Xiaobo Xia, Tongliang Liu, Bo Han, Ming-
update” from” how to update”. In NeurIPS, pages 960–970, ming Gong, Nannan Wang, Haifeng Liu, and Gang Niu.
2017. 3 Class2simi: A new perspective on learning with label noise.
[30] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, arXiv preprint arXiv:2006.07831, 2020. 3
Richard Nock, and Lizhen Qu. Making deep neural networks [45] Xiaobo Xia, Tongliang Liu, Bo Han, Chen Gong, Nannan
robust to label noise: A loss correction approach. In CVPR, Wang, Zongyuan Ge, and Yi Chang. Robust early-learning:
pages 1944–1952, 2017. 3 Hindering the memorization of noisy labels. In ICLR, 2021.
[31] Yang Qin, Dezhong Peng, Xi Peng, Xu Wang, and Peng 3
Hu. Deep evidential learning with noisy correspondence [46] Xiaobo Xia, Tongliang Liu, Bo Han, Mingming Gong, Jun
for cross-modal retrieval. In ACMMM, pages = 4948–4956, Yu, Gang Niu, and Masashi Sugiyama. Sample selection
year = 2022. 3, 7 with uncertainty of losses for learning with noisy labels.
arXiv preprint arXiv:2106.00445, 2021. 3
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[47] Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Gong, Gang Niu, and Masashi Sugiyama. Are anchor points
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
really indispensable in label-noise learning? In NeurIPS,
ing transferable visual models from natural language super-
2019. 3
vision. In International Conference on Machine Learning,
pages 8748–8763. PMLR, 2021. 1 [48] Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen
Gong, Gang Niu, and Masashi Sugiyama. Are anchor points
[33] Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian
really indispensable in label-noise learning? In NeurIPS,
Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training
pages 6838–6849, 2019. 3
deep neural networks on noisy labels with bootstrapping. In
[49] Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang.
ICLR, Workshop Track Proceedings, 2015. 3
L dmi: A novel information-theoretic loss function for train-
[34] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urta- ing deep nets robust to label noise. In NeurIPS, pages 6222–
sun. Learning to reweight examples for robust deep learning. 6233, 2019. 3
In ICML, pages 4331–4340, 2018. 3 [50] Shuo Yang, Erkun Yang, Bo Han, Yang Liu, Min Xu, Gang
[35] Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. Niu, and Tongliang Liu. Estimating instance-dependent
Adversarial representation learning for text-to-image match- label-noise transition matrix using dnns. arXiv, 2021. 3
ing. In ICCV, pages 5813–5823, 2019. 2 [51] Shuo Yang, Erkun Yang, Bo Han, Yang Liu, Min Xu, Gang
[36] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Niu, and Tongliang Liu. Estimating instance-dependent
Soricut. Conceptual captions: A cleaned, hypernymed, im- bayes-label transition matrix using a deep neural network.
age alt-text dataset for automatic image captioning. In Pro- In International Conference on Machine Learning, pages
ceedings of the 56th Annual Meeting of the Association for 25302–25312. PMLR, 2022. 3
Computational Linguistics, 2018. 1, 2, 3, 6 [52] Quanming Yao, Hansi Yang, Bo Han, Gang Niu, and
[37] Hwanjun Song, Minseok Kim, and Jae-Gil Lee. SELFIE: re- James T Kwok. Searching to exploit memorization effect
furbishing unclean samples for robust deep learning. In Ka- in learning with noisy labels. In ICML, 2020. 3
malika Chaudhuri and Ruslan Salakhutdinov, editors, ICML, [53] Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken-
pages 5907–5915, 2019. 3 maier. From image descriptions to visual denotations: New
[38] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiy- similarity metrics for semantic inference over event descrip-
oharu Aizawa. Joint optimization framework for learning tions. Transactions of the Association for Computational
with noisy labels. In CVPR, 2018. 3 Linguistics, 2014. 3, 6
[54] Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor W
[39] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiy-
Tsang, and Masashi Sugiyama. How does disagreement ben-
oharu Aizawa. Joint optimization framework for learning
efit co-teaching? In ICML, 2019. 3, 5
with noisy labels. In CVPR, pages 5552–5560, 2018. 3
[55] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin
[40] Arash Vahdat. Toward robustness against label noise in train-
Recht, and Oriol Vinyals. Understanding deep learning re-
ing deep discriminative neural networks. In NeurIPS, pages
quires rethinking generalization. In ICLR, 2017. 2
5596–5605, 2017. 3
[56] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin
[41] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhi- Recht, and Oriol Vinyals. Understanding deep learning (still)
nav Gupta, and Serge Belongie. Learning from noisy large- requires rethinking generalization. Communications of the
scale datasets with minimal supervision. In CVPR, pages ACM, 64(3):107–115, 2021. 4
839–847, 2017. 3 [57] Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z. Li.
[42] Kai Wang, Xiangyu Peng, Shuo Yang, Jianfei Yang, Zheng Context-aware attention network for image-text retrieval. In
Zhu, Xinchao Wang, and Yang You. Reliable label correction CVPR, pages 3533–3542, 2020. 2
is a good booster when learning with extremely noisy labels. [58] Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, Yueting
arXiv preprint arXiv:2205.00186, 2022. 3 Zhuang, Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and
[43] Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. Yueting Zhuang. Video question answering via hierarchi-
Learning two-branch neural networks for image-text match- cal spatio-temporal attention networks. In IJCAI, volume 2,
ing tasks. TPAMI, 41(2):394–407, 2019. 2 page 8, 2017. 1
[59] Songzhu Zheng, Pengxiang Wu, Aman Goswami, Mayank
Goswami, Dimitris N. Metaxas, and Chao Chen. Error-
bounded correction of noisy labels. In ICML, pages 11447–
11457, 2020. 3