MDMMT: Multidomain Multimodal Transformer For Video Retrieval

MDMMT: Multidomain Multimodal Transformer for Video
Retrieval
A P REPRINT
Maksim Dzabraev1,2 , Maksim Kalashnikov1 , Stepan Komkov1,2 , Aleksandr Petiushko1,2

1
Lomonosov Moscow State University
arXiv:2103.10699v1 [cs.CV] 19 Mar 2021
2
Huawei Moscow Research Center
dzabraev.maksim@intsys.msu.ru, kalashnikov.maxim@intsys.msu.ru,
stepan.komkov@intsys.msu.ru, petyushko.alexander1@huawei.com
ABSTRACT There are two major directions which allow calculate the
relevance between a textual search query and a video seg-
We present a new state-of-the-art on the text to video re-
ment. The first direction is single stream approaches [32],
trieval task on MSRVTT and LSMDC benchmarks where
where a query and a video together are given to a net-
our model outperforms all previous solutions by a large
work and then become fused from the beginning of the
margin. Moreover, state-of-the-art results are achieved
processing. The schematic illustration of this approach is
with a single model on two datasets without finetuning.
presented in Fig. 1a.
This multidomain generalisation is achieved by a proper
combination of different video caption datasets. We show
that training on different datasets can improve test results
of each other. Additionally we check intersection between
many popular datasets and found that MSRVTT has a sig-
nificant overlap between the test and the train parts, and
the same situation is observed for ActivityNet.
Keywords video, language, retrieval, multi-modal,

cross-modal, temporality, transformer, attention
(a) Scheme for a single-stream (b) Scheme for a two-stream
neural network. neural network.
1 Introduction
Figure 1: Two types of fusion
Video is a quite popular data format, 500+ hours of video
are uploaded on YouTube every minute. Many personal
mobile phones have gigabytes of video. Since video format This type of approaches have access to all input data from
gets more popular every year the importance of modern the beginning of its processing and can make a strong
search methods is increasing as well. verdict about data. But these approaches have a significant
drawback because it is not scalable: every new query the
In this work we present our research about text to video
search system should calculate the full forward pass for
retrieval task. In this task system should return for a given
this query and for each video segment from the gallery.
textual query the most relevant video segments from a
gallery. The query is a textual description of what we Another direction is two stream neural networks [24], [8],
want to find in the video. The query may describe objects, where a textual query and a video are processed by two
actions, sounds, ..., and relations between them. different neural networks. As a result the networks pro-
duce embeddings inside the same embedding space, where
Such search methods are a promising direction for mobile
semantically close textual queries and video segments will
devices because every year manufacturers increase the
be placed next to each other. The schematic illustration is
available memory on devices. The large part of the memory
presented in Fig. 1b.
is filled by media data. For end users it is getting difficult
to search for a video made one or two years ago. But users The two stream approach is scalable, it allows to precom-
can easily describe the content of the video using natural pute video embeddings for all videos from the gallery,
language, which can be effectively used as a search query. and to do only one forward pass with the text network for
MDMMT: Multidomain Multimodal Transformer for Video Retrieval A P REPRINT
each new query and then to compute the cosine similarity and the test part has 3k videos. There are two important
between the new query embedding and all precomputed properties of this split: 1. there are no two video segments
embeddings. cropped from the same video so as the first segment is
placed in the train part and the second segment is placed in
To make a strong video retrieval solution it is important to
the test part; 2. there are no two video segments, retrieved
show to the model a lot of situations, actions and objects
from the same query so as the first one is placed in the
from real life. There exist a lot of video datasets, but none
train part and the second one is placed in the test part.
of them cover a significant portion of real life situations.
One of the first steps to tackle this problem is to formulate Another two splits are called 1k-A [40] (sometimes called
the rules for combining different existing datasets to a jsfusion) and 1k-B [21] (sometimes called miech). Both
single large train database. of them have different 1k videos for testing. They were
created by randomly sampling 1k videos from the original
Text to video retrieval is a modern direction, where one
test part (full split). 1k-A train part consists of the original
of the first works was published at 2016 [33]. One of the
train split and the rest of the videos from the test part, so it
most universal solution for video retrieval task is Multi
has 1k videos for the test part and 9k videos for the train
Modal Transformer [8] architecture which uses BERT [4]
part. 1k-B has 1k videos for the test part and 6.5k videos
backbone for a video network. It allows in a natural way to
for the train. Additionally both splits use only one caption
process the temporal dependencies inside the multi modal
per segment (instead of 20 captions).
data source.
Unfortunately 1k-A and 1k-B mixed up the train and test
To train a text to video retrieval neural network the training
parts. This led to violation in properties 1. and 2. which
database should consists of pairs: (a video segment, a
the full split satisfies.
textual description of this video segment). Traditionally
such sort of datasets was created for a video captioning Another problem is that all these splits have the overlap
task. But it turns out that these datasets perfectly can between the test and train parts, see C.2 for details. To
be used for a video retrieval task. One of the first video be strict we remove the overlap between the test part and
captioning dataset was MSVD, which was created in 2010. the train part of MSRVTT full split. We called this split
Today there exist more than a dozen of different video MSRVTT full clean, and refer to it as Mc . It is worth
captioning datasets. to mention that we do not modify the test part, we only
remove some videos from the train part.
The most popular datasets for text to video retrieval is
MSRVTT [39], ActivityNet [17] and LSMDC [29]. Many The Large Scale Movie Description Challenge
researchers test their solutions mostly on these three (LSMDC) [29] is the extension of two independent
datasets. datasets: MPII Movie Description Dataset (MPII-
MD) [28], and Montreal Video Annotation Dataset
Our main contributions in this work are the following:
(M-VAD) [34].
• We present a new state-of-the-art (SotA) result on
Video segments for this dataset were cropped from movies,
MSRVTT and LSMDC benchmarks;
where movie textualized transcriptions were used as cap-
• We present a model which shows good results on three
tions. A movie transcription is an audio description of a
different benchmarks without finetuning: MSRVTT
video segment that helps blind people to watch movies by
(SotA), LSMDC (SotA) and ActivityNet at the same
describing what happens, who appears in this time, what
time;
is on background right now and so on.
• We present a practical approach which helps us to find
the overlap between the train and the test parts of used In this work for testing we use LSMDC public test, which
datasets. consists of 1k video segments.
ActivityNet captions dataset [17] consists of 20k videos
2 Related work and 100k captions, where captions cover the full video
length for the most of videos, and neighbour captions may
intersect. The annotations were made with Amazon Me-
2.1 Datasets chanical Turk.
MSRVTT [39] was created in 2016. This dataset is tradi- The situation when some video segments may overlap
tionally used by researchers as the main dataset for testing makes a problem for text to video retrieval testing. Suppose
text to video retrieval models. This dataset consists of we have two video-caption pairs (S1 , C1 ) and (S2 , C2 )
10k video segments, each segment has 20 captions. The where the video segment S1 has a non empty overlap with
authors collected 257 popular search queries and gathered the video segment S2 . Now suppose that for query C1 the
from YouTube 118 most relevant videos for each of them. system returns the video segment S2 . Is it mistake or not?
The dataset has 42 hours of video. The captions were made What to do in this case?
by 1327 amazon workers.
Many previous works used ActivityNet test dataset in a
Today there are three different test/train splits. The official paragraph retrieval mode. In this mode all captions for all
split is called full split, where the train part has 7k videos
2
video segments are concatenated, then the concatenated stream solution designed for a text to video retrieval task.
text is used as a textual query and the whole video should The extraction of features from the input video stream is
be retrieved for this query. Such mode has two drawbacks. done in the following way. An input video is preprocessed
The first one is that paragraph retrieval is not a classical by several pretrained frozen neural networks (these net-
video retrieval mode, it is another task. One can ask: if works are called experts). Original solution uses seven
a model is good in paragraph retrieval will it be good modalities: motion, RGB, scene, face, OCR, speech, au-
for video retrieval? The second drawback is that queries dio, and one pretrained network for each modality is used.
will be long, video segments will be long (compared to The motion modality is processed with video recognition
a classical video retrieval mode). This issue requires to networks like S3D, SlowFast, irCSN, where several input
enlarge the input for the model. frames are used as a single input. The RGB modality uses
a single frame as an input. The audio modality uses the raw
Another way to use the test part of ActivityNet is just to
input sound from a video. After embeddings are extracted
sample once a single random segment from each video. As
from input data by these experts, it will be augmented by
a result we will have non intersected video segments and
adding positional encoding tokens (representing time) and
captions with usual length. We use ActivityNet test part
expert tokens. Then the augmented embeddings are passed
in this way. We take all videos from val1 and val2 parts,
through MMT backbone. MMT backbone is a standard
and sample a single random segment from each video. All
transformer encoder architecture. Each input modality pro-
results on ActivityNet are reported on this split.
duces one embedding, so in total there are seven output
Additionally in this work the following datasets are used: embedding from MMT.
NIST TRECVID Twitter vines [1], TGIF [18], MSVD [2],
For encoding the textual query the authors use pretrained
YouCook2 [43], Something-something V2 [10], Kinetics
BERT model where the output [CLS] token is used. The
700 [31], HowTo100M [23].
output is postprocessed with shallow networks (one net-
work per modality) to extract the modality related infor-
2.2 Prior Art mation, in total seven feature vectors will be produced. In
addition to embeddings from the text query seven weights
A dominant approach to train video retrieval models is representing how much the query describes one of seven
contrastive learning. The idea of this approach is that we modalities are produced. For example, if a query does
have a set of pairs (videoi , texti ) and elements of each not represent the sound, the small weight for the audio
pair should be placed next to each other in some metric modality should be produced.
space: distance(videoi , texti ) → 0, at the same time the
The final similarity score is done by a sum of seven
element videoi should be far from all other textj ), j 6= i:
weighted dot products of embeddings.
distance(videoi , textj ) → +∞. The bi-directional max-
margin ranking loss [13] represents this idea. The MMT is trained with the bi-directional max-margin
ranking loss [13]:
When training data have a lot of noise the MIL NCE
loss [22] can be applied in the training procedure. Sup-
pose that we know that a videoi should be close to one of 1 X B Xh i
(or several) texts texti1 , ..., textik . This approach tries to max(0, sij −sii +m)+max(0, sji −sii +m)
reduce the distance between the videoi and all texti1 , ..., B i=1 j6=i
textik at the same time.
All video captions datasets have the following problem. where B, sij , m represent the batch size, the similarity
Suppose the distance between (videoi , texti ) is to be min- between the i-th query and the j-th video inside this batch,
imized while the distance between (videoi , textj ), j 6= i and some predefined margin correspondingly.
is to be maximized, but texti and textj are quite similar
(from the semantical point of view). Maybe the optimal
scenario in this situation is to minimize the distance be-
tween (videoi , textj ), j 6= i. In [25] the authors show the 3 Methodology
approach which deals with this problem.
As far as an input video is the temporal sequence of tokens Our work is mostly based on MMT. We use the same loss
(frames or video segments) it is important to efficiently and a similar architecture, but with different hyperparame-
aggregate the information from all tokens. Many ideas for ters. In this work we study the following questions:
such aggregation in the previous works are borrowed from • Which publicly available pretrained motion expert is the
the natural language processing. Convolution filters for best for text to video retrieval nowadays, Sec. 3.1.
aggregation are used in [25], a transformer encoder as a • How to combine several video caption datasets in or-
video aggregator is used in [8], many different aggregation der to train a strong model without specialisation for a
functions are tested in [26]. particular dataset, Sec. 3.2.
We think that the most promising aggregation method is • How to find and prevent the overlap between the test and
Multi Modal Transformer (MMT) [8]. MMT is a two train parts when combining datasets, Sec. C.
3
3.1 Motion experts dataset (the data will be memorized). The "information
sizes" of some used datasets are illustrated in Fig. 2.
The MMT video backbone does not process the raw in-
put video stream, and instead the input video stream is
processed by one or more pretrained experts, where each ActivityNet
expert produces time series of features. The most impor-
466h
tant modality is motion: a motion expert processes several
video frames as a single input unit and extracts the infor-
total video duration

mation about actions and objects within a segment.
We may say that the motion modality is the basis of the
MMT. If a motion expert doesn’t extract some information,
there is a high probability that MMT won’t know about
some events in the video stream. That’s why improving TGIF
the motion expert is very important. LSMDC
We consider several best solutions from Kinetics [15] 102h YC2 MSRVTT
benchmark as well as several promising video recogni- 41h Vines MSVD
tion models and check which one works in the best way as
a motion expert. We present all details in Sec. 4.2. 10K 50K 100K 166K
number of unique captions
3.2 Dataset creation Figure 2: Radius of the ball represent the “information
size” of dataset. The biggest balls have more diversity in
It is possible to train a video retrieval model by two means. data.
The first way is the way of specialization for a single
domain. For example: create model that will work good
Fig. 2 is made with a simple algorithm. We take the orig-
only for MSRVTT benchmark (or domain) but at the same
inal training procedure of MMT and for a given dataset
time this model will show poor results on other datasets
we change the number of examples that will be shown to a
(domains). In this way MMT [8] was trained. The authors
network during training. We define the radius of the ball
trained three different models for MSRVTT, ActivityNet
as the number of training examples after which the perfor-
and LSMDC datasets. Each of these three networks works
mance gets saturated (i.e. increasing the training time does
good on domain X if and only if it was trained on X, but
not give the better model).
at the same time works poor on another domain Y 6= X.
A proof of this statement we provide in Tab. 6. The key question is: what is the proper way for sampling
examples from several datasets taking into account the
The second way is to create a model that will work good
different information size?
for all domains at the same time. We use this way.
We use these obvious rules:
Obviously the model trained in the first way can’t work
good with real users, because the event when a user writes 1. If a dataset X is larger than Y , we should sample from
a search query similar to some caption from a small train X more often than from Y ;
database is very rare. 2. Training on X and Y combined requires longer train
than training solely on X or Y ;
The second drawback here is that each video retrieval train
3. Training on X and Y combined may require a deeper
dataset is not that big, and it causes the situation that model
model than for X or Y .
doesn’t see many words and real life situations during
training. For example, MSRVTT has only 9k videos and If we achieve the same results on X after combining X
200k captions in total for training, obviously this is not and Y it is still good because model gets better on Y .
enough to train a neural network that will know most of Our experiments show that the proper usage of rules 1–3
real life situations, different items and persons. To tackle often improves the results for a specific test dataset (e.g.
with this problem we can take several datasets with videos MSRVTT) after extending the train dataset.
and captions and concatenate it.
We managed to combine the following datasets: MSRVTT,
Different datasets have the different number of videos, the ActivityNet, LSMDC, TwitterVines, YouCook2, MSVD,
different number of captions per video, some datasets may TGIF and Something to something V2 (SomethingV2). In
have long captions, some may have short captions, differ- total we increase the number of video segments by 40 times
ent rules for creating captions were used by human writers, and the number of unique captions by 4 times compared
and so on. Due to these factors some datasets may contain with MSRVTT dataset. In Tab. 1 we summarize the sizes of
more information and require longer training time, some used datasets. We separate SomethingV2 dataset from all
datasets may contain less information and require shorter other datasets because: 1. all video segments are created
training time. On the other hand, if we use long training artificially, 2. the structure of text captions is quite limited.
time for a small dataset it could lead to overfitting on this At the same time videos for all other datasets are collected
4
from the Internet and captions being created by humans Abbreviate Composition
have quite a rich structure.
M MSRVTT full split
Mc MSRVTT full clean split, see Sec. 2.1
M1k-A MSRVTT 1k-A split
Num Num Num Has M1k-B MSRVTT 1k-B split
Dataset video pairs unique YouTube A ActivityNet
captions Id Aval1 ActivityNet val1 validation set
Aval2 ActivityNet val2 validation set
MSRVTT 10k 200k 167k Yes
Ap/r ActivityNet paragraph retrieval, see Sec. 2.1
ActivityNet 14k 70k 69k Yes
LSMDC 101k 101k 101k No L LSMDC
TwitterVines 6.5k 23k 23k No K Kinetics700
YouCook2 1.5k 12k 12k Yes V Twitter Vines
MSVD 1.5k 80k 64k Yes Y YouCook2
TGIF 102k 125k 125k No HT100M HowTo100M
Sum above 236k 611k 561k — MSRVTT + ActivityNet +
MALV
SomethingV2 193k 193k 124k No LSMDC + TwitterVines
Sum above 429k 804k 685k —
MSRVTT + ActivityNet +
Table 1: The "Num video" column represents the number MALVYMT LSMDC + TwitterVines +
of video clips in the dataset, the "Num pairs" column rep- YouCook2 + MSVD + TGIF
resents the total number of video-caption pairs, the "Num
unique captions" column represents the number of unique MSRVTT + ActivityNet +
captions in the dataset. LSMDC + TwitterVines +
MALVYMTS
YouCook2 + MSVD + TGIF +
Something to Something V2
Table 2: The left column represents the abbreviate name
for the set of datasets from the right column.
3.3 Intersection
4 Experiments
It is important to extend the training database carefully, not
allowing the addition to the train part of video segments
that already exist in the test part. 4.1 Architecture
To find the intersection between the test part and the train We use exactly the same neural network architecture as
part we use the two stage filtration. The first stage is to use original MMT [8], our method is significantly based on
the YouTube ID, if it is available. We should not allow to their codebase. The difference is in the following: 1. we
use in the test and train parts simultaneously any two video use the more aggressive dropout equals to 0.2 for the text
segments sampled from the same video. In the second BERT and the video BERT (against the original value of
stage we compute the similarity score between each video 0.1); 2. we found that the deeper and wider transformer
from the test part and each video from the train part, then encoder for a video network gives better results — we use
we manually assess the pairs with the highest scores. In 6 layers and 8 heads for the motion only modality and 9
total we assessed more than 100K pairs of the most relevant layers and 8 heads for the motion + audio setting (against
segments, see Sec. C.1 for details. 4 layer and 4 head in the original implementation).
We found the significant overlap between the MSRVTT
1k-A test and train parts, and the similar situation is with 4.2 Stronger motion experts
the 1k-B test and train parts and the less significant overlap
is found between the MSRVTT full split test and train As the input data for MMT is embeddings from experts,
parts. The similar situation is with the ActivityNet train the obvious question can arise: if a better expert is used,
and validation 1,2 parts. will we have a stronger model? To answer this question
we train MMT on MSRVTT dataset with the only motion
Additionally we estimate (but did not find) the overlap
modality. For motion experts we try several architectures
between HowTo100M and MSRVTT, and found that it
pretrained on different datasets, these models are presented
may be significant. Our approach allows to approximately
in Tab. 3. We take the architectures which show the best re-
estimate the total number of videos in the intersection
sults on Kinetics 400 benchmark having publicly available
without finding the exact intersection, please see the details
pretrained weights: [38] [7] [36] [35] [9] .
in Sec. C.1.2. The similar estimation is for ActivityNet
and Kinetics700, an our approximation shows that there The results in Tab. 3 are made with the same hyperparame-
may be a significant overlap, see all details in Sec. C.1. ters as in [8]. For the train dataset we use only MSRVTT
5
full clean split. The first line in Tab. 3 represents the motion some weight will be larger than needed, this dataset will
feature extractor from the original MMT paper. be overseen and as a result the performance will be lower
comparing to the optimal case. The opposite case is when
As we can see, usually stronger models provide better
a small weight was selected, this causes the situation when
results, but not always. Refer to r(2+1)d 152 rows, this
during training a network does not see the required number
network demonstrates one of the best performance on Ki-
of examples from this dataset.
netics 400 benchmark, but works poorly as motion expert.
Maybe this network is over specialized for Kinetics 400. For experiments in this section we use MMT with the only
More shallow analogue of r(2+1)d 152 is r(2+1)d 34 which motion modality. Embeddings for the motion modality
shows much better results. are computed with irCSN152 pretrained on IG65M. All
configurations are trained with 50 epochs and different
An interesting observation is that the best results are
number examples per epoch. The initial learning rate is 5e-
achieved with the networks trained in the unsupervised
5. After each epoch we multiply learning rate by 0.95. The
manner. CLIP and models trained on IG65M outperform
MALVYMTS (see Tab. 2 for abbreviations.) configuration
all other models trained on Kinetics in the supervised man-
is trained with 150K examples per epoch. Configurations
ner. Another weakly supervised dataset is Sports1M [14].
with the less number of datasets are trained with the less
Models trained on this dataset provide weak embeddings
number of examples per epoch. The number of examples
similar to the weak s3d model trained on Kinetics dataset.
per epoch can be represented as a product of 150K by a sum
The CLIP [27] (ViT-B/32) image feature extractor with
of normalized weights (weights from Tab. 5 divided by a
a large margin outperforms all other models. The model
sum of all weights) for each dataset (the initial sum equals
s3dg MIL-NCE is a video encoder from the work [22].
to 1): 150K = 150K × (pMSRVTT + pActivityNet + pLSMDC +
This network was trained from scratch on HowTo100M
pTwitter Vines + pYouCook2 + pMSVD + pTGIF + pSomething V2 ). If
dataset.
some dataset is removed from the training, we remove the
As we show in Sec. C Kinetics dataset has an overlap with corresponding coefficient from this sum, so the resulting
MSRVTT dataset, and we don’t know whether it affects to length will be 150K multiplied by a value less than 1.
overfitting or not. Also it is worth to mention that IG65M
As far as we use the configurations Mc , A, L as the base-
and CLIP datasets are not publicly available, so we do not
lines, we need to be sure that the results for these con-
know if there is an overlap with MSRVTT and other video
figurations are the optimal values. In addition to the rule
retrieval datasets.
described above we try several values for a number of ex-
For more details about our usage of pretrained video examples per epoch parameter, and report the results for the
perts please refer to Sec. A. best found value.
Tab. 6 summarizes our experiments on the datasets combi-
4.3 Datasets combination nation (for more details please refer to Sec. B). The main
point here is that the proper combination of datasets leads
to the best solution.
In this section we show our experiments about the com-
bination of different datasets. Nowadays video-caption
datasets are not big enough to capture all real life situa- 4.4 Final result
tions, also some datasets may be biased. The combination
of different datasets may help to tackle this problem. In this section we compare our solution with the prior art.
Our two best solution uses three modalities: the audio, the
Our experiments show that the proper combination of
motion and the RGB. To fuse modalities we use MMT ar-
datasets allows to train a single model that can capture
chitecture with 9 layers and 8 heads. As a feature extractor
the knowledge from all used datasets. Important thing here
for the audio stream the vggish [12] network is used. For
is that in most cases the model trained on the combination
the video encoding we use CLIP ViT-B/32 (RGB modal-
of datasets is better than the model trained on a single
ity) and irCSN152 (motion modality) pretrained on IG65M
dataset.
dataset. The details about preprocessing videos for both
In our experiments we combine all datasets presented in networks are presented in Sec. A.
Tab. 5. The important thing is how to sample minibatches
Additionally we report separate results for motion + audio
during training. In our experiments we first sample a
encoders and RGB + audio encoders because we do not
dataset, then we uniformly sample a video segment, if
know whether the IG65M or CLIP train database has a
this sampled video segment has more than one caption
significant overlap with any of the test datasets or not.
we sample a single caption uniformly. Column weight
in Tab. 5 describes the probability of sampling the corre- All our models presented in Tab. 4,7 and 8 are trained
sponding dataset. To obtain the probability of sampling based on the pretrain HowTo100M model. We present the
the dataset with the weight w we should divide w by the details about pretraining in Sec. E.
sum of all weights.
The results for MSRVTT are presented in Tab. 7. As
The weights for all datasets are manually adjusted. It is we can see our solution MDMMT(MALVYMTS) L9H8
important to find a good weight combination, because if CLIP+irCSN152+audio significantly outperforms all pre-
6
Text → Video
Video expert Dataset
R@1↑ R@5↑ R@10↑ MnR↓ MdR↓
s3d Kinetics 600 7.7±0.1 24.0±0.2 34.9±0.2 129.6±1.0 23.7±0.5
SlowFast 32x2 R101 Kinetics 600 9.3±0.1 27.5±0.1 39.1±0.1 110.8±1.1 18.7±0.5
ipCSN152 IG65M 9.5±0.1 27.9±0.2 39.6±0.2 106.1±1.1 18.0±0.0
ipCSN152 IG65M → K400 8.3±0.1 25.2±0.1 36.5±0.2 124.3±0.2 21.0±0.0
ipCSN152 Sports1M 7.4±0.2 22.4±0.1 32.7±0.2 140.6±1.0 27.0±0.0
ipCSN152 Sports1M → K400 7.8±0.1 24.2±0.1 35.2±0.1 129.9±0.2 23.0±0.0
irCSN152 IG65M 9.5±0.1 27.9±0.2 39.5±0.2 105.5±0.4 18.0±0.0
irCSN152 IG65M → K400 8.4±0.1 25.3±0.1 36.5±0.2 120.4±0.4 21.0±0.0
irCSN152 Sports1M 6.9±0.1 21.6±0.1 31.6±0.1 141.9±0.4 28.7±0.5
irCSN152 Sports1M → K400 7.7±0.1 24.1±0.1 35.1±0.1 127.6±0.6 23.0±0.0
r(2+1)d 152 IG65M 5.7±0.1 18.5±0.1 27.8±0.1 178.5±1.5 37.7±0.9
r(2+1)d 152 IG65M → K400 5.5±0.1 18.1±0.1 27.3±0.1 184.1±1.2 39.3±0.5
r(2+1)d 152 Sports1M → K400 5.3±0.1 17.3±0.1 26.0±0.1 193.4±3.6 42.3±0.5
r(2+1)d 34 IG65M 9.1±0.2 27.2±0.2 38.7±0.2 108.1±0.0 19.0±0.0
r(2+1)d 34 IG65M → K400 8.2±0.2 25.3±0.3 36.7±0.1 120.8±0.7 21.0±0.0
CLIP CLIP 14.4±0.1 37.4±0.3 50.2±0.3 70.3±0.3 10.3±0.5
s3dg MIL-NCE HowTo100M 8.6±0.4 26.3±0.5 37.9±0.7 104.4±2.2 19.3±0.5
Table 3: Comparison of the best available pretrain models as the motion experts for MMT. IG65M → K400 means that
model was trained on IG65M and then fine tuned on Kinetics400. Results for each experiment are computed over three
runs with random seeds. The results are reported on MSRVTT full clean split.
ActivityNet text → video

model
R@1↑ R@5↑ R@10↑ MnR↓ MdR↓
CLIP [27] 0.02 0.06 0.2 2210 2251
MMT (Ap/r ) motion+audio [8] 7.3 22.5 31 283.9 30
Ours MDMMT(Mc ALVYMTS) L9H8 irCSN152+audio 15.1±0.1 38.3±0.1 51.5±0.3 92.4±2.3 10.0±0.0
Ours MDMMT(Mc ALVYMTS) L9H8 CLIP+audio 17.7±0.1 41.6±0.3 54.3±0.2 76.0±1.0 8.3±0.5
Ours MDMMT(Mc ALVYMTS) L9H8 CLIP+irCSN152+audio 20.1±0.5 45.1±0.5 58.0±0.6 70.8±0.1 7.0±0.0
Table 4: Test results on our split (see Sec. 2.1) on ActivityNet.
Dataset Weight Test Text → Video R@5 ↑

Dataset
MSRVTT ActivityNet LSMDC
MSRVTT 140
ActivityNet 100 Mc 29.0±0.2 13.4±0.3 12.9±0.6
LSMDC 70 A 14.7 ±0.1 30.9±0.6 10.4±0.3
Twitter Vines 60 L 8.8±0.1 7.2±0.2 24.7±0.6
YouCook2 9 Mc ALV 32.1±0.1 32.0±0.2 26.5±0.7
MSVD 9 Mc ALVYMT 33.8±0.1 32.3±0.2 27.3±0.4
TGIF 102 Mc ALVYMTS 34.5±0.1 32.4±0.5 27.4±0.6
Something V2 169 Table 6: See abbreviations for the first column in Tab. 2.
Table 5: These datasets were used in our train procedure. The first three rows Mc ,A,L report the quality of models
The "Weight" column describes how often we sample ex- trained on a single domain, and tested on other domains.
amples from the dataset. The probability of obtaining an Italic means that the model did not see data from this do-
example from the dataset with the weight w equals to w main during training. In this table the only motion modality
divides by a sum of all weights. (irCSN152) is used.
outperforms the original MMT (the motion, the RGB and

vious solutions on all splits: full, 1k-A and 1k-B. Our
the audio and 4 other modalities) by 8.7%, 10.5% and 14.4
solution is better than the previous SotA (on R@5) on
(R@5) on full, 1k-A and 1k-B correspondingly.
8.7%, 10.5% and 14.4% on full, 1k-A and 1k-B corre-
spondingly. It is also worth to mention that our MDMMT We also report the results for the original CLIP [27]. The
(using only the motion, the RGB and the audio modalities) CLIP model has an image encoder and a text encoder, both
7
pretrained in an unsupervised way. To test the CLIP model

we take a single frame from the middle of the video (this
is the original testing protocol for CLIP). The row CLIP
agg [26] represents the usage of CLIP model with several
frames using some specific aggregation procedure from
this work.
In Tab. 8 we report the results on LSMDC. On this bench-
mark we outperform the previous SotA solution by 8.6%.
As we mention in Sec. 2.1 we do not use the standard
ActivityNet paragraph retrieval test protocol. Instead we
use the text to video retrieval protocol. To compare our
solution with the previous work we take the previous SotA
approach (MMT) in text to video retrieval and test it on
our split. The results are reported in Tab. 4. Our solution
outperforms MMT by 22.6%. The row MMT (Ap/r ) mo-
tion+audio means that this network was trained only on
ActivityNet dataset with the paragraph retrieval mode. It is
also worth to mention that CLIP shows very bad results on
this benchmark. We try to aggregate with the mean pool-
ing of 2, 4 and 16 uniformly taken embeddings, take the
first 10, 20 and 70 words from a caption, and no method
improves the results.
The important property of our model is that we train a
single model and test it on different test sets. The authors
of previous SotA approach (MMT) trained three different
models for MSRVTT, ActivityNet and LSMDC, while in
Tab. 6 we show that the model trained in such a manner
has poor generalization and can show good performance
on the test part of the dataset X if and only if it was trained
on the train part of the dataset X.
5 Conclusions and Discussion

In this work we present a new text to video retrieval state-
of-the-art model on MSRVTT and LSMDC benchmarks.
We do not use ActivityNet dataset in the paragraph retrieval
mode as many previous works do, so we can’t compare
with them. But we show that on ActivityNet in the video
retrieval mode we outperform the previous state-of-the-art
model (MMT) by a large margin. Our model has captured
knowledge from many video caption datasets, thus it is
able to show the best results on several datasets at the same
time without finetuning.
We also present a practical approach to find the overlap
between two different video datasets. Using this approach
we find the overlap between several datasets. Especially
we find a large overlap between the MSRVTT test and
train parts, and between the ActivityNet test and train
parts. Removing this overlap from the MSRVTT train part
significantly decreases the performance of previous best
models on MSRVTT benchmark.
Acknowledgments. We would like to thank Andrey
Ivanyuta and other colleagues from Intelligent Systems
and Data Science Lab for helping to find the overlap be-
tween datasets.
8
MSRVTT text → video
split
model
R@1↑ R@5↑ R@10↑ MnR↓ MdR↓
Random baseline 0.0 0.2 0.3 1500 1500
VSE [24] 5.0 16.4 24.6 — 47
VSE++ [24] 5.7 17.1 24.8 — 65
Multi Cues [24] 7.0 20.9 29.7 — 38
W2VV [5] 6.1 18.7 27.5 — 45
Dual Enc. [6] 7.7 22.0 31.8 — 32
CE [19] 10.0±0.1 29.0±0.3 41.2±0.2 86.8±0.3 16.0±0.0
full
MMT (M) 7mod [8] 10.7±0.2 31.1±0.1 43.4±0.2 88.2±0.7 15.0±0.0
CLIP [27] 15.1 31.8 40.4 184.2 21
CLIP agg [26] 21.5 41.1 50.4 — 4
Ours MDMMT(MALVYMTS) L9H8 irCSN152+audio 15.7±0.1 38.8±0.1 51.1±0.2 76.0±0.7 10.0±0.0
Ours MDMMT(MALVYMTS) L9H8 CLIP+audio 21.7±0.2 47.6±0.3 59.8±0.1 55.9±0.2 6.0±0.0
Ours MDMMT(MALVYMTS) L9H8 CLIP+irCSN152+audio 23.1±0.1 49.8±0.1 61.8±0.1 52.8±0.2 6.0±0.0
MMT (Mc ) 7mod [8] 10.4±0.1 30.2±0.4 42.3±0.2 89.4±0.6 15.7±0.5
clean
full
Random baseline 0.1 0.5 1.0 500.0 500.0
JSFusion [40] 10.2 31.2 43.2 — 13
E2E [22] 9.9 24.0 32.4 — 29.5
HT [23] 14.9 40.2 52.8 — 9
CE [19] 20.9±1.2 48.8±0.6 62.4±0.8 28.2±0.8 6.0±0.0
CLIP [27] 1k-A 22.5 44.3 53.7 61.7 8
MMT (M1k-A ) 7mod [8] 26.6±1.0 57.1±1.0 69.6±0.2 24.0±0.8 4.0±0.0
AVLnet[30] 27.1 55.6 66.6 — 4
SSB [25] 30.1 58.5 69.3 — 3.0
CLIP agg [26] 31.2 53.7 64.2 — 4
Ours MDMMT(M1k-A ALVYMTS) L9H8 irCSN152+audio 31.3±0.1 60.4±1.2 71.8±1.0 24.0±0.4 3.0±0.0
Ours MDMMT(M1k-A ALVYMTS) L9H8 CLIP+audio 38.9±1.0 68.3±0.7 78.8±0.2 17.3±0.5 2.0±0.0
Ours MDMMT(M1k-A ALVYMTS) L9H8 CLIP+irCSN152+audio 38.9±0.6 69.0±0.1 79.7±0.6 16.5±0.4 2.0±0.0
Random baseline 0.1 0.5 1.0 500.0 500.0
MEE [21] 13.6 37.9 51.0 — 10.0
JPose [37] 14.3 38.1 53.0 — 9
MEE-COCO [21] 14.2 39.2 53.8 — 9.0
1k-B
CE [19] 18.2±0.7 46.0±0.4 60.7±0.2 35.3±1.1 7.0±0.0

MMT (M1k-B ) 7mod [8] 24.5±0.5 54.4±0.8 68.0±0.5 26.6±0.2 4.7±0.5
CLIP [27] 24.5 46.2 56.8 60.9 7
Ours MDMMT(M1k-B ALVYMTS) L9H8 irCSN152+audio 28.8±0.9 58.8±0.3 71.2±0.3 28.5±0.5 3.7±0.5
Ours MDMMT(M1k-B ALVYMTS) L9H8 CLIP+audio 35.1±0.1 66.5±0.9 77.6±0.3 21.5±0.4 2.7±0.5
Ours MDMMT(M1k-B ALVYMTS) L9H8 CLIP+irCSN152+audio 37.4±1.5 68.8±0.4 79.4±0.4 21.3±0.4 2.0±0.0
Table 7: Results on MSRVTT dataset.
LSMDC text → video

model
R@1↑ R@5↑ R@10↑ MnR↓ MdR↓
CT-SAN [41] 5.1 16.3 25.2 — 46
JSFusion [40] 9.1 21.2 34.1 — 36
MEE [21] 9.3 25.1 33.4 — 27
MEE-COCO [21] 10.1 25.6 34.6 — 27
CE [19] 11.2±0.4 26.9±1.1 34.8±2.0 96.8±5.0 25.3±3.1
CLIP agg [26] 11.3 22.7 29.2 — 56.5
CLIP [27] 12.4 23.7 31.0 142.5 45
MMT (L) 7mod [8] 12.9±0.1 29.9±0.7 40.1±0.8 75.0±1.2 19.3±0.2
Table 8: Test results on LSMDC public test (1k video)

9
References [15] Will Kay et al. The Kinetics Human Action Video
Dataset. 2017. arXiv: 1705.06950 [cs.CV].
[1] George Awad et al. “TRECVID 2020: comprehen- [16] Giorgos Kordopatis-Zilos et al. “Near-duplicate
sive campaign for evaluating video retrieval tasks video retrieval with deep metric learning”. In: Pro-
across multiple application domains”. In: Proceed- ceedings of the IEEE International Conference on
ings of TRECVID 2020. NIST, USA. 2020. Computer Vision Workshops. 2017, pp. 347–356.
[2] David Chen and William Dolan. “Collecting Highly [17] Ranjay Krishna et al. “Dense-Captioning Events in
Parallel Data for Paraphrase Evaluation”. In: Pro- Videos”. In: International Conference on Computer
ceedings of the 49th Annual Meeting of the Associ- Vision (ICCV). 2017.
ation for Computational Linguistics: Human Lan- [18] Yuncheng Li et al. “TGIF: A New Dataset and
guage Technologies. Portland, Oregon, USA: Asso- Benchmark on Animated GIF Description”. In: The
ciation for Computational Linguistics, June 2011, IEEE Conference on Computer Vision and Pattern
pp. 190–200. URL: https://www.aclweb.org/ Recognition (CVPR). 2016.
anthology/P11-1020. [19] Yang Liu et al. Use What You Have: Video Retrieval
[3] J. Deng et al. “ImageNet: A Large-Scale Hierarchi- Using Representations From Collaborative Experts.
cal Image Database”. In: CVPR09. 2009. 2020. arXiv: 1907.13487 [cs.CV].
[4] Jacob Devlin et al. BERT: Pre-training of Deep Bidi- [20] Dhruv Kumar Mahajan et al. “Exploring the Lim-
rectional Transformers for Language Understand- its of Weakly Supervised Pretraining”. In: ECCV.
ing. 2019. arXiv: 1810.04805 [cs.CL]. 2018.
[5] Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. [21] Antoine Miech, Ivan Laptev, and Josef Sivic. Learn-
“Predicting Visual Features From Text for Image and ing a Text-Video Embedding from Incomplete and
Video Caption Retrieval”. In: IEEE Transactions on Heterogeneous Data. 2020. arXiv: 1804 . 02516
Multimedia 20.12 (2018), 3377–3388. ISSN: 1941- [cs.CV].
0077. DOI: 10.1109/tmm.2018.2832602. URL: [22] Antoine Miech et al. End-to-End Learning of Vi-
http : / / dx . doi . org / 10 . 1109 / TMM . 2018 . sual Representations from Uncurated Instructional
2832602. Videos. 2020. arXiv: 1912.06430 [cs.CV].
[6] Jianfeng Dong et al. Dual Encoding for Zero- [23] Antoine Miech et al. “HowTo100M: Learning a
Example Video Retrieval. 2019. arXiv: 1809.06181 Text-Video Embedding by Watching Hundred Mil-
[cs.CV]. lion Narrated Video Clips”. In: ICCV. 2019.
[7] Christoph Feichtenhofer et al. SlowFast Networks [24] Niluthpol Chowdhury Mithun et al. “Learning joint
for Video Recognition. 2019. arXiv: 1812.03982 embedding with multimodal cues for cross-modal
[cs.CV]. video-text retrieval”. In: Proceedings of the 2018
[8] Valentin Gabeur et al. Multi-modal Transformer ACM on International Conference on Multimedia
for Video Retrieval. 2020. arXiv: 2007 . 10639 Retrieval. 2018, pp. 19–27.
[cs.CV]. [25] Mandela Patrick et al. Support-set bottlenecks for
[9] Deepti Ghadiyaram et al. Large-scale weakly- video-text representation learning. 2021. arXiv:
supervised pre-training for video action recognition. 2010.02824 [cs.CV].
2019. arXiv: 1905.00561 [cs.CV]. [26] Jesús Andrés Portillo-Quintero, José Carlos Ortiz-
[10] Raghav Goyal et al. The "something something" Bayliss, and Hugo Terashima-Marín. A Straightfor-
video database for learning and evaluating vi- ward Framework For Video Retrieval Using CLIP.
sual common sense. 2017. arXiv: 1706 . 04261 2021. arXiv: 2102.12443 [cs.CV].
[cs.CV]. [27] Alec Radford et al. “Learning Transferable Visual
[11] Kaiming He et al. Deep Residual Learning for Models From Natural Language Supervision”. In:
Image Recognition. 2015. arXiv: 1512 . 03385 Image 2 (), T2.
[cs.CV]. [28] Anna Rohrbach et al. A Dataset for Movie Descrip-
[12] Shawn Hershey et al. CNN Architectures for Large- tion. 2015. arXiv: 1501.02530 [cs.CV].
Scale Audio Classification. 2017. arXiv: 1609 . [29] Anna Rohrbach et al. Movie Description. 2016.
09430 [cs.SD]. arXiv: 1605.03705 [cs.CV].
[13] Andrej Karpathy, Armand Joulin, and Li Fei-Fei. [30] Andrew Rouditchenko et al. AVLnet: Learning
“Deep Fragment Embeddings for Bidirectional Im- Audio-Visual Language Representations from In-
age Sentence Mapping”. In: Proceedings of the 27th structional Videos. 2020. arXiv: 2006 . 09199
International Conference on Neural Information [cs.CV].
Processing Systems - Volume 2. NIPS’14. Montreal,
Canada: MIT Press, 2014, 1889–1897. [31] Lucas Smaira et al. A Short Note on the Kinetics-
700-2020 Human Action Dataset. 2020. arXiv:
[14] Andrej Karpathy et al. “Large-scale Video Classi- 2010.10864 [cs.CV].
fication with Convolutional Neural Networks”. In:
CVPR. 2014.
10
[32] Chen Sun et al. VideoBERT: A Joint Model for For SlowFast 32x2 R101 we resize a video to 256 on the
Video and Language Representation Learning. 2019. short side and then take a 256x256 center crop. For ipCSN
arXiv: 1904.01766 [cs.CV]. 152 and irCSN 152 we resize a video to 224 on the short
[33] Atousa Torabi, Niket Tandon, and Leonid Si- side and take a 224x224 center crop. For r(2+1)d 152 and
gal. Learning Language-Visual Embedding for r(2+1)d 34 we resize a video to 112 on the short side and
Movie Understanding with Natural-Language. 2016. then take a 112x112 center crop.
arXiv: 1609.08124 [cs.CV]. Pretrained models for ipCSN, irCSN and r(2+1)d are avail-
[34] Atousa Torabi et al. Using Descriptive Video Ser- able here1 , for SlowFast 32x2 R101 here2 , and for s3d
vices to Create a Large Data Source for Video here3 .
Annotation Research. 2015. arXiv: 1503 . 01070
[cs.CV]. For the CLIP model [27] we resize a video to 224 on the
short side and take a center crop, then we extract 1 frame
[35] Du Tran et al. A Closer Look at Spatiotemporal per second. We use a publicly available image encoder.
Convolutions for Action Recognition. 2018. arXiv: We do not use the text encoder from CLIP.
1711.11248 [cs.CV].
[36] Du Tran et al. Video Classification with Channel- Model s3dg MIL-NCE is a video encoder from the
Separated Convolutional Networks. 2019. arXiv: work [22]. This network was trained from scratch on
1904.02811 [cs.CV]. HowTo100M dataset. For this network we resize the input
video stream to the size of 228x228 pixels, then take a
[37] Michael Wray et al. Fine-Grained Action Re-
center crop.
trieval Through Multiple Parts-of-Speech Embed-
dings. 2019. arXiv: 1908.03477 [cs.CV].
[38] Saining Xie et al. Rethinking Spatiotemporal Fea- B Datasets combination
ture Learning: Speed-Accuracy Trade-offs in Video
Classification. 2018. arXiv: 1712.04851 [cs.CV]. In Fig. 3,4,5 we present 6 models. Abbreviations Mc ALV,
[39] Jun Xu et al. “MSR-VTT: A Large Video Descrip- Mc ALVYMTS and Mc ALVYMTS represent the same
tion Dataset for Bridging Video and Language”. In: three models on these figures. The first model, called
IEEE International Conference on Computer Vision Mc , is trained on the MSRVTT full clean split only, the
and Pattern Recognition (CVPR), 2016. second one, called A, is trained on ActivityNet only.
[40] Youngjae Yu, Jongseok Kim, and Gunhee Kim. A And the third model, called L, is trained on LSMDC
Joint Sequence Fusion Model for Video Question only. These three models are taken as baselines. Adding
Answering and Retrieval. 2018. arXiv: 1808.02559 more datasets should be not worse than these baseline.
[cs.CV]. The forth model is called Mc ALV. This model is trained
on the combination of MSRVTT, ActivityNet, LSMDC
[41] Youngjae Yu et al. End-to-end Concept Word Detec- and TwitterVines. As we can see Mc →Mc ALV gives
tion for Video Captioning, Retrieval, and Question +3.07% on MSRVTT (full clean split), A→Mc ALV gives
Answering. 2017. arXiv: 1610.02947 [cs.CV]. +1.06% on ActivityNet, and L→Mc ALV gives +1.77% on
[42] Bolei Zhou et al. “Places: A 10 million Image LSMDC. The next model is called Mc ALVYMT and it is
Database for Scene Recognition”. In: IEEE Transac- trained on combination of MSRVTT, ActivityNet, LSMDC,
tions on Pattern Analysis and Machine Intelligence TwitterVines, YouCook2, MSVD, TGIF. The transitions
(2017). Mc →Mc ALVYMT, A→Mc ALVYMT, L→Mc ALVYMT
[43] Luowei Zhou, Nathan Louis, and Jason J. Corso. give +4.85%, +1.45% and +2.63% correspondingly. The
Weakly-Supervised Video Object Grounding from last transitions Mc →Mc ALVYMTS, A→Mc ALVYMTS,
Text by Loss Weighting and Object Interaction. 2018. L→Mc ALVYMTS slightly improve the performance on
arXiv: 1805.02834 [cs.CV]. ActivityNet and LSMDC and significantly improve the
performance on MSRVTT. Finally, the combination of all
datasets gives +5.5% for MSRVTT, +1.47% for Activi-
A Pretrain experts usage tyNet and +2.74% for LSMDC.
The important data preparing stage is how to sample frames
from a video to achieve the best performance. For s3d ex- C Test and train intersection
periments the input video is converted to 30 frames per sec-
ond, for all other experiments we convert the input video In this section we present our analysis of overlapping of
to 32 frames per second. As a result we compute a single popular text to video datasets. Since we compose the train
embedding for each second, having 1 second window with dataset from several different datasets it is important to be
1 second shift (no overlapping). sure that there is no the same video segment in the train
part and in the test part. Our aim is to find the overlap
The input frame size is important. We use the different
sizes for the different models. For each model we use the 1. https://github.com/facebookresearch/VMZ
recommended input size. For s3d we resize a video to 2. https://github.com/facebookresearch/SlowFast/blob/master/MODEL_ZOO.md
256 on the short side and then take a 224x224 center crop. 3. https://github.com/princeton-vl/d3dhelper/blob/master/d3d_helper.ipynb
11
MSRVTT full clean R@5 ↑ LSMDC R@5↑

34.5 27.4
33.8
26.5
32.1
29.0 24.7
Mc Mc ALV Mc ALVYMT Mc ALVYMTS L Mc ALV Mc ALVYMT Mc ALVYMTS
Figure 3: Increasing R@5 metric on the MSRVTT full Figure 5: Increasing R@5 metric on the LSMDC test set
clean split while enriching the train part. while enriching the train part.
ActivityNet R@5 ↑ available) and remove from train parts all video segments
32.4 that have the corresponding pair in test parts. In Tab. 1
the information about the availability of YouTube IDs in
datasets is presented. We collect the YouTube ID for all
32.0 videos from MSRVTT full test and ActivityNet validation
1,2 and remove corresponding video segments from the
train part.
The second stage is based on matching frames by embed-
dings. For each video we compute several embeddings
then we compute the similarity between each video from
the train part and the test part. After we manually assess
30.9 several thousands of video segments with highest scores
for each pair of datasets. Then we extend found duplicates
by either the YouTube ID or the internal dataset ID. This
A Mc ALV Mc ALVYMT Mc ALVYMTS means that if a video V1 is marked as a duplicate and a
video V2 is not marked as a duplicate, but they have the
Figure 4: Increasing R@5 metric on the ActivityNet test same YouTube ID or same internal dataset ID, we will
set while enriching the train part. remove V1 and V2 from the train part. In case of LSMDC
we do not have the YouTube ID, but have the name of the
movie from which the video segment was taken, so if a
between the train part of used datasets — MSRVTT, Activ- video segment V1 is marked as a duplicate, we remove
ityNet, LSMDC, YouCook2, MSVD, TGIF, TwitterVines, all segments taken from the movie of V1 . The detailed
HowTo100M, Kinetics700 and the test parts of MSRVTT, description of the second stage is described in Sec. C.1.
ActivityNet and LSMDC, and then to remove found dupli-
Surprisingly we found that the MSRVTT test has a signifi-
cates from the train parts.
cant overlap with the MSRVTT train part. This problem is
Note that for training we use Something to Something V2 relevant for the full, 1k-A and 1k-B splits. The ActivityNet
dataset, but we do not try to find overlap between it and dataset suffers from the same problem.
test datasets because this dataset is artificially created, thus
For large datasets like HowTo100M and Kinetics700 we
the probability to find duplicates is very low.
can not find the whole intersection, but we estimate the
We decided to find the overlap only for MSRVTT, Activ- approximate number of videos in the intersection. We
ityNet and LSMDC because these are the most popular found that HowTo100M may have about 300 (10% of the
datasets and we do not have enough human resources to MSRVTT full test part) video segments that can be in the
find the overlap for the test part of all other datasets. MSRVTT full test part.
Our cleaning method consists of two stages. The first stage The similar situation is about Kinetics700 and ActivityNet
is to match video segments by the YouTube ID (if the ID is datasets. Kinetics700 may have approximately 500-600
12
video segments (10% of the ActivityNet test) that may have

duplicates in ActivityNet validation 1,2. Another problem
ab
with the Kinetics dataset is that many motion models are Sij = max Sij
a=1,...,si −K
pretrained on it. b=1,...,pj −K
This circumstance means that researchers should carefully
use HowTo100M and Kinetics700 along with MSRVTT
and ActivityNet correspondingly, because for today we and the corresponding video segments as
don’t know whether a neural network overfits for some
portion of this intersection or not.
All duplicates can be considered as two groups of pairs. (a, a + K), (b, b + K)
Pairs from the first group have the same videos, but dif-
ferent brightness, aspect ratio, size, presence/absence of
where
a logo and so on. The second group has pairs with quite
similar videos, for example it can be the same person on
the same background, doing the same things, but wearing
ab
different clothes. We think that it is better to remove such a, b = argmax Sij
videos from the train part to prevent overfitting. Several a=1,...,si −K
b=1,...,pj −K
found examples are presented in Fig. 6.
C.1 Near duplicate video search Finally we sorted all Sij in the descending order and man-
ually assess candidate pairs.
C.1.1 Approach
In this section we explain our approach that is used to find C.1.2 Number of pairs to assess
the same or quite similar video segments in test and train
parts. Suppose we search duplicates in datasets Q and G and we
Suppose we have two sets of videos Q = {q1 , ...., qk } and have seen N pairs with the highest scores and find M pairs
G = {g1 , ..., gn } called the query set and the gallery set. with duplicates. The important question is: what is the
We want to find all pairs (qi , gj ) where qi and gj have a total number of duplicates and how many percents of them
common video segment. have we found.
From each qi and gj we extract 1 frame per second. Each For each pair of Q and G we construct the following test
video is then represented by a sequence of pictures: qi = procedure. The first step is to augment Q, and let us call
p
[qi1 , ...., qisi ] and gj = [gj1 , ..., gj j ]. Then a 2D pretrained the result of augmentation as Q̂. To augment a dataset we
neural network is used to extract features from each image: apply two transformations: 1. we randomly crop a side of
q̄ab = neuralnet(qab ) and ḡab = neuralnet(gab ). each video, where each side can be 70%–100% of original
side length (aspect ratio can be changed); 2. we randomly
Then we compute the matrix of cosines between the fea- shift the start of the video by a random value between 0
<q̄ia ,ḡjb >
tures from Q and G: sab ij = ||q̄ b ||2 ||ḡ b ||2 . and 1 seconds.
a a
Now each pair (qi , gj ) is represented by the matrix: Having Q, Q̂ and G we compute sets of positive and neg-
ative scores: Pos and Neg. The Pos is the set of scores
p between i-th video from Q and the corresponding aug-
gj1 ... gj j mented video from Q̂. Neg is the set of scores between
1p
qi1 s11
ij ... sij j each video from Q and G. Having Pos and Neg sets we can
... plot a curve, where x axis represents the fraction of found
s pj
qisi ssiji 1 ... siji pairs with duplicates and Y axis represents the number of
negative pairs that we need to assess to find fraction x of
Suppose that videos qi and gj are intersected at time mo- positive pairs, call this curve F (x). We present the algo-
ments tq and tg , it is naturally to assume that the next sev- rithm that computes F using Pos and Neg sets in Lst. 1.
eral seconds tq +1, ..., tq +K −1 and tg +1, ..., tg +K −1 Suppose we have seen N + M pairs and have found M
(K ≤ min(si , pj )) represent the same video segment. pairs with duplicates. The total number of pairs with dupli-
Motivated by this fact we compute the mean cosine for cates can be estimated as M/F −1 (N ). By the definition
t t F (x) connects the fraction of found positive pairs with the
each interval of K seconds (we use K=4): Sijq j =
t tg t +K−1,tg +K−1 number of seen negative pairs. The value F −1 (N ) repre-
sijq +...+sijq
K . The sum in the numerator is sents approximation of the fraction of found positive pairs.
t t
the sum of diagonal elements started with sijq j . So if we know, that M is approximately 100∗F −1 (N )% of
positive pairs, then we can approximately compute 100%
We define the intersection score between (qi , gj ) as of positive pairs as M/F −1 (N ).
13
Figure 6: The left image is taken from the MSRVTT test split and the right one from MSRVTT Train. The numbers in
the upper left corner represent the MSRVTT video ID. The faces are blurred in order to avoid legal claims.
# first element is highest C.1.3 Best 2D feature extractor

P = np . sort ( P )[:: -1] # Pos
N = np . sort ( N )[:: -1] # Neg The key component of a duplicate search system is a fea-
xs = [] ture extractor. A good feature extractor significantly re-
ys = [] duces the number of pairs for manual assessment. To
for x , p in enumerate ( P ): compare different 2D feature extractors we use the fol-
# how many negative scores lowing test procedure. The test consists of two datasets.
# greater than p ? The first dataset is the train part from the MSRVTT full
j = np . searchsorted (N , p ) split. The second dataset is random 596k videos from the
xs . append ( x ) HowTo100M dataset. From each video of the taken part of
ys . append ( j ) HowTo100M we take a random 30 seconds segment. We
Listing 1: Numpy pseudocode for building the search curve apply random augmentation to MSRVTT, as described
F (x) in Sec. C.1.2. Define MSRVTT as Q, the augmented
14
MSRVTT dataset as Q̂ and the taken part of HowTo100M curve should be estimated for each used pair of datasets Q
as G. For each feature extractor we compute curve F (x), and G.
as described in Sec. C.1.2.
The best expert has the lowest curve. For example, if we
want to find 95% of duplicates, we should see many of C.1.4 Black frames
candidates, some of them are duplicates, but majority of
them are not. So, the value F (0.95) is the approximation Often two consecutive video segments are glued with sev-
of how many not duplicates we need to see to find 95% of eral black frames. The cosine similarity of embeddings
duplicates. Ideally F (0.95) = 0, where all seen candidates of two black or near black frames are close to 1. In this
are duplicates. So, a lower value F (0.95) requires to see case the most probable candidates for duplicates are black
less number of false candidates, that’s why the lower curve video segments. To prevent this we apply the following
is better. rule. Suppose we have a frame U and the unit length em-
bedding v computed from U . We find the prevalent color
We consider several feature extractors: resnet18 and in U and compute the area S0 filled by this color. Then
resnet101 [11] pretrained on ImageNet [3], resnet50 we compute the value S0 /(hw), where h and w are the
pretrained on Places365 [42] and resnext101-32x8d, height and width of U . If this fraction is greater than 0.7
resnext101-32x32d, resnext101-32x48d pretrained on one we define µv = 1 − S0 /(hw), otherwise µv = 1. To
billion images from Instagram [20] and finetuned on Ima- calculate similarity between embeddings v1 and v2 we use
geNet. We report search curves F (x) for these pretrained weighted cosine similarity: µ1 µ2 cosv1 , v2 , instead of clas-
networks in Fig. 7. sical cosine similarity. This rule removes majority of all
There exist networks [27] [16] trained especially for match near black frames from the most relevant candidates for
the duplicate frames or video segments, but they are not duplicates.
publicly available.
C.1.5 Screensavers detection
resnet18-imagenet
Many videos from ActivityNet, HowTo100m, YouCook2
resnet50-places365 contain screensavers at the beginning or at the end. It
6
10 resnet101-imagenet causes a problem like mentioned above with near black
resnext101-32x8d-wsl frames, because most of relevant proposals are the same
# of negative pairs
resnext101-32x16d-wsl screensavers, but the video content of the remainder video

resnext101-32x48d-wsl part are different.
105 Using the system described in Sec. C.1.6 we search for du-
plicates in the ActivityNet dataset, where a lot of the most
relevant segments are screensavers. We collect several hun-
dreds of screensavers and then compute embeddings for
104 each of them. Let us call the resulting set of embeddings as
E. Then we apply the following rule: if some embedding
v has the similarity greater that 0.9 to one of embeddings
ratio of positive pairs from E, we set v = 0. So if the video segment has a
part of a screensaver, it will never be in the most relevant
0 0.2 0.4 0.6 0.8 1 proposals.
Figure 7: Search curves F for different pretrained models.

Curve F is used to estimate the minimal number of negative C.1.6 GUI
pairs (y = F (x)) that human assessors need to inspect
before they find the fraction x of positive pairs. The lower The important part of the video duplicate search system is
the curve F the better (need to inspect manually less pairs). the user interface. Without ergonomic and fast interface it
The curves are built with the query set Q = MSRVTT is impossible to assess tens thousands of video pairs. Our
full train, the gallery set G = random 596k videos from system is presented in Fig. 8.
HowTo100M. The system shows video pairs with the highest scores on
top. A user needs to scroll down a web page (new videos
are loaded dynamically with ajax), and if a video duplicate
As we see resnext101-32x48d-wsl shows the best result.
is detected, a user should press the Duplicate button, if
We use this network for searching for duplicates.
there are no duplicates in the current viewport, no action is
It is worth to mention that here we just compare different required. When a user scrolls a web page, all non-duplicate
networks on a fixed benchmark, and pick the best one. But pairs automatically are saved to a log file. Additionally
the search curve F (x) significantly depends on data. This several users at the same time can assess video pairs.
15
Separate results for the first and the second stages are
reported in Sec. C.2.1.
Note that columns "test" and "train" in Tab. 9 may have
different values. Consider the situation when the test part
have a video segment A, and the train part have two video
segments A1 and A2. And both are marked as duplicates
with A. In this case the video segment A brings +1 to the
"test" column and A1, A2 bring +2 to the "train" column.
The most problematic datasets in terms of the number of
duplicates are MSRVTT and ActivityNet. These datasets
overlap with itself (e.g. MSRVTT test overlap with
Figure 8: Web system used to find duplicates. Images on MSRVTT train). We found more than 100 duplicate
the first and third row are not duplicates and the second pairs for both of them. Other problematic datasets are
row contains duplicate. HowTo100M and Kinetics700, these datasets are large,
so we can’t assess the required number of video pairs
to find 95% or 99% of duplicates. But we can assess a
M A L smaller number of pairs and using search curves F (see
dataset Sec. C.1.2) can extrapolate this value to 100%. We found
test train test train test train
that HowTo100M may have the intersection with MSRVTT
M 114 223 6 10 0 0 test full by about 300 videos (10% of the MSRVTT test
A 10 6 127 163 0 0 full). The similar situation is about the ActivityNet test set
L 6 2744 0 0 0 0 and Kinetics700, the intersection could be near 500-600
YouCook2 13 27 7 10 0 0 videos (10% of the ActivityNet test set).
MSVD 1 1 1 1 1 1
TGIF 6 8 0 0 0 0 In Tab. 10 we report results on MSRVTT for MMT retrain-
Twitter Vines 3 3 0 0 0 0 ing with no cleaning, after cleaning by the YouTube ID and
Kinetics700 4 5 456 464 0 0 cleaning combination by the YouTube ID and the manual
HowTo100M 177 154 209 209 0 0 assessment. The manual cleaning for 1k-A and 1k-B is
incomplete because we only do cleaning for the full split.
Table 9: The leftmost column represents train parts of The following situation takes place for 1k-A, 1k-B splits:
datasets, and the upper row represents test parts of datasets. when 1k videos from the full test are taken for test and
Column "test" means how many video segments are in the remaining 2k videos are moved to the train part, the
the test part that have the corresponding pair in the train additional overlapping is introduced, because these 1k and
part either with the same YouTube ID or manually marked 2k videos are overlapping. We do not remove this overlap
as a duplicate. Column "train" represents the number of in this research.
video segments in the train part that have corresponding
pair in the test dataset either with the same YouTube ID or
no by ID +
manually marked as a duplicate. All segments counted in split by ID
clean manual
the "train" column are removed from the train part. For ex-
ample consider the column "A" and the row "M". train=10 full 31.1±0.1 31.1±0.1 30.2±0.4
means that the MSRVTT train part contains 10 video seg- 1k-A 54.8±0.5 50.7±0.9 49.4±0.5
ments that have a pair in the ActivityNet test part. These 1k-B 51.1±0.9 46.1±0.1 46.4±0.6
10 videos must be removed from train part when dataset Table 10: Comparison for original MMT trained (7 modal-
are combined. test=6 means that ActivityNet test has 6 ities) on MSRVTT without cleaning, with cleaning by the
video segments that have a pair in the MSRVTT test part. YouTube ID only, and with cleaning by the YouTube ID
plus the manual assessment.
C.2 Cleaning results

As you can see after cleaning the performance is signif-
Recall that our cleaning method consists of two stages. In icantly decreased on 1k-A and 1k-B splits for original
the first stage we throw out from the train part all video MMT.
segments that have a pair with the same YouTube ID in
test parts of MSRVTT or ActivityNet. The second stage C.2.1 Intersection by YouTube ID and embeddings
is matching video segments by embeddings and manually
In Tab. 11 we report the intersection by the YouTube ID
assess several thousands pairs with the highest score.
between test parts of MSRVTT (full, 1k-A, 1k-B) and
In Tab. 9 we report how many duplicates are found for ActivityNet with train parts of MSRVTT (full, 1k-A, 1k-
each pairs of datasets. This table represents the final result B), ActivityNet, Kinetics700, YouCook2, HowTo100m,
after applying these two stages. MSVD.
16
data M M1k-a M1k-b A In this table we present the number of manually found
set test train test train test train test train duplicates and the estimated maximum number of dupli-
cates for a given pair of datasets. We managed to find the
M 0 0 0 0 104 179 2 4 intersection for almost all pairs of datasets.
M1k-a 2362 1990 372 415 827 1007 2 4
M1k-b 1689 1367 563 634 380 407 2 4 The maximum number of duplicates is computed based
A 0 0 0 0 0 0 0 0 on the search curve F (x). As we told in Sec. C.1.2 the
K 5 4 1 1 0 0 408 408 search curve significantly depends on data. We compute
Y 8 4 2 2 2 2 3 3 the search curve for all pairs of datasets in Tab. 12. The
HT100M 147 117 39 38 57 53 175 175 search curve for each particular pair of datasets is build
MSVD 3 1 2 1 0 0 1 1 exactly in the same way as described in Sec. C.1.2. For
Table 11: First stage. The leftmost column represents train example, to compute the search curve for MSRVTT test
parts of datasets, and the upper row represents test parts and ActivityNet train we define MSRVTT test as Q, Activ-
of datasets. Column "test" represents number of video ityNet train as G, then augment Q to produce Q̂, and use
segments in test part that have corresponding video in train the algorithm described in Sec. C.1.2.
part with the same YouTube ID. Column "train" represents Using the column "seen" from Tab. 12 we can compute
number of video in train part that have corresponding pair how many pairs need to be assessed to find the full over-
in test part with the same ID. For example: if we com- lap between datasets. For example, inspect 5k pairs for
bine M and YouCook2, we should remove 4 video from HowTo100M dataset and MSRVTT (the row "HT100M"
YouCook2 train. and the column "M"), we found 15 duplicates, so the ap-
proximate maximum number of duplicates is 320: 5k *
(320 / 15) = 106k. So, to find the full overlap using the
current version of algorithm it is needed to manually assess
It is worth to mention that MSRVTT 1k-A test and 1k-B 106k video pairs and it is too much, that’s why we do not
test have a large overlap ratio by the YouTube ID with the find full intersection for this specific pair of datasets.
1k-A train and the 1k-B train parts correspondingly. Both
splits have the overlap ratio of about 38% between the train
part and the test part. We also emphasize that the original D Hyperparameters
MSRVTT full split does not overlap by the YouTube ID
between the test and train parts. To train our best networks (MMT(MALVYMTS)
L9H8 CLIP+audio, MDMMT(MALVYMTS) L9H8
irCSN152+audio and MMT(MALVYMTS) L9H8
data M A L
CLIP+irCSN152+audio) we use 50 epochs and define a
set seen found total seen found total seen found total
single epoch as 150K examples per GPU (in total 1.2M
M 10k 114 114 1k 6 6 1k 0 0 examples per epoch on 8 GPUs). We use Adam optimizer
A 10k 10 10 15k 127 142 1k 0 0 without weight decay, the initial value for a learning rate is
L 3k 6 6 2k 0 0 — — — 5e-5, after each epoch we multiply the learning rate by
Y 2k 13 13 1k 7 7 1k 0 0 0.95. Batch size of 32 examples per GPU is used. We
MSVD 1k 1 1 1k 1 1 1k 1 1 do not exchange embeddings between GPUs. We use
T 2k 6 6 2k 0 0 3k 0 0 bi-directional max-margin ranking loss with margin 0.05.
V 2k 3 3 0k 0 0 1k 0 0 In Bert and the video transformer encoder we use dropout
K 2k 1 2 30k 227 539 2k 0 0 0.2 in attention and in FFN block. We use 8 Nvidia V100
HT100M 5k 15 320 — — — — — — 32GB GPUs. The training time is about 14 hours.
Table 12: Second stage. The leftmost column represents
train parts of datasets, and the upper row represents test
parts of datasets. Column "seen" represents the number of
E Pretrained model
video segments that we manually assess for a given pair of
datasets. Column "found" represents the number of videos The well known method to boost the performance in video
in the test part for which there exists the corresponding retrieval tasks is to use a pretrained model. First the neural
duplicate video segment in the train part. Column "total" network is trained on some large dataset, then at second
represents the approximately estimated total number of stage it is finetuned for target target dataset. In video
videos from the test part that have a duplicate pair in the retrieval task HowTo100M dataset is often used for pre-
train part. Symbol "—" means that the intersection is not training. In this work we use HowTo100M for pretraining
computed because it requires too much human resources. in the same way.
In our training procedure we use 8 Nvidia V100 32Gb
GPUs, we train for 200 epochs where one epoch is defined
In Tab. 12 we report the statistics for the second deduplica- as 80k examples on each GPU (in total network sees 640k
tion stage (searching by embeddings). We do not compute examples on 8 GPUs per epoch). We use batch size 64
an intersection for MSRVTT 1k-A and 1k-B splits. for each GPU and do not exchange embeddings between
17
pretr
MSRVTT full clean text → video
model
R@1↑ R@5↑ R@10↑ MnR↓ MdR↓
Ours MDMMT(Mc ALVYMTS) L9H8 irCSN152+audio yes 15.8±0.1 38.9±0.1 51.0±0.1 76.4±0.5 10.0±0.0
Ours MDMMT(Mc ALVYMTS) L9H8 irCSN152+audio no 14.5±0.1 36.8±0.3 48.8±0.3 82.2±0.6 11.0±0.0
Ours MDMMT(Mc ALVYMTS) L9H8 CLIP+audio yes 21.5±0.1 47.4±0.2 59.6±0.1 57.7±0.4 6.0±0.0
Ours MDMMT(Mc ALVYMTS) L9H8 CLIP+audio no 20.0±0.1 45.1±0.1 57.3±0.1 63.1±0.1 7.0±0.0
Table 13: Performance on the MSRVTT full clean split with and without pretrained model (HowTo100m).
pretr
ActivityNet text → video
model
R@1↑ R@5↑ R@10↑ MnR↓ MdR↓
Table 14: Performance on ActivityNet with and without pretrained model (HowTo100m). The performance reported for
the text to video retrieval task on our own subset of the original ActivityNet test part. See Sec. 2.1 for details.
pretr
LSMDC text → video
model
R@1↑ R@5↑ R@10↑ MnR↓ MdR↓
Table 15: Performance on LSMDC with and without pretrained model (HowTo100m).
GPU. Initial learning rate is 5e-5. After each epoch we

multiply learning rate by 0.98. We use the full HowTo00M
dataset. The model is trained either with two modalities:
motion/RGB and audio or with three modalities: motion,
RGB and audio, depending on how many modalities are
used in final model. The total training time is about 24
hours. We use bi-directional max-margin ranking loss with
margin 0.05.
In Tab. 13, 14 and 15 we compare two our models:
MDMMT(Mc ALVYMTS) L9H8 irCSN152+audio and
MDMMT(Mc ALVYMTS) L9H8 CLIP+audio when they
are trained from the pretrained model or not. In these three
tables we present the same four models (no special finetun-
ing for the target dataset) tested on different datasets.
As we can see in Tab. 13 the pretrained model increases R1
metric by 1% and R5 by 2%. The pretrained model also
increase performance on ActivityNet dataset, see Tab. 14.
For R1 metric the improvement is about 2% and for R5
metric is about 4%. For LSMDC dataset, see Tab 15,
we have approximately the same results with and without
pretraining.
18

MDMMT: Multidomain Multimodal Transformer For Video Retrieval

Uploaded by

Copyright:

Available Formats

MDMMT: Multidomain Multimodal Transformer For Video Retrieval

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MDMMT: Multidomain Multimodal Transformer For Video Retrieval

Uploaded by

Copyright:

Available Formats

MDMMT: Multidomain Multimodal Transformer for Video

Maksim Dzabraev1,2 , Maksim Kalashnikov1 , Stepan Komkov1,2 , Aleksandr Petiushko1,2

Keywords video, language, retrieval, multi-modal,

total video duration

ActivityNet text → video

Table 4: Test results on our split (see Sec. 2.1) on ActivityNet.

Dataset Weight Test Text → Video R@5 ↑

outperforms the original MMT (the motion, the RGB and

pretrained in an unsupervised way. To test the CLIP model

5 Conclusions and Discussion

MSRVTT text → video

CE [19] 18.2±0.7 46.0±0.4 60.7±0.2 35.3±1.1 7.0±0.0

Table 7: Results on MSRVTT dataset.

LSMDC text → video

Table 8: Test results on LSMDC public test (1k video)

MSRVTT full clean R@5 ↑ LSMDC R@5↑

Mc Mc ALV Mc ALVYMT Mc ALVYMTS L Mc ALV Mc ALVYMT Mc ALVYMTS

video segments (10% of the ActivityNet test) that may have

# first element is highest C.1.3 Best 2D feature extractor

resnext101-32x16d-wsl screensavers, but the video content of the remainder video

Figure 7: Search curves F for different pretrained models.

C.2 Cleaning results

GPU. Initial learning rate is 5e-5. After each epoch we

You might also like