Controllable augmentations for video representation learning

Qian, Rui; Lin, Weiyao; See, John; Li, Dian

doi:10.1007/s44267-023-00034-7

Controllable augmentations for video representation learning

Research
Open access
Published: 10 January 2024

Volume 2, article number 1, (2024)
Cite this article

Download PDF

You have full access to this open access article

Visual Intelligence Aims and scope Submit manuscript

Controllable augmentations for video representation learning

Download PDF

Rui Qian¹,
Weiyao Lin¹,
John See² &
…
Dian Li³

1436 Accesses
7 Citations
Explore all metrics

Abstract

This paper focuses on self-supervised video representation learning. Most existing approaches follow the contrastive learning pipeline to construct positive and negative pairs by sampling different clips. However, this formulation tends to bias the static background and has difficulty establishing global temporal structures. The major reason is that the positive pairs, i.e., different clips sampled from the same video, have limited temporal receptive fields, and usually share similar backgrounds but differ in motions. To address these problems, we propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as general long-term temporal relations. Based on a set of designed controllable augmentations, we implement accurate appearance and motion pattern alignment through soft spatio-temporal region contrast. Our formulation avoids the low-level redundancy shortcut with an adversarial mutual information minimization objective to improve the generalization ability. Moreover, we introduce local-global temporal order dependency to further bridge the gap between clip-level and video-level representations for robust temporal modeling. Extensive experiments demonstrate that our framework is superior on three video benchmarks in action recognition and video retrieval, and captures more accurate temporal dynamics.

Video representation learning by identifying spatio-temporal transformations

Article 14 September 2021

Mitigating background bias in self-supervised video representation learning

Article 04 December 2024

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Video representation learning is fundamental to various downstream video-related applications, e.g., action recognition [1, 2], spatio-temporal detection [3, 4], video retrieval [5, 6], etc. Traditional supervised learning schemes require costly human labeling, and the performance is usually restricted by the granularity of the annotations. More precisely, coarse-grained video-level annotations could lead the model to attend to the background [1, 7], while fine-grained annotations greatly facilitate general video analysis but are much more expensive [8, 9]. Unsupervised video representation learning has begun to attract more attention to solve this problem. Some early works designed diverse pretext tasks to learn the video characteristics in a self-supervised manner [10–15]. Recently, the formulation of contrastive learning further improved the performance by a large margin [16–19].

A prevalent method for contrastive video representation learning is to sample several clips and regard those from the video as positive pairs [17, 20–22]. However, this formulation has two drawbacks. On the one hand, these methods tend to be biased toward a static background [23, 24]. This is because the sampled clips mostly share the same background, but subtle differences in motions probably exist. For example, in Fig. 1, the video contains a high jump scene. The clip sampled at an early timestamp shows the running action, but the same clip sampled at a later timestamp presents the jumping action. Thus, pulling these two clips closer in the feature space will lead the model to neglect their distinct motions and only attend to the background of the stadium. On the other hand, there remains an obvious gap between clip-level features and video-level representation. The sampled clips have a limited temporal receptive field, and thus, cannot provide comprehensive information. For example, Clip 1 in Fig. 1 only shows the momentary process of running. When we jointly leverage the correctly ordered two clips, i.e., the running action occurs before jumping, we can understand the original video. Motivated by these observations, we intend to address these problems from two aspects, one is detailed region-level correspondence, and the other is general long-term temporal perception.

In this paper, we propose a framework to learn comprehensive appearance and motion patterns in videos. Concretely, we develop a set of controllable augmentations to achieve this goal. First, we use constrained spatio-temporal cropping to sample several local clips from each video such that the clips cover diverse timestamps of the video. Then we generate dense spatio-temporal position-wise correspondences between the local clip and global video feature maps based on the cropping parameters. In Fig. 1, we present a toy example on temporal correspondence, whereby the spatial correspondence is established by employing these soft codes to align features in corresponding regions. In this way, we can match the exact same appearance and motion content, while avoid aligning inconsistent motions between various timestamps. However, there also exist “shortcuts” that govern the overlapping regions between local clips and global videos, e.g., the low-level color statistics. These shortcuts could prevent the model from learning useful semantics. To avoid them, we define different intensity levels of color jitter and Gaussian blur augmentations, and regard the samples generated by the same level augmentation as sharing similar low-level attributes. We then minimize the mutual information between them to mitigate the impact of low-level shortcuts on the extracted representation.

To further bridge the gap between clip-level and video-level representations, we intuitively introduce a learning objective to model temporal order dependency between local clips and global video. In particular, we have access to the temporal order of the sampled clips in accordance with the cropping parameters. With that, we aim to maximize the mutual information between correctly ordered clip features and the global video features. Through this operation, we facilitate the temporal awareness of the model in the pretraining stage.

In summary, our contributions are as follows:

1) We propose a unified framework to learn video representations from detailed local contrast and general long-term temporal modeling.

2) We develop controllable augmentations to match the visual contents in corresponding spatio-temporal positions for detailed content alignment, and perform mutual information minimization to avoid low-level shortcuts.

3) We introduce the temporal order dependency between the local clips and global video to enhance general temporal structure modeling.

4) We achieve superior results on downstream action recognition and video retrieval tasks, while capturing more accurate motion patterns.

2 Related work

2.1 Contrastive learning

Recently, contrastive learning [25–27] has revolutionized self-supervised learning. Its core idea is to discriminate different instances by attracting the positive pairs and repelling the negative pairs in the feature space [28, 29]. Following this, Wu et al. [30] formulated instance discrimination as a non-parametric classification problem. Van den Oord et al. [27] introduced mutual information estimation with InfoNCE loss [29], which led to easy optimization and fast convergence. Inspired by this, a line of works [25, 26, 31, 32] adopted this learning objective for image representation learning and showed significant improvement on downstream tasks. Later, Xie et al. [33] and Wang et al. [34] developed dense contrastive learning, which performed pixel-level contrast. Compared to instance-level discrimination, dense contrastive learning preserves richer characteristics, and performs better on dense prediction tasks and visual correspondence learning. In our work, we focus on video representation learning. Considering that natural spatio-temporal correspondences exist in video domains, we propose to utilize them as self-supervisory signals for spatio-temporal region contrast to learn more comprehensive video representations.

2.2 Video representation learning

Unlike images, videos contain internal temporal structures that are crucial for video content analysis. To this end, many works [11, 14, 35] designed various pretext tasks to leverage the natural spatio-temporal correspondence as self-supervisory signals. Some typical pretext tasks include temporal ordering [11, 14, 19], spatio-temporal puzzles [12, 15], colorization [36], playback speed prediction [10, 13], temporal cycle-consistency [37–39] and future prediction [40, 41]. There were also some works [42, 43] using cross-modal correspondence for self-supervised pretraining. Inspired by the success of contrastive learning in the image domain, a series of works [16–18, 44] extended this pipeline to video domain. Particularly, Han et al. [45, 46] employed information noise contrastive estimation (InfoNCE) loss for dense future prediction, while Wang et al. [18] and Yang et al. [47] sampled clips of different rates as positive pairs for visual content learning. However, video contrastive learning could lead the model to place more emphasis on the static scene and focus less on motion [23]. To solve this problem, Chen et al. [48] and Simon Jenni et al. [13] integrated contrastive learning with temporal pretext tasks to enhance temporal awareness. Han et al. [20] and Li et al. [49] used optical flow to assist motion modeling. Qian et al. [50] and Ding et al. [51] used static frames, frame differences and consecutive frames to balance appearance and motion perception. Liu et al. [52] and Ding et al. [53] carefully designed motion focused augmentations to place more emphasis on dynamic motions. In our work, we do not resort to frame difference or optical flow to enhance motion learning and temporal modeling. Instead, we hypothesize that the underlying reason for static scene bias lies in the positive pair formulation. Most existing works use either different frames [16, 19] or different clips [17, 21] from the same video as the positive pair, which usually have similar backgrounds but possess different motions. Hence, we propose to consider the corresponding regions within local and global views to form accurate positive, concurrently with low-level shortcut elimination, which captures the desired static and dynamic characteristics. In addition, we develop a temporal dependency between these views to bridge the gap between local clip and global video representations, while learning robust temporal structures.

2.3 Local-global views for video representation

There have also been some works using local and global views for self-supervised video representation learning [21, 54–57]. The major difference between our work and those works lies in the concept of local global views and its target. In our work, “local global” means short and long video clips, and the major target is to construct spatio-temporal overlaps and formulate a soft learning objective, which guides detailed region-level video content alignment. In Ref. [54], local global meant local fine-grained and global coarse-grained features, which were designed for general audio-visual correspondence. Recasens et al. [55] aimed to extrapolate the neighboring video content in the global view based on the observation from the local view. Dave et al. [56] designed a loss function to learn emporal correspondence between local and global clips but still with hard positive assignment. Behrmann et al. [57] employed local global views to decompose stationary and non-stationary features and Kuang et al. [21] used them for segment-based positive sampling. Qing et al. [58] built hierarchical structures on videos and employed multi-level temporal consistency to guide local and global video representation learning.

3 Method

The core idea of our proposed framework is to enhance self-supervised video representation learning by comprehensive appearance and motion content modeling. As displayed in Fig. 2, we utilize a set of controllable augmentations to achieve detailed spatio-temporal region contrast, low-level shortcut elimination and general temporal dependency modeling.

Specifically, we divide the augmentations into two parts: spatio-temporal position transformations $\tau _{\text{p}}$ that include crop and horizontal flip, and low-level statistic transformations $\tau _{\text{l}}$ that include color jitter and Gaussian blur. Following the data preprocessing pipeline, given a video v, we first use $\tau _{\text{p}}$ to sample several local clips and then perform $\tau _{\text{l}}$ to generate the input to the encoder.

3.1 Spatio-temporal region contrast

Given a video v with temporal length T, we first use a set of spatio-temporal position transformations $\tau _{\text{p}}^{k}\in \{\tau _{\text{p}}^{1},\tau _{\text{p}}^{2}, \ldots ,\tau _{\text{p}}^{K}\}$ to sample K clips $v_{k}\in \{v_{1},v_{2},\ldots ,v_{K}\}$, to provide the local feature descriptions. To let the sampled clips contain as much information as the original video, we manually constrain the temporal cropping parameters in $\tau _{\text{p}}^{k}$ to control the central timestamp of $v_{k}$ in the range of $[{(k-1)T}/{K},{kT}/{K} ]$. In this way, sampled clips cover different temporal segments and they jointly present the rich information in v. As mentioned in Sect. 1, there could be inconsistencies in motions between different local clips such that it is not optimal to align the representations between different clips. Hence, we need to determine the exact corresponding content for feature alignment. To this end, considering that there is a natural correspondence between local clips and global video, we leverage v and $v_{k}$ as two views for feature matching.

For local clip feature extraction, we denote the feature extractor as $f(\cdot )$ and the local clip feature map as $f(v_{k})\in \mathbb{R}^{CT_{\text{c}}HW}$, where C, H and W denote the channel, width and height, and $T_{\text{c}}$ denotes the temporal dimension of the clip feature map. For global video feature extraction, we perform sparse sampling to represent v, and set some temporal stride of convolution layers to 1 to make $f'(v)\in \mathbb{R}^{CT_{\text{v}}HW}$ possess higher temporal resolution, i.e., temporal dimension $T_{\text{v}}>T_{\text{c}}$. Note that f and $f'$ share the same architecture and only differ in the temporal stride. Details of the network settings are described in Sect. 4.2.

Based on $f(v_{k})$ and $f'(v)$, we refer to the augmentation parameters in $\tau _{\text{p}}^{k}$ to calculate the dense spatio-temporal position correspondence. Specifically, we use $S_{k}\in \mathbb{R}^{N_{\text{c}}\times N_{\text{v}}}$ to indicate the correspondence result, where $N_{\text{c}}=T_{\text{c}}HW$, $N_{\text{v}}=T_{\text{v}}HW$. $S_{k}(i,j)$ reveals the correspondence score between the ith spatio-temporal grid in $f(v_{k})$ and the jth grid in $f'(v)$. Essentially, each grid on the feature map is equivalent to a tube covering a certain spatio-temporal area as illustrated in Fig. 2, and $S_{k}(i,j)$ is measured by the ratio of the intersection of two tubes over the volume of tube $f(v_{k})[i]$:

$$\begin{aligned} S_{k}(i,j)= \frac{\mathit{inter}(f(v_{k})[i],f'(v)[j])}{\mathit{vol}(f(v_{k})[i])}, \end{aligned}$$

(1)

where $[\cdot ]$ denotes the grid index, $\mathit{vol}(\cdot )$ measures the spatio-temporal volume of the given feature tube, and $\mathit{inter}(\cdot )$ measures the intersecting volume between two tubes. The detailed computation process is illustrated in Sect. 4.2. In this formulation, the row-wise summation of $S_{k}$ equals 1, i.e., $\sum_{j=1}^{N_{\text{v}}}S_{k}(i,j)=\{1\}^{N_{\text{c}}}$. This indicates that each row in $S_{k}$ can be treated as a probability distribution that describes the correspondence between $f(v_{k})[i]$ and each grid in $f'(v)$.

Therefore, we utilize the calculated correspondence matrix $S_{k}$ as the reference distribution to guide spatio-temporal region feature contrast for accurate visual content alignment. Specifically, we take $f(v_{k})[i]$ as a query for illustration. Recall that InfoNCE loss can be written as the cross-entropy between a prior distribution, i.e., the indicator function, and the feature similarity distribution is given as

$$\begin{aligned} \mathcal{L}_{\text{nce}}(i) = -\sum_{j} \mathbb{I}_{ij}\log \frac{{\mathit{sim}}(\boldsymbol{q}_{i},\boldsymbol{k}_{j})}{\sum_{l}{\mathit{sim}}(\boldsymbol{q}_{i},\boldsymbol{k}_{l})}, \end{aligned}$$

(2)

where q, k respectively denotes query and key features in contrastive learning, $\mathbb{I}_{ij}=1$ if $i=j$ otherwise $\mathbb{I}_{ij}=0$, and $\mathit{sim}(\cdot ,\cdot )=\exp (\cos (\cdot ,\cdot )/\tau )$ measures the feature similarity. In our formulation, we replace the prior $\mathbb{I}_{ij}$ with the soft distribution $S_{k}(i,j)$ for accurate region contrast. Since the correspondence between $v_{k}$ and clips from other videos naturally equals 0, we can intuitively enlarge the negative pool by introducing features from other videos. Thus, the spatio-temporal region contrast loss over all $f(v_{k})[i]$ can be formulated as

$$\begin{aligned} &\mathcal{L}_{\text{rc}} = -\sum_{k=1}^{K} \sum_{i=1}^{N_{\text{c}}} \sum _{j=1}^{N_{\text{v}}}S_{k}(i,j)\log p_{k}^{ij}, \end{aligned}$$

(3)

$$\begin{aligned} &p_{k}^{ij} = \frac{\mathit{sim}(f(v_{k})[i],f'(v)[j])}{\sum_{j=1} ^{N_{\text{v}}}\mathit{sim}(f(v_{k})[i],f'(v)[j])+\sum_{\boldsymbol{n}}\mathit{sim}(f(v_{k})[i],\boldsymbol{n})}, \end{aligned}$$

(4)

where $\boldsymbol{n}\in \mathbb{R}^{C}$ denotes the negative features sampled from other videos in the mini-batch. Note that we sample the global views of other videos to form the negative pairs by default, and we include an ablation study in the experimental part. In this way, we are able to align the exact corresponding visual contents including both static appearance and dynamic motions in videos.

3.2 Low-level shortcut elimination

However, local-global spatio-temporal correspondence for region feature contrast, can exist in the form of a “shortcut” that relies merely on low-level statistics, e.g., color distribution, to identify the overlapping areas. This shortcut could prevent the model from learning meaningful semantic features. To this end, we aim to mitigate the impact of low-level statistics on the extracted representations.

An intuitive way to solve this problem is by utilizing strong augmentations. However, we find that this is not enough in the video domain. Unlike images, the temporal continuity between sampled frames could provide extra cues to learn these shortcuts. For example, the continuous change in illumination helps to determine the corresponding segments in the local-global view. It is nontrivial to design augmentations to decouple such low-level information from the final representations. Motivated by adversarial learning, a promising approach is to learn a low-level information estimator from semantically inconsistent samples that share similar low-level statistics. Then, we let the encoder minimize this estimated information.

We note that the color and blur augmentation $\tau _{\text{l}}$ is effective against distortions in low-level statistics. In other words, similar augmentations could generate samples that share similar low-level characteristics. Hence, we define several different intensity levels of $\tau _{\text{l}}$ by constraining the augmentation parameters to a certain range. As such, we can generate frame sequences that possess distinct semantics but similar low-level statistics using the controlled $\tau _{\text{l}}$. Then, we build a mutual information estimator on top of the extracted feature representation for low-level information extraction. Note that there are several ways to approximate the mutual information – we compare different estimation methods in Sect. 4.4. For illustration, we take MINE [59] as an example. Following Ref. [59], we approximate the mutual information between two variables by

$$\begin{aligned} I_{\Theta}(X;Y) = \sup_{\theta \in \Theta}\mathcal{E}_{\mathcal{P}_{XY}}[G_{ \theta}]- \log \bigl(\mathcal{E}_{\mathcal{P}_{X}\otimes \mathcal{P}_{Y}}\bigl[ \text{e}^{G_{\theta}}\bigr]\bigr), \end{aligned}$$

(5)

where $\mathcal{P}$ denotes a probability distribution, $\mathcal{E}$ represents taking the expectation on the corresponding distribution. X and Y are the feature representations extracted by encoder f. The projection function G projects the combination of two variants X and Y sampled from the distribution space $\mathcal{X}$ and $\mathcal{Y}$ to a scalar value, i.e., $G_{\theta}:\mathcal{X}\times \mathcal{Y}\rightarrow \mathbb{R}$. It is instantiated by a neural network with parameters $\theta \in \Theta $, where Θ is the parameter set for optimization. Empirically, we instantiate $G_{\theta}$ as a two-layer multi-layer perceptron (MLP). We regard the features of sample pairs generated from the same intensity-level of $\tau _{\text{l}}$ as the joint distribution $\mathcal{P}_{XY}$, and the features of arbitrary sample pairs as the marginal $\mathcal{P}_{X}\otimes \mathcal{P}_{Y}$, where ⊗ denotes the combination of two marginal distributions. During training, we formulate the learning objective as

$$\begin{aligned} \mathcal{L}_{\text{mi}} = \min_{f}\max _{\theta }\mathcal{E}_{ \mathcal{P}_{XY}}[G_{\theta}]-\log \bigl( \mathcal{E}_{\mathcal{P}_{X} \otimes \mathcal{P}_{Y}}\bigl[\text{e}^{G_{\theta}}\bigr]\bigr). \end{aligned}$$

(6)

We maximize Eq. (6) in regards to the MLP parameters θ to obtain a reliable low-level information extractor, but reverse the gradient back-propagated to the encoder f to minimize Eq. (6). With the learned low-level information estimator $G_{\theta}$, we further apply it to the aforementioned local-global pairs, $f(v_{k})$ and $f'(v)$, to minimize the low-level shortcut by optimizing f, but not update θ. In this way, we minimize the impact of low-level statistics on the spatio-temporal region feature contrast, and facilitate detailed semantic alignment.

3.3 Local-global temporal dependency

Now, we have learned robust clip features from the detailed region semantic contrast, and the remaining task is to bridge the gap between clip-level and video-level representations. Considering that the internal temporal relationships exist between the sampled local clips which are naturally contained in the global video, we propose to model the temporal order dependency between $f(v_{k})$, $k=\{1,2,\ldots ,K\}$ and $f'(v)$ to enhance video-level understanding.

Similar to Sect. 3.2, we also use mutual information to measure the local-global temporal order dependency. The target is to maximize the mutual information between correctly ordered clip-level features and the video-level representation. Mathematically, we define the sequentially ordered clip features as $\overline{f}(v)= [f(v_{1})\circ f(v_{2})\circ \cdots \circ f(v_{K}) ]$, where ∘ denotes concatenation operation, and the arbitrarily ordered features as $\widetilde{f}(v)$. To model the temporal dependency, we regard $\overline{f}(v)$ and $f'(v)$ as sampled from the joint distribution $\mathcal{P}_{XY}$, and $\widetilde{f}(v)$ and $f'(v)$ as sampled from the marginal distribution $\mathcal{P}_{X}\otimes \mathcal{P}_{Y}$. In this formulation, the learning objective can be written as

$$\begin{aligned} \mathcal{L}_{\text{td}} = \max_{f,\psi}\mathcal{E}_{\mathcal{P}_{XY}}[G_{ \psi}]- \log \bigl(\mathcal{E}_{\mathcal{P}_{X}\otimes \mathcal{P}_{Y}}\bigl[ \text{e}^{G_{\psi}}\bigr]\bigr), \end{aligned}$$

(7)

where $G_{\psi}$ is the mutual information estimation head with parameters ψ. Several alternatives exist to instantiate $G_{\psi}$, and we discuss this in Sect. 4.4.

There are some alternatives to establish the marginal distribution $\mathcal{P}_{X}\otimes \mathcal{P}_{Y}$ in Eq. (7). In default, we formulate it as a uniform distribution consisting of differently ordered video clips $\widetilde{f}(v)$. Empirically, this formulation places equal emphasis on all order combinations. However, among the shuffled orders, some are quite trivial to discriminate while others are difficult to perceive. To this end, we refer to the transformation parameters $\tau _{\text{p}}$ in our controllable augmentations, and evaluate the difficulty of each order to pay more attention to the hard examples. In particular, an order indicator $\mathcal{O}\in \mathbb{N}^{K}$ is given, and we have $\mathcal{O}[k]=k$ for the correct order. We denote the central timestamp of the kth clip as $t_{\mathcal{O}[k]}$, and the oracle central timestamp is $\hat{t}_{k}={(2k-1)T}/{(2K)}$. We calculate the summation of the central timestamp deviations to produce the difficulty score of order $\mathcal{O}$:

$$\begin{aligned} D(\mathcal{O}) = -\sum_{i=1}^{K} \vert t_{\mathcal{O}[k]}-\hat{t}_{k} \vert . \end{aligned}$$

(8)

The lower deviation indicates the higher learning difficulty. We take softmax normalization over the difficulty scores to generate the sampling probability of each order combination. In this way, we formulate a marginal distribution which emphasizes hard examples to improve the learning efficiency.

It is worth noting that there are some previous works using temporal order to build pretext tasks for self-supervised learning [11, 14, 35]. The major difference is that our approach incorporates the video-level feature to determine whether the clips are correctly ordered, while Refs. [11, 14, 35] have no access to the global feature. In this way, our formulation could avoid the ambiguity problem when encountering the temporal structure that cannot be determined solely by local clips. For example, in a complex gymnastic scene, it is difficult to determine the temporal order of gymnastic actions only with local clips. However, it is practical to reach the correct order with reference to the global video feature. Thus, our local-global mutual temporal order constraint is be a better way to embed video-level temporal structures into extracted representations.

3.4 Training

We jointly train our model with the aforementioned three objectives:

$$\begin{aligned} \mathcal{L} = \mathcal{L}_{\text{rc}} + \alpha \mathcal{L}_{\text{mi}} + \beta \mathcal{L}_{\text{td}}, \end{aligned}$$

(9)

where α and β serve as balancing hyper-parameters. We set $\alpha =\beta =1$ by default, and find that the performance is fairly robust to these hyper-parameters.

In addition, we also explore applying a curriculum evolving strategy to the parameters in the controllable augmentations to adjust the training process. Intuitively, motivated by Refs. [60, 61], it is promising to learn from easier samples and then gradually expand to more difficult tasks. This phenomenon is more obvious in self-supervised learning in the absence of human annotations [62]. To this end, in this work, we design an evolving strategy for the augmentation parameters to allow the model to learn in an easy-to-hard manner. Specifically, at the beginning of the training process, we strictly constrain the cropping parameters in $\tau _{\text{p}}$ to construct distinct local clips to reduce difficulty in dense region contrast and temporal dependency modeling. Additionally, we divide easily distinguishable intensity levels in $\tau _{\text{l}}$ to more easily capture the low-level statistics. In the learning process, we gradually relax the constraints on the augmentation parameters to increase the learning difficulty. We present the detailed formulation of the dynamic parameter evolving process in the implementation details. We show the empirical comparison with simple constant augmentation parameters in the ablation study.

4 Experiment

4.1 Datasets

We use 4 video action recognition datasets, Kinetics-400 [1], UCF-101 [7], HMDB-51 [63] and Diving-48 [9]. Kinetics-400 [1] is a large-scale dataset consisting of 240 K video clips with 400 human action classes. UCF-101 [7] contains over 13 K clips covering 101 action classes. HMDB-51 [63] covers 51 action categories and approximately 7 K annotated clips. Diving-48 [9] contains 48 different diving actions, which mainly vary in motion patterns and share similar backgrounds. In our experiments, we use the training set of UCF-101 or Kinetics-400 for self-supervised pretraining. For the downstream tasks, following Refs. [10, 46, 56], we use split 1 of UCF-101 and HMDB-51, and the test split V1 of Diving-48 for evaluation.

4.2 Implementation details

Self-supervised pretraining

For global video input, we sparsely sample 16 frames with weak spatial cropping. For local clip input, we constrain the temporal cropping parameters to make K 16-frame clips approximately uniformly distributed in the video. The local clips are spatially cropped within the global view to ensure position-wise correspondence. For low-level augmentations, we define a set of color jitter and Gaussian blur parameters to form different intensity-level transformations. We resize the input frame sequence into $16\times 112\times 112$, and use R3D-18 [64] as the video encoder. For local clip feature extraction, we follow the default setting and the feature resolution is $2\times 4\times 4$. For global video feature extraction, we set the temporal stride of the last 3 stages to 1, so that the feature resolution is $8\times 4\times 4$. We calculate the spatio-temporal correspondence matrix between local and global feature maps based on the cropping and flipping parameters for optimization.

In terms of training settings, we use batchsize of 128, and set the number of local clips K to 4 by default. We train our model on UCF-101 for 200 epochs and on Kinetics-400 for 100 epochs. We use the Adam optimizer with an initial learning rate of $1.0\times 10^{-3}$ and weight decay of $1.0\times 10^{-5}$. The learning rate is decayed by 10 at 70 epochs for Kinetics-400 and 150 epochs for UCF-101.

Action recognition

We load the pretrained video encoder parameters except for the last fully-connected layer. There are two protocols: (1) end-to-end finetune the whole network with action labels; (2) freeze the encoder, and only train the linear classifier that also is known as linear probe. For evaluation, we follow Refs. [14, 18] to uniformly sample 10 clips for each video, which are center cropped and resized to $112\times 112$. We average the softmax probability of each clip as the final prediction and report the Top-1 accuracy.

Video retrieval

We directly use the pretrained model to extract video features without finetuning. Following Refs. [14, 65], we regard videos in the test set as queries, and retrieve nearest neighbors from the training set. Similar to action recognition, we average the features of ten uniformly sampled clips as the global representation. We report Top-k recall R@k.

Controllable augmentations

We also provide a detailed illustration of our controllable augmentations. We respectively describe the implementations for random spatial crop, random temporal crop, random horizontal flip, color jitter and Gaussian blur. We use the default setting, 4 local clips and 512 low-level augmentation intensities for illustration, and provide the detailed evolving progress of the dynamic augmentation parameters.

1) Random temporal crop. For global video, we do not perform temporal cropping, but uniformly sample 16 frames. For local clips, we constrain the central frame to respectively be located in the timespan $[0.00,0.25]$, $[0.25,0.50]$, $[0.50,0.75]$ and $[0.75,1.00]$ to cover the whole video. For the dynamic augmentation parameter design, we constrain the time interval to $[0.10,0.15]$, $[0.35,0.40]$, $[0.60,0.65]$ and $[0.85,0.90]$ at the beginning to generate clips with more easily distinguishable temporal boundaries. Then we linearly expand the time interval to the default $[0.00,0.25]$, $[0.25,0.50]$, $[0.50,0.75]$ and $[0.75,1.00]$ in each training iteration.

2) Random spatial crop. For global video, we perform weak spatial cropping, with the area ratio in the range $[0.8,1.0]$. For local clips, we perform strong spatial cropping relative to the cropped global video, with an area ratio of $[0.4,0.8]$. Similarly, for the dynamic curriculum training, we first constrain the local clip cropping ratio to $[0.7,0.8]$ for more discriminative region contrast. Then we linearly reduce the lower limit of the area ratio to $[0.4,0.8]$ to increase the learning difficulty with more diverse and noisier dense correspondence.

3) Random horizontal flip. For both global video and local clips, we perform random horizontal flipping with a probability of 0.5.

4) Color jitter. Referring to the default settings in [20, 22], the brightness (B), contrast (C) and saturation (S) are in the range $[0.0,0.4]$, and the hue (H) is in the range $[0,0.1]$. We uniformly divide 4 groups for B, C and S, respectively, the intensity from weak to strong order in the range $[0.0,0.1]$, $[0.1,0.2]$, $[0.2,0.3]$ and $[0.3,0.4]$, and divide 2 groups for H that in the range $[0.00,0.05]$, $[0.05,0.10]$. In the dynamic training, we initialize the intensity levels as $[0.04,0.06]$, $[0.14,0.16]$, $[0.24,0.26]$ and $[0.34,0.36]$ to easily capture the difference in low-level statistics. Along the training progress, we linearly expand the intensity levels to the default setting, $[0.0,0.1]$, $[0.1,0.2]$, $[0.2,0.3]$ and $[0.3,0.4]$.

5) Gaussian blur. We adopt 2 different radii 7 and 11, and 2 different sigma ranges $[0.1,0.5]$ and $[0.5,2.0]$, resulting in 4 combinations. In the same manner, we initially set the sigma in the range $[0.25,0.35]$, $[1.2,1.3]$, then linearly relax to $[0.1,0.5]$ and $[0.5,2.0]$ to formulate an easy-to-hard training.

With the help of these controllable augmentations, the spatio-temporal correspondence is calculated by the ratio of the intersection of two tubes. As demonstrated in Fig. 3, we have global video feature map $F_{\text{v}}$ and local clip feature map $F_{\text{c}}$. We aim to calculate the spatio-temporal correspondence matrix S, where $S[i,j]$ indicates the correspondence score between the ith grid in $F_{\text{c}}$ and the jth grid in $F_{\text{v}}$. For better illustration, we assume $F_{\text{c}}[i]$ covers the area $[(t_{\text{c}}^{1},t_{\text{c}}^{2}),(h_{\text{c}}^{1},h_{\text{c}}^{2}),(w_{ \text{c}}^{1},w_{\text{c}}^{2})]$, $F_{\text{v}}[j]$ covers the area $[(t_{\text{v}}^{1},t_{\text{v}}^{2}),(h_{\text{v}}^{1},h_{\text{v}}^{2}),(w_{ \text{v}}^{1},w_{\text{v}}^{2})]$. Then the intersection can be easily written as

$$\begin{aligned}& \textit{inter\_t} = \max \bigl(\min \bigl(t_{\text{c}}^{2},t_{\text{v}}^{2} \bigr)- \max \bigl(t_{\text{c}}^{1},t_{\text{v}}^{1} \bigr),0\bigr), \end{aligned}$$

(10)

$$\begin{aligned}& \textit{inter\_h} = \max \bigl(\min \bigl(h_{\text{c}}^{2},h_{\text{v}}^{2} \bigr)- \max \bigl(h_{\text{c}}^{1},h_{\text{v}}^{1} \bigr),0\bigr), \end{aligned}$$

(11)

$$\begin{aligned}& \textit{inter\_w} = \max \bigl(\min \bigl(w_{\text{c}}^{2},w_{\text{v}}^{2} \bigr)- \max \bigl(w_{\text{c}}^{1},w_{\text{v}}^{1} \bigr),0\bigr), \\ \end{aligned}$$

(12)

$$\begin{aligned}& \mathit{inter} = \textit{inter\_t}\times \textit{inter \_h}\times \textit{inter\_w}. \end{aligned}$$

(13)

$S[i,j]$ is the ratio of the intersection over $F_{\text{c}}[i]$:

$$\begin{aligned} S[i,j] = \frac{\mathit{inter}}{(t_{\text{c}}^{2}-t_{\text{c}}^{1})(h_{\text{c}}^{2}-h_{\text{c}}^{1})(w_{\text{c}}^{2}-w_{\text{c}}^{1})}. \end{aligned}$$

(14)

4.3 Comparison with existing works

Action recognition

We first present the comparison between our method and recent video representation learning approaches on action recognition in Table 1. We report Top-1 accuracy on UCF-101 and HMDB-51 under linear probe and finetune. We exclude the methods that use different evaluation settings and much deeper backbones, such as Refs. [17, 49, 71], or those that rely on audio and text modalities, such as Refs. [72, 73]. In Table 1, we use ‘V+F’ to denote the use of both reg green blue (RGB) and optical flow in the self-supervised pretraining stage. All evaluation results are obtained using only RGB at test time.

Table 1 Comparison results for the action recognition downstream task. We provide the training setting of each method, including the backbone encoder, pretraining dataset, spatio-temporal resolution and modality, where ‘V’ means RGB frames, ‘F’ means optical flow. We use freeze (✓) to indicate the linear probe, while no freeze (✗) denotes end-to-end fine-tuning. For fairness, we exclude methods that use different evaluation settings, much deeper backbones or other modalities, such as audio and text. ‘*’ denotes 200 epochs pretraining on Kinetics-400. The bold font indicates the highest accuracy among the compared methods

Full size table

Under linear probe, our method outperforms other RGB-only approaches by a large margin. The superiority over RSPNet [48], which integrates temporal pretext task with contrastive learning, demonstrates the effectiveness of our general temporal structure learning scheme. Note that our method also dramatically narrows the gap between RGB-only and RGB-flow based methods. This indicates that our method significantly improves motion pattern modeling. Under finetune, our method achieves promising results when pretrained on UCF-101, even surpassing RGB-flow based architecture [15, 20] and methods trained with higher resolution [52]. When pretrained on Kinetics-400, our method is generally comparable with the state-of-the-art RGB-flow approaches [15, 20], motion focused methods [52, 53], probabilistic and hierarchical pretraining [58, 68]. It indicates that our controllable augmentation has the potential to cover long range temporal dynamics and lead to comprehensive perception of appearance and motion cues. In addition, due to limited computational resources, we do not compare with works using very large backbones, such as Refs. [17, 71], but we present the ablation in the bottom three lines. The results indicate that our method has the potential to scale longer training epochs, deeper backbones and larger resolutions.

In addition, we also provide the results on Diving-48 [9], a dataset that mainly relies on dynamic motions to distinguish different action categories. We compare the results of both supervised (the second row to the forth row) and self-supervised methods (bottom three rows) in Table 2. Since the appearance is similar across different videos, and the Top-1 accuracy can well reflect the ability in motion understanding. We observe that in this case, semantic label supervision is not effective, and our method improves the performance by a notable margin. This demonstrates that our learning approach is superior in capturing motion patterns, with less reliance on background information.

Table 2 Action recognition results on the Diving-48 dataset. We compare different Top-1 accuracies based on the test split V1 action labels. The bold font indicates the highest accuracy among the compared methods

Full size table

Video retrieval

Table 3 depicts the comparison on video retrieval with R@k. The model is pretrained on UCF-101. Our method remarkably outperforms most RGB-based approaches. Note that some methods, especially PCL [77], achieve impressive results when k increases to 20. This is because when k is large, it becomes likely to rely on the background as a shortcut to retrieve videos of the same category. We reach comparable or even better performance although STS [15] and CoCLR [20] adopt RGB and optical flow. This demonstrates once again that our integration of detailed local feature alignment and general long-term temporal modeling is effective in enhancing motion pattern modeling without resorting to motion biased input data.

Table 3 Comparison results for the video retrieval downstream task. We report R@k ($k=1$,5,10,20) on the UCF-101 and HMDB-51 datasets. The bold font indicates the highest recall among the compared methods

Full size table

Visualization analysis

We also display some visualization results to analyze the learned feature representations in Fig. 4. We employ class-agnostic activation maps (CAAM) [78] to reveal the spatio-temporal distributions of the extracted features. Generally, vanilla contrastive learning based on SimCLR [26] leads the model to focus on representative background cues, e.g., the soccer field, swimming pool and fitness equipment. In contrast, our pretrained model focuses on the moving foregrounds that contain actions, such as the moving human body and moving boat.

4.4 Ablation study

In this section, we provide several ablation studies to analyze our video representation learning framework. Unless specifically mentioned, all models are pretrained on UCF-101 for 150 epochs, with R3D-18 as the backbone.

Local-global sampling

We first explore the impact of local-global settings. Two aspects were investigated: the number of local clips K and the global video feature temporal resolution $T_{\text{v}}$, which is obtained by adjusting the temporal convolution stride. We presnet the results in Table 4. By varying the number of local clips K from 1 to 4, we find that having more local clips tend to improve the performance due to more fine-grained feature alignment. It is worth noting that when the ratio $T_{\text{v}}/(KT_{\text{c}})<1$, the granularity of the local-global correspondence becomes too coarse, which constricts the performance. Overall, accurate spatio-temporal region correspondence provides a reliable reference for appearance and motion pattern matching, and significantly improves action recognition.

Table 4 Ablation study on local-global sampling. We present the results with different clip numbers and the temporal resolution of global video features. The first line equals the baseline. We report linear probe Top-1 accuracy on UCF-101 and HMDB-51. The bold font indicates the highest accuracy among the compared settings

Full size table

Negative pair formulation

With the local-global sampling, there are several alternatives to formulate the negative feature pool in Eq. (4). In detail, given the query local clip feature $f(v_{k})[i]$, we sample the matched global view features as positive pairs. To this end, we also sample the global view features from other videos to formulate the negative pairs by default. In this ablation study, we also compare with two other variants: (1) sampling local clip features from other videos as negative pairs; (2) combining both local and global view features from other videos as the negative samples. We show the empirical comparison in Table 5. We observe that sampling negative features from the global view works much better than sampling from the local views. Integrating both local and global features as negative pairs leads to negligible improvments. This is because the global view features provide richer visual contexts with larger reception fields, and serve as more informative reference signals in dense region contrast.

Table 5 Ablation study on the negative pair formulation in Eq. (4). We use ✓to indicate the sampling mode. We sample negative features from the global view by default. The bold font indicates the highest accuracy among the compared settings

Full size table

Random perturbations in dense correspondence

Recalling Eq. (1), we calculate the intersection ratio of two spatio-temporal cubes to produce the dense correspondence matrix to indicate the positive pairs. This strict correspondence is purely based on the geometric positions and might neglect some visual contexts that help high-level understanding. To this end, we compare adding independent random Gaussian noise and applying spatio-temporal Gaussian blur to the dense correspondence matrix in Table 6. We find that the random Gaussian noise leads to a substantial performance drop since it destroys the correspondence relations. In contrast, Gaussian blur basically maintains the original correspondence distribution and slightly expands the positive sampling area. It improves the performance by including more diverse visual contexts as positive features.

Table 6 Ablation study on applying random perturbations on the dense correspondence matrix. The bold font indicates the highest accuracy among the compared settings

Full size table

Low-level augmentation levels

We also explore the setting of the intensity levels on low-level augmentations. We follow conventional implementations: For color jitter, there are controllable parameters of brightness, contrast, saturation and hue, which are set as $(B,C,S,H)= (0.4,0.4,0.4,0.1)$ by default [20, 22]. For Gaussian blur, we control the radius and sigma. We set different intensity levels for each controllable parameter as Table 7. Note that since B, C and S are set to the same as default, we also set the same number of levels for them. The total number of predefined intensity levels equals the number of permutations across all parameters, i.e., 32 for the first row, 512 for the second row, etc. For consistency, in each iteration, we randomly sample 32 intensity levels from all possible levels, resulting in 32 groups of features that share similar low-level statistics for mutual information minimization. We observe that too few or too many levels both lead to performance drop. This is because more levels lead to less difference between different groups, while fewer levels mean more difference within each group. We conclude that a trade-off exists which requires balancing to achieve the best possible training.

Table 7 Ablation study on low-level augmentation settings. # denotes the number of intensity levels in brightness, contrast, saturation, hue and Gaussian blur. We report Top-1 accuracy of the linear prob on UCF-101 and HMDB-51. The bold font indicates the highest accuracy among the compared settings

Full size table

Mutual information estimation

We also delve into several methods for mutual information estimation. For low-level shortcut elimination, we need to force the encoder to minimize the estimated mutual information. Theoretically, we should minimize the estimated upper bound. However, we find that it is difficult to converge with CLUB [79] which is an upper bound estimation approach. Therefore, we also adopt lower bound estimation methods of MINE [59], JS [80] and InfoNCE [27] for comparison. These methods provide tight lower bound estimation which leads to easier convergence, thus achieving superior performance on action recognition. Compared to the baseline as illustrated in Table 8, our method obtained significant improvement, especially when pretrained on mini-Kinetics and evaluated on UCF-101. This indicates that mutual information minimization helps mitigate low-level shortcuts and enhances the generalization ability.

Table 8 Ablation study on low-level mutual information estimation. None means the baseline without mutual information minimization. We report Top-1 accuracy of the linear probe on UCF-101 with a model pretrained on UCF-101 and mini-Kinetics. The bold font indicates the highest accuracy among the compared settings

Full size table

Temporal dependency head

To further examine the feasibility of temporal dependency head implementation, we compare three typical examples: (1) MLP: concatenate $f'(v)$ and $\overline{f}(v)$ or $\widetilde{f}(v)$ and pass through a multi-layer perceptron (MLP) to obtain a scalar value. (2) GRU: use a gated recurrent network (GRU) to process clip feature sequence, and calculate the cosine similarity between $f'(v)$ and GRU output. (3) GRU+MLP: GRU is used to process the clip feature sequence, which is then concatenated with $f'(v)$ and passed through an MLP to obtain a scalar value. The results are listed in Table 9. Compared with no temporal constraint, all three implementations gain significant improvements. We note that our MLP implementation is similar to VCOP [14], but different in the learning objective. This improvement reveals that introducing the global video feature as a reference could enhance temporal structure modeling.

Table 9 Ablation study on temporal dependency head. None denotes the baseline without temporal constraints, and follows VCOP [14] for comparison. We report Top-1 accuracy of the linear prob on UCF-101 and HMDB-51. The bold font indicates the highest accuracy among the compared settings

Full size table

Marginal distribution formulation

In addition to the mutual information estimation head, we also compare different marginal distribution formulations for temporal dependency modeling. By default, we instantiate it as a uniform distribution containing different shuffled orders. In addition, we also compare establishing a difficulty-aware marginal distribution that places more emphasis on hard examples. The results are presented in Table 10. The temperature hyper-parameter controls the level of concentration of the softmax normalization over the difficulty score in Eq. (8). Lower temperature leads to sharp distributions, thus resulting in repeated sampling difficult examples. From the comparison, we observe that lower temperature impairs the performance while the smoother distribution brings improvements. This indicates the necessity of using a small number of easy examples to guide the model to discriminate temporal relations and prevent the model from falling into ambiguities. Only when based on this, sampling more hard examples could further facilitate temporal perception.

Table 10 Ablation study on the marginal distribution formulation in the mutual information estimation for temporal dependency modeling. We take the uniform distribution in default and compare it with using softmax normalization over the difficulty score with different temperature hyper-parameters. The bold font indicates the highest accuracy among the compared settings

Full size table

Dynamic augmentation parameters

Here, we provide a quantitative comparison between the default augmentation parameter setting and the dynamic parameter evolving in Table 11. In the dynamic learning stage, we manually control the augmentation parameters to construct the training samples in a curriculum manner. It is clear that the dynamic augmentation parameter evolving in an easy-to-hard manner leads to performance improvement. This dynamic setting contributes to determining optimal augmentation parameter combinations that facilitate video representation learning.

Table 11 Ablation study on the controllable augmentation parameters. We compare using the static parameters over the whole training stage and dynamic parameter evolving on both spatio-temporal position transformations $\tau _{\text{p}}$ and low-level augmentations $\tau _{\text{l}}$. The bold font indicates the highest accuracy among the compared settings

Full size table

Training efficiency

We also report the training efficiency of the variants of our method and other works. For fair comparison, we pretrain the R3D-18 backbone with a resolution of $112\times 112$, a clip length of 16 frames on Kinetics-400 on an 8 NVIDIA 3090 GPU server, and report the training time in hours as well as the total GPU hours. For the variants of our method, we compare using different numbers of local clips K and present the results in Table 12. We observe that sampling more local clips leads to faster convergence, thus not resulting in linear increase in training time. Compared with the other two baselines [51, 67], based on our reimplementations, our method achieves better performance with fewer training hours, demonstrating the high training efficiency of our framework.

Table 12 Ablation study on the training efficiency. We show the comparison of using different numbers of local clips K in our method and the comparison with other baselines. We report the training time and finetuning results on UCF-101 and HMDB-51. The bold font indicates the highest accuracy among the compared settings. ‘x’ denotes the training time ratio comparing to our method using 2 local clips

Full size table

Overall learning objectives

We finally show the ablation of the designed learning objectives in Table 13, where $\mathcal{L}_{\text{nce}}$ is the standard contrastive loss used in existing works. We observe that the integration of $\mathcal{L}_{\text{rc}}$ and $\mathcal{L}_{\text{mi}}$ significantly outperforms $\mathcal{L}_{\text{ce}}$, which indicates that the detailed region contrast with low-level shortcut elimination is more efficient than naive global contrast. In addition, $\mathcal{L}_{\text{td}}$ further enables the model to go beyond local clips and establish long-term relationships. The improvement demonstrates that our method well integrates detailed region-level contrast and general long-term temporal perception.

Table 13 Ablation study on all learning objectives. We use ✓to indicate which loss functions are used. Note that $\mathcal{L}_{\text{nce}}$ is the standard contrastive loss function in previous works. The bold font indicates the highest accuracy among the compared settings

Full size table

5 Conclusion

In this paper, we propose a framework that leverages local clips and global video to enhance self-supervised video representation learning. We employ a set of controllable augmentations to crop local clips and generate groups of samples that share similar low-level attributes. Therefore, we use the soft codes computed from the crop and flip parameters to guide detailed spatio-temporal region contrastive learning, and minimize the mutual information within the same low-level group to avoid shortcuts. We also incorporate local-global temporal dependency to embed general temporal structures into the extracted video representations. Experiments on downstream tasks of action recognition and video retrieval demonstrate the superiority of our formulation, especially in modeling dynamic motion patterns.

Data availability

The datasets generated during and/or analyzed during the current study are available in the kinetics-dataset repository, https://github.com/cvdfoundation/kinetics-dataset.

Abbreviations

GRU:: gated recurrent unit
InfoNCE:: information noise contrastive estimation
MLP:: multi-layer perceptron
RGB:: red green blue

References

Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4724–4733). Piscataway: IEEE.
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In V. Ferrari, M. Hebert, C. Sminchisescu, et al. (Eds.), Proceedings of the 15th European conference on computer vision (pp. 318–335). Cham: Springer.
Google Scholar
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., et al. (2018). AVA: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6047–6056). Piscataway: IEEE.
Google Scholar
Heilbron, F.C., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: a large-scale video benchmark for human activity understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 961–970). Piscataway: IEEE.
Google Scholar
Liu, Y., Albanie, S., Nagrani, A., & Zisserman, A. (2019). Use what you have: video retrieval using representations from collaborative experts. arXiv preprint. arXiv:1907.13487.
Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev, I., & Sivic, J. (2019). Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2630–2640). Piscataway: IEEE.
Google Scholar
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. arXiv:1212.0402.
Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision (pp. 5843–5851). Piscataway: IEEE.
Google Scholar
Li, Y., Li, Y., & Vasconcelos, N. (2018). RESOUND: towards action recognition without representation bias. In V. Ferrari, M. Hebert, C. Sminchisescu, et al. (Eds.), Proceedings of the 15th European conference on computer vision (pp. 520–535). Cham: Springer.
Google Scholar
Benaim, S., Ephrat, A., Lang, O., Mosseri, I., Freeman, W. T., Rubinstein, M., et al. (2020). SpeedNet: learning the speediness in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9919–9928). Piscataway: IEEE.
Google Scholar
Misra, I., Zitnick, C. L., & Hebert, M. (2016). Shuffle and learn: unsupervised learning using temporal order verification. In B. Leibe, J. Matas, N. Sebe, et al. (Eds.), Proceedings of the 14th European conference on computer vision (pp. 527–544). Cham: Springer.
Google Scholar
Kim, D., Cho, D., & Kweon, I. S. (2019). Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the 33rd AAAI conference on artificial intelligence (pp. 8545–8552). Palo Alto: AAAI Press.
Google Scholar
Simon, J., Meishvili, G., & Favaro, P. (2020). Video representation learning by recognizing temporal transformations. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 425–442). Cham: Springer.
Google Scholar
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., & Zhuang, Y. (2019). Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10334–10343). Piscataway: IEEE.
Google Scholar
Wang, J., Jiao, J., Bao, L., He, S., Liu, W., & Liu, Y. (2022). Self-supervised video representation learning by uncovering spatio-temporal statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3791–3806.
Google Scholar
Gordon, D., Ehsani, K., Fox, D., & Farhadi, A. (2020). Watching the world go by: representation learning from unlabeled videos. arXiv preprint. arXiv:2003.07990.
Qian, R., Meng, T., Gong, B., Yang, M.-H., Wang, H., Belongie, S. J., et al. (2021). Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6964–6974). Piscataway: IEEE.
Google Scholar
Wang, J., Jiao, J., & Liu, Y.-H. (2020). Self-supervised video representation learning by pace prediction. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 504–521). Cham: Springer.
Google Scholar
Yao, T., Zhang, Y., Qiu, Z., Pan, Y., & Mei, T. (2021). SeCo: exploring sequence supervision for unsupervised representation learning. In Proceedings of the 35th AAAI conference on artificial intelligence (pp. 10656–10664). Palo Alto: AAAI Press.
Google Scholar
Han, T., Xie, W., & Zisserman, A. (2020). Self-supervised co-training for video representation learning. In H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.), Proceedings of the 34th international conference on neural information processing systems, Red Hook: Curran Associates.
Google Scholar
Kuang, H., Zhu, Y., Zhang, Z., Li, X., Tighe, J., Schwertfeger, S., et al. (2021). Video contrastive learning with global context. In Proceedings of the IEEE/CVF international conference on computer vision workshops (pp. 3188–3197). Piscataway: IEEE.
Google Scholar
Pan, T., Song, Y., Yang, T., Jiang, W., & Liu, W. (2021). Videomoco: contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11205–11214). Piscataway: IEEE.
Google Scholar
Wang, J., Gao, Y., Li, K., Lin, Y., Ma, A. J., Cheng, H., et al. (2021). Removing the background by adding the background: towards background robust self-supervised video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11804–11813). Piscataway: IEEE.
Google Scholar
Wang, J., Gao, Y., Li, K., Hu, J., Jiang, X., Guo, X., et al. (2021). Enhancing unsupervised video representation learning by decoupling the scene and the motion. In Proceedings of the 35th AAAI conference on artificial intelligence (pp. 10129–10137). Menlo Park: AAAI Press.
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9726–9735). Piscataway: IEEE.
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. E. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th international conference on machine learning (pp. 1597–1607). Stroudsburg: International Machine Learning Society.
Google Scholar
van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint. arXiv:1807.03748.
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1735–1742). Piscataway: IEEE.
Google Scholar
Gutmann, M., & Hyvärinen, A. (2010). Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Y. W. Teh & D. M. Titterington (Eds.), Proceedings of the 13th international conference on artificial intelligence and statistics. Retrieved Novermber 3, 2023, from http://proceedings.mlr.press/v9/gutmann10a.html.
Google Scholar
Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3733–3742). Piscataway: IEEE.
Google Scholar
Tian, Y., Krishnan, D., & Isola, P. (2020). Contrastive multiview coding. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 776–794). Cham: Springer.
Google Scholar
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., et al. (2019). Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th international conference on learning representations. Retrieved November 3, 2023, from https://openreview.net/forum?id=Bklr3j0cKX.
Google Scholar
Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., & Hu, H. (2021). Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16684–16693). Piscataway: IEEE.
Google Scholar
Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2021). Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3024–3033). Piscataway: IEEE.
Google Scholar
Lee, H.-Y., Huang, J.-B., Singh, M., & Yang, M.-H. (2017). Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE international conference on computer vision (pp. 667–676). Piscataway: IEEE.
Google Scholar
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking emerges by colorizing videos. In V. Ferrari, M. Hebert, C. Sminchisescu, et al. (Eds.), Proceedings of the 15th European conference on computer vision (pp. 402–419). Cham: Springer.
Google Scholar
Wang, X., Jabri, A., & Efros, A. A. (2019). Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2566–2576). Piscataway: IEEE.
Google Scholar
Jabri, A., Owens, A., & Efros, A. A. (2020). Space-time correspondence as a contrastive random walk. In H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.), Proceedings of the 34th international conference on neural information processing systems, Red Hook: Curran Associates.
Google Scholar
Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., & Yang, M.-H. (2019). Joint-task self-supervised learning for temporal correspondence. In H. M. Wallach, H. Larochelle, A. Beygelzimer, et al. (Eds.), Proceedings of the 33rd international conference on neural information processing systems (pp. 317–327). Red Hook: Curran Associates.
Google Scholar
Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017). Decomposing motion and content for natural video sequence prediction. In Proceedings of the 5th international conference on learning representations. Retrieved Novermber 3, 2023, from https://openreview.net/forum?id=rkEFLFqee.
Google Scholar
Luo, Z., Peng, B., Huang, D.-A., Alahi, A., & Li, F.F. (2017). Unsupervised learning of long-term motion dynamics for videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7101–7110). Piscataway: IEEE.
Google Scholar
Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. In H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.), Proceedings of the 34th international conference on neural information processing systems (pp. 1–13). Red Hook: Curran Associates.
Google Scholar
Piergiovanni, A. J., Angelova, A., & Ryoo, M. S. (2020). Evolving losses for unsupervised video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 130–139). Piscataway: IEEE.
Google Scholar
Liu, Y., Wang, K., Lan, H., & Lin, L. (2021). Temporal contrastive graph for self-supervised video representation learning. arXiv preprint. arXiv:2101.00820.
Han, T., Xie, W., & Zisserman, A. (2019). Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF international conference on computer vision workshops (pp. 1483–1492). Piscataway: IEEE.
Google Scholar
Han, T., Xie, W., & Zisserman, A. (2020). Memory-augmented dense predictive coding for video representation learning. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 312–329). Cham: Springer.
Google Scholar
Yang, C., Xu, Y., Dai, B., & Zhou, B. (2020). Video representation learning with visual tempo consistency. arXiv preprint. arXiv:2006.15489.
Chen, P., Huang, D., He, D., Long, X., Zeng, R., Wen, S., et al. (2021). RSPNet: relative speed perception for unsupervised video representation learning. In Proceedings of the 35th AAAI conference on artificial intelligence (pp. 1045–1053). Palo Alto: AAAI Press.
Google Scholar
Li, R., Zhang, Y., Qiu, Z., Yao, T., Liu, D., & Mei, T. (2021). Motion-focused contrastive learning of video representations. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2085–2094). Piscataway: IEEE.
Google Scholar
Qian, R., Ding, S., Liu, X., & Lin, D. (2022). Static and dynamic concepts for self-supervised video representation learning. In S. Avidan, G. J. Brostow, M. Cissé, et al. (Eds.), Proceedings of the 17th European conference on computer vision (pp. 145–164). Cham: Springer.
Google Scholar
Ding, S., Qian, R., & Xiong, H. (2022). Dual contrastive learning for spatio-temporal representation. In J. Magalhães, A. Del Bimbo, S. Satoh, et al. (Eds.), Proceedings of the 30th ACM international conference on multimedia (pp. 5649–5658). New York: ACM.
Chapter Google Scholar
Liu, Y., Chen, J., & Wu, H. (2022). MoQuad: motion-focused quadruple construction for video contrastive learning. In L. Karlinsky, T. Michaeli, & K. Nishino (Eds.), Proceedings of the 17th European conference on computer vision workshops (pp. 20–38). Cham: Springer.
Google Scholar
Ding, S., Li, M., Yang, T., Qian, R., Xu, H., Chen, Q., et al. (2022). Motion-aware contrastive video representation learning via foreground-background merging. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9706–9716). Piscataway: IEEE.
Google Scholar
Ma, S., Zeng, Z., McDuff, D., & Song, Y. (2021). Contrastive learning of global and local video representations. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, et al. (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp. 7025–7040). Red Hook: Curran Associates.
Google Scholar
Recasens, A., Luc, P., Alayrac, J.-B., Wang, L., Strub, F., Tallec, C., et al. (2021). Broaden your views for self-supervised video learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1235–1245). Piscataway: IEEE.
Google Scholar
Dave, I. R., Gupta, R., Rizve, M. N., & Shah, M. (2022). TCLR: temporal contrastive learning for video representation. Computer Vision and Image Understanding, 219, 103406.
Article Google Scholar
Behrmann, N., Fayyaz, M., Gall, J., & Noroozi, M. (2021). Long short view feature decomposition via contrastive video representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, Piscataway: IEEE.
Google Scholar
Qing, Z., Zhang, S., Huang, Z., Xu, Y., Wang, X., Gao, C., et al. (2023). Self-supervised learning from untrimmed videos via hierarchical consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 12408–12426.
Google Scholar
Belghazi, M.I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Hjelm, R. D., et al. (2018). Mutual information neural estimation. In J. G. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (pp. 530–539). Stroudsburg: International Machine Learning Society.
Google Scholar
Elman, J. L. (1993). Learning and development in neural networks: the importance of starting small. Cognition, 48(1), 71–99.
Article Google Scholar
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In A. P. Danyluk, L. Bottou, & M. L. Littman (Eds.), Proceedings of the 26th annual international conference on machine learning (pp. 41–48). Stroudsburg: International Machine Learning Society.
Chapter Google Scholar
Murali, A., Pinto, L., Gandhi, D., & Gupta, A. (2018). CASSL: curriculum accelerated self-supervised learning. In Proceedings of the IEEE international conference on robotics and automation (pp. 6453–6460). Piscataway: IEEE.
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. A., & Serre, T. (2011). HMDB: a large video database for human motion recognition. In D. N. Metaxas, L. Quan, A. Sanfeliu, et al. (Eds.), IEEE international conference on computer vision (pp. 2556–2563). Piscataway: IEEE.
Google Scholar
Hara, K., Kataoka, H., & Satoh, Y. (2017). Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE international conference on computer vision workshops (pp. 3154–3160). Piscataway: IEEE.
Google Scholar
Luo, D., Liu, C., Zhou, Y., Yang, D., Ma, C., Ye, Q., et al. (2020). Video cloze procedure for self-supervised spatio-temporal learning. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 11701–11708). Palo Alto: AAAI Press.
Google Scholar
Sun, C., Baradel, F., Murphy, K., & Schmid, C. (2019). Learning video representations using contrastive bidirectional transformer. arXiv preprint. arXiv:1906.05743.
Qian, R., Li, Y., Liu, H., See, J., Ding, S., Liu, X., et al. (2021). Enhancing self-supervised video representation learning via multi-level feature optimization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7970–7981). Piscataway: IEEE.
Google Scholar
Park, J., Lee, J., Kim, I.-J., & Sohn, K. (2022). Probabilistic representations for video contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14691–14701). Piscataway: IEEE.
Google Scholar
Huang, D., Wu, W., Hu, W., Liu, X., He, D., Wu, Z., et al. (2021). ASCNet: self-supervised video representation learning with appearance-speed consistency. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8076–8085). Piscataway: IEEE.
Google Scholar
Simon, J., & Jin, H. (2021). Time-equivariant contrastive video representation learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9950–9960). Piscataway: IEEE.
Google Scholar
Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R. B., & He, K. (2021). A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3299–3309). Piscataway: IEEE.
Google Scholar
Asano, Y. M., Patrick, M., Rupprecht, C., & Vedaldi, A. (2020). Labelling unlabelled videos from scratch with multi-modal self-supervision. In H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.), Proceedings of the 34th international conference on neural information processing systems (pp. 1–12). Red Hook: Curran Associates.
Google Scholar
Patrick, M., Asano, Y. M., Kuznetsova, P., Fong, R., Henriques, J. F., Zweig, G., et al. (2020). Multi-modal self-supervision from generalized data transformations. arXiv preprint. arXiv:2003.04298.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2016). Temporal segment networks: towards good practices for deep action recognition. In B. Leibe, J. Matas, N. Sebe, et al. (Eds.), Proceedings of the 14th European conference on computer vision (pp. 20–36). Cham: Springer.
Google Scholar
Choi, J., Gao, C., Messou, J. C. E., & Huang, J.-B. (2019). Why can’t I dance in the mall? Learning to mitigate scene bias in action recognition. In H. M. Wallach, H. Larochelle, A. Beygelzimer, et al. (Eds.), Proceedings of the 33rd international conference on neural information processing systems (pp. 851–863). Red Hook: Curran Associates.
Google Scholar
Yao, Y., Liu, C., Luo, D., Zhou, Y., & Ye, Q. (2020). Video playback rate perception for self-supervised spatio-temporal representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6547–6556). Piscataway: IEEE.
Google Scholar
Tao, L., Wang, X., & Yamasaki, T. (2020). Self-supervised video representation using pretext-contrastive learning. arXiv preprint. arXiv:2010.15464.
Baek, K., Lee, M., & Psynet, H. S. (2020). Self-supervised approach to object localization using point symmetric transformation. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 10451–10459). Palo Alto: AAAI Press.
Google Scholar
Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., & Carin, L. (2020). CLUB: a contrastive log-ratio upper bound of mutual information. In Proceedings of the 37th international conference on machine learning (pp. 1779–1788). Stroudsburg: International Machine Learning Society.
Google Scholar
Nowozin, S., Cseke, B., & Tomioka, R. (2016). f-GAN: training generative neural samplers using variational divergence minimization. In D. D. Lee, M. Sugiyama, U. von Luxburg, et al. (Eds.), Proceedings of the 30th international conference on neural information processing systems (pp. 271–279). Red Hook: Curran Associates.
Google Scholar

Download references

Acknowledgements

We would like to acknowledge Yuxi Li and Huabin Liu for constructive feedbacks and discussions on the paper.

Funding

The paper is supported in part by the National Natural Science Foundation of China (No. 62325109, U21B2013).

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Rui Qian & Weiyao Lin
Heriot-Watt University, Putrajaya, Malaysia
John See
Tencent Platform and Content Group, Beijing, China
Dian Li

Authors

Rui Qian
View author publications
You can also search for this author in PubMed Google Scholar
Weiyao Lin
View author publications
You can also search for this author in PubMed Google Scholar
John See
View author publications
You can also search for this author in PubMed Google Scholar
Dian Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

For this research article, the authors’ contributions are as follows: Conceptualization: RQ, WL; Methodology: RQ, WL; Formal analysis and investigation: RQ, WL; Writing – original draft preparation: RQ; Writing – review and editing: WL, JS, DL; Funding acquisition: WL; Resources: WL; Supervision: WL. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Weiyao Lin.

Ethics declarations

Competing interests

Weiyao Lin is an Associate Editor for Visual Intelligence and was not involved in the editorial review of, or the decision to publish, this article. The authors declare that there are no other competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Qian, R., Lin, W., See, J. et al. Controllable augmentations for video representation learning. Vis. Intell. 2, 1 (2024). https://doi.org/10.1007/s44267-023-00034-7

Download citation

Received: 22 May 2023
Revised: 18 December 2023
Accepted: 20 December 2023
Published: 10 January 2024
DOI: https://doi.org/10.1007/s44267-023-00034-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Controllable augmentations for video representation learning

Abstract

Similar content being viewed by others

Video representation learning by identifying spatio-temporal transformations

Mitigating background bias in self-supervised video representation learning

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

Explore related subjects

1 Introduction

2 Related work

2.1 Contrastive learning

2.2 Video representation learning

2.3 Local-global views for video representation

3 Method

3.1 Spatio-temporal region contrast

3.2 Low-level shortcut elimination

3.3 Local-global temporal dependency

3.4 Training

4 Experiment

4.1 Datasets

4.2 Implementation details

Self-supervised pretraining

Action recognition

Video retrieval

Controllable augmentations

4.3 Comparison with existing works

Action recognition

Video retrieval

Visualization analysis

4.4 Ablation study

Local-global sampling

Negative pair formulation

Random perturbations in dense correspondence

Low-level augmentation levels

Mutual information estimation

Temporal dependency head

Marginal distribution formulation

Dynamic augmentation parameters

Training efficiency

Overall learning objectives

5 Conclusion

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation