Abstract
This paper focuses on self-supervised video representation learning. Most existing approaches follow the contrastive learning pipeline to construct positive and negative pairs by sampling different clips. However, this formulation tends to bias the static background and has difficulty establishing global temporal structures. The major reason is that the positive pairs, i.e., different clips sampled from the same video, have limited temporal receptive fields, and usually share similar backgrounds but differ in motions. To address these problems, we propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as general long-term temporal relations. Based on a set of designed controllable augmentations, we implement accurate appearance and motion pattern alignment through soft spatio-temporal region contrast. Our formulation avoids the low-level redundancy shortcut with an adversarial mutual information minimization objective to improve the generalization ability. Moreover, we introduce local-global temporal order dependency to further bridge the gap between clip-level and video-level representations for robust temporal modeling. Extensive experiments demonstrate that our framework is superior on three video benchmarks in action recognition and video retrieval, and captures more accurate temporal dynamics.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Video representation learning is fundamental to various downstream video-related applications, e.g., action recognition [1, 2], spatio-temporal detection [3, 4], video retrieval [5, 6], etc. Traditional supervised learning schemes require costly human labeling, and the performance is usually restricted by the granularity of the annotations. More precisely, coarse-grained video-level annotations could lead the model to attend to the background [1, 7], while fine-grained annotations greatly facilitate general video analysis but are much more expensive [8, 9]. Unsupervised video representation learning has begun to attract more attention to solve this problem. Some early works designed diverse pretext tasks to learn the video characteristics in a self-supervised manner [10–15]. Recently, the formulation of contrastive learning further improved the performance by a large margin [16–19].
A prevalent method for contrastive video representation learning is to sample several clips and regard those from the video as positive pairs [17, 20–22]. However, this formulation has two drawbacks. On the one hand, these methods tend to be biased toward a static background [23, 24]. This is because the sampled clips mostly share the same background, but subtle differences in motions probably exist. For example, in Fig. 1, the video contains a high jump scene. The clip sampled at an early timestamp shows the running action, but the same clip sampled at a later timestamp presents the jumping action. Thus, pulling these two clips closer in the feature space will lead the model to neglect their distinct motions and only attend to the background of the stadium. On the other hand, there remains an obvious gap between clip-level features and video-level representation. The sampled clips have a limited temporal receptive field, and thus, cannot provide comprehensive information. For example, Clip 1 in Fig. 1 only shows the momentary process of running. When we jointly leverage the correctly ordered two clips, i.e., the running action occurs before jumping, we can understand the original video. Motivated by these observations, we intend to address these problems from two aspects, one is detailed region-level correspondence, and the other is general long-term temporal perception.
In this paper, we propose a framework to learn comprehensive appearance and motion patterns in videos. Concretely, we develop a set of controllable augmentations to achieve this goal. First, we use constrained spatio-temporal cropping to sample several local clips from each video such that the clips cover diverse timestamps of the video. Then we generate dense spatio-temporal position-wise correspondences between the local clip and global video feature maps based on the cropping parameters. In Fig. 1, we present a toy example on temporal correspondence, whereby the spatial correspondence is established by employing these soft codes to align features in corresponding regions. In this way, we can match the exact same appearance and motion content, while avoid aligning inconsistent motions between various timestamps. However, there also exist “shortcuts” that govern the overlapping regions between local clips and global videos, e.g., the low-level color statistics. These shortcuts could prevent the model from learning useful semantics. To avoid them, we define different intensity levels of color jitter and Gaussian blur augmentations, and regard the samples generated by the same level augmentation as sharing similar low-level attributes. We then minimize the mutual information between them to mitigate the impact of low-level shortcuts on the extracted representation.
To further bridge the gap between clip-level and video-level representations, we intuitively introduce a learning objective to model temporal order dependency between local clips and global video. In particular, we have access to the temporal order of the sampled clips in accordance with the cropping parameters. With that, we aim to maximize the mutual information between correctly ordered clip features and the global video features. Through this operation, we facilitate the temporal awareness of the model in the pretraining stage.
In summary, our contributions are as follows:
1) We propose a unified framework to learn video representations from detailed local contrast and general long-term temporal modeling.
2) We develop controllable augmentations to match the visual contents in corresponding spatio-temporal positions for detailed content alignment, and perform mutual information minimization to avoid low-level shortcuts.
3) We introduce the temporal order dependency between the local clips and global video to enhance general temporal structure modeling.
4) We achieve superior results on downstream action recognition and video retrieval tasks, while capturing more accurate motion patterns.
2 Related work
2.1 Contrastive learning
Recently, contrastive learning [25–27] has revolutionized self-supervised learning. Its core idea is to discriminate different instances by attracting the positive pairs and repelling the negative pairs in the feature space [28, 29]. Following this, Wu et al. [30] formulated instance discrimination as a non-parametric classification problem. Van den Oord et al. [27] introduced mutual information estimation with InfoNCE loss [29], which led to easy optimization and fast convergence. Inspired by this, a line of works [25, 26, 31, 32] adopted this learning objective for image representation learning and showed significant improvement on downstream tasks. Later, Xie et al. [33] and Wang et al. [34] developed dense contrastive learning, which performed pixel-level contrast. Compared to instance-level discrimination, dense contrastive learning preserves richer characteristics, and performs better on dense prediction tasks and visual correspondence learning. In our work, we focus on video representation learning. Considering that natural spatio-temporal correspondences exist in video domains, we propose to utilize them as self-supervisory signals for spatio-temporal region contrast to learn more comprehensive video representations.
2.2 Video representation learning
Unlike images, videos contain internal temporal structures that are crucial for video content analysis. To this end, many works [11, 14, 35] designed various pretext tasks to leverage the natural spatio-temporal correspondence as self-supervisory signals. Some typical pretext tasks include temporal ordering [11, 14, 19], spatio-temporal puzzles [12, 15], colorization [36], playback speed prediction [10, 13], temporal cycle-consistency [37–39] and future prediction [40, 41]. There were also some works [42, 43] using cross-modal correspondence for self-supervised pretraining. Inspired by the success of contrastive learning in the image domain, a series of works [16–18, 44] extended this pipeline to video domain. Particularly, Han et al. [45, 46] employed information noise contrastive estimation (InfoNCE) loss for dense future prediction, while Wang et al. [18] and Yang et al. [47] sampled clips of different rates as positive pairs for visual content learning. However, video contrastive learning could lead the model to place more emphasis on the static scene and focus less on motion [23]. To solve this problem, Chen et al. [48] and Simon Jenni et al. [13] integrated contrastive learning with temporal pretext tasks to enhance temporal awareness. Han et al. [20] and Li et al. [49] used optical flow to assist motion modeling. Qian et al. [50] and Ding et al. [51] used static frames, frame differences and consecutive frames to balance appearance and motion perception. Liu et al. [52] and Ding et al. [53] carefully designed motion focused augmentations to place more emphasis on dynamic motions. In our work, we do not resort to frame difference or optical flow to enhance motion learning and temporal modeling. Instead, we hypothesize that the underlying reason for static scene bias lies in the positive pair formulation. Most existing works use either different frames [16, 19] or different clips [17, 21] from the same video as the positive pair, which usually have similar backgrounds but possess different motions. Hence, we propose to consider the corresponding regions within local and global views to form accurate positive, concurrently with low-level shortcut elimination, which captures the desired static and dynamic characteristics. In addition, we develop a temporal dependency between these views to bridge the gap between local clip and global video representations, while learning robust temporal structures.
2.3 Local-global views for video representation
There have also been some works using local and global views for self-supervised video representation learning [21, 54–57]. The major difference between our work and those works lies in the concept of local global views and its target. In our work, “local global” means short and long video clips, and the major target is to construct spatio-temporal overlaps and formulate a soft learning objective, which guides detailed region-level video content alignment. In Ref. [54], local global meant local fine-grained and global coarse-grained features, which were designed for general audio-visual correspondence. Recasens et al. [55] aimed to extrapolate the neighboring video content in the global view based on the observation from the local view. Dave et al. [56] designed a loss function to learn emporal correspondence between local and global clips but still with hard positive assignment. Behrmann et al. [57] employed local global views to decompose stationary and non-stationary features and Kuang et al. [21] used them for segment-based positive sampling. Qing et al. [58] built hierarchical structures on videos and employed multi-level temporal consistency to guide local and global video representation learning.
3 Method
The core idea of our proposed framework is to enhance self-supervised video representation learning by comprehensive appearance and motion content modeling. As displayed in Fig. 2, we utilize a set of controllable augmentations to achieve detailed spatio-temporal region contrast, low-level shortcut elimination and general temporal dependency modeling.
Specifically, we divide the augmentations into two parts: spatio-temporal position transformations \(\tau _{\text{p}}\) that include crop and horizontal flip, and low-level statistic transformations \(\tau _{\text{l}}\) that include color jitter and Gaussian blur. Following the data preprocessing pipeline, given a video v, we first use \(\tau _{\text{p}}\) to sample several local clips and then perform \(\tau _{\text{l}}\) to generate the input to the encoder.
3.1 Spatio-temporal region contrast
Given a video v with temporal length T, we first use a set of spatio-temporal position transformations \(\tau _{\text{p}}^{k}\in \{\tau _{\text{p}}^{1},\tau _{\text{p}}^{2}, \ldots ,\tau _{\text{p}}^{K}\}\) to sample K clips \(v_{k}\in \{v_{1},v_{2},\ldots ,v_{K}\}\), to provide the local feature descriptions. To let the sampled clips contain as much information as the original video, we manually constrain the temporal cropping parameters in \(\tau _{\text{p}}^{k}\) to control the central timestamp of \(v_{k}\) in the range of \([{(k-1)T}/{K},{kT}/{K} ]\). In this way, sampled clips cover different temporal segments and they jointly present the rich information in v. As mentioned in Sect. 1, there could be inconsistencies in motions between different local clips such that it is not optimal to align the representations between different clips. Hence, we need to determine the exact corresponding content for feature alignment. To this end, considering that there is a natural correspondence between local clips and global video, we leverage v and \(v_{k}\) as two views for feature matching.
For local clip feature extraction, we denote the feature extractor as \(f(\cdot )\) and the local clip feature map as \(f(v_{k})\in \mathbb{R}^{CT_{\text{c}}HW}\), where C, H and W denote the channel, width and height, and \(T_{\text{c}}\) denotes the temporal dimension of the clip feature map. For global video feature extraction, we perform sparse sampling to represent v, and set some temporal stride of convolution layers to 1 to make \(f'(v)\in \mathbb{R}^{CT_{\text{v}}HW}\) possess higher temporal resolution, i.e., temporal dimension \(T_{\text{v}}>T_{\text{c}}\). Note that f and \(f'\) share the same architecture and only differ in the temporal stride. Details of the network settings are described in Sect. 4.2.
Based on \(f(v_{k})\) and \(f'(v)\), we refer to the augmentation parameters in \(\tau _{\text{p}}^{k}\) to calculate the dense spatio-temporal position correspondence. Specifically, we use \(S_{k}\in \mathbb{R}^{N_{\text{c}}\times N_{\text{v}}}\) to indicate the correspondence result, where \(N_{\text{c}}=T_{\text{c}}HW\), \(N_{\text{v}}=T_{\text{v}}HW\). \(S_{k}(i,j)\) reveals the correspondence score between the ith spatio-temporal grid in \(f(v_{k})\) and the jth grid in \(f'(v)\). Essentially, each grid on the feature map is equivalent to a tube covering a certain spatio-temporal area as illustrated in Fig. 2, and \(S_{k}(i,j)\) is measured by the ratio of the intersection of two tubes over the volume of tube \(f(v_{k})[i]\):
where \([\cdot ]\) denotes the grid index, \(\mathit{vol}(\cdot )\) measures the spatio-temporal volume of the given feature tube, and \(\mathit{inter}(\cdot )\) measures the intersecting volume between two tubes. The detailed computation process is illustrated in Sect. 4.2. In this formulation, the row-wise summation of \(S_{k}\) equals 1, i.e., \(\sum_{j=1}^{N_{\text{v}}}S_{k}(i,j)=\{1\}^{N_{\text{c}}}\). This indicates that each row in \(S_{k}\) can be treated as a probability distribution that describes the correspondence between \(f(v_{k})[i]\) and each grid in \(f'(v)\).
Therefore, we utilize the calculated correspondence matrix \(S_{k}\) as the reference distribution to guide spatio-temporal region feature contrast for accurate visual content alignment. Specifically, we take \(f(v_{k})[i]\) as a query for illustration. Recall that InfoNCE loss can be written as the cross-entropy between a prior distribution, i.e., the indicator function, and the feature similarity distribution is given as
where q, k respectively denotes query and key features in contrastive learning, \(\mathbb{I}_{ij}=1\) if \(i=j\) otherwise \(\mathbb{I}_{ij}=0\), and \(\mathit{sim}(\cdot ,\cdot )=\exp (\cos (\cdot ,\cdot )/\tau )\) measures the feature similarity. In our formulation, we replace the prior \(\mathbb{I}_{ij}\) with the soft distribution \(S_{k}(i,j)\) for accurate region contrast. Since the correspondence between \(v_{k}\) and clips from other videos naturally equals 0, we can intuitively enlarge the negative pool by introducing features from other videos. Thus, the spatio-temporal region contrast loss over all \(f(v_{k})[i]\) can be formulated as
where \(\boldsymbol{n}\in \mathbb{R}^{C}\) denotes the negative features sampled from other videos in the mini-batch. Note that we sample the global views of other videos to form the negative pairs by default, and we include an ablation study in the experimental part. In this way, we are able to align the exact corresponding visual contents including both static appearance and dynamic motions in videos.
3.2 Low-level shortcut elimination
However, local-global spatio-temporal correspondence for region feature contrast, can exist in the form of a “shortcut” that relies merely on low-level statistics, e.g., color distribution, to identify the overlapping areas. This shortcut could prevent the model from learning meaningful semantic features. To this end, we aim to mitigate the impact of low-level statistics on the extracted representations.
An intuitive way to solve this problem is by utilizing strong augmentations. However, we find that this is not enough in the video domain. Unlike images, the temporal continuity between sampled frames could provide extra cues to learn these shortcuts. For example, the continuous change in illumination helps to determine the corresponding segments in the local-global view. It is nontrivial to design augmentations to decouple such low-level information from the final representations. Motivated by adversarial learning, a promising approach is to learn a low-level information estimator from semantically inconsistent samples that share similar low-level statistics. Then, we let the encoder minimize this estimated information.
We note that the color and blur augmentation \(\tau _{\text{l}}\) is effective against distortions in low-level statistics. In other words, similar augmentations could generate samples that share similar low-level characteristics. Hence, we define several different intensity levels of \(\tau _{\text{l}}\) by constraining the augmentation parameters to a certain range. As such, we can generate frame sequences that possess distinct semantics but similar low-level statistics using the controlled \(\tau _{\text{l}}\). Then, we build a mutual information estimator on top of the extracted feature representation for low-level information extraction. Note that there are several ways to approximate the mutual information – we compare different estimation methods in Sect. 4.4. For illustration, we take MINE [59] as an example. Following Ref. [59], we approximate the mutual information between two variables by
where \(\mathcal{P}\) denotes a probability distribution, \(\mathcal{E}\) represents taking the expectation on the corresponding distribution. X and Y are the feature representations extracted by encoder f. The projection function G projects the combination of two variants X and Y sampled from the distribution space \(\mathcal{X}\) and \(\mathcal{Y}\) to a scalar value, i.e., \(G_{\theta}:\mathcal{X}\times \mathcal{Y}\rightarrow \mathbb{R}\). It is instantiated by a neural network with parameters \(\theta \in \Theta \), where Θ is the parameter set for optimization. Empirically, we instantiate \(G_{\theta}\) as a two-layer multi-layer perceptron (MLP). We regard the features of sample pairs generated from the same intensity-level of \(\tau _{\text{l}}\) as the joint distribution \(\mathcal{P}_{XY}\), and the features of arbitrary sample pairs as the marginal \(\mathcal{P}_{X}\otimes \mathcal{P}_{Y}\), where ⊗ denotes the combination of two marginal distributions. During training, we formulate the learning objective as
We maximize Eq. (6) in regards to the MLP parameters θ to obtain a reliable low-level information extractor, but reverse the gradient back-propagated to the encoder f to minimize Eq. (6). With the learned low-level information estimator \(G_{\theta}\), we further apply it to the aforementioned local-global pairs, \(f(v_{k})\) and \(f'(v)\), to minimize the low-level shortcut by optimizing f, but not update θ. In this way, we minimize the impact of low-level statistics on the spatio-temporal region feature contrast, and facilitate detailed semantic alignment.
3.3 Local-global temporal dependency
Now, we have learned robust clip features from the detailed region semantic contrast, and the remaining task is to bridge the gap between clip-level and video-level representations. Considering that the internal temporal relationships exist between the sampled local clips which are naturally contained in the global video, we propose to model the temporal order dependency between \(f(v_{k})\), \(k=\{1,2,\ldots ,K\}\) and \(f'(v)\) to enhance video-level understanding.
Similar to Sect. 3.2, we also use mutual information to measure the local-global temporal order dependency. The target is to maximize the mutual information between correctly ordered clip-level features and the video-level representation. Mathematically, we define the sequentially ordered clip features as \(\overline{f}(v)= [f(v_{1})\circ f(v_{2})\circ \cdots \circ f(v_{K}) ]\), where ∘ denotes concatenation operation, and the arbitrarily ordered features as \(\widetilde{f}(v)\). To model the temporal dependency, we regard \(\overline{f}(v)\) and \(f'(v)\) as sampled from the joint distribution \(\mathcal{P}_{XY}\), and \(\widetilde{f}(v)\) and \(f'(v)\) as sampled from the marginal distribution \(\mathcal{P}_{X}\otimes \mathcal{P}_{Y}\). In this formulation, the learning objective can be written as
where \(G_{\psi}\) is the mutual information estimation head with parameters ψ. Several alternatives exist to instantiate \(G_{\psi}\), and we discuss this in Sect. 4.4.
There are some alternatives to establish the marginal distribution \(\mathcal{P}_{X}\otimes \mathcal{P}_{Y}\) in Eq. (7). In default, we formulate it as a uniform distribution consisting of differently ordered video clips \(\widetilde{f}(v)\). Empirically, this formulation places equal emphasis on all order combinations. However, among the shuffled orders, some are quite trivial to discriminate while others are difficult to perceive. To this end, we refer to the transformation parameters \(\tau _{\text{p}}\) in our controllable augmentations, and evaluate the difficulty of each order to pay more attention to the hard examples. In particular, an order indicator \(\mathcal{O}\in \mathbb{N}^{K}\) is given, and we have \(\mathcal{O}[k]=k\) for the correct order. We denote the central timestamp of the kth clip as \(t_{\mathcal{O}[k]}\), and the oracle central timestamp is \(\hat{t}_{k}={(2k-1)T}/{(2K)}\). We calculate the summation of the central timestamp deviations to produce the difficulty score of order \(\mathcal{O}\):
The lower deviation indicates the higher learning difficulty. We take softmax normalization over the difficulty scores to generate the sampling probability of each order combination. In this way, we formulate a marginal distribution which emphasizes hard examples to improve the learning efficiency.
It is worth noting that there are some previous works using temporal order to build pretext tasks for self-supervised learning [11, 14, 35]. The major difference is that our approach incorporates the video-level feature to determine whether the clips are correctly ordered, while Refs. [11, 14, 35] have no access to the global feature. In this way, our formulation could avoid the ambiguity problem when encountering the temporal structure that cannot be determined solely by local clips. For example, in a complex gymnastic scene, it is difficult to determine the temporal order of gymnastic actions only with local clips. However, it is practical to reach the correct order with reference to the global video feature. Thus, our local-global mutual temporal order constraint is be a better way to embed video-level temporal structures into extracted representations.
3.4 Training
We jointly train our model with the aforementioned three objectives:
where α and β serve as balancing hyper-parameters. We set \(\alpha =\beta =1\) by default, and find that the performance is fairly robust to these hyper-parameters.
In addition, we also explore applying a curriculum evolving strategy to the parameters in the controllable augmentations to adjust the training process. Intuitively, motivated by Refs. [60, 61], it is promising to learn from easier samples and then gradually expand to more difficult tasks. This phenomenon is more obvious in self-supervised learning in the absence of human annotations [62]. To this end, in this work, we design an evolving strategy for the augmentation parameters to allow the model to learn in an easy-to-hard manner. Specifically, at the beginning of the training process, we strictly constrain the cropping parameters in \(\tau _{\text{p}}\) to construct distinct local clips to reduce difficulty in dense region contrast and temporal dependency modeling. Additionally, we divide easily distinguishable intensity levels in \(\tau _{\text{l}}\) to more easily capture the low-level statistics. In the learning process, we gradually relax the constraints on the augmentation parameters to increase the learning difficulty. We present the detailed formulation of the dynamic parameter evolving process in the implementation details. We show the empirical comparison with simple constant augmentation parameters in the ablation study.
4 Experiment
4.1 Datasets
We use 4 video action recognition datasets, Kinetics-400 [1], UCF-101 [7], HMDB-51 [63] and Diving-48 [9]. Kinetics-400 [1] is a large-scale dataset consisting of 240 K video clips with 400 human action classes. UCF-101 [7] contains over 13 K clips covering 101 action classes. HMDB-51 [63] covers 51 action categories and approximately 7 K annotated clips. Diving-48 [9] contains 48 different diving actions, which mainly vary in motion patterns and share similar backgrounds. In our experiments, we use the training set of UCF-101 or Kinetics-400 for self-supervised pretraining. For the downstream tasks, following Refs. [10, 46, 56], we use split 1 of UCF-101 and HMDB-51, and the test split V1 of Diving-48 for evaluation.
4.2 Implementation details
Self-supervised pretraining
For global video input, we sparsely sample 16 frames with weak spatial cropping. For local clip input, we constrain the temporal cropping parameters to make K 16-frame clips approximately uniformly distributed in the video. The local clips are spatially cropped within the global view to ensure position-wise correspondence. For low-level augmentations, we define a set of color jitter and Gaussian blur parameters to form different intensity-level transformations. We resize the input frame sequence into \(16\times 112\times 112\), and use R3D-18 [64] as the video encoder. For local clip feature extraction, we follow the default setting and the feature resolution is \(2\times 4\times 4\). For global video feature extraction, we set the temporal stride of the last 3 stages to 1, so that the feature resolution is \(8\times 4\times 4\). We calculate the spatio-temporal correspondence matrix between local and global feature maps based on the cropping and flipping parameters for optimization.
In terms of training settings, we use batchsize of 128, and set the number of local clips K to 4 by default. We train our model on UCF-101 for 200 epochs and on Kinetics-400 for 100 epochs. We use the Adam optimizer with an initial learning rate of \(1.0\times 10^{-3}\) and weight decay of \(1.0\times 10^{-5}\). The learning rate is decayed by 10 at 70 epochs for Kinetics-400 and 150 epochs for UCF-101.
Action recognition
We load the pretrained video encoder parameters except for the last fully-connected layer. There are two protocols: (1) end-to-end finetune the whole network with action labels; (2) freeze the encoder, and only train the linear classifier that also is known as linear probe. For evaluation, we follow Refs. [14, 18] to uniformly sample 10 clips for each video, which are center cropped and resized to \(112\times 112\). We average the softmax probability of each clip as the final prediction and report the Top-1 accuracy.
Video retrieval
We directly use the pretrained model to extract video features without finetuning. Following Refs. [14, 65], we regard videos in the test set as queries, and retrieve nearest neighbors from the training set. Similar to action recognition, we average the features of ten uniformly sampled clips as the global representation. We report Top-k recall R@k.
Controllable augmentations
We also provide a detailed illustration of our controllable augmentations. We respectively describe the implementations for random spatial crop, random temporal crop, random horizontal flip, color jitter and Gaussian blur. We use the default setting, 4 local clips and 512 low-level augmentation intensities for illustration, and provide the detailed evolving progress of the dynamic augmentation parameters.
1) Random temporal crop. For global video, we do not perform temporal cropping, but uniformly sample 16 frames. For local clips, we constrain the central frame to respectively be located in the timespan \([0.00,0.25]\), \([0.25,0.50]\), \([0.50,0.75]\) and \([0.75,1.00]\) to cover the whole video. For the dynamic augmentation parameter design, we constrain the time interval to \([0.10,0.15]\), \([0.35,0.40]\), \([0.60,0.65]\) and \([0.85,0.90]\) at the beginning to generate clips with more easily distinguishable temporal boundaries. Then we linearly expand the time interval to the default \([0.00,0.25]\), \([0.25,0.50]\), \([0.50,0.75]\) and \([0.75,1.00]\) in each training iteration.
2) Random spatial crop. For global video, we perform weak spatial cropping, with the area ratio in the range \([0.8,1.0]\). For local clips, we perform strong spatial cropping relative to the cropped global video, with an area ratio of \([0.4,0.8]\). Similarly, for the dynamic curriculum training, we first constrain the local clip cropping ratio to \([0.7,0.8]\) for more discriminative region contrast. Then we linearly reduce the lower limit of the area ratio to \([0.4,0.8]\) to increase the learning difficulty with more diverse and noisier dense correspondence.
3) Random horizontal flip. For both global video and local clips, we perform random horizontal flipping with a probability of 0.5.
4) Color jitter. Referring to the default settings in [20, 22], the brightness (B), contrast (C) and saturation (S) are in the range \([0.0,0.4]\), and the hue (H) is in the range \([0,0.1]\). We uniformly divide 4 groups for B, C and S, respectively, the intensity from weak to strong order in the range \([0.0,0.1]\), \([0.1,0.2]\), \([0.2,0.3]\) and \([0.3,0.4]\), and divide 2 groups for H that in the range \([0.00,0.05]\), \([0.05,0.10]\). In the dynamic training, we initialize the intensity levels as \([0.04,0.06]\), \([0.14,0.16]\), \([0.24,0.26]\) and \([0.34,0.36]\) to easily capture the difference in low-level statistics. Along the training progress, we linearly expand the intensity levels to the default setting, \([0.0,0.1]\), \([0.1,0.2]\), \([0.2,0.3]\) and \([0.3,0.4]\).
5) Gaussian blur. We adopt 2 different radii 7 and 11, and 2 different sigma ranges \([0.1,0.5]\) and \([0.5,2.0]\), resulting in 4 combinations. In the same manner, we initially set the sigma in the range \([0.25,0.35]\), \([1.2,1.3]\), then linearly relax to \([0.1,0.5]\) and \([0.5,2.0]\) to formulate an easy-to-hard training.
With the help of these controllable augmentations, the spatio-temporal correspondence is calculated by the ratio of the intersection of two tubes. As demonstrated in Fig. 3, we have global video feature map \(F_{\text{v}}\) and local clip feature map \(F_{\text{c}}\). We aim to calculate the spatio-temporal correspondence matrix S, where \(S[i,j]\) indicates the correspondence score between the ith grid in \(F_{\text{c}}\) and the jth grid in \(F_{\text{v}}\). For better illustration, we assume \(F_{\text{c}}[i]\) covers the area \([(t_{\text{c}}^{1},t_{\text{c}}^{2}),(h_{\text{c}}^{1},h_{\text{c}}^{2}),(w_{ \text{c}}^{1},w_{\text{c}}^{2})]\), \(F_{\text{v}}[j]\) covers the area \([(t_{\text{v}}^{1},t_{\text{v}}^{2}),(h_{\text{v}}^{1},h_{\text{v}}^{2}),(w_{ \text{v}}^{1},w_{\text{v}}^{2})]\). Then the intersection can be easily written as
\(S[i,j]\) is the ratio of the intersection over \(F_{\text{c}}[i]\):
4.3 Comparison with existing works
Action recognition
We first present the comparison between our method and recent video representation learning approaches on action recognition in Table 1. We report Top-1 accuracy on UCF-101 and HMDB-51 under linear probe and finetune. We exclude the methods that use different evaluation settings and much deeper backbones, such as Refs. [17, 49, 71], or those that rely on audio and text modalities, such as Refs. [72, 73]. In Table 1, we use ‘V+F’ to denote the use of both reg green blue (RGB) and optical flow in the self-supervised pretraining stage. All evaluation results are obtained using only RGB at test time.
Under linear probe, our method outperforms other RGB-only approaches by a large margin. The superiority over RSPNet [48], which integrates temporal pretext task with contrastive learning, demonstrates the effectiveness of our general temporal structure learning scheme. Note that our method also dramatically narrows the gap between RGB-only and RGB-flow based methods. This indicates that our method significantly improves motion pattern modeling. Under finetune, our method achieves promising results when pretrained on UCF-101, even surpassing RGB-flow based architecture [15, 20] and methods trained with higher resolution [52]. When pretrained on Kinetics-400, our method is generally comparable with the state-of-the-art RGB-flow approaches [15, 20], motion focused methods [52, 53], probabilistic and hierarchical pretraining [58, 68]. It indicates that our controllable augmentation has the potential to cover long range temporal dynamics and lead to comprehensive perception of appearance and motion cues. In addition, due to limited computational resources, we do not compare with works using very large backbones, such as Refs. [17, 71], but we present the ablation in the bottom three lines. The results indicate that our method has the potential to scale longer training epochs, deeper backbones and larger resolutions.
In addition, we also provide the results on Diving-48 [9], a dataset that mainly relies on dynamic motions to distinguish different action categories. We compare the results of both supervised (the second row to the forth row) and self-supervised methods (bottom three rows) in Table 2. Since the appearance is similar across different videos, and the Top-1 accuracy can well reflect the ability in motion understanding. We observe that in this case, semantic label supervision is not effective, and our method improves the performance by a notable margin. This demonstrates that our learning approach is superior in capturing motion patterns, with less reliance on background information.
Video retrieval
Table 3 depicts the comparison on video retrieval with R@k. The model is pretrained on UCF-101. Our method remarkably outperforms most RGB-based approaches. Note that some methods, especially PCL [77], achieve impressive results when k increases to 20. This is because when k is large, it becomes likely to rely on the background as a shortcut to retrieve videos of the same category. We reach comparable or even better performance although STS [15] and CoCLR [20] adopt RGB and optical flow. This demonstrates once again that our integration of detailed local feature alignment and general long-term temporal modeling is effective in enhancing motion pattern modeling without resorting to motion biased input data.
Visualization analysis
We also display some visualization results to analyze the learned feature representations in Fig. 4. We employ class-agnostic activation maps (CAAM) [78] to reveal the spatio-temporal distributions of the extracted features. Generally, vanilla contrastive learning based on SimCLR [26] leads the model to focus on representative background cues, e.g., the soccer field, swimming pool and fitness equipment. In contrast, our pretrained model focuses on the moving foregrounds that contain actions, such as the moving human body and moving boat.
4.4 Ablation study
In this section, we provide several ablation studies to analyze our video representation learning framework. Unless specifically mentioned, all models are pretrained on UCF-101 for 150 epochs, with R3D-18 as the backbone.
Local-global sampling
We first explore the impact of local-global settings. Two aspects were investigated: the number of local clips K and the global video feature temporal resolution \(T_{\text{v}}\), which is obtained by adjusting the temporal convolution stride. We presnet the results in Table 4. By varying the number of local clips K from 1 to 4, we find that having more local clips tend to improve the performance due to more fine-grained feature alignment. It is worth noting that when the ratio \(T_{\text{v}}/(KT_{\text{c}})<1\), the granularity of the local-global correspondence becomes too coarse, which constricts the performance. Overall, accurate spatio-temporal region correspondence provides a reliable reference for appearance and motion pattern matching, and significantly improves action recognition.
Negative pair formulation
With the local-global sampling, there are several alternatives to formulate the negative feature pool in Eq. (4). In detail, given the query local clip feature \(f(v_{k})[i]\), we sample the matched global view features as positive pairs. To this end, we also sample the global view features from other videos to formulate the negative pairs by default. In this ablation study, we also compare with two other variants: (1) sampling local clip features from other videos as negative pairs; (2) combining both local and global view features from other videos as the negative samples. We show the empirical comparison in Table 5. We observe that sampling negative features from the global view works much better than sampling from the local views. Integrating both local and global features as negative pairs leads to negligible improvments. This is because the global view features provide richer visual contexts with larger reception fields, and serve as more informative reference signals in dense region contrast.
Random perturbations in dense correspondence
Recalling Eq. (1), we calculate the intersection ratio of two spatio-temporal cubes to produce the dense correspondence matrix to indicate the positive pairs. This strict correspondence is purely based on the geometric positions and might neglect some visual contexts that help high-level understanding. To this end, we compare adding independent random Gaussian noise and applying spatio-temporal Gaussian blur to the dense correspondence matrix in Table 6. We find that the random Gaussian noise leads to a substantial performance drop since it destroys the correspondence relations. In contrast, Gaussian blur basically maintains the original correspondence distribution and slightly expands the positive sampling area. It improves the performance by including more diverse visual contexts as positive features.
Low-level augmentation levels
We also explore the setting of the intensity levels on low-level augmentations. We follow conventional implementations: For color jitter, there are controllable parameters of brightness, contrast, saturation and hue, which are set as \((B,C,S,H)= (0.4,0.4,0.4,0.1)\) by default [20, 22]. For Gaussian blur, we control the radius and sigma. We set different intensity levels for each controllable parameter as Table 7. Note that since B, C and S are set to the same as default, we also set the same number of levels for them. The total number of predefined intensity levels equals the number of permutations across all parameters, i.e., 32 for the first row, 512 for the second row, etc. For consistency, in each iteration, we randomly sample 32 intensity levels from all possible levels, resulting in 32 groups of features that share similar low-level statistics for mutual information minimization. We observe that too few or too many levels both lead to performance drop. This is because more levels lead to less difference between different groups, while fewer levels mean more difference within each group. We conclude that a trade-off exists which requires balancing to achieve the best possible training.
Mutual information estimation
We also delve into several methods for mutual information estimation. For low-level shortcut elimination, we need to force the encoder to minimize the estimated mutual information. Theoretically, we should minimize the estimated upper bound. However, we find that it is difficult to converge with CLUB [79] which is an upper bound estimation approach. Therefore, we also adopt lower bound estimation methods of MINE [59], JS [80] and InfoNCE [27] for comparison. These methods provide tight lower bound estimation which leads to easier convergence, thus achieving superior performance on action recognition. Compared to the baseline as illustrated in Table 8, our method obtained significant improvement, especially when pretrained on mini-Kinetics and evaluated on UCF-101. This indicates that mutual information minimization helps mitigate low-level shortcuts and enhances the generalization ability.
Temporal dependency head
To further examine the feasibility of temporal dependency head implementation, we compare three typical examples: (1) MLP: concatenate \(f'(v)\) and \(\overline{f}(v)\) or \(\widetilde{f}(v)\) and pass through a multi-layer perceptron (MLP) to obtain a scalar value. (2) GRU: use a gated recurrent network (GRU) to process clip feature sequence, and calculate the cosine similarity between \(f'(v)\) and GRU output. (3) GRU+MLP: GRU is used to process the clip feature sequence, which is then concatenated with \(f'(v)\) and passed through an MLP to obtain a scalar value. The results are listed in Table 9. Compared with no temporal constraint, all three implementations gain significant improvements. We note that our MLP implementation is similar to VCOP [14], but different in the learning objective. This improvement reveals that introducing the global video feature as a reference could enhance temporal structure modeling.
Marginal distribution formulation
In addition to the mutual information estimation head, we also compare different marginal distribution formulations for temporal dependency modeling. By default, we instantiate it as a uniform distribution containing different shuffled orders. In addition, we also compare establishing a difficulty-aware marginal distribution that places more emphasis on hard examples. The results are presented in Table 10. The temperature hyper-parameter controls the level of concentration of the softmax normalization over the difficulty score in Eq. (8). Lower temperature leads to sharp distributions, thus resulting in repeated sampling difficult examples. From the comparison, we observe that lower temperature impairs the performance while the smoother distribution brings improvements. This indicates the necessity of using a small number of easy examples to guide the model to discriminate temporal relations and prevent the model from falling into ambiguities. Only when based on this, sampling more hard examples could further facilitate temporal perception.
Dynamic augmentation parameters
Here, we provide a quantitative comparison between the default augmentation parameter setting and the dynamic parameter evolving in Table 11. In the dynamic learning stage, we manually control the augmentation parameters to construct the training samples in a curriculum manner. It is clear that the dynamic augmentation parameter evolving in an easy-to-hard manner leads to performance improvement. This dynamic setting contributes to determining optimal augmentation parameter combinations that facilitate video representation learning.
Training efficiency
We also report the training efficiency of the variants of our method and other works. For fair comparison, we pretrain the R3D-18 backbone with a resolution of \(112\times 112\), a clip length of 16 frames on Kinetics-400 on an 8 NVIDIA 3090 GPU server, and report the training time in hours as well as the total GPU hours. For the variants of our method, we compare using different numbers of local clips K and present the results in Table 12. We observe that sampling more local clips leads to faster convergence, thus not resulting in linear increase in training time. Compared with the other two baselines [51, 67], based on our reimplementations, our method achieves better performance with fewer training hours, demonstrating the high training efficiency of our framework.
Overall learning objectives
We finally show the ablation of the designed learning objectives in Table 13, where \(\mathcal{L}_{\text{nce}}\) is the standard contrastive loss used in existing works. We observe that the integration of \(\mathcal{L}_{\text{rc}}\) and \(\mathcal{L}_{\text{mi}}\) significantly outperforms \(\mathcal{L}_{\text{ce}}\), which indicates that the detailed region contrast with low-level shortcut elimination is more efficient than naive global contrast. In addition, \(\mathcal{L}_{\text{td}}\) further enables the model to go beyond local clips and establish long-term relationships. The improvement demonstrates that our method well integrates detailed region-level contrast and general long-term temporal perception.
5 Conclusion
In this paper, we propose a framework that leverages local clips and global video to enhance self-supervised video representation learning. We employ a set of controllable augmentations to crop local clips and generate groups of samples that share similar low-level attributes. Therefore, we use the soft codes computed from the crop and flip parameters to guide detailed spatio-temporal region contrastive learning, and minimize the mutual information within the same low-level group to avoid shortcuts. We also incorporate local-global temporal dependency to embed general temporal structures into the extracted video representations. Experiments on downstream tasks of action recognition and video retrieval demonstrate the superiority of our formulation, especially in modeling dynamic motion patterns.
Data availability
The datasets generated during and/or analyzed during the current study are available in the kinetics-dataset repository, https://github.com/cvdfoundation/kinetics-dataset.
Abbreviations
- GRU:
-
gated recurrent unit
- InfoNCE:
-
information noise contrastive estimation
- MLP:
-
multi-layer perceptron
- RGB:
-
red green blue
References
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4724–4733). Piscataway: IEEE.
Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In V. Ferrari, M. Hebert, C. Sminchisescu, et al. (Eds.), Proceedings of the 15th European conference on computer vision (pp. 318–335). Cham: Springer.
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., et al. (2018). AVA: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6047–6056). Piscataway: IEEE.
Heilbron, F.C., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: a large-scale video benchmark for human activity understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 961–970). Piscataway: IEEE.
Liu, Y., Albanie, S., Nagrani, A., & Zisserman, A. (2019). Use what you have: video retrieval using representations from collaborative experts. arXiv preprint. arXiv:1907.13487.
Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev, I., & Sivic, J. (2019). Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2630–2640). Piscataway: IEEE.
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint. arXiv:1212.0402.
Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision (pp. 5843–5851). Piscataway: IEEE.
Li, Y., Li, Y., & Vasconcelos, N. (2018). RESOUND: towards action recognition without representation bias. In V. Ferrari, M. Hebert, C. Sminchisescu, et al. (Eds.), Proceedings of the 15th European conference on computer vision (pp. 520–535). Cham: Springer.
Benaim, S., Ephrat, A., Lang, O., Mosseri, I., Freeman, W. T., Rubinstein, M., et al. (2020). SpeedNet: learning the speediness in videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9919–9928). Piscataway: IEEE.
Misra, I., Zitnick, C. L., & Hebert, M. (2016). Shuffle and learn: unsupervised learning using temporal order verification. In B. Leibe, J. Matas, N. Sebe, et al. (Eds.), Proceedings of the 14th European conference on computer vision (pp. 527–544). Cham: Springer.
Kim, D., Cho, D., & Kweon, I. S. (2019). Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the 33rd AAAI conference on artificial intelligence (pp. 8545–8552). Palo Alto: AAAI Press.
Simon, J., Meishvili, G., & Favaro, P. (2020). Video representation learning by recognizing temporal transformations. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 425–442). Cham: Springer.
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., & Zhuang, Y. (2019). Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10334–10343). Piscataway: IEEE.
Wang, J., Jiao, J., Bao, L., He, S., Liu, W., & Liu, Y. (2022). Self-supervised video representation learning by uncovering spatio-temporal statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3791–3806.
Gordon, D., Ehsani, K., Fox, D., & Farhadi, A. (2020). Watching the world go by: representation learning from unlabeled videos. arXiv preprint. arXiv:2003.07990.
Qian, R., Meng, T., Gong, B., Yang, M.-H., Wang, H., Belongie, S. J., et al. (2021). Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6964–6974). Piscataway: IEEE.
Wang, J., Jiao, J., & Liu, Y.-H. (2020). Self-supervised video representation learning by pace prediction. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 504–521). Cham: Springer.
Yao, T., Zhang, Y., Qiu, Z., Pan, Y., & Mei, T. (2021). SeCo: exploring sequence supervision for unsupervised representation learning. In Proceedings of the 35th AAAI conference on artificial intelligence (pp. 10656–10664). Palo Alto: AAAI Press.
Han, T., Xie, W., & Zisserman, A. (2020). Self-supervised co-training for video representation learning. In H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.), Proceedings of the 34th international conference on neural information processing systems, Red Hook: Curran Associates.
Kuang, H., Zhu, Y., Zhang, Z., Li, X., Tighe, J., Schwertfeger, S., et al. (2021). Video contrastive learning with global context. In Proceedings of the IEEE/CVF international conference on computer vision workshops (pp. 3188–3197). Piscataway: IEEE.
Pan, T., Song, Y., Yang, T., Jiang, W., & Liu, W. (2021). Videomoco: contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11205–11214). Piscataway: IEEE.
Wang, J., Gao, Y., Li, K., Lin, Y., Ma, A. J., Cheng, H., et al. (2021). Removing the background by adding the background: towards background robust self-supervised video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11804–11813). Piscataway: IEEE.
Wang, J., Gao, Y., Li, K., Hu, J., Jiang, X., Guo, X., et al. (2021). Enhancing unsupervised video representation learning by decoupling the scene and the motion. In Proceedings of the 35th AAAI conference on artificial intelligence (pp. 10129–10137). Menlo Park: AAAI Press.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9726–9735). Piscataway: IEEE.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. E. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of the 37th international conference on machine learning (pp. 1597–1607). Stroudsburg: International Machine Learning Society.
van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint. arXiv:1807.03748.
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1735–1742). Piscataway: IEEE.
Gutmann, M., & Hyvärinen, A. (2010). Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Y. W. Teh & D. M. Titterington (Eds.), Proceedings of the 13th international conference on artificial intelligence and statistics. Retrieved Novermber 3, 2023, from http://proceedings.mlr.press/v9/gutmann10a.html.
Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3733–3742). Piscataway: IEEE.
Tian, Y., Krishnan, D., & Isola, P. (2020). Contrastive multiview coding. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 776–794). Cham: Springer.
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., et al. (2019). Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th international conference on learning representations. Retrieved November 3, 2023, from https://openreview.net/forum?id=Bklr3j0cKX.
Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., & Hu, H. (2021). Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16684–16693). Piscataway: IEEE.
Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2021). Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3024–3033). Piscataway: IEEE.
Lee, H.-Y., Huang, J.-B., Singh, M., & Yang, M.-H. (2017). Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE international conference on computer vision (pp. 667–676). Piscataway: IEEE.
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking emerges by colorizing videos. In V. Ferrari, M. Hebert, C. Sminchisescu, et al. (Eds.), Proceedings of the 15th European conference on computer vision (pp. 402–419). Cham: Springer.
Wang, X., Jabri, A., & Efros, A. A. (2019). Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2566–2576). Piscataway: IEEE.
Jabri, A., Owens, A., & Efros, A. A. (2020). Space-time correspondence as a contrastive random walk. In H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.), Proceedings of the 34th international conference on neural information processing systems, Red Hook: Curran Associates.
Li, X., Liu, S., De Mello, S., Wang, X., Kautz, J., & Yang, M.-H. (2019). Joint-task self-supervised learning for temporal correspondence. In H. M. Wallach, H. Larochelle, A. Beygelzimer, et al. (Eds.), Proceedings of the 33rd international conference on neural information processing systems (pp. 317–327). Red Hook: Curran Associates.
Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017). Decomposing motion and content for natural video sequence prediction. In Proceedings of the 5th international conference on learning representations. Retrieved Novermber 3, 2023, from https://openreview.net/forum?id=rkEFLFqee.
Luo, Z., Peng, B., Huang, D.-A., Alahi, A., & Li, F.F. (2017). Unsupervised learning of long-term motion dynamics for videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7101–7110). Piscataway: IEEE.
Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. In H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.), Proceedings of the 34th international conference on neural information processing systems (pp. 1–13). Red Hook: Curran Associates.
Piergiovanni, A. J., Angelova, A., & Ryoo, M. S. (2020). Evolving losses for unsupervised video representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 130–139). Piscataway: IEEE.
Liu, Y., Wang, K., Lan, H., & Lin, L. (2021). Temporal contrastive graph for self-supervised video representation learning. arXiv preprint. arXiv:2101.00820.
Han, T., Xie, W., & Zisserman, A. (2019). Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF international conference on computer vision workshops (pp. 1483–1492). Piscataway: IEEE.
Han, T., Xie, W., & Zisserman, A. (2020). Memory-augmented dense predictive coding for video representation learning. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 312–329). Cham: Springer.
Yang, C., Xu, Y., Dai, B., & Zhou, B. (2020). Video representation learning with visual tempo consistency. arXiv preprint. arXiv:2006.15489.
Chen, P., Huang, D., He, D., Long, X., Zeng, R., Wen, S., et al. (2021). RSPNet: relative speed perception for unsupervised video representation learning. In Proceedings of the 35th AAAI conference on artificial intelligence (pp. 1045–1053). Palo Alto: AAAI Press.
Li, R., Zhang, Y., Qiu, Z., Yao, T., Liu, D., & Mei, T. (2021). Motion-focused contrastive learning of video representations. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2085–2094). Piscataway: IEEE.
Qian, R., Ding, S., Liu, X., & Lin, D. (2022). Static and dynamic concepts for self-supervised video representation learning. In S. Avidan, G. J. Brostow, M. Cissé, et al. (Eds.), Proceedings of the 17th European conference on computer vision (pp. 145–164). Cham: Springer.
Ding, S., Qian, R., & Xiong, H. (2022). Dual contrastive learning for spatio-temporal representation. In J. Magalhães, A. Del Bimbo, S. Satoh, et al. (Eds.), Proceedings of the 30th ACM international conference on multimedia (pp. 5649–5658). New York: ACM.
Liu, Y., Chen, J., & Wu, H. (2022). MoQuad: motion-focused quadruple construction for video contrastive learning. In L. Karlinsky, T. Michaeli, & K. Nishino (Eds.), Proceedings of the 17th European conference on computer vision workshops (pp. 20–38). Cham: Springer.
Ding, S., Li, M., Yang, T., Qian, R., Xu, H., Chen, Q., et al. (2022). Motion-aware contrastive video representation learning via foreground-background merging. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9706–9716). Piscataway: IEEE.
Ma, S., Zeng, Z., McDuff, D., & Song, Y. (2021). Contrastive learning of global and local video representations. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, et al. (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp. 7025–7040). Red Hook: Curran Associates.
Recasens, A., Luc, P., Alayrac, J.-B., Wang, L., Strub, F., Tallec, C., et al. (2021). Broaden your views for self-supervised video learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1235–1245). Piscataway: IEEE.
Dave, I. R., Gupta, R., Rizve, M. N., & Shah, M. (2022). TCLR: temporal contrastive learning for video representation. Computer Vision and Image Understanding, 219, 103406.
Behrmann, N., Fayyaz, M., Gall, J., & Noroozi, M. (2021). Long short view feature decomposition via contrastive video representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, Piscataway: IEEE.
Qing, Z., Zhang, S., Huang, Z., Xu, Y., Wang, X., Gao, C., et al. (2023). Self-supervised learning from untrimmed videos via hierarchical consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 12408–12426.
Belghazi, M.I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Hjelm, R. D., et al. (2018). Mutual information neural estimation. In J. G. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (pp. 530–539). Stroudsburg: International Machine Learning Society.
Elman, J. L. (1993). Learning and development in neural networks: the importance of starting small. Cognition, 48(1), 71–99.
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In A. P. Danyluk, L. Bottou, & M. L. Littman (Eds.), Proceedings of the 26th annual international conference on machine learning (pp. 41–48). Stroudsburg: International Machine Learning Society.
Murali, A., Pinto, L., Gandhi, D., & Gupta, A. (2018). CASSL: curriculum accelerated self-supervised learning. In Proceedings of the IEEE international conference on robotics and automation (pp. 6453–6460). Piscataway: IEEE.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. A., & Serre, T. (2011). HMDB: a large video database for human motion recognition. In D. N. Metaxas, L. Quan, A. Sanfeliu, et al. (Eds.), IEEE international conference on computer vision (pp. 2556–2563). Piscataway: IEEE.
Hara, K., Kataoka, H., & Satoh, Y. (2017). Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE international conference on computer vision workshops (pp. 3154–3160). Piscataway: IEEE.
Luo, D., Liu, C., Zhou, Y., Yang, D., Ma, C., Ye, Q., et al. (2020). Video cloze procedure for self-supervised spatio-temporal learning. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 11701–11708). Palo Alto: AAAI Press.
Sun, C., Baradel, F., Murphy, K., & Schmid, C. (2019). Learning video representations using contrastive bidirectional transformer. arXiv preprint. arXiv:1906.05743.
Qian, R., Li, Y., Liu, H., See, J., Ding, S., Liu, X., et al. (2021). Enhancing self-supervised video representation learning via multi-level feature optimization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7970–7981). Piscataway: IEEE.
Park, J., Lee, J., Kim, I.-J., & Sohn, K. (2022). Probabilistic representations for video contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14691–14701). Piscataway: IEEE.
Huang, D., Wu, W., Hu, W., Liu, X., He, D., Wu, Z., et al. (2021). ASCNet: self-supervised video representation learning with appearance-speed consistency. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8076–8085). Piscataway: IEEE.
Simon, J., & Jin, H. (2021). Time-equivariant contrastive video representation learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9950–9960). Piscataway: IEEE.
Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R. B., & He, K. (2021). A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3299–3309). Piscataway: IEEE.
Asano, Y. M., Patrick, M., Rupprecht, C., & Vedaldi, A. (2020). Labelling unlabelled videos from scratch with multi-modal self-supervision. In H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.), Proceedings of the 34th international conference on neural information processing systems (pp. 1–12). Red Hook: Curran Associates.
Patrick, M., Asano, Y. M., Kuznetsova, P., Fong, R., Henriques, J. F., Zweig, G., et al. (2020). Multi-modal self-supervision from generalized data transformations. arXiv preprint. arXiv:2003.04298.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2016). Temporal segment networks: towards good practices for deep action recognition. In B. Leibe, J. Matas, N. Sebe, et al. (Eds.), Proceedings of the 14th European conference on computer vision (pp. 20–36). Cham: Springer.
Choi, J., Gao, C., Messou, J. C. E., & Huang, J.-B. (2019). Why can’t I dance in the mall? Learning to mitigate scene bias in action recognition. In H. M. Wallach, H. Larochelle, A. Beygelzimer, et al. (Eds.), Proceedings of the 33rd international conference on neural information processing systems (pp. 851–863). Red Hook: Curran Associates.
Yao, Y., Liu, C., Luo, D., Zhou, Y., & Ye, Q. (2020). Video playback rate perception for self-supervised spatio-temporal representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6547–6556). Piscataway: IEEE.
Tao, L., Wang, X., & Yamasaki, T. (2020). Self-supervised video representation using pretext-contrastive learning. arXiv preprint. arXiv:2010.15464.
Baek, K., Lee, M., & Psynet, H. S. (2020). Self-supervised approach to object localization using point symmetric transformation. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 10451–10459). Palo Alto: AAAI Press.
Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., & Carin, L. (2020). CLUB: a contrastive log-ratio upper bound of mutual information. In Proceedings of the 37th international conference on machine learning (pp. 1779–1788). Stroudsburg: International Machine Learning Society.
Nowozin, S., Cseke, B., & Tomioka, R. (2016). f-GAN: training generative neural samplers using variational divergence minimization. In D. D. Lee, M. Sugiyama, U. von Luxburg, et al. (Eds.), Proceedings of the 30th international conference on neural information processing systems (pp. 271–279). Red Hook: Curran Associates.
Acknowledgements
We would like to acknowledge Yuxi Li and Huabin Liu for constructive feedbacks and discussions on the paper.
Funding
The paper is supported in part by the National Natural Science Foundation of China (No. 62325109, U21B2013).
Author information
Authors and Affiliations
Contributions
For this research article, the authors’ contributions are as follows: Conceptualization: RQ, WL; Methodology: RQ, WL; Formal analysis and investigation: RQ, WL; Writing – original draft preparation: RQ; Writing – review and editing: WL, JS, DL; Funding acquisition: WL; Resources: WL; Supervision: WL. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
Weiyao Lin is an Associate Editor for Visual Intelligence and was not involved in the editorial review of, or the decision to publish, this article. The authors declare that there are no other competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Qian, R., Lin, W., See, J. et al. Controllable augmentations for video representation learning. Vis. Intell. 2, 1 (2024). https://doi.org/10.1007/s44267-023-00034-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44267-023-00034-7