1 Introduction
Video frame synthesis is a classic video processing task to generate frames in-between (interpolation) or subsequent to (extrapolation) reference frames. It can be applied to many practice applications, such as video compression [
17,
28,
31], video view synthesis [
6,
14], slow-motion generation [
12], and motion blur synthesis [
3]. Recently, various deep-learning-based algorithms have been proposed to handle video frame synthesis problems. Most of them focus on interpolation [
2,
9,
12,
15,
16,
25,
26,
27,
32,
33,
39,
43] or extrapolation [
8,
38], while some others [
19] can deal with both interpolation and extrapolation in a unified framework.
Among existing video frame synthesis algorithms, flow-based schemes predict the current frame by warping reference frames with estimated optical flows. Many flow-based interpolation schemes [
1,
2,
12] first estimate bi-directional optical flows and then approximate intermediate flows by flow reversal [
12,
18]. Some recent schemes [
9,
16,
22,
26,
27,
43] directly estimate intermediate flows and achieve superior performance. However, most of the existing methods fail in estimating large motion or the motion of small objects. It is mainly caused by the limited receptive field and motion capture capability of
convolutional neural networks (CNNs).
Correspondence matching is proven to be effective in capturing long-term correlations in multimedia tasks like video object segmentation [
4,
42] and optical flow estimation [
13,
37]. In these scenarios, by matching pixels of the current frame in the reference frame, a correspondence vector can be established to guide the generation of mask or flow (Figure
2, left). However, in video frame synthesis, we only have two inference frames, and the current frame is not available. As a result, correspondence matching cannot be performed directly. So
how to perform correspondence matching in video frame synthesis is still an unanswered question.
In this article, we introduce a
neighbor correspondence matching (NCM) algorithm to enhance the flow estimation process in video frame synthesis that establishes correspondences in a current-frame-agnostic fashion. Observing that objects usually move continuously and locally within a small region in natural videos, we propose to perform correspondence matching between the spatial-temporal neighbors of each pixel. Specifically, for each pixel in the current frame, we use pixels in the local windows of adjacent reference frames to calculate the correspondence matrix (Figure
2, middle and right), so the pixel value of the current frame is not required. The matched neighbor correspondence matrix can effectively model the object correlations, from which we can infer sufficient motion cues to guide the estimation of flows. In addition, multi-scale neighbor correspondence matching is performed to extend the receptive field and capture large motion.
Based on NCM, we further propose a unified video frame synthesis network for both interpolation and extrapolation. The proposed model can accurately estimate intermediate flows in a heterogeneous coarse-to-fine manner. Specifically, the coarse-scale module is designed to utilize the multi-scale neighbor correspondence matrix to capture accurate motion in low resolution, while the fine-scale module refines the coarse flows to high resolution in a computationally efficient fashion. In addition, to eliminate the resolution gap between the training dataset and real-world high-resolution videos, we propose to train coarse- and fine-scale modules using a progressive training strategy. Combining all the above designs, we augment RIFE [
13] framework to an NCM-based network for video frame synthesis. The model is not only effective but also efficient, especially for high-resolution videos.
To fully understand how NCM works, we further explore the mechanism of NCM. By visualizing the neighbor correspondence, we conclude that NCM provides multiple-hypotheses motion information for frame synthesis. In addition, we compare the estimated flows between interpolation and extrapolation and find that the extrapolation model follows the single-motion-hypotheses synthesis process and cannot fully exploit such multiple-hypotheses motion information. Based on these analyses, we introduce a more powerful multiple-hypotheses framework for video frame extrapolation, NCM-MH. NCM and NCM-MH demonstrate new state-of-the-art results in several benchmarks. As shown in Figure
1, in the challenging X4F1000FPS benchmark, NCM improves
peak signal-to-noise ratio (PSNR) by 0.57 dB (from 31.06 dB of AMT [
16] to 31.63 dB) in
\(8\times\) interpolation, and NCM-MH improves PSNR by 2.65 dB (from 25.43 dB of DMVFN [
8] to 28.08 dB) in extrapolation. The experimental results show the capability of the proposal in capturing large motion and handling real-scenario videos.
In summary, the main contributions of this article are as follows:
(1)
We introduce a neighbor correspondence matching algorithm for video frame synthesis, which is simple yet effective in capturing large motion or small objects.
(2)
We propose a heterogeneous coarse-to-fine structure and train it in a progressive fashion to eliminate the resolution gap between training and inference. Combining all the designs above, we propose a unified framework for frame synthesis, NCM, which can generate intermediate flows accurately and efficiently.
(3)
By exploring the multiple motion hypotheses mechanisms of NCM, we introduce a more robust multiple-hypotheses framework, NCM-MH, for the challenging video frame extrapolation task.
This article is an extension of our previous conference version [
11]. The current work adds to the initial version in some significant aspects. First, we explore the theoretical mechanism of neighbor correspondence matching, accompanied by visualization results, to provide a more comprehensive understanding of NCM. Second, based on the analysis, we introduce multiple motion hypotheses into NCM, resulting in a more powerful frame extrapolation framework, NCM-MH, which demonstrates new state-of-the-art results. Third, we incorporate considerable new experimental results, including comparison with recently proposed methods, ablation study, model setting, visualization comparison, and analysis. We hope that these extensions can contribute to a more comprehensive and insightful exploration of the subject matter.
3 Methodology
The overview of the proposed video frame synthesis network is shown in Figure
3. The network consists of three parts: (1) neighbor correspondence matching with a feature pyramid (
yellow in Figure
3), (2) heterogeneous coarse-to-fine motion estimation (
blue in Figure
3), and (3) frame synthesis. For completeness, we first briefly introduce RIFE [
9] from which we adopt some block designs and then demonstrate details of each module in this section.
3.1 Background
We base our network design on the RIFE [
9] framework. The fundamental concept of RIFE is to directly estimate the intermediate flows for interpolation, denoted as
\(F=(F_{t\rightarrow 0}, F_{t\rightarrow 1})\) , which point from the target frame
\(I_t\) to the references frames
\(I_0, I_1\) . With the intermediate flows and a estimated fusion mask
\(M\) , the target frame can be generated through the following process:
where
\(\odot\) denotes pixelwise product and
\(warp\) means backward warping operation. RIFE employs an
intermediate flow estimation network (IFNet) to estimate
\(F=(F_{t\rightarrow 0}, F_{t\rightarrow 1})\) and
\(M\) using the reference frames
\(I_0\) and
\(I_1\) . Within the IFNet, three cascaded
intermediate flow blocks (IFBlocks) (Figure
5, left) are integrated. Each block refines the flow and mask generated by the previous one by estimating the residual values for flow and mask. Subsequently, with the generated
\(I_t\) ,
\(F\) , and
\(M\) , a U-Net-like frame synthesis network is employed to refine
\(I_t\) to produce the synthesized frame
\(\tilde{I_t}\) .
RIFE is lightweight and real time, but the synthesis quality is not satisfactory due to the limited receptive field and motion capture capability of the designed fully convolutional network. In addition, RIFE cannot adapt to high-resolution videos where the motion is even larger. To eliminate these limitations, we propose neighbor correspondence matching and a heterogeneous coarse-to-fine structure for video frame synthesis, which can effectively estimate accurate intermediate flows even on 4K videos.
3.2 Neighbor Correspondence Matching
3.2.1 Overview.
Based on the observation that an object usually moves continuously and locally within a small region in natural videos, the core idea of NCM is to explore the motion information by establishing a spatial-temporal correlation between the neighboring regions. In detail, we compute the correspondences between the local windows of two adjacent reference frames for each pixel, as shown in Figure
2. It means we need no information from the current frame, so the matching can be performed in a current-frame-agnostic fashion to meet the needs of frame synthesis.
It is worth noting that the position of local windows is determined by the position of the pixel, which is different from the cost-volume-based methods [
9,
26] that establish a cost volume around where the estimated flows point. If the estimated flow is inaccurate, then the flow-centric matching may be performed in the wrong region and the matched correlations are ineffective. On the contrary, NCM is pixel centric and will not be misled by inaccurate flows.
3.2.2 Mathematical Formulation.
Given a pair of reference frames \(I_0, I_1\in \mathbb {R}^{3\times H \times W}\) , an \(n\) -layer feature pyramid \(f_{0, 1}^l\in \mathbb {R}^{C_l\times H_l\times W_l}\) is first extracted with several residual blocks, where \(l\in \lbrace 1, \dots , n\rbrace\) denotes different layers and \(C_l, H_l, W_l\) are the channel number, height, and width of the feature from the \(l\) th layer. Features from the deepest layer \(f_{0, 1}^n\in \mathbb {R}^{C_n\times H_n\times W_n}\) are used for neighbor correspondence matching.
For a pixel at spatial position
\((i, j)\) in
\(f_{0, 1}^n\) , we perform NCM to compute the correspondences in
\(d\times d\) windows by
where
\(\delta _{i,j_{0,1}}\in \lbrace -d/2, -d/2+1,\dots ,d/2\rbrace\) denote different location pairs in the window, and
\(\cdot\) denotes the channelwise dot production. The computed correspondence matrix
\(corr^0\in \mathbb {R}^{d^4\times H_n\times W_n}\) contains correlations of all pairs in the neighborhoods, which can be further leveraged to extract motion information.
To enlarge the receptive field and capture large motion, we further perform multi-scale correspondence matching. As shown in Figure
4, we first downsample
\(f_{0, 1}^n\) to
\(s=1/2^k\) resolution to generate multi-scale features
\(f_{0, 1}^{n^k}, k\in \lbrace 0,1,\dots , K\rbrace\) . For each level
\(k\) , the correspondences can be computed by
where
\((i, j)\) is the position of the pixel in the original feature map
\(f_{0, 1}^n\) , and
\((i_k,j_k) = (i/2^k,j/2^k)\) is the position in the downsampled feature map
\(f_{0, 1}^{n^k}\) . Note that the
\(k\) th scale correspondence matrix
\(corr^k\in \mathbb {R}^{d^4\times H_n\times W_n}\) has the same shape as
\(corr^0\) . We use bilinear interpolation to sample
\(f_{0, 1}^{n^k}\) for non-integer positions.
The final multi-scale neighbor correspondences can be generated by simply concatenating correspondences at different levels,
where
\(|\) denotes channel concatenation. In this article, we extract
\(n=4\) layers feature pyramid with
\(1, 1/2, 1/4, 1/8\) resolutions of input frames and perform NCM in 4 scales (
\(K=3\) ) with window size
\(d=3\) .
3.3 Heterogeneous Coarse-to-fine flow Estimation
Existing coarse-to-fine flow estimation methods [
9,
26,
36] usually adopt the same upsampling factor and the same model structure from coarse to fine scale. However, it may not be the best solution for the coarse-to-fine scheme, because the coarse scale and fine scale have different focuses on motion estimation. In coarse scale, the flows need to be estimated from scratch, so strong motion capture capability is preferred. In fine scale, we only need to refine the coarse-scale flows, which can be done with lower cost. Based on such an idea, we propose a heterogeneous coarse-to-fine structure to adopt different module designs for coarse and fine scales.
Our heterogeneous coarse-to-fine structure comprises a coarse-scale module and a fine-scale module. To adapt to different input resolutions, we downsample \(I_0, I_1\) to \((h,w)\) to feed into the coarse-scale module, and the value of \((h,w)\) can be decided by the resolution of the input video. The estimated coarse flow is upsampled back to the original resolution and fed to the subsequent fine-scale module.
The coarse-scale module is designed to leverage the neighbor correspondences for more accurate flows. In detail, we perform NCM to obtain feature pyramid
\(f^l\) and neighbor correspondences
\(corr\) , which are fed into three augmented IFBlocks to estimate the coarse-scale flows. As shown in Figure
5 right, in each IFBlock, we warp
\(f^l\) to generate an image feature
\(f_I\) and fuse
\(corr\) and flows for a motion feature
\(f_M\) . The residual flows and mask
\(\Delta F^l, \Delta M^l\) are estimated to refine that of the previous block,
For the fine-scale module, we adopt original IFBlocks for computational efficiency. Two IFBlocks refine the coarse-scale flows using the high-resolution frames.
The estimation resolution in each IFBlock can be controlled flexibly by the size \((h,w)\) and the downsampling factor \(K_{c,f}\) to adapt to the resolution of the input video. Assuming \(H\lt W\) , we use parameter \(a\) to control the size by \((h,w)=(a,W/H\times a)\) . In the fine-scale module, we set \(K_f=(2,1)\) if \(a/H\lt 1/2\) , and otherwise set \(K_f=(1,1)\) . We set \(K_c = (4,2,1)\) in the coarse-scale module.
3.4 Synthesis Network
The synthesis network is used to predict the reconstruction residual
\(\Delta I_t\) to refine the synthesized frame by
\(\tilde{I_t}=\hat{I_t} + \Delta I_t\) . Following previous works [
9], we use a contextual U-Net as the synthesis network. The synthesis network helps the model generate finer high-frequency details and reduce the artifacts.
3.5 Exploring Multiple-hypotheses Mechanism of NCM
In this section, we investigate the mechanism of neighbor correspondence matching in video frame synthesis. We start with correspondence matching, which is proven to be beneficial to optical flow estimation [
10,
37,
41,
44].
Mechanism of Correspondence Matching. As shown in Figure
6 (bottom), for each object in the current frame, correspondence matching measures its similarity with objects in the reference frame. The matched similarity vector implies powerful motion cues for flow estimation. For example, if the current object in position
\((x_c, y_c)\) is matched to be the most similar to the reference object in
\((x, y)\) , then we can simply infer that the most possible optical flow in
\((x_c, y_c)\) is
\((x-x_c, y-y_c)\) . In practice, such similarity vector is usually fed into a deep neural network to estimate flows more accurately.
As demonstrated above, correspondence matching is object aware, but objects in the current frame are unknown in video frame synthesis. In this article, an object-agnostic matching scheme, neighbor correspondence matching is proposed to solve this problem.
Mechanism of Neighbor Correspondence Matching. As shown in Figure
6 (top), for each position in the current frame, neighbor correspondence matching measures the similarity between all object pairs in the neighbor windows of reference frames. In this case, the matched similarity matrix contains multiple motion hypotheses. For a position
\((x_c, y_c)\) in the current frame, there may be multiple pairs of similar objects
\(\lbrace (x_1^0, y_1^0), (x_1^1, y_1^1)\rbrace , \dots , \lbrace (x_n^0, y_n^0), (x_n^1, y_n^1)\rbrace\) in the neighbor windows. Every pair of objects can be that in the current position, resulting in
\(2n\) possible motions
\((x_p^{\lbrace 0, 1\rbrace }, y_p^{\lbrace 0, 1\rbrace }), p \in \lbrace 1, \dots , n\rbrace\) .
It is worth noting that in NCM,
high correspondence does not mean absolutely
high probability. In detail, a high correspondence value only indicates that the pair of objects is similar, but the actual motion may not point to the pair with the highest similarity. For example, in Figure
6 (top), the pair of objects with index
\((0, 1)\) has the highest correspondence value, because the two front windows of the car are the most similar. However, the actual motion is close to the license plate with index
\((4, 5)\) , which is far from
\((0, 1)\) .
In practice, we can integrate spatial context information and image information to determine the probability of each motion hypotheses. In the proposed network, IFBlocks are used to integrate such information to estimate intermediate flows. Combining the neighbor correspondence with neural networks, the network can better handle complex motions or occlusions.
3.6 Multiple Hypotheses Motion Estimation for Extrapolation
In this section, we perform a preliminary experiment to show that the proposed extrapolation model cannot fully exploit the multiple-hypotheses information in the neighbor correspondence. Based on such observation, we further propose a multiple-hypotheses motion estimation scheme to enhance the performance of NCM on extrapolation.
Preliminary Experiment. We select a triplet from Vimeo90K [
40] test set and visualize the estimated intermediate flows
\(F\) and masks
\(M\) in Figure
7(a) and (b). By observing the estimation results, we find the following:
—
The interpolation model follows a two-hypotheses synthesizing process. That is, the model estimates two flows for each position in the current frame. It learns to judge which motion hypotheses is more reliable and assigns it more weight in synthesizing. For example, in Figure
7(a), the model assigns more weight to the left edge of the person in flow
\(F_{t\rightarrow 0}\) , while assigning more to the right edge on
\(F_{t\rightarrow 1}\) .
—
However, the extrapolation model shows more tendency for single-hypotheses synthesizing. Even if the model also estimates two flows, it prefers to assign much more weight to the previous flow
\(F_{t\rightarrow 1}\) because of the shorter temporal distance. For example, in Figure
7(b),
\(F_{t\rightarrow 1}\) has more weights than
\(F_{t\rightarrow 0}\) in most regions.
With such observation, we infer that the extrapolation model may lack the capability of multiple-hypotheses synthesizing. Since it tends to assign more weight to only one flow, the flow must be more accurate to guarantee the synthesizing performance. It brings difficulties in motion modeling and motion estimation.
Methods. To fully exploit the multiple-hypotheses motion information in NCM, we propose a multiple-hypotheses motion estimation process for video frame extrapolation. Specifically, we estimate
\(n\) flows
\(F_{t\rightarrow j}^i\) and
\(n\) masks
\(M_j^i\) for each reference frame, where
\(i \in \lbrace 0, 1, \dots , n-1\rbrace\) is the index of flow and
\(j\in \lbrace 0, 1\rbrace\) indexes different reference frames. The IFBlocks estimate the flows and masks progressively, and, finally, the frame can be synthesized by
In IFBlock, all the
warp operation of the
\(j\) th reference frame is replaced by the weighted average of its warped image
\(\sum _i M_j^i\odot warp(I_j, F_{t\rightarrow j}^i)\) , and the same goes for the warp operation in image feature generation. In this article, we set
\(n=5\) hypotheses for each reference frame by default.
These designs ensure that each estimated flow
\(F_{t\rightarrow 1}^i\) (or
\(F_{t\rightarrow 0}^i\) ) has the same temporal distance from the target frame. As a result, the network will not assign mask weight based on the temporal distance, so the single-hypotheses synthesizing problem caused by the uneven temporal distance can be addressed. As shown in Figure
7(c)), the proposed method successfully solve the difficulty of multiple-hypotheses synthesizing in the extrapolation model.
CUDA Acceleration. Since in PyTorch the warp operation of different flows cannot be performed in parallel, we implement the multiple motion warp operation in CUDA for acceleration. We test a 1080p image on NVIDIA 1080Ti GPU and find it achieves 19% faster for the warp operation and 10% faster for the whole model.