research-article

Open access

Exploring Neighbor Correspondence Matching for Multiple-hypotheses Video Frame Synthesis

Authors:

Zhaoyang Jia,

Yan Lu,

Houqiang LiAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 4

Article No.: 111, Pages 1 - 20

https://doi.org/10.1145/3633780

Published: 11 January 2024 Publication History

PDF eReader

Abstract

Video frame synthesis, which consists of interpolation and extrapolation, is an essential video processing technique that can be applied to various scenarios. However, most existing methods cannot handle small objects or large motion well, especially in high-resolution videos such as 4K videos. To eliminate such limitations, we introduce a neighbor correspondence matching (NCM) algorithm for flow-based frame synthesis. Since the current frame is not available in video frame synthesis, NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel. Based on the powerful motion representation capability of NCM, we propose a heterogeneous coarse-to-fine scheme for intermediate flow estimation. The coarse-scale and fine-scale modules are trained progressively, making NCM computationally efficient and robust to large motions. We further explore the mechanism of NCM and find that neighbor correspondence is powerful, since it provides multiple-hypotheses motion information for synthesis. Based on this analysis, we introduce a multiple-hypotheses estimation process for video frame extrapolation, resulting in a more robust framework, NCM-MH. Experimental results show that NCM and NCM-MH achieve 31.63 and 28.08 dB for interpolation and extrapolation on the most challenging X4K1000FPS benchmark, outperforming all the other state-of-the-art methods that use two reference frames as input.

1 Introduction

Video frame synthesis is a classic video processing task to generate frames in-between (interpolation) or subsequent to (extrapolation) reference frames. It can be applied to many practice applications, such as video compression [17, 28, 31], video view synthesis [6, 14], slow-motion generation [12], and motion blur synthesis [3]. Recently, various deep-learning-based algorithms have been proposed to handle video frame synthesis problems. Most of them focus on interpolation [2, 9, 12, 15, 16, 25, 26, 27, 32, 33, 39, 43] or extrapolation [8, 38], while some others [19] can deal with both interpolation and extrapolation in a unified framework.

Among existing video frame synthesis algorithms, flow-based schemes predict the current frame by warping reference frames with estimated optical flows. Many flow-based interpolation schemes [1, 2, 12] first estimate bi-directional optical flows and then approximate intermediate flows by flow reversal [12, 18]. Some recent schemes [9, 16, 22, 26, 27, 43] directly estimate intermediate flows and achieve superior performance. However, most of the existing methods fail in estimating large motion or the motion of small objects. It is mainly caused by the limited receptive field and motion capture capability of convolutional neural networks (CNNs).

Correspondence matching is proven to be effective in capturing long-term correlations in multimedia tasks like video object segmentation [4, 42] and optical flow estimation [13, 37]. In these scenarios, by matching pixels of the current frame in the reference frame, a correspondence vector can be established to guide the generation of mask or flow (Figure 2, left). However, in video frame synthesis, we only have two inference frames, and the current frame is not available. As a result, correspondence matching cannot be performed directly. So how to perform correspondence matching in video frame synthesis is still an unanswered question.

Fig. 1.

Fig. 2.

In this article, we introduce a neighbor correspondence matching (NCM) algorithm to enhance the flow estimation process in video frame synthesis that establishes correspondences in a current-frame-agnostic fashion. Observing that objects usually move continuously and locally within a small region in natural videos, we propose to perform correspondence matching between the spatial-temporal neighbors of each pixel. Specifically, for each pixel in the current frame, we use pixels in the local windows of adjacent reference frames to calculate the correspondence matrix (Figure 2, middle and right), so the pixel value of the current frame is not required. The matched neighbor correspondence matrix can effectively model the object correlations, from which we can infer sufficient motion cues to guide the estimation of flows. In addition, multi-scale neighbor correspondence matching is performed to extend the receptive field and capture large motion.

Based on NCM, we further propose a unified video frame synthesis network for both interpolation and extrapolation. The proposed model can accurately estimate intermediate flows in a heterogeneous coarse-to-fine manner. Specifically, the coarse-scale module is designed to utilize the multi-scale neighbor correspondence matrix to capture accurate motion in low resolution, while the fine-scale module refines the coarse flows to high resolution in a computationally efficient fashion. In addition, to eliminate the resolution gap between the training dataset and real-world high-resolution videos, we propose to train coarse- and fine-scale modules using a progressive training strategy. Combining all the above designs, we augment RIFE [13] framework to an NCM-based network for video frame synthesis. The model is not only effective but also efficient, especially for high-resolution videos.

To fully understand how NCM works, we further explore the mechanism of NCM. By visualizing the neighbor correspondence, we conclude that NCM provides multiple-hypotheses motion information for frame synthesis. In addition, we compare the estimated flows between interpolation and extrapolation and find that the extrapolation model follows the single-motion-hypotheses synthesis process and cannot fully exploit such multiple-hypotheses motion information. Based on these analyses, we introduce a more powerful multiple-hypotheses framework for video frame extrapolation, NCM-MH. NCM and NCM-MH demonstrate new state-of-the-art results in several benchmarks. As shown in Figure 1, in the challenging X4F1000FPS benchmark, NCM improves peak signal-to-noise ratio (PSNR) by 0.57 dB (from 31.06 dB of AMT [16] to 31.63 dB) in \(8\times\) interpolation, and NCM-MH improves PSNR by 2.65 dB (from 25.43 dB of DMVFN [8] to 28.08 dB) in extrapolation. The experimental results show the capability of the proposal in capturing large motion and handling real-scenario videos.

In summary, the main contributions of this article are as follows:

(1)

We introduce a neighbor correspondence matching algorithm for video frame synthesis, which is simple yet effective in capturing large motion or small objects.

(2)

We propose a heterogeneous coarse-to-fine structure and train it in a progressive fashion to eliminate the resolution gap between training and inference. Combining all the designs above, we propose a unified framework for frame synthesis, NCM, which can generate intermediate flows accurately and efficiently.

(3)

By exploring the multiple motion hypotheses mechanisms of NCM, we introduce a more robust multiple-hypotheses framework, NCM-MH, for the challenging video frame extrapolation task.

This article is an extension of our previous conference version [11]. The current work adds to the initial version in some significant aspects. First, we explore the theoretical mechanism of neighbor correspondence matching, accompanied by visualization results, to provide a more comprehensive understanding of NCM. Second, based on the analysis, we introduce multiple motion hypotheses into NCM, resulting in a more powerful frame extrapolation framework, NCM-MH, which demonstrates new state-of-the-art results. Third, we incorporate considerable new experimental results, including comparison with recently proposed methods, ablation study, model setting, visualization comparison, and analysis. We hope that these extensions can contribute to a more comprehensive and insightful exploration of the subject matter.

2 Related Works

2.1 Video Frame Interpolation

Video frame interpolation (VFI) is a sub-task of video frame synthesis, which aims to predict the intermediate frame between input frames. Learning-based VFI methods can be categorized as kernel-based methods [15, 25, 32] and flow-based methods [2, 9, 12, 19, 26, 27, 39]. Kernel-based VFI learns motion implicitly using dynamic kernels [25] and deformable kernels [15], which can preserve structural stability but might generate blurry frames because of the lack of explicit motion guidance. On the contrary, flow-based VFI explicitly models the motion with dense pixelwise flows and performs forward-warp [23, 24] or backward-warp [9, 12, 26, 27] to predict the frame, which can achieve superior performance. Since forward-warping can cause holes and overlaps in the warped image, backward-warping is more widely exploited and applied in flow-based VFI.

For flow-based VFI that performs backward-warping, the key is how to estimate the intermediate flows. The intermediate flows should be spatially aligned with the current synthesized frame, but such spatial information is agnostic during inference. That makes it difficult to estimate accurate intermediate flows. Early flow-based VFI schemes leverage advanced optical flow methods [13, 29, 36, 37] to estimate bi-directional flows and perform flow reversal to generate intermediate flows. Later, Park et al. [26] estimate symmetric bilateral motion with a bilateral cost volume, which is further improved by Park et al. [27] by introducing asymmetric motion to achieve superior performance. Recently, Huang et al. [9] proposed to estimate intermediate flows directly with privileged distillation supervision, which shows a new paradigm for intermediate flow estimation. Lu et al. [22] further take advantage of Transformer to model long-range pixel correlation. However, these schemes cannot handle the large motion of small objects well and are limited by the solution gap between training and inference. It inspires us to explore more effective motion representations for intermediate flow estimation.

2.2 Video Frame Extrapolation

Video frame extrapolation aims to predict the frame subsequent to input frames. It is much more challenging than interpolation, because unseen objects may exist in the current frame. Liu et al. [19] proposed a unified framework for both interpolation and extrapolation, which models intermediate flow as a three-dimensional (3D) voxel flow and synthesizes the current frame by trilinear sampling. Recently, Wu et al. [38] introduced a method for achieving extrapolation by optimizing an interpolation task, resulting in significantly improved extrapolation visual quality. However, this approach tends to incur high computational costs. In a related development, Hu et al. [8] proposed a dynamic multi-scale voxel flow network that strikes a balance between accuracy and efficiency in video frame extrapolation.

2.3 Correspondence Matching

Correspondence matching is a technique to establish correspondences between images that has been widely used in many computer vision and graphics tasks. In many 3D vision tasks [7, 30], correspondences are computed between different views to explore the 3D structure. In video object segmentation [4, 42], correspondence matching is performed to search for similar pixels in the reference frames to propagate the mask. Benefiting from the long-term correlation modeling capability of correspondence matching, these schemes achieve remarkable performance.

Recently, correspondence is leveraged in flow estimation [26, 37] and achieves superior performance. RAFT [37] builds an all-pairs correspondence matrix and looks up it to refine estimated optical flow recurrently, but it cannot be effectively applied in video frame synthesis, because the current frame is not available to compute correspondences. BMBC [26] establishes a bilateral cost volume in video frame interpolation, but it is limited by the symmetric linear motion assumption. In addition, these schemes introduce correspondence as a means to refine estimated flows, which can be easily misled if inaccurate flows are given. In this article, we rethink the correspondence matching process in flow estimation. Based on the assumption that objects usually move continuously and locally in natural videos, we introduce neighbor correspondence matching as a new method for motion correlation matching.

2.4 Multiple-hypotheses Motion Estimation

The main idea of multiple-hypotheses motion estimation is to estimate multiple possible motions and assigns each one with different probabilities for synthesizing. Compared with single-hypotheses motion estimation, multiple-hypotheses motion estimation is proven to be more robust and effective in certain interpolation scenarios. Liu et al. [7] first proposed a multiple-hypotheses Bayesian FRUC scheme for estimating the intermediate frame with maximum a posteriori probability. Recently, AdaCoF [15] was proposed to merge multiple flows through a dilated deformable convolution for video frame interpolation. In this article, we introduce a different multiple-hypotheses motion estimation process that merges multiple motion hypothesis more flexibly without the necessity of dilated deformable convolution. Furthermore, our observations suggest that multiple motion hypotheses provide greater benefits to extrapolation instead of interpolation, addressing the challenge of single-motion-hypotheses synthesis in the extrapolation models.

3 Methodology

The overview of the proposed video frame synthesis network is shown in Figure 3. The network consists of three parts: (1) neighbor correspondence matching with a feature pyramid (yellow in Figure 3), (2) heterogeneous coarse-to-fine motion estimation (blue in Figure 3), and (3) frame synthesis. For completeness, we first briefly introduce RIFE [9] from which we adopt some block designs and then demonstrate details of each module in this section.

Fig. 3.

3.1 Background

We base our network design on the RIFE [9] framework. The fundamental concept of RIFE is to directly estimate the intermediate flows for interpolation, denoted as \(F=(F_{t\rightarrow 0}, F_{t\rightarrow 1})\) , which point from the target frame \(I_t\) to the references frames \(I_0, I_1\) . With the intermediate flows and a estimated fusion mask \(M\) , the target frame can be generated through the following process:

\(\begin{equation} \hat{I_t} = M\odot warp(I_0, F_{t\rightarrow 0}) + (1-M)\odot warp(I_1, F_{t\rightarrow 1}), \end{equation}\)

(1)

where \(\odot\) denotes pixelwise product and \(warp\) means backward warping operation. RIFE employs an intermediate flow estimation network (IFNet) to estimate \(F=(F_{t\rightarrow 0}, F_{t\rightarrow 1})\) and \(M\) using the reference frames \(I_0\) and \(I_1\) . Within the IFNet, three cascaded intermediate flow blocks (IFBlocks) (Figure 5, left) are integrated. Each block refines the flow and mask generated by the previous one by estimating the residual values for flow and mask. Subsequently, with the generated \(I_t\) , \(F\) , and \(M\) , a U-Net-like frame synthesis network is employed to refine \(I_t\) to produce the synthesized frame \(\tilde{I_t}\) .

Fig. 4.

Fig. 5.

RIFE is lightweight and real time, but the synthesis quality is not satisfactory due to the limited receptive field and motion capture capability of the designed fully convolutional network. In addition, RIFE cannot adapt to high-resolution videos where the motion is even larger. To eliminate these limitations, we propose neighbor correspondence matching and a heterogeneous coarse-to-fine structure for video frame synthesis, which can effectively estimate accurate intermediate flows even on 4K videos.

3.2 Neighbor Correspondence Matching

3.2.1 Overview.

Based on the observation that an object usually moves continuously and locally within a small region in natural videos, the core idea of NCM is to explore the motion information by establishing a spatial-temporal correlation between the neighboring regions. In detail, we compute the correspondences between the local windows of two adjacent reference frames for each pixel, as shown in Figure 2. It means we need no information from the current frame, so the matching can be performed in a current-frame-agnostic fashion to meet the needs of frame synthesis.

It is worth noting that the position of local windows is determined by the position of the pixel, which is different from the cost-volume-based methods [9, 26] that establish a cost volume around where the estimated flows point. If the estimated flow is inaccurate, then the flow-centric matching may be performed in the wrong region and the matched correlations are ineffective. On the contrary, NCM is pixel centric and will not be misled by inaccurate flows.

3.2.2 Mathematical Formulation.

Given a pair of reference frames \(I_0, I_1\in \mathbb {R}^{3\times H \times W}\) , an \(n\) -layer feature pyramid \(f_{0, 1}^l\in \mathbb {R}^{C_l\times H_l\times W_l}\) is first extracted with several residual blocks, where \(l\in \lbrace 1, \dots , n\rbrace\) denotes different layers and \(C_l, H_l, W_l\) are the channel number, height, and width of the feature from the \(l\) th layer. Features from the deepest layer \(f_{0, 1}^n\in \mathbb {R}^{C_n\times H_n\times W_n}\) are used for neighbor correspondence matching.

For a pixel at spatial position \((i, j)\) in \(f_{0, 1}^n\) , we perform NCM to compute the correspondences in \(d\times d\) windows by

\(\begin{equation} corr^0(i,j) = \left\lbrace f_0^n(i+\delta _{i_0},j+\delta _{j_0})\cdot f_1^n(i+\delta _{i_1},j+\delta _{j_1})\right\rbrace _{\delta _{i,j_{0,1}}}, \end{equation}\)

(2)

where \(\delta _{i,j_{0,1}}\in \lbrace -d/2, -d/2+1,\dots ,d/2\rbrace\) denote different location pairs in the window, and \(\cdot\) denotes the channelwise dot production. The computed correspondence matrix \(corr^0\in \mathbb {R}^{d^4\times H_n\times W_n}\) contains correlations of all pairs in the neighborhoods, which can be further leveraged to extract motion information.

To enlarge the receptive field and capture large motion, we further perform multi-scale correspondence matching. As shown in Figure 4, we first downsample \(f_{0, 1}^n\) to \(s=1/2^k\) resolution to generate multi-scale features \(f_{0, 1}^{n^k}, k\in \lbrace 0,1,\dots , K\rbrace\) . For each level \(k\) , the correspondences can be computed by

\(\begin{equation} corr^k(i,j) = \left\lbrace f_0^{n^k}(i_k+\delta _{i_0},j_k+\delta _{j_0})\cdot f_1^{n^k}(i_k+\delta _{i_1},j_k+\delta _{j_1})\right\rbrace _{\delta _{i,j_{0,1}}}, \end{equation}\)

(3)

where \((i, j)\) is the position of the pixel in the original feature map \(f_{0, 1}^n\) , and \((i_k,j_k) = (i/2^k,j/2^k)\) is the position in the downsampled feature map \(f_{0, 1}^{n^k}\) . Note that the \(k\) th scale correspondence matrix \(corr^k\in \mathbb {R}^{d^4\times H_n\times W_n}\) has the same shape as \(corr^0\) . We use bilinear interpolation to sample \(f_{0, 1}^{n^k}\) for non-integer positions.

The final multi-scale neighbor correspondences can be generated by simply concatenating correspondences at different levels,

\(\begin{equation} corr = corr^0\ |\ corr^1\ |\ \cdots \ |\ corr^K, \end{equation}\)

(4)

where \(|\) denotes channel concatenation. In this article, we extract \(n=4\) layers feature pyramid with \(1, 1/2, 1/4, 1/8\) resolutions of input frames and perform NCM in 4 scales ( \(K=3\) ) with window size \(d=3\) .

3.3 Heterogeneous Coarse-to-fine flow Estimation

Existing coarse-to-fine flow estimation methods [9, 26, 36] usually adopt the same upsampling factor and the same model structure from coarse to fine scale. However, it may not be the best solution for the coarse-to-fine scheme, because the coarse scale and fine scale have different focuses on motion estimation. In coarse scale, the flows need to be estimated from scratch, so strong motion capture capability is preferred. In fine scale, we only need to refine the coarse-scale flows, which can be done with lower cost. Based on such an idea, we propose a heterogeneous coarse-to-fine structure to adopt different module designs for coarse and fine scales.

Our heterogeneous coarse-to-fine structure comprises a coarse-scale module and a fine-scale module. To adapt to different input resolutions, we downsample \(I_0, I_1\) to \((h,w)\) to feed into the coarse-scale module, and the value of \((h,w)\) can be decided by the resolution of the input video. The estimated coarse flow is upsampled back to the original resolution and fed to the subsequent fine-scale module.

The coarse-scale module is designed to leverage the neighbor correspondences for more accurate flows. In detail, we perform NCM to obtain feature pyramid \(f^l\) and neighbor correspondences \(corr\) , which are fed into three augmented IFBlocks to estimate the coarse-scale flows. As shown in Figure 5 right, in each IFBlock, we warp \(f^l\) to generate an image feature \(f_I\) and fuse \(corr\) and flows for a motion feature \(f_M\) . The residual flows and mask \(\Delta F^l, \Delta M^l\) are estimated to refine that of the previous block,

\(\begin{align} &F^l=F^{l-1}+\Delta F^l, \end{align}\)

(5)

\(\begin{align} &M^l=M^{l-1}+\Delta M^l. \end{align}\)

(6)

For the fine-scale module, we adopt original IFBlocks for computational efficiency. Two IFBlocks refine the coarse-scale flows using the high-resolution frames.

The estimation resolution in each IFBlock can be controlled flexibly by the size \((h,w)\) and the downsampling factor \(K_{c,f}\) to adapt to the resolution of the input video. Assuming \(H\lt W\) , we use parameter \(a\) to control the size by \((h,w)=(a,W/H\times a)\) . In the fine-scale module, we set \(K_f=(2,1)\) if \(a/H\lt 1/2\) , and otherwise set \(K_f=(1,1)\) . We set \(K_c = (4,2,1)\) in the coarse-scale module.

3.4 Synthesis Network

The synthesis network is used to predict the reconstruction residual \(\Delta I_t\) to refine the synthesized frame by \(\tilde{I_t}=\hat{I_t} + \Delta I_t\) . Following previous works [9], we use a contextual U-Net as the synthesis network. The synthesis network helps the model generate finer high-frequency details and reduce the artifacts.

3.5 Exploring Multiple-hypotheses Mechanism of NCM

In this section, we investigate the mechanism of neighbor correspondence matching in video frame synthesis. We start with correspondence matching, which is proven to be beneficial to optical flow estimation [10, 37, 41, 44].

Mechanism of Correspondence Matching. As shown in Figure 6 (bottom), for each object in the current frame, correspondence matching measures its similarity with objects in the reference frame. The matched similarity vector implies powerful motion cues for flow estimation. For example, if the current object in position \((x_c, y_c)\) is matched to be the most similar to the reference object in \((x, y)\) , then we can simply infer that the most possible optical flow in \((x_c, y_c)\) is \((x-x_c, y-y_c)\) . In practice, such similarity vector is usually fed into a deep neural network to estimate flows more accurately.

Fig. 6.

As demonstrated above, correspondence matching is object aware, but objects in the current frame are unknown in video frame synthesis. In this article, an object-agnostic matching scheme, neighbor correspondence matching is proposed to solve this problem.

Mechanism of Neighbor Correspondence Matching. As shown in Figure 6 (top), for each position in the current frame, neighbor correspondence matching measures the similarity between all object pairs in the neighbor windows of reference frames. In this case, the matched similarity matrix contains multiple motion hypotheses. For a position \((x_c, y_c)\) in the current frame, there may be multiple pairs of similar objects \(\lbrace (x_1^0, y_1^0), (x_1^1, y_1^1)\rbrace , \dots , \lbrace (x_n^0, y_n^0), (x_n^1, y_n^1)\rbrace\) in the neighbor windows. Every pair of objects can be that in the current position, resulting in \(2n\) possible motions \((x_p^{\lbrace 0, 1\rbrace }, y_p^{\lbrace 0, 1\rbrace }), p \in \lbrace 1, \dots , n\rbrace\) .

It is worth noting that in NCM, high correspondence does not mean absolutely high probability. In detail, a high correspondence value only indicates that the pair of objects is similar, but the actual motion may not point to the pair with the highest similarity. For example, in Figure 6 (top), the pair of objects with index \((0, 1)\) has the highest correspondence value, because the two front windows of the car are the most similar. However, the actual motion is close to the license plate with index \((4, 5)\) , which is far from \((0, 1)\) .

In practice, we can integrate spatial context information and image information to determine the probability of each motion hypotheses. In the proposed network, IFBlocks are used to integrate such information to estimate intermediate flows. Combining the neighbor correspondence with neural networks, the network can better handle complex motions or occlusions.

3.6 Multiple Hypotheses Motion Estimation for Extrapolation

In this section, we perform a preliminary experiment to show that the proposed extrapolation model cannot fully exploit the multiple-hypotheses information in the neighbor correspondence. Based on such observation, we further propose a multiple-hypotheses motion estimation scheme to enhance the performance of NCM on extrapolation.

Preliminary Experiment. We select a triplet from Vimeo90K [40] test set and visualize the estimated intermediate flows \(F\) and masks \(M\) in Figure 7(a) and (b). By observing the estimation results, we find the following:

Fig. 7.

—

The interpolation model follows a two-hypotheses synthesizing process. That is, the model estimates two flows for each position in the current frame. It learns to judge which motion hypotheses is more reliable and assigns it more weight in synthesizing. For example, in Figure 7(a), the model assigns more weight to the left edge of the person in flow \(F_{t\rightarrow 0}\) , while assigning more to the right edge on \(F_{t\rightarrow 1}\) .

—

However, the extrapolation model shows more tendency for single-hypotheses synthesizing. Even if the model also estimates two flows, it prefers to assign much more weight to the previous flow \(F_{t\rightarrow 1}\) because of the shorter temporal distance. For example, in Figure 7(b), \(F_{t\rightarrow 1}\) has more weights than \(F_{t\rightarrow 0}\) in most regions.

With such observation, we infer that the extrapolation model may lack the capability of multiple-hypotheses synthesizing. Since it tends to assign more weight to only one flow, the flow must be more accurate to guarantee the synthesizing performance. It brings difficulties in motion modeling and motion estimation.

Methods. To fully exploit the multiple-hypotheses motion information in NCM, we propose a multiple-hypotheses motion estimation process for video frame extrapolation. Specifically, we estimate \(n\) flows \(F_{t\rightarrow j}^i\) and \(n\) masks \(M_j^i\) for each reference frame, where \(i \in \lbrace 0, 1, \dots , n-1\rbrace\) is the index of flow and \(j\in \lbrace 0, 1\rbrace\) indexes different reference frames. The IFBlocks estimate the flows and masks progressively, and, finally, the frame can be synthesized by

\(\begin{equation} \hat{I_t} = \sum _i \left[M_0^i \odot warp\left(I_0, F_{t\rightarrow 0}^i\right) + M_1^i\odot warp\left(I_1, F_{t\rightarrow 1}^i\right)\right]. \end{equation}\)

(7)

In IFBlock, all the warp operation of the \(j\) th reference frame is replaced by the weighted average of its warped image \(\sum _i M_j^i\odot warp(I_j, F_{t\rightarrow j}^i)\) , and the same goes for the warp operation in image feature generation. In this article, we set \(n=5\) hypotheses for each reference frame by default.

These designs ensure that each estimated flow \(F_{t\rightarrow 1}^i\) (or \(F_{t\rightarrow 0}^i\) ) has the same temporal distance from the target frame. As a result, the network will not assign mask weight based on the temporal distance, so the single-hypotheses synthesizing problem caused by the uneven temporal distance can be addressed. As shown in Figure 7(c)), the proposed method successfully solve the difficulty of multiple-hypotheses synthesizing in the extrapolation model.

CUDA Acceleration. Since in PyTorch the warp operation of different flows cannot be performed in parallel, we implement the multiple motion warp operation in CUDA for acceleration. We test a 1080p image on NVIDIA 1080Ti GPU and find it achieves 19% faster for the warp operation and 10% faster for the whole model.

4 Implementation Details

4.1 Progressive Learning

Many existing frame synthesis schemes cannot be well extended to applications due to the resolution gap between training and inference. That is, the training data are low resolution (e.g., \(256\times 256\) ) but the resolution of real-world data may be much higher (e.g., 1080p or 4K). To address this problem, we design a progressive learning scheme for the proposed network. The basic idea is to separate the end-to-end training into two stages to simulate the inference on high-resolution videos:

—

In stage I, only the coarse-scale module is trained on low-resolution \(256\times 256\) frames. It can be regarded as training on the low-resolution version of real-world high-resolution images.

—

In stage II, the coarse-scale module is fixed, and the fine-scale module is trained to refine the coarse-scale flows to high resolution. Denoting the \(256\times 256\) frames as \(I_{HR}\) , we start with randomly down-sampling them to \(I_{LR}\) with a resolution of \(64\times 64\) , \(128\times 128\) , or \(256\times 256\) . Subsequently, we feed \(I_{LR}\) into the coarse-scale module to estimate low-resolution flows \(F_{LR}\) . Finally, we train the fine-scale module to refine \(F_{LR}\) to the original \(256\times 256\) resolution flow \(F_{HR}\) , which is further adopted to synthesis the target frame. In this process, the fine-scale module learns to refine coarse flows into fine flows, which is consistent with its function during the inference stage.

In inference, the coarse module can estimate accurate low-resolution flows with stage I, and the fine-scale module can refine such flows to high-resolution with stage II. As a result, our model can be effectively adapted to high-resolution videos.

4.2 Loss Function

Following RIFE, we adopt a self-supervised privileged distillation scheme to supervise the estimated flows directly. In detail, an additional teacher IFBlock is stacked to refine the estimated flows using the current frame \(I_t\) as input, and the generated \(F^{Tea}\) and \(M^{Tea}\) can supervise the intermediate flows with a distillation loss,

\(\begin{equation} L_{dis} = \sum _{i\in \lbrace 0,1\rbrace } ||F_{t\rightarrow i}-F_{t\rightarrow i}^{Tea}||, \end{equation}\)

(8)

which is applied over all estimated flows from each IFBlock. Its gradient is stopped for the teacher module.

The overall training loss consists of the reconstruction loss of the student \(L_{rec}\) , the teacher \(L_{rec}^{Tea}\) and the privileged distillation loss \(L_{dis}\) ,

\(\begin{equation} L = L_{rec}+L_{rec}^{Tea}+\lambda _d L_{dis}, \end{equation}\)

(9)

where the reconstruction loss is defined as the \(L_1\) loss between the Laplacian pyramid representations of the synthesized frame and the ground truth and \(\lambda _d\) is set to 0.01 for interpolation and 0 for extrapolation.

4.3 Experimental Setup

Training Data. We use the Vimeo90k [40] training split, which has 51,312 triplets with a resolution of \(448\times 256\) . We augment the dataset by randomly flipping, temporal order reversing, and cropping \(256\times 256\) patches.

Training Strategy. We use AdamW [20] to optimize our model with weight decay \(10^{-3}\) . The learning rate is gradually reduced from \(3\times 10^{-4}\) to \(3\times 10^{-5}\) using cosine annealing for each stage. The batch size is set to 64, and we use four Telsa V100 GPU to train the coarse-scale module for 230k iterations in stage I and train the other parts for 76k iterations in stage II. It takes about 40 hours for training in total.

4.4 Benchmarks

We evaluate our scheme on various benchmarks to verify its performance and generalization ability. On each dataset, we measure the PSNR and structural similarity (SSIM) for quantitative evaluation.

X4K1000FPS [33]: This is a high-quality dataset of 4K videos, which is challenging due to the high resolution, occlusion, large motion, and scene diversity. The provided X-TEST set supports \(8\times\) interpolation on two frames with a temporal distance of 32 frames and \(2\times\) extrapolation on two frames with a temporal distance of 16 frames

SNU-FILM [5]: This contains 1,240 triplets of 240 fps videos with resolution from \(640\times 368\) to \(1280\times 720\) . Four settings (Easy, Medium, Hard, and Extreme) are provided to evaluate from small motion to large motion, and the temporal distance of each setting increases from 2 (120 fps \(\rightarrow\) 240 fps) to 16 (15 fps \(\rightarrow\) 30 fps).

UCF101 [34]: Liu et al. [19] selected 379 triplets from the UCF101 dataset for video frame synthesis evaluation.

Vimeo90K [40]: Its test set contains 3,782 triplets with the resolution of \(448\times 256\) . We evaluate schemes on it to verify the robustness of our model on low-resolution videos.

Different downsample size \((h,w)\) is set to adapt to the resolution of different benchmarks. We set \(a=384\) when the width of the video is larger than 720 pixels, otherwise set \(a=256\) (the definition of \(a\) is in Section 3.3).

5 Experiments

In this section, we perform comparison experiments and ablation studies to show the effectiveness of the proposal.

5.1 Comparison with Previous Methods

For video frame interpolation, we compare the proposed scheme with other existing methods: DAIN [1], AdaCoF [15], XVFI [33], SoftSplat [24], BMBC [26], ABME [27], RIFE [9], VFIformer [22], AMT [16], and EMA[43]. For extrapolation, we compare with DVF [19], OPT [38], and DMVFN [8]. We also test their runtimes and parameters. All the runtime is tested on Tesla V100 GPU.¹

Following RIFE [9] and EMA [43], we also introduce a Large version of our model to meet the need of different scenarios with different computation costs. We implement model scaling to double the resolution of feature maps in each IFBlock and the synthesis network by removing the first stride. It preserves more high-frequency information in the feature maps to make the model estimate flows more accurately. In addition, we implement test-time augmentation (denoted as \(2T\) in the tables), to infer twice with the original input frames and the flipped frames then average the results. It doubles the runtime but usually leads to better performance [9].

Interpolation. We report the quantitative results of interpolation in Table 1 and show the visual comparison in Figure 8. For real-time schemes, NCM outperforms the state-of-the-art AMT-S by an average PSNR of 0.12 dB with 2 times faster runtime in 1080p videos. For methods with larger computational cost, NCM-Large \(^{2T}\) achieves comparable results with the recent proposed EMA-Large \(^{2T}\) with faster runtime on 1080p videos. It is worth noting that NCM and NCM-Large \(^{2T}\) outperform the state-of-the-art methods in the most challenging X4K1000FPS benchmark by 0.53 and 0.40 dB, which shows its capability in capturing large motion in high-resolution videos.

Table 1.

Model	X4K1000FPS	SNU-FILM				UCF101	Vimeo90K	Average	Runtime (ms)		Parameters
Model	X4K1000FPS	Easy	Medium	Hard	Extreme	UCF101	Vimeo90K	Average	480p	1080p	(Million)
AdaCoF [15]	23.90/0.7271	39.80/0.9900	35.05/0.9754	29.46/0.9244	24.31/0.9439	34.90/0.9680	34.47/0.9730	31.70/0.9288	20	136	21.8
AMT-S [16]	31.06/0.9170	39.95/0.9905	35.98/0.9796	30.60/0.9369	25.30/0.8625	35.35/0.9712	35.97/0.9800	33.46/0.9482	30	159	3.0
RIFE [9]	30.42/0.9042	40.02/0.9905	35.73/0.9787	30.08/0.9328	24.82/0.8530	35.28/0.9690	35.61/0.9779	33.14/0.9437	10	58	10.1
EMA [43]	30.89/0.9101	39.81/0.9906	35.88/0.9795	30.69/0.9375	25.47/0.8632	35.34/0.9696	36.07/0.9797	33.45/0.9471	38	144	14.5
NCM	31.63/0.9185	39.98/0.9903	35.94/0.9788	30.72/0.9359	25.55/0.8624	35.36/0.9695	35.88/0.9795	33.58/0.9478	27	74	12.1
NCM-Large	31.44/0.9190	39.98/0.9903	35.99/0.9789	30.75/0.9363	25.61/0.8635	35.38/0.9698	36.11/0.9803	33.61/0.9483	54	201	12.1
XVFI [33]	30.12/0.8704	39.92/0.9902	35.37/0.9777	29.57/0.9272	24.17/0.8448	35.18/0.9685	35.07/0.9756	32.77/0.9363	65	405	5.7
BMBC [26]	29.35/0.8791	39.90/0.9902	35.31/0.9774	29.33/0.9270	23.92/0.8432	35.15/0.9689	35.01/0.9764	32.57/0.9374	951	6628	11.0
ABME [27]	30.16/0.8793	39.59/0.9901	35.77/0.9789	30.58/0.9364	25.42/0.8639	35.38/0.9698	36.18/0.9805	33.30/0.9425	197	1506	17.6
VFIFormer [22]	30.19/0.8975	40.13/0.9907	36.09/0.9799	30.67/0.9378	25.43/0.8643	35.43/0.9700	36.50/0.9816	33.49/0.9459	994	6652	24.2
AMT-G [16]	31.15/0.9161	39.88/0.9905	36.12/0.9798	30.78/0.9379	25.43/0.8640	35.45/0.9702	36.53/0.9819	33.62/0.9486	100	584	30.6
EMA-Large \(^{2T}\) [43]	31.46/0.9158	39.98/0.9910	36.09/0.9801	30.94/0.9392	25.69/0.8661	35.48/0.9701	36.64/0.9819	33.75/0.9492	92	446	65.7
NCM-Large \(^{2T}\)	31.86/0.9225	40.14/0.9905	36.12/0.9793	30.88/0.9370	25.70/0.8647	35.43/0.9700	36.22/0.9807	33.76/0.9492	108	402	12.1

Table 1. Quantitative Comparison (PSNR/SSIM) of Video Frame Interpolation Results

We divide methods into three groups according to the computational complexity. The best result in each group is shown in red and the second best is shown in blue. Results with * are copied from the original papers.

Fig. 8.

However, some recent works like EMA and AMT show better performance than the proposals on the Vimeo-90k benchmark. Since many designs in NCM are proposed to handle high-resolution videos, such as heterogeneous coarse-to-fine estimation and progressive learning, its performance on low resolution (i.e., \(256\times 448\) in Vimeo-90k) is not that satisfactory. We will improve it in the future.

Extrapolation. We report the quantitative results of extrapolation in Table 2 ² and show the visual comparison on X4K1000FPS in Figure 9. Compared with the most advanced method DMVFN, our models bring an average improvement of 1.09 dB (NCM) and 1.18 dB (NCM-MH) in terms of PSNR. The results also show that NCM-MH outperforms NCM-Large in both visual quality and runtime, which shows the effectiveness of multiple hypotheses motion estimation in extrapolation.

Table 2.

Model	X4K1000FPS	SNU-FILM				UCF101	Vimeo90K	Average	Runtime (ms)		Parameters
Model	X4K1000FPS	Easy	Medium	Hard	Extreme	UCF101	Vimeo90K	Average	480p	1080p	(Million)
DVF [19]	19.18/0.6879	25.39/0.8728	23.30/0.8279	21.41/0.7798	19.49/0.7251	31.29/0.9433	26.21/0.8815	23.75/0.8169	48	384	3.8
DMVFN [8]	25.43/0.8253	35.87/0.9798	31.48/0.9524	26.41/0.8821	22.06/0.7959	31.32/0.9419	30.27/0.9456	28.98/0.9033	66	187	3.6
NCM	27.93/0.8708	36.54/0.9817	32.27/0.9562	27.32/0.8888	22.87/0.8102	31.72/0.9444	31.84/0.9595	30.07/0.9159	27	74	12.1
NCM-Large	27.96/0.8760	36.50/0.9819	32.23/0.9568	27.31/0.8898	22.88/0.8115	31.77/0.9447	32.01/0.9608	30.11/0.9175	54	201	12.1
NCM-MH	28.08/0.8741	36.73/0.9823	32.36/0.9572	27.34/0.8901	22.79/0.8113	31.81/0.9452	32.04/0.9610	30.16/0.9173	41	135	13.3
OPT [38]	24.52/0.8434	34.24/0.9762	30.10/0.9475	25.53/0.8765	21.59/0.7942	30.89/0.9382	29.29/0.9406	28.02/0.9024	> \(10^5\)	> \(10^6\)	16.0
NCM-Large \(^{2T}\)	28.16/0.8794	36.48/0.9818	32.30/0.9567	27.45/0.8903	23.02/0.8131	31.81/0.9449	32.10/0.9614	30.19/0.9182	108	402	12.1
NCM-MH \(^{2T}\)	28.20/0.8789	36.85/0.9826	32.46/0.9577	27.42/0.8907	22.86/0.8125	31.86/0.9453	32.15/0.9617	30.26/0.9185	82	270	13.3

Table 2. Quantitative Comparison (PSNR/SSIM) of Video Frame Extrapolation Results

We divide methods into three groups according to the computational complexity. The best result in each set is shown in red and the second best is shown in blue.

Fig. 9.

5.2 Ablation Study

5.2.1 Add-on Study.

To understand the effectiveness of the proposed components, we perform an ablation study to add each component on a baseline in Table 3. The baseline (last line in the table) consists of five IFBlocks, and between each two blocks the upsampling rate is set to 2 in both training and inference.

Table 3.

Progressive Learning	NCM		X4K1000FPS	Vimeo90K		Runtime @1080p
Progressive Learning	image feature	motion feature	X4K1000FPS	256p	480p	Runtime @1080p
\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	31.63/0.9185	35.88/0.9795	36.44/0.9816	74
\(\checkmark\)	\(\checkmark\)	\(\times\)	30.98/0.9087	35.76/0.9788	36.35/0.9812	66
\(\checkmark\)	\(\times\)	\(\times\)	30.93/0.9088	35.36/0.9773	36.07/0.9803	63
\(\times\)	\(\times\)	\(\times\)	30.00/0.8937	35.53/0.9779	35.93/0.9796	63

Table 3. Add on Study of Progressive Learning and NCM

PSNR/SSIM and the runtime (ms) are reported. The best result is shown in bold.

Progressive learning. It makes the model adapt to high-resolution 4K videos (0.93 dB improvement on X4K1000FPS). However, the performance on 256p videos sightly drops. This is because the baseline is only trained for low-resolution 256p videos, while progressive learning is designed to refine high-resolution videos on low-resolution 256p videos. So if we resize 256p videos to 480p, then such a limitation is eliminated and achieves 0.14 dB improvement.

Neighbor correspondence matching comprises a feature pyramid to extract image features and a matching process to extract motion features. The image features mainly improve performance on low-resolution videos, and the matching-based motion features can enhance performance on both high resolution (0.65 dB) and low resolution (0.12 dB).

5.2.2 Coarse-to-fine and the Matching Region.

We also perform an ablation study on the coarse-to-fine structure and the matching region in NCM in Table 4 (top).

Table 4.

Heterogeneous coarse-to-fine. Compared with using NCM in both coarse and fine scales (normal coarse-to-fine in Table 4), the proposed heterogeneous coarse-to-fine causes 3 times shorter runtime with better performance on 4K videos and comparable performance on 256p videos. It shows the efficiency to design different module structures for the coarse- and fine-scale modules. Using NCM in the fine-scale module in 4K videos can cause a performance drop, because in such high-resolution videos, the content is so similar in the neighbor regions that the correspondence is not effective to capture the motion.

Matching region. We compare the proposed neighbor correspondence matching with the flow-guided matching. Guiding matching by flow causes performance to drop by 0.07 dB in X4K1000FPS. This is because the matching region may be misled by the inaccurately estimated flows. And runtime is also longer by 10 ms, since more matching operations are needed. In addition, we find that using flow-guided matching is more likely to cause a collapse in training. It indicates that flow-guided matching is less stable, since the matching region is influenced by estimated flows.

Compared with previous cost-volume-based schemes [26, 36, 37] that guide the matching region by estimated flow, NCM works better on video frame synthesis. Due to the complexity of frame synthesis, the estimated flows are usually inaccurate at the beginning of estimation. In this case, if the matching region is guided by the inaccurate flow, then the matching region may move away from where the current pixel is, leading to ineffective matching. On the contrary, NCM directly matches correspondences in fixed neighbor regions to avoid misleading of inaccurate flows, which is more stable and more efficient, since the correspondences only need to be computed once to be applied in different stages of estimation.

5.2.3 Multiple Hypotheses Motion Estimation in Interpolation.

In Section 3.6, we propose to improve the extrapolation model with a multiple-hypotheses motion estimation process. We also perform an experiment to estimate more motion hypotheses ( \(n=5\) flows for each reference frame, denoted as “MH”) in the interpolation model, and the results are shown in Table 4 (bottom).

We find that estimating more motion hypotheses improves the performance by 0.17 and 0.04 dB on low-resolution benchmark Vimeo-90k (448 \(\times\) 256) and UCF101 (256 \(\times\) 256), while the performance drops by 0.23 and 0.11 dB on high-resolution X4K1000FPS (about 4000 \(\times\) 2000) and the Extreme settings of SUN-FILM (up to 1280 \(\times\) 720). It indicates that adding more motion hypotheses in the interpolation model cannot enhance the motion modeling capability in all situations. Since estimating more motion hypotheses also brings more complexity in training, it may degrade the performance of the interpolation model in high-resolution video interpolation. Considering NCM is mainly proposed to handle high-resolution videos with large motion, we suggest not estimating more motion hypotheses for interpolation models.

6 Applications

The proposed scheme can be applied to various scenarios due to its powerful capability to capture large motion in high-resolution videos. In this section, we use motion blur synthesis and video compression as examples to show its potential. More applications and details can be found in the supplemental materials.

6.1 Motion Blur Synthesis

Motion blur is the apparent streaking of moving objects in a sequence of frames. In film or animation, simulating motion blur can lead to more realistic visuals. Simulating motion blur can also generate dataset for deblurring networks.

Our interpolation model can be applied to generate such motion blur. We interpolate 31 frames between reference frames, and average them to generate the blurred image. As shown in Figure 10, the generated images are blurred in the fast-moving regions while keeping high quality in static regions.

Fig. 10.

6.2 Video Compression

Most learned video compression schemes [17, 21, 31] focus on reducing the redundancy between frames, but the redundancy between flows is not well considered. We leverage NCM to design a scalable bi-directional video compression model to eliminate such redundancy. It can serve as a plugin-in on uni-direction codec to compress B-frame, and the whole video can be compressed in order of I-B-P-B. The model supports a bit-free mode for low-cost compression and a bit-need mode for high-quality compression. We evaluate it on the advanced learned P-frame codec TCM [31], and show the rate-distortion curve on HEVC test videos [35] in Figure 11.

Fig. 11.

The bit-free mode adopts NCM to interpolate the B-frame. It saves 10.4% bits (0.065bpp \(\rightarrow\) 0.058bpp) under \(\rm PSNR=32.0\) dB, and the inference time is about 9 times faster in B-frames. The bit-need mode adopts NCM as a motion prediction module, and compresses the residual motion instead of the entire motion. It saves 17.5% bits (0.065bpp \(\rightarrow\) 0.053bpp) under \(\rm PSNR=32.0\) dB. Experiments show that NCM can serve as a superior motion prediction module for video compression. More details about the methods, model structures and experiments can be found in the supplementary material.

7 Conclusion

This article proposes a neighbor correspondence matching algorithm for video frame synthesis, which can capture large motion even in high-resolution videos. With the proposed heterogeneous coarse-to-fine structure design and progressive learning, our model is effective and efficient and can adapt to high-resolution videos. We further explore the mechanism of NCM, and introduce a multiple-hypotheses estimation process for video frame extrapolation. Experiments show the superiority of our model in both interpolation and extrapolation. In addition, our model can be used in many applications, and we use several examples to show its potential. We hope it can be extended to more applications in the future.

Footnotes

The runtime differs from the previous conference version [11], since the hardware and the environment are different. In this paper, we use Intel Xeon Platinum 8160T CPU, Python 3.8, CUDA 11.3, CuDNN 8, and PyTorch 1.11.0 for testing.

The results of NCM in extrapolation are slightly higher than in the previous conference version [11], since we use edge padding instead of zero padding for the reference frames in inference.

References

[1]

Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. 2019. Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3703–3712.

Abstract

1 Introduction

2 Related Works

2.1 Video Frame Interpolation

2.2 Video Frame Extrapolation

2.3 Correspondence Matching

2.4 Multiple-hypotheses Motion Estimation

3 Methodology

3.1 Background

3.2 Neighbor Correspondence Matching

3.2.1 Overview.

3.2.2 Mathematical Formulation.

3.3 Heterogeneous Coarse-to-fine flow Estimation

3.4 Synthesis Network

3.5 Exploring Multiple-hypotheses Mechanism of NCM

3.6 Multiple Hypotheses Motion Estimation for Extrapolation

4 Implementation Details

4.1 Progressive Learning

4.2 Loss Function

4.3 Experimental Setup

4.4 Benchmarks

5 Experiments

5.1 Comparison with Previous Methods

5.2 Ablation Study

5.2.1 Add-on Study.

5.2.2 Coarse-to-fine and the Matching Region.

5.2.3 Multiple Hypotheses Motion Estimation in Interpolation.

6 Applications

6.1 Motion Blur Synthesis

6.2 Video Compression

7 Conclusion

Footnotes

References

Index Terms

Recommendations

Neighbor Correspondence Matching for Flow-based Video Frame Synthesis

Augmented Coarse-to-Fine Video Frame Synthesis with Semantic Loss

Multi-Frame Correspondence Estimation Using Subspace Constraints

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations