Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Exploring Neighbor Correspondence Matching for Multiple-hypotheses Video Frame Synthesis

Published: 11 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Video frame synthesis, which consists of interpolation and extrapolation, is an essential video processing technique that can be applied to various scenarios. However, most existing methods cannot handle small objects or large motion well, especially in high-resolution videos such as 4K videos. To eliminate such limitations, we introduce a neighbor correspondence matching (NCM) algorithm for flow-based frame synthesis. Since the current frame is not available in video frame synthesis, NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel. Based on the powerful motion representation capability of NCM, we propose a heterogeneous coarse-to-fine scheme for intermediate flow estimation. The coarse-scale and fine-scale modules are trained progressively, making NCM computationally efficient and robust to large motions. We further explore the mechanism of NCM and find that neighbor correspondence is powerful, since it provides multiple-hypotheses motion information for synthesis. Based on this analysis, we introduce a multiple-hypotheses estimation process for video frame extrapolation, resulting in a more robust framework, NCM-MH. Experimental results show that NCM and NCM-MH achieve 31.63 and 28.08 dB for interpolation and extrapolation on the most challenging X4K1000FPS benchmark, outperforming all the other state-of-the-art methods that use two reference frames as input.

    1 Introduction

    Video frame synthesis is a classic video processing task to generate frames in-between (interpolation) or subsequent to (extrapolation) reference frames. It can be applied to many practice applications, such as video compression [17, 28, 31], video view synthesis [6, 14], slow-motion generation [12], and motion blur synthesis [3]. Recently, various deep-learning-based algorithms have been proposed to handle video frame synthesis problems. Most of them focus on interpolation [2, 9, 12, 15, 16, 25, 26, 27, 32, 33, 39, 43] or extrapolation [8, 38], while some others [19] can deal with both interpolation and extrapolation in a unified framework.
    Among existing video frame synthesis algorithms, flow-based schemes predict the current frame by warping reference frames with estimated optical flows. Many flow-based interpolation schemes [1, 2, 12] first estimate bi-directional optical flows and then approximate intermediate flows by flow reversal [12, 18]. Some recent schemes [9, 16, 22, 26, 27, 43] directly estimate intermediate flows and achieve superior performance. However, most of the existing methods fail in estimating large motion or the motion of small objects. It is mainly caused by the limited receptive field and motion capture capability of convolutional neural networks (CNNs).
    Correspondence matching is proven to be effective in capturing long-term correlations in multimedia tasks like video object segmentation [4, 42] and optical flow estimation [13, 37]. In these scenarios, by matching pixels of the current frame in the reference frame, a correspondence vector can be established to guide the generation of mask or flow (Figure 2, left). However, in video frame synthesis, we only have two inference frames, and the current frame is not available. As a result, correspondence matching cannot be performed directly. So how to perform correspondence matching in video frame synthesis is still an unanswered question.
    Fig. 1.
    Fig. 1. (a) Performance comparison of 8 \(\times\) video frame interpolation on the X4K1000FPS dataset [40]. (b) We propose a neighbor correspondence matching (NCM) algorithm for flow-based video frame synthesis, which efficiently enhances the motion-capturing capability for frame synthesis.
    Fig. 2.
    Fig. 2. Illustration of neighbor correspondence matching. Blue regions in the reference frames denote the matching regions. The correspondence matching [4, 37] is performed between the current frame and the reference frame, while our neighbor correspondence matching is performed in a current-frame-agnostic fashion to match pixels in the spatial-temporal neighborhoods of the current frame. A more detailed example can be found in Figure 6.
    In this article, we introduce a neighbor correspondence matching (NCM) algorithm to enhance the flow estimation process in video frame synthesis that establishes correspondences in a current-frame-agnostic fashion. Observing that objects usually move continuously and locally within a small region in natural videos, we propose to perform correspondence matching between the spatial-temporal neighbors of each pixel. Specifically, for each pixel in the current frame, we use pixels in the local windows of adjacent reference frames to calculate the correspondence matrix (Figure 2, middle and right), so the pixel value of the current frame is not required. The matched neighbor correspondence matrix can effectively model the object correlations, from which we can infer sufficient motion cues to guide the estimation of flows. In addition, multi-scale neighbor correspondence matching is performed to extend the receptive field and capture large motion.
    Based on NCM, we further propose a unified video frame synthesis network for both interpolation and extrapolation. The proposed model can accurately estimate intermediate flows in a heterogeneous coarse-to-fine manner. Specifically, the coarse-scale module is designed to utilize the multi-scale neighbor correspondence matrix to capture accurate motion in low resolution, while the fine-scale module refines the coarse flows to high resolution in a computationally efficient fashion. In addition, to eliminate the resolution gap between the training dataset and real-world high-resolution videos, we propose to train coarse- and fine-scale modules using a progressive training strategy. Combining all the above designs, we augment RIFE [13] framework to an NCM-based network for video frame synthesis. The model is not only effective but also efficient, especially for high-resolution videos.
    To fully understand how NCM works, we further explore the mechanism of NCM. By visualizing the neighbor correspondence, we conclude that NCM provides multiple-hypotheses motion information for frame synthesis. In addition, we compare the estimated flows between interpolation and extrapolation and find that the extrapolation model follows the single-motion-hypotheses synthesis process and cannot fully exploit such multiple-hypotheses motion information. Based on these analyses, we introduce a more powerful multiple-hypotheses framework for video frame extrapolation, NCM-MH. NCM and NCM-MH demonstrate new state-of-the-art results in several benchmarks. As shown in Figure 1, in the challenging X4F1000FPS benchmark, NCM improves peak signal-to-noise ratio (PSNR) by 0.57 dB (from 31.06 dB of AMT [16] to 31.63 dB) in \(8\times\) interpolation, and NCM-MH improves PSNR by 2.65 dB (from 25.43 dB of DMVFN [8] to 28.08 dB) in extrapolation. The experimental results show the capability of the proposal in capturing large motion and handling real-scenario videos.
    In summary, the main contributions of this article are as follows:
    (1)
    We introduce a neighbor correspondence matching algorithm for video frame synthesis, which is simple yet effective in capturing large motion or small objects.
    (2)
    We propose a heterogeneous coarse-to-fine structure and train it in a progressive fashion to eliminate the resolution gap between training and inference. Combining all the designs above, we propose a unified framework for frame synthesis, NCM, which can generate intermediate flows accurately and efficiently.
    (3)
    By exploring the multiple motion hypotheses mechanisms of NCM, we introduce a more robust multiple-hypotheses framework, NCM-MH, for the challenging video frame extrapolation task.
    This article is an extension of our previous conference version [11]. The current work adds to the initial version in some significant aspects. First, we explore the theoretical mechanism of neighbor correspondence matching, accompanied by visualization results, to provide a more comprehensive understanding of NCM. Second, based on the analysis, we introduce multiple motion hypotheses into NCM, resulting in a more powerful frame extrapolation framework, NCM-MH, which demonstrates new state-of-the-art results. Third, we incorporate considerable new experimental results, including comparison with recently proposed methods, ablation study, model setting, visualization comparison, and analysis. We hope that these extensions can contribute to a more comprehensive and insightful exploration of the subject matter.

    2 Related Works

    2.1 Video Frame Interpolation

    Video frame interpolation (VFI) is a sub-task of video frame synthesis, which aims to predict the intermediate frame between input frames. Learning-based VFI methods can be categorized as kernel-based methods [15, 25, 32] and flow-based methods [2, 9, 12, 19, 26, 27, 39]. Kernel-based VFI learns motion implicitly using dynamic kernels [25] and deformable kernels [15], which can preserve structural stability but might generate blurry frames because of the lack of explicit motion guidance. On the contrary, flow-based VFI explicitly models the motion with dense pixelwise flows and performs forward-warp [23, 24] or backward-warp [9, 12, 26, 27] to predict the frame, which can achieve superior performance. Since forward-warping can cause holes and overlaps in the warped image, backward-warping is more widely exploited and applied in flow-based VFI.
    For flow-based VFI that performs backward-warping, the key is how to estimate the intermediate flows. The intermediate flows should be spatially aligned with the current synthesized frame, but such spatial information is agnostic during inference. That makes it difficult to estimate accurate intermediate flows. Early flow-based VFI schemes leverage advanced optical flow methods [13, 29, 36, 37] to estimate bi-directional flows and perform flow reversal to generate intermediate flows. Later, Park et al. [26] estimate symmetric bilateral motion with a bilateral cost volume, which is further improved by Park et al. [27] by introducing asymmetric motion to achieve superior performance. Recently, Huang et al. [9] proposed to estimate intermediate flows directly with privileged distillation supervision, which shows a new paradigm for intermediate flow estimation. Lu et al. [22] further take advantage of Transformer to model long-range pixel correlation. However, these schemes cannot handle the large motion of small objects well and are limited by the solution gap between training and inference. It inspires us to explore more effective motion representations for intermediate flow estimation.

    2.2 Video Frame Extrapolation

    Video frame extrapolation aims to predict the frame subsequent to input frames. It is much more challenging than interpolation, because unseen objects may exist in the current frame. Liu et al. [19] proposed a unified framework for both interpolation and extrapolation, which models intermediate flow as a three-dimensional (3D) voxel flow and synthesizes the current frame by trilinear sampling. Recently, Wu et al. [38] introduced a method for achieving extrapolation by optimizing an interpolation task, resulting in significantly improved extrapolation visual quality. However, this approach tends to incur high computational costs. In a related development, Hu et al. [8] proposed a dynamic multi-scale voxel flow network that strikes a balance between accuracy and efficiency in video frame extrapolation.

    2.3 Correspondence Matching

    Correspondence matching is a technique to establish correspondences between images that has been widely used in many computer vision and graphics tasks. In many 3D vision tasks [7, 30], correspondences are computed between different views to explore the 3D structure. In video object segmentation [4, 42], correspondence matching is performed to search for similar pixels in the reference frames to propagate the mask. Benefiting from the long-term correlation modeling capability of correspondence matching, these schemes achieve remarkable performance.
    Recently, correspondence is leveraged in flow estimation [26, 37] and achieves superior performance. RAFT [37] builds an all-pairs correspondence matrix and looks up it to refine estimated optical flow recurrently, but it cannot be effectively applied in video frame synthesis, because the current frame is not available to compute correspondences. BMBC [26] establishes a bilateral cost volume in video frame interpolation, but it is limited by the symmetric linear motion assumption. In addition, these schemes introduce correspondence as a means to refine estimated flows, which can be easily misled if inaccurate flows are given. In this article, we rethink the correspondence matching process in flow estimation. Based on the assumption that objects usually move continuously and locally in natural videos, we introduce neighbor correspondence matching as a new method for motion correlation matching.

    2.4 Multiple-hypotheses Motion Estimation

    The main idea of multiple-hypotheses motion estimation is to estimate multiple possible motions and assigns each one with different probabilities for synthesizing. Compared with single-hypotheses motion estimation, multiple-hypotheses motion estimation is proven to be more robust and effective in certain interpolation scenarios. Liu et al. [7] first proposed a multiple-hypotheses Bayesian FRUC scheme for estimating the intermediate frame with maximum a posteriori probability. Recently, AdaCoF [15] was proposed to merge multiple flows through a dilated deformable convolution for video frame interpolation. In this article, we introduce a different multiple-hypotheses motion estimation process that merges multiple motion hypothesis more flexibly without the necessity of dilated deformable convolution. Furthermore, our observations suggest that multiple motion hypotheses provide greater benefits to extrapolation instead of interpolation, addressing the challenge of single-motion-hypotheses synthesis in the extrapolation models.

    3 Methodology

    The overview of the proposed video frame synthesis network is shown in Figure 3. The network consists of three parts: (1) neighbor correspondence matching with a feature pyramid (yellow in Figure 3), (2) heterogeneous coarse-to-fine motion estimation (blue in Figure 3), and (3) frame synthesis. For completeness, we first briefly introduce RIFE [9] from which we adopt some block designs and then demonstrate details of each module in this section.
    Fig. 3.
    Fig. 3. Overview of the proposed network. Our network is based on RIFE [9]. We augment RIFE with (1) the neighbor correspondence matching (NCM, yellow color) with a feature pyramid and (2) the heterogeneous coarse-to-fine modules (blue color). IFBlocks estimate flows with features \(f_{0,1}^l\) , correspondences \(corr\) or frame \(I_{0,1}\) as inputs, which will be illustrated in Figure 5.

    3.1 Background

    We base our network design on the RIFE [9] framework. The fundamental concept of RIFE is to directly estimate the intermediate flows for interpolation, denoted as \(F=(F_{t\rightarrow 0}, F_{t\rightarrow 1})\) , which point from the target frame \(I_t\) to the references frames \(I_0, I_1\) . With the intermediate flows and a estimated fusion mask \(M\) , the target frame can be generated through the following process:
    \(\begin{equation} \hat{I_t} = M\odot warp(I_0, F_{t\rightarrow 0}) + (1-M)\odot warp(I_1, F_{t\rightarrow 1}), \end{equation}\)
    (1)
    where \(\odot\) denotes pixelwise product and \(warp\) means backward warping operation. RIFE employs an intermediate flow estimation network (IFNet) to estimate \(F=(F_{t\rightarrow 0}, F_{t\rightarrow 1})\) and \(M\) using the reference frames \(I_0\) and \(I_1\) . Within the IFNet, three cascaded intermediate flow blocks (IFBlocks) (Figure 5, left) are integrated. Each block refines the flow and mask generated by the previous one by estimating the residual values for flow and mask. Subsequently, with the generated \(I_t\) , \(F\) , and \(M\) , a U-Net-like frame synthesis network is employed to refine \(I_t\) to produce the synthesized frame \(\tilde{I_t}\) .
    Fig. 4.
    Fig. 4. Illustration of neighbor correspondence matching at the \(k\) th scale.
    Fig. 5.
    Fig. 5. Structure of IFBlock. In the left, we show the original IFBlock in IFNet-HD [9], which is used as fine-scale IFBlock \(_f\) in the proposed network. In the right, we show our modification to add image features and motion features into IFBlock, which is used only in coarse-scale IFBlock \(_c\) .
    RIFE is lightweight and real time, but the synthesis quality is not satisfactory due to the limited receptive field and motion capture capability of the designed fully convolutional network. In addition, RIFE cannot adapt to high-resolution videos where the motion is even larger. To eliminate these limitations, we propose neighbor correspondence matching and a heterogeneous coarse-to-fine structure for video frame synthesis, which can effectively estimate accurate intermediate flows even on 4K videos.

    3.2 Neighbor Correspondence Matching

    3.2.1 Overview.

    Based on the observation that an object usually moves continuously and locally within a small region in natural videos, the core idea of NCM is to explore the motion information by establishing a spatial-temporal correlation between the neighboring regions. In detail, we compute the correspondences between the local windows of two adjacent reference frames for each pixel, as shown in Figure 2. It means we need no information from the current frame, so the matching can be performed in a current-frame-agnostic fashion to meet the needs of frame synthesis.
    It is worth noting that the position of local windows is determined by the position of the pixel, which is different from the cost-volume-based methods [9, 26] that establish a cost volume around where the estimated flows point. If the estimated flow is inaccurate, then the flow-centric matching may be performed in the wrong region and the matched correlations are ineffective. On the contrary, NCM is pixel centric and will not be misled by inaccurate flows.

    3.2.2 Mathematical Formulation.

    Given a pair of reference frames \(I_0, I_1\in \mathbb {R}^{3\times H \times W}\) , an \(n\) -layer feature pyramid \(f_{0, 1}^l\in \mathbb {R}^{C_l\times H_l\times W_l}\) is first extracted with several residual blocks, where \(l\in \lbrace 1, \dots , n\rbrace\) denotes different layers and \(C_l, H_l, W_l\) are the channel number, height, and width of the feature from the \(l\) th layer. Features from the deepest layer \(f_{0, 1}^n\in \mathbb {R}^{C_n\times H_n\times W_n}\) are used for neighbor correspondence matching.
    For a pixel at spatial position \((i, j)\) in \(f_{0, 1}^n\) , we perform NCM to compute the correspondences in \(d\times d\) windows by
    \(\begin{equation} corr^0(i,j) = \left\lbrace f_0^n(i+\delta _{i_0},j+\delta _{j_0})\cdot f_1^n(i+\delta _{i_1},j+\delta _{j_1})\right\rbrace _{\delta _{i,j_{0,1}}}, \end{equation}\)
    (2)
    where \(\delta _{i,j_{0,1}}\in \lbrace -d/2, -d/2+1,\dots ,d/2\rbrace\) denote different location pairs in the window, and \(\cdot\) denotes the channelwise dot production. The computed correspondence matrix \(corr^0\in \mathbb {R}^{d^4\times H_n\times W_n}\) contains correlations of all pairs in the neighborhoods, which can be further leveraged to extract motion information.
    To enlarge the receptive field and capture large motion, we further perform multi-scale correspondence matching. As shown in Figure 4, we first downsample \(f_{0, 1}^n\) to \(s=1/2^k\) resolution to generate multi-scale features \(f_{0, 1}^{n^k}, k\in \lbrace 0,1,\dots , K\rbrace\) . For each level \(k\) , the correspondences can be computed by
    \(\begin{equation} corr^k(i,j) = \left\lbrace f_0^{n^k}(i_k+\delta _{i_0},j_k+\delta _{j_0})\cdot f_1^{n^k}(i_k+\delta _{i_1},j_k+\delta _{j_1})\right\rbrace _{\delta _{i,j_{0,1}}}, \end{equation}\)
    (3)
    where \((i, j)\) is the position of the pixel in the original feature map \(f_{0, 1}^n\) , and \((i_k,j_k) = (i/2^k,j/2^k)\) is the position in the downsampled feature map \(f_{0, 1}^{n^k}\) . Note that the \(k\) th scale correspondence matrix \(corr^k\in \mathbb {R}^{d^4\times H_n\times W_n}\) has the same shape as \(corr^0\) . We use bilinear interpolation to sample \(f_{0, 1}^{n^k}\) for non-integer positions.
    The final multi-scale neighbor correspondences can be generated by simply concatenating correspondences at different levels,
    \(\begin{equation} corr = corr^0\ |\ corr^1\ |\ \cdots \ |\ corr^K, \end{equation}\)
    (4)
    where \(|\) denotes channel concatenation. In this article, we extract \(n=4\) layers feature pyramid with \(1, 1/2, 1/4, 1/8\) resolutions of input frames and perform NCM in 4 scales ( \(K=3\) ) with window size \(d=3\) .

    3.3 Heterogeneous Coarse-to-fine flow Estimation

    Existing coarse-to-fine flow estimation methods [9, 26, 36] usually adopt the same upsampling factor and the same model structure from coarse to fine scale. However, it may not be the best solution for the coarse-to-fine scheme, because the coarse scale and fine scale have different focuses on motion estimation. In coarse scale, the flows need to be estimated from scratch, so strong motion capture capability is preferred. In fine scale, we only need to refine the coarse-scale flows, which can be done with lower cost. Based on such an idea, we propose a heterogeneous coarse-to-fine structure to adopt different module designs for coarse and fine scales.
    Our heterogeneous coarse-to-fine structure comprises a coarse-scale module and a fine-scale module. To adapt to different input resolutions, we downsample \(I_0, I_1\) to \((h,w)\) to feed into the coarse-scale module, and the value of \((h,w)\) can be decided by the resolution of the input video. The estimated coarse flow is upsampled back to the original resolution and fed to the subsequent fine-scale module.
    The coarse-scale module is designed to leverage the neighbor correspondences for more accurate flows. In detail, we perform NCM to obtain feature pyramid \(f^l\) and neighbor correspondences \(corr\) , which are fed into three augmented IFBlocks to estimate the coarse-scale flows. As shown in Figure 5 right, in each IFBlock, we warp \(f^l\) to generate an image feature \(f_I\) and fuse \(corr\) and flows for a motion feature \(f_M\) . The residual flows and mask \(\Delta F^l, \Delta M^l\) are estimated to refine that of the previous block,
    \(\begin{align} &F^l=F^{l-1}+\Delta F^l, \end{align}\)
    (5)
    \(\begin{align} &M^l=M^{l-1}+\Delta M^l. \end{align}\)
    (6)
    For the fine-scale module, we adopt original IFBlocks for computational efficiency. Two IFBlocks refine the coarse-scale flows using the high-resolution frames.
    The estimation resolution in each IFBlock can be controlled flexibly by the size \((h,w)\) and the downsampling factor \(K_{c,f}\) to adapt to the resolution of the input video. Assuming \(H\lt W\) , we use parameter \(a\) to control the size by \((h,w)=(a,W/H\times a)\) . In the fine-scale module, we set \(K_f=(2,1)\) if \(a/H\lt 1/2\) , and otherwise set \(K_f=(1,1)\) . We set \(K_c = (4,2,1)\) in the coarse-scale module.

    3.4 Synthesis Network

    The synthesis network is used to predict the reconstruction residual \(\Delta I_t\) to refine the synthesized frame by \(\tilde{I_t}=\hat{I_t} + \Delta I_t\) . Following previous works [9], we use a contextual U-Net as the synthesis network. The synthesis network helps the model generate finer high-frequency details and reduce the artifacts.

    3.5 Exploring Multiple-hypotheses Mechanism of NCM

    In this section, we investigate the mechanism of neighbor correspondence matching in video frame synthesis. We start with correspondence matching, which is proven to be beneficial to optical flow estimation [10, 37, 41, 44].
    Mechanism of Correspondence Matching. As shown in Figure 6 (bottom), for each object in the current frame, correspondence matching measures its similarity with objects in the reference frame. The matched similarity vector implies powerful motion cues for flow estimation. For example, if the current object in position \((x_c, y_c)\) is matched to be the most similar to the reference object in \((x, y)\) , then we can simply infer that the most possible optical flow in \((x_c, y_c)\) is \((x-x_c, y-y_c)\) . In practice, such similarity vector is usually fed into a deep neural network to estimate flows more accurately.
    Fig. 6.
    Fig. 6. Comparison between correspondence matching (bottom) and neighbor correspondence matching (top). For the object (masked by blue box) in the current frame, correspondence matching measures its similarity with the reference frame to infer the possible motion, while neighbor correspondence matching measures the all-pairs similarity between two reference frames to infer multiple motion hypotheses.
    As demonstrated above, correspondence matching is object aware, but objects in the current frame are unknown in video frame synthesis. In this article, an object-agnostic matching scheme, neighbor correspondence matching is proposed to solve this problem.
    Mechanism of Neighbor Correspondence Matching. As shown in Figure 6 (top), for each position in the current frame, neighbor correspondence matching measures the similarity between all object pairs in the neighbor windows of reference frames. In this case, the matched similarity matrix contains multiple motion hypotheses. For a position \((x_c, y_c)\) in the current frame, there may be multiple pairs of similar objects \(\lbrace (x_1^0, y_1^0), (x_1^1, y_1^1)\rbrace , \dots , \lbrace (x_n^0, y_n^0), (x_n^1, y_n^1)\rbrace\) in the neighbor windows. Every pair of objects can be that in the current position, resulting in \(2n\) possible motions \((x_p^{\lbrace 0, 1\rbrace }, y_p^{\lbrace 0, 1\rbrace }), p \in \lbrace 1, \dots , n\rbrace\) .
    It is worth noting that in NCM, high correspondence does not mean absolutely high probability. In detail, a high correspondence value only indicates that the pair of objects is similar, but the actual motion may not point to the pair with the highest similarity. For example, in Figure 6 (top), the pair of objects with index \((0, 1)\) has the highest correspondence value, because the two front windows of the car are the most similar. However, the actual motion is close to the license plate with index \((4, 5)\) , which is far from \((0, 1)\) .
    In practice, we can integrate spatial context information and image information to determine the probability of each motion hypotheses. In the proposed network, IFBlocks are used to integrate such information to estimate intermediate flows. Combining the neighbor correspondence with neural networks, the network can better handle complex motions or occlusions.

    3.6 Multiple Hypotheses Motion Estimation for Extrapolation

    In this section, we perform a preliminary experiment to show that the proposed extrapolation model cannot fully exploit the multiple-hypotheses information in the neighbor correspondence. Based on such observation, we further propose a multiple-hypotheses motion estimation scheme to enhance the performance of NCM on extrapolation.
    Preliminary Experiment. We select a triplet from Vimeo90K [40] test set and visualize the estimated intermediate flows \(F\) and masks \(M\) in Figure 7(a) and (b). By observing the estimation results, we find the following:
    Fig. 7.
    Fig. 7. A set of estimation results of NCM on interpolation, extrapolation, and multiple-hypotheses extrapolation.
    The interpolation model follows a two-hypotheses synthesizing process. That is, the model estimates two flows for each position in the current frame. It learns to judge which motion hypotheses is more reliable and assigns it more weight in synthesizing. For example, in Figure 7(a), the model assigns more weight to the left edge of the person in flow \(F_{t\rightarrow 0}\) , while assigning more to the right edge on \(F_{t\rightarrow 1}\) .
    However, the extrapolation model shows more tendency for single-hypotheses synthesizing. Even if the model also estimates two flows, it prefers to assign much more weight to the previous flow \(F_{t\rightarrow 1}\) because of the shorter temporal distance. For example, in Figure 7(b), \(F_{t\rightarrow 1}\) has more weights than \(F_{t\rightarrow 0}\) in most regions.
    With such observation, we infer that the extrapolation model may lack the capability of multiple-hypotheses synthesizing. Since it tends to assign more weight to only one flow, the flow must be more accurate to guarantee the synthesizing performance. It brings difficulties in motion modeling and motion estimation.
    Methods. To fully exploit the multiple-hypotheses motion information in NCM, we propose a multiple-hypotheses motion estimation process for video frame extrapolation. Specifically, we estimate \(n\) flows \(F_{t\rightarrow j}^i\) and \(n\) masks \(M_j^i\) for each reference frame, where \(i \in \lbrace 0, 1, \dots , n-1\rbrace\) is the index of flow and \(j\in \lbrace 0, 1\rbrace\) indexes different reference frames. The IFBlocks estimate the flows and masks progressively, and, finally, the frame can be synthesized by
    \(\begin{equation} \hat{I_t} = \sum _i \left[M_0^i \odot warp\left(I_0, F_{t\rightarrow 0}^i\right) + M_1^i\odot warp\left(I_1, F_{t\rightarrow 1}^i\right)\right]. \end{equation}\)
    (7)
    In IFBlock, all the warp operation of the \(j\) th reference frame is replaced by the weighted average of its warped image \(\sum _i M_j^i\odot warp(I_j, F_{t\rightarrow j}^i)\) , and the same goes for the warp operation in image feature generation. In this article, we set \(n=5\) hypotheses for each reference frame by default.
    These designs ensure that each estimated flow \(F_{t\rightarrow 1}^i\) (or \(F_{t\rightarrow 0}^i\) ) has the same temporal distance from the target frame. As a result, the network will not assign mask weight based on the temporal distance, so the single-hypotheses synthesizing problem caused by the uneven temporal distance can be addressed. As shown in Figure 7(c)), the proposed method successfully solve the difficulty of multiple-hypotheses synthesizing in the extrapolation model.
    CUDA Acceleration. Since in PyTorch the warp operation of different flows cannot be performed in parallel, we implement the multiple motion warp operation in CUDA for acceleration. We test a 1080p image on NVIDIA 1080Ti GPU and find it achieves 19% faster for the warp operation and 10% faster for the whole model.

    4 Implementation Details

    4.1 Progressive Learning

    Many existing frame synthesis schemes cannot be well extended to applications due to the resolution gap between training and inference. That is, the training data are low resolution (e.g., \(256\times 256\) ) but the resolution of real-world data may be much higher (e.g., 1080p or 4K). To address this problem, we design a progressive learning scheme for the proposed network. The basic idea is to separate the end-to-end training into two stages to simulate the inference on high-resolution videos:
    In stage I, only the coarse-scale module is trained on low-resolution \(256\times 256\) frames. It can be regarded as training on the low-resolution version of real-world high-resolution images.
    In stage II, the coarse-scale module is fixed, and the fine-scale module is trained to refine the coarse-scale flows to high resolution. Denoting the \(256\times 256\) frames as \(I_{HR}\) , we start with randomly down-sampling them to \(I_{LR}\) with a resolution of \(64\times 64\) , \(128\times 128\) , or \(256\times 256\) . Subsequently, we feed \(I_{LR}\) into the coarse-scale module to estimate low-resolution flows \(F_{LR}\) . Finally, we train the fine-scale module to refine \(F_{LR}\) to the original \(256\times 256\) resolution flow \(F_{HR}\) , which is further adopted to synthesis the target frame. In this process, the fine-scale module learns to refine coarse flows into fine flows, which is consistent with its function during the inference stage.
    In inference, the coarse module can estimate accurate low-resolution flows with stage I, and the fine-scale module can refine such flows to high-resolution with stage II. As a result, our model can be effectively adapted to high-resolution videos.

    4.2 Loss Function

    Following RIFE, we adopt a self-supervised privileged distillation scheme to supervise the estimated flows directly. In detail, an additional teacher IFBlock is stacked to refine the estimated flows using the current frame \(I_t\) as input, and the generated \(F^{Tea}\) and \(M^{Tea}\) can supervise the intermediate flows with a distillation loss,
    \(\begin{equation} L_{dis} = \sum _{i\in \lbrace 0,1\rbrace } ||F_{t\rightarrow i}-F_{t\rightarrow i}^{Tea}||, \end{equation}\)
    (8)
    which is applied over all estimated flows from each IFBlock. Its gradient is stopped for the teacher module.
    The overall training loss consists of the reconstruction loss of the student \(L_{rec}\) , the teacher \(L_{rec}^{Tea}\) and the privileged distillation loss \(L_{dis}\) ,
    \(\begin{equation} L = L_{rec}+L_{rec}^{Tea}+\lambda _d L_{dis}, \end{equation}\)
    (9)
    where the reconstruction loss is defined as the \(L_1\) loss between the Laplacian pyramid representations of the synthesized frame and the ground truth and \(\lambda _d\) is set to 0.01 for interpolation and 0 for extrapolation.

    4.3 Experimental Setup

    Training Data. We use the Vimeo90k [40] training split, which has 51,312 triplets with a resolution of \(448\times 256\) . We augment the dataset by randomly flipping, temporal order reversing, and cropping \(256\times 256\) patches.
    Training Strategy. We use AdamW [20] to optimize our model with weight decay \(10^{-3}\) . The learning rate is gradually reduced from \(3\times 10^{-4}\) to \(3\times 10^{-5}\) using cosine annealing for each stage. The batch size is set to 64, and we use four Telsa V100 GPU to train the coarse-scale module for 230k iterations in stage I and train the other parts for 76k iterations in stage II. It takes about 40 hours for training in total.

    4.4 Benchmarks

    We evaluate our scheme on various benchmarks to verify its performance and generalization ability. On each dataset, we measure the PSNR and structural similarity (SSIM) for quantitative evaluation.
    X4K1000FPS [33]: This is a high-quality dataset of 4K videos, which is challenging due to the high resolution, occlusion, large motion, and scene diversity. The provided X-TEST set supports \(8\times\) interpolation on two frames with a temporal distance of 32 frames and \(2\times\) extrapolation on two frames with a temporal distance of 16 frames
    SNU-FILM [5]: This contains 1,240 triplets of 240 fps videos with resolution from \(640\times 368\) to \(1280\times 720\) . Four settings (Easy, Medium, Hard, and Extreme) are provided to evaluate from small motion to large motion, and the temporal distance of each setting increases from 2 (120 fps \(\rightarrow\) 240 fps) to 16 (15 fps \(\rightarrow\) 30 fps).
    UCF101 [34]: Liu et al. [19] selected 379 triplets from the UCF101 dataset for video frame synthesis evaluation.
    Vimeo90K [40]: Its test set contains 3,782 triplets with the resolution of \(448\times 256\) . We evaluate schemes on it to verify the robustness of our model on low-resolution videos.
    Different downsample size \((h,w)\) is set to adapt to the resolution of different benchmarks. We set \(a=384\) when the width of the video is larger than 720 pixels, otherwise set \(a=256\) (the definition of \(a\) is in Section 3.3).

    5 Experiments

    In this section, we perform comparison experiments and ablation studies to show the effectiveness of the proposal.

    5.1 Comparison with Previous Methods

    For video frame interpolation, we compare the proposed scheme with other existing methods: DAIN [1], AdaCoF [15], XVFI [33], SoftSplat [24], BMBC [26], ABME [27], RIFE [9], VFIformer [22], AMT [16], and EMA[43]. For extrapolation, we compare with DVF [19], OPT [38], and DMVFN [8]. We also test their runtimes and parameters. All the runtime is tested on Tesla V100 GPU.1
    Following RIFE [9] and EMA [43], we also introduce a Large version of our model to meet the need of different scenarios with different computation costs. We implement model scaling to double the resolution of feature maps in each IFBlock and the synthesis network by removing the first stride. It preserves more high-frequency information in the feature maps to make the model estimate flows more accurately. In addition, we implement test-time augmentation (denoted as \(2T\) in the tables), to infer twice with the original input frames and the flipped frames then average the results. It doubles the runtime but usually leads to better performance [9].
    Interpolation. We report the quantitative results of interpolation in Table 1 and show the visual comparison in Figure 8. For real-time schemes, NCM outperforms the state-of-the-art AMT-S by an average PSNR of 0.12 dB with 2 times faster runtime in 1080p videos. For methods with larger computational cost, NCM-Large \(^{2T}\) achieves comparable results with the recent proposed EMA-Large \(^{2T}\) with faster runtime on 1080p videos. It is worth noting that NCM and NCM-Large \(^{2T}\) outperform the state-of-the-art methods in the most challenging X4K1000FPS benchmark by 0.53 and 0.40 dB, which shows its capability in capturing large motion in high-resolution videos.
    Table 1.
    ModelX4K1000FPSSNU-FILMUCF101Vimeo90KAverageRuntime (ms)Parameters
    EasyMediumHardExtreme480p1080p(Million)
    AdaCoF [15]23.90/0.727139.80/0.990035.05/0.975429.46/0.924424.31/0.943934.90/0.968034.47/0.973031.70/0.92882013621.8
    AMT-S [16]31.06/0.917039.95/0.990535.98/0.979630.60/0.936925.30/0.862535.35/0.971235.97/0.980033.46/0.9482301593.0
    RIFE [9]30.42/0.904240.02/0.990535.73/0.978730.08/0.932824.82/0.853035.28/0.969035.61/0.977933.14/0.9437105810.1
    EMA [43]30.89/0.910139.81/0.990635.88/0.979530.69/0.937525.47/0.863235.34/0.969636.07/0.979733.45/0.94713814414.5
    NCM31.63/0.918539.98/0.990335.94/0.978830.72/0.935925.55/0.862435.36/0.969535.88/0.979533.58/0.9478277412.1
    NCM-Large31.44/0.919039.98/0.990335.99/0.978930.75/0.936325.61/0.863535.38/0.969836.11/0.980333.61/0.94835420112.1
    XVFI [33]30.12/0.870439.92/0.990235.37/0.977729.57/0.927224.17/0.844835.18/0.968535.07/0.975632.77/0.9363654055.7
    BMBC [26]29.35/0.879139.90/0.990235.31/0.977429.33/0.927023.92/0.843235.15/0.968935.01/0.976432.57/0.9374951662811.0
    ABME [27]30.16/0.879339.59/0.990135.77/0.978930.58/0.936425.42/0.863935.38/0.969836.18/0.980533.30/0.9425197150617.6
    VFIFormer [22]30.19/0.897540.13/0.990736.09/0.979930.67/0.937825.43/0.864335.43/0.970036.50/0.981633.49/0.9459994665224.2
    AMT-G [16]31.15/0.916139.88/0.990536.12/0.979830.78/0.937925.43/0.864035.45/0.970236.53/0.981933.62/0.948610058430.6
    EMA-Large \(^{2T}\) [43]31.46/0.915839.98/0.991036.09/0.980130.94/0.939225.69/0.866135.48/0.970136.64/0.981933.75/0.94929244665.7
    NCM-Large \(^{2T}\) 31.86/0.922540.14/0.990536.12/0.979330.88/0.937025.70/0.864735.43/0.970036.22/0.980733.76/0.949210840212.1
    Table 1. Quantitative Comparison (PSNR/SSIM) of Video Frame Interpolation Results
    We divide methods into three groups according to the computational complexity. The best result in each group is shown in red and the second best is shown in blue. Results with * are copied from the original papers.
    Fig. 8.
    Fig. 8. Visual comparisons on interpolation with SOTA methods. For each method, we show the synthesized results, the zoomed details, and the residuals between synthesized images and ground truth. Best viewed in zoom.
    However, some recent works like EMA and AMT show better performance than the proposals on the Vimeo-90k benchmark. Since many designs in NCM are proposed to handle high-resolution videos, such as heterogeneous coarse-to-fine estimation and progressive learning, its performance on low resolution (i.e., \(256\times 448\) in Vimeo-90k) is not that satisfactory. We will improve it in the future.
    Extrapolation. We report the quantitative results of extrapolation in Table 22 and show the visual comparison on X4K1000FPS in Figure 9. Compared with the most advanced method DMVFN, our models bring an average improvement of 1.09 dB (NCM) and 1.18 dB (NCM-MH) in terms of PSNR. The results also show that NCM-MH outperforms NCM-Large in both visual quality and runtime, which shows the effectiveness of multiple hypotheses motion estimation in extrapolation.
    Table 2.
    ModelX4K1000FPSSNU-FILMUCF101Vimeo90KAverageRuntime (ms)Parameters
    EasyMediumHardExtreme480p1080p(Million)
    DVF [19]19.18/0.687925.39/0.872823.30/0.827921.41/0.779819.49/0.725131.29/0.943326.21/0.881523.75/0.8169483843.8
    DMVFN [8]25.43/0.825335.87/0.979831.48/0.952426.41/0.882122.06/0.795931.32/0.941930.27/0.945628.98/0.9033661873.6
    NCM27.93/0.870836.54/0.981732.27/0.956227.32/0.888822.87/0.810231.72/0.944431.84/0.959530.07/0.9159277412.1
    NCM-Large27.96/0.876036.50/0.981932.23/0.956827.31/0.889822.88/0.811531.77/0.944732.01/0.960830.11/0.91755420112.1
    NCM-MH28.08/0.874136.73/0.982332.36/0.957227.34/0.890122.79/0.811331.81/0.945232.04/0.961030.16/0.91734113513.3
    OPT [38]24.52/0.843434.24/0.976230.10/0.947525.53/0.876521.59/0.794230.89/0.938229.29/0.940628.02/0.9024> \(10^5\) > \(10^6\) 16.0
    NCM-Large \(^{2T}\) 28.16/0.879436.48/0.981832.30/0.956727.45/0.890323.02/0.813131.81/0.944932.10/0.961430.19/0.918210840212.1
    NCM-MH \(^{2T}\) 28.20/0.878936.85/0.982632.46/0.957727.42/0.890722.86/0.812531.86/0.945332.15/0.961730.26/0.91858227013.3
    Table 2. Quantitative Comparison (PSNR/SSIM) of Video Frame Extrapolation Results
    We divide methods into three groups according to the computational complexity. The best result in each set is shown in red and the second best is shown in blue.
    Fig. 9.
    Fig. 9. Visual comparisons on extrapolation with SOTA methods. For each method, we show the synthesized results, the zoomed details, and the residuals between synthesized images and ground truth. Best viewed in zoom.

    5.2 Ablation Study

    5.2.1 Add-on Study.

    To understand the effectiveness of the proposed components, we perform an ablation study to add each component on a baseline in Table 3. The baseline (last line in the table) consists of five IFBlocks, and between each two blocks the upsampling rate is set to 2 in both training and inference.
    Table 3.
    Progressive LearningNCMX4K1000FPSVimeo90KRuntime @1080p
    image featuremotion feature256p480p
    \(\checkmark\) \(\checkmark\) \(\checkmark\) 31.63/0.918535.88/0.979536.44/0.981674
    \(\checkmark\) \(\checkmark\) \(\times\) 30.98/0.908735.76/0.978836.35/0.981266
    \(\checkmark\) \(\times\) \(\times\) 30.93/0.908835.36/0.977336.07/0.980363
    \(\times\) \(\times\) \(\times\) 30.00/0.893735.53/0.977935.93/0.979663
    Table 3. Add on Study of Progressive Learning and NCM
    PSNR/SSIM and the runtime (ms) are reported. The best result is shown in bold.
    Progressive learning. It makes the model adapt to high-resolution 4K videos (0.93 dB improvement on X4K1000FPS). However, the performance on 256p videos sightly drops. This is because the baseline is only trained for low-resolution 256p videos, while progressive learning is designed to refine high-resolution videos on low-resolution 256p videos. So if we resize 256p videos to 480p, then such a limitation is eliminated and achieves 0.14 dB improvement.
    Neighbor correspondence matching comprises a feature pyramid to extract image features and a matching process to extract motion features. The image features mainly improve performance on low-resolution videos, and the matching-based motion features can enhance performance on both high resolution (0.65 dB) and low resolution (0.12 dB).

    5.2.2 Coarse-to-fine and the Matching Region.

    We also perform an ablation study on the coarse-to-fine structure and the matching region in NCM in Table 4 (top).
    Table 4.
    Table 4. Ablation Study on the Coarse-to-fine Manner and Matching Region (Top) and Estimating More Motion Hypotheses on Interpolation (Bottom)
    Heterogeneous coarse-to-fine. Compared with using NCM in both coarse and fine scales (normal coarse-to-fine in Table 4), the proposed heterogeneous coarse-to-fine causes 3 times shorter runtime with better performance on 4K videos and comparable performance on 256p videos. It shows the efficiency to design different module structures for the coarse- and fine-scale modules. Using NCM in the fine-scale module in 4K videos can cause a performance drop, because in such high-resolution videos, the content is so similar in the neighbor regions that the correspondence is not effective to capture the motion.
    Matching region. We compare the proposed neighbor correspondence matching with the flow-guided matching. Guiding matching by flow causes performance to drop by 0.07 dB in X4K1000FPS. This is because the matching region may be misled by the inaccurately estimated flows. And runtime is also longer by 10 ms, since more matching operations are needed. In addition, we find that using flow-guided matching is more likely to cause a collapse in training. It indicates that flow-guided matching is less stable, since the matching region is influenced by estimated flows.
    Compared with previous cost-volume-based schemes [26, 36, 37] that guide the matching region by estimated flow, NCM works better on video frame synthesis. Due to the complexity of frame synthesis, the estimated flows are usually inaccurate at the beginning of estimation. In this case, if the matching region is guided by the inaccurate flow, then the matching region may move away from where the current pixel is, leading to ineffective matching. On the contrary, NCM directly matches correspondences in fixed neighbor regions to avoid misleading of inaccurate flows, which is more stable and more efficient, since the correspondences only need to be computed once to be applied in different stages of estimation.

    5.2.3 Multiple Hypotheses Motion Estimation in Interpolation.

    In Section 3.6, we propose to improve the extrapolation model with a multiple-hypotheses motion estimation process. We also perform an experiment to estimate more motion hypotheses ( \(n=5\) flows for each reference frame, denoted as “MH”) in the interpolation model, and the results are shown in Table 4 (bottom).
    We find that estimating more motion hypotheses improves the performance by 0.17 and 0.04 dB on low-resolution benchmark Vimeo-90k (448 \(\times\) 256) and UCF101 (256 \(\times\) 256), while the performance drops by 0.23 and 0.11 dB on high-resolution X4K1000FPS (about 4000 \(\times\) 2000) and the Extreme settings of SUN-FILM (up to 1280 \(\times\) 720). It indicates that adding more motion hypotheses in the interpolation model cannot enhance the motion modeling capability in all situations. Since estimating more motion hypotheses also brings more complexity in training, it may degrade the performance of the interpolation model in high-resolution video interpolation. Considering NCM is mainly proposed to handle high-resolution videos with large motion, we suggest not estimating more motion hypotheses for interpolation models.

    6 Applications

    The proposed scheme can be applied to various scenarios due to its powerful capability to capture large motion in high-resolution videos. In this section, we use motion blur synthesis and video compression as examples to show its potential. More applications and details can be found in the supplemental materials.

    6.1 Motion Blur Synthesis

    Motion blur is the apparent streaking of moving objects in a sequence of frames. In film or animation, simulating motion blur can lead to more realistic visuals. Simulating motion blur can also generate dataset for deblurring networks.
    Our interpolation model can be applied to generate such motion blur. We interpolate 31 frames between reference frames, and average them to generate the blurred image. As shown in Figure 10, the generated images are blurred in the fast-moving regions while keeping high quality in static regions.
    Fig. 10.
    Fig. 10. Apply NCM in motion blur synthesis.

    6.2 Video Compression

    Most learned video compression schemes [17, 21, 31] focus on reducing the redundancy between frames, but the redundancy between flows is not well considered. We leverage NCM to design a scalable bi-directional video compression model to eliminate such redundancy. It can serve as a plugin-in on uni-direction codec to compress B-frame, and the whole video can be compressed in order of I-B-P-B. The model supports a bit-free mode for low-cost compression and a bit-need mode for high-quality compression. We evaluate it on the advanced learned P-frame codec TCM [31], and show the rate-distortion curve on HEVC test videos [35] in Figure 11.
    Fig. 11.
    Fig. 11. Apply NCM in bi-direction video compression. We compare the proposed NCM-based B-frame video codec with TCM [31].
    The bit-free mode adopts NCM to interpolate the B-frame. It saves 10.4% bits (0.065bpp \(\rightarrow\) 0.058bpp) under \(\rm PSNR=32.0\) dB, and the inference time is about 9 times faster in B-frames. The bit-need mode adopts NCM as a motion prediction module, and compresses the residual motion instead of the entire motion. It saves 17.5% bits (0.065bpp \(\rightarrow\) 0.053bpp) under \(\rm PSNR=32.0\) dB. Experiments show that NCM can serve as a superior motion prediction module for video compression. More details about the methods, model structures and experiments can be found in the supplementary material.

    7 Conclusion

    This article proposes a neighbor correspondence matching algorithm for video frame synthesis, which can capture large motion even in high-resolution videos. With the proposed heterogeneous coarse-to-fine structure design and progressive learning, our model is effective and efficient and can adapt to high-resolution videos. We further explore the mechanism of NCM, and introduce a multiple-hypotheses estimation process for video frame extrapolation. Experiments show the superiority of our model in both interpolation and extrapolation. In addition, our model can be used in many applications, and we use several examples to show its potential. We hope it can be extended to more applications in the future.

    Footnotes

    1
    The runtime differs from the previous conference version [11], since the hardware and the environment are different. In this paper, we use Intel Xeon Platinum 8160T CPU, Python 3.8, CUDA 11.3, CuDNN 8, and PyTorch 1.11.0 for testing.
    2
    The results of NCM in extrapolation are slightly higher than in the previous conference version [11], since we use edge padding instead of zero padding for the reference frames in inference.

    References

    [1]
    Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. 2019. Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3703–3712.
    [2]
    Wenbo Bao, Wei-Sheng Lai, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. 2019. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3 (2019), 933–948.
    [3]
    Tim Brooks and Jonathan T Barron. 2019. Learning to synthesize motion blur. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6840–6848.
    [4]
    Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems 34 (2021), 11781–11794.
    [5]
    Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. 2020. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10663–10671.
    [6]
    John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. 2016. Deepstereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5515–5524.
    [7]
    Richard Hartley and Andrew Zisserman. 2003. Multiple View Geometry in Computer Vision. Cambridge University Press.
    [8]
    Xiaotao Hu, Zhewei Huang, Ailin Huang, Jun Xu, and Shuchang Zhou. 2023. A dynamic multi-scale voxel flow network for video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6121–6131.
    [9]
    Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. 2022. Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision. Springer, 624–642.
    [10]
    Tak-Wai Hui and Chen Change Loy. 2020. Liteflownet3: Resolving correspondence ambiguity for more accurate optical flow estimation. In European Conference on Computer Vision. Springer, 169–184.
    [11]
    Zhaoyang Jia, Yan Lu, and Houqiang Li. 2022. Neighbor correspondence matching for flow-based video frame synthesis. In Proceedings of the 30th ACM International Conference on Multimedia. 5389–5397.
    [12]
    Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. 2018. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9000–9008.
    [13]
    Shihao Jiang, Dylan Campbell, Yao Lu, Hongdong Li, and Richard Hartley. 2021. Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9772–9781.
    [14]
    Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. 2016. Learning-based view synthesis for light field cameras. ACM Trans. Graph. 35, 6 (2016), 1–10.
    [15]
    Hyeongmin Lee, Taeoh Kim, Tae-young Chung, Daehyun Pak, Yuseok Ban, and Sangyoun Lee. 2020. Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5316–5325.
    [16]
    Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng. 2023. AMT: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9801–9810.
    [17]
    Jiaying Liu, Sifeng Xia, and Wenhan Yang. 2019. Deep reference generation with multi-domain hierarchical constraints for inter prediction. IEEE Trans. Multimedia 22, 10 (2019), 2497–2510.
    [18]
    Yihao Liu, Liangbin Xie, Li Siyao, Wenxiu Sun, Yu Qiao, and Chao Dong. 2020. Enhanced quadratic video interpolation. In European Conference on Computer Vision. Springer, 41–56.
    [19]
    Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. 2017. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision. 4463–4471.
    [20]
    Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam. (2018).
    [21]
    Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao. 2019. Dvc: An end-to-end deep video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11006–11015.
    [22]
    Liying Lu, Ruizheng Wu, Huaijia Lin, Jiangbo Lu, and Jiaya Jia. 2022. Video frame interpolation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3532–3542.
    [23]
    Simon Niklaus and Feng Liu. 2018. Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1701–1710.
    [24]
    S. Niklaus and Feng Liu. 2020. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5437–5446.
    [25]
    Simon Niklaus, Long Mai, and Feng Liu. 2017. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision. 261–270.
    [26]
    Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim. 2020. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In European Conference on Computer Vision. Springer, 109–125.
    [27]
    Junheum Park, Chul Lee, and Chang-Su Kim. 2021. Asymmetric bilateral motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14539–14548.
    [28]
    Reza Pourreza and Taco Cohen. 2021. Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6680–6689.
    [29]
    Anurag Ranjan and Michael J Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161–4170.
    [30]
    Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4104–4113.
    [31]
    Xihua Sheng, Jiahao Li, Bin Li, Li Li, Dong Liu, and Yan Lu. 2023. Temporal context mining for learned video compression. IEEE Transactions on Multimedia 25 (2023), 7311–7322.
    [32]
    Zhihao Shi, Xiaohong Liu, Kangdi Shi, Linhui Dai, and Jun Chen. 2021. Video frame interpolation via generalized deformable convolution. IEEE Trans. Multimedia 24 (2021), 426–439.
    [33]
    Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. 2021. XVFI: Extreme video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14489–14498.
    [34]
    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https://arxiv.org/abs/1212.0402.
    [35]
    Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circ. Syst. Vid. Technol. 22, 12 (2012), 1649–1668.
    [36]
    Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8934–8943.
    [37]
    Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision. Springer, 402–419.
    [38]
    Yue Wu, Qiang Wen, and Qifeng Chen. 2022. Optimizing video prediction via video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17814–17823.
    [39]
    Jing Xiao, Kangmin Xu, Mengshun Hu, Liang Liao, Zheng Wang, Chia-Wen Lin, Mi Wang, and Shin’ichi Satoh. 2022. Progressive motion boosting for video frame interpolation. IEEE Transactions on Multimedia (2022), 1–14.
    [40]
    Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T. Freeman. 2019. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 127, 8 (2019), 1106–1125.
    [41]
    Gengshan Yang and Deva Ramanan. 2019. Volumetric correspondence networks for optical flow. Advances in Neural Information Processing Systems 32 (2019), 794–805.
    [42]
    Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 4701–4712.
    [43]
    Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, and Limin Wang. 2023. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5682–5692.
    [44]
    Shiyu Zhao, Long Zhao, Zhixing Zhang, Enyu Zhou, and Dimitris Metaxas. 2022. Global matching with overlapping attention for optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17592–17601.

    Index Terms

    1. Exploring Neighbor Correspondence Matching for Multiple-hypotheses Video Frame Synthesis

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 4
      April 2024
      676 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613617
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 January 2024
      Online AM: 23 November 2023
      Accepted: 11 November 2023
      Revised: 10 November 2023
      Received: 11 May 2023
      Published in TOMM Volume 20, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Video frame synthesis
      2. correspondence matching

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 2,869
        Total Downloads
      • Downloads (Last 12 months)2,869
      • Downloads (Last 6 weeks)12
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media