1]Meta AI 2]KAUST \contribution[*]Work done at Meta \contribution[†]Equal Contribution

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Haozhe Liu Shikun Liu Zijian Zhou Mengmeng Xu Yanping Xie Xiao Han Juan C. Pérez Ding Liu Kumara Kahatapitiya Menglin Jia Jui-Chieh Wu Sen He Tao Xiang Jürgen Schmidhuber Juan-Manuel Pérez-Rúa [ [ haozhe.liu@kaust.edu.sa jmpr@meta.com

(October 26, 2024)

Abstract

We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini’s MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.

\correspondence

Haozhe Liu at , Juan-Manuel Pérez-Rúa at \metadata[Blogpost]https://mardini-vidgen.github.io

1 Introduction

Auto-regressive (AR) transformers (Vaswani et al., 2017; Peng et al., 2023; Schmidhuber, 1992b; Schlag et al., 2021) have recently demonstrated remarkable success in natural language processing (Dubey et al., 2024; Team et al., 2023; Achiam et al., 2023), sparking efforts to achieve similar breakthroughs in computer vision (Rombach et al., 2022; Dai et al., 2023a; Saharia et al., 2022a). However, unlike the discrete, sequential, and easily tokenized nature of language, visual data consist of continuous pixel signals distributed across a high-dimensional space, making them more difficult to model through 1D auto-regression.

To overcome this challenge, recent studies have explored vector quantization techniques (Van Den Oord et al., 2017; Razavi et al., 2019) to convert continuous pixel data into discrete representations suitable for AR modelling. Unfortunately, these approaches (Yu et al., 2022; Ramesh et al., 2021) rely on causal attention, which is not well aligned for high-dimensional visual data, often leading to diminished performance (Li et al., 2024), particularly on large-scale datasets (Xie et al., 2024; Zhou et al., 2024). To mitigate this limitation, masked auto-regression (MAR) has been introduced (Chang et al., 2022; Li et al., 2023a). MAR replaces the causal attention with bi-directional attention (He et al., 2021; Devlin et al., 2019), effectively simulating auto-regressive behaviour while being more capable of handling visual data. Leveraging this approach, MAR exhibits flexibility in handling diverse generation tasks through different masking strategies, such as image generation (Chang et al., 2022; Li et al., 2023a), out-painting (Chang et al., 2022), video expansion (Yu et al., 2023a) and class-conditioned video generation (Yu et al., 2024; Voleti et al., 2022) while maintaining manageable computational overhead. Although MAR shows potential in scaling image and video generation tasks (Chang et al., 2023; Yu et al., 2023a, 2024), its key bottleneck lies in its training instability which is tied to the reliance on discrete representations (Ramesh et al., 2021; Razavi et al., 2019).

Meanwhile, Diffusion models (DMs) (Ho et al., 2020; Neal, 2001; Jarzynski, 1997) have emerged as a successful alternative for scaling vision generative models, offering stable training by modelling visual signals directly in a continuous space. However, DMs tend to incur high inference costs due to the requirement of the multi-step diffusion process. Here, video generation poses an even greater challenge — Video is a strict super-set of the image domain, requiring additional modelling for temporal consistency and complex motion dynamics.

To this end, we propose a new paradigm for video generation that combines the flexibility of MAR in a continuous space with the robust generative capabilities of DM. Specifically, we present a scalable training recipe and an efficient neural architecture design for video generation. Our model decomposes video generation into two sub-tasks — temporal and spatial modelling — handled by distinct networks with an asymmetric design based on the following two principles:

1.

MAR handles long-range temporal modelling, while DM focuses on detailed spatial modelling.
2.

MAR operates with more parameters at a lower resolution, while DM operates with fewer parameters at a higher resolution.

Following these principles, we use the same training batch for both MAR and DM but employ two distinct processes operating at different resolutions. MAR receives randomly masked low-resolution input frames and predicts the corresponding planning signals. Conditioned on these planning signals via cross-attention and the unmasked frames, DM learns to incrementally recover the masked high-resolution frames from noise. Finally, we introduce a progressive training strategy that gradually curates mask ratios and with its data pipelines, allowing our model to be trained from scratch on unlabeled video data. This eliminates the common reliance on text-to-image and text-to-video pre-training, as seen in other video diffusion models (Girdhar et al., 2023; Blattmann et al., 2023a).

Our model integrates MAR-based planning signals with a DiT-based (Peebles and Xie, 2023; Chen et al., 2024c) lightweight, tiny diffusion model, hence the name MarDini. Our empirical study on MarDini highlights the following key characteristics:

•

Flexibility. With MAR conditioning, MarDini naturally supports a range of video generation tasks through flexible masking strategies. For example, when given the first frame and masking the rest, it performs image-to-video generation; when given a video and masking subsequent frames, it performs video expansion; and, when given the first and last frames and masking the middle frames, it performs video interpolation. By hierarchically and auto-regressively masking middle frames across multiple inferences, MarDini generates slow-motion videos.
•

Scalability. MarDini can be trained from scratch at scale, without relying on generative image-based pre-training. In contrast to most video generation models, that treat video as a secondary task following image generation, MarDini leverages mask ratio tuning to progressively adjust the difficulty of the training task. This approach enables the model to scale from video interpolation to full video generation, directly bypassing the need for image-based pre-training.
•

Efficiency. MarDini’s asymmetric design allocates more computational resources to lower resolutions, making it memory-efficient and fast during inference. With lower overall memory usage, MarDini allows the deployment of computationally intensive spatio-temporal attention mechanisms at scale, improving its ability to model complex motion dynamics.

2 MarDini: An Efficient and Asymmetric Video Diffusion Model

2.1 Design Overview

MarDini is a video generation model designed to efficiently generate high-resolution videos using an asymmetric network architecture. As shown in Figure 1, MarDini consists of two networks: a heavy-weight MAR planning model and a light-weight generation DM. During training, the planning network processes randomly masked low-resolution frames and predicts corresponding planning signals. These planning signals compress the semantic and long-range temporal information, guiding the DM’s high-resolution generation process. The DM receives noisy frames at the masked positions and reconstructs them by progressively removing noise.

In this section, we outline and address the key design challenges involved in training MarDini. First, we describe the data representations and their corresponding notations within the MarDini framework (Section 2.2). Next, we describe the design details of the MAR planning network and the DM, along with the integration of additional guidance such as diffusion steps and planning signals (Section 2.3). Finally, we outline the multi-stage training recipe for MarDini, which we found to be essential for ensuring stable training (Section 2.4). Collectively, these innovations enable MarDini to become one of the first video generation models capable of being trained from scratch using only unlabelled video data.

Refer to caption — Figure 1: MarDini Training Pipeline Overview. A latent representation is computed for unmasked frames that serve as a conditional signal to a generative process. On the first hand, we have a planning model that autoregressively encodes global conditioning signals from a low-resolution version of the unmasked latent inputs. On the other hand, the planning signals are fed to the diffusion-based generation model through cross-attention layers. A high-resolution version of the input conditions is also ingested by the diffusion model, enabling generation with a coherent temporal structure and a direct mechanism to attend to fine-grained details of the unmasked frames. MarDini is trained end-to-end via masked frame-level diffusion loss.

2.2 Data Representation and Notations

VAE Compressor.

Consistent with prior works (Dai et al., 2023a; Girdhar et al., 2023), we adopt a pre-trained Variational Auto-Encoder (VAE) (Kingma and Welling, 2014), denoted by $\mathcal{D}_{\text{enc}}$ , to compress videos into a low-dimensional continuous latent space, which improves both training and inference efficiency. Our VAE employs a 16-channel latent dimension with an 8 $\times$ spatial compression rate to preserve spatial details, following Dai et al. (2023a). The VAE outputs are then patchified into a shape of $N\times C$ , where $N$ represents the token count and $C=16$ represents its latent dimension.

MAR Planning Model.

Given a low-resolution input video $\mathbf{X}_{\text{low}}=\{x^{\text{low}}_{i}\}_{i=1:K}$ with $K$ frames, we apply the VAE encoder to compress the frames into their corresponding latent representations: $\mathbf{Z}_{\text{low}}=\{z^{\text{low}}_{i}\}_{i=1:K}=\mathcal{D}_{\text{enc}% }(\mathbf{X}_{\text{low}})$ . To train the MAR planning model $\mathcal{P}$ , we randomly select $K^{\prime}<K$ video latents $\{z^{low}_{j}\}_{j=1:K^{\prime}}\in\mathbf{Z}_{\text{low}}$ and replace them with a learnable mask token [MASK], resulting in the final masked low-resolution latent inputs $\mathbf{Z}_{\text{low}}^{\text{mask}}$ . The planning model then processes $\mathbf{Z}_{\text{low}}^{\text{mask}}$ and predicts $\mathbf{Z}_{\text{cond}}=\mathcal{P}(\mathbf{Z}_{\text{low}}^{\text{mask}})=\{% z^{\text{cond}}_{i}\}_{i=1:K}$ , where $z^{\text{cond}}_{i}$ is the planning signal for the $i$ -th frame, shaped as $N_{\text{low}}\times C_{\text{low}}$ , with $N_{\text{low}}$ representing the number of patches per frame.

DM Generation Model.

Conversely, we obtain high-resolution video latents $\mathbf{Z}_{\text{high}}=\{z^{\text{high}}_{i}\}_{i=1:K}=\mathcal{D}_{\text{% enc}}(\mathbf{X}_{\text{high}})$ with dimensions $N_{\text{high}}\times C_{\text{high}}$ , generated by the VAE encoder using the same video inputs at high resolution: $\mathbf{X}_{\text{high}}=\{x^{\text{high}}_{i}\}_{i=1:K}$ . Notably, we have $N_{\text{high}}\gg N_{\text{low}}$ . At diffusion step $t$ , we sample noise and add it to $K^{\prime}$ frames that were masked in the planning model (denoted by [NOISE]), leaving the remaining $K-K^{\prime}$ reference frames unchanged (denoted by [REF]). This produces the final noisy high-resolution video latent inputs $\mathbf{Z}_{\text{high}}^{\text{noise},t}$ . Then, the generation model $\mathcal{G}$ processes these latent inputs $\mathbf{Z}_{\text{high}}^{\text{noise},t}$ and performs a standard denoising step, where we denote the DM output at time step $t$ as $\mathcal{G}(\mathbf{Z}_{\text{high}}^{\text{noise},t},\mathbf{Z}_{\text{cond}}% ,t)$ .

2.3 Architecture Design

In this section, we provide a comprehensive explanation of the MarDini architecture, including its detailed design, model configurations, and variations.

2.3.1 MarDini Block Design

Figure 2 illustrates the design of the MarDini’s MAR and DM models, both of which are based on the transformer architecture (Vaswani et al., 2017).

In the MAR planning model, we adhere to the design conventions established in Llama models (Dubey et al., 2024; Touvron et al., 2023), which apply RMS-Norm (Zhang and Sennrich, 2019) to normalize the inputs of each attention block. Additionally, layer normalization (Ba et al., 2016) is applied to normalize the projected features in multi-head attention, enhancing training stability. Due to the use of low-resolution inputs, we manage to directly employ spatio-temporal attention, allowing tokens to attend across frames. This design is feasible only with asymmetric resolution inputs, as it prevents excessive memory consumption.

Concretely, within each attention block in MAR, we utilize rotary positional encoding (RoPE) (Su et al., 2024) to encode both the spatial and temporal positions of the video tokens. To accomplish this, we apply a 2D RoPE to encode the 3-dimensional video data. Specifically, we flatten the image patches into a 1-dimensional token sequence and insert a learnable [NEXT] token to differentiate image patches across different rows, following Gao et al. (2024). This design effectively handles video data with varying aspect ratios and resolutions.

We design the DM model in alignment with MAR, but with three key differences. First, we adopt a DiT-style approach (Peebles and Xie, 2023), using AdaIN (Huang and Belongie, 2017) to integrate the diffusion steps as a conditional signal within the spatial attention layers, and additionally added with the MAR’s planning signal within the MLP layers. Second, we introduce a cross-attention layer to process the planning features predicted by the MAR model. Lastly, we replace spatio-temporal attention with temporal attention (Blattmann et al., 2023b) to reduce the computational cost associated with high-resolution inputs in DM.

2.3.2 Identity Attention

In our initial experiments, we observed significant training instability in MarDini’s DM. We speculate that this is due to two main factors: i) the inherent distributional disparity between noisy ([NOISE]) tokens and clean reference ([REF]) tokens, which is further amplified by the stochastic nature of sampling diffusion steps; and ii) the random positions and varying lengths of these [NOISE] tokens. These factors likely compound, potentially disrupting the DM’s training signals and hindering the model’s ability to converge efficiently.

To address this challenge, we introduce Identity Attention, which enables the model to easily distinguish between [REF] and [NOISE] tokens by employing a separate attention strategy. As illustrated in Figure 3, [REF] tokens simply serve as an identity projection, preserving the input reference frames without attending to other tokens. In contrast, [NOISE] tokens possess a global view, attending to tokens across all frames. The [REF] tokens serve as guidance for generation, so we design them to be isolated from other tokens, while [NOISE] tokens provide global attention to all conditional signals for generation. We incorporate Identity Attention in both the spatio-temporal layers of MAR and the temporal layers of DM, which has been found to significantly enhance training stability in both models.

2.3.3 Model Configuration

As outlined in Table 1, this study develops four models with distinct configurations. We train two planning models with 3.1B and 1.3B parameters alongside two generation models, employing spatio-temporal or temporal attention mechanisms. To align with our asymmetric design between the planning and generation models, the generation model’s parameter size is reduced to $3\times$ or $10\times$ smaller than that of the planning model. Due to the high computational cost of spatio-temporal attention, we limit MarDini-L/ST and MarDini-S/ST to a 9-frame length for fair comparison on VIDIM-Bench (Jain et al., 2024). Importantly, the model’s ability to autoregressively generate samples ensures that the length of the output video is not constrained.

Configuration	Planning Model (MAR)					Generation Model (DM)					Frame
Configuration	Depth	Hidden Size	MLP Size	Attn.	Param.	Depth	Hidden Size	MLP Size	Attn.	Param.	Frame
MarDini-S/ST	8	4096	4096	S.-T. Attn.	1.3B	8	1024	4096	S.-T. Attn.	288M	9
MarDini-L/ST	16	4096	8192	S.-T. Attn.	3.1B	8	1024	4096	S.-T. Attn.	288M	9
MarDini-S/T	8	4096	4096	S.-T. Attn.	1.3B	8	1024	4096	T. Attn.	288M	17
MarDini-L/T	16	4096	8192	S.-T. Attn.	3.1B	8	1024	4096	T. Attn.	288M	17

Table 1: Configuration Details of MarDini Models. We provide four models, differing primarily in the size of the planning module (3.1B vs. 1.3B parameters) and the attention mechanisms used in the generation module: spatio-temporal attention (S.-T. Attn.) vs. temporal attention (T. Attn.).

2.4 MarDini Training Recipes

In this section, we outline the training pipeline of MarDini. Specifically, we employ a multi-stage progressive training strategy that gradually increases task difficulty. This approach offers two key benefits: i) progressive learning inherently enhances training stability and improves the performance of generative models, as demonstrated by Karras (2018) and Chen et al. (2024b); and ii) it allows for the collection of checkpoints from earlier stages, which helps mitigate setbacks caused by suboptimal configurations. Below, we elaborate on our detailed progressive training strategy, including the training objectives, architecture design, and training data configurations. A comprehensive training manual for MarDini is shown in Figure 4, with detailed hyper-parameters and optimization methods further outlined in the Appendix 8.

2.4.1 Training Tasks: From Frame Interpolation to Video Generation

Our training objectives are organized into three stages: i) Initial Stage: We separately train the planning and generation models, each with its own learning objective, to initialize their model weights. ii) Joint-Model Stage: We combine the models for joint training on a simple video interpolation task, using only a masked diffusion loss. iii) Joint-Task Stage: We further train the model by gradually reducing the number of preserved reference frames, enabling it to jointly learn video interpolation and image-to-video generation tasks.

Initial Stage.

Wang et al. (2024a) pointed out that transformers with a large parameter count often experience unstable training. As such, we simplify the training dynamics by separately warming up the two models as an initial step.

To optimize generation model $\mathcal{G}$ , we employ a masked diffusion loss $\mathcal{L}_{\text{DM}}$ :

\displaystyle\mathcal{L}_{\text{DM}}^{\theta}=||\mathbf{M}\cdot\mathbf{V}^{t}-% \mathbf{M}\cdot\mathcal{G}_{\theta}(\mathbf{Z}_{\text{high}}^{\text{noise},t},% \mathbf{Z}_{\text{uncond}},t)||_{2}^{2},

(1)

where $\mathbf{Z}_{\text{uncond}}$ is a learnable token serving as unconditional guidance from the planning model. $\theta$ represents the parameters of the generation model, and $\mathbf{M}$ denotes the binary masks used to mask out all clean reference frames. Inspired by Blattmann et al. (2023b); Salimans and Ho (2022), we apply velocity prediction as the diffusion loss, where the prediction target $\mathbf{V}^{t}=\{v_{i}^{t}\}_{i=1:K}$ represents the velocity at time step $t$ for the $i$ -th frame, defined as $v_{i}^{t}=\alpha_{t}\epsilon-\sigma_{t}z^{\text{high}}_{i},\epsilon\sim% \mathcal{N}(0,I)$ . Here, $\alpha_{t}$ and $\sigma_{t}$ correspond to the diffusion scheduler at $t$ step.

To optimize MAR planning model $\mathcal{P}$ , we employ a masked reconstruction loss $\mathcal{L}_{\text{MAR}}$ :

\displaystyle\mathcal{L}^{\phi,\zeta}_{\text{MAR}}=||\mathbf{M}\cdot\mathbf{Z}% _{\text{low}}-\mathbf{M}\cdot f_{\zeta}(\mathcal{P}_{\phi}(\mathbf{Z}_{\text{% low}}^{\text{mask}})||_{2}^{2}.

(2)

where $f$ denotes a projection layer that depatchifies the model predictions to match the resolution of the low-resolution input image $\mathbf{Z}_{\text{low}}$ . $\phi,\zeta$ represent the learnable parameters of the planning model and the projection layer respectively. Note that, $f$ is only used during the initial training stage, and will be removed in the later training stages.

Joint-Model Stage.

After the initial pre-training stage, we then jointly train the planning and generation models end-to-end using a unified masked diffusion learning objective $\mathcal{L}_{\text{MDiff}}$ :

\displaystyle\mathcal{L}_{\text{MDiff}}^{\theta,\phi}=||\mathbf{M}\cdot\mathbf% {V}^{t}-\mathbf{M}\cdot\mathcal{G}_{\theta}(\mathbf{Z}_{\text{high}}^{\text{% noise},t},\mathcal{P}_{\phi}(\mathbf{Z}_{\text{low}}^{\text{mask}}),t)||_{2}^{% 2},

(3)

where $\mathbf{Z}_{\text{cond}}=\mathcal{P}(\mathbf{Z}_{\text{low}}^{\text{mask}})$ is the planning signal predicted by MAR. In order to enable classifier-free guidance (Ho and Salimans, 2022) on the planning signal, we maintain a fixed probability of $\nicefrac{{1}}{{10}}$ to randomly replace $\mathbf{Z}^{t}_{\text{cond}}$ with $\mathbf{Z}_{\text{uncond}}$ .

Joint-Task Stage.

In the final training stage, we reuse the learning objective from the previous stage, but gradually decrease the masking ratio to induce more challenging generation tasks. Here, mask ratio refers to the proportion of frames preserved during training. This stage requires a significantly larger computational resources with higher-resolution videos, as it determines the model’s final performance. By gradually decreasing the masking ratios, we smoothly transform the model’s task from video interpolation to single-image-to-video generation. This procedure ultimately enables the model to generate videos with a variable number of input frames at arbitrary temporal locations.

2.4.2 DM Architecture: From Spatio-Temporal to Temporal Attention

In conjunction with our progressive training objectives, we also introduce a progressive architectural design. Specifically, we first use spatio-temporal attention in the DM during the initial training stage. This choice promotes convergence, compared to temporal attention, as noted in Gao et al. (2024). Since in our initial stage we train the DM in isolation and on a relatively low-resolution setup, this sophisticated attention incurs in minor computational overhead. When integrating MAR with the DM in the second stage, we replace the spatio-temporal attention with the more cost-effective temporal attention, thus increasing the efficiency of the generation model.

2.4.3 Data: Progressive Configuration of Specifications

Analogous to our progressive strategies for training objective and architecture we also propose a progressive data configuration. Over time, we gradually increase the video’s spatial resolution, alongside progressively extending the video’s duration. This approach ensures efficient use of computational resources and facilitates effective model scaling, allowing MarDini to handle more complex and high-resolution video data as training progresses.

3 Experiments

We evaluate MarDini on two benchmarks: VIDIM-Bench (Jain et al., 2024), for long-term video interpolation, and VBench (Huang et al., 2024) for image-to-video generation. We further elaborate on the specifics of these benchmarks in Appendix 10. We highly encourage referring to the generated videos in our web page for a comprehensive understanding of the quality of the generated videos.

3.1 Ablation Studies and Analysis

Effectiveness of MAR and DM.

We demonstrate the importance of having a DM on top of our MAR planning model. In fact, it is tempting to hypothesize that MAR on its own contains all the ingredients to enable high-quality video interpolation. To explore this, we introduce a projection layer to directly unpatchify the output of the MAR model without intermediate diffusion. Our experiments on VIDIM-Bench reveal that, MAR on its own, performs poorly on interpolation tasks, as shown by the first two and last two rows in Table 3.1, for both the 1B and 3B settings. This result suggests that directly applying MAR to continuous space is suboptimal, a result consistent with previous findings (Li et al., 2024). Similarly, directly tackling this task with a small DM without global guidance, according to the third row of Table 3.1, results in sub-optimal performance. However, by combining MAR’s planning capability with DM’s stable performance in continuous space, we achieve optimal results, demonstrating that both components are beneficial for video generation.

Method	DAVIS-7				UCF101-7
Method	MidF-SSIM	MidF-LPIPS	FID	FVD	MidF-SSIM	MidF-LPIPS	FID	FVD
AMT (Li et al., 2023b)	0.4853	0.2865	34.65	234.50	0.7903	0.1691	31.60	344.50
RIFE (Huang et al., 2022)	0.4546	0.2954	23.98	240.04	0.7769	0.1564	18.72	323.80
FILM (Reda et al., 2022)	0.4718	0.3048	30.16	214.80	0.7869	0.1620	26.06	328.20
LDMVFI (Danier et al., 2024)	0.4175	0.2765	22.10	245.02	0.7712	0.1564	18.09	316.30
VIDIM (Jain et al., 2024)	0.4221	0.2986	28.06	199.32	0.6880	0.1768	34.48	278.00
MarDini-S/ST-256	0.4249	0.3654	49.21	224.07	0.7654	0.2480	45.85	258.08
MarDini-L/ST-256	0.4959	0.2768	20.64	102.87	0.7734	0.2213	28.85	197.69
MarDini-S/ST-512	0.5017	0.3193	25.92	138.86	0.7960	0.2315	30.24	205.71
MarDini-L/ST-512	0.5314	0.2736	20.76	99.05	0.7814	0.2347	30.08	204.20
MarDini-L/T-512	0.5085	0.3083	25.30	117.13	0.7893	0.2270	30.72	198.94

	Generated Frames (Middle)
Reference Frames (First, Last)	FILM	LDMVFI	VIDIM	Ours	Ground-Truth

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Abstract

1 Introduction

2 MarDini: An Efficient and Asymmetric Video Diffusion Model

2.1 Design Overview

2.2 Data Representation and Notations

VAE Compressor.

MAR Planning Model.

DM Generation Model.

2.3 Architecture Design

2.3.1 MarDini Block Design

2.3.2 Identity Attention

2.3.3 Model Configuration

2.4 MarDini Training Recipes

2.4.1 Training Tasks: From Frame Interpolation to Video Generation

Initial Stage.

Joint-Model Stage.

Joint-Task Stage.

2.4.2 DM Architecture: From Spatio-Temporal to Temporal Attention

2.4.3 Data: Progressive Configuration of Specifications

3 Experiments

3.1 Ablation Studies and Analysis

Effectiveness of MAR and DM.

Efficiency Analysis.

Explaining MAR’s Planning Signal.

From Video Interpolation to Image-To-Video Generation.

Impact of Identity Attention.

3.2 Results on Video Interpolation

3.3 Results on Image-to-Video Generation

3.4 Additional Applications

Zero-Shot 3D Novel View Synthesis

Video Expansion

(Hierarchical) Auto-Regressive Generation

4 Related Work

Auto-Regressive Model in Visual Generation.

Diffusion Model for Video Generation.

Asymmetric Neural Networks.

5 Limitations and Future Works

Post Training.

Improved Conditional Signals.

6 Conclusion

Acknowledgements

Ethics Statement

References

7 Reconstruction metrics in Video Interpolation.

8 MarDini Training Strategies

9 Visualization of Video Interpolation

10 Benchmarks