Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

Shenghai Yuan1, Jinfa Huang3, Yongqi Xu1, Yaoyang Liu1, Shaofeng Zhang4,
Yujun Shi5, Ruijie Zhu6, Xinhua Cheng1,2, Jiebo Luo3, Li Yuan1,2,†

Project: https://github.com/PKU-YuanGroup/ChronoMagic-Bench
1 Peking University, Shenzhen Graduate School, 2 Rabbitpre Intelligence,
3 University of Rochester, 4 Shanghai Jiao Tong University,
5 National University of Singapore, 6 University of California Santa Cruz

{2401212886,chengxinhua}@stu.pku.edu.cn, shi.yujun@u.nus.edu, rzhu48@ucsc.edu
{yaoyangliu319, yongqixuing}@gmail.com, sherrylone@sjtu.edu.cn
yuanli-ece@pku.edu.cn, {jhuang90@ur,jluo@cs}.rochester.edu
Abstract

We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench111 ”Chrono” is derived from the Greek word ”chronos”, which means ”time”., to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora [8] and Lumiere [3]) in time-lapse video generation. In contrast to existing benchmarks that focus on visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the models’ ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization ensures a comprehensive evaluation of the models’ capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos’ metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions. Each caption ensures high physical pertinence and large metamorphic amplitude, which have a far-reaching impact on the T2V generation community.

1 Introduction

Text-to-video (T2V) generative models [84, 86, 40, 22, 69, 62, 28, 79] have been developed rapidly recently. As the number of models continues to grow, there is an urgent need for evaluation methods that align with human perception, accurately reflecting the specific strengths and weaknesses of each model, thereby enabling community to adopt architectures that meet their requirements more easily.

However, the current T2V benchmarks [43, 61, 59, 75, 26, 28] primarily assess the capability of generating general videos instead of time-lapse videos, failing to reflect the extent of physical priors encoded by the models. Additionally, the evaluation metrics they use mainly focus on visual quality and textual relevance, from early metrics including FID [24], FVD [64], and CLIPScore [23] to more recent ones such as UMTScore [43], T2VQA [30], and UMT-FVD [43], almost all of which overlook two other crucial aspects of videos: metamorphic amplitude and temporal coherence. These limitations hinder the development of T2V models in generating videos with rich physical content.

Table 1: Comparison of the characteristics of our ChronoMagic-Bench with existing T2V benchmarks. Most of them only assess two dimensions: visual quality and text relevance.
Benchmark Type Visual Quality Text Relevance Metamorphic Amplitude Temporal Coherence
UCF-101 [61] General
Make-a-Video-Eval [59] General
MSR-VTT [75] General
FETV [43] General
VBench [26] General
T2VScore [72] General
ChronoMagic-Bench Time-lapse

Due to the greater metamorphic amplitude and temporal coherence of time-lapse videos, they contain more physical priors compared to general videos [79]. Therefore, to address the aforementioned issues, we introduce a benchmark called ChronoMagic-Bench for Metamorphic Evaluation of Time-Lapse Text-to-Video Generation, which provides a comprehensive evaluation system for T2V. We specifically design four major categories for time-lapse videos, including biological, human-created, meteorological, and physical videos, and extend these to 75 subcategories. Based on this, we construct ChronoMagic-Bench, comprising 1,649 prompts and their corresponding reference time-lapse videos. As shown in Table 1, In contrast to existing benchmarks [59, 75, 43, 26, 72, 61], ChronoMagic-Bench emphasizes generating videos with high persistence and strong variation, i.e., metamorphic videos with high physical prior content. Additionally, we develop MTScore for evaluating metamorphic amplitude and CHScore for temporal coherence to address the deficiencies in evaluation metrics and perspectives. With ChronoMagic-Bench, we conduct comprehensive qualitative and quantitative evaluations of almost all open-source T2V models, enabling analysis of their strengths and weaknesses. The results highlight the weaknesses of existing T2V models, including 1) failure by almost all models to generate time-lapse videos with large variations; 2) poor adherence to prompts (thus necessitating multiple inferences to achieve satisfactory results); and 3) flickering even though the visual quality of single frames may be high (indicating poor temporal coherence).

Furthermore, we have meticulously curated the dataset ChronoMagic-Pro to provide the community with the first large-scale T2V dataset specifically designed for time-lapse video generation with higher physical prior content. ChronoMagic-Pro stands out from previous T2V datasets [75, 2, 70, 15, 68] as it comprises time-lapse videos (e.g., ice melting and flowers blooming) characterized by strong physical characteristics, high persistence, and variability. Considering the domain differences between time-lapse videos and general videos, we propose an automatic time-lapse video collection framework to ensure the integrity of video content and improve annotation quality.

In summary, the contributions of this work are as follows:

i) A New T2V Benchmark. We introduce ChronoMagic-Bench to evaluate the existing T2V models from visual quality, textual relevance, metamorphic amplitude, and temporal coherence.

ii) New Automatic Metrics. We develop MTScore and CHScore, which align better with human judgment than existing metrics, for assessing metamorphic attributes and temporal coherence.

iii) New Insights for T2V Model Selection. Our evaluations using ChronoMagic-Bench provide crucial insights into the strengths and weaknesses of various open/close-source T2V models.

iv) A Large-Scale Time-lapse Video-Text Dataset. We create ChronoMagic-Pro, a dataset with 460k high-quality 720p time-lapse videos and detailed captions, promoting advances in T2V research.

2 Related Work

Automatic Metrics for Text-to-Video Generation.    Existing benchmarks [26, 31, 73, 34, 54] typically utilize Frechet Inception Distance (FID) [24], Frechet Video Distance (FVD) [64], CLIPScore [23], or their improved versions to assess the visual quality and text relevance of generated videos. For example, FETV [43] enhances FVD and CLIPScore within the UMT [36] feature space, resulting in UMT-FVD and UMTScore. Additionally, the CLIPScore feature extractor can be replaced with BLIP [82] to evaluate the relevance between text and generated content. To the best of our knowledge, existing T2V benchmarks [42, 43, 78, 18, 48] mainly assess these two aspects, with prompts based on general videos. This means that temporal coherence and metamorphic amplitude in videos have been overlooked, leading to the absence of automated metrics that indirectly reflect the physical content encoded by video models. Although some work [43, 26, 42] assess coherence, they are based on feature space or human evaluation, which is expensive and not sufficiently intuitive. Therefore, we propose the Metamorphic Score (MTScore) and Coherence Score (CHScore) to measure the metamorphic degree and temporal coherence of videos, filling this gap in the field.

Datasets for Text-to-Video Generation.    Large-scale high-quality text-content pair data [9, 60, 49, 25] are essential for training generation models [83, 53, 52, 51, 39, 4, 85, 13, 47, 37, 5, 6, 20, 46, 81, 58]. To enable models to learn better representation spaces that simulate the real world, the larger the dataset and the richer the physical knowledge contained in the videos, the better the training effect. Researchers often construct these large-scale datasets through web scraping. For example, existing video generation models typically use WebVid-10M [2], which contains 10 million videos and captions. Recently released datasets, such as Panda-70M [15], HD-VG-130M [68], and InternVid [70], contain 70 million, 130 million, and 7.1 million text-video pairs, respectively. Despite their large sizes, these datasets consist of general videos with small metamorphic amplitude and short persistence of change, resulting in limited physical knowledge. Consequently, models trained on these datasets struggle to generate metamorphic videos. To address this issue, we propose the first large-scale dataset of time-lapse videos, comprising 460k 720P resolution video clips and their corresponding captions, which features strong persistence of changes, and high physical content.

Refer to caption
Figure 1: Example of four major categories in ChronoMagic-Bench. These categories comprehensively encompass the physical world, allowing our benchmark and dataset to empower the research community. Due to limited space, only the reference video is shown here without prompts.

3 ChronoMagic-Bench

3.1 Benchmark Construction

Prompt Construction.    To comprehensively test the time-lapse video generation capabilities of existing T2V models, the designed text prompts need to cover as many metamorphic types as possible, and the corresponding reference videos must be of relatively high quality. Manual construction is impractical; therefore, to build a T2V benchmark rich in visual concepts, we first manually created a search term database suitable for diverse and broadly applicable time-lapse videos. We then counted the number of videos obtainable for each search term and filtered them based on frequency, resulting in a search database containing 75 categories of time-lapse videos. Additionally, since there are four major nature phenomena: biological covers all content related to living organisms, such as plant growth, animal activities, microbial movement, etc. Human-created includes all objects created or influenced by human activities, such as the construction process of buildings, urban traffic flow, etc. Meteorological includes all content related to meteorological phenomena, such as cloud movement, storm formation, etc. Physical includes all content related to non-biological physical phenomena, such as water flow, volcanic eruptions, etc. We divide the 75 subcategories into four major categories (biological, human-created, meteorological, and physical), as shown in Figure 1. Then, we use the search terms to crawl 20 high-quality videos for each category from video platforms. Finally, we used GPT-4o [1] to accurately caption these 1,649 videos and treated these captions as the text prompts for the benchmark. For more details about benchmark construction, please refer to Appendix E.

Refer to caption
Figure 2: Categories of Time-lapse Videos: First, we classify the videos into four major categories (biological, human-created, meteorological, physical), which are further subdivided into 75 subcategories (e.g., animal, parking, beach, melting).

Benchmark Statistics.    We collect a total of 1,649 prompts with corresponding videos and categories, the specific data distribution is shown in Figure 2, indicating that 75 categories have a comparable number of test cases to reflect the time-lapse video generation capabilities of different models accurately. Each data sample in ChronoMagic-Bench consists of four elements: prompt p𝑝pitalic_p, reference video v𝑣vitalic_v, sub-category c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and major category c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Since existing T2V models typically use CLIP as the text encoder, which supports a maximum input of 77 tokens, we have limited the length of p𝑝pitalic_p to within 77 tokens for general applicability, as shown in Figure 3(a). Although the length is limited, the diversity remains rich. By comparing the main words in the word cloud, as shown in Figure 3(b), it is observed that terms related to time-lapse videos such as "transitioning," "progressing," "increasing," and "gradually" appear most frequently. These terms significantly highlight ChronoMagic-Bench’s focus on large metamorphic amplitude, strong persistence of changes, and high physical content. In addition, words from four major categories are distributed, such as biological (seed, butterfly, etc.), human-created (Minecraft, traffic, etc), meteorological (sunset, tide, etc), and physical (burning, explosion, etc). For detailed explanations of the 75 subcategories, please refer to the Appendix E.

Refer to caption
Figure 3: The word cloud and word count range of the prompts in ChronoMagic-Bench. It shows that prompts mainly describe videos with large metamorphic amplitude and long persistence.

3.2 New Automatic Metrics

As previously mentioned, existing evaluation metrics mainly assess two aspects: visual quality and textual relevance, and the prompts only describe general videos. This indicates a lack of metrics for evaluating the capability to generate time-lapse videos, which not only need to measure the aforementioned two aspects but also need to assess metamorphic amplitude and temporal coherence.

Metamorphic Score.    To the best of our knowledge, there is no existing automated evaluation metric for assessing metamorphic amplitude. A simple way is to use questionnaires or GPT-4o [1], which, although highly effective, is expensive. Another way is to use the open-source model [71], which, although less effective, is much cheaper. To address this, we propose both coarse-grained and fine-grained scores to measure the metamorphic amplitude, aiming to balance cost and performance.

For the coarse-grained score (i.e. MTScore), we initially designed N𝑁Nitalic_N retrieval sentences (please refer to Appendix B.1 for more details). We then input these sentences into a video retrieval model [71], resulting in the computation of probabilities for n𝑛nitalic_n metamorphic and m𝑚mitalic_m general videos. Let Pimetasuperscriptsubscript𝑃𝑖metaP_{i}^{\text{meta}}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT meta end_POSTSUPERSCRIPT and Pigensuperscriptsubscript𝑃𝑖genP_{i}^{\text{gen}}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gen end_POSTSUPERSCRIPT represent the probabilities for the i𝑖iitalic_i-th metamorphic and general retrieval sentences, respectively. We then integrate these probabilities to derive a coarse-grained metamorphic score Scsubscript𝑆𝑐S_{c}italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT:

Sc=i=1nPimetai=1nPimeta+i=1mPigensubscript𝑆𝑐superscriptsubscript𝑖1𝑛superscriptsubscript𝑃𝑖metasuperscriptsubscript𝑖1𝑛superscriptsubscript𝑃𝑖metasuperscriptsubscript𝑖1𝑚superscriptsubscript𝑃𝑖genS_{c}=\frac{\sum_{i=1}^{n}P_{i}^{\text{meta}}}{\sum_{i=1}^{n}P_{i}^{\text{meta% }}+\sum_{i=1}^{m}P_{i}^{\text{gen}}}italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT meta end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT meta end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT gen end_POSTSUPERSCRIPT end_ARG (1)

Due to the strong instruction-following capability and world-understanding ability of GPT-4o, it can partially replace humans. For the fine-grained score (GPT4o-MTScore), we use GPT-4o as the evaluator. Specifically, we set a 5-point evaluation standard, then uniformly sample T𝑇Titalic_T frames and input them into GPT-4o[1] to obtain the score. More details are provided in Appendix B.1.

Temporal Coherence Score.    Temporal coherence is crucial for time-lapse videos because they span a large time range. Current benchmarks assess coherence either through questionnaires [43] or by employing methods based on feature space calculations [26, 42]. The former approach is time-consuming, whereas the latter lacks intuitiveness and does not support visualization. Therefore, we developed the Coherence Score (CHScore) based on a video tracking model [27]. More details are provided in the Appendix B.2. Specifically, we first process input video using the pre-trained model with grid size G𝐺Gitalic_G and threshold T𝑇Titalic_T to get pvissubscript𝑝visp_{\text{vis}}italic_p start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT. Then, we count the number of missing tracking points m[i]𝑚delimited-[]𝑖m[i]italic_m [ italic_i ] in each frame, and the change in missed points between consecutive frames Δm[i]Δ𝑚delimited-[]𝑖\Delta m[i]roman_Δ italic_m [ italic_i ]:

m[i]1Nj=1N(1pvis[i,j])𝑚delimited-[]𝑖1𝑁superscriptsubscript𝑗1𝑁1subscript𝑝vis𝑖𝑗m[i]\leftarrow\frac{1}{N}\sum_{j=1}^{N}(1-p_{\text{vis}}[i,j])italic_m [ italic_i ] ← divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT [ italic_i , italic_j ] ) (2)
Δm[i]m[i+1]m[i]Δ𝑚delimited-[]𝑖𝑚delimited-[]𝑖1𝑚delimited-[]𝑖\Delta m[i]\leftarrow m[i+1]-m[i]roman_Δ italic_m [ italic_i ] ← italic_m [ italic_i + 1 ] - italic_m [ italic_i ] (3)

where N=G×G𝑁𝐺𝐺N=G\times Gitalic_N = italic_G × italic_G, i𝑖iitalic_i represents the position of the frame, and j𝑗jitalic_j identifies different tracking points. Based on these, we then calculate the Rmissedsubscript𝑅missedR_{\text{missed}}italic_R start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT, which represents the average proportion of missed points per frame in the video. And the Vmissedsubscript𝑉missedV_{\text{missed}}italic_V start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT, which measures the variation in the number of missed points between consecutive frames, indicating frame-to-frame coherence:

Rmissed=1Fi=1F(1Nj=1N(1pvis[i,j]))subscript𝑅missed1𝐹superscriptsubscript𝑖1𝐹1𝑁superscriptsubscript𝑗1𝑁1subscript𝑝vis𝑖𝑗R_{\text{missed}}=\frac{1}{F}\sum_{i=1}^{F}\left(\frac{1}{N}\sum_{j=1}^{N}(1-p% _{\text{vis}}[i,j])\right)italic_R start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_F end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT [ italic_i , italic_j ] ) ) (4)
Vmissed=1F1i=1F1(Δm[i]Δm¯)2subscript𝑉missed1𝐹1superscriptsubscript𝑖1𝐹1superscriptΔ𝑚delimited-[]𝑖¯Δ𝑚2V_{\text{missed}}=\sqrt{\frac{1}{F-1}\sum_{i=1}^{F-1}(\Delta m[i]-\bar{\Delta m% })^{2}}italic_V start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_F - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F - 1 end_POSTSUPERSCRIPT ( roman_Δ italic_m [ italic_i ] - over¯ start_ARG roman_Δ italic_m end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (5)

where Δm[i]=m[i+1]m[i]Δ𝑚delimited-[]𝑖𝑚delimited-[]𝑖1𝑚delimited-[]𝑖\Delta m[i]=m[i+1]-m[i]roman_Δ italic_m [ italic_i ] = italic_m [ italic_i + 1 ] - italic_m [ italic_i ], Δm¯¯Δ𝑚\bar{\Delta m}over¯ start_ARG roman_Δ italic_m end_ARG is the mean of Δm[i]Δ𝑚delimited-[]𝑖\Delta m[i]roman_Δ italic_m [ italic_i ], F𝐹Fitalic_F is the total number of frames, N𝑁Nitalic_N is the number of points per frame, and pvis[i,j]subscript𝑝vis𝑖𝑗p_{\text{vis}}[i,j]italic_p start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT [ italic_i , italic_j ] indicates the visibility of point j𝑗jitalic_j in frame i𝑖iitalic_i. In addition, we need to calculate the Rcutsubscript𝑅cutR_{\text{cut}}italic_R start_POSTSUBSCRIPT cut end_POSTSUBSCRIPT, which indicates the ratio of frames that need to be cut to the total number of frames, reflecting the extent of video editing required. And the Cmissedsubscript𝐶missedC_{\text{missed}}italic_C start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT, which indicates the number of consecutive changes in missed points exceeding the threshold, indicating frequent large-scale instability in point tracking:

Rcut=|{i:Δm[i]>T}|Fsubscript𝑅cutconditional-set𝑖Δ𝑚delimited-[]𝑖𝑇𝐹R_{\text{cut}}=\frac{|\{i:\Delta m[i]>T\}|}{F}italic_R start_POSTSUBSCRIPT cut end_POSTSUBSCRIPT = divide start_ARG | { italic_i : roman_Δ italic_m [ italic_i ] > italic_T } | end_ARG start_ARG italic_F end_ARG (6)
Cmissed=i=1Δm[i]>TF1Δm[i]subscript𝐶missedsuperscriptsubscript𝑖1Δ𝑚delimited-[]𝑖𝑇𝐹1Δ𝑚delimited-[]𝑖C_{\text{missed}}=\sum_{\begin{subarray}{c}i=1\\ \Delta m[i]>T\end{subarray}}^{F-1}\Delta m[i]italic_C start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i = 1 end_CELL end_ROW start_ROW start_CELL roman_Δ italic_m [ italic_i ] > italic_T end_CELL end_ROW end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F - 1 end_POSTSUPERSCRIPT roman_Δ italic_m [ italic_i ] (7)

where T𝑇Titalic_T is the threshold for significant missed point variation, and |{i:Δm[i]>T}|conditional-set𝑖Δ𝑚delimited-[]𝑖𝑇|\{i:\Delta m[i]>T\}|| { italic_i : roman_Δ italic_m [ italic_i ] > italic_T } | represents the number of frames with significant missed point variation. Then we calculate the Mmissedsubscript𝑀missedM_{\text{missed}}italic_M start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT, which measures the maximum continuous change in missed points, reflecting the most severe continuity breaks in the video, and finally get the Coherence Score (CHScore):

Mmissed=max(Δm)subscript𝑀missedΔ𝑚M_{\text{missed}}=\max(\Delta m)italic_M start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT = roman_max ( roman_Δ italic_m ) (8)
SCHS=1Rmissed+Vmissed+Rcut+Cmissed+Mmissedsubscript𝑆CHS1subscript𝑅missedsubscript𝑉missedsubscript𝑅cutsubscript𝐶missedsubscript𝑀missedS_{\text{CHS}}=\frac{1}{R_{\text{missed}}+V_{\text{missed}}+R_{\text{cut}}+C_{% \text{missed}}+M_{\text{missed}}}italic_S start_POSTSUBSCRIPT CHS end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_R start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT cut end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT end_ARG (9)

3.3 Application Scope

ChronoMagic-Bench proposes automatic scores for measuring metamorphic amplitude and temporal coherence. When combined with existing metrics for visual quality and textual relevance, such as FVD [64], CLIPScore [23], UMT-FVD [43], and UMTScore [43], a comprehensive evaluation of T2V models across four dimensions can be achieved. Additionally, we can use human evaluation to more accurately assess these four dimensions.

Table 2: Comparison of the statistics of our ChronoMagic-Pro with existing T2V datasets.
Dataset # Categories Video clips Resolution Type Average length Video duration (h)
MSR-VTT [75] General 10K 240p Video-Text 15.0s 40
WebVid-10M [2] General 10M 360p Video-Text 18.7s 52K
InternVid [70] General 234M 720p Video-Text 11.9s 760K
Panda-70M [15] General 70M 720p Video-Text 8.5s 166K
HD-VG-130M [68] General 130M 720p Video-Text 4.9s 178K
Time-Lapse-D [74] Time-lapse 2K 360p Video - -
Sky Time-Lapse [77] Time-lapse 17K 1080p Video - -
ChronoMagic [79] Time-lapse 2K 720p Video-Text 11.4s 7
ChronoMagic-Pro Time-lapse 460K 720p Video-Text 234.1s 30K
ChronoMagic-ProH Time-lapse 150K 720p Video-Text 190.2s 8K

4 ChronoMagic-Pro

Multi-Aspect Data Curation.    As previously mentioned, existing large-scale text-video datasets primarily consist of general videos with limited physical information content, restricting open-source models [5, 59, 22] to generating only general videos rather than time-lapse videos. To address this, we construct the first large-scale time-lapse video dataset by collecting time-lapse videos based on the search terms outlined in Section 3.1, ultimately obtaining 66,226 original videos. Following the Panda70m method [15], we split these videos to produce 460K semantically consistent single-scene video clips (ChronoMagic-Pro). Finally, we utilize the video annotation strategy similar to MagicTime [79], replacing GPT-4V [1] with the open-source ShareGPT4Video [12] to reduce computational overhead while ensuring high-quality video captions. Additionally, we employ a video retrieval model [71] to filter out low-quality videos, resulting in 150K video clips of higher purity and quality (ChronoMagic-ProH). Verification experiment are shown in the Appendix D.3.

Dataset Statistics.    We collect time-lapse videos from 75 categories manually set by the human, with proportions being roughly similar. Some samples can be found in Appendix C.3. Unlike before, ChronoMagic-Pro is the first high-quality large-scale time-lapse T2V dataset, which contains more physical knowledge than general videos, as shown in Table 2. As shown in Figure 4, in terms of duration, more than half (53.3%) of the videos have a duration of 0-15 seconds, a quarter (27.1%) are longer than 60 seconds, 12.1% are between 15-30 seconds, and the remaining videos are distributed between 30-60 seconds. Regarding resolution, 97% are high resolution (720P), 2% are ultra-high resolution (1080P), and the remaining videos have lower resolutions ranging from 360P to 480P. As the number of words accepted by the text encoder increases, we require the generated captions to be as detailed as possible, with 95% of captions containing more than 100 words. For aesthetic score [57], 73% videos receive high scores ranging from 4 to 6. 14% of the videos had aesthetic indicators exceeding 6, and only a small portion of the videos score below 3. This indicates that the quality of most videos is high. For the word distribution of the generated captions, please refer to Appendix C.2. Similar to Figure 3, ChronoMagic-Pro mainly focuses on changes (gradually, progressing, increasing, etc.), processes spanning a large amount of time, such as flower blooming and ice melting. In addition, the categories, durations, resolutions, and word count ranges of ChronoMagic-ProH are similar to those of ChronoMagic-Pro; however, ChronoMagic-ProH exhibits superior quality and purity.

Table 3: Quantitative comparison with state-of-the-art T2V generation methods for the text-to-video task in ChronoMagic-Bench. "\downarrow" denotes lower is better. "\uparrow" denotes higher is better.
Method Venue Backbone UMT-FVD\downarrow UMTScore\uparrow MTScore\uparrow CHScore\uparrow GPT4o-MTScore\uparrow
ModelScopeT2V [66] Arxiv’23 U-Net 194.77 2.909 0.401 11.03 2.86
ZeroScope [62] CVPR’23 U-Net 227.02 2.350 0.400 25.13 2.09
T2V-zero [28] ICCV’23 U-Net 209.66 2.661 0.400 1.68 2.55
LaVie [69] Arxiv’23 U-Net 166.97 2.763 0.346 8.60 2.46
AnimateDiff [22] ICLR’24 U-Net 197.89 2.944 0.467 11.36 2.62
VideoCrafter2 [10] Arxiv’24 U-Net 178.45 2.753 0.433 8.27 2.68
MCM [80] Arxiv’24 U-Net 202.08 2.33 0.417 14.08 3.04
MagicTime [79] Arxiv’24 U-Net 257.56 1.916 0.478 10.66 3.13
Latte [45] Arxiv’24 DiT 192.12 2.111 0.363 13.81 2.20
OpenSora 1.1 [86] Github’24 DiT 195.43 2.678 0.444 10.03 2.52
OpenSora 1.2 [86] Github’24 DiT 166.92 2.781 0.375 4.69 2.56
OpenSoraPlan v1.1 [40] Github’24 DiT 188.53 2.421 0.327 10.35 2.19
Refer to caption
Figure 4: Video clips statistics in (Top) ChronoMagic-Pro and (Bottom) ChronoMagic-ProH. The dataset includes a diverse range of categories, durations and caption lengths, with most of the videos at the 720P resolution. ChronoMagic-ProH has higher quality and purity (e.g. Aesthetic Score)

5 Experiments

5.1 Evaluation Models

We select ten open-source T2V models for evaluation, including both relatively advanced U-Net based models and emerging DiT-based models. All inference parameters follow the official implementation. More details about the experiment can be found in Appendix D.

U-Net Based Models.    Including ModelScopeT2V [66], ZeroScope [62], T2V-zero [28], LaVie [69], AnimateDiff [22], MCM [80], VideoCrafter2 [10], and MagicTime [79] .

DiT-Based Models.    Including Latte [45], OpenSoraPlan v1.1 [40] and OpenSora 1.1 & 1.2 [86].

5.2 Evaluation Setups

Evaluation Criteria.    We assess video quality primarily from the following four aspects: (a) Visual Quality, measures the clarity, color saturation, contrast, and overall aesthetic effect, using UMT-FVD [43], an enhanced version of FVD [64]. (b) Text Relevance measures the correlation between the prompt and the video using UMTScore [43], an enhanced version of CLIPScore [23]. (c) Metamorphic Amplitude measures the diversity and dynamic changes in the video content, using the proposed Metamorphic Score. (d) Temporal Coherence measures the smoothness and logical sequence of the video content over time, using the proposed Coherence Score. Additionally, we use human evaluation to cross-verify the reliability of the four metrics.

Implementation Details.    For each baseline, we generate corresponding triple results based on the 1,649 prompts contained in the ChronoMagic-Bench, resulting in 4,947 videos for each model. We then use the four automated metrics mentioned above to assess all the generated videos.

5.3 Comprehensive Analysis

Quantitative Evaluation.    We first present and analyze the results from a qualitative perspective. All input texts are from our ChronoMagic-Bench. Unlike existing benchmarks that only assess general videos, our evaluation task focuses on generating metamorphic videos, such as the construction of houses in Minecraft, the blooming of flowers, the baking of bread rolls, and the melting of ice cubes. As shown in Figure 5, almost all U-Net-based and DiT-based models are limited to generating general videos and fail to follow prompts to produce videos with significant motion and temporal spans, except for MagicTime [79] (training data contains time-lapse videos), which underscores the importance of ChronoMagic-Pro dataset. Since T2V-Zero [28] is a zero-shot video generation model, its coherence is significantly lacking, although its visual quality is acceptable. Additionally, videos generated by AnimateDiff [22] have the best visual quality and text relevance, with high clarity and accurate adherence to the prompt’s instructions. Among the emerging DiT-based video models, OpenSora 1.2 [86] stands out as a representative model that matches the performance of U-Net based methods. It is followed by OpenSoraPlan v1.1 [40], OpenSora 1.1 [86], and Latte [45]. However, due to the inherent limitations of OpenSora 1.2’s Video VAE and 2+1D DiT architecture, the generated videos are prone to flickering, particularly during significant changes, resulting in the lowest temporal coherence (CHScore). All videos generated on ChronoMagic-Bench are available on our homepage.

Refer to caption
Figure 5: Qualitative comparison with different T2V generation methods for the text-to-video task in ChronoMaigc-Bench. Most models cannot follow instructions to generate time-lapse videos.

Qualitative Evaluation.    Next, we present and analyze the results of different T2V models from a qualitative perspective. Due to the lack of metrics, we propose the first MTScore and CHScore to evaluate video motion extent and coherence, with results shown in Table 3. Consistent with Figure 5, MagicTime [79], as the only model capable of generating metamorphic videos, has the highest MTScore and GPT4o-MTScore among all models. The other models, trained only on general videos, produce videos with limited motion range due to the minimal physical knowledge encoded in the models. It can also be seen that the results of the MTScore based on feature space with lower overhead and the GPT4o-MTScore based on question answering with higher overhead are roughly similar, proving the effectiveness of the proposed indicators. Additionally, ZeroScope [62] has limited metamorphic amplitude but the best coherence, while the zero-shot algorithm T2V-Zero [28] has the lowest CHScore. U-Net based and DiT-based algorithms have similar CHScore, but the former shows superior average metamorphic amplitude. For visual quality and text relevance, the emerging OpenSoraPlan v1.1 [40] and OpenSora 1.1 & 1.2 [86] have visual quality comparable to U-Net based methods, but slightly inferior text relevance and temporal coherence. Additionally, OpenSora 1.2 [86] has the lowest UMT-FVD [43] with higher color saturation, but Quantitative Evaluation shows that LaVie [69] has the highest clarity and most accurate color reproduction. Only MagicTime [79] follows the prompt to generate a time-lapse video, but the UMTScore [43] is the lowest. We infer that the UMT-FVD [43] and UMTScore [43] are inconsistent with human perception.

Human Preference.    Finally, we cross-validate the effectiveness of the different metrics through Human Study. We randomly select the generated videos corresponding to 16 prompts and invited 171 participants to vote, obtaining manual evaluation results. To enhance user satisfaction, we select only five representative baseline results from which users can choose. Table 6 shows the correlation between automatic metrics and human perception. It is evident that the proposed three metrics, MTScore, CHScore, and GPT4o-MTScore, are consistent with human perception and can accurately reflect the metamorphic amplitude and temporal coherence of T2V models. Notably, although most models exhibit good coherence, they have low metamorphic amplitude. In other words, they cannot generate videos with significant physical priors, such as seed germination, egg hatching, or sunrise. This is a challenge that T2V models need to overcome in the future. Additionally, as mentioned earlier, UMTScore [43] cannot accurately measure text relevance, especially in evaluating time-lapse videos, where its Kendall and Spearman coefficients are the lowest. We infer that its feature space is not suitable for time-lapse video. Appendix D.6 provides more details about human evaluation.

Refer to caption
Figure 6: Alignment between automatic metrics and human perception in terms of visual quality, textual relevance, metamorphic amplitude, and temporal coherence. ð and £ represent Kendall\uparrow and Spearman\uparrow coefficients, respectively. \uparrow denotes higher is better.

5.4 Extended Analysis of Closed-Source Models

In this section, we explore the performance and limitations of closed-source models, specifically U-Net based: Gen-2 [56], Pika-2.0 [33], DiT-based: Dream Machine [44], and KeLing [32]. Given the impracticality of manually testing all 1,649 prompts in ChronoMagic-Bench, we selected two hard prompts from each of the 75 categories, resulting in ChronoMagic-Bench-150.

First, we present and analyze the results from a qualitative perspective, as shown in Figure 7. For metamorphic amplitude, most methods can only generate simple time-lapse videos, such as traffic flow; only Dream Machine [44] can generate a moderately challenging full process of night-to-day transformation; no method can generate complex changes like plant growth or building construction. In terms of temporal coherence, the performance of various closed-source models is comparable, with minor visible differences. Regarding visual quality, the DiT-based methods Dream Machine [44] and KeLing [32] outperform those based on U-Net, producing more realistic plants, more accurately saturated sky colors, and clearer traffic flow. In terms of text relevance, all methods adhere to the prompt’s instructions to generate content relevant to the theme, except for Pika-2.0 [33], which mistakenly interprets day-to-night as night-to-day.

Next, we analyze the results from a quantitative perspective, as shown in Table 4. The results are consistent with Figure 7, with Dream Machine [44] performing better in metamorphic amplitude (MTScore, GPT4o-MTScore) and Pika-2.0 [33] showing the worst text relevance (UMTScore). DiT-based methods outperform U-Net based ones in visual quality. To facilitate comparison under a unified standard, we also test open-source models on ChronoMagic-Bench-150. It is evident that for most models, the MTScore and GPT4o-MTScore are low, and they are unable to generate videos involving complex state changes. Additionally, due to the inherent limitations of UMT-FVD [43] and UMTScore [43], they fail to accurately reflect the differences between open-source and closed-source models. However, the qualitative analysis across all models demonstrates that closed-source models consistently surpass open-source models in visual quality and textual relevance. Furthermore, it is worth noting that the results within the same domain (open/closed) align with human evaluations.

Refer to caption
Figure 7: Qualitative comparison with Close-Source generation methods for the text-to-video task in ChronoMagic-Bench-150. Most methods can only generate simple time-lapse videos such as traffic flows and starry skies, and are incapable of generating complex changes such as plant growth or building construction.
Table 4: Quantitative Comparison with Closed-Source Generation Methods for the Text-to-Video Task in ChronoMagic-Bench-150. To facilitate comparison under a unified standard, we also test Open-Source models. The current UMT-FVD [43] and UMTScore [43] metrics lack accuracy, preventing them from effectively highlighting the significant disparity between open-source and closed-source models. \downarrow denotes lower is better. \uparrow denotes higher is better.
Method Venue Backbone Status UMT-FVD\downarrow UMTScore\uparrow MTScore\uparrow GPT4o-MTScore\uparrow
Gen-2 [56] Runway U-Net Close-Source 218.99 2.400 0.373 2.62
Pika-2.0 [33] PikaLab U-Net Close-Source 223.05 2.317 0.347 2.48
Dream Machine [44] LUMA DiT Close-Source 214.91 2.387 0.474 3.11
KeLing [32] Kwai DiT Close-Source 202.32 2.517 0.369 2.74
ModelScopeT2V [66] Arxiv’23 U-Net Open-Source 230.74 2.783 0.409 3.01
ZeroScope [62] CVPR’23 U-Net Open-Source 260.61 2.232 0.403 2.29
T2V-zero [28] ICCV’23 U-Net Open-Source 250.22 2.559 0.399 2.62
LaVie [69] Arxiv’23 U-Net Open-Source 210.39 2.714 0.350 2.50
AnimateDiff [22] ICLR’24 U-Net Open-Source 239.31 2.837 0.470 2.62
VideoCrafter2 [10] Arxiv’24 U-Net Open-Source 214.06 2.763 0.437 2.87
MCM [80] Arxiv’24 U-Net Open-Source 244.49 2.282 0.422 3.06
MagicTime [79] Arxiv’24 U-Net Open-Source 294.72 1.763 0.479 3.05
Latte [45] Arxiv’24 DiT Open-Source 232.29 2.122 0.366 2.42
OpenSora 1.1 [86] Github’24 DiT Open-Source 241.09 2.676 0.448 2.57
OpenSora 1.2 [86] Github’24 DiT Open-Source 210.93 2.681 0.383 2.50
OpenSoraPlan v1.1 [40] Github’24 DiT Open-Source 228.70 2.459 0.331 2.21

5.5 Guideline for T2V Models Selection

With the increasing number of T2V models, the community faces challenges in selecting the most appropriate model due to the tendency of each model to showcase its best results. To address this issue, we provide a guideline according to the Table 4 for model selection based on ChronoMagic-Bench: (1) Except for MagicTime [79] and Dream Machine [44], most T2V models exhibit minimal metamorphic amplitude and cannot generate complete processes rich in physical changes, such as seed germination, sunrise, or building construction; (2) The visual quality of a single frame may be high, but when viewed in sequence, flickering often occurs, indicating poor temporal coherence. This issue is particularly evident in T2V-zero [28] and OpenSora 1.2 [86], whereas closed-source models do not exhibit this problem; (3) The emergence of Sora [8] has promoted the rapid development of DiT-based methods. Closed-source models based on DiT have comprehensively surpassed those based on U-Net. However, open-source models’ visual quality, text-following capability, and metamorphic amplitude still lag behind U-Net-based methods. We speculate that DiT-based models are more scalable and require more data, giving closed-source models a significant advantage over open-source models; (4) It is expensive to access massive data and computing resources. First, they can build datasets by crawling videos without copyright disputes. Second, adopting the U-DiT architecture may balance performance and cost to a certain extent; (5) Ordinary users who want to try T2V models can prioritize Dream Machine [44] and KeLing [32]. Researchers who wish to conduct in-depth research on T2V can prioritize the study of metamorphic video generation with OpenSoraPlan [40] and OpenSora [86], as neither open-source nor closed-source models can achieve this function.

6 Conclusion

In this paper, we present ChronoMagic-Bench, the first benchmark specifically designed to assess the generation of time-lapse videos. It addresses the shortcomings of current benchmarks, which primarily focus on standard videos and overlook critical elements such as metamorphic amplitude and coherence. Additionally, we introduce two new automated metrics, MTScore and CHScore, which align with human perception. Based on ChronoMagic-Bench, we conduct a comprehensive evaluation of almost all open-source leading text-to-video (T2V) models and provide crucial insights into the strengths and weaknesses of various models. Moreover, we propose ChronoMagic-Pro, the first large-scale time-lapse text-to-video dataset, to facilitate further research by the community.

Limitations and Future Work.    While ChronoMagic-Bench offers a robust evaluation framework, there are two limitations to this work. First, although MTScore and CHScore are introduced to assess metamorphic attributes and temporal coherence, these metrics only relatively reflect the quality of different T2V models. And they do not ascertain whether the videos adhere to physical laws. For example, the MTScore of MCM [80] is relatively high, but the video changes strangely. Second, we use the best existing UMT-FVD and UMTScore to evaluate visual quality and text relevance, but due to the inapplicability of feature space, models in different domains (open/closed source) are not comparable. We will address this issue in the future.

References

  • [1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [2] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  • [3] Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
  • [4] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. ICML, 2023.
  • [5] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  • [6] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023.
  • [7] Gary Bradski, Adrian Kaehler, et al. Opencv. Dr. Dobb’s journal of software tools, 3(2), 2000.
  • [8] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. OpenAI, 2024.
  • [9] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  • [10] Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
  • [11] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
  • [12] Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024.
  • [13] Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance. In CVPR, pages 5343–5353, 2024.
  • [14] Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance. In CVPR, pages 5343–5353, 2024.
  • [15] Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024.
  • [16] Laion CoCo. Laion coco: 600m synthetic captions from laion2b-en. Laion CoCo, 2022.
  • [17] Tim Dettmers, Artidoro Pagnoni, and Holtzman. Qlora: Efficient finetuning of quantized llms. NeurIPS, 36, 2024.
  • [18] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
  • [19] Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu, Yuhang Zhang, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945, 2024.
  • [20] Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, pages 22930–22941, 2023.
  • [21] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  • [22] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  • [23] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  • [24] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [25] Ming Hu, Lin Wang, Siyuan Yan, Don Ma, Qingli Ren, Peng Xia, Wei Feng, Peibo Duan, Lie Ju, and Zongyuan Ge. Nurvid: A large expert-level video database for nursing procedure activity understanding. Advances in Neural Information Processing Systems, 36, 2024.
  • [26] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982, 2023.
  • [27] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. arXiv:2307.07635, 2023.
  • [28] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  • [29] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [30] Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. Subjective-aligned dateset and metric for text-to-video quality assessment. arXiv preprint arXiv:2403.11956, 2024.
  • [31] Tengchuan Kou, Xiaohong Liu, Zicheng Zhang, Chunyi Li, Haoning Wu, Xiongkuo Min, Guangtao Zhai, and Ning Liu. Subjective-aligned dateset and metric for text-to-video quality assessment. arXiv preprint arXiv:2403.11956, 2024.
  • [32] Kwai. Keling. Kwai, 2024.
  • [33] Pika Lab. Pika-2.0 lab discord server. Pika Lab, 2024.
  • [34] Tiep Le, Vasudev Lal, and Phillip Howard. Coco-counterfactuals: Automatically constructed counterfactual examples for image-text pairs. Advances in Neural Information Processing Systems, 36, 2024.
  • [35] Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024.
  • [36] Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19948–19960, 2023.
  • [37] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
  • [38] Yumeng Li, William Beluch, Margret Keuper, Dan Zhang, and Anna Khoreva. Vstar: Generative temporal nursing for longer dynamic video synthesis. arXiv preprint arXiv:2403.13501, 2024.
  • [39] Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
  • [40] Bin Lin, Shenghai Yuan, Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Liuhan Chen, Yang Ye, Bin Zhu, Yunyang Ge, Xing Zhou, Shaoling Dong, Yemin Shi, Yonghong Tian, and Li Yuan. Tinysora. In Github, March 2023.
  • [41] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  • [42] Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023.
  • [43] Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems, 36, 2024.
  • [44] Luma. Dream-machine. Luma, 2024.
  • [45] Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  • [46] Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, and Jian Zhang. Revideo: Remake a video with motion and content control. arXiv preprint arXiv:2405.13865, 2024.
  • [47] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  • [48] Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Shin’ichi Satoh. Toward verifiable and reproducible human evaluation for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14277–14286, 2023.
  • [49] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022.
  • [50] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  • [51] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • [52] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. arXiv preprint:2204.06125, 2022.
  • [53] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, pages 8821–8831, 2021.
  • [54] Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan Plummer, Ranjay Krishna, and Kate Saenko. cola: A benchmark for compositional text-to-image retrieval. Advances in Neural Information Processing Systems, 36, 2024.
  • [55] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, Jun 2022.
  • [56] Runway. Gen-2: The next step forward for generative ai. Runway, 2024.
  • [57] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  • [58] Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vincent YF Tan, and Song Bai. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023.
  • [59] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  • [60] Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the kinetics-700-2020 human action dataset. arXiv preprint arXiv:2010.10864, 2020.
  • [61] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [62] Spencer Sterling. zeroscope. In Huggingface, 2023.
  • [63] Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems, 36, 2024.
  • [64] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. openreview, 2019.
  • [65] Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations, 2022.
  • [66] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  • [67] Weimin Wang, Jiawei Liu, Zhijie Lin, Jiangqiao Yan, Shuo Chen, Chetwin Low, Tuyen Hoang, Jie Wu, Jun Hao Liew, Hanshu Yan, et al. Magicvideo-v2: Multi-stage high-aesthetic video generation. arXiv preprint arXiv:2401.04468, 2024.
  • [68] Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, and Jiaying Liu. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023.
  • [69] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  • [70] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
  • [71] Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024.
  • [72] Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, et al. Towards a better metric for text-to-video generation. arXiv preprint arXiv:2401.07781, 2024.
  • [73] Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, et al. Towards a better metric for text-to-video generation. arXiv preprint arXiv:2401.07781, 2024.
  • [74] Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In CVPR, pages 2364–2373, 2018.
  • [75] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. CVPR, pages 5288–5296, 2016.
  • [76] Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994, 2024.
  • [77] Hongwei Xue, Bei Liu, Huan Yang, Jianlong Fu, Houqiang Li, and Jiebo Luo. Learning fine-grained motion embedding for landscape animation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 291–299, 2021.
  • [78] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  • [79] Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, and Jiebo Luo. Magictime: Time-lapse video generation models as metamorphic simulators. arXiv preprint arXiv:2404.05014, 2024.
  • [80] Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, and Lijuan Wang. Motion consistency model: Accelerating video diffusion with disentangled motion-appearance distillation. arXiv preprint arXiv:2406.06890, 2024.
  • [81] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  • [82] Lei Zhang, Fangxun Shu, Sucheng Ren, Bingchen Zhao, Hao Jiang, and Cihang Xie. Compress & align: Curating image-text data with human knowledge. arXiv preprint arXiv:2312.06726, 2023.
  • [83] Shaofeng Zhang, Jinfa Huang, Qiang Zhou, Zhibin Wang, Fan Wang, Jiebo Luo, and Junchi Yan. Continuous-multiple image outpainting in one-step via positional query and a diffusion-based approach. arXiv preprint arXiv:2401.15652, 2024.
  • [84] Zhaoyang Zhang, Ziyang Yuan, Xuan Ju1, Yiming Gao1, Xintao Wang, Chun Yuan, and Ying Shan1. Mira: A mini-step towards sora-like long video generation. In Github, April 2024.
  • [85] Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K Wong. Uni-controlnet: All-in-one control to text-to-image diffusion models. NeurIPS, 36, 2024.
  • [86] Zangwei Zheng, Xiangyu Peng, and Yang You. Open-sora: Democratizing efficient video production for all. In Github, March 2024.
\startcontents

[chapters]

Appendix

\printcontents

[chapters]1

Appendix A More Related Work: Text-to-Video Generation Models

The emergence of large-scale text-to-image models [83, 53, 52, 51, 39, 4, 85, 13, 47, 37] has significantly advanced the field of Text-to-Video (T2V) generation [59, 5, 6, 20, 66, 81]. Existing T2V architectures can be categorized into two types: U-Net-based and DiT-based. The former typically builds on Stable Diffusion [55], extending the 2D U-Net to a 3D U-Net by adding temporal layers, thereby achieving high-quality video generation [67, 14, 22, 3, 10, 38]. The latter focuses on recreating open-source structures similar to Sora [8], using the DiT (Diffusion-Transformer) [50] framework for T2V generation [40, 86, 84, 19]. However, the generation quality of DiT-based architectures still lags behind that of U-Net-based architectures. MagicTime [79] notes that although these models have achieved basic video generation, the videos are typically limited to simple actions and scenes, resulting in the production of general videos rather than those enriched with physical priors like metamorphic/time-lapse videos. For a more intuitive representation, we have detailed a comparison of the metamorphic video generation capabilities of different algorithms.

Appendix B More Details about Automatic Metrics

B.1 Construction of Retrieval Sentences for Metamorphic Score

To obtain an effective Metamorphic Score (MTScore), we meticulously designed ten distinct retrieval texts to differentiate between time-lapse and normal videos. Although, in theory, only two retrieval sentences are needed to distinguish between general and time-lapse videos, multiple texts were used to enhance the model’s robustness and accuracy. This approach also provides diverse linguistic representations for each video category, ensuring comprehensive coverage and minimizing bias. As shown in Table 5, the first five sentences (Index 0-4) describe general videos, capturing standard, unaltered video content in unique phrasings. The last five sentences (Index 5-9) describe time-lapse videos, characterized by accelerated playback or condensed time sequences, also phrased in various ways to capture different nuances. When calculating the MTScore, the video retrieval model uses these texts to evaluate each frame of the video, assigning probabilities based on the matches. The final result is obtained by summing the general probability and the metamorphic probability. For GPT4o-MTScore, we used a five-point rating scale and provided detailed scoring guidelines in the prompt, as shown in Table 6.

Table 5: Retrieval sentences for coarse-grained score (MTScore)
Index Sentence
1 A conventional video, not a time-condensed video.
2 A usual video, not an accelerated video sequence.
3 A normal video, not a time-lapse video.
4 A standard video, not a time-lapse.
5 An ordinary video, different from a fast-motion video.
6 A time-lapse video, distinct from a regular recording.
7 A time-lapse footage, not your typical video.
8 A fast-motion video, unlike a standard video.
9 A time-condensed video, not a conventional video.
10 An accelerated video sequence, not a usual video.
Table 6: Scoring Criteria for GPT4o-MTScore. We set guidelines for each score to ensure that GPT-4o makes choices based on consistent criteria.
Score Brief Reasoning Statement
1 Minimal change. The scene appears almost like a still image, with static elements remaining motionless and only minor changes in lighting or subtle movements of elements. No significant activity is noticeable.
2 Slight change. There is a small amount of movement or change in the elements of the scene, such as a few people or vehicles moving and minor changes in light or shadows. The overall variation is still minimal, with changes mostly being quantitative.
3 Moderate change. Multiple elements in the scene undergo changes, but the overall pace is slow. This includes gradual changes in daylight, moving clouds, growing plants, or occasional vehicle and pedestrian movements. The scene begins to show a transition from quantitative to qualitative change.
4 Significant change. The elements in the scene show obvious dynamic changes with a higher speed and frequency of variation. This includes noticeable changes in city traffic, crowd activities, or significant weather transitions. The scene displays a mix of quantitative and qualitative changes.
5 Dramatic change. Elements in the scene undergo continuous and rapid significant changes, creating a very rich visual effect. This includes events like sunrise and sunset, construction of buildings, and seasonal changes, making the variation process vivid and impactful. The scene exhibits clear qualitative change.

B.2 More Description of Temporal Coherence Score

We present a concise global description of the algorithm for computing the Temporal Coherence Score, as shown in Algorithm 1. First, we process the input video using a pre-trained model with grid size G𝐺Gitalic_G and threshold T𝑇Titalic_T to obtain pvissubscript𝑝visp_{\text{vis}}italic_p start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT. Next, we count the number of missing tracking points m[i]𝑚delimited-[]𝑖m[i]italic_m [ italic_i ] in each frame and the change in missed points between consecutive frames Δm[i]Δ𝑚delimited-[]𝑖\Delta m[i]roman_Δ italic_m [ italic_i ]. We then calculate the average proportion of missed points per frame Rmissedsubscript𝑅missedR_{\text{missed}}italic_R start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT, indicating the overall visibility issue across the video. Following this, we compute the variation in the number of missed points between consecutive frames Vmissedsubscript𝑉missedV_{\text{missed}}italic_V start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT, measuring frame-to-frame coherence. We also determine the ratio of frames that need to be cut Rcutsubscript𝑅cutR_{\text{cut}}italic_R start_POSTSUBSCRIPT cut end_POSTSUBSCRIPT, reflecting the extent of video editing required, and count the number of consecutive changes in missed points exceeding the threshold Cmissedsubscript𝐶missedC_{\text{missed}}italic_C start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT, indicating frequent large-scale instability in point tracking. Additionally, we measure the maximum continuous change in missed points Mmissedsubscript𝑀missedM_{\text{missed}}italic_M start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT, highlighting the most severe continuity breaks in the video. Finally, we integrate these metrics to calculate the Coherence Score (CHScore).

Algorithm 1 Calculation of Coherence Score
1:Input: Video, pre-trained model with grid size G𝐺Gitalic_G and threshold T𝑇Titalic_T
2:Output: Coherence score SCHSsubscript𝑆CHSS_{\text{CHS}}italic_S start_POSTSUBSCRIPT CHS end_POSTSUBSCRIPT
3:Process input video using pre-trained model with grid size G𝐺Gitalic_G and threshold T𝑇Titalic_T to get pvissubscript𝑝visp_{\text{vis}}italic_p start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT
4:for each frame i𝑖iitalic_i do
5:     count the number of missing tracking points in each frame
6:     m[i]1Nj=1N(1pvis[0,i,j])𝑚delimited-[]𝑖1𝑁superscriptsubscript𝑗1𝑁1subscript𝑝vis0𝑖𝑗m[i]\leftarrow\frac{1}{N}\sum_{j=1}^{N}(1-p_{\text{vis}}[0,i,j])italic_m [ italic_i ] ← divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT [ 0 , italic_i , italic_j ] )
7:end for
8:for each frame i𝑖iitalic_i do
9:     Δm[i]m[i+1]m[i]Δ𝑚delimited-[]𝑖𝑚delimited-[]𝑖1𝑚delimited-[]𝑖\Delta m[i]\leftarrow m[i+1]-m[i]roman_Δ italic_m [ italic_i ] ← italic_m [ italic_i + 1 ] - italic_m [ italic_i ]
10:     if Δm[i]>TΔ𝑚delimited-[]𝑖𝑇\Delta m[i]>Troman_Δ italic_m [ italic_i ] > italic_T then
11:         frame i𝑖iitalic_i will be added to the set frames_to_be_cut
12:         CmissedCmissed+Δm[i]subscript𝐶missedsubscript𝐶missedΔ𝑚delimited-[]𝑖C_{\text{missed}}\leftarrow C_{\text{missed}}+\Delta m[i]italic_C start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT ← italic_C start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT + roman_Δ italic_m [ italic_i ]
13:     end if
14:end for
15:Rcutlen(frames_to_be_cut)framessubscript𝑅cutlenframes_to_be_cutframesR_{\text{cut}}\leftarrow\frac{\text{len}(\text{frames\_to\_be\_cut})}{\text{% frames}}italic_R start_POSTSUBSCRIPT cut end_POSTSUBSCRIPT ← divide start_ARG len ( frames_to_be_cut ) end_ARG start_ARG frames end_ARG
16:Rmissed1framesi=1framesm[i]subscript𝑅missed1framessuperscriptsubscript𝑖1frames𝑚delimited-[]𝑖{R}_{\text{missed}}\leftarrow\frac{1}{\text{frames}}\sum_{i=1}^{\text{frames}}% m[i]italic_R start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG frames end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT frames end_POSTSUPERSCRIPT italic_m [ italic_i ]
17:Vmissedstd(Δm)subscript𝑉missedstdΔ𝑚V_{\text{missed}}\leftarrow\text{std}(\Delta m)italic_V start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT ← std ( roman_Δ italic_m )
18:Mmissedmax(Δm)subscript𝑀missedΔ𝑚M_{\text{missed}}\leftarrow\max(\Delta m)italic_M start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT ← roman_max ( roman_Δ italic_m )
19:TSI_sumRmissed+Vmissed+Rcut+Cmissed+MmissedTSI_sumsubscript𝑅missedsubscript𝑉missedsubscript𝑅cutsubscript𝐶missedsubscript𝑀missed\text{TSI\_sum}\leftarrow{R}_{\text{missed}}+V_{\text{missed}}+R_{\text{cut}}+% C_{\text{missed}}+M_{\text{missed}}TSI_sum ← italic_R start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT cut end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT missed end_POSTSUBSCRIPT
20:SCHS1TSI_sumsubscript𝑆CHS1TSI_sumS_{\text{CHS}}\leftarrow\frac{1}{\text{TSI\_sum}}italic_S start_POSTSUBSCRIPT CHS end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG TSI_sum end_ARG

B.3 Visualization of the Different Scores of MTScore and CHScore

We provide some samples of different scoring magnitudes for MTScore and CHScore, as shown in Figure 8. It can be seen that both scores are consistent with human perception.

Refer to caption
Figure 8: Visual Reference for Varying Scores of MTScore and CHScore. It is observed that higher scores correlate with increased metamorphic amplitude and coherence.

Appendix C More Details about ChronoMagic-Pro

C.1 Multi-Aspect Data Preprocessing

Due to the abundance of low-quality videos on video platforms, we filter out lower-quality videos based on metadata such as view counts, comments, and likes after acquiring the original videos, ultimately obtaining 66,226 original videos. Then, we use a three-stage method to further process and filter the data.

Adaptive Video Transition Cutting.    In stage one: since our training data is sourced from video platforms (e.g., YouTube) where videos are designed to engage the audience, they inherently contain many transitions (significant changes in content during video playback). To address this issue, we follow the method described in Panda70M [15] to split the videos into multiple semantically consistent single-scene clips. Specifically, OpenCV [7] initially splits the video by analyzing pixel differences between adjacent frames. Let Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the image frame at time t𝑡titalic_t; the difference between two adjacent frames can be computed as:

Dt=i=1Hj=1W|It(i,j)It+1(i,j)|subscript𝐷𝑡superscriptsubscript𝑖1𝐻superscriptsubscript𝑗1𝑊subscript𝐼𝑡𝑖𝑗subscript𝐼𝑡1𝑖𝑗D_{t}=\sum_{i=1}^{H}\sum_{j=1}^{W}\left|I_{t}(i,j)-I_{t+1}(i,j)\right|italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT | italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i , italic_j ) - italic_I start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_i , italic_j ) | (10)

where H𝐻Hitalic_H and W𝑊Witalic_W are the height and width of the frame, and i𝑖iitalic_i and j𝑗jitalic_j represent pixel positions, respectively. Videos are split into clips where Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT exceeds a certain threshold τ𝜏\tauitalic_τ. Then, the ImageBind model [21] recombines erroneously split clips by analyzing feature space differences between adjacent clips. Let ϕ(It)italic-ϕsubscript𝐼𝑡\phi(I_{t})italic_ϕ ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) represent the feature vector of frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT obtained from the ImageBind model. The feature space difference between adjacent clips Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ci+1subscript𝐶𝑖1C_{i+1}italic_C start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT can be computed as:

Fi=ϕ(Iti)ϕ(Iti+1)2subscript𝐹𝑖subscriptnormitalic-ϕsubscript𝐼subscript𝑡𝑖italic-ϕsubscript𝐼subscript𝑡𝑖12F_{i}=\left\|\phi(I_{t_{i}})-\phi(I_{t_{i+1}})\right\|_{2}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∥ italic_ϕ ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_ϕ ( italic_I start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (11)

where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ti+1subscript𝑡𝑖1t_{i+1}italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT are the times of the last frame of Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the first frame of Ci+1subscript𝐶𝑖1C_{i+1}italic_C start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, respectively. Clips are recombined where Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is below a certain threshold η𝜂\etaitalic_η. This process results in 460K semantically consistent single-scene video clips (ChronoMagic-Pro).

Refer to caption
Figure 9: The pipeline of constructing ChronoMagic-Pro/ProH. (Top) We first use OpenCV [7] and ImageBind [21] to split the video and get semantically consistent single-scene video clips. (Middle) Then, we use the video retrieval model [71] to filter out videos that are not low-quality videos. (Bottom) Finally, uniformly sample N𝑁Nitalic_N frames and obtain captions for each using ShareGPT4Video [12], and let it summarize the video based on these captions and their frame positions.

Video Redundancy Elimination.    In stage two: video publishers often use eye-catching titles, descriptions, or hashtags to attract traffic. As a result, time-lapse videos found through search terms may be general videos. Manually screening large-scale videos is impractical, we construct a zero-shot metamorphic-general classification strategy based on the video retrieval model [71]. Specifically, we carefully constructed 10 retrieval sentences to describe metamorphic and general videos. Let Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the i𝑖iitalic_i-th retrieval sentence from the set of retrieval sentences {T1,T2,,T10}subscript𝑇1subscript𝑇2subscript𝑇10\{T_{1},T_{2},\ldots,T_{10}\}{ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT }. For each video V𝑉Vitalic_V in the dataset, the video retrieval model computes the relevance probability P(Ti|V)𝑃conditionalsubscript𝑇𝑖𝑉P(T_{i}|V)italic_P ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_V ) for each retrieval sentence Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We denote these probabilities as P1,P2,,P10subscript𝑃1subscript𝑃2subscript𝑃10P_{1},P_{2},\ldots,P_{10}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT. To determine if a video is general or metamorphic, we sum the relevance probabilities for each retrieval sentence. Let Sgensubscript𝑆genS_{\text{gen}}italic_S start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT be the total relevance probability for general videos and Smetasubscript𝑆metaS_{\text{meta}}italic_S start_POSTSUBSCRIPT meta end_POSTSUBSCRIPT be the total relevance probability for metamorphic videos:

Sgen=i=15P(Ti|V)Smeta=i=610P(Ti|V)formulae-sequencesubscript𝑆gensuperscriptsubscript𝑖15𝑃conditionalsubscript𝑇𝑖𝑉subscript𝑆metasuperscriptsubscript𝑖610𝑃conditionalsubscript𝑇𝑖𝑉S_{\text{gen}}=\sum_{i=1}^{5}P(T_{i}|V)\quad\quad S_{\text{meta}}=\sum_{i=6}^{% 10}P(T_{i}|V)italic_S start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_P ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_V ) italic_S start_POSTSUBSCRIPT meta end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT italic_P ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_V ) (12)

we then judge whether the video is a time-lapse video through a voting strategy:

Class(V)={generalif Sgeneral>0.5metamorphicotherwiseClass𝑉casesgeneralif subscript𝑆general0.5metamorphicotherwise\text{Class}(V)=\begin{cases}\text{general}&\text{if }S_{\text{general}}>0.5\\ \text{metamorphic}&\text{otherwise}\end{cases}Class ( italic_V ) = { start_ROW start_CELL general end_CELL start_CELL if italic_S start_POSTSUBSCRIPT general end_POSTSUBSCRIPT > 0.5 end_CELL end_ROW start_ROW start_CELL metamorphic end_CELL start_CELL otherwise end_CELL end_ROW (13)

If Sgeneral>Smetamorphicsubscript𝑆generalsubscript𝑆metamorphicS_{\text{general}}>S_{\text{metamorphic}}italic_S start_POSTSUBSCRIPT general end_POSTSUBSCRIPT > italic_S start_POSTSUBSCRIPT metamorphic end_POSTSUBSCRIPT, the video is classified as general; otherwise, it is classified as metamorphic. To this end, we obtain 150K video clips with higher purity and quality (ChronoMagic-ProH).

Time-Aware Annotation    In stage three: after obtaining high-quality time-lapse video clips, it is crucial to add appropriate captions. The simplest approach is to input the video clips into a large multimodal model to generate text descriptions of the video content. However, our experiments found that the 8B [41], 13B [76], and 34B [35] models could not accurately describe the content of time-lapse videos, resulting in severe hallucinations, as shown in Figure 10. Therefore, we decided to follow the annotation strategy of MagicTime [79]. Unlike MagicTime, due to higher costs, we adopted an open-source model [12] instead of the closed-source GPT-4V [1]. As shown in Figure 4, we first uniformly sample N𝑁Nitalic_N frames from each video segment, input these N𝑁Nitalic_N frames into the multimodal large model to describe the content, and finally have the model summarize the final video captions based on the textual descriptions of N𝑁Nitalic_N frames and the corresponding position of each frame in the video. To balance cost and effectiveness, we chose to use the 8B large multimodal model [12] instead of the 34B.

Refer to caption
Figure 10: Ablation on different Captioning method. Directly inputting the video into the model and having it describe the content is less effective than inputting keyframes into it.

C.2 Distribution of the Generated Captions

To analyze the word distribution in our generated captions within ChronoMagic-Pro, we computed their frequency distributions. The results, shown in Figure 11, reveal a prevalence of terms related to time-lapse videos, including "change," "transition," and "progressing." Additionally, words from four primary categories are evident: biological (e.g., mealworm, flower, tree), human-created (e.g., building, painting, walking), meteorological (e.g., eclipse, cloud, sunrise), and physical (e.g., burning, explosion). These terms underscore ChronoMagic-Pro’s focus on large-scale metamorphic changes, persistent transformations, and substantial physical interactions. The distribution of Captions in ChronoMagic-ProH is similar to that in ChronoMagic-Pro.

Refer to caption
Figure 11: The word clouds of the generated captions of (Top) ChronoMagic-Pro and (Bottom) ChronoMagic-ProH. The dataset focuses on changes (gradually, progressing, increasing, etc.), processes spanning a large amount of time, such as flower blooming, ice melting, building construction, sunrise and sunset.

C.3 Samples of the ChronoMagic-Pro

Figure 12 showcases a diverse array of samples from the ChronoMagic-Pro dataset, which features an extensive collection of time-lapse videos across several categories, including plants, buildings, ice, food, and various other objects and phenomena. Each video captures dynamic changes over time, providing rich visual information that surpasses the physical knowledge contained in many existing Text-to-Video (T2V) datasets. These samples illustrate the dataset’s diversity and depth, encompassing biological, human-created, meteorological, and physical categories, designed to support advanced research in high-dynamic text-to-video generation and related fields. Additionally, the dataset includes both time-lapse videos with significant state changes (e.g., flowers blooming) and videos with smaller state changes (e.g., clouds floating).

Refer to caption
Figure 12: Samples from the ChronoMagic-Pro dataset. The dataset consists of time-lapse videos, which exhibit more physical knowledge than the existing T2V dataset.

Appendix D More Details about Experiment

D.1 Computation Resource Details

We employ two types of GPUs: NVIDIA H100 and NVIDIA A100. All implementations are conducted based on the official code using the PyTorch framework.

D.2 Details of Evaluation T2V Models

ModelScopeT2V.    Model Details. ModelScopeT2V [66], featuring a U-Net architecture, extends the T2I model Stable Diffusion [55] by incorporating 1D temporal convolution and attention modules alongside the 2D modules for video modeling. Its training data consists primarily of image-text pairs (LAION [57]) and general video-text pairs (WebVid-10M [2] and MSR-VTT [75]), but it does not include the time-lapse videos discussed in this paper. Implementation Setups. We utilized the ModelScopeT2V code and model officially released on HuggingFace, maintaining the original parameter settings. We used a spatial resolution of 256×256 and a frame rate of 8 fps to generate a 2-second (16-frame) video.

ZeroScope.    Model Details. ZeroScope [62] is a watermark-free U-Net-based video model built on ModelScopeT2V [66], capable of generating high-quality 16:9 compositions and smooth video outputs. The model is trained on 9,923 clips and 29,769 labeled frames (24 frames per clip, 576×320 resolution) derived from the original weights of ModelScopeT2V [66]. The official documentation does not specify the exact training data; we speculate that time-lapse videos were not included. Implementation Setups. We utilized the ZeroScope_v2_576w code and model officially released on HuggingFace, maintaining the original parameter settings. We used a spatial resolution of 576×320 and a frame rate of 8 fps to generate a 3-second (24-frame) video.

Text2Video-Zero.    Model Details. Text2Video-Zero [28], featuring a U-Net architecture, is a zero-shot video generation method based on the T2I model Stable Diffusion [55]. It generates latent codes for all frames using rich motion dynamics and utilizes a self-attention mechanism to enable all frames to interact with the latent codes of the first frame. This process ultimately achieves high spatial and temporal consistency in the video through denoising. It does not require training data and, therefore, does not use time-lapse videos as training data. Implementation Setups. We utilized the officially released Text2Video-Zero code and model, maintaining the original parameter settings. Specifically, we used the dreamlike-photoreal-2.0 version of Stable Diffusion [55], with a spatial resolution of 512×512 and a frame rate of 8 fps, to generate a 2-second (16-frame) video.

LaVie.    Model Details. Model Details. LaVie [69], featuring a U-Net architecture, is an extension of the T2I model Stable Diffusion [55]. It converts the T2I model into a T2V model by adding temporal dimension attention after the spatial modules and adopting an image-video joint training strategy. Its training data primarily consists of image-text pairs (LAION [57]) and general video-text pairs (WebVid-10M [2] and Vimeo25M [69]), but it does not include the time-lapse videos discussed in this paper. Implementation Setups. We used the officially released LaVie code and model. Although LaVie [69] provides options for frame interpolation and super-resolution after video generation, we did not use them to maintain fairness. We followed the original parameter settings, using a spatial resolution of 512×320 and a frame rate of 8 fps, to generate a 2-second (16-frame) video.

AnimateDiff.    Model Details. AnimateDiff [22], featuring a U-Net architecture, is an extension of the T2I model Stable Diffusion [55]. It attaches a newly initialized motion modeling module to a frozen text-to-image model, then trains it on video clips to extract reasonable motion priors for video generation. Its training data primarily consists of general video-text pairs (WebVid-10M [2]), excluding the time-lapse videos discussed in this paper. Implementation Setups. We used the officially released AnimateDiffV3 code and model, maintaining the original parameter settings. We used a spatial resolution of 384×256 and a frame rate of 8 fps to generate a 2-second (16-frame) video.

VideoCrafter2.    Model Details. VideoCrafter2 [10], featuring a U-Net architecture, is similar to AnimateDiff [22], as both add temporal modules to Stable Diffusion [55] to achieve video generation. However, VideoCrafter2 differs by encoding fps as a condition into the model and implementing the I2V function. Its training data primarily includes image-text pairs (LAION-COCO [16], JDB [63]) and general video-text pairs (WebVid-10M [2]), but it does not include the time-lapse videos discussed in this paper. Implementation Setups. We used the officially released VideoCrafter2 code and model, maintaining the original parameter settings. We used a spatial resolution of 512×320 and a frame rate of 10 fps to generate a 2-second (20-frame) video.

MCM.    Model Details. MCM [80], featuring a U-Net architecture, is a distillation video generation method based on the T2I model Stable Diffusion [55]. It propose motion consistency models (MCM) to improve video diffusion distillation by disentangling motion and appearance learning, addressing frame quality issues and training-inference discrepancies. Its training data primarily includes image-text pairs (LAION-aes [57]) and general video-text pairs (WebVid-2M [2]), but it does not include the time-lapse videos discussed in this paper. Implementation Setups. We used the officially released MCM-modelscopet2v-laion code and model, maintaining the original parameter settings. We used a spatial resolution of 256×256 and a frame rate of 7 fps to generate a 2-second (14-frame) video.

MagicTime.    Model Details. MagicTime [79] is a U-Net-based metamorphic video generation model built on AnimateDiff [22]. It is capable of generating time-lapse videos with significant time spans and pronounced state changes, such as the entire process of a seed blooming or building construction. The model is trained using 2,265 metamorphic (time-lapse) clips and the original weights from AnimateDiffV3 [22]. Its training data primarily includes ChronoMagic [79], making it the only existing T2V model that uses time-lapse videos in the training process. Implementation Setups. We used the officially released MagicTime code and model, maintaining the original parameter settings. We used a spatial resolution of 512×512 and a frame rate of 8 fps to generate a 2-second (16-frame) video.

Latte.    Model Details. Latte [45] is a pioneer in open-source DiT-based T2V algorithms. It inherits the pure Transformer architecture of the T2I algorithm PixArt-α𝛼\alphaitalic_α [11] and extends it by adding temporal modules after each spatial module, training from the original weights of PixArt-α𝛼\alphaitalic_α [11] to achieve a DiT-based T2V algorithm. Its training data primarily includes general video-text pairs (Vimeo25M [69] and WebVid-10M [2]). Although it includes the time-lapse videos mentioned in this paper, they primarily consist of sky videos with fewer physical priors, making it unable to generate videos such as seed germination and flower blooming. Implementation Setup. We used the officially released LatteT2V code and model, maintaining the original parameter settings. We used a spatial resolution of 512×512 and a frame rate of 8 fps to generate a 2-second (16-frame) video.

OpenSoraPlan v1.1.    Model Details. OpenSoraPlan v1.1 [40] is a high-quality video generation model based on Latte [45]. It replaces the Image VAE [29] with Video VAE (CausalVideoVAE [40]), similar to Sora [8], enabling the generation of videos up to approximately 21 seconds long and high-quality images. Its training data consists of videos and images scraped from open-source websites under the CC0 license, labeled using ShareGPT4Video [12] to create a high-quality self-built dataset. The official documentation does not specify the exact training data; we speculate that time-lapse videos were not used. Implementation Setup. We used the officially released OpenSoraPlan v1.1 code and model. Although it provides T2V models in three versions: 65 frames, 221 frames, and 513 frames, we chose the 65-frame version to ensure fairness by maintaining a similar video length to other models. We kept the original parameter settings, using a spatial resolution of 512×512 and a frame rate of 24 fps to generate a 3-second (65-frame) video.

OpenSora 1.1 & 1.2.    Model Details. OpenSora 1.1 & 1.2 [86] is a high-quality DiT-based T2V model that introduces the ST-DiT-2 architecture, building on Latte [45]the former is based on the Diffusion Model and the latter is based on the Flow Model. It supports the generation of images or videos with any aspect ratio, different resolutions, and durations. Its training data consists of images and videos scraped from open-source websites and a labeled self-built dataset. The official documentation does not specify the exact training data; we speculate that time-lapse videos were not used. Implementation Setup. We used the officially released OpenSora 1.1 & 1.2 code and model. For OpenSora 1.1, we employed the stage-3 checkpoint, setting the spatial resolution to 512×512 and the frame rate to 24 fps, to generate a 2-second (48-frame) video. For OpenSora 1.2, we set the spatial resolution to 1280×720 and the frame rate to 24 fps, producing a 4-second (96-frame) video.

D.3 Verification Experiment on ChronoMagic-Pro

To verify the validity and robustness of the ChronoMagic-Pro dataset, we conducted quantitative and qualitative validation experiments based on OpenSoraPlan v1.1 [40]. Specifically, we fine-tuned the temporal module of the OpenSoraPlan v1.1 model using a uniform frame extraction strategy and LoRA [17], based on the weights of OpenSoraPlan v1.1 [40]. Due to limited computational resources, we randomly selected only 10,000 video-text pairs from ChronoMagic-Pro for training. The results are shown in Table 7. After fine-tuning with ChronoMagic-Pro, the visual quality (UMT-FVD), text relevance (UMTScore), and metamorphic amplitude (MTScore and GPT4o-MTScore) were all effectively improved. Notably, the enhancement in metamorphic amplitude endowed OpenSoraPlan [40] with the ability to generate time-lapse videos of significant state changes, such as blooming flowers and city traffic. We also provide Qualitative Analysis, as shown in Figure 13. It is evident that, after fine-tuning, the generated videos can extend changes beyond mere lighting and camera movements to alterations in the state of objects, while ensuring that the visual quality, text relevance, and coherence remain uncompromised. This proves that ChronoMagic-Pro can support existing models in generating high-quality time-lapse videos with significant state changes, providing a new approach for future T2V model training. However, using a uniform frame extraction strategy and LoRA [17] for model fine-tuning may lead to decreased video coherence, which needs to be addressed in future work. In this study, we employ these methods solely for verification experiments.

Table 7: Quantitative comparison of OpenSoraPlan v1.1 [40] before and after fine-tuning using ChronoMagic-Pro. "\downarrow" denotes lower is better. "\uparrow" denotes higher is better.
Method Venue UMT-FVD\downarrow UMTScore\uparrow MTScore\uparrow CHScore\uparrow GPT4o-MTScore\uparrow
OpenSoraPlan v1.1 [40] Github’24 188.53 2.421 0.327 10.35 2.19
OpenSoraPlan v1.1 [40] + ChronoMagic-Pro Our 185.72 2.753 0.341 5.626 3.03
Refer to caption
Figure 13: Qualitative comparison of OpenSoraPlan v1.1 [40] before and after fine-tuning using ChronoMagic-Pro. After fine-tuning, the changes in the generated videos are no longer limited to lighting and camera movement, but are extended to changes in the state of objects. Additionally, it ensures that the visual quality, text relevance, and coherence are maintained without loss.

D.4 More Qualitative Evaluation on ChronoMagic-Bench

Due to space limitations, additional time-lapse videos generated by different baseline methods are shown in Figure 14. Similar to the results in the main text, most algorithms, except for MagicTime [79], fail to generate time-lapse videos with significant state changes, such as building construction. However, for time-lapse videos with smaller state changes, essentially faster-moving videos like city traffic changes, U-Net-based methods [66, 62, 28, 69, 22, 10, 79] exhibit much better visual quality, text relevance, and coherence compared to DiT-based methods [45, 40, 86]. This again demonstrates that U-Net-based methods are currently more stable and capable of producing satisfactory results with minimal inference. All videos generated by all models on ChronoMagic-Bench will be made publicly available.

Refer to caption
Figure 14: More Qualitative Comparison with different T2V generation methods for the text-to-video task in ChronoMaigc-Bench. Most methods struggle to follow the prompt to generate time-lapse videos with high physics prior content.

D.5 More Quantitative Evaluation on ChronoMagic-Bench-150

We conduct a quantitative analysis of the temporal coherence of T2V models, with the results presented in Table 8. The CHScore value ranges within the same domain (open/close) are consistent, accurately reflecting the temporal coherence of different algorithms. Specifically, the temporal coherence of closed-source models is generally similar, whereas T2V-Zero [28] and OpenSora 1.2 [86] in open-source models exhibit the worst temporal coherence. The reasons are as follows: (1) Current video tracking algorithms perform poorly in detecting fluids (clouds, water flow, traffic flow), which are crucial components in time-lapse videos; (2) Closed-source models can generate realistic fluids following prompt instructions, whereas open-source models produce less realistic static fluids. Consequently, CHScore can only be adapted to current open-source models, indirectly highlighting the significant differences between open-source and closed-source models.

Table 8: More Quantitative Comparison with T2V Generation Methods for the Text-to-Video Task in ChronoMagic-Bench-150. "\downarrow" denotes lower is better. "\uparrow" denotes higher is better.
Method Venue Backbone Status CHScore\uparrow
Gen-2 [56] Runway U-Net Close-Source 5.27
Pika-2.0 [33] PikaLab U-Net Close-Source 4.00
Dream Machine [44] LUMA DiT Close-Source 2.30
KeLing [32] Kwai DiT Close-Source 3.69
ModelScopeT2V [66] Arxiv’23 U-Net Open-Source 10.64
ZeroScope [62] CVPR’23 U-Net Open-Source 24.10
T2V-zero [28] ICCV’23 U-Net Open-Source 1.84
LaVie [69] Arxiv’23 U-Net Open-Source 9.58
AnimateDiff [22] ICLR’24 U-Net Open-Source 11.09
VideoCrafter2 [10] Arxiv’24 U-Net Open-Source 7.78
MCM [80] Arxiv’24 U-Net Open-Source 14.14
MagicTime [79] Arxiv’24 U-Net Open-Source 11.58
Latte [45] Arxiv’24 DiT Open-Source 13.79
OpenSora 1.1 [86] Github’24 DiT Open-Source 10.46
OpenSora 1.2 [86] Github’24 DiT Open-Source 5.60
OpenSoraPlan v1.1 [40] Github’24 DiT Open-Source 10.32

D.6 Details of Human Evaluation

To validate the effectiveness of the automated metrics, we selected a subset of videos for user evaluation, inviting 171 participants to provide manual evaluation results. To enhance user satisfaction, we chose five representative baseline results (AnimateDiff [22], MagicTime [79], VideoCrafter2 [10], Opensora 1.1 [86], OpenSoraPlan v1.1 [40]) for users to choose from. Following established methodologies from prior studies [53, 65, 79, 59], we designed a detailed questionnaire for human evaluators to rate the generated content. The evaluation focused on four primary aspects: Visual Quality, Text Relevance, Metamorphic Amplitude, and Coherence. For each criterion, we employed a five-point rating scale and provided scoring guidelines to ensure consistent user selections, thereby minimizing assessment bias. For detailed criteria, please refer to Figure 15.

Refer to caption
Figure 15: Visualization of the Questionnaire for Human Evaluation. We employ a five-point rating scale and provide scoring guidelines to ensure consistent selections by users, thereby minimizing assessment bias.

Appendix E More Details about 75 Subcategories in ChronoMaigc-Bench

Due to space limitations, we provide detailed descriptions of the 75 search terms used in ChronoMagic-Bench below (each term includes the phrase "time-lapse"), all of which pertain to time-lapse. Because of search engine limitations, some precise search terms may not yield optimal results. Therefore, to collect search terms more comprehensively, some overlap may exist between broader terms like "plant" and precise terms like "flower".

Biological:

  • Animal. Captures the movements, behaviors, and interactions of various animals over an extended period. This includes everything from the daily activities of pets to the complex behaviors of wild animals in their natural habitats.

  • Spider Web. Showcases the intricate process of spiders spinning their webs. It highlights the changes the web undergoes over time.

  • Butterfly. Focuses on the life cycle of butterflies, particularly the metamorphosis from caterpillar to chrysalis to adult butterfly. It includes the intricate process of pupation and emergence.

  • Hatching. Documents the hatching process of various eggs, including those of birds, reptiles, and insects. This category captures the moment of emergence and the initial activities of the newborns.

  • Flower Dying. Captures the end-of-life process of flowers, showing how they wilt and decay over time.

  • Mealworm. Showcases the behavior of mealworms, including their feeding habits.

  • Plant Growing. This broad category includes time-lapse videos of various plants as they grow from seeds to mature plants. It encompasses root development, stem elongation, and the emergence of leaves and flowers.

  • Ripening. Documents the ripening process of fruits and vegetables, showing the changes in color, texture, and overall appearance as they become ready for consumption.

  • Leaves. Focuses on the growth, movement, and changes of leaves on plants. This includes the unfolding of new leaves, changes in color, and responses to environmental factors.

  • Seed. Captures the germination and initial growth stages of seeds, from the first signs of sprouting to the establishment of seedlings. It focuses on the early and often delicate stages of plant development.

  • Blooming. Showcases the process of flowers blooming, capturing the gradual opening of petals and the transformation from buds to full blossoms.

  • Mushroom. Captures the rapid growth and development of mushrooms, from the initial emergence of the mycelium to the full development of the fruiting body.

Human-Created:

  • 3D Printing. Captures the process of 3D printing objects. These videos show the additive manufacturing process layer by layer, from the initial base to the final, complete object.

  • Painting. Showcases the process of creating a painting, from the initial sketch to the final strokes.

  • Laser Engraving. Show the process of laser engraving on various materials, such as the process of pattern formation.

  • Building. Documents the construction of various structures, including residential, commercial, and industrial buildings. This category highlights the step-by-step development from foundation to completion.

  • Minecraft Build. Captures the construction of complex structures and landscapes within the game Minecraft.

  • Demolition. Captures the process of demolishing buildings and structures.

  • Fireworks. Captures the display of fireworks, showcasing the entire process from the launch of the explosive into the sky to its transformation into bursts of color and patterns in the night sky.

  • People. Focuses on the activities and movements of people in various settings, including streets, parks, and public spaces.

  • Sport. Captures sporting events and activities, highlighting the movement of athletes, the progression of games, and the energy of the crowd.

  • City. Focuses on the dynamic activities within a city, including urban development, traffic flow, and daily life. These videos often showcase the bustling and ever-changing nature of urban environments.

  • Factory. Highlights the operations within a factory, including assembly lines, manufacturing processes, and the movement of goods.

  • Market. Documents the activities within a market, including the setting up of stalls, movement of people, and trading of goods.

  • Office. Captures the daily activities within an office environment, including the ebb and flow of workers, meetings, and the general hustle and bustle of office life.

  • Restaurant. Documents the activities within a restaurant, including food preparation, service, and customer interactions.

  • Road. Capture the traffic flow, and changes in road conditions over time.

  • Station. Focuses on the activities within transportation stations, such as train stations, bus terminals, and airports. These videos capture the flow of passengers, arrivals, departures, and the hustle and bustle of travel hubs.

  • Traffic. Captures the movement of vehicles on roads and highways, including the traffic flow, congestion, and the changing pace of vehicular movement throughout day.

  • Walking. Focuses on people walking in various environments, such as city streets, parks, and malls.

  • Parking. Captures the movement of vehicles in parking lots or garages, including the flow of cars as they enter, park, and exit.

Meteorological:

  • Day to Night. Show the transitions from daylight to nighttime, capturing the gradual shift in light and atmosphere as day turns to night.

  • Night to Day. Shows the transitions from nighttime to daylight, showing the gradual change in lighting and environment as night turns to day.

  • Day. Captures the progression of daylight hours, highlighting changes in light intensity, shadows, and weather conditions.

  • Night. Shows the sequences of nighttime scenes, often capturing the movement of stars, phases of the moon, and nocturnal activities.

  • Cloud. Shows the formation, movement, and dissipation of clouds, providing a dynamic view of the ever-changing sky.

  • Lunar Eclipse. Shows the gradual movement of the moon through the Earth’s shadow and the resulting changes in appearance during a lunar eclipse.

  • Rainbow. Captures the formation, duration, and fading of rainbows, providing a colorful display over time.

  • Sky. Captures a variety of atmospheric phenomena such as cloud movements, sunrises, sunsets, and weather changes over time.

  • Snowstorm. Shows the accumulation of snow and the changing conditions during and after a snowstorm.

  • Storm. Highlights the intensity and movement of storm clouds and lightning during various types of storms.

  • Sunrise. Captures the gradual increase in light and the awakening of the environment during sunrise.

  • Sunset. Showcases the beautiful colors and gradual fading of light as the day ends during sunset.

  • Aurora. Captures the dynamic changes and movement of the Northern and Southern Lights, showcasing the evolving natural light displays over time.

  • Tide. Illustrates the rise and fall of sea levels and their impact on coastal landscapes over time.

  • Wind. Captures the effects of wind on landscapes, including the movement of vegetation, dust storms, and changing cloud patterns over time.

  • Seasons. Shows the dramatic changes across different seasons, highlighting the transformation of landscapes throughout the year.

  • Nature. Captures various natural scenes, including the growth of plants, changes in landscapes, and wildlife activity.

  • Beach. Illustrate the changes in tides, waves, and shifting weather conditions throughout day.

  • Desert. Shows the dramatic changes in light, temperature, and atmosphere in desert landscapes over time.

  • Forest. Illustrates changes in foliage, light patterns, and wildlife activity in forests throughout the day or seasons.

  • Grassland. Highlight the subtle yet significant changes in vegetation and weather in grasslands over time.

  • Lake. Captures reflections, water level changes, and the transformation of surrounding landscapes.

  • Mountain. Showcases changes in light, weather, and cloud movement around mountainous peaks over time.

  • Ocean. Highlights the continuous motion of waves, tides, and the impact of weather on ocean scenes over time.

  • Plain. Shows the transformation of open landscapes due to changing light and weather conditions over time.

  • River. Illustrates the flow of water, changes in water levels, and the transformation of surrounding landscapes over time.

  • Valley. Highlights changes in light, weather, and seasonal transformations in valley areas over time.

Physical:

  • Baking. Shows the transformation of dough or batter as it rises and turns into baked goods, highlighting changes in color, texture, and volume over time.

  • Cooking. Shows the various stages of food preparation and cooking, highlighting changes in texture, color, and form.

  • Candle Burning. Illustrates the gradual melting and burning of a candle, including changes in the wax and the flickering flame.

  • Tea Diffusing. Illustrates how tea leaves release their color and flavor into hot water, showing the gradual diffusion process and changes in the liquid.

  • Corrosion. Captures the slow process of materials deteriorating due to chemical reactions with their environment, often resulting in rust or other forms of decay.

  • Decompose. Shows organic materials breaking down over time, illustrating the process of decomposition and the changes in form and structure.

  • Fruit Rotting. Illustrates the gradual decay and breakdown of fruit, showing changes in color, texture, and structure as it rots.

  • Explosion. Captures the rapid and dramatic release of energy, showing the sudden change in materials and the environment.

  • Burning. Captures the process of combustion, showing how materials ignite, burn, and reduce to ash or other residues.

  • Gasification. Shows the process of a solid or liquid turning into gas, highlighting the changes in state and movement of particles.

  • Ice Melting. Captures the transition of ice from solid to liquid, showing the gradual melting process and changes in shape and volume.

  • Ink Diffusing. Illustrates how ink spreads and disperses in a liquid, showing the dynamic patterns and changes in concentration over time.

  • Melting. Shows the process of a solid turning into a liquid, highlighting changes in form and consistency as the material melts.

  • Rusting. Captures the slow formation of rust on metal surfaces, showing the chemical changes and resulting texture and color changes.

  • Water Freezing. Shows the transition of water from liquid to solid, capturing the formation of ice and changes in volume and structure.

Appendix F Licensing, Hosting and Maintenance Plan

Author Statement.    We take full responsibility for the licensing, distribution, and maintenance of our ChronoMaigc-Bench and ChronoMagic-Pro.

License.    ChronoMaigc-Bench and ChronoMagic-Pro are under CC-BY 4.0 license.

Hosting.    The code and dataset are uploaded to GitHub and HuggingFace and made public. The dataset is in the JSON file format.

Metadata.    Metadata and Benchmark are uploaded to Huggingface https://huggingface.co/spaces/BestWishYsh/ChronoMagic-Bench.