Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3702250.3702298acmotherconferencesArticle/Chapter ViewFull TextPublication PagesicvgipConference Proceedingsconference-collections
research-article
Open access

Exploring the Limits of VLMs: A Dataset for Evaluating Text-to-Video Generation✱

Published: 31 December 2024 Publication History

Abstract

Vision language models (VLMs) integrate vision and text by learning from images and their textual descriptions for generating text from images and vice-versa. VLMs are used for various tasks such as image captioning, and visual question answering. As VLMs extend to video generation tasks, significant challenges arise. The generated videos often lack temporal consistency, and there exist issues with alignment between the generated video content and the input text. This work proposes a set of prompts for systematically evaluating VLMs on text-to-video generation focused on object coherence and temporal consistency. This dataset of prompt covers two categories: first, the complexity of prompts, and second, the number of objects and actions. The first category consists of four levels of prompt complexity: simple, mid-level, high-level, and unique and rare object prompts. The second category also consists of four levels: single object, single object in action, multiple objects, and multiple objects in action. Thus, the dataset is a combination of 16 different prompt scenarios where each scenarios has 10 prompts resulting in a dataset of 160 prompts. This work explores and evaluates the outputs of three models from the family of diffusion-based VLMs for the task of text to video generation. The videos generated by the models were assessed by five participants on a 0-5 Likert scale. VLMs under study were able to generate temporally consistent videos and proper object in video in only 33.63% and 39.43% of the total evaluation scenario prompts. Most of these prompts belong to the categories of single object and simple prompts. The performance of models drops with increase in objects, actions and prompt complexity. The scenarios of actions, with rare prompts are where all models perform poorly. Thus, reflecting on limitations of VLMs in generating videos. The proposed dataset can be extended to other video generation models to benchmark their performances on the basic aspects of consistency and alignment in videos.

1 Introduction

Recent advances in vision language models (VLMs) such as DALL-E 3 and Midjourney generate images from text [1, 16]. VLMs learn the relationships between images and their corresponding textual descriptions, Based on the learned representations, VLMs generate text-to-image and vice versa. There are different families of VLMs for text-to-image/video generation task such as GAN-based, diffusion-based, and transformers-based. The focus of this work will be on diffusion-based models. There are three primary elements of any diffusion-based vision language model: encoder, diffusion blocks, and decoder which are shown in Figure 1. The number of input modalities indicate whether the model is multimodal or not, and each modality has an associated encoder block.
An encoder block in VLMs transforms input data into dense vector representations. In text encoders, layers of transformers or RNNs process the text, encoding its meaning and structure. The embeddings from encoders are fed to the diffusion block where the forward diffusion process and denoising process take place during training and while inference only the denoising process takes place inside the diffusion block. In the denoising process, the model is trained to reverse the forward diffusion by removing noise step-by-step from the noisy images, conditioned on text features to produce an image embedding representing the target image. Finally, the decoder block converts the embedding into an image. Image generation has gained significant popularity recently, leveraging which vision-language models are now extending their applications to generate videos from text.
The key challenge in text-to-video generation is the lack of large datasets of text-to-video pairs [22]. However, some datasets of text-to-image pairs like LAION-5B [17], WebLI [4], Visual Genome [12] and for text-to-video pairs, YFCC [7] are frequently used to train VLMs. Text-to-video models require heavy training, which demands high computational power and increases computational costs. Unlike image generation, which deals with a single frame generation, video generation involves generating multiple frames. A challenge in text-to-video generation is maintaining temporal consistency across frames while ensuring the generated videos align with the input text. Addressing temporal consistency and text-video alignment is the key factor in creating VLMs for text-to-video generation.
Figure 1:
Figure 1: A generic block diagram of diffusion models for text-to-image generation.
To understand the shortcomings of VLMs in temporal consistency and text-video alignment, a dataset is required for evaluating them on these grounds and benchmark their performances. Such datasets exists for images but not for videos leaving a research gap to be explored. To address this gap, a dataset of prompts for evaluating VLMs is proposed which identifies the capabilities of VLMs in generating videos. This dataset tests a VLMs output w.r.t object formation, action consistency and temporal consistency using 16 different types of prompt scenarios. The evaluation reveals that VLMs have notable performance limitations, particularly in scenarios involving multiple objects and multiple objects in action with complex prompts describing high-level, rare and unique objects. On the other hand for scenarios with simple prompts and single objects all model show some decent performance with close to 40% of videos being aligned to text and approx. 32% being temporally consistent.
The rest of the paper is organized as follows. The existing family of VLMs for text-to-image/video generation is briefly described in Sec.2. The proposed method is formally described in Sec.3. The experiments are detailed in Sec. 4. The results are discussed in Sec. 5 with conclusion of the work presented in Sec.6.

2 Prior Art

In recent years, there has been significant research focused on the field of image generation from text, but very few works have made substantial advancement in video generation from text. This section reviews the state-of-the-art methodologies in this field.
The approaches are broadly categorized into three families of models: transformer-based, diffusion-based, and GAN-based models.

2.1 GAN Based Methods

This section will discuss GAN-based models. IRC-GAN framework consists of three parts: the text encoder network, the recurrent transconvolutional generator network, and the introspective discriminator network which handles two critical issues that come into play while generating videos [5]. First, the generated frames need to be realistic and temporally consistent. Second, the text and video content need to be relevant with each other. The recurrent transconvolutional generator integrates LSTM cells with 2D transconvolutional layers, enabling frame generation based on previous frames. Mutual information introspection helps to semantically align the generated video to text. It measures the semantic consistency to minimize the semantic distance between the generated video and the corresponding text.
TIGVAN model generates videos on a basis of frame-by-frame [10]. The training process is decomposed into two stages: text-to-image generation and evolutionary generation. To generate an image from text, the text is first encoded into a feature vector, which is then processed by a recurrent unit to create an input vector for a generator. The generator uses this input to produce an image that matches the text description. The model is trained with an adversarial framework to ensure the generated image aligns with the input text. In the evolutionary generation stage, the trained model is used to create a sequence of consecutive frames. At each step, the number of frames doubles, with the recurrent unit generating new vectors and the generator creating images. A step-specific discriminator ensures temporal consistency, while the image discriminator maintains image quality.
Video generation from text (VGT) has three model components [13]: conditional gist generators, video generators, and video discriminators. First, a conditional VAE model is used to generate the gist of the video from the input text, where gist is an image that helps the model predict the background color and object layout for the desired video. The generator creates synthetic video frames to challenge the discriminator, distinguishing between real and generated frames. The generator and discriminator compete in a minimax game: the generator aims to produce more realistic frames, while the discriminator improves its detection of synthetic content.

2.2 Diffusion Based Methods

This section covers the methods that generate images using diffusion model and subsequently convert images to video.
Text2Video-Zero is a diffusion model used to generate video from text [9]. Text2Video-Zero employs a zero-shot approach by utilizing the capabilities of currently available text-to-image synthesis techniques such as stable diffusion[15] and adapting them to the video domain. The method is termed zero-shot because it generates video frames from text descriptions without requiring prior training or fine-tuning on video datasets. Instead of depending on large-scale text-video paired data, it leverages existing text-to-image models to create temporally consistent video frames directly from text. This approach eliminates the need for extensive video-specific training, making the process more cost-effective and convenient.
SHOW-1 is another model that generates videos from text using diffusion process [26]. The model is trained on the WebVid-10M dataset, which incorporates 10 million image-text pairs. This work implements the first hybrid model which integrates pixel and latent video diffusion models (VDMs) for text-to-video generation. Pixel VDMs are used to produce a low-resolution video that has strong text-to-video alignment. After that, latent VDM is used to convert low-resolution video into high-resolution.
MagicVideo-V2 proposes a multi-stage T2V framework that integrates text-to-image (T2I), image-to-video (I2V), video-to-video (V2V), and video frame interpolator (VFI) into the text-to-video generation pipeline [21]. The T2I module generates an initial image from the text prompt, setting the foundation for the I2V module to produce low-resolution keyframes, which the V2V module enhances in resolution and detail. Finally, the frame interpolation module smooths the motion in the video.
ModelScopeT2V is a text-to-video synthesis model developed from a text-to-image synthesis model called stable diffusion[20]. Spatial-temporal blocks are incorporated into ModelScopeT2V for constant frame generation and seamless movement transitions. Three components a text encoder, a denoising UNet, and a VQGAN are used in ModelScopeT2V. The denoising UNet operates within the latent space. During training, a diffusion process gradually adds Gaussian noise over T steps, resulting in progressively less informative data. During inference, the UNet predicts the noise added at each step, ultimately allowing the generation of an image from random noise. VQGAN decoder is used to generate a video from these images.
Table 1:
 Single object (SO)Single object in action (SOA)Multiple objects (MO)Multiple objects in action (MOA)
Simple prompt (SP)A cat is sitting on a chairA leaf falls from a treeA cat and a dog play in the backyardA group of friends plays a game of soccer
Mid-level prompt (MP)A dog is running around a tree in the parkA gymnast performs a back flip on a balance beamThe birds and the squirrels are gathering food in the park.A chef flips pancakes on a sizzling griddle
High-level prompt (HP)An origami dragon unfurls its wings and takes flight.A professional surfer rides a massive wave with perfect balanceTwo robots collaborate to assemble a spaceship on MarsA bird is flying over a river while a fish jumps out of the water
Rare and unique objects prompt (RP)A levitating top hat spins mysteriously in mid-airA unicorn is drinking water from a golden fountain in a forestA group of children and adults are playing various sports in a large parkA team of researchers studies the effects of time dilation in a controlled laboratory setting
Table 1: Examples of input prompts for each scenario of static single and multiple objects and action by single and multiple objects.

2.3 Transformers Based Methods

In this section, the focus is on transformers-based text-to-video generation methods.
CogVideo is a 9B-parameter model trained by inheriting a pre-trained text-to-image model CogView2 [6, 8]. This work uses a multi-frame-rate hierarchical training strategy which helps in better alignment of text with video. Backbone of CogVideo is a transformer with dual-channel attention, consisting of 48 layers, and 48 attention heads, a size of 3,072 in each channel, CogView2 [6] has 6 billion parameters.
VideoPoet is a method for converting any autoregressive language model into a high-quality video generator[11]. It includes key components such as a pre-trained MAGVIT V2[24] video tokenizer, which converts images, and video into discrete codes compatible with text-based language models. An autoregressive language model predicts the next video token in a sequence, learning across video, image, and text modalities. The model integrates various generative learning objectives, including text-to-video, text-to-image, and video frame continuation, enabling zero-shot capabilities. VideoPoet generates videos with high temporal consistency.

2.4 Evaluation of VLMs

The attribution, relation, and order (ARO) benchmark evaluates VLMs understanding of object properties, relational context, and order sensitivity [25]. The current VLMs often perform poorly in relational understanding and order sensitivity, even though they are trained on large datasets. This issue arises because models can perform well on standard tests without truly understanding compositional information. To improve VLMs, they proposes a method called composition-aware hard negative mining. First, add the closest neighboring images to each batch to help models recognize fine-grained differences between similar scenes. Second, add captions with scrambled word order to each batch to help models distinguish correct from incorrect sequences. This simple finetuning significantly enhances the model’s understanding of attributes and relationships.
Image based benchmark[14] talk about the recent text-to-image (T2I) generation models like stable diffusion[15] and how they have improved in generating high-resolution images from text descriptions but still face issues like artifacts, misalignment, and low aesthetic quality. Inspired by reinforcement learning with human feedback (RLHF), researchers have enhanced feedback signals by marking problematic image regions and identifying misrepresented or missing text elements. They collected detailed human feedback on 18,000 generated images (RichHF-18K) and trained a multimodal transformer to predict this feedback. Additionally, [2, 18, 19] helps in understanding the generation of videos from text.
Existing benchmarks do not evaluate VLMs on object formation and action consistency, highlighting a gap in current research. This work is dedicated to addressing this gap by evaluating VLMs on these two evaluation criteria.

3 Methods

The previous section discusses the challenges in generating videos with VLMs. A dataset of prompts is generated to evaluate the videos generated from VLMs on the grounds of the aforementioned challenges. This section discussed the intuition behind the design of the dataset in detail. The dataset is designed to thoroughly test the VLMs generated output on the ground of temporal consistency and text-video alignment. Two main categories are discussed below:
Table 2:
 Single object (SO)Single objects in action (SOA)Multiple objects (MO)Multiple objects in action (MOA)Total prompts per complexity level
Simple prompts (SP)1010101040
Mid-level prompts (MP)1010101040
High-level prompts (HP)1010101040
Rare and Unique objects prompts (RP)1010101040
Total prompts per scenarios40404040Total prompts=160
Table 2: Number of input prompts for each scenario.

3.1 Number of objects and their actions

3.1.1 Single object scenarios.

: This section discusses single-object scenarios, which are divided into two subcategories: static single objects and single objects in action.
Static single object: Prompts describing single object where no action are being performed on the object. Examples cases are:
Zoom In/Out: The situations where the video zooms in or out on a single object.
Stable View: The instances in which the background changes constantly but the objects in the video stay stationary.
Single object in action: Prompts describing actions being performed on a single object.

3.1.2 Multiple objects scenarios.

: This section discusses multiple-object scenarios, which are divided into two subcategories: static multiple objects and multiple objects in action.
Static multiple objects: Prompts describing multiple objects where no actions are being performed on the objects. Examples cases are:
Zoom In/Out: The situations where the video zooms in or out on a multiple objects.
Stable View: The instances in which the background changes constantly but the objects in the video stay stationary.
Multiple objects in action: Prompts describing actions being performed on a multiple objects.
It should be noted that when multiple objects are in action, their actions might either correlated with or be independent of each other.

3.2 Complexity of prompts

This section categorizes prompt complexity into four levels to evaluate the performance of VLMs when presented with varied amount of details.
The levels of complexity of prompts are:
Simple prompts: Simple statements that describe things and behaviors in an easily understandable approach.
Mid-level prompts: prompts with moderate complexity, involving more detailed descriptions and actions.
High-level prompts: Complicated details and subtle actions are included in complex prompts.
Rare and unique objects prompts: prompts describing unusual or imaginative situations that make it difficult for the VLM to produce accurate videos.
After covering the cases discussed in Sec 3.1 and 3.2 in the prompt generation results in a combination of 16 scenarios of which few examples are shown in Table 1. The generated dataset is available in supplementary materials.

3.3 Evaluation criteria

Two main criteria—action consistency and object formation—will be the primary focus for evaluating VLM outputs using the previously mentioned sixteen possible combinations of prompts. These criteria are used to assess the videos generated by the models for text-video misalignment and temporal inconsistency.
Object formation: Evaluating the generated video involves checking whether the object described in the provided prompt is accurately formed. This criterion is used to assess the model’s object formation capabilities across the sixteen combinations of input prompts.
Action consistency: Evaluating the generated output with which the object in the video performs the action stated in the provided prompt. Both a single object and multiple objects in motion are evaluated using this criteria. This will be the main factor in evaluating VLMs for temporal consistency.

3.4 Importance of the proposed dataset

The proposed dataset of prompts is crucial as it provides a structured and comprehensive framework for evaluating and benchmarking VLMs based on temporal consistency and text-video alignment. Given the lack of established dataset to benchmark VLMs in video generation based on these aspects. It is essential to have a dataset of prompts based on these aspects as they represent the main challenges VLMs face in video generation.
This datasets helps with the following:
Identifying the strengths and weaknesses of different VLMs, in terms of text-video alignment and temporal consistency.
Identify the specific scenarios in which VLMs fail to generate videos.
Provide valuable insights into evaluating VLMs, which helps to extend and build similar datasets to benchmark video generation.

4 Experiments

This section outlines the details of dataset generation and implementation details of this study. This work introduces a benchmark based on human evaluation for text-to-video generation. To evaluate the performance of VLM outputs, a thorough examination of the benchmark results is needed to identify specific areas where the models fall short in producing videos.

4.1 Dataset

A dataset of prompts is developed with careful consideration of the number of objects, their actions, and the complexity levels of the prompts. The dataset is divided into two categories: number of objects, actions, and complexity of prompts, each with four levels. Table 2 provides an overview of the number of prompt inputs. A total of 160 input prompts, representing scenarios is generated. The categories are chosen to cover real-life video generation scenarios, where the number of objects and actions increases and the complexity of prompts also rises.

4.2 Implementation details

Three diffusion models: Text2Video-zero [9], LAVIE [23], and Video-crafter2 [3] are considered for evaluation.
The models are provided with input prompts from the benchmark dataset, resulting in the generation of 160 videos per model. In total, 480 videos are generated, with each video spanning between two to four seconds. Text2Video-Zero [9], and Videocrafter2 [3] models generated videos at a resolution of 512x512, while the LAVIE [23] generated videos at a resolution of 320x512. The videos generated by each model are evaluated by five participants. The participants under took a Likert test, wherein they were presented with a video and asked to rate them on a scale from 0 to 5 based on their observations. The Likert test comprised two criteria: object formation and action consistency. This evaluation method provides detailed insights into the performance of each model in generating coherent and contextually accurate videos from prompt descriptions.
First participants rated whether the object described in the prompt was accurately formed in the video. Second, participants evaluated whether the action performed by the objects matched the prompt description. For object formation, the Likert scale ratings are as follows: 0: no object formed/object formed but not matching the prompt description, 1: object formed but highly distorted, 2: object formed but less distorted, 3: object formed is average, 4: object formed is good; 5: object formed is perfect. For the action performed, the ratings were: 0: no action performed/action performed but not matching the input prompt, 1: action performed but highly inconsistent, 2: action performed but less inconsistent, 3: action performed is average in consistency, 4: action performed is consistent, 5: action performed is highly consistent. This detailed rating system allowed for a subtle assessment of the object formation and the action consistency in the generated videos.
This thorough evaluation process ensured a comprehensive evaluation of each models performance in generating accurate and consistent videos.
Figure 2:
Figure 2: Plot of the mean and standard deviation of the ratings from the Likert test of evaluation criteria: object formation and action consistency for the models Text2Video-zero [9], LAVIE [23], and Videocrafter2 [3].

5 Results and Discussion

As discussed in Section 4.2, five participants conducted a human evaluation of the outputs from three different VLMs: Text2Video-zero [9], LAVIE [23], and Videocrafter2 [3]. The results obtained from the participants for the Likert test were analyzed, and the outcomes are discussed in this section.
Figure 3:
Figure 3: Percentage distribution of the Likert test for three different models: Text2Video-Zero [9], LAVIE [23], and Videocrafter2 [3] respectively. Here, in each sub-figure, the left pie chart corresponds to object formation and the right corresponds to the action consistency of the respective models.
Figure 4:
Figure 4: Histogram plots of Likert test average ratings of object formation and action consistency for all 16 scenarios where each row has increasing prompt complexity from left to right.
The results, as depicted in Figure 2, reveal that a model’s ability to generate videos diminishes as the complexity of the prompts increases from SO_SP to MOA_RP. The plot demonstrates a decline in average ratings with rising prompt complexity in both the cases of object formation and action consistency. For instance, as shown in Figure 2, the rating for SO_SP is approximately 3.5-4 in the case of object formation and approx. 3 in case of action consistency. However, as the number of objects, actions, and complexity of the prompts increase, the rating drops to a range of 1-1.5 in both cases except Videocrafter2 [3] which outperforms others and has a rating of approx. 3.5 in object formation and approx. 2.5 in action consistency. Figure 2 also illustrates a noticeable improvement in ratings for SP. This spike indicates that the models perform better with SP, visual sample are shown in Figure 5(a), and Figure 5(b) which illustrate video frames with consistent action and appropriate object formation. The models excel in object formation compared to action consistency which is evident from the average ratings shown in Figure 2. Additionally, it was observed that the models struggled with video generation as the input prompt shifted from SO to MO, and a similar trend was seen when comparing SOA to MOA which is evident from Figure 5(d) which highlights video frames with inconsistent action and the circle highlights inappropriate object formation such as distorted face, and legs of humans.
The percentage distribution of ratings assigned to each model is also calculated and is depicted in Figure 3. The pie charts illustrates the percentage distribution of the average Likert test ratings. Notably, the charts shows that no scenarios received a rating of 5, indicating that none of the generated videos achieved the highest quality. Moreover, there were no ratings categorized as inappropriate, suggesting that all models were capable of generating some relevant contents in videos, as shown in Figure 5 it is clear that the frames are neither perfect nor inappropriate with respect to the prompt description. However, the distribution in Figure 3 reveals that the majority of ratings fall within the bad range, accounting for 45%–55% of the evaluations. Ratings in the good range were minimal, between 1%-19%, indicating that the generated videos were generally poor in terms of object formation and action consistency. Ratings categorized as average fell between 13%-36%, reflecting that while some videos were moderately acceptable, they still lacked in good object formation and action consistency.
A histogram plot of the average ratings of object formation and action consistency given by the participants to each model output across all scenarios is shown in Figure 4. The plot further illustrates that the ratings for simple prompts in all four scenarios range from bad to good. However, the trend suggests that as prompt complexity increases, the model’s ability to generate accurate and coherent videos becomes progressively worse. This is evidenced by a shift to the left in the plot when transitioning from simple prompts (SP) to rare and unique prompts (RP). For rare and unique object prompts, almost all the ratings are below or equal to average. Additionally, as the number of objects and actions increases from top to bottom in Figure 4, the ratings drop significantly, reflecting a marked decline in the quality of video generation. The histogram plots also gives a meaning insights about the individual performance of the models. As seen in the Figure 4, in the scenario SO_SP all models are performing better but LAVIE [23] model outperforms the other two models with all the ratings given as good, a similar trend is seen across all the scenario of SO. In scenario MOA_RP all models perform poorly with ratings given as mostly very bad. In scenario SO_RP, Text2video-Zero [9] model under performs the other two models with ratings given as bad.
Figure 5:
Figure 5: Sample frames from videos generated by LAVIE [23], Videocrafter2 [3], and Text2video-Zero [9] model, demonstrating both success and failed cases of generation. Here, the yellow circle highlights the inconsistent region of the video.
Table 3:
ModelObject formationAction consistency
LAVIE [23]2.531.88
Videocrafter2 [3]2.571.98
Text2video-Zero [9]2.291.85
Table 3: Average ratings of the Likert test for object formation and action performance for the models.
The average ratings of all three models for the two evaluation criteria, object formation, and action consistency, are presented in Table 3. The ratings for object formation and action consistency are approximately the same across all three models. The ratings are below 3, indicating that all the models perform below average in terms of object formation, and the ratings are below 2, indicating that all the models fall into the bad category in terms of action consistency. Figure 5(c) depicts scenarios where action consistency is inappropriate such as ball is not present in starting and ending frames. Among the three models, Videocrafter2 [3] outperforms the other two models in both evaluation criteria, as depicted in Table 3.
Sample frames from generated videos by the models: LAVIE [23], Videocrafter2 [3], and Text2video-Zero [9] for few scenarios are shown in Figure 5. The results show that the model struggle with generating unique objects, and often miss some objects. Figure 5(e) shows a video frames where the object of the goalkeeper is not formed, the prompt says that a goalkeeper should be present in the video. It also fails to maintain temporal consistency which is visible at the goal post and net as shown by the highlighted yellow circle. Given these insights from the evaluations, it is clear that there is reason for improvement of the VLMs producing relevant and coherent videos.

6 Conclusion

This study benchmarks three diffusion-based model’s performance for text-to-video generation on the grounds of object formation and action consistency. A dataset of 160 prompts is developed to perform this benchmarking where, the prompt is categorized across varying levels of complexity and the number of objects and their actions. The results indicate that a model performance declines as prompt complexity increases, with lower average ratings and greater variability in object generation and action consistency. The best results of \(3\) are obtained with simple prompts with single objects, whereas scenarios involving multiple objects, particularly in action or involving rare and unique objects, exhibited the poorest performance of around 2-2.5 on Likert scale. The evaluation highlighted specific scenarios where vision language models (VLMs) exhibit significant performance shortcomings which lays a foundation to understand the shortcomings and address them for generating better VLMs for text to video generation. While this benchmark is currently employed to assess diffusion models, it can be extended to other VLMs as well. This extension will provide a more comprehensive assessment of the VLMs capabilities and help identify further areas for failure for video generation. In the future, the developed dataset can be used to evaluate various video generation models.

References

[1]
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf 2, 3 (2023), 8.
[2]
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:https://arXiv.org/abs/2311.15127 (2023).
[3]
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. 2024. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7310–7320.
[4]
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2022. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:https://arXiv.org/abs/2209.06794 (2022).
[5]
Kangle Deng, Tianyi Fei, Xin Huang, and Yuxin Peng. 2019. IRC-GAN: Introspective Recurrent Convolutional GAN for Text-to-video Generation. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)1, 2216–2222. Conference paper.
[6]
Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. 2022. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems 35 (2022), 16890–16902.
[7]
Jared Heinly, Johannes L Schonberger, Enrique Dunn, and Jan-Michael Frahm. 2015. Reconstructing the world* in six days*(as captured by the yahoo 100 million image dataset). In Proceedings of the IEEE conference on computer vision and pattern recognition. 3287–3295.
[8]
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. 2022. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:https://arXiv.org/abs/2205.15868 (2022).
[9]
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. 2023. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15954–15964.
[10]
Doyeon Kim, Donggyu Joo, and Junmo Kim. 2020. Tivgan: Text to image to video generation with step-by-step evolutionary generator. IEEE Access 8 (2020), 153113–153122.
[11]
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. 2023. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:https://arXiv.org/abs/2312.14125 (2023).
[12]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32–73.
[13]
Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. 2018. Video generation from text. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[14]
Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, et al. 2024. Rich human feedback for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19401–19411.
[15]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
[16]
Janus Rose. 2022. Inside Midjourney, The Generative Art AI That Rivals DALL-E. VICE, July 19 (2022).
[17]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35 (2022), 25278–25294.
[18]
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2022. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15638–15650.
[19]
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. 2022. Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations.
[20]
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. 2023. Modelscope text-to-video technical report. arXiv preprint arXiv:https://arXiv.org/abs/2308.06571 (2023).
[21]
Weimin Wang, Jiawei Liu, Zhijie Lin, Jiangqiao Yan, Shuo Chen, Chetwin Low, Tuyen Hoang, Jie Wu, Jun Hao Liew, Hanshu Yan, et al. 2024. Magicvideo-v2: Multi-stage high-aesthetic video generation. arXiv preprint arXiv:https://arXiv.org/abs/2401.04468 (2024).
[22]
Wenhao Wang and Yi Yang. 2024. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. arXiv preprint arXiv:https://arXiv.org/abs/2403.06098 (2024).
[23]
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. 2023. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:https://arXiv.org/abs/2309.15103 (2023).
[24]
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. 2023. Language Model Beats Diffusion–Tokenizer is Key to Visual Generation. arXiv preprint arXiv:https://arXiv.org/abs/2310.05737 (2023).
[25]
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. 2023. When and why vision-language models behave like bags-of-words, and what to do about it?. In The Eleventh International Conference on Learning Representations.
[26]
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. 2023. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:https://arXiv.org/abs/2309.15818 (2023).

Index Terms

  1. Exploring the Limits of VLMs: A Dataset for Evaluating Text-to-Video Generation✱

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICVGIP '24: Proceedings of the Fifteenth Indian Conference on Computer Vision Graphics and Image Processing
    December 2024
    443 pages
    ISBN:9798400710759
    DOI:10.1145/3702250

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 December 2024

    Check for updates

    Author Tags

    1. Benchmark dataset
    2. Diffusion models
    3. Temporal consistency
    4. Video generation
    5. Vision language models

    Qualifiers

    • Research-article

    Conference

    ICVGIP 2024

    Acceptance Rates

    Overall Acceptance Rate 95 of 286 submissions, 33%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 292
      Total Downloads
    • Downloads (Last 12 months)292
    • Downloads (Last 6 weeks)292
    Reflects downloads up to 26 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media