1 Introduction
Recent advances in vision language models (VLMs) such as DALL-E 3 and Midjourney generate images from text [
1,
16]. VLMs learn the relationships between images and their corresponding textual descriptions, Based on the learned representations, VLMs generate text-to-image and vice versa. There are different families of VLMs for text-to-image/video generation task such as GAN-based, diffusion-based, and transformers-based. The focus of this work will be on diffusion-based models. There are three primary elements of any diffusion-based vision language model: encoder, diffusion blocks, and decoder which are shown in Figure
1. The number of input modalities indicate whether the model is multimodal or not, and each modality has an associated encoder block.
An encoder block in VLMs transforms input data into dense vector representations. In text encoders, layers of transformers or RNNs process the text, encoding its meaning and structure. The embeddings from encoders are fed to the diffusion block where the forward diffusion process and denoising process take place during training and while inference only the denoising process takes place inside the diffusion block. In the denoising process, the model is trained to reverse the forward diffusion by removing noise step-by-step from the noisy images, conditioned on text features to produce an image embedding representing the target image. Finally, the decoder block converts the embedding into an image. Image generation has gained significant popularity recently, leveraging which vision-language models are now extending their applications to generate videos from text.
The key challenge in text-to-video generation is the lack of large datasets of text-to-video pairs [
22]. However, some datasets of text-to-image pairs like LAION-5B [
17], WebLI [
4], Visual Genome [
12] and for text-to-video pairs, YFCC [
7] are frequently used to train VLMs. Text-to-video models require heavy training, which demands high computational power and increases computational costs. Unlike image generation, which deals with a single frame generation, video generation involves generating multiple frames. A challenge in text-to-video generation is maintaining temporal consistency across frames while ensuring the generated videos align with the input text. Addressing temporal consistency and text-video alignment is the key factor in creating VLMs for text-to-video generation.
To understand the shortcomings of VLMs in temporal consistency and text-video alignment, a dataset is required for evaluating them on these grounds and benchmark their performances. Such datasets exists for images but not for videos leaving a research gap to be explored. To address this gap, a dataset of prompts for evaluating VLMs is proposed which identifies the capabilities of VLMs in generating videos. This dataset tests a VLMs output w.r.t object formation, action consistency and temporal consistency using 16 different types of prompt scenarios. The evaluation reveals that VLMs have notable performance limitations, particularly in scenarios involving multiple objects and multiple objects in action with complex prompts describing high-level, rare and unique objects. On the other hand for scenarios with simple prompts and single objects all model show some decent performance with close to 40% of videos being aligned to text and approx. 32% being temporally consistent.
The rest of the paper is organized as follows. The existing family of VLMs for text-to-image/video generation is briefly described in Sec.2. The proposed method is formally described in Sec.3. The experiments are detailed in Sec. 4. The results are discussed in Sec. 5 with conclusion of the work presented in Sec.6.
2 Prior Art
In recent years, there has been significant research focused on the field of image generation from text, but very few works have made substantial advancement in video generation from text. This section reviews the state-of-the-art methodologies in this field.
The approaches are broadly categorized into three families of models: transformer-based, diffusion-based, and GAN-based models.
2.1 GAN Based Methods
This section will discuss GAN-based models. IRC-GAN framework consists of three parts: the text encoder network, the recurrent transconvolutional generator network, and the introspective discriminator network which handles two critical issues that come into play while generating videos [
5]. First, the generated frames need to be realistic and temporally consistent. Second, the text and video content need to be relevant with each other. The recurrent transconvolutional generator integrates LSTM cells with 2D transconvolutional layers, enabling frame generation based on previous frames. Mutual information introspection helps to semantically align the generated video to text. It measures the semantic consistency to minimize the semantic distance between the generated video and the corresponding text.
TIGVAN model generates videos on a basis of frame-by-frame [
10]. The training process is decomposed into two stages: text-to-image generation and evolutionary generation. To generate an image from text, the text is first encoded into a feature vector, which is then processed by a recurrent unit to create an input vector for a generator. The generator uses this input to produce an image that matches the text description. The model is trained with an adversarial framework to ensure the generated image aligns with the input text. In the evolutionary generation stage, the trained model is used to create a sequence of consecutive frames. At each step, the number of frames doubles, with the recurrent unit generating new vectors and the generator creating images. A step-specific discriminator ensures temporal consistency, while the image discriminator maintains image quality.
Video generation from text (VGT) has three model components [
13]: conditional gist generators, video generators, and video discriminators. First, a conditional VAE model is used to generate the gist of the video from the input text, where gist is an image that helps the model predict the background color and object layout for the desired video. The generator creates synthetic video frames to challenge the discriminator, distinguishing between real and generated frames. The generator and discriminator compete in a minimax game: the generator aims to produce more realistic frames, while the discriminator improves its detection of synthetic content.
2.2 Diffusion Based Methods
This section covers the methods that generate images using diffusion model and subsequently convert images to video.
Text2Video-Zero is a diffusion model used to generate video from text [
9]. Text2Video-Zero employs a zero-shot approach by utilizing the capabilities of currently available text-to-image synthesis techniques such as stable diffusion[
15] and adapting them to the video domain. The method is termed zero-shot because it generates video frames from text descriptions without requiring prior training or fine-tuning on video datasets. Instead of depending on large-scale text-video paired data, it leverages existing text-to-image models to create temporally consistent video frames directly from text. This approach eliminates the need for extensive video-specific training, making the process more cost-effective and convenient.
SHOW-1 is another model that generates videos from text using diffusion process [
26]. The model is trained on the WebVid-10M dataset, which incorporates 10 million image-text pairs. This work implements the first hybrid model which integrates pixel and latent video diffusion models (VDMs) for text-to-video generation. Pixel VDMs are used to produce a low-resolution video that has strong text-to-video alignment. After that, latent VDM is used to convert low-resolution video into high-resolution.
MagicVideo-V2 proposes a multi-stage T2V framework that integrates text-to-image (T2I), image-to-video (I2V), video-to-video (V2V), and video frame interpolator (VFI) into the text-to-video generation pipeline [
21]. The T2I module generates an initial image from the text prompt, setting the foundation for the I2V module to produce low-resolution keyframes, which the V2V module enhances in resolution and detail. Finally, the frame interpolation module smooths the motion in the video.
ModelScopeT2V is a text-to-video synthesis model developed from a text-to-image synthesis model called stable diffusion[
20]. Spatial-temporal blocks are incorporated into ModelScopeT2V for constant frame generation and seamless movement transitions. Three components a text encoder, a denoising UNet, and a VQGAN are used in ModelScopeT2V. The denoising UNet operates within the latent space. During training, a diffusion process gradually adds Gaussian noise over T steps, resulting in progressively less informative data. During inference, the UNet predicts the noise added at each step, ultimately allowing the generation of an image from random noise. VQGAN decoder is used to generate a video from these images.
2.3 Transformers Based Methods
In this section, the focus is on transformers-based text-to-video generation methods.
CogVideo is a 9B-parameter model trained by inheriting a pre-trained text-to-image model CogView2 [
6,
8]. This work uses a multi-frame-rate hierarchical training strategy which helps in better alignment of text with video. Backbone of CogVideo is a transformer with dual-channel attention, consisting of 48 layers, and 48 attention heads, a size of 3,072 in each channel, CogView2 [
6] has 6 billion parameters.
VideoPoet is a method for converting any autoregressive language model into a high-quality video generator[
11]. It includes key components such as a pre-trained MAGVIT V2[
24] video tokenizer, which converts images, and video into discrete codes compatible with text-based language models. An autoregressive language model predicts the next video token in a sequence, learning across video, image, and text modalities. The model integrates various generative learning objectives, including text-to-video, text-to-image, and video frame continuation, enabling zero-shot capabilities. VideoPoet generates videos with high temporal consistency.
2.4 Evaluation of VLMs
The attribution, relation, and order (ARO) benchmark evaluates VLMs understanding of object properties, relational context, and order sensitivity [
25]. The current VLMs often perform poorly in relational understanding and order sensitivity, even though they are trained on large datasets. This issue arises because models can perform well on standard tests without truly understanding compositional information. To improve VLMs, they proposes a method called composition-aware hard negative mining. First, add the closest neighboring images to each batch to help models recognize fine-grained differences between similar scenes. Second, add captions with scrambled word order to each batch to help models distinguish correct from incorrect sequences. This simple finetuning significantly enhances the model’s understanding of attributes and relationships.
Image based benchmark[
14] talk about the recent text-to-image (T2I) generation models like stable diffusion[
15] and how they have improved in generating high-resolution images from text descriptions but still face issues like artifacts, misalignment, and low aesthetic quality. Inspired by reinforcement learning with human feedback (RLHF), researchers have enhanced feedback signals by marking problematic image regions and identifying misrepresented or missing text elements. They collected detailed human feedback on 18,000 generated images (RichHF-18K) and trained a multimodal transformer to predict this feedback. Additionally, [
2,
18,
19] helps in understanding the generation of videos from text.
Existing benchmarks do not evaluate VLMs on object formation and action consistency, highlighting a gap in current research. This work is dedicated to addressing this gap by evaluating VLMs on these two evaluation criteria.
4 Experiments
This section outlines the details of dataset generation and implementation details of this study. This work introduces a benchmark based on human evaluation for text-to-video generation. To evaluate the performance of VLM outputs, a thorough examination of the benchmark results is needed to identify specific areas where the models fall short in producing videos.
4.1 Dataset
A dataset of prompts is developed with careful consideration of the number of objects, their actions, and the complexity levels of the prompts. The dataset is divided into two categories: number of objects, actions, and complexity of prompts, each with four levels. Table
2 provides an overview of the number of prompt inputs. A total of 160 input prompts, representing scenarios is generated. The categories are chosen to cover real-life video generation scenarios, where the number of objects and actions increases and the complexity of prompts also rises.
4.2 Implementation details
Three diffusion models: Text2Video-zero [
9], LAVIE [
23], and Video-crafter2 [
3] are considered for evaluation.
The models are provided with input prompts from the benchmark dataset, resulting in the generation of 160 videos per model. In total, 480 videos are generated, with each video spanning between two to four seconds. Text2Video-Zero [
9], and Videocrafter2 [
3] models generated videos at a resolution of 512x512, while the LAVIE [
23] generated videos at a resolution of 320x512. The videos generated by each model are evaluated by five participants. The participants under took a Likert test, wherein they were presented with a video and asked to rate them on a scale from 0 to 5 based on their observations. The Likert test comprised two criteria: object formation and action consistency. This evaluation method provides detailed insights into the performance of each model in generating coherent and contextually accurate videos from prompt descriptions.
First participants rated whether the object described in the prompt was accurately formed in the video. Second, participants evaluated whether the action performed by the objects matched the prompt description. For object formation, the Likert scale ratings are as follows: 0: no object formed/object formed but not matching the prompt description, 1: object formed but highly distorted, 2: object formed but less distorted, 3: object formed is average, 4: object formed is good; 5: object formed is perfect. For the action performed, the ratings were: 0: no action performed/action performed but not matching the input prompt, 1: action performed but highly inconsistent, 2: action performed but less inconsistent, 3: action performed is average in consistency, 4: action performed is consistent, 5: action performed is highly consistent. This detailed rating system allowed for a subtle assessment of the object formation and the action consistency in the generated videos.
This thorough evaluation process ensured a comprehensive evaluation of each models performance in generating accurate and consistent videos.
5 Results and Discussion
As discussed in Section
4.2, five participants conducted a human evaluation of the outputs from three different VLMs: Text2Video-zero [
9], LAVIE [
23], and Videocrafter2 [
3]. The results obtained from the participants for the Likert test were analyzed, and the outcomes are discussed in this section.
The results, as depicted in Figure
2, reveal that a model’s ability to generate videos diminishes as the complexity of the prompts increases from SO_SP to MOA_RP. The plot demonstrates a decline in average ratings with rising prompt complexity in both the cases of object formation and action consistency. For instance, as shown in Figure
2, the rating for SO_SP is approximately 3.5-4 in the case of object formation and approx. 3 in case of action consistency. However, as the number of objects, actions, and complexity of the prompts increase, the rating drops to a range of 1-1.5 in both cases except Videocrafter2 [
3] which outperforms others and has a rating of approx. 3.5 in object formation and approx. 2.5 in action consistency. Figure
2 also illustrates a noticeable improvement in ratings for SP. This spike indicates that the models perform better with SP, visual sample are shown in Figure
5(a), and Figure
5(b) which illustrate video frames with consistent action and appropriate object formation. The models excel in object formation compared to action consistency which is evident from the average ratings shown in Figure
2. Additionally, it was observed that the models struggled with video generation as the input prompt shifted from SO to MO, and a similar trend was seen when comparing SOA to MOA which is evident from Figure
5(d) which highlights video frames with inconsistent action and the circle highlights inappropriate object formation such as distorted face, and legs of humans.
The percentage distribution of ratings assigned to each model is also calculated and is depicted in Figure
3. The pie charts illustrates the percentage distribution of the average Likert test ratings. Notably, the charts shows that no scenarios received a rating of 5, indicating that none of the generated videos achieved the highest quality. Moreover, there were no ratings categorized as inappropriate, suggesting that all models were capable of generating some relevant contents in videos, as shown in Figure
5 it is clear that the frames are neither perfect nor inappropriate with respect to the prompt description. However, the distribution in Figure
3 reveals that the majority of ratings fall within the bad range, accounting for 45%–55% of the evaluations. Ratings in the good range were minimal, between 1%-19%, indicating that the generated videos were generally poor in terms of object formation and action consistency. Ratings categorized as average fell between 13%-36%, reflecting that while some videos were moderately acceptable, they still lacked in good object formation and action consistency.
A histogram plot of the average ratings of object formation and action consistency given by the participants to each model output across all scenarios is shown in Figure
4. The plot further illustrates that the ratings for simple prompts in all four scenarios range from bad to good. However, the trend suggests that as prompt complexity increases, the model’s ability to generate accurate and coherent videos becomes progressively worse. This is evidenced by a shift to the left in the plot when transitioning from simple prompts (SP) to rare and unique prompts (RP). For rare and unique object prompts, almost all the ratings are below or equal to average. Additionally, as the number of objects and actions increases from top to bottom in Figure
4, the ratings drop significantly, reflecting a marked decline in the quality of video generation. The histogram plots also gives a meaning insights about the individual performance of the models. As seen in the Figure
4, in the scenario SO_SP all models are performing better but LAVIE [
23] model outperforms the other two models with all the ratings given as good, a similar trend is seen across all the scenario of SO. In scenario MOA_RP all models perform poorly with ratings given as mostly very bad. In scenario SO_RP, Text2video-Zero [
9] model under performs the other two models with ratings given as bad.
The average ratings of all three models for the two evaluation criteria, object formation, and action consistency, are presented in Table
3. The ratings for object formation and action consistency are approximately the same across all three models. The ratings are below 3, indicating that all the models perform below average in terms of object formation, and the ratings are below 2, indicating that all the models fall into the bad category in terms of action consistency. Figure
5(c) depicts scenarios where action consistency is inappropriate such as ball is not present in starting and ending frames. Among the three models, Videocrafter2 [
3] outperforms the other two models in both evaluation criteria, as depicted in Table
3.
Sample frames from generated videos by the models: LAVIE [
23], Videocrafter2 [
3], and Text2video-Zero [
9] for few scenarios are shown in Figure
5. The results show that the model struggle with generating unique objects, and often miss some objects. Figure
5(e) shows a video frames where the object of the goalkeeper is not formed, the prompt says that a goalkeeper should be present in the video. It also fails to maintain temporal consistency which is visible at the goal post and net as shown by the highlighted yellow circle. Given these insights from the evaluations, it is clear that there is reason for improvement of the VLMs producing relevant and coherent videos.