LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation

Zheng, Haoyu; Zhang, Wenqiao; Wang, Yaoke; Zhou, Hao; Liu, Jiang; Li, Juncheng; Lv, Zheqi; Tang, Siliang; Zhuang, Yueting

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.13558 (cs)

[Submitted on 21 Apr 2024 (v1), last revised 23 Apr 2024 (this version, v2)]

Title:LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation

Authors:Haoyu Zheng, Wenqiao Zhang, Yaoke Wang, Hao Zhou, Jiang Liu, Juncheng Li, Zheqi Lv, Siliang Tang, Yueting Zhuang

View PDF HTML (experimental)

Abstract:Revolutionary advancements in text-to-image models have unlocked new dimensions for sophisticated content creation, e.g., text-conditioned image editing, allowing us to edit the diverse images that convey highly complex visual concepts according to the textual guidance. Despite being promising, existing methods focus on texture- or non-rigid-based visual manipulation, which struggles to produce the fine-grained animation of smooth text-conditioned image morphing without fine-tuning, i.e., due to their highly unstructured latent space. In this paper, we introduce a tuning-free LLM-driven attention control framework, encapsulated by the progressive process of LLM planning, prompt-Aware editing, StablE animation geneRation, abbreviated as LASER. LASER employs a large language model (LLM) to refine coarse descriptions into detailed prompts, guiding pre-trained text-to-image models for subsequent image generation. We manipulate the model's spatial features and self-attention mechanisms to maintain animation integrity and enable seamless morphing directly from text prompts, eliminating the need for additional fine-tuning or annotations. Our meticulous control over spatial features and self-attention ensures structural consistency in the images. This paper presents a novel framework integrating LLMs with text-to-image models to create high-quality animations from a single text input. We also propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER. Extensive experiments demonstrate that LASER produces impressive, consistent, and efficient results in animation generation, positioning it as a powerful tool for advanced digital content creation.

Comments:	10 pages, 7 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.13558 [cs.CV]
	(or arXiv:2404.13558v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.13558

Submission history

From: Haoyu Zheng [view email]
[v1] Sun, 21 Apr 2024 07:13:56 UTC (27,567 KB)
[v2] Tue, 23 Apr 2024 07:17:03 UTC (27,567 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators