Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3680528.3687604acmconferencesArticle/Chapter ViewFull TextPublication Pagessiggraph-asiaConference Proceedingsconference-collections
research-article
Open access

PALP: Prompt Aligned Personalization of Text-to-Image Models

Published: 03 December 2024 Publication History

Abstract

Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfillment of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a single prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques.

1 Introduction

Text-to-image models have shown exceptional abilities to generate a diversity of images in various settings (place, time, style, and appearances), such as "a sketch of Paris on a rainy day" or "a Manga drawing of a teddy bear at night" [Ramesh et al. 2021; Saharia et al. 2022]. Recently, personalization methods even allow one to include specific subjects (objects, animals, or people) into the generated images [Gal et al. 2023a; Ruiz et al. 2023]. In practice, however, such models are difficult to control and may require significant prompt engineering and re-sampling to create the specific image one has in mind. It is even more acute with personalized models, where it is challenging to include the personal item or character in the image and simultaneously fulfill the textual prompt describing the content and style. This work proposes a method for better personalization and prompt alignment, especially suited for complex prompts.
Fig. 1:
Fig. 1: Prompt aligned personalization allow rich and complex scene generation, including all elements of a condition prompt (right).
A key ingredient in personalization methods is fine-tuning pre-trained text-to-image models on a small set of personal images while relying on heavy regularization to maintain the model’s capacity. Doing so will preserve the model’s prior knowledge and allow the user to synthesize images with various prompts; however, it impairs capturing identifying features of the target subject. On the other hand, persisting on identification accuracy can hinder the prompt-alignment capabilities. Generally speaking, the trade-off between identity preservation and prompt alignment is a core challenge in personalization methods (see fig. 2).
Content creators and AI artists frequently have a clear idea of the prompt they wish to utilize. It may involve stylization and other factors that current personalization methods struggle to maintain. Therefore, we take a different approach by focusing on excelling with a single prompt rather than offering a general-purpose method intended to perform well with a wide range of prompts. This approach enables both (i) learning the unique features of the subject from a few or even a single input image and (ii) generating richer scenes that are better aligned with the user’s desired prompt (see fig. 1).
Our work is based on the premise that existing models possess knowledge of all elements within the target prompt except the new personal subject. Consequently, we leverage the pre-trained model’s prior knowledge to prevent personalized models from losing their understanding of the target prompt. In particular, since we know the target prompt during training, we show how to incorporate score-distillation guidance [Poole et al. 2023] to constrain the personalized model’s prediction to stay aligned with the pre-trained one. Therefore, we introduce a framework comprising two components: personalization, which teaches the model about our new subject, and prompt alignment which prevents it from forgetting elements included in the target prompt.
Our approach liberates content creators from constraints associated with specific prompts, unleashing the full potential of text-to-image models. We evaluate our method qualitatively and quantitatively. We show superior results compared with the baselines in multi- and single-shot settings, all without pre-training on large-scale data [Arar et al. 2023; Gal et al. 2023b], which can be difficult for certain domains. Finally, we show that our method can accommodate multi-subject personalization with minor modification and offer new applications such as drawing inspiration from a single artistic painting, and not just text (see Figure 3).
Fig. 2:
Fig. 2: Previous personalization methods struggle with complex prompts (e.g., “A sketch inspired by Vitruvian man”) presenting a trade-off between prompt-alignment and subject-fidelity. Our method, optimizes for both, without compromising either.

2 Related work

Text-to-image synthesis. has marked an unprecedented progress in recent years [Gafni et al. 2022; Nichol et al. 2021; Ramesh et al. 2021; Rombach et al. 2022; Saharia et al. 2022], mostly due to large-scale training on data like LAION-400m [Schuhmann et al. 2021]. Our approach uses pre-trained diffusion models [Ho et al. 2020] to extend their understanding to new subjects. We use the publicly available Stable-Diffusion model [Rombach et al. 2022] for most of our experiments since baseline models are mostly open-source on SD. We further verify our method on a larger latent diffusion model variant [Rombach et al. 2022].
Text-based editing. methods rely on contrastive multi-modal models like CLIP [Radford et al. 2021] as an interface to guide local and global edits [Avrahami et al. 2023b; 2022; Bar-Tal et al. 2022; Gal et al. 2022; Patashnik et al. 2021]. Recently, Prompt-to-Prompt (P2P) [Hertz et al. 2023b] was proposed as a way to edit and manipulate generated images by editing the attention maps in the cross-attention layers of a pre-trained text-to-image model. Later, Mokady et al. [2023] extended P2P for real images by encoding them into the null-conditioning space of classifier-free guidance [Ho and Salimans 2022]. InstructPix2Pix [Brooks et al. 2023] uses an instruction-guided image-to-image translation network trained on synthetic data. Others preserve image-structure by using reference attention maps [Parmar et al. 2023] or features extracted using DDIM [Song et al. 2021] inversion [Tumanyan et al. 2023]. Imagic [Kawar et al. 2023] starts from a target prompt, finds text embedding to reconstruct an input image, and later interpolates between the two to achieve the final edit. UniTune [Valevski et al. 2022], on the other hand, performs the interpolation in pixel space during the denoising backward process. In our work, we focus on the ability to generate images depicting a given subject, which may not necessarily maintain the global structure of an input image.
Early personalization methods. like Textual Inversion [Gal et al. 2023a] and DreamBooth [Ruiz et al. 2023] tune pre-trained text-2-image models to represent new subjects, either by finding a new soft word-embedding [Gal et al. 2023a] or calibrating model-weights [Ruiz et al. 2023] with existing words to represent the newly added subject. Later methods improved memory requirements of previous methods using Low-Rank updates [Hu et al. 2022; Kumari et al. 2023; Ryu 2023; Tewel et al. 2023] or compact-parameter space [Han et al. 2023]. On another front, NeTI [Alaluf et al. 2023] and P + [Voynov et al. 2023] enhance TI[Gal et al. 2023a] by using additional tokens to more effectively capture subject-identifying features, whereas [Pang et al. 2023] introduces a cross-initialization method to bridge the gap between the learned token-embedding and the original model token-embedding. DreamMatcher [Nam et al. 2024] uses appearance matching in the generation process while maintaining the structure path of the original text-to-image model. Personalization can also be used for other tasks. ReVersion [Huang et al. 2023] showed how to learn relational features from reference images, and Vinker et al. [2023] used personalization to decompose and visualize concepts at different abstraction levels. Chefer et. al [Chefer et al. 2023b] propose an interpretability method for text-to-image models by decomposing concepts into interpretable tokens. Another line-of-works pre-train encoders on large-data for near-instant, single-shot adaptation [Arar et al. 2023; Chen et al. 2023; Gal et al. 2023b; Valevski et al. 2023; Wei et al. 2023; Ye et al. 2023; Zhou et al. 2023]. Single-image personalization has also been addressed in Avrahami et al. [2023a], where the authors use segmentation masks to personalize a model on different subjects. In our work, we focus on prompt alignment, and any of the previous personalization methods may be replaced by our baseline personalization method.
Score Distillation Sampling (SDS). emerged as a technique for leveraging 2D-diffusion models priors [Rombach et al. 2022; Saharia et al. 2022] for 3D-generation from textual input. Soon, this technique found way to different applications like SVG generation [Iluz et al. 2023; Jain et al. 2023], image-editing [Hertz et al. 2023a], and more [Song et al. 2022]. Other variant of SDS [Poole et al. 2023] aim to improve the image-generation quality of SDS, which suffers from over-saturation and blurriness [Katzir et al. 2023; Wang et al. 2023].
Text-to-image alignment. methods address text-related issues that arise in base diffusion generative models. These issues include neglecting specific text parts, attribute mixing, and more. Previous methods address these issues through attention-map re-weighting[Feng et al. 2022; Phung et al. 2023; Wu et al. 2023], latent-optimization [Chefer et al. 2023a; Rassin et al. 2023], or re-training with additional data [Segalis et al. 2023]. However, none of these methods address prompt alignment of personalization methods, instead, they aim to enhance the base models in generating text-aligned images.
Fig. 3:
Fig. 3: PALP for multi-subject personalization achieves coherent and prompt-aligned results. Our method works when the subject has only one image (e.g., the "Wanderer above the Sea of Fog" artwork by Caspar David Friedrich).

3 Method

In our method, we strive to teach a pre-trained text-to-image model to generate images of a new subject S, which the model doesn’t recognize. Although personalization methods can help improve the model generalization, they struggle to create images of S depicted by complex texts. To overcome this obstacle, we optimize for both subject fidelity, and the ability to faithfully fulfil a target prompt. While the latter seems non-trivial, we show that by knowing the target prompt at the finetuning stage, we can help the model preserve its prior knowledge about different elements in the text.
We begin by giving an overview of personalization methods and identifying the prompt misalignment due to the personalization process. We then show our solution, which relies on knowing the prompt during personalization. More particularly, we simultaneously finetune the model to both become personalized and not lose its knowledge about the target prompt ytrg.

3.1 Preliminaries

Fig. 4:
Fig. 4: Method overview. We propose a framework consisting of a personalization path (left) and a prompt-alignment branch (right) applied simultaneously in the same training step. We achieve personalization by finetuning the pre-trained model using a simple reconstruction loss to denoise the new subject S. To keep the model aligned with the target prompt, we additionally use score sampling to pivot the prediction towards the direction of the target prompt y, e.g., "A sketch of a cat." In this example, when personalization and text alignment are optimized simultaneously, the network learns to denoise the subject towards a "sketch" like representation. Finally, our method does not induce a significant memory overhead due to the efficient estimation of the score function, following [Poole et al. 2023].
Text-to-image Diffusion models perform a backward diffusion process to generate an image x. In this process, a denoising model G with parameters θ, progressively cleans an input noise \(x_T \sim \mathcal {N}(0,1)\) to produce a sample x from an underlying data distribution p(X). At each diffusion timestep t ∈ {T, …, 1, 0}, the model predicts a noise \(\hat{\epsilon } = G\left(x_t, t, y; \theta \right)\) conditioned on a prompt y and the timestep t. The generative model is trained to maximize the evidence lower bound (ELBO) using a denoising score matching objective [Ho et al. 2020]:
\begin{equation} \mathcal {L}_{diff} = \mathbb {E}_{ \epsilon \sim \mathcal {N}(0,1) \atop t \sim \mathcal {U}(0,T)} \left[\Vert G\left(x_t, t, y; \theta \right) - \epsilon \Vert _{2}^{2} \right]. \end{equation}
(1)
where given a noise scheduler parameters, \(\sqrt {\bar{\alpha }_{t}}\), the latent xt at timestep is given by the forward diffusion:
\begin{equation} x_t = \sqrt {\bar{\alpha }_{t}} {\bf x} + \sqrt {1-\bar{\alpha }_{t}}\epsilon. \end{equation}
(2)

3.2 Personalization

Personalization methods finetune the diffusion model parameters θ for a new target subject S, such that the finetuned personalized model can generate images of S. Following prior works, we finetune the model G using the loss defined in eq. (1). Like Textual-Inversion [Gal et al. 2023a], we learn a new word embedding, denoted by the placeholder [V], to represent our subject S. We also finetune the model weights, similar to DreamBooth [Ruiz et al. 2023], using the Low-Rank Adaption method (LoRA) [Hu et al. 2022; Kumari et al. 2023], where each layer weight \(W \in \mathbb {R}^{N\times M}\) is updated using:
\begin{equation} W^{\prime } = W + \Delta W = W + A\times B. \end{equation}
(3)
Here the matrices \(A\in \mathbb {R}^{N\times r}\) and \(B\in \mathbb {R}^{r\times M}\) are learned low-rank decomposition of the weight-update, \(\Delta W \in \mathbb {R}^{N\times M}\). Similar to prior works [Kumari et al. 2023; Tewel et al. 2023], we only update the weights of the self and cross-attention layers. Finally, the finetuning is performed by pairing each image set by a personalization prompt yP with the placeholder [V], e.g., "yP = A photo of [V]."
Throughout the rest of the section, we will denote Gθ and \(G_{\theta _{LoRA}}\) to be the pre-trained and the personalized models, respectively. Therefore, given an input image x of our subject S, and a conditional text embedding yP, the personalization loss is given by:
\begin{equation} \mathcal {L}_{P} = \mathbb {E}_{ \epsilon \sim \mathcal {N}(0,{\bf I}) \atop t \sim \mathcal {U}(0,T)} \left[\Vert G_{\theta _{LoRA}}\left(x_t, t, y_P \right) - \epsilon \Vert _{2}^{2} \right]. \end{equation}
(4)

3.3 Over-fitting in Personalization

Let our target prompt be ytrg, for example, "A sketch of [V]." Too many finetuning steps, on a small image set, causes the model to overfit. In this case, the diffusion model steers the backward denoising process towards one of the training images, regardless of any conditional prompt. To see this, let us visualize the model prediction by analyzing its estimate of x, the real-sample image. The estimation \(\hat{x}_0\) can be derived from the model’s prediction via the forward diffusion eq. (2):
\begin{equation} \hat{x}_0 = \frac{x_t - \sqrt {1-\bar{\alpha }_{t}}G(x_t,t, y) }{ \sqrt {\bar{\alpha }_{t}} }. \end{equation}
(5)
Fig. 5:
Fig. 5: Single-step visualization.
In 5a, we show the estimate of three models: (1) the pre-trained model Gθ, (2) the overfitted personalized model, and (3) a prompt-aligned personalized model. As can be seen from the figure, after prolonged training, the overfitted personalized model can reconstruct the target image, given a random noise, in a single step. In particular, after one denoising step, elements from the training images, like the background, are more dominant, and sketchiness fades away, suggesting miss-alignment with the prompt "A sketch." Throughout the rest of this section, we will denote by \(\hat{x}_0\) to be the personalized model estimation, i.e., \(G_{\theta _{LoRA}}\), of the latent xt.

3.4 Prompt-Aligned Personalization

We want to personalize Gθ to generate images related to a new subject S. However, unlike previous methods, we strongly emphasize achieving near-optimal results for a single textual prompt, denoted by ytrg (e.g., "A sketch of [V]").
Our key idea is to optimize our personalized weights, i.e., θLoRA and [V], while simultaneously pushing the model prediction towards our target prompt ytrg. For example, in case ytrg = "A sketch of [V], then we want the estimate \(\hat{x}_0\) to be both personalized, and aligned with ytrg. In other words, we want the estimate to contain elements of the subject S and to maintain a sketchy appearance instead of a photo-realistic one.
To achieve this, we make use of the pre-trained model Gθ, which possesses all knowledge about the target prompt elements (except for the subject S). By omitting "[V]" from the target prompt we get a clean prompt that the pretrained model understand (henceforth called, yc). We can measure the estimated fidelity to yc using the diffusion loss in eq. (1).
In particular, given a sampled noise ϵ2 and timestep t2, we let \(\hat{x}_{t_2}\) be the latent of \(\hat{x}_0\) as given by the forward diffusion process. We further denote \(G^{\alpha }_{\theta }\) to be the classifier-free guidance prediction of the base-model, which is an extrapolation of the conditional and unconditioned (y = ∅) noise prediction. The scalar \(\alpha \in \mathbb {R}^+\) controls the extrapolation via:
\begin{equation} G^{\alpha }_{\theta }\left(x_t, t, y\right) = (1-\alpha) G_{\theta }\left(x_t, t, \emptyset \right) + \alpha G_{\theta }\left(x_t, t, y \right). \end{equation}
(6)
Then, to maintain alignment between the personalized model prediction, i.e., \(\hat{x}_0\) and the non-personalzied text yc, we can use the diffusion loss. Specifically, given \(\hat{x}_{t_2} = \sqrt {\bar{\alpha }_{t_2}} \hat{x}_0 + \sqrt {1-\bar{\alpha }_{t_2}} \epsilon _2\), then we want to perturb the model prediction by minimizing:
\begin{equation} \mathbb {E}_{ \epsilon _2 \sim \mathcal {N}(0,1) \atop t_2 \sim \mathcal {U}(0,T)} \left[\Vert G^\alpha _{\theta }\left(\hat{x}_{t_2}, t_2, y_c;\right) - \epsilon _2 \Vert _{2}^{2} \right]. \end{equation}
(7)
Specifically, we use Score Distillation Sampling (SDS) techniques, which provide a more effective gradient estimation [Poole et al. 2023]:
\begin{equation} \nabla \mathcal {L}_{SDS}(G_\theta , \hat{x}_{t_2}, y_c) = \tilde {w}(t) \left(G^\alpha _{\theta }\left(\hat{x}_{t_2}, t_2, y_c;\right) - \epsilon _2 \right) \frac{\partial \hat{x}_0}{\partial \phi }, \end{equation}
(8)
here, ϕ are the weights controlling the appearance of \(\hat{x}_0\), which in our case are the LoRA weights and [V].

3.5 Avoiding Over-saturation and Mode Collapse

Incorporating the gradients defined in eq. (8) in our framework leads to less diverse and over-saturated results. While using alternative implementations like [Katzir et al. 2023; Wang et al. 2023] improved diversity, the overall personalization is still affected (see supplemental materials).
Previous works  [Hertz et al. 2023a; Katzir et al. 2023] have introduced a residual score formulation that better estimates a desired gradient direction. In our case, we want to pivot the personalized model to the base model prediction. Therefore, we want to perturb the model prediction using the difference between the personalized model’s prediction \(G_{\theta _{LoRA}}\), and the pre-trained one Gθ, and use the residual score:
\begin{equation} \nabla \mathcal {L}_{PALP} = \tilde {w}(t) \left(G^\alpha _{\theta }(\hat{x}_{t_2}, t_2, y_c) - G^\beta _{\theta _{LoRA}} (\hat{x}_{t_2}, t_2, y_P) \right) \frac{\partial \hat{x}_0}{\partial \phi }, \end{equation}
(9)
where α and β are the guidance scales of Gθ and \(G_{\theta _{LoRA}}\), respectively. Here yP and yc is a clean prompt (see right part of fig. 4).
In fig. 5b, we visualize the residual gradient defined in eq. (9). The figure illustrates that incorporating PALP effectively reduces overfitting while retaining personalization capabilities. Ideally, both the base and personalized models should perform similarly in background regions. An optimal personalized model should have noise predictions for the background comparable to the base model and enhanced noise predictions for the target subject. However, without PALP, DreamBooth overfits by excessively denoising both the background and the target subject (see the middle image in fig. 5b). In contrast, with PALP, the personalized model matches the base model’s noise predictions for the background while improving predictions for the target subject.
Notice that our formulation is derived from the personalization overfitting problem, where we calculate a residual with respect to the estimations of two different networks given the same input, while previous works used the prediction of the same network over different input images, aiming to improve image-to-image or NeRF tasks.

3.6 Overall Training Objective

Our final objective is to optimize θLoRA using \(\mathcal {L}_{P}\) from eq. (4) and \(\mathcal {L}_{PALP}\) from  eq. (9), hence giving our loss to be:
\begin{equation} \mathcal {L} = \mathcal {L}_{P} + \lambda \cdot \mathcal {L}_{PALP}, \end{equation}
(10)
where λ is used to balance the prompt-alignment fidelity (we used λ = 0.2). Further, we found imbalanced guidance scales, i.e., α > β, to perform better and t1 = t2 improves numerical stability. We have also considered two variant implementations: (1) using the same noise (i.e., ϵ1 = ϵ2), or (2) different ones. Using the same noise achieves better text alignment than the latter variant, for further details please refer to the ablation section section 4.1.

4 Results

Table 1:
InputOursTI+DBCDP +NeTI
Table 1: Qualitative comparison in multi-shot setting. Our method achieves state-of-the-art results on complex prompts, better-preserving identity, and prompt alignment. For TI [Gal et al. 2023a]+DB [Ruiz et al. 2023], we use the same seed to generate our results, emphasizing the gain achieved by incorporating the prompt alignment path. For other baselines, we chose the best two out of eight samples.
Experimental setup: We use StableDiffusion (SD)-v1.4 [Rombach et al. 2022] for ablation and comparison purposes, as many official implementations of state-of-the-art methods are available in SD-v1.4. We further validate our method with larger text-to-image models (please refer to supplemental materials for further details). Complete experimental configuration, including learning rate, number of steps, and batch size, appear in the supplemental material.
Evaluation metric:. for evaluation, we follow previous works [Gal et al. 2023a; Ruiz et al. 2023] and use CLIP-score [Radford et al. 2021] to measure alignment with the target (clean) prompt \(y^c_t\) (i.e., does not have the placeholder [V]). For subject preservation, we also use CLIP feature similarity between the input and generated images with the target prompt. We use VIT-B32 [Dosovitskiy et al. 2021] trained by OpenAI on their proprietary data for both metrics. This ensures that the underlying CLIP used by SD-v1.4 differs from the one used for evaluation, which could compromise the validity of the reported metric.
Dataset:. for multi-shot setting, we use data collected by previous methods [Gal et al. 2023a; Kumari et al. 2023], with different subjects like animals, toys, personal items, and buildings. For those subjects, checkpoints of previous methods exist, allowing a fair comparison.

4.1 Ablation studies

Fig. 6:
Fig. 6: Ablation study: Image alignment (left) and text alignment (right) are reported against the number of fine-tuning steps. The base model results show the pre-trained model’s performance using the target subject class.
For ablation, we start with TI [Gal et al. 2023a] and DB [Ruiz et al. 2023] as our baseline personalization method and gradually add different components contributing to our final method. Full experimental details appear in the supplemental materials.
Early stopping:. We begin by considering early stopping as a way to control the text-alignment. The lower the number of iterations, the less likely we are to hurt the model’s prior knowledge. However, this comes at the cost of subject fidelity, evident from  fig. 6. The longer we tune the model on the target subject, the more we risk overfitting the training set.
Adding SDS guidance: improves text alignment, yet it severely harms the subject fidelity, and image diversity is substantially reduced (see supplemental materials). Alternative distillation sampling guidance [Katzir et al. 2023] improves on top of SDS; however, since the distillation sampling guides the personalization optimization towards the center of distribution of the subject class, it still produces less favorable results.
Replacing SDS with PALP guidance: improves text alignment by a considerable margin and maintains high fidelity to the subject S. We consider two variants: one where we use the same noise for personalization loss or sample a new one from the normal distribution. Interestingly, using the same noise helps with prompt alignment. Furthermore, scaling the score sampling equation eq. (9) by \(\sqrt {\bar{\alpha }_{t}} / \sqrt {1-\bar{\alpha }_{t}}\) further enhances the performance. Further details and qualitative samples please refer to the supplemental material.

4.2 Comparison with Existing Methods

Table 2:
 Text-Alignment ↑Image-Alignment ↑
MethodStyleClassAmbiance 1Ambiance 2Target Prompt 
P+0.2440.2570.2170.2180.3080.673
NeTI0.2350.2640.220.2140.3100.695
TI+DB0.2370.2790.220.2160.3190.716
Ours0.2450.2720.230.2240.3400.681
Table 2: Comparisons to prior work. Our method presents better prompt-alignment, without hindering personalization.
Table 3:
MetricP+NeTITI+DBOurs
Text-Alignment ↑68.5 %63.2 %73.3 %91.2 %
Personalization ↑61.2 %70.3 %60.4 %72.1 %
Table 3: User Study results: For text alignment, we report the percentage of prompt elements found in the generated images. For personalization, users rated the similarity between subject S and the generated images.
Table 4:
InputOursPrevious methods
 "3D Render
as a chef"
"Vector art
wearing a hat"
"A painting
by Da Vinci"
"3D Render
as a chef"
"Vector art
wearing a hat"
"A painting
by Da Vinci"
 
 
"Anime drawing"
"Pop art"
"A Caricature"
"Anime drawing"
"Pop art"
"A Caricature"
 "3D Render
as a chef"
"Vector art
wearing a hat"
"A painting
by Da Vinci"
"3D Render
as a chef"
"Vector art
wearing a hat"
"A painting
by Da Vinci"
 
 
"Anime drawing"
"Pop art"
"A Caricature"
"Anime drawing"
"Pop art"
"A Caricature"
Table 4: Qualitative comparison against ProFusion [Zhou et al. 2023], IP-Adapter [Ye et al. 2023], E4T [Gal et al. 2023b], and Face0 [Valevski et al. 2023]. On the left, we show the results of our method on two individuals using a single image with multiple prompts. To meet space requirements, we show results of a different method on each row, where the order is ProFusion [Zhou et al. 2023], IP-Adapter [Ye et al. 2023], E4T [Gal et al. 2023b], and Face0 [Valevski et al. 2023]. Full comparison appears in the supplemental material.
Table 5:
InputOurs TI+DBCDP +NeTI
 
 
 
 
 
 
 
 
Table 5: Additional qualitative comparison in multi-shot setting.
We compare our method against multi-shot methods, including CustomDiffusion [Kumari et al. 2023], P +  [Voynov et al. 2023], and NeTI [Alaluf et al. 2023]. We further compare against TI [Gal et al. 2023a] and DB [Ruiz et al. 2023] using our implementation, which should also highlight the gain we achieve by incorporating our framework with existing personalization methods. Our evaluation set contains ten different complex prompts that include at least four different elements, including style-change (e.g., "sketch of", "anime drawing of"), time or a place (e.g., "in Paris", "at night"), color palettes (e.g., "warm," "vintage"). Quantitative results appear in table 2, and qualitative comparison appear in table 1 and table 5).
Our method achieves the best text alignment while maintaining high image alignment. TI+DB achieves the best image alignment. However, the reason for this is because TI+DB is prone to over-fitting. Indeed, investigating each element in the prompt, we find that the TI+DB achieves the best alignment with the class prompt (e.g., "A photo of a cat") while being significantly worse in the Style prompt (e.g., "A sketch"). Our method has a slightly lower image alignment since we expect appearance change for stylized prompts. We validate this hypothesis with a user study and find that our method achieves the best user preference in prompt alignment and personalization (see table 3). Please refer to the supplemental material for further non-stylized results and full details on the user study.

4.3 Applications

Single-shot setting: In a single-shot setting, we aim to personalize text-2-image models using a single image. This setting is helpful for cases where only a single image exists for our target subject (e.g., an old photo of a loved one). For this setting, we qualitatively compare our method with encoder-based methods, including IP-Adapter [Ye et al. 2023], ProFusion [Zhou et al. 2023], Face0 [Valevski et al. 2023], and E4T [Gal et al. 2023b]. We use portraits of two individuals and expect previous methods to generalize to our selected images since all methods are pre-trained on human faces. Note that E4T [Gal et al. 2023b] and ProFusion [Zhou et al. 2023] also perform test time optimization.
As seen from table 4, our method is both prompt- and identity-aligned. Previous methods, on the other hand, struggle more with identity preservation. We note that optimization-based approaches [Gal et al. 2023b; Zhou et al. 2023] are more identity-preserving, but this comes at the cost of text alignment. Finally, our method achieves a higher success rate, where the quality of the result is independent of the chosen seed.
Multi-concept Personalization:. Our method accommodates multi-subject personalization via simple modifications. Assume we want to compose two subjects, S1 and S2, in a specific scene depicted by a given prompt y. To do so, we first allocate two different placeholders, [V1] and [V2], to represent the target subjects S1 and S2, respectively. During training, we randomly sample an image from a set of images containing S1 and S2. We assign different personalization prompts yP for each subject, e.g., "A photo of [V1]" or "A painting inspired by [V2]", depending on the context. Then, we perform PALP while using the target prompt in mind, e.g., "A painting of [V1] inspired by [V2]". This allows composing different subjects into coherent scenes or using a single artwork as a reference for generating art-inspired images. Results appear in fig. 3; further details and results appear in the supplemental material.

5 Conclusions

We have introduced a novel personalization method that allows better prompt alignment. Our approach involves fine-tuning a pre-trained model to learn a given subject while employing score sampling to maintain alignment with the target prompt. We achieve favorable results in both prompt- and subject-alignment and push the boundary of personalization methods to handle complex prompts, comprising multiple subjects, even when one subject has only a single reference image.
While the resulting personalized model still generalizes for other prompts, we must personalize the pre-trained model for different prompts to achieve optimal results. For practical real-time use cases, there may be better options. However, future directions employing prompt-aligned adapters could result in instant time personalization for a specific prompt (e.g., for sketches). Finally, our work will motivate future methods to excel on a subset of prompts, allowing more specialized methods to achieve better and more accurate results.

Supplemental Material

PDF File
Supplemental Material for: saconferencepapers24-47 PALP: Prompt Aligned Personalization of Text-to-Image Models

References

[1]
Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. 2023. A Neural Space-Time Representation for Text-to-Image Personalization. CoRR abs/2305.15391 (2023). arXiv:https://arXiv.org/abs/2305.15391
[2]
Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, and Amit H. Bermano. 2023. Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models. CoRR abs/2307.06925 (2023). arXiv:https://arXiv.org/abs/2307.06925
[3]
Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. 2023a. Break-A-Scene: Extracting Multiple Concepts from a Single Image. CoRR abs/2305.16311 (2023). arXiv:https://arXiv.org/abs/2305.16311
[4]
Omri Avrahami, Ohad Fried, and Dani Lischinski. 2023b. Blended Latent Diffusion. ACM Trans. Graph. 42, 4 (2023), 149:1–149:11.
[5]
Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended Diffusion for Text-driven Editing of Natural Images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 18187–18197.
[6]
Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. 2022. Text2LIVE: Text-Driven Layered Image and Video Editing. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XV(Lecture Notes in Computer Science, Vol. 13675), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 707–723.
[7]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 18392–18402.
[8]
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. 2023a. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–10.
[9]
Hila Chefer, Oran Lang, Mor Geva, Volodymyr Polosukhin, Assaf Shocher, Michal Irani, Inbar Mosseri, and Lior Wolf. 2023b. The Hidden Language of Diffusion Models. arXiv preprint arXiv:https://arXiv.org/abs/2306.00966 (2023).
[10]
Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W. Cohen. 2023. Subject-driven Text-to-Image Generation via Apprenticeship Learning. CoRR abs/2304.00186 (2023). arXiv:https://arXiv.org/abs/2304.00186
[11]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy
[12]
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. 2022. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:https://arXiv.org/abs/2212.05032 (2022).
[13]
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XV(Lecture Notes in Computer Science, Vol. 13675), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 89–106.
[14]
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. 2023a. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=NAQvF08TcyG
[15]
Rinon Gal, Moab Arar, Yuval Atzmon, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. 2023b. Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models. ACM Trans. Graph. 42, 4 (2023), 150:1–150:13.
[16]
Rinon Gal, Or Patashnik, Haggai Maron, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Trans. Graph. 41, 4 (2022), 141:1–141:13.
[17]
Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, and Feng Yang. 2023. Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:https://arXiv.org/abs/2303.11305 (2023).
[18]
Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. 2023a. Delta Denoising Score. CoRR abs/2304.07090 (2023). arXiv:https://arXiv.org/abs/2304.07090
[19]
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023b. Prompt-to-Prompt Image Editing with Cross-Attention Control. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=_CDixzkzeyb
[20]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
[21]
Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance. CoRR abs/2207.12598 (2022). arXiv:https://arXiv.org/abs/2207.12598
[22]
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9
[23]
Ziqi Huang, Tianxing Wu, Yuming Jiang, Kelvin C. K. Chan, and Ziwei Liu. 2023. ReVersion: Diffusion-Based Relation Inversion from Images. CoRR abs/2303.13495 (2023). arXiv:https://arXiv.org/abs/2303.13495
[24]
Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. 2023. Word-As-Image for Semantic Typography. ACM Trans. Graph. 42, 4 (2023), 151:1–151:11.
[25]
Ajay Jain, Amber Xie, and Pieter Abbeel. 2023. VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 1911–1920.
[26]
Oren Katzir, Or Patashnik, Daniel Cohen-Or, and Dani Lischinski. 2023. Noise-Free Score Distillation. arXiv preprint arXiv:https://arXiv.org/abs/2310.17590 (2023).
[27]
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. 2023. Imagic: Text-Based Real Image Editing with Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 6007–6017.
[28]
Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. 2023. Multi-Concept Customization of Text-to-Image Diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 1931–1941.
[29]
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2023. Null-text Inversion for Editing Real Images using Guided Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 6038–6047.
[30]
Jisu Nam, Heesu Kim, DongJae Lee, Siyoon Jin, Seungryong Kim, and Seunggyu Chang. 2024. Dreammatcher: Appearance matching self-attention for semantically-consistent text-to-image personalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8100–8110.
[31]
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:https://arXiv.org/abs/2112.10741 (2021).
[32]
Lianyu Pang, Jian Yin, Haoran Xie, Qiping Wang, Qing Li, and Xudong Mao. 2023. Cross initialization for personalized text-to-image generation. arXiv preprint arXiv:https://arXiv.org/abs/2312.15905 (2023).
[33]
Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. 2023. Zero-shot Image-to-Image Translation. arxiv:https://arXiv.org/abs/2302.03027 [cs.CV]
[34]
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2065–2074.
[35]
Quynh Phung, Songwei Ge, and Jia-Bin Huang. 2023. Grounded Text-to-Image Synthesis with Attention Refocusing. arXiv preprint arXiv:https://arXiv.org/abs/2306.05427 (2023).
[36]
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2023. DreamFusion: Text-to-3D using 2D Diffusion. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=FjNys5c7VyY
[37]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. CoRR abs/2103.00020 (2021). arXiv:https://arXiv.org/abs/2103.00020https://arxiv.org/abs/2103.00020
[38]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. CoRR abs/2102.12092 (2021). arXiv:https://arXiv.org/abs/2102.12092https://arxiv.org/abs/2102.12092
[39]
Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfogel, Yoav Goldberg, and Gal Chechik. 2023. Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment. arxiv:https://arXiv.org/abs/2306.08877 [cs.CL]
[40]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 10674–10685.
[41]
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 22500–22510.
[42]
Simo Ryu. 2023. Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning. https://github.com/cloneofsimo/lora.
[43]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:https://arXiv.org/abs/2205.11487 (2022).
[44]
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:https://arXiv.org/abs/2111.02114 (2021).
[45]
Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, and Yaniv Leviathan. 2023. A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation. arXiv preprint arXiv:https://arXiv.org/abs/2310.16656 (2023).
[46]
Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=St1giarCHLP
[47]
Kunpeng Song, Ligong Han, Bingchen Liu, Dimitris N. Metaxas, and Ahmed Elgammal. 2022. Diffusion Guided Domain Adaptation of Image Generators. CoRR abs/2212.04473 (2022). arXiv:https://arXiv.org/abs/2212.04473
[48]
Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. 2023. Key-Locked Rank One Editing for Text-to-Image Personalization. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH 2023, Los Angeles, CA, USA, August 6-10, 2023, Erik Brunvand, Alla Sheffer, and Michael Wimmer (Eds.). ACM, 12:1–12:11.
[49]
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. 2023. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 1921–1930.
[50]
Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan. 2022. Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:https://arXiv.org/abs/2210.09477 (2022).
[51]
Dani Valevski, Danny Wasserman, Yossi Matias, and Yaniv Leviathan. 2023. Face0: Instantaneously Conditioning a Text-to-Image Model on a Face. CoRR abs/2306.06638 (2023). arXiv:https://arXiv.org/abs/2306.06638
[52]
Yael Vinker, Andrey Voynov, Daniel Cohen-Or, and Ariel Shamir. 2023. Concept Decomposition for Visual Exploration and Inspiration. CoRR abs/2305.18203 (2023). arXiv:https://arXiv.org/abs/2305.18203
[53]
Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. 2023. P+: Extended Textual Conditioning in Text-to-Image Generation. CoRR abs/2303.09522 (2023). arXiv:https://arXiv.org/abs/2303.09522
[54]
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2023. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. CoRR abs/2305.16213 (2023). arXiv:https://arXiv.org/abs/2305.16213
[55]
Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. 2023. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:https://arXiv.org/abs/2302.13848 (2023).
[56]
Qiucheng Wu, Yujian Liu, Handong Zhao, Trung Bui, Zhe Lin, Yang Zhang, and Shiyu Chang. 2023. Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis. CoRR abs/2304.03869 (2023). arXiv:https://arXiv.org/abs/2304.03869
[57]
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. 2023. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. CoRR abs/2308.06721 (2023). arXiv:https://arXiv.org/abs/2308.06721
[58]
Yufan Zhou, Ruiyi Zhang, Tong Sun, and Jinhui Xu. 2023. Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach. CoRR abs/2305.13579 (2023). arXiv:https://arXiv.org/abs/2305.13579

Cited By

View all
  • (2024)LCM-Lookahead for Encoder-Based Text-to-Image PersonalizationComputer Vision – ECCV 202410.1007/978-3-031-72630-9_19(322-340)Online publication date: 29-Sep-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SA '24: SIGGRAPH Asia 2024 Conference Papers
December 2024
1620 pages
ISBN:9798400711312
DOI:10.1145/3680528
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2024

Check for updates

Author Tags

  1. Personalization
  2. Text-Alignment

Qualifiers

  • Research-article

Conference

SA '24
Sponsor:
SA '24: SIGGRAPH Asia 2024 Conference Papers
December 3 - 6, 2024
Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)590
  • Downloads (Last 6 weeks)219
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LCM-Lookahead for Encoder-Based Text-to-Image PersonalizationComputer Vision – ECCV 202410.1007/978-3-031-72630-9_19(322-340)Online publication date: 29-Sep-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media