1 Introduction
Text-to-image models have shown exceptional abilities to generate a diversity of images in various settings (place, time, style, and appearances), such as "a sketch of Paris on a rainy day" or "a Manga drawing of a teddy bear at night" [Ramesh et al.
2021; Saharia et al.
2022]. Recently, personalization methods even allow one to include specific subjects (objects, animals, or people) into the generated images [Gal et al.
2023a; Ruiz et al.
2023]. In practice, however, such models are difficult to control and may require significant prompt engineering and re-sampling to create the specific image one has in mind. It is even more acute with personalized models, where it is challenging to include the personal item or character in the image and simultaneously fulfill the textual prompt describing the content and style. This work proposes a method for better personalization and prompt alignment, especially suited for complex prompts.
A key ingredient in personalization methods is fine-tuning pre-trained text-to-image models on a small set of personal images while relying on heavy regularization to maintain the model’s capacity. Doing so will preserve the model’s prior knowledge and allow the user to synthesize images with various prompts; however, it impairs capturing identifying features of the target subject. On the other hand, persisting on identification accuracy can hinder the prompt-alignment capabilities. Generally speaking, the trade-off between identity preservation and prompt alignment is a core challenge in personalization methods (see fig.
2).
Content creators and AI artists frequently have a clear idea of the prompt they wish to utilize. It may involve stylization and other factors that current personalization methods struggle to maintain. Therefore, we take a different approach by focusing on excelling with a
single prompt rather than offering a general-purpose method intended to perform well with a wide range of prompts. This approach enables both (i) learning the unique features of the subject from a few or even a single input image and (ii) generating richer scenes that are better aligned with the user’s desired prompt (see fig.
1).
Our work is based on the premise that existing models possess knowledge of all elements within the target prompt except the new personal subject. Consequently, we leverage the pre-trained model’s prior knowledge to prevent personalized models from losing their understanding of the target prompt. In particular, since we know the target prompt during training, we show how to incorporate score-distillation guidance [Poole et al.
2023] to constrain the personalized model’s prediction to stay aligned with the pre-trained one. Therefore, we introduce a framework comprising two components: personalization, which teaches the model about our new subject, and prompt alignment which prevents it from forgetting elements included in the target prompt.
Our approach liberates content creators from constraints associated with specific prompts, unleashing the full potential of text-to-image models. We evaluate our method qualitatively and quantitatively. We show superior results compared with the baselines in multi- and single-shot settings, all without pre-training on large-scale data [Arar et al.
2023; Gal et al.
2023b], which can be difficult for certain domains. Finally, we show that our method can accommodate multi-subject personalization with minor modification and offer new applications such as drawing inspiration from a single artistic painting, and not just text (see Figure
3).
2 Related work
Text-to-image synthesis. has marked an unprecedented progress in recent years [Gafni et al.
2022; Nichol et al.
2021; Ramesh et al.
2021; Rombach et al.
2022; Saharia et al.
2022], mostly due to large-scale training on data like LAION-400m [Schuhmann et al.
2021]. Our approach uses pre-trained diffusion models [Ho et al.
2020] to extend their understanding to new subjects. We use the publicly available Stable-Diffusion model [Rombach et al.
2022] for most of our experiments since baseline models are mostly open-source on SD. We further verify our method on a larger latent diffusion model variant [Rombach et al.
2022].
Text-based editing. methods rely on contrastive multi-modal models like CLIP [Radford et al.
2021] as an interface to guide local and global edits [Avrahami et al.
2023b;
2022; Bar-Tal et al.
2022; Gal et al.
2022; Patashnik et al.
2021]. Recently, Prompt-to-Prompt (P2P) [Hertz et al.
2023b] was proposed as a way to edit and manipulate
generated images by editing the attention maps in the cross-attention layers of a pre-trained text-to-image model. Later, Mokady et al. [
2023] extended P2P for real images by encoding them into the null-conditioning space of classifier-free guidance [Ho and Salimans
2022]. InstructPix2Pix [Brooks et al.
2023] uses an instruction-guided image-to-image translation network trained on synthetic data. Others preserve image-structure by using reference attention maps [Parmar et al.
2023] or features extracted using DDIM [Song et al.
2021] inversion [Tumanyan et al.
2023]. Imagic [Kawar et al.
2023] starts from a target prompt, finds text embedding to reconstruct an input image, and later interpolates between the two to achieve the final edit. UniTune [Valevski et al.
2022], on the other hand, performs the interpolation in pixel space during the denoising backward process. In our work, we focus on the ability to generate images depicting a given subject, which may not necessarily maintain the global structure of an input image.
Early personalization methods. like Textual Inversion [Gal et al.
2023a] and DreamBooth [Ruiz et al.
2023] tune pre-trained text-2-image models to represent new subjects, either by finding a new soft word-embedding [Gal et al.
2023a] or calibrating model-weights [Ruiz et al.
2023] with existing words to represent the newly added subject. Later methods improved memory requirements of previous methods using Low-Rank updates [Hu et al.
2022; Kumari et al.
2023; Ryu
2023; Tewel et al.
2023] or compact-parameter space [Han et al.
2023]. On another front, NeTI [Alaluf et al.
2023] and
P + [Voynov et al.
2023] enhance TI[Gal et al.
2023a] by using additional tokens to more effectively capture subject-identifying features, whereas [Pang et al.
2023] introduces a cross-initialization method to bridge the gap between the learned token-embedding and the original model token-embedding. DreamMatcher [Nam et al.
2024] uses appearance matching in the generation process while maintaining the structure path of the original text-to-image model. Personalization can also be used for other tasks. ReVersion [Huang et al.
2023] showed how to learn relational features from reference images, and Vinker et al. [
2023] used personalization to decompose and visualize concepts at different abstraction levels. Chefer et. al [Chefer et al.
2023b] propose an interpretability method for text-to-image models by decomposing concepts into interpretable tokens. Another line-of-works pre-train encoders on large-data for near-instant, single-shot adaptation [Arar et al.
2023; Chen et al.
2023; Gal et al.
2023b; Valevski et al.
2023; Wei et al.
2023; Ye et al.
2023; Zhou et al.
2023]. Single-image personalization has also been addressed in Avrahami et al. [
2023a], where the authors use segmentation masks to personalize a model on different subjects. In our work, we focus on prompt alignment, and any of the previous personalization methods may be replaced by our baseline personalization method.
Score Distillation Sampling (SDS). emerged as a technique for leveraging 2D-diffusion models priors [Rombach et al.
2022; Saharia et al.
2022] for 3D-generation from textual input. Soon, this technique found way to different applications like SVG generation [Iluz et al.
2023; Jain et al.
2023], image-editing [Hertz et al.
2023a], and more [Song et al.
2022]. Other variant of SDS [Poole et al.
2023] aim to improve the image-generation quality of SDS, which suffers from over-saturation and blurriness [Katzir et al.
2023; Wang et al.
2023].
Text-to-image alignment. methods address text-related issues that arise in base diffusion generative models. These issues include neglecting specific text parts, attribute mixing, and more. Previous methods address these issues through attention-map re-weighting[Feng et al.
2022; Phung et al.
2023; Wu et al.
2023], latent-optimization [Chefer et al.
2023a; Rassin et al.
2023], or re-training with additional data [Segalis et al.
2023]. However, none of these methods address prompt alignment of personalization methods, instead, they aim to enhance the base models in generating text-aligned images.
3 Method
In our method, we strive to teach a pre-trained text-to-image model to generate images of a new subject S, which the model doesn’t recognize. Although personalization methods can help improve the model generalization, they struggle to create images of S depicted by complex texts. To overcome this obstacle, we optimize for both subject fidelity, and the ability to faithfully fulfil a target prompt. While the latter seems non-trivial, we show that by knowing the target prompt at the finetuning stage, we can help the model preserve its prior knowledge about different elements in the text.
We begin by giving an overview of personalization methods and identifying the prompt misalignment due to the personalization process. We then show our solution, which relies on knowing the prompt during personalization. More particularly, we simultaneously finetune the model to both become personalized and not lose its knowledge about the target prompt ytrg.
3.1 Preliminaries
Text-to-image Diffusion models perform a backward diffusion process to generate an image
x. In this process, a denoising model
G with parameters
θ, progressively cleans an input noise
\(x_T \sim \mathcal {N}(0,1)\) to produce a sample
x from an underlying data distribution
p(
X). At each diffusion timestep
t ∈ {
T, …, 1, 0}, the model predicts a noise
\(\hat{\epsilon } = G\left(x_t, t, y; \theta \right)\) conditioned on a prompt
y and the timestep
t. The generative model is trained to maximize the evidence lower bound (ELBO) using a denoising score matching objective [Ho et al.
2020]:
where given a noise scheduler parameters,
\(\sqrt {\bar{\alpha }_{t}}\), the latent
xt at timestep is given by the forward diffusion:
3.2 Personalization
Personalization methods finetune the diffusion model parameters
θ for a new target subject
S, such that the finetuned personalized model can generate images of
S. Following prior works, we finetune the model
G using the loss defined in eq.
(1). Like Textual-Inversion [Gal et al.
2023a], we learn a new word embedding, denoted by the placeholder [
V], to represent our subject
S. We also finetune the model weights, similar to DreamBooth [Ruiz et al.
2023], using the Low-Rank Adaption method (LoRA) [Hu et al.
2022; Kumari et al.
2023], where each layer weight
\(W \in \mathbb {R}^{N\times M}\) is updated using:
Here the matrices
\(A\in \mathbb {R}^{N\times r}\) and
\(B\in \mathbb {R}^{r\times M}\) are learned low-rank decomposition of the weight-update,
\(\Delta W \in \mathbb {R}^{N\times M}\). Similar to prior works [Kumari et al.
2023; Tewel et al.
2023], we only update the weights of the self and cross-attention layers. Finally, the finetuning is performed by pairing each image set by a personalization prompt
yP with the placeholder [
V], e.g., "
yP = A photo of [V]."
Throughout the rest of the section, we will denote
Gθ and
\(G_{\theta _{LoRA}}\) to be the pre-trained and the personalized models, respectively. Therefore, given an input image
x of our subject
S, and a conditional text embedding
yP, the personalization loss is given by:
3.3 Over-fitting in Personalization
Let our target prompt be
ytrg, for example, "A sketch of [
V]." Too many finetuning steps, on a small image set, causes the model to overfit. In this case, the diffusion model steers the backward denoising process towards one of the training images, regardless of any conditional prompt. To see this, let us visualize the model prediction by analyzing its estimate of
x, the real-sample image. The estimation
\(\hat{x}_0\) can be derived from the model’s prediction via the forward diffusion eq.
(2):
In 5a, we show the estimate of three models: (1) the pre-trained model Gθ, (2) the overfitted personalized model, and (3) a prompt-aligned personalized model. As can be seen from the figure, after prolonged training, the overfitted personalized model can reconstruct the target image, given a random noise, in a single step. In particular, after one denoising step, elements from the training images, like the background, are more dominant, and sketchiness fades away, suggesting miss-alignment with the prompt "A sketch." Throughout the rest of this section, we will denote by \(\hat{x}_0\) to be the personalized model estimation, i.e., \(G_{\theta _{LoRA}}\), of the latent xt.
3.4 Prompt-Aligned Personalization
We want to personalize Gθ to generate images related to a new subject S. However, unlike previous methods, we strongly emphasize achieving near-optimal results for a single textual prompt, denoted by ytrg (e.g., "A sketch of [V]").
Our key idea is to optimize our personalized weights, i.e., θLoRA and [V], while simultaneously pushing the model prediction towards our target prompt ytrg. For example, in case ytrg = "A sketch of [V], then we want the estimate \(\hat{x}_0\) to be both personalized, and aligned with ytrg. In other words, we want the estimate to contain elements of the subject S and to maintain a sketchy appearance instead of a photo-realistic one.
To achieve this, we make use of the pre-trained model
Gθ, which possesses all knowledge about the target prompt elements (except for the subject
S). By omitting "[V]" from the target prompt we get a clean prompt that the pretrained model understand (henceforth called,
yc). We can measure the estimated fidelity to
yc using the diffusion loss in eq.
(1).
In particular, given a sampled noise ϵ
2 and timestep
t2, we let
\(\hat{x}_{t_2}\) be the latent of
\(\hat{x}_0\) as given by the forward diffusion process. We further denote
\(G^{\alpha }_{\theta }\) to be the classifier-free guidance prediction of the base-model, which is an extrapolation of the conditional and unconditioned (
y = ∅) noise prediction. The scalar
\(\alpha \in \mathbb {R}^+\) controls the extrapolation via:
Then, to maintain alignment between the personalized model prediction, i.e.,
\(\hat{x}_0\) and the non-personalzied text
yc, we can use the diffusion loss. Specifically, given
\(\hat{x}_{t_2} = \sqrt {\bar{\alpha }_{t_2}} \hat{x}_0 + \sqrt {1-\bar{\alpha }_{t_2}} \epsilon _2\), then we want to perturb the model prediction by minimizing:
Specifically, we use Score Distillation Sampling (SDS) techniques, which provide a more effective gradient estimation [Poole et al.
2023]:
here,
ϕ are the weights controlling the appearance of
\(\hat{x}_0\), which in our case are the LoRA weights and [
V].
3.5 Avoiding Over-saturation and Mode Collapse
Incorporating the gradients defined in eq.
(8) in our framework leads to less diverse and over-saturated results. While using alternative implementations like [Katzir et al.
2023; Wang et al.
2023] improved diversity, the overall personalization is still affected (see supplemental materials).
Previous works [Hertz et al.
2023a; Katzir et al.
2023] have introduced a residual score formulation that better estimates a desired gradient direction. In our case, we want to pivot the personalized model to the base model prediction. Therefore, we want to perturb the model prediction using the difference between the personalized model’s prediction
\(G_{\theta _{LoRA}}\), and the pre-trained one
Gθ, and use the residual score:
where
α and
β are the guidance scales of
Gθ and
\(G_{\theta _{LoRA}}\), respectively. Here
yP and
yc is a clean prompt (see right part of fig.
4).
In fig.
5b, we visualize the residual gradient defined in eq.
(9). The figure illustrates that incorporating PALP effectively reduces overfitting while retaining personalization capabilities. Ideally, both the base and personalized models should perform similarly in background regions. An optimal personalized model should have noise predictions for the background comparable to the base model and enhanced noise predictions for the target subject. However, without PALP, DreamBooth overfits by excessively denoising both the background and the target subject (see the middle image in fig.
5b). In contrast, with PALP, the personalized model matches the base model’s noise predictions for the background while improving predictions for the target subject.
Notice that our formulation is derived from the personalization overfitting problem, where we calculate a residual with respect to the estimations of two different networks given the same input, while previous works used the prediction of the same network over different input images, aiming to improve image-to-image or NeRF tasks.
3.6 Overall Training Objective
Our final objective is to optimize
θLoRA using
\(\mathcal {L}_{P}\) from eq.
(4) and
\(\mathcal {L}_{PALP}\) from eq.
(9), hence giving our loss to be:
where
λ is used to balance the prompt-alignment fidelity (we used
λ = 0.2). Further, we found imbalanced guidance scales, i.e.,
α >
β, to perform better and
t1 =
t2 improves numerical stability. We have also considered two variant implementations: (1) using the same noise (i.e., ϵ
1 = ϵ
2), or (2) different ones. Using the same noise achieves better text alignment than the latter variant, for further details please refer to the ablation section section
4.1.
4 Results
Experimental setup: We use StableDiffusion (SD)-v1.4 [Rombach et al.
2022] for ablation and comparison purposes, as many official implementations of state-of-the-art methods are available in SD-v1.4. We further validate our method with larger text-to-image models (please refer to supplemental materials for further details). Complete experimental configuration, including learning rate, number of steps, and batch size, appear in the supplemental material.
Evaluation metric:. for evaluation, we follow previous works [Gal et al.
2023a; Ruiz et al.
2023] and use CLIP-score [Radford et al.
2021] to measure alignment with the target (clean) prompt
\(y^c_t\) (i.e., does not have the placeholder [V]). For subject preservation, we also use CLIP feature similarity between the input and generated images with the target prompt. We use VIT-B32 [Dosovitskiy et al.
2021] trained by OpenAI on their proprietary data for both metrics. This ensures that the underlying CLIP used by SD-v1.4 differs from the one used for evaluation, which could compromise the validity of the reported metric.
Dataset:. for multi-shot setting, we use data collected by previous methods [Gal et al.
2023a; Kumari et al.
2023], with different subjects like animals, toys, personal items, and buildings. For those subjects, checkpoints of previous methods exist, allowing a fair comparison.
4.1 Ablation studies
For ablation, we start with TI [Gal et al.
2023a] and DB [Ruiz et al.
2023] as our baseline personalization method and gradually add different components contributing to our final method. Full experimental details appear in the supplemental materials.
Early stopping:. We begin by considering early stopping as a way to control the text-alignment. The lower the number of iterations, the less likely we are to hurt the model’s prior knowledge. However, this comes at the cost of subject fidelity, evident from fig.
6. The longer we tune the model on the target subject, the more we risk overfitting the training set.
Adding SDS guidance: improves text alignment, yet it severely harms the subject fidelity, and image diversity is substantially reduced (see supplemental materials). Alternative distillation sampling guidance [Katzir et al.
2023] improves on top of SDS; however, since the distillation sampling guides the personalization optimization towards the center of distribution of the subject class, it still produces less favorable results.
Replacing SDS with PALP guidance: improves text alignment by a considerable margin and maintains high fidelity to the subject
S. We consider two variants: one where we use the same noise for personalization loss or sample a new one from the normal distribution. Interestingly, using the same noise helps with prompt alignment. Furthermore, scaling the score sampling equation eq.
(9) by
\(\sqrt {\bar{\alpha }_{t}} / \sqrt {1-\bar{\alpha }_{t}}\) further enhances the performance. Further details and qualitative samples please refer to the supplemental material.
4.2 Comparison with Existing Methods
We compare our method against multi-shot methods, including CustomDiffusion [Kumari et al.
2023],
P + [Voynov et al.
2023], and NeTI [Alaluf et al.
2023]. We further compare against TI [Gal et al.
2023a] and DB [Ruiz et al.
2023] using our implementation, which should also highlight the gain we achieve by incorporating our framework with existing personalization methods. Our evaluation set contains ten different complex prompts that include at least four different elements, including style-change (e.g., "sketch of", "anime drawing of"), time or a place (e.g., "in Paris", "at night"), color palettes (e.g., "warm," "vintage"). Quantitative results appear in table
2, and qualitative comparison appear in table
1 and table
5).
Our method achieves the best text alignment while maintaining high image alignment. TI+DB achieves the best image alignment. However, the reason for this is because TI+DB is prone to over-fitting. Indeed, investigating each element in the prompt, we find that the TI+DB achieves the best alignment with the class prompt (e.g., "A photo of a cat") while being significantly worse in the Style prompt (e.g., "A sketch"). Our method has a slightly lower image alignment since we expect appearance change for stylized prompts. We validate this hypothesis with a user study and find that our method achieves the best user preference in prompt alignment and personalization (see table
3). Please refer to the supplemental material for further non-stylized results and full details on the user study.
4.3 Applications
Single-shot setting: In a single-shot setting, we aim to personalize text-2-image models using a single image. This setting is helpful for cases where only a single image exists for our target subject (e.g., an old photo of a loved one). For this setting, we qualitatively compare our method with encoder-based methods, including IP-Adapter [Ye et al.
2023], ProFusion [Zhou et al.
2023], Face0 [Valevski et al.
2023], and E4T [Gal et al.
2023b]. We use portraits of two individuals and expect previous methods to generalize to our selected images since all methods are pre-trained on human faces. Note that E4T [Gal et al.
2023b] and ProFusion [Zhou et al.
2023] also perform test time optimization.
As seen from table
4, our method is both prompt- and identity-aligned. Previous methods, on the other hand, struggle more with identity preservation. We note that optimization-based approaches [Gal et al.
2023b; Zhou et al.
2023] are more identity-preserving, but this comes at the cost of text alignment. Finally, our method achieves a higher success rate, where the quality of the result is independent of the chosen seed.
Multi-concept Personalization:. Our method accommodates multi-subject personalization via simple modifications. Assume we want to compose two subjects,
S1 and
S2, in a specific scene depicted by a given prompt
y. To do so, we first allocate two different placeholders, [V1] and [V2], to represent the target subjects
S1 and
S2, respectively. During training, we randomly sample an image from a set of images containing
S1 and
S2. We assign different personalization prompts
yP for each subject, e.g., "A photo of [V1]" or "A painting inspired by [V2]", depending on the context. Then, we perform PALP while using the target prompt in mind, e.g., "A painting of [V1] inspired by [V2]". This allows composing different subjects into coherent scenes or using a single artwork as a reference for generating art-inspired images. Results appear in fig.
3; further details and results appear in the supplemental material.
5 Conclusions
We have introduced a novel personalization method that allows better prompt alignment. Our approach involves fine-tuning a pre-trained model to learn a given subject while employing score sampling to maintain alignment with the target prompt. We achieve favorable results in both prompt- and subject-alignment and push the boundary of personalization methods to handle complex prompts, comprising multiple subjects, even when one subject has only a single reference image.
While the resulting personalized model still generalizes for other prompts, we must personalize the pre-trained model for different prompts to achieve optimal results. For practical real-time use cases, there may be better options. However, future directions employing prompt-aligned adapters could result in instant time personalization for a specific prompt (e.g., for sketches). Finally, our work will motivate future methods to excel on a subset of prompts, allowing more specialized methods to achieve better and more accurate results.