Content creators often aim to create personalized images using personal subjects that go beyond the capabilities of conventional text-to-image models. Additionally, they may want the resulting image to encompass a specific location, style, ambiance, and more. Existing personalization methods may compromise personalization ability or the alignment to complex textual prompts. This trade-off can impede the fulfillment of user prompts and subject fidelity. We propose a new approach focusing on personalization methods for a single prompt to address this issue. We term our approach prompt-aligned personalization. While this may seem restrictive, our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts, which may pose a challenge for current techniques. In particular, our method keeps the personalized model aligned with a target prompt using an additional score distillation sampling term. We demonstrate the versatility of our method in multi- and single-shot settings and further show that it can compose multiple subjects or use inspiration from reference images, such as artworks. We compare our approach quantitatively and qualitatively with existing baselines and state-of-the-art techniques.

1 Introduction

Text-to-image models have shown exceptional abilities to generate a diversity of images in various settings (place, time, style, and appearances), such as "a sketch of Paris on a rainy day" or "a Manga drawing of a teddy bear at night" [Ramesh et al. 2021; Saharia et al. 2022]. Recently, personalization methods even allow one to include specific subjects (objects, animals, or people) into the generated images [Gal et al. 2023a; Ruiz et al. 2023]. In practice, however, such models are difficult to control and may require significant prompt engineering and re-sampling to create the specific image one has in mind. It is even more acute with personalized models, where it is challenging to include the personal item or character in the image and simultaneously fulfill the textual prompt describing the content and style. This work proposes a method for better personalization and prompt alignment, especially suited for complex prompts.

Fig. 1:

A key ingredient in personalization methods is fine-tuning pre-trained text-to-image models on a small set of personal images while relying on heavy regularization to maintain the model’s capacity. Doing so will preserve the model’s prior knowledge and allow the user to synthesize images with various prompts; however, it impairs capturing identifying features of the target subject. On the other hand, persisting on identification accuracy can hinder the prompt-alignment capabilities. Generally speaking, the trade-off between identity preservation and prompt alignment is a core challenge in personalization methods (see fig. 2).

Content creators and AI artists frequently have a clear idea of the prompt they wish to utilize. It may involve stylization and other factors that current personalization methods struggle to maintain. Therefore, we take a different approach by focusing on excelling with a single prompt rather than offering a general-purpose method intended to perform well with a wide range of prompts. This approach enables both (i) learning the unique features of the subject from a few or even a single input image and (ii) generating richer scenes that are better aligned with the user’s desired prompt (see fig. 1).

Our work is based on the premise that existing models possess knowledge of all elements within the target prompt except the new personal subject. Consequently, we leverage the pre-trained model’s prior knowledge to prevent personalized models from losing their understanding of the target prompt. In particular, since we know the target prompt during training, we show how to incorporate score-distillation guidance [Poole et al. 2023] to constrain the personalized model’s prediction to stay aligned with the pre-trained one. Therefore, we introduce a framework comprising two components: personalization, which teaches the model about our new subject, and prompt alignment which prevents it from forgetting elements included in the target prompt.

Our approach liberates content creators from constraints associated with specific prompts, unleashing the full potential of text-to-image models. We evaluate our method qualitatively and quantitatively. We show superior results compared with the baselines in multi- and single-shot settings, all without pre-training on large-scale data [Arar et al. 2023; Gal et al. 2023b], which can be difficult for certain domains. Finally, we show that our method can accommodate multi-subject personalization with minor modification and offer new applications such as drawing inspiration from a single artistic painting, and not just text (see Figure 3).

Fig. 2:

2 Related work

Text-to-image synthesis. has marked an unprecedented progress in recent years [Gafni et al. 2022; Nichol et al. 2021; Ramesh et al. 2021; Rombach et al. 2022; Saharia et al. 2022], mostly due to large-scale training on data like LAION-400m [Schuhmann et al. 2021]. Our approach uses pre-trained diffusion models [Ho et al. 2020] to extend their understanding to new subjects. We use the publicly available Stable-Diffusion model [Rombach et al. 2022] for most of our experiments since baseline models are mostly open-source on SD. We further verify our method on a larger latent diffusion model variant [Rombach et al. 2022].

Text-based editing. methods rely on contrastive multi-modal models like CLIP [Radford et al. 2021] as an interface to guide local and global edits [Avrahami et al. 2023b; 2022; Bar-Tal et al. 2022; Gal et al. 2022; Patashnik et al. 2021]. Recently, Prompt-to-Prompt (P2P) [Hertz et al. 2023b] was proposed as a way to edit and manipulate generated images by editing the attention maps in the cross-attention layers of a pre-trained text-to-image model. Later, Mokady et al. [2023] extended P2P for real images by encoding them into the null-conditioning space of classifier-free guidance [Ho and Salimans 2022]. InstructPix2Pix [Brooks et al. 2023] uses an instruction-guided image-to-image translation network trained on synthetic data. Others preserve image-structure by using reference attention maps [Parmar et al. 2023] or features extracted using DDIM [Song et al. 2021] inversion [Tumanyan et al. 2023]. Imagic [Kawar et al. 2023] starts from a target prompt, finds text embedding to reconstruct an input image, and later interpolates between the two to achieve the final edit. UniTune [Valevski et al. 2022], on the other hand, performs the interpolation in pixel space during the denoising backward process. In our work, we focus on the ability to generate images depicting a given subject, which may not necessarily maintain the global structure of an input image.

Early personalization methods. like Textual Inversion [Gal et al. 2023a] and DreamBooth [Ruiz et al. 2023] tune pre-trained text-2-image models to represent new subjects, either by finding a new soft word-embedding [Gal et al. 2023a] or calibrating model-weights [Ruiz et al. 2023] with existing words to represent the newly added subject. Later methods improved memory requirements of previous methods using Low-Rank updates [Hu et al. 2022; Kumari et al. 2023; Ryu 2023; Tewel et al. 2023] or compact-parameter space [Han et al. 2023]. On another front, NeTI [Alaluf et al. 2023] and P + [Voynov et al. 2023] enhance TI[Gal et al. 2023a] by using additional tokens to more effectively capture subject-identifying features, whereas [Pang et al. 2023] introduces a cross-initialization method to bridge the gap between the learned token-embedding and the original model token-embedding. DreamMatcher [Nam et al. 2024] uses appearance matching in the generation process while maintaining the structure path of the original text-to-image model. Personalization can also be used for other tasks. ReVersion [Huang et al. 2023] showed how to learn relational features from reference images, and Vinker et al. [2023] used personalization to decompose and visualize concepts at different abstraction levels. Chefer et. al [Chefer et al. 2023b] propose an interpretability method for text-to-image models by decomposing concepts into interpretable tokens. Another line-of-works pre-train encoders on large-data for near-instant, single-shot adaptation [Arar et al. 2023; Chen et al. 2023; Gal et al. 2023b; Valevski et al. 2023; Wei et al. 2023; Ye et al. 2023; Zhou et al. 2023]. Single-image personalization has also been addressed in Avrahami et al. [2023a], where the authors use segmentation masks to personalize a model on different subjects. In our work, we focus on prompt alignment, and any of the previous personalization methods may be replaced by our baseline personalization method.

Score Distillation Sampling (SDS). emerged as a technique for leveraging 2D-diffusion models priors [Rombach et al. 2022; Saharia et al. 2022] for 3D-generation from textual input. Soon, this technique found way to different applications like SVG generation [Iluz et al. 2023; Jain et al. 2023], image-editing [Hertz et al. 2023a], and more [Song et al. 2022]. Other variant of SDS [Poole et al. 2023] aim to improve the image-generation quality of SDS, which suffers from over-saturation and blurriness [Katzir et al. 2023; Wang et al. 2023].

Text-to-image alignment. methods address text-related issues that arise in base diffusion generative models. These issues include neglecting specific text parts, attribute mixing, and more. Previous methods address these issues through attention-map re-weighting[Feng et al. 2022; Phung et al. 2023; Wu et al. 2023], latent-optimization [Chefer et al. 2023a; Rassin et al. 2023], or re-training with additional data [Segalis et al. 2023]. However, none of these methods address prompt alignment of personalization methods, instead, they aim to enhance the base models in generating text-aligned images.

Fig. 3:

3 Method

In our method, we strive to teach a pre-trained text-to-image model to generate images of a new subject S, which the model doesn’t recognize. Although personalization methods can help improve the model generalization, they struggle to create images of S depicted by complex texts. To overcome this obstacle, we optimize for both subject fidelity, and the ability to faithfully fulfil a target prompt. While the latter seems non-trivial, we show that by knowing the target prompt at the finetuning stage, we can help the model preserve its prior knowledge about different elements in the text.

We begin by giving an overview of personalization methods and identifying the prompt misalignment due to the personalization process. We then show our solution, which relies on knowing the prompt during personalization. More particularly, we simultaneously finetune the model to both become personalized and not lose its knowledge about the target prompt y_trg.

3.1 Preliminaries

Fig. 4:

Text-to-image Diffusion models perform a backward diffusion process to generate an image x. In this process, a denoising model G with parameters θ, progressively cleans an input noise \(x_T \sim \mathcal {N}(0,1)\) to produce a sample x from an underlying data distribution p(X). At each diffusion timestep t ∈ {T, …, 1, 0}, the model predicts a noise \(\hat{\epsilon } = G\left(x_t, t, y; \theta \right)\) conditioned on a prompt y and the timestep t. The generative model is trained to maximize the evidence lower bound (ELBO) using a denoising score matching objective [Ho et al. 2020]:

\begin{equation} \mathcal {L}_{diff} = \mathbb {E}_{ \epsilon \sim \mathcal {N}(0,1) \atop t \sim \mathcal {U}(0,T)} \left[\Vert G\left(x_t, t, y; \theta \right) - \epsilon \Vert _{2}^{2} \right]. \end{equation}

(1)

where given a noise scheduler parameters, \(\sqrt {\bar{\alpha }_{t}}\), the latent x_t at timestep is given by the forward diffusion:

\begin{equation} x_t = \sqrt {\bar{\alpha }_{t}} {\bf x} + \sqrt {1-\bar{\alpha }_{t}}\epsilon. \end{equation}

(2)

3.2 Personalization

Personalization methods finetune the diffusion model parameters θ for a new target subject S, such that the finetuned personalized model can generate images of S. Following prior works, we finetune the model G using the loss defined in eq. (1). Like Textual-Inversion [Gal et al. 2023a], we learn a new word embedding, denoted by the placeholder [V], to represent our subject S. We also finetune the model weights, similar to DreamBooth [Ruiz et al. 2023], using the Low-Rank Adaption method (LoRA) [Hu et al. 2022; Kumari et al. 2023], where each layer weight \(W \in \mathbb {R}^{N\times M}\) is updated using:

\begin{equation} W^{\prime } = W + \Delta W = W + A\times B. \end{equation}

(3)

Here the matrices \(A\in \mathbb {R}^{N\times r}\) and \(B\in \mathbb {R}^{r\times M}\) are learned low-rank decomposition of the weight-update, \(\Delta W \in \mathbb {R}^{N\times M}\). Similar to prior works [Kumari et al. 2023; Tewel et al. 2023], we only update the weights of the self and cross-attention layers. Finally, the finetuning is performed by pairing each image set by a personalization prompt y_P with the placeholder [V], e.g., "y_P = A photo of [V]."

Throughout the rest of the section, we will denote G_θ and \(G_{\theta _{LoRA}}\) to be the pre-trained and the personalized models, respectively. Therefore, given an input image x of our subject S, and a conditional text embedding y_P, the personalization loss is given by:

\begin{equation} \mathcal {L}_{P} = \mathbb {E}_{ \epsilon \sim \mathcal {N}(0,{\bf I}) \atop t \sim \mathcal {U}(0,T)} \left[\Vert G_{\theta _{LoRA}}\left(x_t, t, y_P \right) - \epsilon \Vert _{2}^{2} \right]. \end{equation}

(4)

3.3 Over-fitting in Personalization

Let our target prompt be y_trg, for example, "A sketch of [V]." Too many finetuning steps, on a small image set, causes the model to overfit. In this case, the diffusion model steers the backward denoising process towards one of the training images, regardless of any conditional prompt. To see this, let us visualize the model prediction by analyzing its estimate of x, the real-sample image. The estimation \(\hat{x}_0\) can be derived from the model’s prediction via the forward diffusion eq. (2):

\begin{equation} \hat{x}_0 = \frac{x_t - \sqrt {1-\bar{\alpha }_{t}}G(x_t,t, y) }{ \sqrt {\bar{\alpha }_{t}} }. \end{equation}

(5)

Fig. 5:

In 5a, we show the estimate of three models: (1) the pre-trained model G_θ, (2) the overfitted personalized model, and (3) a prompt-aligned personalized model. As can be seen from the figure, after prolonged training, the overfitted personalized model can reconstruct the target image, given a random noise, in a single step. In particular, after one denoising step, elements from the training images, like the background, are more dominant, and sketchiness fades away, suggesting miss-alignment with the prompt "A sketch." Throughout the rest of this section, we will denote by \(\hat{x}_0\) to be the personalized model estimation, i.e., \(G_{\theta _{LoRA}}\), of the latent x_t.

3.4 Prompt-Aligned Personalization

We want to personalize G_θ to generate images related to a new subject S. However, unlike previous methods, we strongly emphasize achieving near-optimal results for a single textual prompt, denoted by y_trg (e.g., "A sketch of [V]").

Our key idea is to optimize our personalized weights, i.e., θ_LoRA and [V], while simultaneously pushing the model prediction towards our target prompt y_trg. For example, in case y_trg = "A sketch of [V], then we want the estimate \(\hat{x}_0\) to be both personalized, and aligned with y_trg. In other words, we want the estimate to contain elements of the subject S and to maintain a sketchy appearance instead of a photo-realistic one.

To achieve this, we make use of the pre-trained model G_θ, which possesses all knowledge about the target prompt elements (except for the subject S). By omitting "[V]" from the target prompt we get a clean prompt that the pretrained model understand (henceforth called, y_c). We can measure the estimated fidelity to y_c using the diffusion loss in eq. (1).

In particular, given a sampled noise ϵ₂ and timestep t₂, we let \(\hat{x}_{t_2}\) be the latent of \(\hat{x}_0\) as given by the forward diffusion process. We further denote \(G^{\alpha }_{\theta }\) to be the classifier-free guidance prediction of the base-model, which is an extrapolation of the conditional and unconditioned (y = ∅) noise prediction. The scalar \(\alpha \in \mathbb {R}^+\) controls the extrapolation via:

\begin{equation} G^{\alpha }_{\theta }\left(x_t, t, y\right) = (1-\alpha) G_{\theta }\left(x_t, t, \emptyset \right) + \alpha G_{\theta }\left(x_t, t, y \right). \end{equation}

(6)

Then, to maintain alignment between the personalized model prediction, i.e., \(\hat{x}_0\) and the non-personalzied text y_c, we can use the diffusion loss. Specifically, given \(\hat{x}_{t_2} = \sqrt {\bar{\alpha }_{t_2}} \hat{x}_0 + \sqrt {1-\bar{\alpha }_{t_2}} \epsilon _2\), then we want to perturb the model prediction by minimizing:

\begin{equation} \mathbb {E}_{ \epsilon _2 \sim \mathcal {N}(0,1) \atop t_2 \sim \mathcal {U}(0,T)} \left[\Vert G^\alpha _{\theta }\left(\hat{x}_{t_2}, t_2, y_c;\right) - \epsilon _2 \Vert _{2}^{2} \right]. \end{equation}

(7)

Specifically, we use Score Distillation Sampling (SDS) techniques, which provide a more effective gradient estimation [Poole et al. 2023]:

\begin{equation} \nabla \mathcal {L}_{SDS}(G_\theta , \hat{x}_{t_2}, y_c) = \tilde {w}(t) \left(G^\alpha _{\theta }\left(\hat{x}_{t_2}, t_2, y_c;\right) - \epsilon _2 \right) \frac{\partial \hat{x}_0}{\partial \phi }, \end{equation}

(8)

here, ϕ are the weights controlling the appearance of \(\hat{x}_0\), which in our case are the LoRA weights and [V].

3.5 Avoiding Over-saturation and Mode Collapse

Incorporating the gradients defined in eq. (8) in our framework leads to less diverse and over-saturated results. While using alternative implementations like [Katzir et al. 2023; Wang et al. 2023] improved diversity, the overall personalization is still affected (see supplemental materials).

Previous works [Hertz et al. 2023a; Katzir et al. 2023] have introduced a residual score formulation that better estimates a desired gradient direction. In our case, we want to pivot the personalized model to the base model prediction. Therefore, we want to perturb the model prediction using the difference between the personalized model’s prediction \(G_{\theta _{LoRA}}\), and the pre-trained one G_θ, and use the residual score:

\begin{equation} \nabla \mathcal {L}_{PALP} = \tilde {w}(t) \left(G^\alpha _{\theta }(\hat{x}_{t_2}, t_2, y_c) - G^\beta _{\theta _{LoRA}} (\hat{x}_{t_2}, t_2, y_P) \right) \frac{\partial \hat{x}_0}{\partial \phi }, \end{equation}

(9)

where α and β are the guidance scales of G_θ and \(G_{\theta _{LoRA}}\), respectively. Here y_P and y_c is a clean prompt (see right part of fig. 4).

In fig. 5b, we visualize the residual gradient defined in eq. (9). The figure illustrates that incorporating PALP effectively reduces overfitting while retaining personalization capabilities. Ideally, both the base and personalized models should perform similarly in background regions. An optimal personalized model should have noise predictions for the background comparable to the base model and enhanced noise predictions for the target subject. However, without PALP, DreamBooth overfits by excessively denoising both the background and the target subject (see the middle image in fig. 5b). In contrast, with PALP, the personalized model matches the base model’s noise predictions for the background while improving predictions for the target subject.

Notice that our formulation is derived from the personalization overfitting problem, where we calculate a residual with respect to the estimations of two different networks given the same input, while previous works used the prediction of the same network over different input images, aiming to improve image-to-image or NeRF tasks.

3.6 Overall Training Objective

Our final objective is to optimize θ_LoRA using \(\mathcal {L}_{P}\) from eq. (4) and \(\mathcal {L}_{PALP}\) from eq. (9), hence giving our loss to be:

\begin{equation} \mathcal {L} = \mathcal {L}_{P} + \lambda \cdot \mathcal {L}_{PALP}, \end{equation}

(10)

where λ is used to balance the prompt-alignment fidelity (we used λ = 0.2). Further, we found imbalanced guidance scales, i.e., α > β, to perform better and t₁ = t₂ improves numerical stability. We have also considered two variant implementations: (1) using the same noise (i.e., ϵ₁ = ϵ₂), or (2) different ones. Using the same noise achieves better text alignment than the latter variant, for further details please refer to the ablation section section 4.1.

4 Results

Table 1:

Experimental setup: We use StableDiffusion (SD)-v1.4 [Rombach et al. 2022] for ablation and comparison purposes, as many official implementations of state-of-the-art methods are available in SD-v1.4. We further validate our method with larger text-to-image models (please refer to supplemental materials for further details). Complete experimental configuration, including learning rate, number of steps, and batch size, appear in the supplemental material.

Evaluation metric:. for evaluation, we follow previous works [Gal et al. 2023a; Ruiz et al. 2023] and use CLIP-score [Radford et al. 2021] to measure alignment with the target (clean) prompt \(y^c_t\) (i.e., does not have the placeholder [V]). For subject preservation, we also use CLIP feature similarity between the input and generated images with the target prompt. We use VIT-B32 [Dosovitskiy et al. 2021] trained by OpenAI on their proprietary data for both metrics. This ensures that the underlying CLIP used by SD-v1.4 differs from the one used for evaluation, which could compromise the validity of the reported metric.

Dataset:. for multi-shot setting, we use data collected by previous methods [Gal et al. 2023a; Kumari et al. 2023], with different subjects like animals, toys, personal items, and buildings. For those subjects, checkpoints of previous methods exist, allowing a fair comparison.

4.1 Ablation studies

Fig. 6:

For ablation, we start with TI [Gal et al. 2023a] and DB [Ruiz et al. 2023] as our baseline personalization method and gradually add different components contributing to our final method. Full experimental details appear in the supplemental materials.

Early stopping:. We begin by considering early stopping as a way to control the text-alignment. The lower the number of iterations, the less likely we are to hurt the model’s prior knowledge. However, this comes at the cost of subject fidelity, evident from fig. 6. The longer we tune the model on the target subject, the more we risk overfitting the training set.

Adding SDS guidance: improves text alignment, yet it severely harms the subject fidelity, and image diversity is substantially reduced (see supplemental materials). Alternative distillation sampling guidance [Katzir et al. 2023] improves on top of SDS; however, since the distillation sampling guides the personalization optimization towards the center of distribution of the subject class, it still produces less favorable results.

Replacing SDS with PALP guidance: improves text alignment by a considerable margin and maintains high fidelity to the subject S. We consider two variants: one where we use the same noise for personalization loss or sample a new one from the normal distribution. Interestingly, using the same noise helps with prompt alignment. Furthermore, scaling the score sampling equation eq. (9) by \(\sqrt {\bar{\alpha }_{t}} / \sqrt {1-\bar{\alpha }_{t}}\) further enhances the performance. Further details and qualitative samples please refer to the supplemental material.

4.2 Comparison with Existing Methods

Table 2:

	Text-Alignment ↑					Image-Alignment ↑
Method	Style	Class	Ambiance 1	Ambiance 2	Target Prompt
P+	0.244	0.257	0.217	0.218	0.308	0.673
NeTI	0.235	0.264	0.22	0.214	0.310	0.695
TI+DB	0.237	0.279	0.22	0.216	0.319	0.716
Ours	0.245	0.272	0.23	0.224	0.340	0.681

Table 2: Comparisons to prior work. Our method presents better prompt-alignment, without hindering personalization.

Table 3:

Metric	P+	NeTI	TI+DB	Ours
Text-Alignment ↑	68.5 %	63.2 %	73.3 %	91.2 %
Personalization ↑	61.2 %	70.3 %	60.4 %	72.1 %

Table 3: User Study results: For text alignment, we report the percentage of prompt elements found in the generated images. For personalization, users rated the similarity between subject S and the generated images.

Table 4:

Table 5:

We compare our method against multi-shot methods, including CustomDiffusion [Kumari et al. 2023], P + [Voynov et al. 2023], and NeTI [Alaluf et al. 2023]. We further compare against TI [Gal et al. 2023a] and DB [Ruiz et al. 2023] using our implementation, which should also highlight the gain we achieve by incorporating our framework with existing personalization methods. Our evaluation set contains ten different complex prompts that include at least four different elements, including style-change (e.g., "sketch of", "anime drawing of"), time or a place (e.g., "in Paris", "at night"), color palettes (e.g., "warm," "vintage"). Quantitative results appear in table 2, and qualitative comparison appear in table 1 and table 5).

Our method achieves the best text alignment while maintaining high image alignment. TI+DB achieves the best image alignment. However, the reason for this is because TI+DB is prone to over-fitting. Indeed, investigating each element in the prompt, we find that the TI+DB achieves the best alignment with the class prompt (e.g., "A photo of a cat") while being significantly worse in the Style prompt (e.g., "A sketch"). Our method has a slightly lower image alignment since we expect appearance change for stylized prompts. We validate this hypothesis with a user study and find that our method achieves the best user preference in prompt alignment and personalization (see table 3). Please refer to the supplemental material for further non-stylized results and full details on the user study.

4.3 Applications

Single-shot setting: In a single-shot setting, we aim to personalize text-2-image models using a single image. This setting is helpful for cases where only a single image exists for our target subject (e.g., an old photo of a loved one). For this setting, we qualitatively compare our method with encoder-based methods, including IP-Adapter [Ye et al. 2023], ProFusion [Zhou et al. 2023], Face0 [Valevski et al. 2023], and E4T [Gal et al. 2023b]. We use portraits of two individuals and expect previous methods to generalize to our selected images since all methods are pre-trained on human faces. Note that E4T [Gal et al. 2023b] and ProFusion [Zhou et al. 2023] also perform test time optimization.

As seen from table 4, our method is both prompt- and identity-aligned. Previous methods, on the other hand, struggle more with identity preservation. We note that optimization-based approaches [Gal et al. 2023b; Zhou et al. 2023] are more identity-preserving, but this comes at the cost of text alignment. Finally, our method achieves a higher success rate, where the quality of the result is independent of the chosen seed.

Multi-concept Personalization:. Our method accommodates multi-subject personalization via simple modifications. Assume we want to compose two subjects, S₁ and S₂, in a specific scene depicted by a given prompt y. To do so, we first allocate two different placeholders, [V1] and [V2], to represent the target subjects S₁ and S₂, respectively. During training, we randomly sample an image from a set of images containing S₁ and S₂. We assign different personalization prompts y_P for each subject, e.g., "A photo of [V1]" or "A painting inspired by [V2]", depending on the context. Then, we perform PALP while using the target prompt in mind, e.g., "A painting of [V1] inspired by [V2]". This allows composing different subjects into coherent scenes or using a single artwork as a reference for generating art-inspired images. Results appear in fig. 3; further details and results appear in the supplemental material.

5 Conclusions

We have introduced a novel personalization method that allows better prompt alignment. Our approach involves fine-tuning a pre-trained model to learn a given subject while employing score sampling to maintain alignment with the target prompt. We achieve favorable results in both prompt- and subject-alignment and push the boundary of personalization methods to handle complex prompts, comprising multiple subjects, even when one subject has only a single reference image.

While the resulting personalized model still generalizes for other prompts, we must personalize the pre-trained model for different prompts to achieve optimal results. For practical real-time use cases, there may be better options. However, future directions employing prompt-aligned adapters could result in instant time personalization for a specific prompt (e.g., for sketches). Finally, our work will motivate future methods to excel on a subset of prompts, allowing more specialized methods to achieve better and more accurate results.

Supplemental Material

PDF File

Supplemental Material for: saconferencepapers24-47 PALP: Prompt Aligned Personalization of Text-to-Image Models

Download
36.77 MB

References

[1]

Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. 2023. A Neural Space-Time Representation for Text-to-Image Personalization. CoRR abs/2305.15391 (2023). arXiv:https://arXiv.org/abs/2305.15391

Input	Ours			Previous methods

	"3D Render as a chef"	"Vector art wearing a hat"	"A painting by Da Vinci"	"3D Render as a chef"	"Vector art wearing a hat"	"A painting by Da Vinci"

	"Anime drawing"	"Pop art"	"A Caricature"	"Anime drawing"	"Pop art"	"A Caricature"

	"3D Render as a chef"	"Vector art wearing a hat"	"A painting by Da Vinci"	"3D Render as a chef"	"Vector art wearing a hat"	"A painting by Da Vinci"

	"Anime drawing"	"Pop art"	"A Caricature"	"Anime drawing"	"Pop art"	"A Caricature"

Abstract

1 Introduction

2 Related work

3 Method

3.1 Preliminaries

3.2 Personalization

3.3 Over-fitting in Personalization

3.4 Prompt-Aligned Personalization

3.5 Avoiding Over-saturation and Mode Collapse

3.6 Overall Training Objective

4 Results

4.1 Ablation studies

4.2 Comparison with Existing Methods

4.3 Applications

5 Conclusions

Supplemental Material

References

Cited By

Index Terms

Recommendations

Generating semantically enriched user profiles for Web personalization

Personalization of social media

Improving Similarity Measurement of User's Rating Value using Sigmoid Function in Personalization System

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations