Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2312.06708v1 [cs.CV] 10 Dec 2023

Neutral Editing Framework for Diffusion-based Video Editing

Sunjae Yoon       Gwanhyeong Koo       Ji Woo Hong       Chang D. Yoo
Korea Advanced Institute of Science and Technology (KAIST)
{sunjae.yoon,kookie,jiwoohong93,cd_yoo}@kaist.ac.kr
Abstract

Text-conditioned image editing has succeeded in various types of editing based on a diffusion framework. Unfortunately, this success did not carry over to a video, which continues to be challenging. Existing video editing systems are still limited to rigid-type editing such as style transfer and object overlay. To this end, this paper proposes Neutral Editing (NeuEdit) framework to enable complex non-rigid editing by changing the motion of a person/object in a video, which has never been attempted before. NeuEdit introduces a concept of ‘neutralization’ that enhances a tuning-editing process of diffusion-based editing systems in a model-agnostic manner by leveraging input video and text without any other auxiliary aids (e.g., visual masks, video captions). Extensive experiments on numerous videos demonstrate adaptability and effectiveness of the NeuEdit framework. The website of our work is available here: https://neuedit.github.io/

[Uncaptioned image]
Figure 1: Neutral editing framework enables the diffusion-based editing models to perform various text-based non-rigid editing such as motion variation of objects spanning from fine-grained variations to large dynamic variations while preserving fidelity to the input video.
Refer to caption
Figure 2: (a) Edited videos of current systems [38, 19] based on target prompt in terms of motion change. (b) Categorical analysis of different types of editing on videos of DAVIS [22]. Illustration of (c) current editing framework and (d) proposed neutral editing framework.

1 Introduction

The recent success of generative frameworks [16, 10, 4] and large-scale models [5, 24, 25] provide surreal outputs surpassing the boundaries of human capabilities. The diffusion models [32, 34, 6] lay the foundation for such innovative advancements, bridging a diverse range of large-scale generative models [33, 27]. To be specific, diffusion-based text-to-image (T2I) models [27, 21] synthesize natural images of high fidelity and further edit [28, 14] them by modifying specific attributes corresponding to the input text. Expanding the work in image, diffusion-based text-to-video (T2V) models [1, 38] also have been considered. Due to insufficient training resources about videos, in early work, significant technical contributions [31, 13] have been made to transfer the knowledge of pre-trained T2I models into the T2V models. Currently, researchers are striving to refine this text-based video generation into a more controlled and fine-grained approach by modifying specific attributes in a video corresponding to users’ requirements from a text, ultimately performing text-based video editing.

In a formal definition of text-based video editing, as shown in Figure 2 (a), systems are given an input video and a target textual prompt describing desired modifications in the video such that they produce an edited video that conforms to the target prompt. To achieve this, editing systems largely perform two sequential processes: (1) video tuning and (2) video editing. In video tuning, the editing system is trained to generate the input video and to comprehend the contextual meaning of the video. In video editing, the system generates the variants of the input video that conform to the meaning of the target prompt. To provide necessary attributes for editing, pre-trained vision-language models [25, 27] also have to be integrated into the system.

Despite recent advancements in video editing systems, their capabilities are still restricted to rigid modifications within the realm of inpainting such as style transfer and object overlay. To be specific, in Figure 2 (a), for a given target prompt (e.g. “A man jumps on the moon”) requiring non-rigid modifications by changing a motion of an object, current systems do not conform to the target prompt and return the original input video under over-fidelity. Otherwise, they often show impractical results by mixing up the original content (i.e. walking) and targeted content (i.e. jumping). In Figure 2 (b), our categorical analysis of textual alignment with video according to different types of editing (i.e. style transfer, object overlay, motion change) demonstrates that current systems are facing difficulties in changing a motion in a video. Therefore the results of complex non-rigid editing are still unsatisfactory.

One of the reasons for the unsatisfactory editing is rooted in a conventional tuning and editing process within diffusion-based editing frameworks. As illustrated in Figure 2 (c), current editing frameworks require additional input caption about the video (i.e. source prompt) in the tuning process. After tuning, the model edits the video based on a target prompt. However, employing a source prompt leads to a functionally unnecessary tuning of content (e.g. ‘walking’) in the video, which is unrelated to the intended editing (e.g. ‘jumping’) and results in suboptimal editing. Furthermore, the outcomes are vulnerable to the variants of source prompts. Therefore, frameworks employing source prompt are inadequate for effective text-based video editing.

To this end, we propose Neutral Editing (NeuEdit) framework that performs effective video editing including non-rigid editing to a video with only a target prompt. As shown in Figure 2 (d), the NeuEdit framework introduces a novel concept of neutralization, which enables current editing systems to conduct (1) neutral prompt tuning and (2) neutral video editing. The neutral prompt tuning refers to tuning a model based on a neutral prompt. This prompt (e.g. “A man [mask] on the moon”) is a text that reduces111Masking is an intuitive approach for reducing the factors. See also other approaches in Method. factors (e.g. “jumps”) contributing to editing from a target prompt, allowing a model to tune a video without relying on a source prompt. Furthermore, the target prompt holds effective differences from this neutral prompt by the factors related to editing. To implement a neutral prompt, we first introduce a neutralization which refers to disentangling a factor related to editing in the input. After tuning with the neutral prompt, the T2V model performs video editing. Our studies found that current models struggle with non-rigid editing, primarily due to constraints imposed by the original content in the input video (e.g. Figure 2 (a)). To address this, we also construct a neutral video by applying neutralization which reduces the influence of original content in a region of the video to be edited, such that it amplifies the possibility of non-rigid editing. NeuEdit can be applied to diffusion-based editing systems in a model-agnostic manner, enhancing various editing including object motion change. Extensive experiments validate its adaptability and visual effectiveness.

2 Related Works

2.1 Diffusion-based generative models

Deep diffusion models [10, 33] exhibit significant capability by outperforming the prior best qualities of generative adversarial networks [8]. Applying diffusion to a text-to-image (T2I) generation, significant advancements have been observed in image generation, where diffusion-based T2I models [26, 29] produce high-fidelity images from textual inputs. Recently, the T2I models have expanded their visual generative capabilities into the domain of videos to perform text-to-video (T2V) generation. Earlier studies [12, 37, 13] in T2V generation modified the T2I model by introducing a temporal axis for video data, thus transferring pre-trained knowledge from the T2I model. To enhance temporal consistency in generated frames, temporal attentions [11, 31] are also designed. Recently, these diffusion-based models have been successful in various generative works including inpainting and super-resolution [30, 20]. Among them, visual editing emerges as a new challenge to perform controlling and reasoning about selective synthesizing, which is discussed in detail below.

2.2 Image and video editing

Text-based image editing aims to edit a given image based on text descriptions. To perform this, DiffusionCLIP [15] first proposes a tuning-editing framework of diffusion model based on CLIP embedding and Prompt-To-Prompt [9] proposes weight blending to perform effective rigid editing in this framework. For efficient editing, InstructPix2Pix [2] design zero-shot edits without fine-tuning. Similar to work in the image, video editing also has expanded. Especially to keep temporal consistency, several technical solutions are introduced such as layered editing [3] and attention control [19]. However, current video editing is limited to rigid types of editing and still challenging to dynamic motion change. Thus, NeuEdit first performs complex non-rigid editing in a video based on a text.

3 Preliminaries

3.1 Denoising diffusion probabilistic models

Denoising diffusion probabilistic models (DDPMs) [10] are parameterized Markov chains to reconstruct a sequence of data {x1\{x_{1}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,\cdots, xT}x_{T}\}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. Given raw data x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the Markov transition gradually adds Gaussian noise upto xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT using q(xt|xt1)=𝒩(xt;αtxt1,(1αt)I)𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝑥𝑡subscript𝛼𝑡subscript𝑥𝑡11subscript𝛼𝑡𝐼q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{\alpha_{t}}x_{t-1},(1-\alpha_{t})I)italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ) under pre-defined schedule αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT following t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T. This process is referred to as a forward process of the diffusion model. In the reverse process, the diffusion model approximates the q(xt1|xt)𝑞conditionalsubscript𝑥𝑡1subscript𝑥𝑡q(x_{t-1}|x_{t})italic_q ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) using trainable Gaussian transitions pθ(xt1|xt)=𝒩(xt1;μθ(xt,t),σθ(xt,t))subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝒩subscript𝑥𝑡1subscript𝜇𝜃subscript𝑥𝑡𝑡subscript𝜎𝜃subscript𝑥𝑡𝑡p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\sigma_{% \theta}(x_{t},t))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , italic_σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) starting at normal distribution p(xT)=𝒩(xT;0,I)𝑝subscript𝑥𝑇𝒩subscript𝑥𝑇0𝐼p(x_{T})=\mathcal{N}(x_{T};0,I)italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; 0 , italic_I ). The training objective is to maximize log-likelihood log(pθ(x0))𝑙𝑜𝑔subscript𝑝𝜃subscript𝑥0log(p_{\theta}(x_{0}))italic_l italic_o italic_g ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ), where we can also apply variational inference by maximizing the variational lower bound of this. This makes a closed-form of KL divergence222See the detailed proof in Appendix D. between the distributions of pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and q𝑞qitalic_q while optimizing the parameter θ𝜃\thetaitalic_θ. The beauty of DDPM is that this process can be summarized as denoising network ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) for predicting noise ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) as given below:

𝔼x,ϵ𝒩(0,1),t𝒰{1,T}[ϵϵθ(xt,t)22].subscript𝔼formulae-sequencesimilar-to𝑥italic-ϵ𝒩01similar-to𝑡𝒰1𝑇delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡22\displaystyle\mathbb{E}_{x,\epsilon\sim\mathcal{N}(0,1),t\sim\mathcal{U}\{1,T% \}}[||\epsilon-\epsilon_{\theta}(x_{t},t)||_{2}^{2}].blackboard_E start_POSTSUBSCRIPT italic_x , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t ∼ caligraphic_U { 1 , italic_T } end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (1)

To keep the robustness in all steps, t𝑡titalic_t is sampled from the discrete uniform distribution 𝒰{1,T}𝒰1𝑇\mathcal{U}\{1,T\}caligraphic_U { 1 , italic_T }. Based on trained ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, denoising is performed, where denoising diffusion implicit model (DDIM) [33] has been a popular choice for the denoising by a small number of sampling steps as below:

xt1=αt1αtxt+(1αt1αt11αtαt)ϵθ.subscript𝑥𝑡1subscript𝛼𝑡1subscript𝛼𝑡subscript𝑥𝑡1subscript𝛼𝑡1subscript𝛼𝑡11subscript𝛼𝑡subscript𝛼𝑡subscriptitalic-ϵ𝜃x_{t-1}=\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}x_{t}+\left(\sqrt{\frac{1-\alpha% _{t-1}}{\alpha_{t-1}}}-\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}\right)\cdot% \epsilon_{\theta}.italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( square-root start_ARG divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG - square-root start_ARG divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) ⋅ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT . (2)

3.2 Text-guided diffusion model

The text-guided diffusion model is a DDPM that restores the output data x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from random noise with a guided condition of a text prompt 𝒯𝒯\mathcal{T}caligraphic_T. Thus, the training objective is also formulated with this condition under latent space to interact with textual modality as 𝔼z,ϵ,t[ϵϵθ(zt,t,𝐜)22]subscript𝔼𝑧italic-ϵ𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡𝐜22\mathbb{E}_{z,\epsilon,t}[||\epsilon-\epsilon_{\theta}(z_{t},t,\mathbf{c})||_{% 2}^{2}]blackboard_E start_POSTSUBSCRIPT italic_z , italic_ϵ , italic_t end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_c ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], where zt=E(xt)subscript𝑧𝑡𝐸subscript𝑥𝑡z_{t}=E(x_{t})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_E ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a latent noise encoding (e.g. VQ-VAE [36]) and 𝐜=ψ(𝒯)𝐜𝜓𝒯\mathbf{c}=\psi(\mathcal{T})bold_c = italic_ψ ( caligraphic_T ) is conditional textual embedding (e.g. CLIP [25]). In video editing, ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a latent encoding of video data, and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can be pre-trained video diffusion networks.

Refer to caption
Figure 3: Illustration of neutralization composed of (a) factor identification f𝑓fitalic_f and (b) semantic disentanglement g𝑔gitalic_g. The f𝑓fitalic_f localizes editing factors in video and text and the g𝑔gitalic_g produces neutralized video and text via semantically reducing the editing factors in the video and text. The textual neutralization (gf)𝒯subscript𝑔𝑓𝒯(g\circ f)_{\mathcal{T}}( italic_g ∘ italic_f ) start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT and visual neutralization (gf)𝒱subscript𝑔𝑓𝒱(g\circ f)_{\mathcal{V}}( italic_g ∘ italic_f ) start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT are the applications of f𝑓fitalic_f and g𝑔gitalic_g for performing the neutralization.
Refer to caption
Figure 4: Illustration of Neutral Editing framework composed of (a) neutral prompt tuning and (b) neutral video editing. The neutral prompt is a text that removes factors contributing to editing from the target prompt. The neutral video is a video that reduces factors related to editing enabling the systems to perform effective editing including non-rigid editing without auxiliary inputs.

4 Neutral Editing Framework

The Neutral Editing (NeuEdit) framework aims to enhance existing diffusion-based video editing systems, enabling more effective non-rigid modifications in a model-agnostic approach using only an input video and a target prompt. To achieve this, as shown in Figure 4 (a), NeuEdit introduces the novel concept of neutralization which enables current editing systems to conduct (1) neutral prompt tuning and (2) neutral video editing. The neutral prompt tuning refers to model tuning based on a neutral prompt. This neutral prompt is a text that reduces factors (e.g. specific words or features) contributing to editing from a target prompt, effectively resolving the issue of spurious reliance on additional source prompt of current editing systems. Henceforth, the target prompt keeps effective differences from the neutral prompt by the factors related to editing. After tuning, the systems edit a video based on the target prompt. However, our studies found that original content within the editing region in the video imposes constraints on non-rigid edits. Thus we construct a neutral video that sensibly reduces the influence of original content in a region of video to be edited, amplifying the possibilities of non-rigid editing. The neutral prompt and neutral video are constructed by following our proposed neutralization operation.

4.1 Neutralization

Neutralization aims to disentangle factors contributing to editing in the input modality (i.e. video, text). To perform this, as shown in Figure 3, it comprises two sequential processes: (1) factor identification f𝑓fitalic_f and (2) semantic disentanglement g𝑔gitalic_g. To provide a formal definition of the neutralization, it takes inputs of target prompt 𝒯𝒯\mathcal{T}caligraphic_T and video 𝒱𝒱\mathcal{V}caligraphic_V and produces neutral prompt 𝒯nsubscript𝒯𝑛\mathcal{T}_{n}caligraphic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and neutral video 𝒱nsubscript𝒱𝑛\mathcal{V}_{n}caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as below:

𝒯n,𝒱n=(gf)(𝒯,𝒱),subscript𝒯𝑛subscript𝒱𝑛𝑔𝑓𝒯𝒱\mathcal{T}_{n},\mathcal{V}_{n}=(g\circ f)(\mathcal{T},\mathcal{V}),caligraphic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_g ∘ italic_f ) ( caligraphic_T , caligraphic_V ) , (3)

where f𝑓fitalic_f is factor identification which localizes the editing factors within each modality, and g𝑔gitalic_g is semantic disentanglement that produces the modality that semantically reduces the meaning of the identified editing factors. The implementations of f𝑓fitalic_f and g𝑔gitalic_g are specified depending on the target (i.e. text or video) of neutralization, referred to as textual neutralization and visual neutralization in the following.

Textual Neutralization

Textual neutralization (gf)𝒯subscript𝑔𝑓𝒯(g\circ f)_{\mathcal{T}}( italic_g ∘ italic_f ) start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT at the top of Figure 3 is an application of the neutralization to the target prompt, which obtains neutral prompt 𝒯nsubscript𝒯𝑛\mathcal{T}_{n}caligraphic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from target prompt 𝒯𝒯\mathcal{T}caligraphic_T and video 𝒱𝒱\mathcal{V}caligraphic_V. To implement the factor identification f𝒯subscript𝑓𝒯f_{\mathcal{T}}italic_f start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT and the semantic disentanglement g𝒯subscript𝑔𝒯g_{\mathcal{T}}italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT in the textual neutralization, we first define the editing factors in the target prompt as ‘words’ contributing to editing. Thus, the f𝒯subscript𝑓𝒯f_{\mathcal{T}}italic_f start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT aims to localize these words in the target prompt 𝒯𝒯\mathcal{T}caligraphic_T. Since the target prompt is a description of desired modifications in the current video, the prompt and the video exhibit semantic misalignment due to the words associated with the modifications (i.e. editing factors). Intrigued by this observation, we measure all words in 𝒯𝒯\mathcal{T}caligraphic_T based on their cosine similarities with 𝒱𝒱\mathcal{V}caligraphic_V to localize the words exhibiting low similarities as the editing factors. Therefore, we define a textual factor identification as f𝒯(𝒯,𝒱)subscript𝑓𝒯𝒯𝒱f_{\mathcal{T}}(\mathcal{T},\mathcal{V})italic_f start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( caligraphic_T , caligraphic_V ) to produce scores about localization of editing factor in the target prompt based on their similarity scores as given below:

f𝒯(𝒯,𝒱)=1mean(𝐰𝐯)M,subscript𝑓𝒯𝒯𝒱1meansuperscript𝐰𝐯topsuperscript𝑀f_{\mathcal{T}}(\mathcal{T},\mathcal{V})=1-\textrm{mean}(\mathbf{w}\mathbf{v}^% {\top})\in\mathbb{R}^{M},italic_f start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( caligraphic_T , caligraphic_V ) = 1 - mean ( bold_wv start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , (4)

where the 𝐰=ψT(𝒯)M×d𝐰subscript𝜓𝑇𝒯superscript𝑀𝑑\mathbf{w}=\psi_{T}(\mathcal{T})\in\mathbb{R}^{M\times d}bold_w = italic_ψ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_T ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT is d𝑑ditalic_d-dimensional normalized word features of target prompt. The M𝑀Mitalic_M is the number of words and ψTsubscript𝜓𝑇\psi_{T}italic_ψ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the CLIP [25] text encoder. The 𝐯=ψI(𝒱)L×d𝐯subscript𝜓𝐼𝒱superscript𝐿𝑑\mathbf{v}=\psi_{I}(\mathcal{V})\in\mathbb{R}^{L\times d}bold_v = italic_ψ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( caligraphic_V ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT is d𝑑ditalic_d-dimensional video features with frame length L𝐿Litalic_L, where ψIsubscript𝜓𝐼\psi_{I}italic_ψ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is the CLIP image encoder. mean()mean\textrm{mean}(\cdot)mean ( ⋅ ) is a mean-pooling along frame axis. As editing factors have low similarity scores with video, we inverse the scores by subtracting them from one, finally producing the textual editing factor score denoted as 𝐳𝒯=f𝒯(𝒯,𝒱)subscript𝐳𝒯subscript𝑓𝒯𝒯𝒱\mathbf{z}_{\mathcal{T}}=f_{\mathcal{T}}(\mathcal{T},\mathcal{V})bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( caligraphic_T , caligraphic_V ). This score 𝐳𝒯subscript𝐳𝒯\mathbf{z}_{\mathcal{T}}bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT is utilized for the following textual semantic disentanglement.

The textual semantic disentanglement g𝒯subscript𝑔𝒯g_{\mathcal{T}}italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT aims to build a neutral prompt 𝒯nsubscript𝒯𝑛\mathcal{T}_{n}caligraphic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from a target prompt 𝒯𝒯\mathcal{T}caligraphic_T by semantically reducing editing factors in the 𝒯𝒯\mathcal{T}caligraphic_T. To perform this, g𝒯subscript𝑔𝒯g_{\mathcal{T}}italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT employs the editing factor score 𝐳𝒯subscript𝐳𝒯\mathbf{z}_{\mathcal{T}}bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT to identify the editing factors and reduce their meaning by disentangling them. To be specific, we present two technical contributions about the disentangling methods: (1) factor swapping and (2) factor deforming. The factor swapping is to swap the identified editing factors with other words. To the scores above a specific value s𝑠sitalic_s (e.g. 0.7) on 𝐳𝒯subscript𝐳𝒯\mathbf{z}_{\mathcal{T}}bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT (e.g. [0.1, 0.9, \cdots, 0.2]), their corresponding words are decided as editing factors, denoting them as W𝐳𝒯>ssubscript𝑊subscript𝐳𝒯𝑠W_{\mathbf{z}_{\mathcal{T}}>s}italic_W start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT > italic_s end_POSTSUBSCRIPT in word space, where 𝐳𝒯>ssubscript𝐳𝒯𝑠\mathbf{z}_{\mathcal{T}}>sbold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT > italic_s is the indices of words in target prompt 𝒯𝒯\mathcal{T}caligraphic_T of a higher score than s𝑠sitalic_s. Thus, the W𝐳𝒯>ssubscript𝑊subscript𝐳𝒯𝑠W_{\mathbf{z}_{\mathcal{T}}>s}italic_W start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT > italic_s end_POSTSUBSCRIPT are swapped with other word tokens, where to mitigate a semantic intervention by swapping tokens, we define a dummy token as <<<DMY>>> for the swapping. As a result, the textual semantic disentanglement with factor swapping ultimately produces a neutral prompt g𝒯(z𝒯)=𝒯n=[W1,,Wi,,WM]subscript𝑔𝒯subscript𝑧𝒯subscript𝒯𝑛subscript𝑊1subscript𝑊𝑖subscript𝑊𝑀g_{\mathcal{T}}(z_{\mathcal{T}})=\mathcal{T}_{n}=[W_{1},\cdots,W_{i},\cdots,W_% {M}]italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) = caligraphic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ⋯ , italic_W start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] including W𝐳𝒯>s=subscript𝑊subscript𝐳𝒯𝑠absentW_{\mathbf{z}_{\mathcal{T}}>s}=italic_W start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT > italic_s end_POSTSUBSCRIPT = <<<DMY>>>, where Wisubscript𝑊𝑖W_{i}italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes corresponding i𝑖iitalic_i-th word in the target prompt 𝒯𝒯\mathcal{T}caligraphic_T. Although the 𝒯nsubscript𝒯𝑛\mathcal{T}_{n}caligraphic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with factor swapping maintains a distinct difference333Experimental studies in Appendix F provide optimal region of the s𝑠sitalic_s. with the 𝒯𝒯\mathcal{T}caligraphic_T by the editing factors W𝐳𝒯>ssubscript𝑊subscript𝐳𝒯𝑠W_{\mathbf{z}_{\mathcal{T}}>s}italic_W start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT > italic_s end_POSTSUBSCRIPT, it relies on a heuristic manner in selecting the editing factor and is difficult to distinguish the difference among the factors. To this end, we further devised a feature-level disentanglement referred to as factor deforming. To be specific, it first disentangles the target prompt features 𝐰M×d𝐰superscript𝑀𝑑\mathbf{w}\in\mathbb{R}^{M\times d}bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT into a format of linear combination using factor score 𝐳𝒯Msubscript𝐳𝒯superscript𝑀\mathbf{z}_{\mathcal{T}}\in\mathbb{R}^{M}bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT as 𝐰=𝐳𝒯𝐰+(1𝐳𝒯)𝐰𝐰subscript𝐳𝒯𝐰1subscript𝐳𝒯𝐰\mathbf{w}=\mathbf{z}_{\mathcal{T}}\circ\mathbf{w}+(1-\mathbf{z}_{\mathcal{T}}% )\circ\mathbf{w}bold_w = bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ∘ bold_w + ( 1 - bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) ∘ bold_w, where \circ444Here, \circ is different from composite operation in fg𝑓𝑔f\circ gitalic_f ∘ italic_g is element-wise multiplication with broadcasting. After that, we deform the features attended by 𝐳𝒯subscript𝐳𝒯\mathbf{z}_{\mathcal{T}}bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT using deformable operation h()h(\cdot)italic_h ( ⋅ ) as:

𝐰n=𝐳𝒯h(𝐰)+(1𝐳𝒯)𝐰.subscript𝐰𝑛subscript𝐳𝒯𝐰1subscript𝐳𝒯𝐰\mathbf{w}_{n}=\mathbf{z}_{\mathcal{T}}\circ h(\mathbf{w})+(1-\mathbf{z}_{% \mathcal{T}})\circ\mathbf{w}.bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ∘ italic_h ( bold_w ) + ( 1 - bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) ∘ bold_w . (5)

This format selectively deforms the text features concerning the editing factor while keeping disparities among features of editing factors. We apply feature down-scaling for the deforming555Appendix F also provides other factor deforming methods as h(𝐰)=α×𝐰𝐰𝛼𝐰h(\mathbf{w})=\alpha\times\mathbf{w}italic_h ( bold_w ) = italic_α × bold_w with scaler 0α<10𝛼10\leq\alpha<10 ≤ italic_α < 1. For α=0𝛼0\alpha=0italic_α = 0, all deformed features become identical, working similarly to the factor swapping. Finally, we define a neutral prompt features g𝒯(𝐳𝒯)=𝐰nM×dsubscript𝑔𝒯subscript𝐳𝒯subscript𝐰𝑛superscript𝑀𝑑g_{\mathcal{T}}(\mathbf{z}_{\mathcal{T}})=\mathbf{w}_{n}\in\mathbb{R}^{M\times d}italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) = bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT, which is utilized in tuning process of NeuEdit framework.

Visual Neutralization

Visual neutralization (gf)𝒱subscript𝑔𝑓𝒱(g\circ f)_{\mathcal{V}}( italic_g ∘ italic_f ) start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT at the bottom of Figure 3 is another application of neutralization to the input video, which produces neutral video 𝒱nsubscript𝒱𝑛\mathcal{V}_{n}caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from target prompt 𝒯𝒯\mathcal{T}caligraphic_T and video 𝒱𝒱\mathcal{V}caligraphic_V. Neutral video improves the effectiveness of editing by reducing the influence of the original content in a region to be edited. To construct a factor identification f𝒱subscript𝑓𝒱f_{\mathcal{V}}italic_f start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT and semantic disentanglement g𝒱subscript𝑔𝒱g_{\mathcal{V}}italic_g start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT in the visual neutralization, we also define the editing factors in the video as ‘pixels’ contributing to editing. Thus the f𝒱subscript𝑓𝒱f_{\mathcal{V}}italic_f start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT aims to localize these pixels in the input video 𝒱𝒱\mathcal{V}caligraphic_V. As editing models’ multi-modal interaction modules (i.e. cross-attention)666Video editing is based on pre-trained knowledge (e.g. Stable Diffusion [27]) with multi-modal attention to provide required modifications. contain information about the interactions between texts and frames, we employ this information to build the f𝒱subscript𝑓𝒱f_{\mathcal{V}}italic_f start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT. We first embed each i𝑖iitalic_i-th frame of video into patch-wise features as 𝐩i(Wp×Hp)×dsuperscript𝐩𝑖superscriptsubscript𝑊𝑝subscript𝐻𝑝𝑑\mathbf{p}^{i}\in\mathbb{R}^{(W_{p}\times H_{p})\times d}bold_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) × italic_d end_POSTSUPERSCRIPT and perform cross attention with a target prompt features 𝐰M×d𝐰superscript𝑀𝑑\mathbf{w}\in\mathbb{R}^{M\times d}bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT to get cross attention maps as 𝐦i(Wp×Hp)×Msuperscript𝐦𝑖superscriptsubscript𝑊𝑝subscript𝐻𝑝𝑀\mathbf{m}^{i}\in\mathbb{R}^{(W_{p}\times H_{p})\times M}bold_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) × italic_M end_POSTSUPERSCRIPT, where (Wp×Hpsubscript𝑊𝑝subscript𝐻𝑝W_{p}\times H_{p}italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) is the number patches in the frame and M𝑀Mitalic_M is the number of words in the target prompt. Among the attention maps 𝐦isuperscript𝐦𝑖\mathbf{m}^{i}bold_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, we highlight the maps related to the editing using textual editing factor score 𝐳𝒯M×1subscript𝐳𝒯superscript𝑀1\mathbf{z}_{\mathcal{T}}\in\mathbb{R}^{M\times 1}bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 1 end_POSTSUPERSCRIPT as (Please see also illustration in the bottom of Figure 3 for clear understanding.):

𝐳𝒱i=𝐦i𝐳𝒯Wp×Hp,superscriptsubscript𝐳𝒱𝑖superscript𝐦𝑖subscript𝐳𝒯superscriptsubscript𝑊𝑝subscript𝐻𝑝\mathbf{z}_{\mathcal{V}}^{i}=\mathbf{m}^{i}\mathbf{z}_{\mathcal{T}}\in\mathbb{% R}^{W_{p}\times H_{p}},bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = bold_m start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (6)

where 𝐳𝒱isuperscriptsubscript𝐳𝒱𝑖\mathbf{z}_{\mathcal{V}}^{i}bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the i𝑖iitalic_i-th frame visual editing factor score. After restoring 𝐳𝒱isuperscriptsubscript𝐳𝒱𝑖\mathbf{z}_{\mathcal{V}}^{i}bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT up to the original frame scale (W×H𝑊𝐻W\times Hitalic_W × italic_H) and aggregating all frames, we finally define the visual editing identification as f𝒱(𝒯,𝒱)=𝐳𝒱L×(W×H)subscript𝑓𝒱𝒯𝒱subscript𝐳𝒱superscript𝐿𝑊𝐻f_{\mathcal{V}}(\mathcal{T},\mathcal{V})=\mathbf{z}_{\mathcal{V}}\in\mathbb{R}% ^{L\times(W\times H)}italic_f start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ( caligraphic_T , caligraphic_V ) = bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × ( italic_W × italic_H ) end_POSTSUPERSCRIPT, where L𝐿Litalic_L is the number of video frames. The visual factor score 𝐳𝒱subscript𝐳𝒱\mathbf{z}_{\mathcal{V}}bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT is utilized for the following visual semantic disentanglement.

The visual semantic disentanglement g𝒱subscript𝑔𝒱g_{\mathcal{V}}italic_g start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT aims to build neutral video 𝒱nsubscript𝒱𝑛\mathcal{V}_{n}caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from input video 𝒱𝒱\mathcal{V}caligraphic_V by reducing the meaning of editing factors in the video. To this, based on visual editing factor score 𝐳𝒱subscript𝐳𝒱\mathbf{z}_{\mathcal{V}}bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT, the g𝒱subscript𝑔𝒱g_{\mathcal{V}}italic_g start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT identifies factors contributing to editing at a pixel level and semantically reduces them. Similar to textual semantic disentanglement g𝒯subscript𝑔𝒯g_{\mathcal{T}}italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, we apply a factor deforming by separating video pixels into two groups of pixels and deforming a group related to editing as below:

𝒱n=𝐳𝒱h(𝒱)+(1𝐳𝒱)𝒱,subscript𝒱𝑛subscript𝐳𝒱𝒱1subscript𝐳𝒱𝒱\mathcal{V}_{n}=\mathbf{z}_{\mathcal{V}}\circ h(\mathcal{V})+(1-\mathbf{z}_{% \mathcal{V}})\circ\mathcal{V},caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ∘ italic_h ( caligraphic_V ) + ( 1 - bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ) ∘ caligraphic_V , (7)

where we applied Gaussian blurring to deform the video concerned about editing factors as h(𝒱)=𝒱*G𝒱𝒱𝐺h(\mathcal{V})=\mathcal{V}*Gitalic_h ( caligraphic_V ) = caligraphic_V * italic_G with Gaussian kernel G(x,y)=12πσ2e(x2+y2)/2σ2𝐺𝑥𝑦12𝜋superscript𝜎2superscript𝑒superscript𝑥2superscript𝑦22superscript𝜎2G(x,y)=\frac{1}{2\pi\sigma^{2}}e^{-(x^{2}+y^{2})/2\sigma^{2}}italic_G ( italic_x , italic_y ) = divide start_ARG 1 end_ARG start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - ( italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Therefore the visual semantic disentanglement is summarized as g𝒱(𝐳𝒱)=𝒱nsubscript𝑔𝒱subscript𝐳𝒱subscript𝒱𝑛g_{\mathcal{V}}(\mathbf{z}_{\mathcal{V}})=\mathcal{V}_{n}italic_g start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT ) = caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The 𝒱nsubscript𝒱𝑛\mathcal{V}_{n}caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is used for editing instead of the 𝒱𝒱\mathcal{V}caligraphic_V, alleviating restrictions imposed by the original content and facilitating dynamic variations within the editing areas.

4.2 Plug-and-play NeuEdit framework

We integrate the neutral prompt 𝒯nsubscript𝒯𝑛\mathcal{T}_{n}caligraphic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and neutral video 𝒱nsubscript𝒱𝑛\mathcal{V}_{n}caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT into diffusion-based video editing system. The editing system includes two processes: (1) video tuning and (2) video editing, where the neutral prompt is introduced in the tuning process as a guided condition features 𝐰n=ψT(𝒯n)subscript𝐰𝑛subscript𝜓𝑇subscript𝒯𝑛\mathbf{w}_{n}=\psi_{T}(\mathcal{T}_{n})bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for the training objective of a text-guided diffusion model (i.e. refer details in Section 2) as given below:

𝔼z,ϵ,t[ϵϵθ(zt,t,𝐰n)22],subscript𝔼𝑧italic-ϵ𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝐰𝑛22\mathbb{E}_{z,\epsilon,t}[||\epsilon-\epsilon_{\theta}(z_{t},t,\mathbf{w}_{n})% ||_{2}^{2}],blackboard_E start_POSTSUBSCRIPT italic_z , italic_ϵ , italic_t end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (8)

After tuning with neutral prompt, the model performs editing by denoising an initial latent noise with the input condition of target prompt 𝒯𝒯\mathcal{T}caligraphic_T, producing edited video 𝒱editsubscript𝒱edit\mathcal{V}_{\textrm{edit}}caligraphic_V start_POSTSUBSCRIPT edit end_POSTSUBSCRIPT as:

𝒱edit=Denoise(finit(𝒱n),𝒯),subscript𝒱editDenoisesubscript𝑓𝑖𝑛𝑖𝑡subscript𝒱𝑛𝒯\mathcal{V}_{\textrm{edit}}=\mathrm{\textrm{Denoise}}(f_{init}(\mathcal{V}_{n}% ),\mathcal{T}),caligraphic_V start_POSTSUBSCRIPT edit end_POSTSUBSCRIPT = Denoise ( italic_f start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT ( caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , caligraphic_T ) , (9)

where Denoise(,)Denoise\textrm{Denoise}(\cdot,\cdot)Denoise ( ⋅ , ⋅ ) is the reverse process of diffusion model by gradual denoising under the sequential process using Equation 2 and finitsubscript𝑓𝑖𝑛𝑖𝑡f_{init}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT is initial latent noise encoding such as DDIM inversion777Appendix E gives details of DDIM inversion and gradual denoising. for enhanced reconstruction based on input video, where we provide neutral video 𝒱nsubscript𝒱𝑛\mathcal{V}_{n}caligraphic_V start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT instead of 𝒱𝒱\mathcal{V}caligraphic_V to improve dynamic modifications in editing region.

Refer to caption
Figure 5: Qualitative result about applying NeuEdit framework on recent editing systems according to (a) non-rigid editing (i.e. motion variation) and (b) rigid editing (i.e. style transfer, object overlay). (TAV: Tune-A-Video [38], P2P: Video-P2P [19]).

5 Experiment

5.1 Experimental Settings

Implementation Details.

For textual factor identification f𝒯subscript𝑓𝒯f_{\mathcal{T}}italic_f start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, CLIP model (ViT-L/14) [25] is used for text and image features. The attention map for visual neutralization is based on the cross attention module of video diffusion model [38] (i.e. Stable Diffusion v1.5). The experimental settings are W=H=512,Wp=Hp=16formulae-sequence𝑊𝐻512subscript𝑊𝑝subscript𝐻𝑝16W=H=512,W_{p}=H_{p}=16italic_W = italic_H = 512 , italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 16 on NVIDIA A100 GPU.

Dataset and Baselines.

We validate videos on DAVIS [22] and LOVEU-TGVE [39], which are video editing challenge dataset888https://sites.google.com/view/loveucvpr23/track4 comprising 32 to 128 frames of each. NeuEdit framework is validated about non-rigid/rigid editing on recent editing systems including Tune-A-video, Video-P2P, FateZero [23] on their public codes.

5.2 Evaluation Metric

We validate editing results based on four assessments: (1) textual alignment, (2) fidelity to input video, (3) frame consistency, and (4) human preference. The textual alignment measures the semantic alignment between a target prompt and an edited video using the CLIP score and PickScore [18]. The PickScore approximates human preferences by a large-scale trained model. The fidelity measures the preservation of original content in the unedited region999Detailed explanations of capturing unedited region are in Appendix C. using peak signal-to-noise ratio (PSNR), learned perceptual image patch similarity (LPIPS), and structural similarity index measure (SSIM). The frame consistency measures image CLIP scores between sequential frames and measures fréchet video distance (FVD) to evaluate the naturalness of videos. For the human evaluation, we investigate the preferences of edited videos according to the target prompt between the editing models and the models with NeuEdit.

5.3 Experimental Results

Qualitative Comparisons.

Figure 5 shows the qualitative results of recent editing systems [38, 19] with the proposed neutral editing framework. (See also qualitative results in Appendix G). To validate the qualitative effectiveness of neutralization, we perform case studies in terms of two types of editing: (a) non-rigid editing and (b) rigid editing. In the case of non-rigid editing, current editing systems’ results are not aligned with the target prompt, generating original input videos or incorrectly synthesizing original content (e.g. trees) and required variations (e.g. wings). However, these models with the NeuEdit framework demonstrate effective non-rigid editing on various targets including human and object. It is also notable that motion editing about thumbs-up is conditionally performed according to the visibility of the skier’s hand, this is because the visual neutralization is sensibly applied to visible editing factors (i.e. hand). We provide further analysis of this in Section 5.4. In the case of rigid editing (i.e. top: style transfer, bottom: object overlay), the current models and the models with NeuEdit provide qualitatively proper modifications. But, in detail, only the models under NeuEdit framework maintain a finer fidelity to the unedited region (i.e. yellow box) in the video. It is considered that feature commonality between neutral prompt and target prompt, excluding the editing factor, improves selective fidelity in the model. At the bottom, we also edited a man in kite-surfing to resemble Spider-Man. Interestingly, the model with NeuEdit also modifies the action of catching a kite as catching it with a spider web. We consider the neutral video shown in the pink box contributes to effective editing, mitigating the restriction by the original content in the area to be edited.

Textual Alignment Fidelity to Input Video Frame Consistency Human
CLIPsuperscriptCLIP\textrm{CLIP}^{\star}CLIP start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT PickScore ↑ PSNR ↑ LPIPS ↓ SSIM ↑ CLIPsuperscriptCLIP\textrm{CLIP}^{\dagger}CLIP start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT FVD ↓ Preference ↑
TAV [38] 22.6 / 27.1 19.5 / 20.2 13.1 / 14.1 0.1813 / 0.1934 0.621 / 0.653 0.921 / 0.952 3481 / 3392 0.14
TAV + NeuEdit 27.6 / 28.5 20.6 / 20.9 19.2 / 18.9 0.1438 / 0.1411 0.706 / 0.711 0.962 / 0.971 3270 / 3151 0.86
FateZero [23] 21.2 / 26.1 19.4 / 20.1 14.1 / 13.6 0.1653 / 0.1731 0.636 / 0.643 0.958 / 0.960 3319 / 3106 0.34
FateZero + NeuEdit 27.3 / 28.7 20.1 / 21.2 16.8 / 17.3 0.1621 / 0.1724 0.637 / 0.657 0.969 / 0.968 3209 / 3071 0.66
Video-P2P [19] 22.5 / 27.2 19.6 / 20.0 14.7 / 15.5 0.1738 / 0.1814 0.645 / 0.677 0.961 / 0.958 3231 / 3095 0.38
Video-P2P + NeuEdit 27.9 / 29.6 20.9 / 21.3 19.3 / 19.8 0.1298 / 0.1388 0.727 / 0.733 0.966 / 0.973 3135 / 2953 0.62
Table 1: Evaluations about edited videos based on DAVIS and TGVE in terms of non-rigid/rigid type editing corresponding to textual alignment, fidelity to input video, frame consistency, and human preference. CLIPsuperscriptCLIP\textrm{CLIP}^{\star}CLIP start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT: text-video clip score, CLIPsuperscriptCLIP\textrm{CLIP}^{\dagger}CLIP start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT: image-image clip score.

Quantitative Results.

Table 1 presents evaluations of the non-rigid/rigid editing on videos of DAVIS and TGVE101010Appendix C provides further results on UCF101[35] of recent editing systems with the NeuEdit about four assessments (i.e. alignment, fidelity, consistency, human evaluation). The effectiveness of NeuEdit is confirmed in all models. Especially in non-rigid editing, textual alignment is significantly improved. Fidelity evaluates the preservation of unedited areas in video, such that we measure fidelity after masking identical regions related to editing. The fidelity is effectively enhanced in the tuning-based models (i.e. TAV, Video-P2P) than tuning-free model (i.e. FateZero), which tells that neutral prompt contributes to improving fidelity.

Refer to caption
Figure 6: Ablation studies about neutral video and neutral prompt. The input video and target are shown in Figure 1.
Refer to caption
Figure 7: Sensitivity analysis of textual and visual semantic disentanglement. The g𝒯subscript𝑔𝒯g_{\mathcal{T}}italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT employs feature down-scaling with scaler α𝛼\alphaitalic_α and g𝒯subscript𝑔𝒯g_{\mathcal{T}}italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT employs Gaussian blurring with σ𝜎\sigmaitalic_σ. Single frame of neutral video according to the blurring by σ𝜎\sigmaitalic_σ is shown below the curve (b).

5.4 Ablation Study

Figure 6 presents ablation studies about neutral video and neutral prompt in terms of motion editing. In Figure 6 (a), the current model is ineffective in motion editing (i.e. thumbs-up), resulting in a video closely resembling the input video. We applied neutralization to this model, implementing its process step by step. In (b) and (c), we show results from neutral prompt tuning applied to the editing model. The results (b) from neutral prompt with factor swapping, while (c) stems from neutral prompt feature with factor deforming. As shown in yellow circles in (b) and (c), they show the astronaut slightly extends his arm to give a thumbs-up, The action in (c) seems more effective than (b). We consider that neutral prompt features identify differences among editing factors to be emphasized. Nevertheless, unnatural motion editing persists due to constraints posed by the original arm motion. Video (d) is edited results with neutral video (e), where it showcases dynamic motion edits, such as bending arms for a thumbs-up. Notably, neutralization is sensibly applied to editing factors. When the hand disappears behind the leg, the neutralization effect (i.e. red circle) vanishes, facilitating natural motion recovery akin to temporal motion change in the result. Figure 7 shows sensitivity analysis of factor deforming in textual and visual semantic disentanglement. The textual disentanglement g𝒯subscript𝑔𝒯g_{\mathcal{T}}italic_g start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT is modulated by feature down-scaler α𝛼\alphaitalic_α. Neutral prompt features are effective for values below 0.3, but below 0.15, it shows deterioration, damaging the distinguishability among editing factors. For the disentanglement g𝒱subscript𝑔𝒱g_{\mathcal{V}}italic_g start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT, the Gaussian blurring is controlled by the σ𝜎\sigmaitalic_σ. Effective motion editing occurs for σ>3𝜎3\sigma>3italic_σ > 3, aligning with ambiguity caused by blurring (i.e. yellow to green in Figure 7 (b)). In lower values, it is restricted by the original motion of video.

6 Conclusion

This paper proposes diffusion-based video editing framework referred to as Neutral Editing (NeuEdit), which enables complex non-rigid editing of a person/object in a video. NeuEdit introduces a ‘neutralization’ concept to enhance the current tuning-editing process of diffusion-based editing systems in a model-agnostic manner. Extensive experiments validate its editability and visual effectiveness.

References

  • Bar-Tal et al. [2022] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 707–723. Springer, 2022.
  • Brooks et al. [2022] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  • Chai et al. [2023] Wenhao Chai, Xun Guo, Gaoang Wang, and Yan Lu. Stablevideo: Text-driven consistency-aware diffusion video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23040–23050, 2023.
  • Creswell et al. [2018] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  • Esser et al. [2023] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  • Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  • Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022b.
  • Hong et al. [2022] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  • Kawar et al. [2022] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
  • Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  • Kirstain et al. [2023] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
  • Liu et al. [2023] Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
  • Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  • Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • Pont-Tuset et al. [2017] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  • Qi et al. [2023] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  • Radford et al. [2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Ruiz et al. [2022] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  • Saharia et al. [2022a] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022a.
  • Saharia et al. [2022b] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022b.
  • Singer et al. [2022] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • Song et al. [2020a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  • Song et al. [2020b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  • Soomro et al. [2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • Van Den Oord et al. [2017] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  • Wu et al. [2022a] Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. Nüwa: Visual synthesis pre-training for neural visual world creation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI, pages 720–736. Springer, 2022a.
  • Wu et al. [2022b] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022b.
  • Wu et al. [2023] Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jinbin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuanxi Sun, et al. Cvpr 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003, 2023.
\thetitle

Supplementary Material

Appendix A Broader Impacts and Ethic Statements

Visual generative models raise several ethical concerns such as illegal counterfeit content, potential invasion of privacy, and fairness issues. Our work also relies on the underlying framework of these generative models, making it vulnerable to these concerns. Therefore, effectively addressing these concerns is required, where various regulations should be prepared including technical safeguards. Crucially, researchers should take responsibility for these concerns and actively make an effort to build technical safeguards. Therefore, to mitigate potential concerns and hold transparency, we will release our source code including specifications of models and data that we employed under a license encouraging ethical and legal usage. We also consider introducing further regulations such as learning-based digital forensics and digital watermarking. Collectively, these measures aim to navigate the ethical landscape of visual generative models, fostering their responsible and beneficial use.

Appendix B Limitation and Future work

The editing systems seem to be susceptible to unintended bias in modifying required attributes. For instance, Figure 14 shows the failure case of our method. When modifying specific attributes (e.g., motion of riding snowboard) of an object in a video, the scene in the video is also changed into a context (e.g. snow) primarily associated with the desired attributes. We define this issue as editing bias and our future work is to mitigate the editing bias. Although NeuEdit is also successfully applicable in image non-rigid editing including still-pose editing (i.e. Figure 10), in a video domain, it was also challenging to edit a motion of a moving object to be still. As the temporal attention in the video diffusion model performs to preserve temporal consistency, this consistency is reflected in the object to be edited, such that a still pose is made, but follows the movements of the object.

Appendix C Further details and more evaluations

We present further details of our experiments including implementations, evaluations, and results.

C.1 Implementation details

For video encoding, we utilize VQ-VAE [36], which provides patch-wise features of each frame, and for text encoding the CLIP model (ViT-L/14) [25] is employed. In the visual neutralizing, we applied bicubic interpolation to restore the original scale of frames from each i𝑖iitalic_i-th frame editing factor score 𝐳𝒱isuperscriptsubscript𝐳𝒱𝑖\mathbf{z}_{\mathcal{V}}^{i}bold_z start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. In the case of the cross-attention module, the cross-attention weights of the first up-block attention layer in Stable Diffusion U-Net [27] is utilized, where the attention map with the size of 16×16161616\times 1616 × 16 is constructed. Empirically, other mid-block and up-block layers can also properly work for designing the visual editing factor. However, the down-block layers were not effective due to insufficient early multimodal interactions between text and image.

For the details of neutral video, we set visual editing factor scores below 0.2 uniformly to zero. This approach improves the effectiveness of editing in the targeted region by establishing a clear boundary between the visual regions associated with the editing factor and those that are not.

C.2 Evaluation details

Fidelity evaluation.

In order to measure fidelity to input video, as shown in Figure 8, we applied the identical zero mask to the edited area in the input video and output video. Applying a mask to the area to be edited allows us to measure the similarity and commonality between the input video and the output video in terms of preservation of unedited content, producing the score of PSNR, LPIPS, and SSIM. In the early study, we attempted several automatic detectors such as segmentation-based detectors [17] to identify areas to be edited in input and output video, but the specifications by humans were the most accurate.

Human evaluation.

Human evaluation is performed to measure the preference for edited results according to a given target prompt. We conducted a survey about discrete selection between the outcomes of current editing systems and those generated using the NeuEdit framework. A survey involving 36 participants was conducted, incorporating diverse academic backgrounds (e.g. engineering, literature, art) and including those who speak English as their native language and those who do not.

C.3 Evaluations on videos of different domains

UCF101

Collected from YouTube, UCF101 [35] dataset is designed for action recognition, comprising 101 action categories. To measure the video editing performances according to the editing model with NeuEdit framework and without the framework, We selected 83 videos from the dataset. Overall, performances of all models (i.e. TAV: Tune-A-Video, FateZero, and Video-P2P) using NeuEdit framework are enhanced in terms of textual alignment, fidelity, and consistency. Furthermore, the tuning-editing models (i.e. TAV, Video-P2P) of NeuEdit framework exhibited a significant enhancement in fidelity, similar to the DAVIS dataset, demonstrating general effectiveness in videos.

Refer to caption
Figure 8: Illustration of masked input and output video for measuring fidelity metrics. Identical masking is applied between the input video and the output edited video.
Textual Alignment Fidelity to Input Video Frame Consistency Human
CLIPsuperscriptCLIP\textrm{CLIP}^{\star}CLIP start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT PickScore ↑ PSNR ↑ LPIPS ↓ SSIM ↑ CLIPsuperscriptCLIP\textrm{CLIP}^{\dagger}CLIP start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT FVD ↓ Preference ↑
TAV [38] 22.3 / 26.7 19.2 / 19.7 12.7 / 13.8 0.1911 / 0.2031 0.573 / 0.609 0.913 / 0.937 3537 / 3441 0.15
TAV + NeuEdit 26.8 / 27.9 20.1 / 20.6 18.2 / 17.4 0.1532 / 0.1509 0.658 / 0.669 0.946 / 0.952 3321 / 3202 0.85
FateZero [23] 21.0 / 25.7 19.1 / 19.6 13.1 / 13.2 0.1749 / 0.1836 0.583 / 0.598 0.936 / 0.941 3372 / 3151 0.30
FateZero + NeuEdit 26.7 / 27.9 20.0 / 20.8 15.9 / 16.9 0.1723 / 0.1811 0.589 / 0.601 0.947 / 0.946 3252 / 3123 0.70
Video-P2P [19] 22.3 / 26.8 19.3 / 19.6 14.2 / 14.9 0.1821 / 0.1923 0.598 / 0.623 0.942 / 0.939 3282 / 3142 0.31
Video-P2P + NeuEdit 27.1 / 28.4 20.5 / 21.1 18.4 / 18.9 0.1312 / 0.1413 0.676 / 0.681 0.948 / 0.954 3182 / 3001 0.69
Table 2: Evaluations about edited videos based on UCF101 in terms of non-rigid/rigid type editing corresponding to textual alignment, fidelity to input video, frame consistency, and human preference. CLIPsuperscriptCLIP\textrm{CLIP}^{\star}CLIP start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT: text-video clip score, CLIPsuperscriptCLIP\textrm{CLIP}^{\dagger}CLIP start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT: image-image clip score.

Appendix D Proof for the closed form of KL divergence in reverse diffusion process

The reverse process of denoising diffusion probabilistic models is to approximate q(xt1|xt)𝑞conditionalsubscript𝑥𝑡1subscript𝑥𝑡q(x_{t-1}|x_{t})italic_q ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) using parameterized Gaussian transitions pθ(xt1|xt)=𝒩(xt1;μθ(xt,t),σθ(xt,t))subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝒩subscript𝑥𝑡1subscript𝜇𝜃subscript𝑥𝑡𝑡subscript𝜎𝜃subscript𝑥𝑡𝑡p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\sigma_{% \theta}(x_{t},t))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , italic_σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ). Considering whole T𝑇Titalic_T step parameterized transitions, these are sequentially constructed as given below:

pθ(X)=pθ(xT)t=1Tpθ(xt1|xt),subscript𝑝𝜃𝑋subscript𝑝𝜃subscript𝑥𝑇superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡p_{\theta}(X)=p_{\theta}(x_{T})\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t}),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (10)

where we take X=x0:T𝑋subscript𝑥:0𝑇X=x_{0:T}italic_X = italic_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT and it starts at normal distribution p(xT)=𝒩(xT;0,𝐼)𝑝subscript𝑥𝑇𝒩subscript𝑥𝑇0𝐼p(x_{T})=\mathcal{N}(x_{T};0,\textit{I})italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; 0 , I ). To optimize the pθ(X)subscript𝑝𝜃𝑋p_{\theta}(X)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ), training objective is to maximize log-likelihood log(pθ(X))logsubscript𝑝𝜃𝑋\textrm{log}(p_{\theta}(X))log ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) ), where we can also apply variational inference by maximizing the variational lower bound LVLBsubscript𝐿𝑉𝐿𝐵-L_{VLB}- italic_L start_POSTSUBSCRIPT italic_V italic_L italic_B end_POSTSUBSCRIPT as given below:

LVLB=logpθ(X)DKL(q(Z|X)||pθ(Z|X))\displaystyle-L_{VLB}=\textrm{log}p_{\theta}(X)-D_{\textrm{KL}}(q(Z|X)||p_{% \theta}(Z|X))- italic_L start_POSTSUBSCRIPT italic_V italic_L italic_B end_POSTSUBSCRIPT = log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) - italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q ( italic_Z | italic_X ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Z | italic_X ) ) (11)
logpθ(X),absentlogsubscript𝑝𝜃𝑋\displaystyle\leq\textrm{log}p_{\theta}(X),≤ log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) ,

where DKLsubscript𝐷KLD_{\textrm{KL}}italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT is the Kullback-Leibler divergence (KL divergence) and the Z𝑍Zitalic_Z is latent variable by reparametrization trick used in the variational auto-encoder. The q𝑞qitalic_q can be any distribution that we can address with ease. We leverage this inequality condition as logpθ(X)LVLBlogsubscript𝑝𝜃𝑋subscript𝐿𝑉𝐿𝐵-\textrm{log}p_{\theta}(X)\leq L_{VLB}- log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) ≤ italic_L start_POSTSUBSCRIPT italic_V italic_L italic_B end_POSTSUBSCRIPT. The LVLBsubscript𝐿𝑉𝐿𝐵L_{VLB}italic_L start_POSTSUBSCRIPT italic_V italic_L italic_B end_POSTSUBSCRIPT can be expanded out as LVLB=LT+LT1++L0subscript𝐿𝑉𝐿𝐵subscript𝐿𝑇subscript𝐿𝑇1subscript𝐿0L_{VLB}=L_{T}+L_{T-1}+\cdots+L_{0}italic_L start_POSTSUBSCRIPT italic_V italic_L italic_B end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT + ⋯ + italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where they are defined with 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T as given below:

LT=DKL(q(xT|x0)||pθ(xT)),\displaystyle L_{T}=D_{\textrm{KL}}(q(x_{T}|x_{0})||p_{\theta}(x_{T})),italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) , (12)
Lt=DKL(q(xt|xt+1,x0)||pθ(xt|xt+1)),\displaystyle L_{t}=D_{\textrm{KL}}(q(x_{t}|x_{t+1},x_{0})||p_{\theta}(x_{t}|x% _{t+1})),italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) ,
L0=logpθ(x0|x1).subscript𝐿0logsubscript𝑝𝜃conditionalsubscript𝑥0subscript𝑥1\displaystyle L_{0}=-\textrm{log}p_{\theta}(x_{0}|x_{1}).italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

Therefore the terms about Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT make the closed form of KL divergence under step t𝑡titalic_t with a range of 0tT0𝑡𝑇0\leq t\leq T0 ≤ italic_t ≤ italic_T.

Appendix E DDIM sampling and DDIM inversion

To accelerate the reverse process of DDPM, denoising diffusion implicit model (DDIM) [33] is proposed, it samples latent features with a small number of denoising steps as:

zt1=αt1αtzt+(1αt111αt1)ϵsubscript𝑧𝑡1subscript𝛼𝑡1subscript𝛼𝑡subscript𝑧𝑡1subscript𝛼𝑡111subscript𝛼𝑡1italic-ϵ\displaystyle z_{t-1}=\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}z_{t}+\left(\sqrt{% \frac{1}{\alpha_{t-1}}-1}-\sqrt{\frac{1}{\alpha_{t}}-1}\right)\epsilonitalic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG - 1 end_ARG - square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG ) italic_ϵ (13)

We can also reverse this process to make latent noise again, which gives corresponding latent features as below:

zt+1=αt+1αtzt+(1αt+111αt1)ϵ,subscript𝑧𝑡1subscript𝛼𝑡1subscript𝛼𝑡subscript𝑧𝑡1subscript𝛼𝑡111subscript𝛼𝑡1italic-ϵz_{t+1}=\sqrt{\frac{\alpha_{t+1}}{\alpha_{t}}}z_{t}+\left(\sqrt{\frac{1}{% \alpha_{t+1}}-1}-\sqrt{\frac{1}{\alpha_{t}}-1}\right)\epsilon,italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG - 1 end_ARG - square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG ) italic_ϵ , (14)

where it is referred to as DDIM inversion process, which maintains higher fidelity to the input than just initially starting from Gaussian noise.

Refer to caption
Figure 9: Sensitivity analysis of the score s𝑠sitalic_s for selecting editing factor for factor swapping in terms of textual alignment (i.e. ClIP score, Pick score) and fidelity (i.e. SSIM).

Appendix F Studies of textual semantic disentanglement

F.1 Sensitivity analysis of factor swapping

Factor swapping involves the binary deletion of a word according to a decision on the editing factor. Hence, the proper selection of editing factors contributes to an efficient neutral prompt. To explore this aspect further, we aim to investigate the optimal operational range of factor swapping by leveraging the threshold score s𝑠sitalic_s for deciding the editing factor. Figure 9 depicts the editing performance variations with changes in the score s𝑠sitalic_s, where the 0.74<s<0.780.74𝑠0.780.74<s<0.780.74 < italic_s < 0.78 presents the optimal range for effectively working as a neutral prompt with factor swapping.

F.2 Ablation studies about factor deforming

We introduce further empirical methods performed for textual semantic disentanglement. In the main paper, we employ factor deforming 𝐰n=𝐳𝒯h(𝐰)+(1𝐳𝒯)𝐰subscript𝐰𝑛subscript𝐳𝒯𝐰1subscript𝐳𝒯𝐰\mathbf{w}_{n}=\mathbf{z}_{\mathcal{T}}\circ h(\mathbf{w})+(1-\mathbf{z}_{% \mathcal{T}})\circ\mathbf{w}bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ∘ italic_h ( bold_w ) + ( 1 - bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) ∘ bold_w with the deformable operation of down-scaling as below:

h(𝐰)=α𝐰M×d,𝐰𝛼𝐰superscript𝑀𝑑h(\mathbf{w})=\alpha\mathbf{w}\in\mathbb{R}^{M\times d},italic_h ( bold_w ) = italic_α bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT , (15)

where the 0α<10𝛼10\leq\alpha<10 ≤ italic_α < 1 is the down-scaler. Our studies also consider other deformable operations defined as (1) deformable swapping and (2) factor blurring. The detailed explanations are presented in the following.

Deformable swapping

The deformable swapping is an extended version of factor swapping. Factor swapping involves transforming all words identified as editing factors into a unified token as a dummy token <<<DMY>>>. The limitation of this approach is that all modified dummy tokens become indistinguishable from one another. Therefore, we integrated the format of factor deforming with the factor swapping and adaptively changed the magnitudes of the dummy token feature according to the editing factor score as given below:

𝐰n=𝐳𝒯𝐰nswp+(1𝐳𝒯)𝐰,subscript𝐰𝑛subscript𝐳𝒯superscriptsubscript𝐰𝑛swp1subscript𝐳𝒯𝐰\mathbf{w}_{n}=\mathbf{z}_{\mathcal{T}}\circ\mathbf{w}_{n}^{\textrm{swp}}+(1-% \mathbf{z}_{\mathcal{T}})\circ\mathbf{w},bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ∘ bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT swp end_POSTSUPERSCRIPT + ( 1 - bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) ∘ bold_w , (16)

where 𝐰nswpsuperscriptsubscript𝐰𝑛swp\mathbf{w}_{n}^{\textrm{swp}}bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT swp end_POSTSUPERSCRIPT is text feature obtained by factor swapping. This format is available to impart distinguishable influence to dummy token features based on the editing factor score.

Factor blurring

Another attempt is factor blurring. Similar to the factor deforming in the visual textual semantic disentanglement, we apply to blur the features related to editing factors. Hence, rather than employing feature down-scaling, we adopt an alternative approach by introducing an additional d𝑑ditalic_d-dimensional noise feature to deform the target prompt features as given below

𝐰n=𝐳𝒯(𝐰+ϵ)+(1𝐳𝒯)𝐰,subscript𝐰𝑛subscript𝐳𝒯𝐰italic-ϵ1subscript𝐳𝒯𝐰\mathbf{w}_{n}=\mathbf{z}_{\mathcal{T}}\circ(\mathbf{w}+\epsilon)+(1-\mathbf{z% }_{\mathcal{T}})\circ\mathbf{w},bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ∘ ( bold_w + italic_ϵ ) + ( 1 - bold_z start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) ∘ bold_w , (17)

where ϵ𝒩(0,1)similar-toitalic-ϵ𝒩01\epsilon\sim\mathcal{N}(0,1)italic_ϵ ∼ caligraphic_N ( 0 , 1 ) is the noise features added to target prompt features. It deforms the prompt feature corresponding to editing factors.

Alignment Fidelity Consistency
CLIPsuperscriptCLIP\textrm{CLIP}^{\star}CLIP start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT SSIM ↑ CLIPsuperscriptCLIP\textrm{CLIP}^{\dagger}CLIP start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT
factor deforming (M) 28.6 0.730 0.969
factor blurring (A) 27.7 0.697 0.948
factor swapping (M) 27.5 0.707 0.951
deformable swapping (A) 28.1 0.711 0.961
Table 3: Ablation study of different methods of factor deforming for textual semantic disentanglement. M: method in Main paper, A: method is Appendix

Table 3 presents ablation studies in terms of factor deforming used for textual semantic disentangling. The first section is the original factor deforming with feature-down scaling in the main paper. The second section is the factor blurring method, which was which was less effective than other methods. We consider that although the blurring method is effective in visual semantic disentanglement, in the case of words, it seems difficult to obtain meaningful information from the blurred features that the network understands due to words’ discrete characteristics. The third and fourth sections are factor swapping. Building distinguishable features among dummy token features enhances the editing performances in deformable factor swapping. This demonstrates that the editing factor scores properly contain information about how effective each factor is in performing the editing.

Refer to caption
Figure 10: Application of NeuEdit into image editing about non-rigid editing (e.g. pose variations).

Appendix G Further qualitative results

Application to image editing

The NeuEdit framework can be structurally applied to any diffusion-based editing system, such that we extend our work in image editing. Figure 10 shows image editing results under the NeuEdit framework. It is notable that the non-rigid editing is also successfully applied based on the input image.

Further qualitative results

In the following, we showcase our qualitative editing results in terms of non-rigid and rigid editing. All the videos in our experiments are sourced from publicly available sources in [22, 35, 7, 38].

Refer to caption
Figure 11: Illustration non-rigid editing including motion change.
Refer to caption
Figure 12: Illustration of rigid editing about object overlay.
Refer to caption
Figure 13: Illustration of rigid editing about style transfer.
Refer to caption
Figure 14: Failure case by editing bias. The scene is unintentionally changed into a snow background correlated with riding a snowboard.