research-article

Open access

MotionFix: Text-Driven 3D Human Motion Editing

Authors:

Gül VarolAuthors Info & Claims

SA '24: SIGGRAPH Asia 2024 Conference Papers

Article No.: 44, Pages 1 - 11

https://doi.org/10.1145/3680528.3687559

Published: 03 December 2024 Publication History

All formats PDF

Abstract

The focus of this paper is 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The challenges include the lack of training data and the design of a model that faithfully edits the source motion. In this paper, we address both these challenges. We build a methodology to semi-automatically collect a dataset of triplets in the form of (i) a source motion, (ii) a target motion, and (iii) an edit text, and create the new MotionFix dataset. Having access to such data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input. We further build various baselines trained only on text-motion pairs datasets, and show superior performance of our model trained on triplets. We introduce new retrieval-based metrics for motion editing, and establish a new benchmark on the evaluation set of MotionFix. Our results are encouraging, paving the way for further research on finegrained motion generation. Code, models and data are available at our project website.

1 Introduction

Human motion control is an essential component of the animation pipeline, and involves creating a motion, as well as editing it until the motion matches the desired outcome. Text descriptions have emerged as one of the prominent ways to control motion generation [Guo et al. 2022a; Petrovich et al. 2022; Tevet et al. 2023; Zhang et al. 2023a]. However, due to the inherent ambiguity in high-level language instructions, the resulting generation may not necessarily correspond to the motion one has in mind. An animator may then need to edit the motion further. Motion editing is non-trivial, arguably more complex than static pose editing, and may involve multiple types of instructions, such as changing the speed of a motion, modifying the repetitions for cyclic actions, adjusting the posture of a particular body or modifying a certain temporal segment of a motion. In this work, we aim, given an initial source motion and an edit description, to generate a new motion that follows the source motion and edits it according to the text instruction.

Fig. 1:

There are existing approaches that can modify body coordinates [Karunratanakul et al. 2023] and lower/upper limbs [Tevet et al. 2023; Zhang et al. 2023a]. However, they require manual selection of the body parts, which prevents making edits beyond local modifications such as the speed of the overall action. On the other hand, some methods have been proposed to add more fine-grained control to text-to-motion generation; examples include temporal [Athanasiou et al. 2022; Zhang et al. 2023c] and spatial [Athanasiou et al. 2023; Goel et al. 2024; Zou et al. 2024] compositions. These methods can be repurposed for editing, but would be limited to addition or subtraction of actions. Consequently, all these works are limited to specific types of edits. Instead, our work considers unrestricted edits described by language instructions. To this end, we collect a manually annotated dataset, MotionFix, that supports training generative models for the task of text-driven motion editing.

Constructing a dataset for 3D human motion editing is non-trivial. In contrast to text-based image editing, where methods such as InstructPix2Pix [Brooks et al. 2023] exploit large text-to-image generation models [Rombach et al. 2022] to automatically create training data, there exists no 3D motion generation model which generalizes faithfully to unrestricted text inputs. Besides, dynamic edits can be more complex than static image edits. In fact, PoseFix [Delmas et al. 2023] is a successful example of a 3D human body pose editing dataset; however, the difference between static poses are mapped to text in a rule-based manner using joint distances, and this would not be applicable to dynamic motions. In this work, we take a different route and mine existing motion capture (MoCap) datasets to find suitable motion pairs automatically, for which the differences are then manually described by typing text. By not relying on a generative model, we ensure motion quality; and by annotating the text (which is relatively fast), we achieve unrestricted edits.

The key challenge in our semi-automatic data curation pipeline is how to find motion pairs that are similar enough for a meaningful and concise edit text to describe the difference. The difference between the source and target motions should not be too large to avoid the annotator typing a too-complex text or describing entirely the target motion, discarding the source. The motions should have similarities for the annotator to potentially make reference to the source. Our solution is to employ the recent TMR motion embedding space [Petrovich et al. 2023] that effectively captures semantics, as well as sufficient details for the body dynamics, thanks to its contrastively and generatively trained motion encoder. To form our candidate pairs for annotation, for each motion in a large MoCap collection [Mahmood et al. 2019], we retrieve the top-ranked motions according to their embedding similarity. Using crowdsourcing, we collect textual annotations for these pairs. The resulting dataset, MotionFix, is the first text-based motion editing dataset, which contains different types of edits as can be seen in Figure 1 (top). Some edits involve a specific body part (e.g., the hand should “make the circular motion a bit wider”), others alter the overall body dynamics (e.g., “make the circle bigger and walk faster” when walking in a circle).

MotionFix enables both training and benchmarking for this new task. We design and train a Text-based Motion Editing Diffusion model, TMED, that is conditioned on both the source motion and the edit text. The results of our TMED model are encouraging as shown in Figure 1 (bottom), generating different types of edits. For example, the model can edit the overall spatial coordinates of a motion (“do a left side tilt instead of back one”), the way a motion is performed (“sit straight, don’t get up”), parts of the body (“do it with the hand raised lower”), or the speed of a motion (“faster”).

To benchmark our model and compare against baselines, we introduce new metrics on the evaluation set of MotionFix. Following the commonly adopted retrieval-based metrics in text-to-motion generation benchmarks [Guo et al. 2022a], we perform motion-to-motion retrieval and check how often the ground-truth target motion is in the top ranks. We also report the ranking of the source motion to evaluate the proximity to the source. While this metric should not be too high – otherwise there would be no edit – it gives an intuition on whether the generated motion deviates too much from the source. Our experiments demonstrate that our conditional model trained on triplets generates motions that are closer to the target, compared to strong baselines we build on top of state-of-the-art text-to-motion generation methods, which have only access to text-motion pairs for training.

Our contributions are the following: (i) We introduce MotionFix, the first language-based motion editing dataset, that provides motion-motion-text triplets annotated through our semi-automatic data collection methodology. This dataset allows both for training and benchmarking for this new task. (ii) We introduce several baselines based on text-to-motion generation, together with edit-relevant body parts detection using language models. While our baselines achieve promising results, we show that models trained on text-motion pairs fall behind those trained on our triplets. (iii) We propose TMED, a diffusion-based model for motion editing given language instructions. We demonstrate both qualitatively and quantitatively that TMED outperforms all the baselines.

2 Related Work

In the following, we briefly overview relevant works on motion generation, editing, and datasets.

3D motion generation from text. In contrast to the relatively mature areas of text-to-image generation [Rombach et al. 2022] and text-based image editing [Brooks et al. 2023], language-based 3D human motion generation is at its infancy. Initial work employs VAEs [Kingma and Welling 2014] with action label conditioning using a small set of categories [Guo et al. 2020; Petrovich et al. 2021]. With the introduction of recent text-motion datasets [Guo et al. 2022a; Lin et al. 2023; Punnakkal et al. 2021], there has been increased interest in conditioning the generation on free-form language inputs [Guo et al. 2024; 2022a; 2022b; Petrovich et al. 2022; Tevet et al. 2022; Uchida et al. 2024; Zou et al. 2024]. Recently, diffusion models [Ho et al. 2020] have been successfully integrated [Shafir et al. 2024; Tevet et al. 2023; Wan et al. 2023; Xie et al. 2024; Zhang et al. 2023a], producing state-of-the-art results in text-to-motion generation.

Several works focus on increased controllability in motion generation, going beyond a single textual input. Examples include enabling temporal compositionality (a series of motions) [Athanasiou et al. 2022; Lee et al. 2022; Shafir et al. 2024], spatial compositionality [Athanasiou et al. 2023; Zhang et al. 2023a] (simultaneous motions), and a unified framework of timeline control [Petrovich et al. 2024]. Diffusion-based models were shown to be suitable for fine-grained local control such as joint trajectories [Karunratanakul et al. 2023; Shafir et al. 2024; Xie et al. 2024] or keyframes [Cohan et al. 2024]. Our work is in similar spirit in terms of providing more control to users; however, in contrast to the above, our focus is motion editing using language.

Language-based human body editing. There are numerous traditional methods for editing 3D humans to generate movies [Catmull 1972], imposing space or time constraints [Cohen 1992], interactively generating motions using procedural animation [Lee and Shin 1999; Perlin 1995] or physics-based approaches [Popović and Witkin 1999]. Recent work using language can be grouped into pose [Delmas et al. 2023; Kim et al. 2021] or motion [Fieraru et al. 2021; Goel et al. 2024] editing. In FixMyPose [Kim et al. 2021], the focus is on editing athletic human poses in synthetic images. In [Delmas et al. 2023], a text-based 3D human pose editing method is developed, enabled through the collection of the PoseFix dataset containing language descriptions of differences between pairs of poses. PoseFix builds on the previous work of PoseScript [Delmas et al. 2022], where a dataset of pose descriptions are automatically collected through a rule-based approach. Unlike PoseFix that concentrates on static poses, our MotionFix dataset involves dynamic motions where the space of possible edits are much larger, necessitating a different approach to data collection.

In terms of dynamic bodies, current motion editing approaches can be separated into three categories: (a) Style-based editing or motion style transfer exploits datasets that contain a small set of style labels such as ‘angry’ and ‘old’ [Aberman et al. 2020; Kobayashi et al. 2023; Mason et al. 2022]. This line of work, focuses mostly on copying the style of one motion onto another, typically performing the same action. (b) Part-based editing considers selecting a subset of the body. MDM [Tevet et al. 2023] shows the potential of diffusion models to edit the upper/lower body by text-conditioned motion inpainting. Similarly, MotionDiffuse [Zhang et al. 2023a] and FLAME [Kim et al. 2023] manually specify body parts to edit them with text. More recently, CoMo [Huang et al. 2024] and FineMoGen [Zhang et al. 2023b] use LLMs to produce edit texts and demonstrate promising results for part editing. (c) Among heuristic-based approaches [Fieraru et al. 2021; Goel et al. 2024], AIFit [Fieraru et al. 2021] can edit domain-specific exercise poses from a pre-defined grammar and is focused on a limited set of cyclic motions from their Fit3D dataset. Iterative Motion Editing [Goel et al. 2024] relies on captioned source motions, which are passed through an LLM along with a pre-defined set of ‘Motion Editing Operators’ (MEOs), to detect which joints and frames should be edited. A pre-trained diffusion model is then used to infill these locations. In contrast with the prior work, we do not focus on a specific type of motion editing or any heuristics. Our MotionFix contains diverse edits as can be seen in Figure 1 (top) and Figure 2. Closest to ours is the concurrent work of [Goel et al. 2024]; however, as mentioned above, their approach is not fully automatic due to requiring a captioned source motion. Their keyframe selection heuristic further limits the applicability to certain edit types. Moreover, due to the unavailability of an open-source code at the time of writing this paper, we do not provide comparisons in this work.

Table 1:

Dataset	#motions	vocab.	label type
KIT-ML [Plappert et al. 2016]	3911	1623	motion description
BABEL [Punnakkal et al. 2021]	10881	1347	motion description, action
HumanML3D [Guo et al. 2022a]	14616	5371	motion description
PoseFix [Delmas et al. 2023]	6157 x 2	1068	pose editing
MotionFix (ours)	6730 x 2	1479	motion editing

Table 1: Comparison with existing datasets: MotionFix is the first dataset supporting the task of text-based motion editing.

3D human & language datasets. The progress in controlling 3D humans with text has been driven by new datasets that pair 3D humans with language descriptions. In Table 1, we summarize the three most popular motion description datasets. KIT [Plappert et al. 2016] is the first source of such data containing textual annotations for ∼ 4k motion sequences. To obtain sufficient data to train deep networks, AMASS [Mahmood et al. 2019] unifies several MoCap collections in the SMPL body format [Loper et al. 2015]; however, it lacks language descriptions. This is addressed by BABEL [Punnakkal et al. 2021] and HumanML3D [Guo et al. 2022a], which concurrently collect semantic annotations in the form of action categories and/or textual descriptions. While KIT, BABEL, HumanML3D enable text-to-motion generation training, they do not support editing. On the other hand, as can be next seen in Table 1, PoseFix [Delmas et al. 2023] provides pose editing triplets, but does not support motion editing. Our MotionFix dataset supports motion editing training, while being at a similar scale to PoseFix in terms of the number of triplets and the vocabulary of edit texts.

3 The New MotionFix Dataset

Fig. 2:

Appropriate training data for text-based motion editing would be in the form of triplets: source motions, target motions and edit texts. As discussed in Section 1, a big challenge in motion editing from language instructions is the lack of training data. To overcome this challenge, we design a semi-automatic data creation methodology. We first automatically construct candidate motion pairs that are similar (and different) enough, so the edit can potentially be described by language in simple words. We then ask annotators to manually type the edit text. In the following, we detail our procedure.

We make use of a motion embedding space to find motion pairs that are similar. Specifically, we employ a recent text-to-motion retrieval model TMR [Petrovich et al. 2023]. TMR is trained with a contrastive loss on the latent space of motions and texts, and reports state-of-the-art results for text-motion retrieval. We observe that such a model, by design, has the ability to produce latent motion representations that, for a given motion, ranks the semantically close ones nearby in the embedding space. We then use TMR to perform motion-to-motion retrieval. This is in similar spirit to using CLIP [Radford et al. 2021] for image similarity. We construct our dataset by finding such motion pairs from the AMASS MoCap collection [Mahmood et al. 2019].

From each motion in AMASS, we first extract TMR motion embeddings with sliding windows of 3 to 5 seconds. In preliminary analysis, we found that using longer motion pairs reduces the probability of finding good candidates that differ by simple edits, while using shorter ones usually yields motion pairs that have a high probability to be almost identical. Then, we compute the pairwise embedding similarities and filter out all the motions pairs that have similarity ≥ 0.99 to avoid identical motions. We extract the top-2 most similar motions for a given motion and include these pairs in the annotation pool. We experimented with thresholding instead of following a top-k selection approach, but the TMR feature similarity is not well calibrated across motion pairs, which would make finding a constant threshold difficult. Finally, we align each motion pair to have the same initial translation and global orientation for the gravity axis to avoid labeling redundant edits that can be trivially created by changing the initial body translation and orientation.

Once we curate a list of candidate motion pairs, we give them to annotators from Amazon Mechanical Turk (AMT). In the annotation interface, along with the instructions, we give representative examples with multiple plausible edit texts for each motion pair. We allow the option to skip motion pairs if they are too similar (no difference to describe), or if they are too different (no easy way to describe the difference). Quantitatively, 7% and 55% of the pairs were considered too similar or too different, respectively. For the remaining pairs, we found that the majority are suitable candidates for editing. Our annotation interface can be visualized in the supplementary materials (sup. mat.).

We performed several rounds of data collection. After the initial round, we observed that some annotators tend to overanalyze the edit, which tends to describe the target motion alone or produce overcomplicated edits. Hence, after computing the statistics for a manually curated set of good annotations, we started encouraging the annotators to keep their edit texts around 3 − 12 words, but no longer than 15 words. We explicitly request the annotators to refer to the source motion and encourage them to use words indicative of edit texts, e.g.,“instead”, “higher/lower”, “same/opposite”.

The resulting MotionFix dataset contains 6730 triplets of source-target motions and text annotations. We partition the data into train/validation/test splits randomly with 80%/5%/15% ratios, and obtain 5387/330/1013 triplets for each split, respectively. As shown in Table 1, in contrast to previous motion description datasets that provide text-motion pairs [Guo et al. 2022a; Plappert et al. 2016; Punnakkal et al. 2021], our dataset enables training for motion editing, by also including a source motion. MotionFix is similar in spirit to PoseFix [Delmas et al. 2023], but our labels describe the difference between dynamic motions, as opposed to static poses. Our dataset involves unrestricted edits, leading to different edit types such as spatial edits “throw from higher”, temporal subtraction of actions “start standing not bent down”, mixture of both “bend down a bit more, stand up faster”, and repetitions with adjustments of the whole body motion “do one more repetition and extend arms and legs wider apart”. We include several visual examples in Figure 2 that show body part editing (“keep arms at shoulder height”), directional (“move in the other direction”) and temporal (“start crawling earlier”) changes. We provide dynamic video examples in our webpage¹ through the supplementary video and the data exploration interface. Detailed statistics regarding the texts and motions can also be viewed in Section A of the appendix.

4 Text-Driven Motion Editing Diffusion Model

Fig. 3:

We introduce TMED, a text-driven motion editing diffusion model. Given a short 3D human motion, a textual instruction describing a modification, and a noise vector to enable randomness, the model generates an edited motion. Similar tasks have been addressed in the image domain for text-based image editing [Brooks et al. 2023], from which we take inspiration for our model design. We further build on motion diffusion model (MDM) [Tevet et al. 2023] that takes only text as input and generates a motion. In contrast, our model has an additional condition on the source, thus requires a different training dataset (as described in Section 3). In the following, we present the components of our TMED model.

4.1 3D Human Motion Representation

We use a sequence of SMPL [Loper et al. 2015] body parameters to represent a human motion. SMPL is a linear function that maps the shape, and pose parameters of J joints, along with the global body translation and orientation, to a 3D mesh. The joint positions, J_p, can be obtained from vertices via the learned SMPL joint regressor. Following previous work [Petrovich et al. 2022] that discards the shape parameters, we set the shape parameters to zero (mean shape), since motion is parameterized primarily by pose parameters.

Various alternative representations have been used based on joint positions with respect to the local coordinate system of the body [Guo et al. 2022a; Holden et al. 2016; Starke et al. 2019]. Unlike prior works [Guo et al. 2022a; Tevet et al. 2023; Zhang et al. 2023a] that fit SMPL bodies to skeleton generations, we aim to enable direct regression of SMPL parameters, bypassing the need of a costly post-processing optimization [Bogo et al. 2016], and thus making our method ready to use for animation frameworks.

A common approach for representing SMPL pose parameters within a learning framework is to employ 6D rotations [Zhou et al. 2019], and to apply first-frame canonicalization for motions [Athanasiou et al. 2022; Petrovich et al. 2022]. Similarly, we canonicalize our motions prior to training, so that all face the same direction in the first frame and have the same initial global position. Inspired by [Holden et al. 2016; Petrovich et al. 2024], we represent the global body translation as differences between consecutive frames. Supervising with such relative translations helps the denoiser to generate better trajectories, as we observed unsmooth generations when using the absolute translation. Similar to STMC [Petrovich et al. 2024], we factor out the z-rotation from the pelvis orientation and separately represent the global orientation as the xy-orientation and the z-orientation as the differences between rotations in consecutive frames (resulting in 12 features, i.e., 6D representation for xy and z). We represent the body pose with 6D rotations [Zhou et al. 2019]. Similar to [Petrovich et al. 2022], we exclude the hand joints as they mostly do not move in the datasets we use. We additionally append the local joint positions after removing z-rotation of the body [Holden et al. 2016; Petrovich et al. 2024] (resulting in 192 dimensional features with 6 × 21 for rotations and 22 × 3 for joints including the root joint). Thus, each motion frame has a dimension d_p = 207, consisting of 3 features for the global translation, 12 for the global orientation, and 192 for the body pose. The motion is represented as a sequence of the pose representations. During training, all features are normalized according to their mean and variance over the training set.

4.2 Conditional Diffusion Model

To learn TMED, we use our new training data, where each data sample comprises a source motion M_S, target motion M_T, and a language instruction L. We train a conditional diffusion model that learns to edit the source motion with respect to the instruction. We design a model similar to that of InstructPix2Pix [Brooks et al. 2023], where the generation from a random noise vector is conditioned on two further inputs L and M_S. Here, instead of a sequence of image patch tokens, the motion modality is represented as a variable-length sequence of motion frames. The noised target motion, the text condition L, and the source motion condition M_S are all fed as input to the denoiser at every diffusion step.

Diffusion models [Sohl-Dickstein et al. 2015] learn to gradually turn random noise into a sample from a data distribution by a sequence of denoising autoencoders. This is achieved by a diffusion process that adds noise ϵ_t to an input signal, M_T. We denote the noise level added to the input signal by using t, the diffusion timestep, as a superscript. This produces a diffused sample, \({M}^{t}_{T}\). The amount of noise added at timestep t = 1, …, N is defined a-priori through a noise schedule. We train a denoiser network, to reverse this process given the timestep t, the instruction L, the noised target motion \({M}^{t}_{T}\) and the source motion M_S. As supervision, the output of the denoiser network \(\tilde{M}^{t}_{T}\) is compared against the ground-truth denoised target motion M_T. Our model is therefore trained to minimize:

\begin{equation} { \mathbb {E}_{\epsilon \sim \mathcal {N}(0,1), t, L, M_S} \left\Vert D\left({M^{t}_{T}}; t, L, M_S\right) - M_{T} \right\Vert } \end{equation}

(1)

We use standard mean-squared-error as the loss function to compare the diffusion output with the ground-truth target motion. We choose to predict the denoised target motion, as we found this to produce better results visually than predicting the noise itself.

The architecture overview is illustrated in Figure 3 (left). Our model consists of multiple encoders for each input modality (E_T for timestep, E_L for text, and E_M for motion) and a transformer encoder D that operates on all inputs. The timestep t is encoded via E_T similar to MDM [Tevet et al. 2023], by first converting into a sinusoidal positional embedding, and then projecting through a feed-forward network (consisting of two linear layers with a SiLU activation [Elfwing et al. 2018] in between). As in [Tevet et al. 2023], we use the CLIP [Radford et al. 2021] text encoder for E_L. We pass the source and noised target motions through a linear layer (E_M), shared across frames, and obtain \(M^{enc}_{S}=E_M(M_S)\) and \({M^{t}_{T}}^{enc}=E_M({M}^{t}_{T})\). Given the variable duration of source and target motions, we add a learnable separation token SEP in between [Devlin et al. 2019] when appending them (so that the information on when the target motion ends and the source motion starts is communicated to the transformer). Once all encoded inputs have the same feature dimensionality d, they are combined into a single sequence to be fed to the transformer, as shown in Figure 3, and sinusoidal positional embeddings are subsequently added. During training, to enable classifier-free guidance, the source motion condition is randomly dropped 5% of the time, the text condition 5%, both conditions together 5%, and all the inputs are used 85% of the time. For sampling from a diffusion model with two conditions, we apply classifier-free guidance with respect to two conditions: the input motion M_S and the text instruction L. We introduce separate guidance scales \(s_{M_S}\) and s_L that allow adjusting the influence of each conditioning.

For simplicity, we now abuse the notation by dropping the timestep subscripts when deriving the sampling process. Our generative model, TMED, learns the probability distribution over the target motions, M_T, conditioned on the source motions and text condition, P(M_T∣M_S, L). Expanding this conditional probability gives:

\begin{equation} \begin{split} P(M_T \mid M_S, L) &= \frac{P(M_T, M_S, L)}{P(L, M_S)} \\ &=\frac{P(L \mid M_S, M_T) P(M_S \mid M_T) P(M_T)}{P(L, M_S)}. \end{split} \end{equation}

(2)

As in the original diffusion, we formulate this as a score function optimization problem by first taking the logarithm of Eq.(2):

\begin{equation} \begin{split} \log (P(M_T \mid M_S, L)) &= \log (P(L \mid M_S, M_T)) \\ &\quad + \log (P(M_S \mid M_T)) \\ &\quad + \log (P(M_T)) \\ &\quad - \log (P(L, M_S)). \end{split} \end{equation}

(3)

Then, the derivative with respect to the input of Eq.(3) gives the score estimate \(\tilde{e}_\theta (M_T, s_{M_S}, s_L)\), learned under classifier-free guidance:

\begin{equation} { \begin{aligned} \nabla _{M_T} \log (P(M_T \mid M_S, L)) &= \nabla _{M_T} \log (P(L \mid M_S, M_T)) \\ &\quad + \nabla _{M_T} \log (P(M_S \mid M_T)) \\ &\quad + \nabla _{M_T} \log (P(M_T)). \\ \end{aligned} } \end{equation}

(4)

Hence, from Eq.(4), we sample from TMED using the modified score estimate of two-way conditioning a diffusion model as:

\begin{equation} { \begin{aligned} \tilde{e}_\theta (M_T, s_{M_S}, s_L) &= e_\theta (M_T, \emptyset , \emptyset) \\ &\quad + s_{M_S} \cdot (e_\theta (M_T, M_S, \emptyset) - e_\theta (M_T, \emptyset , \emptyset)) \\ &\quad + s_L \cdot (e_\theta (M_T, M_S, L) - e_\theta (M_T, M_S, \emptyset)). \end{aligned} } \end{equation}

(5)

We further ablate the guidance scales which control the generation at test time in Section 5.

Implementation details. All models are trained for 1000 epochs using cosine noise schedule with DDPM scheduler [Ho et al. 2020]. We use N = 300 diffusion timesteps, as we find this is a good compromise between speed and quality. The guidance scales are chosen for each model based on their best performance in the validation set of MotionFix (\(S_L=2, s_{M_S}=2\)). We follow the same process for training MDM [Tevet et al. 2023] baselines described in the next section. In terms of architectural details, the dimensionality of the embeddings before inputting to the transformer is d = 512. We use a pre-trained and frozen CLIP [Radford et al. 2021] with all 77 token outputs of the ViT-B/32 backbone [Dosovitskiy et al. 2021] as our text encoder E_L. We use the text masks from E_L to mask the padded area of the text inputs. The motion encoder E_M that precedes the transformer is a simple linear projection with dimensionality d_p × d, where the feature dimension of each motion frame is d_p = 207 (as described in Section 4.1).

5 Experiments

We start by describing our evaluation metrics for the new MotionFix benchmark (Section 5.1). We then present the main results on this task, comparing our proposed model to our baseline designs (Section 5.2). Next, we provide ablations on the training data size and guidance hyperparameters (Section 5.3). Finally, we demonstrate qualitative results, comparisons with the baselines and samples from our dataset (Section 5.4).

Table 2:

Methods	Data	Source input	generated-to-target retrieval				generated-to-source retrieval
Methods	Data	Source input	R@1	R@2	R@3	AvgR	R@1	R@2	R@3	AvgR
GT	n/a	n/a	100.0	100.0	100.0	1.00	74.01	84.52	89.91	2.03
MDM	HumanML3D	✗	4.03	7.56	10.48	15.55	2.62	6.15	9.38	15.88
MDM_S	HumanML3D	✓, init	3.63	7.06	10.08	15.64	2.62	6.25	9.78	15.84
MDM-BP_S	HumanML3D	✓, init&BP	38.10	48.99	54.84	6.47	60.28	69.46	73.89	4.23
MDM-BP	HumanML3D	✓, BP	39.10	50.09	54.84	6.46	61.28	69.55	73.99	4.21
TMED	MotionFix	✓, condition	62.90	76.51	83.06	2.71	71.77	84.07	89.52	1.96

Table 2: Results on the MotionFix benchmark (test set): We first evaluate several variants of our text-to-motion synthesis baseline (MDM) on the motion editing task. Subscript S denotes models that denoise the source motion initialization (init) instead of starting the diffusion from noise. BP indicates GPT-based body part labeling described in Section 4 to mask the source body parts which are kept unchanged during diffusion. Our model TMED effectively learns how to utilize the source motion conditioning, thanks to the MotionFix training data. See text for detailed comments.

Table 3:

Methods	generated-to-target retrieval				generated-to-source retrieval
Methods	R@1	R@2	R@3	AvgR	R@1	R@2	R@3	AvgR
GT	100.0	100.0	100.0	1.00	74.01	84.52	89.91	2.03
10%	19.25	30.65	38.71	8.92	22.98	37.50	45.97	7.50
50%	47.08	61.49	69.66	4.23	54.44	70.06	78.12	3.33
100%	62.90	76.51	83.06	2.71	71.77	84.07	89.52	1.96

Table 3: Effect of training data size in MotionFix: We observe significant performance improvement as we increase the amount of training data.

5.1 Evaluation Metrics

Similar to text-to-motion synthesis, distance-based metrics to evaluate motion generation quality is problematic due to multiple plausible ground-truth motions for a given text. Prior work has extensively used text-to-motion retrieval metrics for evaluating text-to-motion synthesis [Guo et al. 2022b; Tevet et al. 2023], by training a text-motion contrastive model and using its features. To evaluate motion editing, we introduce motion-to-motion retrieval metrics. Given a generated motion, we measure how well the source (generated-to-source retrieval) or the target motion (generated-to-target retrieval) can be retrieved. We use TMR [Petrovich et al. 2023] as the feature extractor, but train it ourselves to support our feature representation, using the same regime as in the original paper with HumanML3D data [Guo et al. 2022a]. We report standard metrics, R@1, R@2, R@3 and AvgR using a gallery size of 32 randomly sampled batches for retrieval from the test set. Recall at rank k (R@k) computes the percentage of times the correct motion is among the top k results. Note that we fix the batches so there is no randomness across evaluations. The performance is averaged across batches. We report results on the full test set as gallery in Tables A.2 and A.3 of the appendix, where the same conclusions hold. While the main performance measure is according to generated-to-target retrieval, we also monitor how close our generations remain to the source. As indicative values, we provide ground-truth (GT) values for the latter. We provide additional measures (FID, L2) and perceptual studies in Sections B and C of the appendix, respectively, which further confirm the results of our proposed retrieval-based benchmarking.

5.2 Comparison to Baselines

We report our main results in Table 2. We compare the performance of TMED trained on our MotionFix triplets against several baselines trained on the larger, HumanML3D text-motion pairs [Guo et al. 2022a]. We build our baselines by training MDM [Tevet et al. 2023] with our human motion representation and by repurposing this model for motion editing, described next.

We first introduce two simple baselines: (a) MDM that purely uses the edit text as input to the text-to-motion generation (i.e., without a source motion), and (b) MDM_S that additionally uses the source motion as input instead of noise during inference. For the latter, we also investigated reducing the number of diffusion steps when initializing from source; however, we observed performance drops and therefore kept the full 300 diffusion steps. Inspired by [Athanasiou et al. 2023], we design two additional strong baselines (MDM-BP_S and MDM-BP), that are based on body part labels extracted by querying GPT with the edit texts. We automatically detect body parts which are irrelevant to the text and keep them constant via masking. We again initialize the diffusion process either from the source motion (MDM-BP_S) or from noise (MDM-BP) for the body parts that need to change according to the GPT response. For more details on the query and example GPT outputs, we refer to Section D of the appendix.

We first observe from Table 2 that, for all the baselines, initializing from noise performs better than initializing from source motion. Our strong baselines based on body-part detection (MDM-BP, MDM-BP_S) clearly outperform the naive baselines. However, all baselines fall behind our TMED that successfully leverages the access to training triplets, and significantly outperforms alternatives.

Moreover, MDM-BP, MDM-BP_S are both strong baselines, but relying on GPT body part labels might not capture all edit types, such as the ones that require modifying the overall body. We demonstrate this further in our supplementary video from the project webpage and our qualitative comparisons (Section 5.4).

5.3 Ablations

In the following, we investigate the effect of training data size and the guidance scales on the TMED model performance.

Training data size. In Table 3, we present the performance of TMED for different data sizes from MotionFix. We clearly observe, that increasing the data size has a large impact on the performance, justifying our data collection. The non-saturated trend is encouraging to scale up the training further.

Fig. 4:

Fig. 5:

Fig. 6:

Fig. 7:

Guidance hyperparameters. In Figure 4, we present how TMED performs across different guidance values for both conditions. x-axis controls the text guidance s_L, and y-axis controls the source motion guidance \(s_{M_S}\) at test time. We report both generated-to-target (left) and source-to-target (right) R@1 retrieval results. We observe that there needs to be a balance between the two guidance values, and that performances decrease towards the extremes (e.g., top left and bottom right corners of the plots, where only one of the two conditions have higher guidance). This highlights the need to rely on both conditions to perform the task.

5.4 Qualitative Results

We display several generations from TMED in Figure 5 to enable qualitative assessment. We observe that our model can perform different types of edits such as the addition of actions (“rotate wrists instead of stretching like yawn”), temporal edits in a motion (“get up a bit earlier”), speed edits (“slow down”) and combinations of these. We refer to our supplementary video for dynamic visualizations, which may be easier to interpret.

In Figure 6, we further provide examples of failures cases from TMED. In the top row, we analyze cases with long edit texts. The model struggles with complex details and does not “keep the body straight” in the left example, nor follows “bend arms in the elbows” instructions on the right side, while wider legs are correctly edited. In the bottom row, we illustrate examples where the model faithfully follows the edit text, but does not resemble the source motion. In the left generation, the steps are correctly wider, but the movement does not continue to the similar position as the source motion. Finally, on the right, the body is kneeling down faster as instructed, but towards the opposite direction.

We additionally provide a qualitative comparison in Figure 7, between TMED and various baselines. We provide two comparisons for each baseline (top block for MDM_S, middle block for MDM-BP_S, and bottom block for MDM-B).

We observe that MDM_S picks up the action from the prompt, but fails to faithfully follow the source motion. In the first row, the generation by MDM_S raises both hands, instead of adjusting only the height of the hand raised in the source motion. Similarly, in the second row, MDM_S generation raises the arm but in front of the body and not higher as prompted by the edit text.

In the next two rows, we visualize MDM-BP_S results. Given the text “rotate wrists instead of stretching like yawn”, GPT correctly suggests editing both hands; however, the generated motion no longer resembles the source motion as the wide-open hands are not preserved. For the example edit “turn in the opposite direction”, all body parts are involved, but MDM-BP_S does not deviate too much from the source, perhaps because traditional text-to-motion generation models rarely see relative words such as “opposite direction”.

Finally, we illustrate generations from our strongest baseline MDM-BP in the last two rows. Both generations involve all parts of the body (e.g., “slow down”) making it hard to follow the source motion. In comparison, our model faithfully performs most of the edits. The disadvantage of TMED, on the other hand, might be the generalization to motion pairs where the TMR similarity is low as such edits were unseen during training. We briefly discuss more limitations in the following.

6 Conclusions

In this work, we studied the task of motion editing from language instructions. Given the scarcity of training data, we introduced a new dataset MotionFix, collected in a semi-automatic manner. We exploit motion retrieval models to obtain “edit-ready” motion pairs which we annotate with language labels. We design a conditional diffusion model TMED that is trained on MotionFix, and generates edited motions that follow the source motion and the edit text. We show both quantitatively and qualitatively that our model outperforms all baselines. We hope that our dataset and findings will assist the research community and pave the way for exploring this new task.

Limitations. Our approach comes with limitations. Assuming two TMR-similar motions being an editing distance apart is not always accurate but serves as a good starting point. Furthermore, in our data collection, we constrain the motions to be up to 5 seconds since longer motions to produce many dissimilar pairs. Regarding model performance, TMED exhibits difficulty generalizing to unseen or complex edit texts and maintaining faithfulness to the source motion. Moreover, while our model can be used iteratively, we do not explore this capability in this paper and leave for future work.

Acknowledgments

The authors would like to thank Benjamin Pellkofer for building the data exploration website, Tsvetelina Alexiadis for guidance in data collection and perceptual studies, Arina Kuznetcova, Asuka Bertler, Claudia Gallatz, Suraj Bhor, Tithi Rakshit, Taylor McConnell, Tomasz Niewiadomski for data annotation, Lea Müller and Mathis Petrovich for helpful discussions, Yuliang Xiu for proofreading and Peter Kulits for the support and seatmating. GV acknowledges the ANR project CorVis ANR-21-CE23-0003-01. Disclosure: https://files.is.tue.mpg.de/black/CoI_CVPR_2024.txt

Footnote

https://motionfix.is.tue.mpg.de

Supplemental Material

MP4 File

Video and Pdf SupMat for MotionFix.

Download
619.02 MB

PDF File

Video and Pdf SupMat for MotionFix.

Download
8.34 MB

References

[1]

Kfir Aberman, Yijia Weng, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. 2020. Unpaired motion style transfer from video to animation. Transactions on Graphics (TOG) (2020).

Abstract

1 Introduction

2 Related Work

3 The New MotionFix Dataset

4 Text-Driven Motion Editing Diffusion Model

4.1 3D Human Motion Representation

4.2 Conditional Diffusion Model

5 Experiments

5.1 Evaluation Metrics

5.2 Comparison to Baselines

5.3 Ablations

5.4 Qualitative Results

6 Conclusions

Acknowledgments

Footnote

Supplemental Material

References

Index Terms

Recommendations

Iterative Motion Editing with Natural Language

Using motion capture for interactive motion editing

Orthogonal-Blendshape-Based Editing System for Facial Motion Capture Data

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations