Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3680528.3687559acmconferencesArticle/Chapter ViewFull TextPublication Pagessiggraph-asiaConference Proceedingsconference-collections
research-article
Open access

MotionFix: Text-Driven 3D Human Motion Editing

Published: 03 December 2024 Publication History

Abstract

The focus of this paper is 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The challenges include the lack of training data and the design of a model that faithfully edits the source motion. In this paper, we address both these challenges. We build a methodology to semi-automatically collect a dataset of triplets in the form of (i) a source motion, (ii) a target motion, and (iii) an edit text, and create the new MotionFix dataset. Having access to such data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input. We further build various baselines trained only on text-motion pairs datasets, and show superior performance of our model trained on triplets. We introduce new retrieval-based metrics for motion editing, and establish a new benchmark on the evaluation set of MotionFix. Our results are encouraging, paving the way for further research on finegrained motion generation. Code, models and data are available at our project website.

1 Introduction

Human motion control is an essential component of the animation pipeline, and involves creating a motion, as well as editing it until the motion matches the desired outcome. Text descriptions have emerged as one of the prominent ways to control motion generation [Guo et al. 2022a; Petrovich et al. 2022; Tevet et al. 2023; Zhang et al. 2023a]. However, due to the inherent ambiguity in high-level language instructions, the resulting generation may not necessarily correspond to the motion one has in mind. An animator may then need to edit the motion further. Motion editing is non-trivial, arguably more complex than static pose editing, and may involve multiple types of instructions, such as changing the speed of a motion, modifying the repetitions for cyclic actions, adjusting the posture of a particular body or modifying a certain temporal segment of a motion. In this work, we aim, given an initial source motion and an edit description, to generate a new motion that follows the source motion and edits it according to the text instruction.
Fig. 1:
Fig. 1: Our text-driven motion diffusion model (TMED) enables 3D human motion editing from natural language descriptions. To train this model, we introduce a semi-automatically collected dataset MotionFix that contains diverse types of editing such as modifying body parts, changing certain moments of the motion, or editing the speed or the style.
There are existing approaches that can modify body coordinates [Karunratanakul et al. 2023] and lower/upper limbs [Tevet et al. 2023; Zhang et al. 2023a]. However, they require manual selection of the body parts, which prevents making edits beyond local modifications such as the speed of the overall action. On the other hand, some methods have been proposed to add more fine-grained control to text-to-motion generation; examples include temporal [Athanasiou et al. 2022; Zhang et al. 2023c] and spatial [Athanasiou et al. 2023; Goel et al. 2024; Zou et al. 2024] compositions. These methods can be repurposed for editing, but would be limited to addition or subtraction of actions. Consequently, all these works are limited to specific types of edits. Instead, our work considers unrestricted edits described by language instructions. To this end, we collect a manually annotated dataset, MotionFix, that supports training generative models for the task of text-driven motion editing.
Constructing a dataset for 3D human motion editing is non-trivial. In contrast to text-based image editing, where methods such as InstructPix2Pix [Brooks et al. 2023] exploit large text-to-image generation models [Rombach et al. 2022] to automatically create training data, there exists no 3D motion generation model which generalizes faithfully to unrestricted text inputs. Besides, dynamic edits can be more complex than static image edits. In fact, PoseFix [Delmas et al. 2023] is a successful example of a 3D human body pose editing dataset; however, the difference between static poses are mapped to text in a rule-based manner using joint distances, and this would not be applicable to dynamic motions. In this work, we take a different route and mine existing motion capture (MoCap) datasets to find suitable motion pairs automatically, for which the differences are then manually described by typing text. By not relying on a generative model, we ensure motion quality; and by annotating the text (which is relatively fast), we achieve unrestricted edits.
The key challenge in our semi-automatic data curation pipeline is how to find motion pairs that are similar enough for a meaningful and concise edit text to describe the difference. The difference between the source and target motions should not be too large to avoid the annotator typing a too-complex text or describing entirely the target motion, discarding the source. The motions should have similarities for the annotator to potentially make reference to the source. Our solution is to employ the recent TMR motion embedding space [Petrovich et al. 2023] that effectively captures semantics, as well as sufficient details for the body dynamics, thanks to its contrastively and generatively trained motion encoder. To form our candidate pairs for annotation, for each motion in a large MoCap collection [Mahmood et al. 2019], we retrieve the top-ranked motions according to their embedding similarity. Using crowdsourcing, we collect textual annotations for these pairs. The resulting dataset, MotionFix, is the first text-based motion editing dataset, which contains different types of edits as can be seen in Figure 1 (top). Some edits involve a specific body part (e.g., the hand should “make the circular motion a bit wider”), others alter the overall body dynamics (e.g., “make the circle bigger and walk faster” when walking in a circle).
MotionFix enables both training and benchmarking for this new task. We design and train a Text-based Motion Editing Diffusion model, TMED, that is conditioned on both the source motion and the edit text. The results of our TMED model are encouraging as shown in Figure 1 (bottom), generating different types of edits. For example, the model can edit the overall spatial coordinates of a motion (“do a left side tilt instead of back one”), the way a motion is performed (“sit straight, don’t get up”), parts of the body (“do it with the hand raised lower”), or the speed of a motion (“faster”).
To benchmark our model and compare against baselines, we introduce new metrics on the evaluation set of MotionFix. Following the commonly adopted retrieval-based metrics in text-to-motion generation benchmarks [Guo et al. 2022a], we perform motion-to-motion retrieval and check how often the ground-truth target motion is in the top ranks. We also report the ranking of the source motion to evaluate the proximity to the source. While this metric should not be too high – otherwise there would be no edit – it gives an intuition on whether the generated motion deviates too much from the source. Our experiments demonstrate that our conditional model trained on triplets generates motions that are closer to the target, compared to strong baselines we build on top of state-of-the-art text-to-motion generation methods, which have only access to text-motion pairs for training.
Our contributions are the following: (i) We introduce MotionFix, the first language-based motion editing dataset, that provides motion-motion-text triplets annotated through our semi-automatic data collection methodology. This dataset allows both for training and benchmarking for this new task. (ii) We introduce several baselines based on text-to-motion generation, together with edit-relevant body parts detection using language models. While our baselines achieve promising results, we show that models trained on text-motion pairs fall behind those trained on our triplets. (iii) We propose TMED, a diffusion-based model for motion editing given language instructions. We demonstrate both qualitatively and quantitatively that TMED outperforms all the baselines.

2 Related Work

In the following, we briefly overview relevant works on motion generation, editing, and datasets.
3D motion generation from text. In contrast to the relatively mature areas of text-to-image generation [Rombach et al. 2022] and text-based image editing [Brooks et al. 2023], language-based 3D human motion generation is at its infancy. Initial work employs VAEs [Kingma and Welling 2014] with action label conditioning using a small set of categories [Guo et al. 2020; Petrovich et al. 2021]. With the introduction of recent text-motion datasets [Guo et al. 2022a; Lin et al. 2023; Punnakkal et al. 2021], there has been increased interest in conditioning the generation on free-form language inputs [Guo et al. 2024; 2022a; 2022b; Petrovich et al. 2022; Tevet et al. 2022; Uchida et al. 2024; Zou et al. 2024]. Recently, diffusion models [Ho et al. 2020] have been successfully integrated [Shafir et al. 2024; Tevet et al. 2023; Wan et al. 2023; Xie et al. 2024; Zhang et al. 2023a], producing state-of-the-art results in text-to-motion generation.
Several works focus on increased controllability in motion generation, going beyond a single textual input. Examples include enabling temporal compositionality (a series of motions) [Athanasiou et al. 2022; Lee et al. 2022; Shafir et al. 2024], spatial compositionality [Athanasiou et al. 2023; Zhang et al. 2023a] (simultaneous motions), and a unified framework of timeline control [Petrovich et al. 2024]. Diffusion-based models were shown to be suitable for fine-grained local control such as joint trajectories [Karunratanakul et al. 2023; Shafir et al. 2024; Xie et al. 2024] or keyframes [Cohan et al. 2024]. Our work is in similar spirit in terms of providing more control to users; however, in contrast to the above, our focus is motion editing using language.
Language-based human body editing. There are numerous traditional methods for editing 3D humans to generate movies [Catmull 1972], imposing space or time constraints [Cohen 1992], interactively generating motions using procedural animation [Lee and Shin 1999; Perlin 1995] or physics-based approaches [Popović and Witkin 1999]. Recent work using language can be grouped into pose [Delmas et al. 2023; Kim et al. 2021] or motion [Fieraru et al. 2021; Goel et al. 2024] editing. In FixMyPose [Kim et al. 2021], the focus is on editing athletic human poses in synthetic images. In [Delmas et al. 2023], a text-based 3D human pose editing method is developed, enabled through the collection of the PoseFix dataset containing language descriptions of differences between pairs of poses. PoseFix builds on the previous work of PoseScript [Delmas et al. 2022], where a dataset of pose descriptions are automatically collected through a rule-based approach. Unlike PoseFix that concentrates on static poses, our MotionFix dataset involves dynamic motions where the space of possible edits are much larger, necessitating a different approach to data collection.
In terms of dynamic bodies, current motion editing approaches can be separated into three categories: (a) Style-based editing or motion style transfer exploits datasets that contain a small set of style labels such as ‘angry’ and ‘old’ [Aberman et al. 2020; Kobayashi et al. 2023; Mason et al. 2022]. This line of work, focuses mostly on copying the style of one motion onto another, typically performing the same action. (b) Part-based editing considers selecting a subset of the body. MDM [Tevet et al. 2023] shows the potential of diffusion models to edit the upper/lower body by text-conditioned motion inpainting. Similarly, MotionDiffuse [Zhang et al. 2023a] and FLAME [Kim et al. 2023] manually specify body parts to edit them with text. More recently, CoMo [Huang et al. 2024] and FineMoGen [Zhang et al. 2023b] use LLMs to produce edit texts and demonstrate promising results for part editing. (c) Among heuristic-based approaches [Fieraru et al. 2021; Goel et al. 2024], AIFit [Fieraru et al. 2021] can edit domain-specific exercise poses from a pre-defined grammar and is focused on a limited set of cyclic motions from their Fit3D dataset. Iterative Motion Editing [Goel et al. 2024] relies on captioned source motions, which are passed through an LLM along with a pre-defined set of ‘Motion Editing Operators’ (MEOs), to detect which joints and frames should be edited. A pre-trained diffusion model is then used to infill these locations. In contrast with the prior work, we do not focus on a specific type of motion editing or any heuristics. Our MotionFix contains diverse edits as can be seen in Figure 1 (top) and Figure 2. Closest to ours is the concurrent work of [Goel et al. 2024]; however, as mentioned above, their approach is not fully automatic due to requiring a captioned source motion. Their keyframe selection heuristic further limits the applicability to certain edit types. Moreover, due to the unavailability of an open-source code at the time of writing this paper, we do not provide comparisons in this work.
Table 1:
Dataset#motionsvocab.label type
KIT-ML [Plappert et al. 2016]39111623motion description
BABEL [Punnakkal et al. 2021]108811347motion description, action
HumanML3D [Guo et al. 2022a]146165371motion description
PoseFix [Delmas et al. 2023]6157 x 21068pose editing
MotionFix (ours)6730 x 21479motion editing
Table 1: Comparison with existing datasets: MotionFix is the first dataset supporting the task of text-based motion editing.
3D human & language datasets. The progress in controlling 3D humans with text has been driven by new datasets that pair 3D humans with language descriptions. In Table 1, we summarize the three most popular motion description datasets. KIT [Plappert et al. 2016] is the first source of such data containing textual annotations for ∼ 4k motion sequences. To obtain sufficient data to train deep networks, AMASS [Mahmood et al. 2019] unifies several MoCap collections in the SMPL body format [Loper et al. 2015]; however, it lacks language descriptions. This is addressed by BABEL [Punnakkal et al. 2021] and HumanML3D [Guo et al. 2022a], which concurrently collect semantic annotations in the form of action categories and/or textual descriptions. While KIT, BABEL, HumanML3D enable text-to-motion generation training, they do not support editing. On the other hand, as can be next seen in Table 1, PoseFix [Delmas et al. 2023] provides pose editing triplets, but does not support motion editing. Our MotionFix dataset supports motion editing training, while being at a similar scale to PoseFix in terms of the number of triplets and the vocabulary of edit texts.

3 The New MotionFix Dataset

Fig. 2:
Fig. 2: Dataset samples: We display source motions (red) overlaid with target motions (green) from our MotionFix dataset, together with their corresponding text annotations.
Appropriate training data for text-based motion editing would be in the form of triplets: source motions, target motions and edit texts. As discussed in Section 1, a big challenge in motion editing from language instructions is the lack of training data. To overcome this challenge, we design a semi-automatic data creation methodology. We first automatically construct candidate motion pairs that are similar (and different) enough, so the edit can potentially be described by language in simple words. We then ask annotators to manually type the edit text. In the following, we detail our procedure.
We make use of a motion embedding space to find motion pairs that are similar. Specifically, we employ a recent text-to-motion retrieval model TMR [Petrovich et al. 2023]. TMR is trained with a contrastive loss on the latent space of motions and texts, and reports state-of-the-art results for text-motion retrieval. We observe that such a model, by design, has the ability to produce latent motion representations that, for a given motion, ranks the semantically close ones nearby in the embedding space. We then use TMR to perform motion-to-motion retrieval. This is in similar spirit to using CLIP [Radford et al. 2021] for image similarity. We construct our dataset by finding such motion pairs from the AMASS MoCap collection [Mahmood et al. 2019].
From each motion in AMASS, we first extract TMR motion embeddings with sliding windows of 3 to 5 seconds. In preliminary analysis, we found that using longer motion pairs reduces the probability of finding good candidates that differ by simple edits, while using shorter ones usually yields motion pairs that have a high probability to be almost identical. Then, we compute the pairwise embedding similarities and filter out all the motions pairs that have similarity ≥ 0.99 to avoid identical motions. We extract the top-2 most similar motions for a given motion and include these pairs in the annotation pool. We experimented with thresholding instead of following a top-k selection approach, but the TMR feature similarity is not well calibrated across motion pairs, which would make finding a constant threshold difficult. Finally, we align each motion pair to have the same initial translation and global orientation for the gravity axis to avoid labeling redundant edits that can be trivially created by changing the initial body translation and orientation.
Once we curate a list of candidate motion pairs, we give them to annotators from Amazon Mechanical Turk (AMT). In the annotation interface, along with the instructions, we give representative examples with multiple plausible edit texts for each motion pair. We allow the option to skip motion pairs if they are too similar (no difference to describe), or if they are too different (no easy way to describe the difference). Quantitatively, 7% and 55% of the pairs were considered too similar or too different, respectively. For the remaining pairs, we found that the majority are suitable candidates for editing. Our annotation interface can be visualized in the supplementary materials (sup. mat.).
We performed several rounds of data collection. After the initial round, we observed that some annotators tend to overanalyze the edit, which tends to describe the target motion alone or produce overcomplicated edits. Hence, after computing the statistics for a manually curated set of good annotations, we started encouraging the annotators to keep their edit texts around 3 − 12 words, but no longer than 15 words. We explicitly request the annotators to refer to the source motion and encourage them to use words indicative of edit texts, e.g.,“instead”, “higher/lower”, “same/opposite”.
The resulting MotionFix dataset contains 6730 triplets of source-target motions and text annotations. We partition the data into train/validation/test splits randomly with 80%/5%/15% ratios, and obtain 5387/330/1013 triplets for each split, respectively. As shown in Table 1, in contrast to previous motion description datasets that provide text-motion pairs [Guo et al. 2022a; Plappert et al. 2016; Punnakkal et al. 2021], our dataset enables training for motion editing, by also including a source motion. MotionFix is similar in spirit to PoseFix [Delmas et al. 2023], but our labels describe the difference between dynamic motions, as opposed to static poses. Our dataset involves unrestricted edits, leading to different edit types such as spatial edits “throw from higher”, temporal subtraction of actions “start standing not bent down”, mixture of both “bend down a bit more, stand up faster”, and repetitions with adjustments of the whole body motion “do one more repetition and extend arms and legs wider apart”. We include several visual examples in Figure 2 that show body part editing (“keep arms at shoulder height”), directional (“move in the other direction”) and temporal (“start crawling earlier”) changes. We provide dynamic video examples in our webpage1 through the supplementary video and the data exploration interface. Detailed statistics regarding the texts and motions can also be viewed in Section A of the appendix.

4 Text-Driven Motion Editing Diffusion Model

Fig. 3:
Fig. 3: Models overview: (left) We illustrate our TMED model during training. We noise the target motion for t steps, and the transformer model is trained to denoise it back by one step. The conditions – text and source motion – are appended to the input. CLIP backbone is frozen, while components denoted in pink are learned during training. At test time, the iterative diffusion process is initialized from random noise instead of the noised target. (right) Our MDM-BP baseline is repurposed from a pretrained text-to-motion generation model to be used only at test time for motion editing. The model is initialized from random noise and the body parts not to be edited according to GPT are copied from the source motion.
We introduce TMED, a text-driven motion editing diffusion model. Given a short 3D human motion, a textual instruction describing a modification, and a noise vector to enable randomness, the model generates an edited motion. Similar tasks have been addressed in the image domain for text-based image editing [Brooks et al. 2023], from which we take inspiration for our model design. We further build on motion diffusion model (MDM) [Tevet et al. 2023] that takes only text as input and generates a motion. In contrast, our model has an additional condition on the source, thus requires a different training dataset (as described in Section 3). In the following, we present the components of our TMED model.

4.1 3D Human Motion Representation

We use a sequence of SMPL [Loper et al. 2015] body parameters to represent a human motion. SMPL is a linear function that maps the shape, and pose parameters of J joints, along with the global body translation and orientation, to a 3D mesh. The joint positions, Jp, can be obtained from vertices via the learned SMPL joint regressor. Following previous work [Petrovich et al. 2022] that discards the shape parameters, we set the shape parameters to zero (mean shape), since motion is parameterized primarily by pose parameters.
Various alternative representations have been used based on joint positions with respect to the local coordinate system of the body [Guo et al. 2022a; Holden et al. 2016; Starke et al. 2019]. Unlike prior works [Guo et al. 2022a; Tevet et al. 2023; Zhang et al. 2023a] that fit SMPL bodies to skeleton generations, we aim to enable direct regression of SMPL parameters, bypassing the need of a costly post-processing optimization [Bogo et al. 2016], and thus making our method ready to use for animation frameworks.
A common approach for representing SMPL pose parameters within a learning framework is to employ 6D rotations [Zhou et al. 2019], and to apply first-frame canonicalization for motions [Athanasiou et al. 2022; Petrovich et al. 2022]. Similarly, we canonicalize our motions prior to training, so that all face the same direction in the first frame and have the same initial global position. Inspired by [Holden et al. 2016; Petrovich et al. 2024], we represent the global body translation as differences between consecutive frames. Supervising with such relative translations helps the denoiser to generate better trajectories, as we observed unsmooth generations when using the absolute translation. Similar to STMC [Petrovich et al. 2024], we factor out the z-rotation from the pelvis orientation and separately represent the global orientation as the xy-orientation and the z-orientation as the differences between rotations in consecutive frames (resulting in 12 features, i.e., 6D representation for xy and z). We represent the body pose with 6D rotations [Zhou et al. 2019]. Similar to [Petrovich et al. 2022], we exclude the hand joints as they mostly do not move in the datasets we use. We additionally append the local joint positions after removing z-rotation of the body [Holden et al. 2016; Petrovich et al. 2024] (resulting in 192 dimensional features with 6 × 21 for rotations and 22 × 3 for joints including the root joint). Thus, each motion frame has a dimension dp = 207, consisting of 3 features for the global translation, 12 for the global orientation, and 192 for the body pose. The motion is represented as a sequence of the pose representations. During training, all features are normalized according to their mean and variance over the training set.

4.2 Conditional Diffusion Model

To learn TMED, we use our new training data, where each data sample comprises a source motion MS, target motion MT, and a language instruction L. We train a conditional diffusion model that learns to edit the source motion with respect to the instruction. We design a model similar to that of InstructPix2Pix [Brooks et al. 2023], where the generation from a random noise vector is conditioned on two further inputs L and MS. Here, instead of a sequence of image patch tokens, the motion modality is represented as a variable-length sequence of motion frames. The noised target motion, the text condition L, and the source motion condition MS are all fed as input to the denoiser at every diffusion step.
Diffusion models [Sohl-Dickstein et al. 2015] learn to gradually turn random noise into a sample from a data distribution by a sequence of denoising autoencoders. This is achieved by a diffusion process that adds noise ϵt to an input signal, MT. We denote the noise level added to the input signal by using t, the diffusion timestep, as a superscript. This produces a diffused sample, \({M}^{t}_{T}\). The amount of noise added at timestep t = 1, …, N is defined a-priori through a noise schedule. We train a denoiser network, to reverse this process given the timestep t, the instruction L, the noised target motion \({M}^{t}_{T}\) and the source motion MS. As supervision, the output of the denoiser network \(\tilde{M}^{t}_{T}\) is compared against the ground-truth denoised target motion MT. Our model is therefore trained to minimize:
\begin{equation} { \mathbb {E}_{\epsilon \sim \mathcal {N}(0,1), t, L, M_S} \left\Vert D\left({M^{t}_{T}}; t, L, M_S\right) - M_{T} \right\Vert } \end{equation}
(1)
We use standard mean-squared-error as the loss function to compare the diffusion output with the ground-truth target motion. We choose to predict the denoised target motion, as we found this to produce better results visually than predicting the noise itself.
The architecture overview is illustrated in Figure 3 (left). Our model consists of multiple encoders for each input modality (ET for timestep, EL for text, and EM for motion) and a transformer encoder D that operates on all inputs. The timestep t is encoded via ET similar to MDM [Tevet et al. 2023], by first converting into a sinusoidal positional embedding, and then projecting through a feed-forward network (consisting of two linear layers with a SiLU activation [Elfwing et al. 2018] in between). As in [Tevet et al. 2023], we use the CLIP [Radford et al. 2021] text encoder for EL. We pass the source and noised target motions through a linear layer (EM), shared across frames, and obtain \(M^{enc}_{S}=E_M(M_S)\) and \({M^{t}_{T}}^{enc}=E_M({M}^{t}_{T})\). Given the variable duration of source and target motions, we add a learnable separation token SEP in between [Devlin et al. 2019] when appending them (so that the information on when the target motion ends and the source motion starts is communicated to the transformer). Once all encoded inputs have the same feature dimensionality d, they are combined into a single sequence to be fed to the transformer, as shown in Figure 3, and sinusoidal positional embeddings are subsequently added. During training, to enable classifier-free guidance, the source motion condition is randomly dropped 5% of the time, the text condition 5%, both conditions together 5%, and all the inputs are used 85% of the time. For sampling from a diffusion model with two conditions, we apply classifier-free guidance with respect to two conditions: the input motion MS and the text instruction L. We introduce separate guidance scales \(s_{M_S}\) and sL that allow adjusting the influence of each conditioning.
For simplicity, we now abuse the notation by dropping the timestep subscripts when deriving the sampling process. Our generative model, TMED, learns the probability distribution over the target motions, MT, conditioned on the source motions and text condition, P(MTMS, L). Expanding this conditional probability gives:
\begin{equation} \begin{split} P(M_T \mid M_S, L) &= \frac{P(M_T, M_S, L)}{P(L, M_S)} \\ &=\frac{P(L \mid M_S, M_T) P(M_S \mid M_T) P(M_T)}{P(L, M_S)}. \end{split} \end{equation}
(2)
As in the original diffusion, we formulate this as a score function optimization problem by first taking the logarithm of Eq.(2):
\begin{equation} \begin{split} \log (P(M_T \mid M_S, L)) &= \log (P(L \mid M_S, M_T)) \\ &\quad + \log (P(M_S \mid M_T)) \\ &\quad + \log (P(M_T)) \\ &\quad - \log (P(L, M_S)). \end{split} \end{equation}
(3)
Then, the derivative with respect to the input of Eq.(3) gives the score estimate \(\tilde{e}_\theta (M_T, s_{M_S}, s_L)\), learned under classifier-free guidance:
\begin{equation} { \begin{aligned} \nabla _{M_T} \log (P(M_T \mid M_S, L)) &= \nabla _{M_T} \log (P(L \mid M_S, M_T)) \\ &\quad + \nabla _{M_T} \log (P(M_S \mid M_T)) \\ &\quad + \nabla _{M_T} \log (P(M_T)). \\ \end{aligned} } \end{equation}
(4)
Hence, from Eq.(4), we sample from TMED using the modified score estimate of two-way conditioning a diffusion model as:
\begin{equation} { \begin{aligned} \tilde{e}_\theta (M_T, s_{M_S}, s_L) &= e_\theta (M_T, \emptyset , \emptyset) \\ &\quad + s_{M_S} \cdot (e_\theta (M_T, M_S, \emptyset) - e_\theta (M_T, \emptyset , \emptyset)) \\ &\quad + s_L \cdot (e_\theta (M_T, M_S, L) - e_\theta (M_T, M_S, \emptyset)). \end{aligned} } \end{equation}
(5)
We further ablate the guidance scales which control the generation at test time in Section 5.
Implementation details. All models are trained for 1000 epochs using cosine noise schedule with DDPM scheduler [Ho et al. 2020]. We use N = 300 diffusion timesteps, as we find this is a good compromise between speed and quality. The guidance scales are chosen for each model based on their best performance in the validation set of MotionFix  (\(S_L=2, s_{M_S}=2\)). We follow the same process for training MDM [Tevet et al. 2023] baselines described in the next section. In terms of architectural details, the dimensionality of the embeddings before inputting to the transformer is d = 512. We use a pre-trained and frozen CLIP [Radford et al. 2021] with all 77 token outputs of the ViT-B/32 backbone [Dosovitskiy et al. 2021] as our text encoder EL. We use the text masks from EL to mask the padded area of the text inputs. The motion encoder EM that precedes the transformer is a simple linear projection with dimensionality dp × d, where the feature dimension of each motion frame is dp = 207 (as described in Section 4.1).

5 Experiments

We start by describing our evaluation metrics for the new MotionFix benchmark (Section 5.1). We then present the main results on this task, comparing our proposed model to our baseline designs (Section 5.2). Next, we provide ablations on the training data size and guidance hyperparameters (Section 5.3). Finally, we demonstrate qualitative results, comparisons with the baselines and samples from our dataset (Section 5.4).
Table 2:
MethodsDataSource inputgenerated-to-target retrievalgenerated-to-source retrieval
R@1R@2R@3AvgRR@1R@2R@3AvgR
GTn/an/a100.0100.0100.01.0074.0184.5289.912.03
MDMHumanML3D4.037.5610.4815.552.626.159.3815.88
MDMSHumanML3D✓, init3.637.0610.0815.642.626.259.7815.84
MDM-BPSHumanML3D✓, init&BP38.1048.9954.846.4760.2869.4673.894.23
MDM-BPHumanML3D✓, BP39.1050.0954.846.4661.2869.5573.994.21
TMEDMotionFix✓, condition62.9076.5183.062.7171.7784.0789.521.96
Table 2: Results on the MotionFix benchmark (test set): We first evaluate several variants of our text-to-motion synthesis baseline (MDM) on the motion editing task. Subscript S denotes models that denoise the source motion initialization (init) instead of starting the diffusion from noise. BP indicates GPT-based body part labeling described in Section 4 to mask the source body parts which are kept unchanged during diffusion. Our model TMED effectively learns how to utilize the source motion conditioning, thanks to the MotionFix training data. See text for detailed comments.
Table 3:
Methodsgenerated-to-target retrievalgenerated-to-source retrieval
R@1R@2R@3AvgRR@1R@2R@3AvgR
GT100.0100.0100.01.0074.0184.5289.912.03
10%19.2530.6538.718.9222.9837.5045.977.50
50%47.0861.4969.664.2354.4470.0678.123.33
100%62.9076.5183.062.7171.7784.0789.521.96
Table 3: Effect of training data size in MotionFix: We observe significant performance improvement as we increase the amount of training data.

5.1 Evaluation Metrics

Similar to text-to-motion synthesis, distance-based metrics to evaluate motion generation quality is problematic due to multiple plausible ground-truth motions for a given text. Prior work has extensively used text-to-motion retrieval metrics for evaluating text-to-motion synthesis [Guo et al. 2022b; Tevet et al. 2023], by training a text-motion contrastive model and using its features. To evaluate motion editing, we introduce motion-to-motion retrieval metrics. Given a generated motion, we measure how well the source (generated-to-source retrieval) or the target motion (generated-to-target retrieval) can be retrieved. We use TMR [Petrovich et al. 2023] as the feature extractor, but train it ourselves to support our feature representation, using the same regime as in the original paper with HumanML3D data [Guo et al. 2022a]. We report standard metrics, R@1, R@2, R@3 and AvgR using a gallery size of 32 randomly sampled batches for retrieval from the test set. Recall at rank k (R@k) computes the percentage of times the correct motion is among the top k results. Note that we fix the batches so there is no randomness across evaluations. The performance is averaged across batches. We report results on the full test set as gallery in Tables A.2 and A.3 of the appendix, where the same conclusions hold. While the main performance measure is according to generated-to-target retrieval, we also monitor how close our generations remain to the source. As indicative values, we provide ground-truth (GT) values for the latter. We provide additional measures (FID, L2) and perceptual studies in Sections B and C of the appendix, respectively, which further confirm the results of our proposed retrieval-based benchmarking.

5.2 Comparison to Baselines

We report our main results in Table 2. We compare the performance of TMED trained on our MotionFix triplets against several baselines trained on the larger, HumanML3D text-motion pairs [Guo et al. 2022a]. We build our baselines by training MDM [Tevet et al. 2023] with our human motion representation and by repurposing this model for motion editing, described next.
We first introduce two simple baselines: (a) MDM that purely uses the edit text as input to the text-to-motion generation (i.e., without a source motion), and (b) MDMS that additionally uses the source motion as input instead of noise during inference. For the latter, we also investigated reducing the number of diffusion steps when initializing from source; however, we observed performance drops and therefore kept the full 300 diffusion steps. Inspired by [Athanasiou et al. 2023], we design two additional strong baselines (MDM-BPS and MDM-BP), that are based on body part labels extracted by querying GPT with the edit texts. We automatically detect body parts which are irrelevant to the text and keep them constant via masking. We again initialize the diffusion process either from the source motion (MDM-BPS) or from noise (MDM-BP) for the body parts that need to change according to the GPT response. For more details on the query and example GPT outputs, we refer to Section D of the appendix.
We first observe from Table 2 that, for all the baselines, initializing from noise performs better than initializing from source motion. Our strong baselines based on body-part detection (MDM-BP, MDM-BPS) clearly outperform the naive baselines. However, all baselines fall behind our TMED that successfully leverages the access to training triplets, and significantly outperforms alternatives.
Moreover, MDM-BP, MDM-BPS are both strong baselines, but relying on GPT body part labels might not capture all edit types, such as the ones that require modifying the overall body. We demonstrate this further in our supplementary video from the project webpage and our qualitative comparisons (Section 5.4).

5.3 Ablations

In the following, we investigate the effect of training data size and the guidance scales on the TMED model performance.
Training data size. In Table 3, we present the performance of TMED for different data sizes from MotionFix. We clearly observe, that increasing the data size has a large impact on the performance, justifying our data collection. The non-saturated trend is encouraging to scale up the training further.
Fig. 4:
Fig. 4: Guidances of conditions: We illustrate the R@1 performance of TMED for generated-to-target (left) and generated-to-source (right) retrieval benchmarks for \(s_L, s_{M_S} \in [1, 5]\).
Fig. 5:
Fig. 5: TMED generations: We illustrate several generations from our model with overlaid source (red) and generated (blue) motions. We showcase a variety of test cases ranging from elaborate edits (first example in top left) to short commands (e.g., “mirror”). TMED is able to perform both edits that describe temporal (e.g., “slow down”) or spatial (e.g., “raise your arms higher so it is overhead”) modifications.
Fig. 6:
Fig. 6: Failure cases: We show four failure examples from our model. For each sample, we provide the source motion (red) overlaid both with the generation (blue, left) or the ground-truth target motion (green, right). In the top row, we observe that the model may fail to generate the edited motions when the edit text is detailed and the motions differences are subtle. In the bottom row, although the generated motions follow the edit text, they diverge from the source motions.
Fig. 7:
Fig. 7: Qualitative comparisons with baselines: We provide example results by comparing TMED against the baselines on the MF test set: MDMS (top), MDM-BPS (middle) and MDM-BP (bottom). First two columns show the source (red) and ground-truth target (green) motions. Third column is reserved for baselines, the last column for TMED. Generations are denoted in blue.
Guidance hyperparameters. In Figure 4, we present how TMED performs across different guidance values for both conditions. x-axis controls the text guidance sL, and y-axis controls the source motion guidance \(s_{M_S}\) at test time. We report both generated-to-target (left) and source-to-target (right) R@1 retrieval results. We observe that there needs to be a balance between the two guidance values, and that performances decrease towards the extremes (e.g., top left and bottom right corners of the plots, where only one of the two conditions have higher guidance). This highlights the need to rely on both conditions to perform the task.

5.4 Qualitative Results

We display several generations from TMED in Figure 5 to enable qualitative assessment. We observe that our model can perform different types of edits such as the addition of actions (“rotate wrists instead of stretching like yawn”), temporal edits in a motion (“get up a bit earlier”), speed edits (“slow down”) and combinations of these. We refer to our supplementary video for dynamic visualizations, which may be easier to interpret.
In Figure 6, we further provide examples of failures cases from TMED. In the top row, we analyze cases with long edit texts. The model struggles with complex details and does not “keep the body straight” in the left example, nor follows “bend arms in the elbows” instructions on the right side, while wider legs are correctly edited. In the bottom row, we illustrate examples where the model faithfully follows the edit text, but does not resemble the source motion. In the left generation, the steps are correctly wider, but the movement does not continue to the similar position as the source motion. Finally, on the right, the body is kneeling down faster as instructed, but towards the opposite direction.
We additionally provide a qualitative comparison in Figure 7, between TMED and various baselines. We provide two comparisons for each baseline (top block for MDMS, middle block for MDM-BPS, and bottom block for MDM-B).
We observe that MDMS picks up the action from the prompt, but fails to faithfully follow the source motion. In the first row, the generation by MDMS raises both hands, instead of adjusting only the height of the hand raised in the source motion. Similarly, in the second row, MDMS generation raises the arm but in front of the body and not higher as prompted by the edit text.
In the next two rows, we visualize MDM-BPS results. Given the text “rotate wrists instead of stretching like yawn”, GPT correctly suggests editing both hands; however, the generated motion no longer resembles the source motion as the wide-open hands are not preserved. For the example edit “turn in the opposite direction”, all body parts are involved, but MDM-BPS does not deviate too much from the source, perhaps because traditional text-to-motion generation models rarely see relative words such as “opposite direction”.
Finally, we illustrate generations from our strongest baseline MDM-BP in the last two rows. Both generations involve all parts of the body (e.g., “slow down”) making it hard to follow the source motion. In comparison, our model faithfully performs most of the edits. The disadvantage of TMED, on the other hand, might be the generalization to motion pairs where the TMR similarity is low as such edits were unseen during training. We briefly discuss more limitations in the following.

6 Conclusions

In this work, we studied the task of motion editing from language instructions. Given the scarcity of training data, we introduced a new dataset MotionFix, collected in a semi-automatic manner. We exploit motion retrieval models to obtain “edit-ready” motion pairs which we annotate with language labels. We design a conditional diffusion model TMED that is trained on MotionFix, and generates edited motions that follow the source motion and the edit text. We show both quantitatively and qualitatively that our model outperforms all baselines. We hope that our dataset and findings will assist the research community and pave the way for exploring this new task.
Limitations. Our approach comes with limitations. Assuming two TMR-similar motions being an editing distance apart is not always accurate but serves as a good starting point. Furthermore, in our data collection, we constrain the motions to be up to 5 seconds since longer motions to produce many dissimilar pairs. Regarding model performance, TMED exhibits difficulty generalizing to unseen or complex edit texts and maintaining faithfulness to the source motion. Moreover, while our model can be used iteratively, we do not explore this capability in this paper and leave for future work.

Acknowledgments

The authors would like to thank Benjamin Pellkofer for building the data exploration website, Tsvetelina Alexiadis for guidance in data collection and perceptual studies, Arina Kuznetcova, Asuka Bertler, Claudia Gallatz, Suraj Bhor, Tithi Rakshit, Taylor McConnell, Tomasz Niewiadomski for data annotation, Lea Müller and Mathis Petrovich for helpful discussions, Yuliang Xiu for proofreading and Peter Kulits for the support and seatmating. GV acknowledges the ANR project CorVis ANR-21-CE23-0003-01. Disclosure: https://files.is.tue.mpg.de/black/CoI_CVPR_2024.txt

Footnote

Supplemental Material

MP4 File
Video and Pdf SupMat for MotionFix.
PDF File
Video and Pdf SupMat for MotionFix.

References

[1]
Kfir Aberman, Yijia Weng, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. 2020. Unpaired motion style transfer from video to animation. Transactions on Graphics (TOG) (2020).
[2]
Nikos Athanasiou, Mathis Petrovich, Michael J Black, and Gül Varol. 2022. TEACH: Temporal Action Composition for 3D Humans. In International Conference on 3D Vision (3DV).
[3]
Nikos Athanasiou, Mathis Petrovich, Michael J. Black, and Gül Varol. 2023. SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation. International Conference on Computer Vision (ICCV) (2023).
[4]
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In European Conference on Computer Vision (ECCV).
[5]
Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. InstructPix2Pix: Learning to follow image editing instructions. In Computer Vision and Pattern Recognition (CVPR).
[6]
Edwin Catmull. 1972. A system for computer generated movies. In Proceedings of the ACM Annual Conference.
[7]
Setareh Cohan, Guy Tevet, Daniele Reda, Xue Bin Peng, and Michiel van de Panne. 2024. Flexible Motion In-betweening with Diffusion Models. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
[8]
Michael F Cohen. 1992. Interactive spacetime control for animation. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
[9]
Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, and Grégory Rogez. 2022. PoseScript: 3D Human Poses from Natural Language. In European Conference on Computer Vision (ECCV).
[10]
Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, and Grégory Rogez. 2023. PoseFix: Correcting 3D Human Poses with Natural Language. In International Conference on Computer Vision (ICCV).
[11]
J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
[12]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR).
[13]
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. 2018. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks (2018).
[14]
Mihai Fieraru, Mihai Zanfir, Silviu Cristian Pirlea, Vlad Olaru, and Cristian Sminchisescu. 2021. AIFit: Automatic 3D human-interpretable feedback models for fitness training. In Computer Vision and Pattern Recognition (CVPR).
[15]
Purvi Goel, Kuan-Chieh Wang, C Karen Liu, and Kayvon Fatahalian. 2024. Iterative Motion Editing with Natural Language. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
[16]
Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. 2024. MoMask: Generative Masked Modeling of 3D Human Motions. In Computer Vision and Pattern Recognition (CVPR).
[17]
Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, and Ji. 2022a. Generating diverse and natural 3D human motions from text. In Computer Vision and Pattern Recognition (CVPR).
[18]
Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. 2022b. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In European Conference on Computer Vision (ECCV).
[19]
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, and Sun. 2020. Action2Motion: Conditioned Generation of 3D Human Motions. In ACM International Conference on Multimedia (MM).
[20]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In Conference on Neural Information Processing Systems (NeurIPS).
[21]
Daniel Holden, Jun Saito, and Taku Komura. 2016. A Deep Learning Framework for Character Motion Synthesis and Editing. Transactions on Graphics (TOG) (2016).
[22]
Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, and Lingjie Liu. 2024. CoMo: Controllable Motion Generation through Language Guided Pose Code Editing. arXiv:https://arXiv.org/abs/2403.13900 (2024).
[23]
Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. 2023. Guided Motion Diffusion for Controllable Human Motion Synthesis. In Computer Vision and Pattern Recognition (CVPR).
[24]
Hyounghun Kim, Abhay Zala, Graham Burri, and Mohit Bansal. 2021. FixMyPose: Pose Correctional Captioning and Retrieval. AAAI Conference on Artificial Intelligence (2021).
[25]
Jihoon Kim, Jiseob Kim, and Sungjoon Choi. 2023. FLAME: Free-form language-based motion synthesis & editing. In AAAI Conference on Artificial Intelligence.
[26]
Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR).
[27]
Makito Kobayashi, Chen-Chieh Liao, Keito Inoue, Sentaro Yojima, and Masafumi Takahashi. 2023. Motion capture dataset for practical use of AI-based motion editing and stylization. arXiv:https://arXiv.org/abs/2306.08861 (2023).
[28]
Jehee Lee and Sung Yong Shin. 1999. A hierarchical approach to interactive motion editing for human-like figures. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
[29]
Taeryung Lee, Gyeongsik Moon, and Kyoung Mu Lee. 2022. MultiAct: Long-Term 3D Human Motion Generation from Multiple Action Labels. In AAAI Conference on Artificial Intelligence.
[30]
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. 2023. Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset. In Conference on Neural Information Processing Systems (NeurIPS).
[31]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A Skinned Multi-Person Linear Model. Transactions on Graphics (TOG) (2015).
[32]
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. 2019. AMASS: Archive of Motion Capture As Surface Shapes. In International Conference on Computer Vision (ICCV).
[33]
Ian Mason, Sebastian Starke, and Taku Komura. 2022. Real-time style modelling of human locomotion via feature-wise transformations and local motion phases. Proceedings of the ACM on Computer Graphics and Interactive Techniques (i3D) (2022).
[34]
Ken Perlin. 1995. Real time responsive animation with personality. IEEE Transactions on Visualization and Computer Graphics (1995).
[35]
Mathis Petrovich, Michael J Black, and Gül Varol. 2021. Action-conditioned 3D human motion synthesis with Transformer VAE. In International Conference on Computer Vision (ICCV).
[36]
Mathis Petrovich, Michael J Black, and Gül Varol. 2022. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV).
[37]
Mathis Petrovich, Michael J. Black, and Gül Varol. 2023. TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis. In International Conference on Computer Vision (ICCV).
[38]
Mathis Petrovich, Or Litany, Umar Iqbal, Michael J. Black, Gül Varol, Xue Bin Peng, and Davis Rempe. 2024. Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation. In Computer Vision and Pattern Recognition Workshops (CVPRW).
[39]
Matthias Plappert, Christian Mandery, and Tamim Asfour. 2016. The KIT Motion-Language Dataset. Big Data (2016).
[40]
Zoran Popović and Andrew Witkin. 1999. Physically based motion transformation. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
[41]
Abhinanda R. Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J. Black. 2021. BABEL: Bodies, Action and Behavior with English Labels. In Computer Vision and Pattern Recognition (CVPR).
[42]
Alec Radford, Jong Wook Kim, Chris Hallacy, and Ramesh. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML).
[43]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Computer Vision and Pattern Recognition (CVPR).
[44]
Yonatan Shafir, Guy Tevet, Roy Kapon, and Bermano. 2024. Human Motion Diffusion as a Generative Prior. In International Conference on Learning Representations (ICLR).
[45]
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML).
[46]
Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state machine for character-scene interactions. Transactions on Graphics (TOG) (2019).
[47]
Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. 2022. MotionCLIP: Exposing human motion generation to clip space. In European Conference on Computer Vision (ECCV).
[48]
Guy Tevet, Sigal Raab, Brian Gordon, and Shafir. 2023. Human Motion Diffusion Model. In International Conference on Learning Representations (ICLR).
[49]
Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Shusuke Takahashi, and Yuki Mitsufuji. 2024. MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training. arXiv:https://arXiv.org/abs/2103.15691 (2024).
[50]
Weilin Wan, Yiming Huang, Shutong Wu, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. 2023. DiffusionPhase: Motion Diffusion in Frequency Domain. arXiv:https://arXiv.org/abs/2312.04036 (2023).
[51]
Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. 2024. OmniControl: Control Any Joint at Any Time for Human Motion Generation. In International Conference on Learning Representations (ICLR).
[52]
Mingyuan Zhang, Zhongang Cai, Liang Pan, and Hong. 2023a. MotionDiffuse: Text-driven human motion generation with diffusion model. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2023).
[53]
Mingyuan Zhang, Huirong Li, Zhongang Cai, Jiawei Ren, Lei Yang, and Ziwei Liu. 2023b. FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing. In Conference on Neural Information Processing Systems (NeurIPS).
[54]
Qinsheng Zhang, Jiaming Song, Xun Huang, and Chen. 2023c. DiffCollage: Parallel generation of large content with diffusion models. In Computer Vision and Pattern Recognition (CVPR).
[55]
Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the Continuity of Rotation Representations in Neural Networks. In Computer Vision and Pattern Recognition (CVPR).
[56]
Qiran Zou, Shangyuan Yuan, Shian Du, Yu Wang, Chang Liu, Yi Xu, Jie Chen, and Xiangyang Ji. 2024. ParCo: Part-Coordinating Text-to-Motion Synthesis. ECCV (2024).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SA '24: SIGGRAPH Asia 2024 Conference Papers
December 2024
1620 pages
ISBN:9798400711312
DOI:10.1145/3680528

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2024

Check for updates

Author Tags

  1. Motion Editing
  2. Motion from Instructions

Qualifiers

  • Research-article

Funding Sources

Conference

SA '24
Sponsor:
SA '24: SIGGRAPH Asia 2024 Conference Papers
December 3 - 6, 2024
Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 788
    Total Downloads
  • Downloads (Last 12 months)788
  • Downloads (Last 6 weeks)359
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media