We introduce TMED, a text-driven motion editing diffusion model. Given a short 3D human motion, a textual instruction describing a modification, and a noise vector to enable randomness, the model generates an edited motion. Similar tasks have been addressed in the image domain for text-based image editing [Brooks et al.
2023], from which we take inspiration for our model design. We further build on motion diffusion model (MDM) [Tevet et al.
2023] that takes only text as input and generates a motion. In contrast, our model has an additional condition on the source, thus requires a different training dataset (as described in Section
3). In the following, we present the components of our TMED model.
4.1 3D Human Motion Representation
We use a sequence of SMPL [Loper et al.
2015] body parameters to represent a human motion. SMPL is a linear function that maps the shape, and pose parameters of
J joints, along with the global body translation and orientation, to a 3D mesh. The joint positions,
Jp, can be obtained from vertices via the learned SMPL joint regressor. Following previous work [Petrovich et al.
2022] that discards the shape parameters, we set the shape parameters to zero (mean shape), since motion is parameterized primarily by pose parameters.
Various alternative representations have been used based on joint positions with respect to the local coordinate system of the body [Guo et al.
2022a; Holden et al.
2016; Starke et al.
2019]. Unlike prior works [Guo et al.
2022a; Tevet et al.
2023; Zhang et al.
2023a] that fit SMPL bodies to skeleton generations, we aim to enable direct regression of SMPL parameters, bypassing the need of a costly post-processing optimization [Bogo et al.
2016], and thus making our method ready to use for animation frameworks.
A common approach for representing SMPL pose parameters within a learning framework is to employ 6D rotations [Zhou et al.
2019], and to apply first-frame canonicalization for motions [Athanasiou et al.
2022; Petrovich et al.
2022]. Similarly, we canonicalize our motions prior to training, so that all face the same direction in the first frame and have the same initial global position. Inspired by [Holden et al.
2016; Petrovich et al.
2024], we represent the global body translation as differences between consecutive frames. Supervising with such relative translations helps the denoiser to generate better trajectories, as we observed unsmooth generations when using the absolute translation. Similar to STMC [Petrovich et al.
2024], we factor out the
z-rotation from the pelvis orientation and separately represent the global orientation as the
xy-orientation and the
z-orientation as the differences between rotations in consecutive frames (resulting in 12 features, i.e., 6D representation for
xy and
z). We represent the body pose with 6D rotations [Zhou et al.
2019]. Similar to [Petrovich et al.
2022], we exclude the hand joints as they mostly do not move in the datasets we use. We additionally append the local joint positions after removing
z-rotation of the body [Holden et al.
2016; Petrovich et al.
2024] (resulting in 192 dimensional features with 6 × 21 for rotations and 22 × 3 for joints including the root joint). Thus, each motion frame has a dimension
dp = 207, consisting of 3 features for the global translation, 12 for the global orientation, and 192 for the body pose. The motion is represented as a sequence of the pose representations. During training, all features are normalized according to their mean and variance over the training set.
4.2 Conditional Diffusion Model
To learn TMED, we use our new training data, where each data sample comprises a source motion
MS, target motion
MT, and a language instruction
L. We train a conditional diffusion model that learns to edit the source motion with respect to the instruction. We design a model similar to that of InstructPix2Pix [Brooks et al.
2023], where the generation from a random noise vector is conditioned on two further inputs
L and
MS. Here, instead of a sequence of image patch tokens, the motion modality is represented as a variable-length sequence of motion frames. The noised target motion, the text condition
L, and the source motion condition
MS are all fed as input to the denoiser at every diffusion step.
Diffusion models [Sohl-Dickstein et al.
2015] learn to gradually turn random noise into a sample from a data distribution by a sequence of denoising autoencoders. This is achieved by a diffusion process that adds noise ϵ
t to an input signal,
MT. We denote the noise level added to the input signal by using
t, the diffusion timestep, as a superscript. This produces a diffused sample,
\({M}^{t}_{T}\). The amount of noise added at timestep
t = 1, …,
N is defined a-priori through a noise schedule. We train a denoiser network, to reverse this process given the timestep
t, the instruction
L, the noised target motion
\({M}^{t}_{T}\) and the source motion
MS. As supervision, the output of the denoiser network
\(\tilde{M}^{t}_{T}\) is compared against the ground-truth denoised target motion
MT. Our model is therefore trained to minimize:
We use standard mean-squared-error as the loss function to compare the diffusion output with the ground-truth target motion. We choose to predict the denoised target motion, as we found this to produce better results visually than predicting the noise itself.
The architecture overview is illustrated in Figure
3 (left). Our model consists of multiple encoders for each input modality (
ET for timestep,
EL for text, and
EM for motion) and a transformer encoder
D that operates on all inputs. The timestep
t is encoded via
ET similar to MDM [Tevet et al.
2023], by first converting into a sinusoidal positional embedding, and then projecting through a feed-forward network (consisting of two linear layers with a SiLU activation [Elfwing et al.
2018] in between). As in [Tevet et al.
2023], we use the CLIP [Radford et al.
2021] text encoder for
EL. We pass the source and noised target motions through a linear layer (
EM), shared across frames, and obtain
\(M^{enc}_{S}=E_M(M_S)\) and
\({M^{t}_{T}}^{enc}=E_M({M}^{t}_{T})\). Given the variable duration of source and target motions, we add a learnable separation token
SEP in between [Devlin et al.
2019] when appending them (so that the information on when the target motion ends and the source motion starts is communicated to the transformer). Once all encoded inputs have the same feature dimensionality
d, they are combined into a single sequence to be fed to the transformer, as shown in Figure
3, and sinusoidal positional embeddings are subsequently added. During training, to enable classifier-free guidance, the source motion condition is randomly dropped 5% of the time, the text condition 5%, both conditions together 5%, and all the inputs are used 85% of the time. For sampling from a diffusion model with two conditions, we apply classifier-free guidance with respect to two conditions: the input motion
MS and the text instruction
L. We introduce separate guidance scales
\(s_{M_S}\) and
sL that allow adjusting the influence of each conditioning.
For simplicity, we now abuse the notation by dropping the timestep subscripts when deriving the sampling process. Our generative model, TMED, learns the probability distribution over the target motions,
MT, conditioned on the source motions and text condition,
P(
MT∣
MS,
L). Expanding this conditional probability gives:
As in the original diffusion, we formulate this as a score function optimization problem by first taking the logarithm of Eq.(
2):
Then, the derivative with respect to the input of Eq.(
3) gives the score estimate
\(\tilde{e}_\theta (M_T, s_{M_S}, s_L)\), learned under classifier-free guidance:
Hence, from Eq.(
4), we sample from TMED using the modified score estimate of two-way conditioning a diffusion model as:
We further ablate the guidance scales which control the generation at test time in Section
5.
Implementation details. All models are trained for 1000 epochs using cosine noise schedule with DDPM scheduler [Ho et al.
2020]. We use
N = 300 diffusion timesteps, as we find this is a good compromise between speed and quality. The guidance scales are chosen for each model based on their best performance in the validation set of MotionFix (
\(S_L=2, s_{M_S}=2\)). We follow the same process for training MDM [Tevet et al.
2023] baselines described in the next section. In terms of architectural details, the dimensionality of the embeddings before inputting to the transformer is
d = 512. We use a pre-trained and frozen CLIP [Radford et al.
2021] with all 77 token outputs of the ViT-B/32 backbone [Dosovitskiy et al.
2021] as our text encoder
EL. We use the text masks from
EL to mask the padded area of the text inputs. The motion encoder
EM that precedes the transformer is a simple linear projection with dimensionality
dp ×
d, where the feature dimension of each motion frame is
dp = 207 (as described in Section
4.1).