SINC: Spatial Composition of 3D Human Motions
for Simultaneous Action Generation

Nikos Athanasiou ¹ Mathis Petrovich^$*$^1,2 Michael J. Black¹ Gül Varol²
¹Max Planck Institute for Intelligent Systems, Tübingen, Germany
²LIGM, École des Ponts, Univ Gustave Eiffel, CNRS, France
sinc.is.tue.mpg.de Equal contribution

Abstract

Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example ‘waving hand’ while ‘walking’ at the same time. We refer to generating such simultaneous movements as performing spatial compositions. In contrast to temporal compositions that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved in which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as “what are the body parts involved in the action $<$ action name $>$ ?”, while also providing the parts list and few-shot examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC (“SImultaneous actioN Compositions for 3D human motions”). In our experiments, we find that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at sinc.is.tue.mpg.de.

Refer to caption — Figure 1: Goal: We demonstrate the task of spatial compositions in human motion synthesis. We generate 3D motions for a pair of actions, defined by a pair of textual descriptions. Here, we provide six sample input-output illustrations from our model. For example, we input the set of actions {‘put hands on the waist’, ‘move torso left’} and generate one motion that simultaneously performs both.

1 Introduction

Text-conditioned 3D human motion generation has recently attracted increasing interest in the research community [44, 15, 4], where the task is to input natural language descriptions of actions and to output motion sequences that semantically correspond to the text. Such controlled motion synthesis has a variety of applications in fields that rely on motion capture data, such as special effects, games, and virtual reality. While there have been promising results in this direction, fine-grained descriptions remain out of reach. Consider the scenario in which a movie production needs a particular motion of someone jumping down from a building. One may generate an initial motion with one description, and then gradually refine it until the desired motion is obtained, e.g., {‘jumping down’, ‘with arms behind the back’, ‘while bending the knees’}. State-of-the-art methods [44, 9] often fail to produce reasonable motions when conditioned on fine-grained text describing multiple actions. In this work, we take a step towards this goal by focusing on the spatial composition of motions. In other words, we aim to generate one motion depicting multiple simultaneous actions; see Figure 1. This paves the way for further research on fine-grained human motion generation.

Previous work [33, 2, 13, 44] initially explored the text-conditioned motion synthesis problem on the small-scale KIT Motion-Language dataset [46]. Recently, work [15, 4] has shifted to the large-scale motion capture collection AMASS [37], and its language labels from BABEL [47] or HumanML3D [15]. In particular, similar to this work, TEACH [4] focuses on fine-grained descriptions by addressing temporal compositionality, that is, generating a sequence of actions, one after the other. We argue that composition in time is simpler for a model to learn since the main challenge is to smoothly transition between actions. This does not necessarily require action-specific knowledge, and a simple interpolation method such as Slerp [51] may provide a decent solution. On the other hand, there is no such trivial solution for compositions in space, since one needs to know action-specific body parts to combine two motions. If one knows that ‘waving’ involves the hand and ‘walking’ involves the legs, then compositing the two actions can be performed by cutting and pasting the hand motion into the walking motion. This is often done manually in the animation industry.

To automate this process, we observe that pretrained language models such as GPT-3 [7] encode knowledge about which body parts are involved in different actions. This allows us to first establish a spatial composition baseline (analogous to the Slerp baseline for temporal compositions); i.e., independently generating actions then combining with heuristics. Not surprisingly, we find that this is suboptimal. Instead, we use the synthesized compositions of actions as additional training data for a text-to-motion network. This enriched dataset enables our model, called SINC (“SImultaneous actioN Compositions for 3D human motions”), to outperform the baseline. Our GPT-based approach is similar in spirit to work that incorporates external linguistic knowledge into visual tasks [64, 60, 6].

While BABEL [47] and HumanML3D [15] have relatively large vocabularies of actions, they contain a limited number of simultaneous actions. A single temporal segment is rarely annotated with multiple texts. For example, BABEL contains only roughly 2.5K segments with simultaneous actions, while it has $\sim$ 25K segments with only one action. This highlights the difficulty of obtaining compositional data at scale. Moreover, for any reasonably large set of actions, it is impractical to collect data for all possible pairwise, or greater, combinations of actions such that there exists no unseen combination at test time [62, 64]. With existing datasets, it is easy to learn spurious correlations. For example, if waving is only ever observed by someone standing, a model will learn that waving involves moving the arm with straight legs. Thus generating waving and sitting would be highly unlikely. In our work, we address this challenge by artificially creating compositional data for training using GPT-3. By introducing more variety, our generative model is better able to understand what is essential to an action like ‘waving’.

Our method, SINC, extends the generative text-to-motion model TEMOS [44] such that it becomes robust to input text describing more than one action, thanks to our synthetic training. We intentionally build on an existing model to focus the analysis on our proposed synthetic data. Given a mix of real single actions, real pairs of actions, and synthetic pairs of actions, we train a probabilistic text-conditioned motion generation model. We introduce several baselines to measure sensitivity to the model design, as well as to check whether our learned motion decoder outperforms a simpler compositing technique (i.e., simply using our GPT-guided data creation approach, along with a single-action generation model). We observe limited realism when compositing different body parts together, and need to incorporate several heuristics, for example when merging motions whose body parts overlap. While such synthetic data is imperfect, it helps the model disentangle the body parts that are relevant for an action and avoid learning spurious correlations. Moreover, since our motion decoder has also access to real motions, it learns to generate realistic motions, eliminating the realism problem of the synthetic composition baseline.

Our contributions are the following: (i) We establish a new benchmark on the problem of spatial compositions for 3D human motions, compare a number of baseline models on this new problem, and introduce a new evaluation metric that is based on a motion encoder that has been trained with text supervision. (ii) To address the data scarcity problem, we propose a GPT-guided synthetic data generation scheme by combining action-relevant body parts from two motions. (iii) We provide an extensive set of experiments on the BABEL dataset, including ablations that demonstrate the advantages of our synthetic training, as well as an analysis quantifying the ability of GPT-3 to assign part labels to actions. Our code is available for research purposes.

2 Related Work

Human motion generation. While motion prediction [41, 71, 65, 10, 5, 38, 49, 34], synthesis [18, 31] and in-betweening [19, 54, 73, 27] represent the most common motion-generation tasks, conditional synthesis through other modalities (e.g., text) has recently received increasing interest. Example conditions include music [32, 40], speech [17, 1], scenes [53, 20, 69, 59], action [16, 43] or text [33, 2, 13, 44, 4, 15]. In the following, we focus on work involving text-conditioned motion synthesis, which is most closely related to our work.

3D human motion and natural language. Unlike methods that use categorical action labels to control the motion synthesis [16, 43, 36], text-conditioned methods [33, 2, 13, 44, 4, 15] seek to input free-form language descriptions that go beyond a closed set of classes. The KIT-ML dataset [46] comprises textual annotations for motion capture data, representing the first benchmark for this task. More recently, the larger scale AMASS [37] motion capture collection is labeled with language descriptions by BABEL [47] and HumanML3D [15]. A common solution to text-conditioned synthesis is to design a cross-modal joint space between motions and language [2, 13, 44]. TM2T [14] introduces a framework to jointly perform text-to-motion and motion-to-text, integrating a back-translation loss. In contrast to the deterministic methods of [2, 13], TEMOS [44] employs a VAE-based probabilistic approach (building on ACTOR [43]) that can generate multiple motions per textual input, and establishes the state of the art on the KIT benchmark [46] with a non-autoregressive architecture. Following the success of diffusion models [52, 22], very recently, MDM [56], FLAME [28], MotionDiffuse [67], and MoFusion [12] demonstrate diffusion-based motion synthesis. Recent work [9] shows the potential of latent diffusion to address the slow inference limitation. On the other hand, T2M-GPT [66] obtains competitive performance compared with diffusion using VQ-VAEs. Our approach is complementary and applicable to existing models for text-to-motion synthesis. In this work, we adopt TEMOS [44] and retrain it on the data from [47] together with our proposed synthetic compositions.

In contrast to previous work, our focus is on the composition of simultaneous actions. Prior work on compositional actions focuses on temporal compositions; i.e., inputting a sequence of textual descriptions). Early influential work [3] employs dynamic-programming approaches to compose existing motions from a motion database with action labels. Recently, Wang et al. [59] generate a sequence of actions in 3D scenes by synthesizing pose anchors that are then placed in the scene and refined by infilling. TEACH [4] extends TEMOS [44] by incorporating an action-level recursive design that generates the next action conditioned on the past motion. ActionGPT [25] improves this model by retraining it with text augmentations using language models. Concurrently, MultiAct [30] similarly aims to produce continuous transitions between generated actions. In contrast to previous work [4, 30, 25], we focus on spatial compositionality, inputting text that describes simultaneous actions. In this direction, MotionCLIP [55] and MDM [56] test the compositional capabilities of their methods, but only show preliminary analyses. The concurrent work of MotionDiffuse [67] injects manually labeled body-part information and performs noise interpolation to obtain spatial compositionality.

External linguistic knowledge. Large language models have been exploited for many visual tasks such as instruction-conditioned image editing [6], visual relationship detection [64], and human-object reconstruction [60], among others. Similar to us, Wang et al. [60] incorporate GPT by asking what body part is in contact with a given object, which in turn is used for image-based 3D human-object reconstruction. On the other hand, we exploit GPT to extract knowledge about body parts that are involved in an action. To the best of our knowledge, we are the first to systematically model such body part associations from textual descriptions.

Training with synthetic data. Using synthetic data to train machine learning models is a standard approach for solving many visual recognition tasks, such as 3D body pose estimation [8, 42], 2D body part segmentation [58], 3D hand pose estimation [21], video action recognition [57], 2D body pose estimation [48] pedestrian detection [45], and optical flow estimation [24]. In a similar spirit to us, the recent work of HUMANISE [61] creates a synthetic dataset of human-scene interactions by combining 4 actions from BABEL [47] with 3D scenes, and pairing them with language descriptions. In this work, we generate synthetic training data by combining existing 3D motion assets and language labels to overcome the data scarcity problem for compositional learning, helping our method to avoid learning spurious correlations.

Action	Prompt Type	GPT-3 Response
move right arm in circular motion	Free-form	The person’s right arm, shoulder, and possibly the upper part of their body.
	Choosing from a list	right arm
	Choosing from a list + Few-shot examples	right arm
make large circles with left leg in front of body	Free-form	The left leg and the hips
	Choosing from a list	left leg
	Choosing from a list + Few-shot examples	left leg
overhead throw	Free-form	The deltoid muscle in the shoulder and the triceps muscle in the arm are moving when someone is doing an overhead throw.
	Choosing from a list	left arm right arm
	Choosing from a list + Few-shot examples	left arm right arm torso
walk backwards with arms attach to the waist	Free-form	The body parts involved in the action of walking backwards with arms attached to the waist are the legs, arms, back, and abdomen.
	Choosing from a list	right leg left leg buttocks
	Choosing from a list + Few-shot examples	left arm right arm left leg right leg waist
put down bottle with left hand	Free-form	Left arm Left hand Fingers
	Choosing from a list	left arm
	Choosing from a list + Few-shot examples	left arm torso
rotate shoulders	Free-form	The body parts involved in the action of rotating the shoulders are the neck, shoulders, arms, and back.
	Choosing from a list	left arm right arm arm torso neck
	Choosing from a list + Few-shot examples	left arm right arm arm torso

	Model used for TEMOS score
	Single-action	SINC
Single-action	0.601	0.594
SINC	0.644	0.637

SINC: Spatial Composition of 3D Human Motions
for Simultaneous Action Generation

Abstract

1 Introduction

2 Related Work

3 Spatial Composition of Motions from Textual Descriptions

3.1 GPT-guided synthetic training data creation

Body part label extraction from GPT-3.

Body part composition to create new motions.

3.2 Learning to generate spatial compositions

3.3 Implementation details

4 Experiments

4.1 Data and evaluation metrics

4.2 Single-action baselines

4.3 The effect of the input text format

4.4 Training with different sets of data

4.5 Qualitative analysis

4.6 Limitations

5 Conclusions

References

Appendix A Additional experiment with diffusion models

Appendix B Body Part Labeling with GPT-3

Appendix C Synthetic Data Creation

Appendix D TEMOS Score

Appendix E Additional Quantitative Evaluation

E.1 More conjunction words

E.2 TEMOS score with various TEMOS models

E.3 Diversity

E.4 Full validation set

SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation

Abstract

1 Introduction

2 Related Work

3 Spatial Composition of Motions from Textual Descriptions

3.1 GPT-guided synthetic training data creation

Body part label extraction from GPT-3.

Body part composition to create new motions.

3.2 Learning to generate spatial compositions

3.3 Implementation details

4 Experiments

4.1 Data and evaluation metrics

4.2 Single-action baselines

4.3 The effect of the input text format

4.4 Training with different sets of data

4.5 Qualitative analysis

4.6 Limitations

5 Conclusions

References

Appendix A Additional experiment with diffusion models

Appendix B Body Part Labeling with GPT-3

Appendix C Synthetic Data Creation

Appendix D TEMOS Score

Appendix E Additional Quantitative Evaluation

E.1 More conjunction words

E.2 TEMOS score with various TEMOS models

E.3 Diversity

E.4 Full validation set

SINC: Spatial Composition of 3D Human Motions
for Simultaneous Action Generation