PoseFix: Correcting 3D Human Poses with Natural Language

Ginger Delmas1,2, Philippe Weinzaepfel2, Francesc Moreno-Noguer1, Grégory Rogez2
1 Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona, Spain
1{gdelmas, fmoreno}@iri.upc.edu, 2{name.surname}@naverlabs.com

Automatically producing instructions to modify one’s posture could open the door to endless applications, such as personalized coaching and in-home physical therapy. Tackling the reverse problem (i.e., refining a 3D pose based on some natural language feedback) could help for assisted 3D character animation or robot teaching, for instance. Although a few recent works explore the connections between natural language and 3D human pose, none focus on describing 3D body pose differences. In this paper, we tackle the problem of correcting 3D human poses with natural language. To this end, we introduce the PoseFix dataset, which consists of several thousand paired 3D poses and their corresponding text feedback, that describe how the source pose needs to be modified to obtain the target pose. We demonstrate the potential of this dataset on two tasks: (1) text-based pose editing, that aims at generating corrected 3D body poses given a query pose and a text modifier; and (2) correctional text generation, where instructions are generated based on the differences between two body poses. The dataset and the code are available at https://europe.naverlabs.com/research/computer-vision/posefix/.

1 Introduction

How many puzzles could you solve with two human body poses and a description of their differences? Call this description a feedback. It could be automatically generated by a fitness application based on the comparison between the gold standard fitness pose and the pose of John Doe, exercising in front of their smartphone camera in their living room (“straighten your back”). In another context, the feedback can be considered a modifying instruction, provided by a digital animation artist to automatically modify the pose of a character, without having to redesign everything by hand. This feedback could be some kind of constraint, to be applied to a whole sequence of poses (make them run, but “with hands on the hips!”). It could also be a hint, to guide pose estimation from images in failure cases: start from an initial 3D body pose fit, and give step-by-step instructions for the model to improve its pose estimation (“the left elbow should be bent to the back”).

Figure 1: Illustration of the tasks addressed with the new PoseFix dataset, which consists of textual descriptions of the difference between two 3D body poses.
Figure 2: Examples of pose pairs and their annotated modifier in PoseFix. The source pose is shown in gray and the target pose in purple. Poses from in-sequence (IS) pairs are from the same motion clip; unlike out-of-sequence (OOS) pairs.

In this paper, we focus on free-form feedback describing the change between two static 3D human poses (which can be extracted from actual pose sequences). Why so static? There exist many settings that require the semantic understanding of fine-grained changes of static body poses. For instance, yoga poses are extremely challenging and specific (with a lot of subtle variations), and they are static. Some sport motions require almost-perfect postures at every moment: for better efficiency, to avoid any pain or injury, or just for better rendering e.g. in classical dance, yoga, karate, etc. What is more, the realization of complex motions sometimes calls for precise step-to-step instructions, in order to assimilate the gesture or to perform it correctly.

Natural language can help in all these scenarios, in that it is highly semantic and unconstrained, in addition of being a very intuitive way to convey ideas. While 3D poses can be manually edited within a design framework [40], language is particularly efficient for non-experts or when direct manipulation is not possible. The pose semantic we propose to learn here can be leveraged for other modalities (e.g. images) or in other settings (e.g. robot teaching).

While the link between language and images has been extensively studied in tasks like image captioning [34, 22] or image editing [66], the research on leveraging natural language for 3D human modeling is still in its infancy. A few works use textual descriptions to generate motion [18, 56], to describe the difference in poses from synthetic 2D renderings [26] or to describe a single static pose [12]. Nevertheless, there currently exists no dataset that associates pairs of 3D poses with textual instructions to move from one source pose to one target pose. In this work, we thus introduce the PoseFix dataset, which contains over 6,000 textual modifiers written by human annotators for this scenario. In addition, we design a pipeline similar to  [12], to generate modifiers automatically and increase the size of the data, see Figure 2 for some examples.

Leveraging the PoseFix dataset, we tackle two tasks: text-based pose editing, where the goal is to generate new poses from an initial pose and modification instructions, and correctional text generation where the objective is to produce a textual description of the difference between a pair of poses (see Figure 1). For the first task, we use a baseline consisting in a conditional Variational Auto-Encoder (cVAE). For the second, we consider a baseline built from an auto-regressive transformer model. We provide a detailed evaluation of both baselines, and show promising results.

In summary, our contributions are threefold:

    We introduce the PoseFix dataset (Section 3) that associates pairs of 3D human poses and human-written textual descriptions of their differences.

    We introduce the task of text-based pose editing (Section 4), that can be tackled with a cVAE baseline.

    We study the task of correctional text generation with a conditioned auto-regressive model (Section 5).

2 Related Work

3D pose and text datasets. AMASS [38] gathers several datasets of 3D human motions in SMPL [37] format. BABEL [49] and HumanML3D [18] build on top of it to provide free-form text descriptions of the sequences, similarly to the earlier and smaller Kit Motion-Language dataset [48]. These datasets focus on sequence semantic (high-level actions) rather than individual pose semantic (fine-grained egocentric relations). To complement, PoseScript [12] links static 3D human poses with descriptions in natural language about fine-grained pose aspects. However, PoseScript does not make it possible to relate two poses together in a straightforward way, as we attempt by introducing the new PoseFix dataset. In contrast to FixMyPose [26], the PoseFix dataset we introduce comprehends poses from more diverse sequences and the textual annotations were collected based on actual 3D data and not synthetic 2D image renderings (reduced depth ambiguity).

3D human pose generation. Previous works have mainly focused on the generation of pose sequences, conditioning on music [30, 31], context [10], past poses [63, 64], text labels [20, 46] and mostly on text descriptions [32, 1, 61, 2, 17, 47, 18, 56, 19, 27]. Some works push it one step further and also attempt to synthesize the mesh appearance [24, 62], leveraging large pretrained models like CLIP [50]. Similarly to PoseScript [12], we depart from generic actions and focus on static poses and fine-grained aspects of the human body, to learn about precise egocentric relations. However, we consider two poses instead of one to comprehend detailed pose modifications. Different from ProtoRes [40], which proposes to manually design a human pose inside a 3D environment based on sparse constraints, we use text for controllability. As PoseScript and VPoser [44], an (unconditioned) pose prior, we use a VAE-based [29] model to generate the 3D human poses.

Pose correctional feedback generation. Recent advances in text generation have led to a shift from recurrent neural networks [55] to large pretrained transformer models, such as GPT [8]. These models can be effectively conditioned using prompting [41] or cross-attention mechanisms [51]. While multi-modal text generation tasks, such as image captioning, have been extensively studied [34, 22, 58] no previous work has focused on using 3D human poses to generate free-form feedback. In this regard, AIFit [15] extracts 3D data to compare the video performance of a trainee against a coach’s and provides feedback based on predefined templates. [65] also outputs predefined texts for a small set of exercises and [35] does not provide any natural language instructions, either. Besides, FixMyPose [26] is based on highly-synthetic 2D images.

Compositional learning consists in using a query made of multiple distinct elements, which can be of different modalities, as for visual question answering [4] or composed image retrieval [59]. Similarly to the latter, we are interested in bi-modal queries composed of a textual “modifier” which specifies changes to apply on the first element. Modifiers first took the form of single-word attributes [43, 39, 14] and evolved into free-form texts [60, 36]. While a large body of works focus on text-conditioned image editing [25, 7, 23] or text-enhanced image search [59, 5, 11], few study 3D human body poses. ClipFace [3] proposes to edit 3D morphable face models and StyleGAN-Human [16] generates 2D images of human bodies in very model-like poses. PoseTutor [13] provides an approach to highlight joints with incorrect angles on 2D yoga/pilate/kung-fu images. More related to our work, FixMyPose [26] performs composed image retrieval. Conversely to them, we propose to generate a 3D pose based on an initial static pose and a modifier expressed in natural language.

3 The PoseFix dataset

To tackle the two pose correctional tasks considered in this paper, we introduce the new PoseFix dataset. It consists of 135k triplets of {pose A, pose B, text modifier}, where pose B𝐵Bitalic_B (the target pose) is the result of the correction of pose A𝐴Aitalic_A (the source pose), as specified by the text modifier. The 3D human body poses were sampled from AMASS [38]. All pairs were captioned in Natural Language thanks to our automatic comparative pipeline; 6157 pairs were additionally presented to human annotators on the crowd-source annotation platform Amazon Mechanical Turk. We next present the pair selection method, the annotations process and some dataset statistics.

3.1 Pair selection process

In- and Out-of-sequence pairs. Pose pairs can be of two types: “in-sequence" (IS) or “out-of-sequence" (OOS). In the first case, the two poses belong to the same AMASS sequence and are temporally ordered (pose A𝐴Aitalic_A happens before pose B𝐵Bitalic_B). We select them with a maximum time difference of 0.5 second, to have both textual modifiers describing precisely atomic motion sub-sequences and ground-truth motion. For an increased time difference between the two poses, they could be an infinity of plausible in-between motions, which would weaken such supervision signal. Out-of-sequence pairs are made of two poses from different sequences; to help generalize to less common motions and to study poses of similar configuration but different style, empowering “pose correction” beside “motion continuation”.

Selecting pose B. As we aim to obtain pose B𝐵Bitalic_B from pose A𝐴Aitalic_A, we consider that pose B𝐵Bitalic_B is guiding the most the annotation: while the text modifier should account for pose A𝐴Aitalic_A and refer to it, its true target is pose B𝐵Bitalic_B. Thus, to build the triplets, we first choose the set of poses B𝐵Bitalic_B. So to maximize the diversity of poses, we follow [12], and get a set S𝑆Sitalic_S of 100k poses sampled with a farthest-point algorithm. Poses B𝐵Bitalic_B are then iteratively selected from S𝑆Sitalic_S.

Selecting pose A. The paired poses should satisfy two main constraints. First, poses A𝐴Aitalic_A and B𝐵Bitalic_B should be similar enough for the text modifier not to become a complete description of pose B: if A and B are too different, it is easier for the annotator to just ignore A and directly characterize B [60, 36]. Yet, we aim at learning fine-grained and subtle differences between two poses. Hence, we rank all poses in S𝑆Sitalic_S with regard to each pose B𝐵Bitalic_B based on the cosine similarity of their PoseScript semantic pose features [12]. Pose A𝐴Aitalic_A is to be selected within the top 100. Second, the two poses should be different enough, so that the modifier does not collapse to oversimple instructions like ‘raise your right hand’, which would not compare to realistic scenarios. While we expect the poses to be quite different as they belong to S𝑆Sitalic_S, we go one step further and leverage posecode information [12] to ensure that the two poses have at least 15 (resp. 20) low-level different properties for IS (resp. OOS) pairs.

One- and Two-way pairs. We consider all possible IS pairs AB𝐴𝐵A\to Bitalic_A → italic_B, with A𝐴Aitalic_A and B𝐵Bitalic_B in S𝑆Sitalic_S, that meet the selection constraints. Then, following the order defined by S𝑆Sitalic_S, we sample OOS pairs: for each selected pair AB𝐴𝐵A\to Bitalic_A → italic_B, if A𝐴Aitalic_A was not already used for another pair, we also consider BA𝐵𝐴B\to Aitalic_B → italic_A. We call such pairs ‘two-way’ pairs, as opposed to ‘one-way’ pairs. Two-way pairs could be used for cycle consistency.

Figure 3: Left: Data presented to the annotators. The slider makes it possible to look at the poses under different viewpoints. Right: word cloud of the PoseFix annotations.
Property Proportion Example
Egocentric relations 74% Join your hands in front of your chest.
Analogies 5% like you’re about to clap your hands.
Implicit side description 25% Place your left toes on the ground
and extend your Øitalic-Ø\Oitalic_Ø leg slightly.

Table 1: Semantic analysis on 104 sampled human texts.

Splits. We use the same sequence-based split as [12], and perform pose pair selection independently in each subset. Since we also use the same ordered set S𝑆Sitalic_S, some poses are annotated both with a description and a modifier: such complementary information can be used in a multitask setting.

3.2 Collection of human annotations

We collected the textual modifiers on Amazon Mechanical Turk from English-speaking annotators with a 95% approval rate who already completed at least 5000 tasks. To limit perspective-based mistakes, we presented both poses rendered under different viewpoints (see Figure 3, left). An annotation could not be submitted until it was more than 10 words and several viewpoints were considered. The orientation of the poses was normalized so they would both face the annotator in the front view. Only for in-sequence pairs, we would apply the normalization of pose A𝐴Aitalic_A to pose B𝐵Bitalic_B, to stay faithful to the global change of orientation in the ground-truth motion sequences.

The annotators were given the following instruction: “You are a coach or a trainer. Your student is in pose A, but should be in pose B. Please write the instructions so they can correct the pose on at least 3 aspects.”. Annotators were required to describe the position of the body parts relatively to the others (e.g.Your right hand should be close to your neck.’), to use directions (such as ‘left’ and ‘right’) in the subject’s frame of reference and to mention the rotation of the body, if any. They were also encouraged to use analogies (e.g.in a push-up pose’). For the annotations to size-agnostic, distance metrics were prohibited.

The task was first made available to any worker by tiny batches. Annotations were carefully scrutinized, and only the best workers were qualified to pursue to larger batches, with lighter supervision. In total, about 15% of the annotations were manually reviewed, and corrected when needed. We further cleaned the annotations by fixing misspelled and duplicated words, detected automatically. Figure 2 shows some pose pairs and their annotated modifiers.

3.3 Generating annotations automatically

To scale up the dataset, we design a pipeline to automatically generate thousands of modifiers, by relying on low-level properties as in [12]. The process takes as input the 3D keypoint positions of two poses A𝐴Aitalic_A and B𝐵Bitalic_B, and outputs a textual instruction to obtain pose B𝐵Bitalic_B from pose A𝐴Aitalic_A. First, it measures and classifies the variation of atomic pose configurations to obtain a set of “paircodes”. For instance, we attend to the motion of the keypoints along each axis (“move the right hand slightly to the left” (x-axis), “lift the left knee” (y-axis)), to the variation of distance between two keypoints (“move your hands closer”) or to the angle change (“bend your left elbow”). We further define “super-paircodes”, resulting from the combination of several paircodes or posecodes [12]; e.g. the paircode “bend the left knee less”, associated to the posecode “the left knee is slightly bent” on pose A𝐴Aitalic_A leads to the super-paircode “straighten the left leg”. The super-paircodes make it possible to describe higher-level concepts or to refine some assessments (e.g. only tell to move the hands farther away from each other if they are close to begin with). The paircodes are next aggregated using the same set of rules as in [12], then they are structurally ordered, to gather information about the same general part of the body within the description. Ultimately, for each paircode, we sample and complete one of the associated template sentences. Their concatenation yields the automatic modifier. Please refer to the supplementary for more details. The whole process produced 135k annotations in less than 15 minutes. Some examples are shown in Figure 2. In this paper, we use the automatic data for pretraining only.

3.4 Statistics and semantic analysis

Table 2: Number of pairs of each set and type.
automatic human
in-sequence 25,201 2,615
out-of-sequence 110,104 3,542
both-way 93,180 2,710
one-way 42,125 3,447
total 135,305 6,157
automatic human
different poses 99,231 7,433
different poses A 87,793 5,343
different poses B 98,939 5,922
   in PoseScript 6,249 3,551
A in PoseScript 6,160 2,753
B in PoseScript 6,226 3,143
Table 2: Number of pairs of each set and type.
Table 3: Number of poses per type or shared with  [12].

PoseFix contains 6157 (resp. 135k) human- (resp. automatically-) annotated pairs, split according to a 70%-10%-20% proportion. In average, human-written text modifiers are close to 30 words long with a minimum of 10 words. All together, they form a cleaned vocabulary of 1068 words, a wordcloud of which is shown in Figure 3 (right).

Negation particles were detected in 3.6% of the annotations, which makes textual queries with negations a bit harder, akin to similar datasets [60, 12]. A semantic analysis carried out on 104 annotations taken at random is reported in Table 1. We found that textual modifiers provide correctional instructions about 4 different body parts in average, which vary depending on the context (pose A𝐴Aitalic_A).

Figure 4: Overview of our text-based pose editing baseline. The top part represents a standard VAE, where poses are encoded into a Gaussian distribution. At training time, a latent variable is sampled and decoded into a pose to learn pose reconstruction. The bottom left part represents the conditioning: the text is encoded using a frozen DistilBERT with a small transformer on top. It is combined with source pose features in the fusion module, from which we predict a Gaussian distribution. A KL loss ensures the alignment of the distributions from the standard VAE and the conditioning. At test time, we sample from the latter to predict the target pose.

A few other annotation behaviors were found to be quite difficult to quantify, in particular “missing” instructions. Sometimes, details are omitted in the text because the context given by pose A𝐴Aitalic_A is “taken for granted”. For instance, in the 3rd example shown in Figure 2, the “45-degree angle” is to be understood with regard to the “0 degree” plan defined by the back of the body in pose A𝐴Aitalic_A. Moreover, the annotator did not specify how the position of the arms have changed, supposedly because this change comes naturally once the back is straighten up, from the structure of the kinematic chain. These challenges are inherent to the task.

Detailed statistics are presented in Tables 3 and 3.

4 Application to Text-based Pose Editing

We introduce a VAE [29] baseline to perform text-based 3D human pose editing. Specifically, we aim to generate plausible new poses based on two input elements: an initial pose A𝐴Aitalic_A providing some context (a starting point for modifications), and a textual modifier which specifies the changes to be made. Figure 4 gives an overview of our model.

Data processing. Poses are characterized by their SMPL-H [52] body joint rotations in axis-angle representation. Their global orientation is first normalized along the y-axis. For in-sequence pairs, the same normalization that was applied to pose A𝐴Aitalic_A is applied to pose B𝐵Bitalic_B in order to preserve information about the change of global orientation.

Training phase. During training, the model encodes both the query pose A𝐴Aitalic_A and the ground-truth target pose B𝐵Bitalic_B using a shared pose encoder, yielding respectively features 𝐚𝐚\mathbf{a}bold_a and 𝐛𝐛\mathbf{b}bold_b in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The tokenized text modifier is fed into a pretrained embedding module to extract expressive word encodings. These are further processed by a learned textual model, to yield a global textual representation 𝐦n𝐦superscript𝑛\mathbf{m}\in\mathbb{R}^{n}bold_m ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Next, the two input embeddings 𝐚𝐚\mathbf{a}bold_a and 𝐦𝐦\mathbf{m}bold_m are provided to a fusing module which outputs a single vector 𝐩d𝐩superscript𝑑\mathbf{p}\in\mathbb{R}^{d}bold_p ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Both 𝐛𝐛\mathbf{b}bold_b and 𝐩𝐩\mathbf{p}bold_p then go through specific fully connected layers to produce the parameters of two Gaussian distributions: the posterior 𝒩b=𝒩(|𝝁(𝐛),𝚺(𝐛))\mathcal{N}_{b}=\mathcal{N}(\cdot|\boldsymbol{\mu}(\mathbf{b}),\boldsymbol{% \Sigma}(\mathbf{b}))caligraphic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = caligraphic_N ( ⋅ | bold_italic_μ ( bold_b ) , bold_Σ ( bold_b ) ) and the prior 𝒩p=𝒩(|𝝁(𝐩),𝚺(𝐩))\mathcal{N}_{p}=\mathcal{N}(\cdot|\boldsymbol{\mu}(\mathbf{p}),\boldsymbol{% \Sigma}(\mathbf{p}))caligraphic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = caligraphic_N ( ⋅ | bold_italic_μ ( bold_p ) , bold_Σ ( bold_p ) ) conditioned on 𝐩𝐩\mathbf{p}bold_p from the fusion of 𝐚𝐚\mathbf{a}bold_a and 𝐦𝐦\mathbf{m}bold_m. Eventually, a sampled latent variable 𝐳b𝒩bsimilar-tosubscript𝐳𝑏subscript𝒩𝑏\mathbf{z}_{b}\sim\mathcal{N}_{b}bold_z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∼ caligraphic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is decoded into a reconstructed pose B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG.

The loss consists in the sum of a reconstruction term R(B,B^)subscript𝑅𝐵^𝐵\mathcal{L}_{R}(B,\hat{B})caligraphic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_B , over^ start_ARG italic_B end_ARG ) and the Kullback-Leibler (KL) divergence between 𝒩bsubscript𝒩𝑏\mathcal{N}_{b}caligraphic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and 𝒩psubscript𝒩𝑝\mathcal{N}_{p}caligraphic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. The former enables the generation of plausible poses, while the latter acts as a regularization term to align the two spaces. The combined loss is then:

pose editing=R(B,B^)+KL(𝒩b,𝒩p).subscriptpose editingsubscript𝑅𝐵^𝐵subscript𝐾𝐿subscript𝒩𝑏subscript𝒩𝑝\mathcal{L_{\text{pose editing}}}=\mathcal{L}_{R}(B,\hat{B})+\mathcal{L}_{KL}(% \mathcal{N}_{b},\mathcal{N}_{p}).caligraphic_L start_POSTSUBSCRIPT pose editing end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_B , over^ start_ARG italic_B end_ARG ) + caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) . (1)

We use the same negative log likelihood-based reconstruction loss as in [12]: it is applied to the output joint rotations in the continuous 6D representation [67], and both the joint and vertices positions inferred from the output by the SMPL-H [52] model.

Inference phase. The input pose A𝐴Aitalic_A and the text modifier are processed as in the training phase. However, this time we sample 𝐳p𝒩psimilar-tosubscript𝐳𝑝subscript𝒩𝑝\mathbf{z}_{p}~{}\sim~{}\mathcal{N}_{p}bold_z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∼ caligraphic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to obtain the generated pose B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG.

Evaluation metrics. We report the Evidence Lower Bound (ELBO) for the size-normalized rotations, joints and vertices, as well as the Fréchet inception distance (FID) which compares the distribution of the generated poses with the one of the expected poses, based on their semantic PoseScript features. The ELBO and the FID are mostly sensitive to complementary traits (support coverage and sample quality respectively). When some settings do not improve all metrics; we then base our decisions on the metrics with the highest differential. While the ELBO is better suited to evaluate generative models than reconstruction metrics, for intuitiveness, we also report the the MPJE (mean-per-joint error, in mm), the MPVE (mean-per-vertex error, in mm) and the geodesic distance for joint rotations (in degrees) between the target and the best (i.e., closest) generated sample out of N=30𝑁30N{=}30italic_N = 30 in all experiments.

Architecture details and ablations. We use the VPoser [44] architecture for the pose auto-encoder, resulting in features of dimension d=32𝑑32d=32italic_d = 32. The variance of the decoder is considered a learned constant [53]. We experiment with two different text encoders (Table 4, top): (i) a bi-GRU [9] mounted on top of pretrained GloVe word embeddings [45], or (ii) a transformer followed by average-pooling, processing frozen DistilBERT [54] word embeddings. We find that the transformer pipeline outperforms the other in terms of ELBO (+0.24 in average) when no additional pretraining is involved, supposedly because it uses already strong general-pretrained weights. Pretraining on our automatic modifiers brings the bi-GRU pipeline on par with the transformer one (+0.04). For simplicity, we will thereafter resort to the former.

For fusion, we use TIRG [59], a well-spread module for compositional learning. It consists in a gating mechanism composed of two 2-layer Multi-Layer Perceptrons (MLP) f𝑓fitalic_f and g𝑔gitalic_g balanced by learned scalars wfsubscript𝑤𝑓w_{f}italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and wgsubscript𝑤𝑔w_{g}italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT such that:

𝐩=wff([𝐚,𝐦])𝐚+wgg([𝐚,𝐦]).𝐩direct-productsubscript𝑤𝑓𝑓𝐚𝐦𝐚subscript𝑤𝑔𝑔𝐚𝐦\mathbf{p}=w_{f}f([\mathbf{a},\mathbf{m}])\odot\mathbf{a}+w_{g}g([\mathbf{a},% \mathbf{m}]).bold_p = italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_f ( [ bold_a , bold_m ] ) ⊙ bold_a + italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_g ( [ bold_a , bold_m ] ) . (2)

It is designed to ‘preserve’ the main modality feature 𝐚𝐚\mathbf{a}bold_a while applying the modification as a residual connection.

FID \downarrow ELBO \uparrow Reconstruction \downarrow (best of 30)
jts v2v rot MPJE MPVE Geodesic
Text Encoder (with/without pretraining)
without      GloVe + bi-GRU 0.19 0.61 1.51 0.50 278 217 9.89
     DistilBERT+transformer 0.10 0.95 1.51 0.63 226 180 9.22
with      GloVe + bi-GRU 0.02 1.40 1.88 0.99 199 165 8.59
     DistilBERT+transformer 0.02 1.37 1.84 0.93 201 167 8.70
Data augmentations (with/without pretraining, GloVe+bi-GRU config)
without    no augmentation 0.19 0.61 1.51 0.50 278 217 9.89
   + L/R flip 0.13 1.10 1.73 0.58 250 196 9.57
   + paraphrases 0.19 0.90 1.45 0.58 233 186 9.44
   + PoseMix 0.10 0.63 1.12 0.58 254 202 9.61
   + PoseMix & PoseCopy 0.04 1.03 1.50 0.78 221 178 9.07
with    no augmentation 0.02 1.40 1.88 0.99 199 165 8.59
   + L/R flip 0.02 1.47 1.94 0.97 197 163 8.65
   + paraphrases 0.02 1.43 1.90 0.97 198 164 8.58
   + PoseMix 0.06 0.68 1.13 0.91 214 174 8.74
   + PoseMix & PoseCopy 0.03 1.23 1.71 0.98 208 172 8.75
   + L/R flip & paraphrases 0.02 1.44 1.92 0.97 196 162 8.62

Table 4: Text-based pose editing results for various architectures, data augmentations and training strategies. We show the best result in bold and underline the second best.

Training data and augmentations ablations. We experiment with several kinds of data augmentations and training data. Corresponding results are reported in Table 4 (bottom). First, we try left/right flipping by swapping the rotations of the left and right body joints (e.g. the left hand becomes the right hand) and changing the text accordingly. This improves significantly the relevance of the generated poses (ELBO), especially when the model did not benefit from pretraining on diverse synthetic data (+37% average improvement of the ELBO).

Next, we use InstructGPT [41] to obtain 2 paraphrases per annotation. This form of data augmentation was found helpful, particularly when training on a small amount of data, i.e., without pretraining (+20%).

In order to encourage the model to fully leverage the textual cue, we define PoseMix, which gathers both the PoseScript [12] and the PoseFix datasets. When training with PoseScript data, which consist in pairs of poses and textual descriptions, we set pose A𝐴Aitalic_A to 0. We notice a mitigated improvement, and even a drop in performance in the pretrained case. One possible reason for that is the difference in formulation between PoseScript descriptions (“The person is … with their left hand…”) and PoseFix modifiers (“Move your left hand…”). Another is that the model then learns to ignore A𝐴Aitalic_A, which is nonetheless crucial in the PoseFix setting. To circumvent this last-mentioned issue, we improve the balance of the training data by introducing PoseCopy. This consists in providing the model with the same pose in the role of pose A𝐴Aitalic_A and pose B𝐵Bitalic_B, along with an empty modifier, assuming that a non-existent textual query will force the model to attend pose A𝐴Aitalic_A. The PoseMix & PoseCopy setting yields a great improvement over all metrics for the non-pretrained case (+41%). This further shows that the formulation gap was not the main issue. As a side product, the fusing branch is now able to work as a pseudo auto-encoder, and to output a copy of the input pose when no modification instruction is provided.

Eventually, the pretraining has a more significant impact than using any kind of data augmentation (+84%). Besides, the data augmentations become much less effective in this setting (+1%). The model thus benefits better from pretraining on a large set of new pairs with synthetic instructions, than training on more human-written modifiers of the same pose pairs. We overall obtain our best model by combining pretraining, left/right flip and paraphrases (last row).

FID \downarrow ELBO \uparrow Reconstruction \downarrow (best of 30)
jts v2v rot MPJE MPVE Geodesic
Pair subset
  in-sequence (530) 0.04 1.33 1.78 0.88 188 154 8.47
  out-of-sequence (709) 0.03 1.53 2.02 1.04 206 168 8.80
  full PoseFix test set (1239) 0.02 1.44 1.92 0.97 196 162 8.62
Input type (full PoseFix test set - 1239)
  pose A𝐴Aitalic_A only 0.04 1.43 1.92 0.97 219 180 8.91
  modifier only 0.42 1.30 1.92 0.92 378 339 13.03
  pose A𝐴Aitalic_A + modifier 0.02 1.44 1.92 0.97 196 162 8.62

Table 5: Pose editing results for various subsets and input types, using the best model as per Table 4.
Figure 5: Generated poses for the text-based pose editing task on PoseFix queries from the left blocks. Two views of each pose are shown on the same ground plane for better visualization of the 3D. Generated poses are shown in blue. Original poses B from the PoseFix dataset are in the supplementary material.

Detailed analysis. In Table 5, we evaluate our best pose editing model on several subsets of pairs and with different input types.

First, we notice higher ELBO performance on the out-of-sequence (OOS) pair set compared to the in-sequence (IS) set, suggesting that pairs in the latter are harder. This can be due to pose A𝐴Aitalic_A and pose B𝐵Bitalic_B being more similar in IS than OOS, as they belong to the same sequence with a maximum delay of 0.5s. We indeed measure a mean per joint distance of 311mm between A𝐴Aitalic_A and B𝐵Bitalic_B in IS vs. 350mm in OOS: the differences between IS poses thus ought to be more subtle, yielding more complex modifiers. This drop in ELBO performance shows also that the model struggles more with IS modifiers, meaning that it most probably generates, in average, poses that are close to pose A𝐴Aitalic_A, – in other words, it would takes guesses in the surroundings of pose A𝐴Aitalic_A. This would actually be a good fall-back strategy, because the two poses are rather similar in general. In the IS case, since pose A𝐴Aitalic_A and pose B𝐵Bitalic_B are particularly close to each other, the model may end up finding, with enough guesses, a pose closer to pose B𝐵Bitalic_B than it would in the OOS case, where the two poses are more different. This could explain why the reconstruction metrics using the best sample out of 30 are lower for the IS subset than the OOS subset.

Next, we compare the results when querying with the pose A𝐴Aitalic_A only or the modifier only. The former achieves already high performance, showing that the initial pose A𝐴Aitalic_A alone provides a good approximation of the expected pose B𝐵Bitalic_B – indeed, the pair selection process constrained pose A𝐴Aitalic_A and pose B𝐵Bitalic_B to be quite similar. The latter yields poor FID and reconstruction metrics: the textual cue is only a modifier, and the same instructions could apply to a large variety of poses. Looking around pose A𝐴Aitalic_A remains a better strategy than sticking to the sole modifier in order to generate the expected pose. Eventually, both parts of the query are complementary: pose A𝐴Aitalic_A serves as a strong contextual cue, and the modifier guides the search starting from it (the pose being provided through the gating mechanism in TIRG). Both are crucial to reach pose B𝐵Bitalic_B (last row).

Figure 6: Overview of our baseline for correctional text generation. The bottom part represents a standard auto-regressive transformer model: the next word is predicted from the previously generated tokens. The decoder outputs a distribution of probabilities over the vocabulary for each token. The top part represents the conditioning on the pose pair: the two pose embeddings are fused together into a set of “pose tokens”, further used for conditioning via prompting or via cross-attentions in the transformer. At inference, the modifier is generated iteratively using the greedy approach.

Qualitative results. Last, we present qualitative results for text-based 3D human pose editing in Figure 5. It appears that the model has a relatively good semantic comprehension of the different body parts and of the actions to modify their positions. Some egocentric relations (“Raise your right elbow slightly.”, first row) are better understood than others, in particular contact requirements (“Bend your elbow so it’s almost touching the inside of your knee”, second row). When missing some specifications, the model generates various pose configurations (e.g. the extent of the left leg extension in the first example). It can handle a number of instructions at once (third row), but may fail to attend all of them. Crouching and lying-down poses are the most challenging (see failure case in the last row, and how the crouch is hardly preserved in the third row).

5 Application to correctional text generation

We next present a baseline for correctional text generation. We aim to produce feedback in natural language explaining how the source pose A𝐴Aitalic_A should be modified to obtain the target pose B𝐵Bitalic_B. We rely on an auto-regressive model conditioned on the pose pair, which iteratively predicts the next word given the previous generated ones (see Fig. 6).

Training phase. Let T1:Lsubscript𝑇:1𝐿T_{1:L}italic_T start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT be the L𝐿Litalic_L tokens of the text modifier. An auto-regressive generative model seeks to predict the next token l+1𝑙1l+1italic_l + 1 from the first l𝑙litalic_l tokens T1:lsubscript𝑇:1𝑙T_{1:l}italic_T start_POSTSUBSCRIPT 1 : italic_l end_POSTSUBSCRIPT. Let p(|T1:l)p(\cdot|T_{1:l})italic_p ( ⋅ | italic_T start_POSTSUBSCRIPT 1 : italic_l end_POSTSUBSCRIPT ) be the predicted probability distribution over the vocabulary. The model is trained, via a cross-entropy loss, to maximize the probability of generating the ground-truth token Tl+1subscript𝑇𝑙1T_{l+1}italic_T start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT given previous ones: p(Tl+1|T1:l)𝑝conditionalsubscript𝑇𝑙1subscript𝑇:1𝑙p(T_{l+1}|T_{1:l})italic_p ( italic_T start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT | italic_T start_POSTSUBSCRIPT 1 : italic_l end_POSTSUBSCRIPT ).

To predict p(|T1:l)p(\cdot|T_{1:l})italic_p ( ⋅ | italic_T start_POSTSUBSCRIPT 1 : italic_l end_POSTSUBSCRIPT ), the tokens T1:lsubscript𝑇:1𝑙T_{1:l}italic_T start_POSTSUBSCRIPT 1 : italic_l end_POSTSUBSCRIPT are first embedded, and positional encodings are injected. The result is fed to a series of transformer blocks [57], and projected into a space whose dimension is the vocabulary size q𝑞qitalic_q. Let 𝐭q𝐭superscript𝑞\mathbf{t}\in\mathbb{R}^{q}bold_t ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT denote the outcome. The probability distribution over the vocabulary for the next token p(|T1:l)p(\cdot|T_{1:l})italic_p ( ⋅ | italic_T start_POSTSUBSCRIPT 1 : italic_l end_POSTSUBSCRIPT ) could be obtained from Softmax(𝐭)𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝐭Softmax(\mathbf{t})italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( bold_t ).

Transformer-based auto-regressive models can be trained efficiently using causal attention masks which, for each token l𝑙litalic_l, prevent the network from attending all future tokens l>lsuperscript𝑙𝑙l^{\prime}>litalic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_l, in a single pass.

Now, how do poses come into the picture? Pose A𝐴Aitalic_A and pose B𝐵Bitalic_B are encoded using a shared encoder, and combined in the fusing module, which outputs a set of N𝑁Nitalic_N ‘pose’ tokens. To condition the text generation on pose information, we experiment with two alternatives: those pose tokens can either be used for prompting, i.e., added as extra tokens at the beginning of the text modifier, or serve in cross-attention mechanisms within the text transformer.

Inference phase. For inference, we provide the model with the pose tokens and the special <BOS>expectation𝐵𝑂𝑆{<}BOS{>}< italic_B italic_O italic_S > token which indicates the sequence beginning. We decode the output 𝐭2subscript𝐭2\mathbf{t}_{2}bold_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in a greedy fashion, i.e., we predict the next token as the word that maximizes the negative log likelihood. We proceed iteratively, giving previously decoded tokens T1:lsubscript𝑇:1𝑙T_{1:l}italic_T start_POSTSUBSCRIPT 1 : italic_l end_POSTSUBSCRIPT to the model so to obtain the subsequent token l+1𝑙1l+1italic_l + 1, until the <EOS>expectation𝐸𝑂𝑆{<}EOS{>}< italic_E italic_O italic_S > token (denoting the end of the sequence) is decoded.

R Precision \uparrow NLP \uparrow Reconstruction \downarrow (best of 30)
Control measures
random text 3.13 6.25 9.38 7.11 26.33 26.88 225 185 9.07
original text 62.71 74.01 79.26 100.00 100.00 100.00 196 162 8.62
Injection type (with/without pretraining)
without     prompt 3.63 7.10 10.73 9.74 31.88 27.72 226 184 8.94
    cross-attention 6.78 12.27 17.35 10.62 31.66 28.74 220 180 8.85
with     prompt 15.09 22.28 30.35 11.15 32.58 29.76 211 175 8.79
    cross-attention 58.43 71.35 77.56 12.19 33.94 31.30 192 161 8.55
Data augmentations (with pretraining & cross-attention injection)
  no augmentation 58.43 71.35 77.56 12.19 33.94 31.30 192 161 8.55
  with L/R flip 60.69 71.51 78.85 12.14 34.02 30.90 189 159 8.54
  with paraphrases 53.91 67.72 74.98 10.56 33.07 30.15 194 162 8.65
  with PoseMix 45.12 56.66 64.89 10.94 33.22 30.12 197 164 8.74
Table 6: Correctional text generation results for various pose injections and data augmentations. For reference, we also provide numbers for the ground-truth texts and an annotated text chosen at random.
Figure 7: Generated correctional texts for PoseFix pose pairs (pose A𝐴Aitalic_A is grey, pose B𝐵Bitalic_B is purple). The original human annotations for these pose pairs are available in the supplementary material.

Evaluation metrics. We resort to standard natural language metrics: BLEU-4 [42], Rouge-L [33] and METEOR [6], which measure different kinds of n-grams overlaps between the reference text and the generated one. Yet, we notice that these metrics do not reliably reflect the model quality for this task. Indeed, we only have one reference text and, given the initial pose, very different instructions can lead to the same result (e.g.lower your arm at your side” and “move your right hand next to your hip”); it is not just a matter of formulation. Thus, we report the top-k R-precision metrics proposed in TM2T [19]: we use contrastive learning to train a joint embedding space for the modifiers and the concatenation of poses A𝐴Aitalic_A and B𝐵Bitalic_B, then we look at the ranking of the correct pose pair for each generated text within a set of 32 pose pairs. We also report reconstruction metrics on the pose generated thanks to our best model from Section 4 using the generated text. These added metrics assess the semantic correctness of the generated texts.

Quantitative results are presented in Table 6. We experiment with the same fusing module as before: TIRG [59], where the gating applied on the pair leading pose (pose B𝐵Bitalic_B); thus using N=1𝑁1N=1italic_N = 1. We try prompting and cross-attention to inject the pose information in the text decoder, and found the latter to yield the best results. Pretraining on automatic modifiers significantly boosts the performance, e.g. with cross-attention injection, the R@2 increases from 12.27% to 71.35%. Regarding data augmentations, the left/right flip yields additional gains (+1.7% of average R Precision) with results close to those obtained with the ground-truth texts, both for R-precision and reconstruction. Even if the generated text does not have the same wording as the original text (low NLP metrics), combined with pose A, it achieves to produce a satisfactory pose B^^𝐵\hat{B}over^ start_ARG italic_B end_ARG, meaning that it carries the right correctional information. Of course, one should recall that the added metrics rely on imperfect models, which have their own limitations. Finally, we observe a decrease in performance with the paraphrases or the PoseMix settings: we hypothesize that these settings are harder than the regular one for this task, due to new words and formulations.

Qualitative results. Fig. 7 shows some generated texts. The model is able to produce satisfying feedback, it generates egocentric relations (third and fourth examples) and groups indications by body part (second column). However, it tends to mix up pose A𝐴Aitalic_A and B𝐵Bitalic_B (last two examples). It also sometimes describes only a subset of the differences.

6 Conclusion

This paper lays the groundwork for investigating the challenge of correcting 3D human poses using natural language instructions. Going beyond existing methods that utilize language to model global motion or entire body poses, we aim to capture the subtle differences between pairs of body poses, which requires a new level of semantic understanding. For this purpose, we have introduced PoseFix, a novel dataset with paired poses and their corresponding correctional descriptions. We also presented promising results for two baselines which address the deriving tasks of text-based pose editing and correctional text generation.

Acknowledgments. This work is supported by the Spanish government with the project MoHuCo PID2020-120049RB-I00, and by NAVER LABS Europe under technology transfer contract ‘Text4Pose’.


Supplementary Material

In this supplementary material, we first provide additional details and statistics on the PoseFix dataset in Section A. The original triplets from PoseFix for the generated results presented in the main paper are available in Section B. Additional visualizations are provided in Section C. Finally, we give implementation details in Section D.

Appendix A PoseFix complementary information

In this section, we provide additional details about the creation of the PoseFix dataset.

A.1 Human annotations

Sequences of origin. The poses in PoseFix were extracted from AMASS [38]. In Figure A1, we present the proportion of poses coming from each of the datasets included in AMASS. We notice that most poses belong to the DanceDB dataset (44%), presumably because this is where the poses are the most diverse. Recall that poses were chosen following a farther-point sampling algorithm to ensure we would get a various subset of poses. Besides, we note that most of the sequences available in DanceDB (94%) and MPI-limits (83%) provided at least one pose to PoseFix, which suggests that PoseFix could help in apprehending very complex, extreme poses.

Figure A1: Origin of the human-annotated poses in PoseFix. The top plot shows the proportion of poses in PoseFix that come from each sub-dataset in AMASS [38]. The lower plot shows the proportion of sequences, in each of the sub-dataset, that provided at least one pose to PoseFix.
Figure A2: Distribution of the number of words in the human-written annotations from PoseFix.

Turkers qualifications and statistics. The annotations were collected on Amazon Mechanical Turk. Participating workers (“Turkers”) had to come from English-speaking countries (Australia, Canada, New Zealand, United Kingdom, USA), have completed at least 5,000 other tasks, and have an approval rate greater than 95%. In total, 105 different annotators participated. We qualified 20 of them for access to the larger batches, on the basis of at least 3 good annotations. Other 50 workers were excluded from our annotation task because of poor writing, misunderstanding of the task or cheating. The remaining participants did not complete enough annotations of good quality to be qualified for accessing more. Eventually, over 90% of the annotations were made by 8 annotators.

Pricing. Properly completing an annotation, after a bit of training, was timed to take approximately 1’10”. Annotations from the smaller qualifying batches were rewarded $0.25. Once a worker completed 3 of them correctly, s/he was granted access to the larger batches, where annotations were rewarded $0.32 each, based on the minimum wage in California for 2023. We additionally paid a 10% bonus for every 30 annotations.

Figure A3: Automatic Comparative Pipeline, which generates modifiers based on the 3D keypoint coordinates of two input poses. L (resp. R) stands for ‘left’ (resp. ‘right’).

Quality assessment. Annotations from the early smaller qualifying batches which were opened to any worker were systematically reviewed. In contrast, only up to 10% of the trusted worker annotations were randomly selected for manual review. The quality of the annotations was assessed based on the following criteria:

  • completeness: most of the differences between pose A𝐴Aitalic_A and pose B𝐵Bitalic_B were addressed in the annotation;

  • direction accuracy: the annotation explains how to go from pose A𝐴Aitalic_A to pose B𝐵Bitalic_B, and not the reverse;

  • left/right accuracy: the words ‘left’ and ‘right’ were used in the body’s frame of reference;

  • 3D consideration: the annotation fits the 3D information, no guess was taken on occluded body parts, or ambiguous postures;

  • no distance metric: the annotation does not contain any distance metric (e.g., ‘one meter apart’), which would not scale to bodies of different size;

  • writing quality: correct grammar and formulation.

Length of the human-written annotations. Figure A2 shows the length distribution of the collected annotations. We here refer to the length as the number of words, excluding punctuation. While the annotations were constrained to be at least 10-word long, they tend to count about 30 words, suggesting that the differences between two similar poses A𝐴Aitalic_A and B𝐵Bitalic_B are both subtle and several.

A.2 Automatic annotations

We explain here in more details the learning-free process to automatically generate modifiers. The different steps of the pipeline are illustrated in Figure A3. We comment on some of those steps.

Code extraction. Two of the elementary paircodes are basically variation-versions of the initial posecodes [12]: we look at the change in angle posecode or distance posecode between pose A𝐴Aitalic_A and pose B𝐵Bitalic_B. The third kind of paircode studies the variation in position of a keypoint along the x-, y- or z- axis. All three paircodes are computed on the orientation-normalized bodies, so that the produced instructions would not depend on the change in global orientation of the body between pose A𝐴Aitalic_A and pose B𝐵Bitalic_B. This last part is treated separately, and yields a sentence that is added at the beginning of the modifier.
We also resort to the posecodes of both poses A𝐴Aitalic_A and B𝐵Bitalic_B to define super-paircodes, and thus gain in abstraction or formulation quality. There can be several ways to achieve the same paircode, each way comprising at least two conditions (posecode and paircode mixed together). Some posecodes of pose B𝐵Bitalic_B, if statistically rare, are also included in the final modifier, e.g.the hands should be shoulder-width apart’, ‘the left thigh should be parallel with the ground’. Posecodes of pose A𝐴Aitalic_A are only useful for super-paircode computations.

Code selection and aggregation. We proceed as in  [12]. Trivial codes are removed. The codes (paircodes + posecodes) are aggregated based on simple syntactic rules depending on shared information between codes.

Code ordering. The final set of codes is semantically ordered to produce modifiers that are easier to read and closer to what a human would write (i.e., describe about everything related to the right arm at once, instead of scattering pieces of information everywhere in the text). This step did not exist in the PoseScript automatic pipeline. Specifically, we design a directed graph where the nodes represent the body parts and the edges define a relation of inclusion or proximity between them (e.g. torsonormal-→\rightarrowleft shoulder, armnormal-→\rightarrowforearm). For each pose pair, we perform a randomized depth walk through the graph: starting from the body node, we choose one node at random among the ones directly accessible, then reiterate the process from that node until we reach a leaf; at that point, we come back to the last visited node leading to non-visited nodes and sample one child node at random. We use the order in which the body parts are visited to order the paircodes.

Code conversion. Codes are converted to pieces of text by plugging information into a randomly chosen template sentence associated to each of them. The pieces of text are next concatenated thanks to transition texts. Verbs are conjugated accordingly to the chosen transition (e.g. “while + gerund”) and code (e.g. posecodes lead to “[…] should be” sentences).

We refer to the code for the detailed and complete list of paircodes and super-paircodes definition.

Appendix B Original triplets of the generation examples

In this section, we provide the original triplets for the generation results presented in Figure 5 (see Figure A4) and in Figure 7 (see Figure A5). While this ground truth may ease the comparison, it is not the only true answer for a generative model: multiple valid results could be produced. The GT was purposely omitted to prevent judgment bias, but is added here for reference.

Figure A4: Original poses B𝐵Bitalic_B for the text-based pose editing task and PoseFix queries presented in Figure 5. Two views of the each pose are shown on the same ground plane. Pose A𝐴Aitalic_A is shown in grey, pose B𝐵Bitalic_B in purple.
Figure A5: Original correctional feedback annotation for PoseFix pose pairs presented in Figure A5. Pose A𝐴Aitalic_A is shown in grey, pose B𝐵Bitalic_B in purple.

Appendix C Miscellaneous visualizations

Robot teaching application. The choice of modifiers in Natural Language to learn the difference between two poses proves especially useful in applications where direct manipulation is not possible, for instance in the case of robot teaching. Figure A6 shows a snapshot of a demo where a two-arm robot pose is optimized to match SMPL keypoints obtained from textual instructions.

Figure A6: Robot teaching application.

The PoseCopy behavior. The PoseCopy setting for the text-based pose editing task consists in training the model with a proportion of the data where the text is emptied and pose B𝐵Bitalic_B becomes a copy-paste of pose A𝐴Aitalic_A. This training configuration makes it possible for the model to yield the exact same pose as the initial one, when no correctional instruction is specified, see Figure A7 for an example. Besides, we hypothesize that this setting encourages the model to better pay attention to pose A𝐴Aitalic_A.

Figure A7: Effect of training with PoseCopy.

Appendix D Implementation details

Architecture details. We follow the VPoser [44] architecture for our pose encoder, modified to account for the 52 joints of the SMPL-H [52] body model. In the ‘glove+bigru’ configuration of our pose editing baseline, GloVe word embeddings are of size 300 and we use a bidirectional GRU with one layer and hidden state features of size 512. In the transformer configuration, we use a frozen pretrained DistilBERT model to encode the text tokens. The transformer afterwards is composed of 4 layers with 4 heads and feed-forward networks with 1024 dimensions. It relies on GELU [21] activations and uses a dropout. The text embedding is eventually obtained by performing an average pooling. The transformer in our correctional text generation baseline is the same as for pose editing, except that we use 8 heads. In our models for both tasks, the poses and texts are encoded in latent spaces of dimensions d=32𝑑32d{=}32italic_d = 32 and n=128𝑛128n{=}128italic_n = 128 (n=512𝑛512n{=}512italic_n = 512 for the text generation task) respectively.

Optimization and training details. We trained our models with the Adam [28] optimizer, a batch size of 128, a learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT (104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for pretraining; and 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT for finetuning in the case of pose editing) and a weight decay of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT (105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for finetuning in the case of pose editing). The pose editing model was trained for 10,000 epochs (half for pretraining and half for finetuning, or 10,000 straight if no pretraining was involved), while the text generation model was trained for 3,000 epochs for pretraining and 2,000 for finetuning. In the PoseCopy setting, 50% of the batch is randomly used in “copy” mode (i.e., empty text, with poses A𝐴Aitalic_A and B𝐵Bitalic_B being the same).

Why using the ELBO metric? The ELBO is well suited to VAEs [29]: it balances reconstruction and KL into a lower bound on the data log likelihood, a universal quantity for comparing likelihood-based generative models. It accounts for the probabilistic nature of the model, by evaluating the target under the output distribution. In a VAE framework, reporting reconstruction errors only does not penalize the model for storing a lot of information in the latent variable produced by the encoder. The extreme case of an encoder that learns an identity function would appear optimal, yet fail at test time when the ground truth is no longer available for encoding. By contrast, the ELBO takes both reconstruction and the amount of information given by the encoder (the KL term) into account, and combines them into a lower bound on the data log likelihood.

Hand data. We used the hand data (fingers joints) for all ours experiments, but note that this was not necessary, given that the hands all have the same pose for PoseFix human-annotated pose pairs. In case more data with relevant hand information is annotated in the future, we suggest to keep the original hand data for the pairs annotated in this version of the dataset, as some annotators may have referred to them in their instructions.