Article

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Authors:

Li ChengAuthors Info & Claims

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV

Pages 580 - 597

https://doi.org/10.1007/978-3-031-19833-5_34

Published: 23 October 2022 Publication History

Abstract

Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods. Project page: https://ericguo5513.github.io/TM2T/

References

[1]

Adeli V, Adeli E, Reid I, Niebles JC, and Rezatofighi H Socially and contextually aware human motion and pose forecasting IEEE Robot. Autom. Lett. 2020 5 4 6033-6040

[2]

Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV), pp. 719–728. IEEE (2019)

[3]

Aliakbarian, S., Saleh, F., Petersson, L., Gould, S., Salzmann, M.: Contextually plausible and diverse 3D human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11333–11342 (2021)

[4]

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

[5]

Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: IEEE Virtual Reality and 3D User Interfaces (VR), pp. 1–10. IEEE (2021)

[6]

Cao Z, Gao H, Mangalam K, Cai Q-Z, Vo M, and Malik J Vedaldi A, Bischof H, Brox T, and Frahm J-M Long-term human motion prediction with scene context Computer Vision – ECCV 2020 2020 Cham Springer 387-404

[7]

Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F.: Context-aware human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6992–7001 (2020)

[8]

Dubey, S., Olimov, F., Rafique, M.A., Kim, J., Jeon, M.: Label-attention transformer with geometrically coherent objects for image captioning. arXiv preprint arXiv:2109.07799 (2021)

[9]

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)

[10]

Gao, J., Wang, S., Wang, S., Ma, S., Gao, W.: Self-critical n-step training for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6300–6308 (2019)

[11]

Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1396–1406 (2021)

[12]

Ging, S., Zolfaghari, M., Pirsiavash, H., Brox, T.: COOT: cooperative hierarchical transformer for video-text representation learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22605–22618 (2020)

[13]

Goutsu, Y., Inamura, T.: Linguistic descriptions of human motion with generative adversarial Seq2Seq learning. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 4281–4287. IEEE (2021)

[14]

Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5152–5161 (2022)

[15]

Guo, C., et al.: Action2video: generating videos of human 3D actions. Int. J. Comput. Vis., 1–31 (2022)

[16]

Guo, C., et al.: Action2motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)

[17]

Guo, L., Liu, J., Yao, P., Li, J., Lu, H.: MSCap: multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4204–4213 (2019)

[18]

Holden, D., Kanoun, O., Perepichka, M., Popa, T.: Learned motion matching. ACM Trans. Graph. (TOG) 39(4), 53–1 (2020)

[19]

Holden D, Komura T, and Saito J Phase-functioned neural networks for character control ACM Trans. Graph. (TOG) 2017 36 4 1-13

[20]

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)

[21]

Kojima A, Tamura T, and Fukunaga K Natural language description of human activities from video images based on concept hierarchy of actions Int. J. Comput. Vis. 2002 50 2 171-184

[22]

Kulkarni G et al. Babytalk: understanding and generating simple image descriptions IEEE Trans. Pattern Anal. Mach. Intell. 2013 35 12 2891-2903

[23]

Lee, H.Y., et al.: Dancing to music. In: Advances in Neural Information Processing Systems 32 (2019)

[24]

Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

[25]

Lin AS, Wu L, Corona R, Tai K, Huang Q, and Mooney RJ Generating animated videos of human activities from natural language descriptions Learning 2018 2018 1

[26]

Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

[27]

Liu, Z., et al.: Towards natural and accurate future motion prediction of humans and animals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10004–10012 (2019)

[28]

Mao, W., Liu, M., Salzmann, M.: Generating smooth pose sequences for diverse human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13309–13318 (2021)

[29]

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

[30]

Park, J.S., Rohrbach, M., Darrell, T., Rohrbach, A.: Adversarial inference for multi-sentence video description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6598–6608 (2019)

[31]

Pavllo D, Feichtenhofer C, Auli M, and Grangier D Modeling human motion with quaternion-based neural networks Int. J. Comput. Vis. 2020 128 4 855-872

[32]

Peng, J., Liu, D., Xu, S., Li, H.: Generating diverse structure for image inpainting with hierarchical VQ-VAE. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10775–10784 (2021)

[33]

Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10985–10995 (2021)

[34]

Plappert M, Mandery C, and Asfour T The kit motion-language dataset Big Data 2016 4 4 236-252

[35]

Plappert M, Mandery C, and Asfour T Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks Robot. Auton. Syst. 2018 109 13-26

[36]

Qin, Y., Du, J., Zhang, Y., Lu, H.: Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8367–8375 (2019)

[37]

Rakhimov, R., Volkhonskiy, D., Artemov, A., Zorin, D., Burnaev, E.: Latent video transformer. arXiv preprint arXiv:2006.10704 (2020)

[38]

Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)

[39]

Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

[40]

Starke S, Zhang H, Komura T, and Saito J Neural state machine for character-scene interactions ACM Trans. Graph. 2019 38 6 209-1

[41]

Takano W and Nakamura Y Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions Int. J. Robot. Res. 2015 34 10 1314-1328

[42]

Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)

[43]

Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems 30 (2017)

[44]

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

[45]

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

[46]

Venugopalan, S., et al.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)

[47]

Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

[48]

Wang, J., Xu, H., Narasimhan, M., Wang, X.: Multi-person 3D motion prediction with multi-range transformers. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

[49]

Wang L et al. Leibe B, Matas J, Sebe N, Welling M, et al. Temporal segment networks: towards good practices for deep action recognition Computer Vision – ECCV 2016 2016 Cham Springer 20-36

[50]

Wang, Z., et al.: Learning diverse stochastic human-action generators by learning smooth latent transitions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12281–12288 (2020)

[51]

Xu C, Govindarajan LN, Zhang Y, and Cheng L Lie-x: depth image based articulated object pose estimation, tracking, and action recognition on lie groups Int. J. Comput. Vis. 2017 123 3 454-478

[52]

Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)

[53]

Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)

[54]

Yamada T, Matsunaga H, and Ogata T Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descriptions IEEE Robot. Autom. Lett. 2018 3 4 3441-3448

[55]

Yu P, Zhao Y, Li C, Yuan J, and Chen C Vedaldi A, Bischof H, Brox T, and Frahm J-M Structure-aware human-action generation Computer Vision – ECCV 2020 2020 Cham Springer 18-34

[56]

Yuan Y and Kitani K Vedaldi A, Bischof H, Brox T, and Frahm J-M DLow: diversifying latent flows for diverse human motion prediction Computer Vision – ECCV 2020 2020 Cham Springer 346-364

[57]

Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)

[58]

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019)

Cited By

Ling ZHan BWong YLin HKangkanhalli MGeng WLarson K(2024)MCMProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/120(1083-1091)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/120
Athanasiou NCseke ADiomataris MBlack MVarol G(2024)MotionFix: Text-Driven 3D Human Motion EditingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687559(1-11)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687559
Chen XHe KLiu WLiu XZha ZMei TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)CLaM: An Open-Source Library for Performance Evaluation of Text-driven Human Motion GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3685523(11194-11197)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3685523
Show More Cited By

Index Terms

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts
1. Computing methodologies

Index terms have been assigned to the content through auto-classification.

Recommendations

Towards Open Domain Text-Driven Synthesis of Multi-person Motions
Computer Vision – ECCV 2024
Abstract
This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two ...
A data-driven, piecewise linear approach to modeling human motions
Hierarchical indexing structure for 3d human motions
MMM'07: Proceedings of the 13th international conference on Multimedia Modeling - Volume Part I

Content-based retrieval of 3D human motion capture data has significant impact in different fields such as physical medicine, rehabilitation, and animation. This paper develops an efficient indexing approach for 3D motion capture data, supporting ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV

Oct 2022

801 pages

ISBN:978-3-031-19832-8

DOI:10.1007/978-3-031-19833-5

Editors:
Shai Avidan
Tel Aviv University, Tel Aviv, Israel
,
Gabriel Brostow
University College London, London, UK
,
Moustapha Cissé
Google AI, Accra, Ghana
,
Giovanni Maria Farinella
University of Catania, Catania, Italy
,
Tal Hassner
Facebook (United States), Menlo Park, CA, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 October 2022

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ling ZHan BWong YLin HKangkanhalli MGeng WLarson K(2024)MCMProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/120(1083-1091)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/120
Athanasiou NCseke ADiomataris MBlack MVarol G(2024)MotionFix: Text-Driven 3D Human Motion EditingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687559(1-11)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687559
Chen XHe KLiu WLiu XZha ZMei TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)CLaM: An Open-Source Library for Performance Evaluation of Text-driven Human Motion GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3685523(11194-11197)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3685523
Xiao YShu KZhang HYin BCheang WWang HGao JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681392(6113-6122)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681392
Wu SLi YYan YDuan HLiu ZZhai GCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MMHead: Towards Fine-grained Multi-modal 3D Facial AnimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681366(7966-7975)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681366
Yao HSong ZZhou YAo TChen BLiu L(2024)MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete RepresentationsACM Transactions on Graphics10.1145/365813743:4(1-21)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658137
Li PStarke SYe YSorkine-Hornung O(2024)WalkTheDog: Cross-Morphology Motion Alignment via Phase ManifoldsACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657508(1-10)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657508
Sun HZheng RHuang HMa CHuang HHu R(2024)LGTM: Local-to-Global Text-Driven Human Motion Diffusion ModelACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657422(1-9)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657422
Wang CZheng SZhong LYu CLiang CWang YGao YLam TShi Y(2024)PepperPose: Full-Body Pose Estimation with a Companion RobotProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642231(1-16)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642231
Fu YZhao BLv CYue GWang RZhou F(2024)Improved Text-Driven Human Motion Generation via Out-of-Distribution Detection and RectificationComputational Visual Media10.1007/978-981-97-2095-8_12(218-231)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1007/978-981-97-2095-8_12
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten