Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-19833-5_34guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Published: 23 October 2022 Publication History

Abstract

Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods. Project page: https://ericguo5513.github.io/TM2T/

References

[1]
Adeli V, Adeli E, Reid I, Niebles JC, and Rezatofighi H Socially and contextually aware human motion and pose forecasting IEEE Robot. Autom. Lett. 2020 5 4 6033-6040
[2]
Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV), pp. 719–728. IEEE (2019)
[3]
Aliakbarian, S., Saleh, F., Petersson, L., Gould, S., Salzmann, M.: Contextually plausible and diverse 3D human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11333–11342 (2021)
[4]
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
[5]
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: IEEE Virtual Reality and 3D User Interfaces (VR), pp. 1–10. IEEE (2021)
[6]
Cao Z, Gao H, Mangalam K, Cai Q-Z, Vo M, and Malik J Vedaldi A, Bischof H, Brox T, and Frahm J-M Long-term human motion prediction with scene context Computer Vision – ECCV 2020 2020 Cham Springer 387-404
[7]
Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F.: Context-aware human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6992–7001 (2020)
[8]
Dubey, S., Olimov, F., Rafique, M.A., Kim, J., Jeon, M.: Label-attention transformer with geometrically coherent objects for image captioning. arXiv preprint arXiv:2109.07799 (2021)
[9]
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)
[10]
Gao, J., Wang, S., Wang, S., Ma, S., Gao, W.: Self-critical n-step training for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6300–6308 (2019)
[11]
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1396–1406 (2021)
[12]
Ging, S., Zolfaghari, M., Pirsiavash, H., Brox, T.: COOT: cooperative hierarchical transformer for video-text representation learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22605–22618 (2020)
[13]
Goutsu, Y., Inamura, T.: Linguistic descriptions of human motion with generative adversarial Seq2Seq learning. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 4281–4287. IEEE (2021)
[14]
Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5152–5161 (2022)
[15]
Guo, C., et al.: Action2video: generating videos of human 3D actions. Int. J. Comput. Vis., 1–31 (2022)
[16]
Guo, C., et al.: Action2motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)
[17]
Guo, L., Liu, J., Yao, P., Li, J., Lu, H.: MSCap: multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4204–4213 (2019)
[18]
Holden, D., Kanoun, O., Perepichka, M., Popa, T.: Learned motion matching. ACM Trans. Graph. (TOG) 39(4), 53–1 (2020)
[19]
Holden D, Komura T, and Saito J Phase-functioned neural networks for character control ACM Trans. Graph. (TOG) 2017 36 4 1-13
[20]
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
[21]
Kojima A, Tamura T, and Fukunaga K Natural language description of human activities from video images based on concept hierarchy of actions Int. J. Comput. Vis. 2002 50 2 171-184
[22]
Kulkarni G et al. Babytalk: understanding and generating simple image descriptions IEEE Trans. Pattern Anal. Mach. Intell. 2013 35 12 2891-2903
[23]
Lee, H.Y., et al.: Dancing to music. In: Advances in Neural Information Processing Systems 32 (2019)
[24]
Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
[25]
Lin AS, Wu L, Corona R, Tai K, Huang Q, and Mooney RJ Generating animated videos of human activities from natural language descriptions Learning 2018 2018 1
[26]
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
[27]
Liu, Z., et al.: Towards natural and accurate future motion prediction of humans and animals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10004–10012 (2019)
[28]
Mao, W., Liu, M., Salzmann, M.: Generating smooth pose sequences for diverse human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13309–13318 (2021)
[29]
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
[30]
Park, J.S., Rohrbach, M., Darrell, T., Rohrbach, A.: Adversarial inference for multi-sentence video description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6598–6608 (2019)
[31]
Pavllo D, Feichtenhofer C, Auli M, and Grangier D Modeling human motion with quaternion-based neural networks Int. J. Comput. Vis. 2020 128 4 855-872
[32]
Peng, J., Liu, D., Xu, S., Li, H.: Generating diverse structure for image inpainting with hierarchical VQ-VAE. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10775–10784 (2021)
[33]
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10985–10995 (2021)
[34]
Plappert M, Mandery C, and Asfour T The kit motion-language dataset Big Data 2016 4 4 236-252
[35]
Plappert M, Mandery C, and Asfour T Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks Robot. Auton. Syst. 2018 109 13-26
[36]
Qin, Y., Du, J., Zhang, Y., Lu, H.: Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8367–8375 (2019)
[37]
Rakhimov, R., Volkhonskiy, D., Artemov, A., Zorin, D., Burnaev, E.: Latent video transformer. arXiv preprint arXiv:2006.10704 (2020)
[38]
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
[39]
Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
[40]
Starke S, Zhang H, Komura T, and Saito J Neural state machine for character-scene interactions ACM Trans. Graph. 2019 38 6 209-1
[41]
Takano W and Nakamura Y Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions Int. J. Robot. Res. 2015 34 10 1314-1328
[42]
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)
[43]
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems 30 (2017)
[44]
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
[45]
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
[46]
Venugopalan, S., et al.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
[47]
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
[48]
Wang, J., Xu, H., Narasimhan, M., Wang, X.: Multi-person 3D motion prediction with multi-range transformers. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
[49]
Wang L et al. Leibe B, Matas J, Sebe N, Welling M, et al. Temporal segment networks: towards good practices for deep action recognition Computer Vision – ECCV 2016 2016 Cham Springer 20-36
[50]
Wang, Z., et al.: Learning diverse stochastic human-action generators by learning smooth latent transitions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12281–12288 (2020)
[51]
Xu C, Govindarajan LN, Zhang Y, and Cheng L Lie-x: depth image based articulated object pose estimation, tracking, and action recognition on lie groups Int. J. Comput. Vis. 2017 123 3 454-478
[52]
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
[53]
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
[54]
Yamada T, Matsunaga H, and Ogata T Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descriptions IEEE Robot. Autom. Lett. 2018 3 4 3441-3448
[55]
Yu P, Zhao Y, Li C, Yuan J, and Chen C Vedaldi A, Bischof H, Brox T, and Frahm J-M Structure-aware human-action generation Computer Vision – ECCV 2020 2020 Cham Springer 18-34
[56]
Yuan Y and Kitani K Vedaldi A, Bischof H, Brox T, and Frahm J-M DLow: diversifying latent flows for diverse human motion prediction Computer Vision – ECCV 2020 2020 Cham Springer 346-364
[57]
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
[58]
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019)

Cited By

View all
  • (2024)MCMProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/120(1083-1091)Online publication date: 3-Aug-2024
  • (2024)MotionFix: Text-Driven 3D Human Motion EditingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687559(1-11)Online publication date: 3-Dec-2024
  • (2024)CLaM: An Open-Source Library for Performance Evaluation of Text-driven Human Motion GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3685523(11194-11197)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV
        Oct 2022
        801 pages
        ISBN:978-3-031-19832-8
        DOI:10.1007/978-3-031-19833-5

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 23 October 2022

        Author Tags

        1. Motion captioning
        2. Text-to-motion generation

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 11 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)MCMProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/120(1083-1091)Online publication date: 3-Aug-2024
        • (2024)MotionFix: Text-Driven 3D Human Motion EditingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687559(1-11)Online publication date: 3-Dec-2024
        • (2024)CLaM: An Open-Source Library for Performance Evaluation of Text-driven Human Motion GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3685523(11194-11197)Online publication date: 28-Oct-2024
        • (2024)EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681392(6113-6122)Online publication date: 28-Oct-2024
        • (2024)MMHead: Towards Fine-grained Multi-modal 3D Facial AnimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681366(7966-7975)Online publication date: 28-Oct-2024
        • (2024)MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete RepresentationsACM Transactions on Graphics10.1145/365813743:4(1-21)Online publication date: 19-Jul-2024
        • (2024)WalkTheDog: Cross-Morphology Motion Alignment via Phase ManifoldsACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657508(1-10)Online publication date: 13-Jul-2024
        • (2024)LGTM: Local-to-Global Text-Driven Human Motion Diffusion ModelACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657422(1-9)Online publication date: 13-Jul-2024
        • (2024)PepperPose: Full-Body Pose Estimation with a Companion RobotProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642231(1-16)Online publication date: 11-May-2024
        • (2024)Improved Text-Driven Human Motion Generation via Out-of-Distribution Detection and RectificationComputational Visual Media10.1007/978-981-97-2095-8_12(218-231)Online publication date: 10-Apr-2024
        • Show More Cited By

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media