Motion2language, unsupervised learning of synchronized semantic motion segmentation

Radouane, Karim; Tchechmedjiev, Andon; Lagarde, Julien; Ranwez, Sylvie

doi:10.1007/s00521-023-09227-z

Motion2language, unsupervised learning of synchronized semantic motion segmentation

Original Article
Published: 13 December 2023

Volume 36, pages 4401–4420, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Karim Radouane¹,
Andon Tchechmedjiev¹,
Julien Lagarde² &
…
Sylvie Ranwez¹

804 Accesses
2 Altmetric
Explore all metrics

Abstract

In this paper, we investigate building a sequence to sequence architecture for motion-to-language translation and synchronization. The aim is to translate motion capture inputs into English natural-language descriptions, such that the descriptions are generated synchronously with the actions performed, enabling semantic segmentation as a byproduct, but without requiring synchronized training data. We propose a new recurrent formulation of local attention that is suited for synchronous/live text generation, as well as an improved motion encoder architecture better suited to smaller data and for synchronous generation. We evaluate both contributions in individual experiments, using the standard BLEU4 metric, as well as a simple semantic equivalence measure, on the KIT motion-language dataset. In a follow-up experiment, we assess the quality of the synchronization of generated text in our proposed approaches through multiple evaluation metrics. We find that both contributions to the attention mechanism and the encoder architecture additively improve the quality of generated text (BLEU and semantic equivalence), but also of synchronization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MotionCLIP: Exposing Human Motion Generation to CLIP Space

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Text Motion Translator: A Bi-directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The KIT human motion to language dataset used throughout this work is available at https://motion-annotation.humanoids.kit.edu/dataset/. The HumanML3D dataset is available at https://github.com/EricGuo5513/HumanML3D. Any scripts performing post-processing of said data before it was used, as well as the portion, we annotated with synchronization information, and the full implementation of our models will be made available on our git repository at https://github.com/rd20karim/M2T-Segmentation.

Notes

$ {\left[\kern-0.15em\left[ { i, j} \right]\kern-0.15em\right]} $ denotes set of integers between i and j both included. We will use the Bourbaki convention throughout the paper.
"$\cdot $” is the scalar product.
Equivalent to applying a truncated Gaussian kernel.
https://anonymous.4open.science/r/Motion2Language_Animation-BDB4/README.md.

Abbreviations

KIT-ML:: The KIT motion-language dataset
HumanML3d:: 3D human motion-language dataset
MLP:: Multilayer perceptron
BLEU:: Bilingual evaluation understudy
GRU:: Gated recurrent unit
PA-ResGCN:: Part-attention residual graph convolutional network
NMT:: Neural machine translation

References

Mandery C, Ömer Terlemez Do M, Vahrenkamp N, Asfour T (2016) Unifying representations and large-scale whole-body motion databases for studying human motion. IEEE Trans Robot 32:796–809. https://doi.org/10.1109/TRO.2016.2572685
Article Google Scholar
Guo C, Zou S, Zuo X, Wang S, Ji W, Li X, Cheng L (2022) Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5152–5161
Plappert M, Mandery C, Asfour T (2016) The KIT motion-language dataset. Big Data 4(4):236–252. https://doi.org/10.1089/big.2016.0028
Article PubMed Google Scholar
Plappert M, Mandery C, Asfour T (2017) Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robot Auton Syst 109:13–26. https://doi.org/10.1016/j.robot.2018.07.006
Article Google Scholar
Lin AS, Wu L, Corona R, Tai K, Huang Q, Mooney RJ (2018) Generating animated videos of human activities from natural language descriptions
Ghosh A, Cheema N, Oguz C, Theobalt C, Slusallek P (2021) Synthesis of compositional animations from textual descriptions. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 1396–1406
Petrovich M, Black MJ, Varol G (2022) Temos: generating diverse human motions from textual descriptions. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T (eds) Computer vision - ECCV 2022. Springer, Cham, pp 480–497
Chapter Google Scholar
Goutsu Y, Inamura T (2021) Linguistic descriptions of human motion with generative adversarial seq2seq learning. In: 2021 IEEE International conference on robotics and automation (ICRA), pp 4281–4287. https://doi.org/10.1109/ICRA48506.2021.9561519
Guo C, Zuo X, Wang S, Cheng L (2022) Tm2t: stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: ECCV
Lin JFS, Kulic D (2014) Online segmentation of human motion for automated rehabilitation exercise analysis. IEEE Trans Neural Syst Rehabili Eng 22:168–180. https://doi.org/10.1109/TNSRE.2013.2259640
Article Google Scholar
Kulić D, Takano W, Nakamura Y (2009) Online segmentation and clustering from continuous observation of whole body motions. IEEE Trans Robot 25:1158–1166. https://doi.org/10.1109/TRO.2009.2026508
Article Google Scholar
Mei F, Hu Q, Yang C, Liu L (2021) Arma-based segmentation of human limb motion sequences. Sensors. https://doi.org/10.3390/s21165577
Article PubMed PubMed Central Google Scholar
Li R, Liu Z, Tan J (2018) Human motion segmentation using collaborative representations of 3d skeletal sequences. IET Comput Vision 12:434–442. https://doi.org/10.1049/IET-CVI.2016.0385
Article Google Scholar
Zhou F, De la Torre F, Hodgins JK (2008) Aligned cluster analysis for temporal segmentation of human motion. In: 2008 8th IEEE international conference on automatic face and gesture recognition, pp 1–7. https://doi.org/10.1109/AFGR.2008.4813468
Zhou F, De la Torre F, Hodgins JK (2013) Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Trans Patt Anal Mach Intell 35(3):582–596. https://doi.org/10.1109/TPAMI.2012.137
Article Google Scholar
Ma H, Yang Z, Liu H (2021) Fine-grained unsupervised temporal action segmentation and distributed representation for skeleton-based human motion analysis. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2021.3132016
Article PubMed Google Scholar
Filtjens B, Vanrumste B, Slaets P (2022) Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Trans Emerg Top Comput. https://doi.org/10.1109/tetc.2022.3230912
Article Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the 27th international conference on neural information processing systems - volume 2. NIPS’14. MIT Press, Cambridge, pp 3104–3112
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Bengio, Y., LeCun, Y. (eds.) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings, pp 1–15
Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 1412–1421. https://doi.org/10.18653/v1/D15-1166. https://aclanthology.org/D15-1166
Bull H, Gouiffès M, Braffort A (2020) Automatic segmentation of sign language into subtitle-units. In: Bartoli A, Fusiello A (eds) Computer vision - ECCV 2020 workshops. Springer, Cham, pp 186–198
Chapter Google Scholar
Bull H, Afouras T, Varol G, Albanie S, Momeni L, Zisserman A (2021) Aligning subtitles in sign language videos. ICCV, 11552–11561
Liaw R, Liang E, Nishihara R, Moritz P, Gonzalez JE, Stoica I (2018) Tune: a research platform for distributed model selection and training. CoRR arXiv:1807.05118
”Reimers N, Gurevych I (2019) “Sentence-BERT: Sentence embeddings using Siamese BERT-networks”. In: “Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)”. Association for Computational Linguistics, Hong Kong, pp 3982–3992. https://doi.org/10.18653/v1/D19-1410. https://aclanthology.org/D19-1410
Lin JF-S, Karg M, Kulić D (2016) Movement primitive segmentation for human motion modeling: a framework for analysis. IEEE Trans Human Mach Syst 46(3):325–339. https://doi.org/10.1109/THMS.2015.2493536
Article Google Scholar
Wang Q, Rao Y (2018) Visual analysis of human motion: a survey on recent advances and applications. In: 2018 IEEE visual communications and image processing (VCIP), pp 1–4. https://doi.org/10.1109/VCIP.2018.8698618

Download references

Acknowledgements

The following work is supported by the scholarship Granted by the Occitanie Region of France (Grant number ALDOCT-001100 20007383).

Author information

Authors and Affiliations

EuroMov Digital Health in Motion, University of Montpellier, IMT Mines Alés, Alés, France
Karim Radouane, Andon Tchechmedjiev & Sylvie Ranwez
EuroMov Digital Health in Motion, University of Montpellier, IMT Mines Alés, Montpellier, France
Julien Lagarde

Authors

Karim Radouane
View author publications
You can also search for this author in PubMed Google Scholar
Andon Tchechmedjiev
View author publications
You can also search for this author in PubMed Google Scholar
Julien Lagarde
View author publications
You can also search for this author in PubMed Google Scholar
Sylvie Ranwez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karim Radouane.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

In all conducted experiments, the common observation is that most motion descriptions start with the words “a person”; for such words that depend on language model, the network learns to align this word with the start of the motion.

Figure 11 shows the case of a motion composed of two motion ’move to the left’ then ’move to the right’ where localization of attention weights was correctly distributed allowing a correct segmentation of motion.

Figure 12 shows the case of push backward action; the attention positions were correctly distributed with relation to the range of the action.

Figure 13 represents the attention map for "walk down"; when observing the motion, the action is performed in the range $ {\left[\kern-0.15em\left[ { 11,30} \right]\kern-0.15em\right]} $.

Figure 14 presents the association $P_m$ to $L_m$ for model using the deep-MLP encoder.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Radouane, K., Tchechmedjiev, A., Lagarde, J. et al. Motion2language, unsupervised learning of synchronized semantic motion segmentation. Neural Comput & Applic 36, 4401–4420 (2024). https://doi.org/10.1007/s00521-023-09227-z

Download citation

Received: 25 January 2023
Accepted: 26 October 2023
Published: 13 December 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00521-023-09227-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Motion2language, unsupervised learning of synchronized semantic motion segmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MotionCLIP: Exposing Human Motion Generation to CLIP Space

TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts

Text Motion Translator: A Bi-directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions

Data availability

Notes

Abbreviations

References

Acknowledgements