Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Motion2language, unsupervised learning of synchronized semantic motion segmentation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In this paper, we investigate building a sequence to sequence architecture for motion-to-language translation and synchronization. The aim is to translate motion capture inputs into English natural-language descriptions, such that the descriptions are generated synchronously with the actions performed, enabling semantic segmentation as a byproduct, but without requiring synchronized training data. We propose a new recurrent formulation of local attention that is suited for synchronous/live text generation, as well as an improved motion encoder architecture better suited to smaller data and for synchronous generation. We evaluate both contributions in individual experiments, using the standard BLEU4 metric, as well as a simple semantic equivalence measure, on the KIT motion-language dataset. In a follow-up experiment, we assess the quality of the synchronization of generated text in our proposed approaches through multiple evaluation metrics. We find that both contributions to the attention mechanism and the encoder architecture additively improve the quality of generated text (BLEU and semantic equivalence), but also of synchronization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The KIT human motion to language dataset used throughout this work is available at https://motion-annotation.humanoids.kit.edu/dataset/. The HumanML3D dataset is available at https://github.com/EricGuo5513/HumanML3D. Any scripts performing post-processing of said data before it was used, as well as the portion, we annotated with synchronization information, and the full implementation of our models will be made available on our git repository at https://github.com/rd20karim/M2T-Segmentation.

Notes

  1. \( {\left[\kern-0.15em\left[ { i, j} \right]\kern-0.15em\right]} \) denotes set of integers between i and j both included. We will use the Bourbaki convention throughout the paper.

  2. "\(\cdot \)” is the scalar product.

  3. Equivalent to applying a truncated Gaussian kernel.

  4. https://anonymous.4open.science/r/Motion2Language_Animation-BDB4/README.md.

Abbreviations

KIT-ML:

The KIT motion-language dataset

HumanML3d:

3D human motion-language dataset

MLP:

Multilayer perceptron

BLEU:

Bilingual evaluation understudy

GRU:

Gated recurrent unit

PA-ResGCN:

Part-attention residual graph convolutional network

NMT:

Neural machine translation

References

  1. Mandery C, Ömer Terlemez Do M, Vahrenkamp N, Asfour T (2016) Unifying representations and large-scale whole-body motion databases for studying human motion. IEEE Trans Robot 32:796–809. https://doi.org/10.1109/TRO.2016.2572685

    Article  Google Scholar 

  2. Guo C, Zou S, Zuo X, Wang S, Ji W, Li X, Cheng L (2022) Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5152–5161

  3. Plappert M, Mandery C, Asfour T (2016) The KIT motion-language dataset. Big Data 4(4):236–252. https://doi.org/10.1089/big.2016.0028

    Article  PubMed  Google Scholar 

  4. Plappert M, Mandery C, Asfour T (2017) Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robot Auton Syst 109:13–26. https://doi.org/10.1016/j.robot.2018.07.006

    Article  Google Scholar 

  5. Lin AS, Wu L, Corona R, Tai K, Huang Q, Mooney RJ (2018) Generating animated videos of human activities from natural language descriptions

  6. Ghosh A, Cheema N, Oguz C, Theobalt C, Slusallek P (2021) Synthesis of compositional animations from textual descriptions. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 1396–1406

  7. Petrovich M, Black MJ, Varol G (2022) Temos: generating diverse human motions from textual descriptions. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T (eds) Computer vision - ECCV 2022. Springer, Cham, pp 480–497

    Chapter  Google Scholar 

  8. Goutsu Y, Inamura T (2021) Linguistic descriptions of human motion with generative adversarial seq2seq learning. In: 2021 IEEE International conference on robotics and automation (ICRA), pp 4281–4287. https://doi.org/10.1109/ICRA48506.2021.9561519

  9. Guo C, Zuo X, Wang S, Cheng L (2022) Tm2t: stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: ECCV

  10. Lin JFS, Kulic D (2014) Online segmentation of human motion for automated rehabilitation exercise analysis. IEEE Trans Neural Syst Rehabili Eng 22:168–180. https://doi.org/10.1109/TNSRE.2013.2259640

    Article  Google Scholar 

  11. Kulić D, Takano W, Nakamura Y (2009) Online segmentation and clustering from continuous observation of whole body motions. IEEE Trans Robot 25:1158–1166. https://doi.org/10.1109/TRO.2009.2026508

    Article  Google Scholar 

  12. Mei F, Hu Q, Yang C, Liu L (2021) Arma-based segmentation of human limb motion sequences. Sensors. https://doi.org/10.3390/s21165577

    Article  PubMed  PubMed Central  Google Scholar 

  13. Li R, Liu Z, Tan J (2018) Human motion segmentation using collaborative representations of 3d skeletal sequences. IET Comput Vision 12:434–442. https://doi.org/10.1049/IET-CVI.2016.0385

    Article  Google Scholar 

  14. Zhou F, De la Torre F, Hodgins JK (2008) Aligned cluster analysis for temporal segmentation of human motion. In: 2008 8th IEEE international conference on automatic face and gesture recognition, pp 1–7. https://doi.org/10.1109/AFGR.2008.4813468

  15. Zhou F, De la Torre F, Hodgins JK (2013) Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Trans Patt Anal Mach Intell 35(3):582–596. https://doi.org/10.1109/TPAMI.2012.137

    Article  Google Scholar 

  16. Ma H, Yang Z, Liu H (2021) Fine-grained unsupervised temporal action segmentation and distributed representation for skeleton-based human motion analysis. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2021.3132016

    Article  PubMed  Google Scholar 

  17. Filtjens B, Vanrumste B, Slaets P (2022) Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Trans Emerg Top Comput. https://doi.org/10.1109/tetc.2022.3230912

    Article  Google Scholar 

  18. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the 27th international conference on neural information processing systems - volume 2. NIPS’14. MIT Press, Cambridge, pp 3104–3112

  19. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Bengio, Y., LeCun, Y. (eds.) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings, pp 1–15

  20. Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 1412–1421. https://doi.org/10.18653/v1/D15-1166. https://aclanthology.org/D15-1166

  21. Bull H, Gouiffès M, Braffort A (2020) Automatic segmentation of sign language into subtitle-units. In: Bartoli A, Fusiello A (eds) Computer vision - ECCV 2020 workshops. Springer, Cham, pp 186–198

    Chapter  Google Scholar 

  22. Bull H, Afouras T, Varol G, Albanie S, Momeni L, Zisserman A (2021) Aligning subtitles in sign language videos. ICCV, 11552–11561

  23. Liaw R, Liang E, Nishihara R, Moritz P, Gonzalez JE, Stoica I (2018) Tune: a research platform for distributed model selection and training. CoRR arXiv:1807.05118

  24. ”Reimers N, Gurevych I (2019) “Sentence-BERT: Sentence embeddings using Siamese BERT-networks”. In: “Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)”. Association for Computational Linguistics, Hong Kong, pp 3982–3992. https://doi.org/10.18653/v1/D19-1410. https://aclanthology.org/D19-1410

  25. Lin JF-S, Karg M, Kulić D (2016) Movement primitive segmentation for human motion modeling: a framework for analysis. IEEE Trans Human Mach Syst 46(3):325–339. https://doi.org/10.1109/THMS.2015.2493536

    Article  Google Scholar 

  26. Wang Q, Rao Y (2018) Visual analysis of human motion: a survey on recent advances and applications. In: 2018 IEEE visual communications and image processing (VCIP), pp 1–4. https://doi.org/10.1109/VCIP.2018.8698618

Download references

Acknowledgements

The following work is supported by the scholarship Granted by the Occitanie Region of France (Grant number ALDOCT-001100 20007383).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karim Radouane.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Appendix A

In all conducted experiments, the common observation is that most motion descriptions start with the words “a person”; for such words that depend on language model, the network learns to align this word with the start of the motion.

Figure 11 shows the case of a motion composed of two motion ’move to the left’ then ’move to the right’ where localization of attention weights was correctly distributed allowing a correct segmentation of motion.

Fig. 11
figure 11

Truncated gaussian DeepMLP-GRU \(D=5\) [move left, move right] (move to the left in the range \( {\left[\kern-0.15em\left[ { 8,23} \right]\kern-0.15em\right]} \), move to the right in the range \( {\left[\kern-0.15em\left[ { 24,35} \right]\kern-0.15em\right]} \)

Figure 12 shows the case of push backward action; the attention positions were correctly distributed with relation to the range of the action.

Fig. 12
figure 12

Truncated gaussian MLP-GRU \(D=5\) [Push backward] (pushing action in the range \( {\left[\kern-0.15em\left[ { 15,26} \right]\kern-0.15em\right]} \)

Figure 13 represents the attention map for "walk down"; when observing the motion, the action is performed in the range \( {\left[\kern-0.15em\left[ { 11,30} \right]\kern-0.15em\right]} \).

Fig. 13
figure 13

Truncated gaussian DeepMLP-GRU \(D=5\) [stand up] (range \( {\left[\kern-0.15em\left[ { 11,30} \right]\kern-0.15em\right]} \))

Figure 14 presents the association \(P_m\) to \(L_m\) for model using the deep-MLP encoder.

Fig. 14
figure 14

Truncated gaussian DeepMLP-GRU \(D=5\) mapping \(P_m \rightarrow L_m\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Radouane, K., Tchechmedjiev, A., Lagarde, J. et al. Motion2language, unsupervised learning of synchronized semantic motion segmentation. Neural Comput & Applic 36, 4401–4420 (2024). https://doi.org/10.1007/s00521-023-09227-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09227-z

Keywords