A U-Shaped Spatio-Temporal Transformer as Solver for Motion Capture

Yang, Huabin; Zhang, Zhongjian; Wang, Yan; Guan, Deyu; Guo, Kangshuai; Chang, Yu; Zhang, Yanru

doi:10.1007/978-981-97-2095-8_15

Huabin Yang⁹,
Zhongjian Zhang⁹,
Yan Wang^9,12,13,
Deyu Guan¹⁰,
Kangshuai Guo⁹,
Yu Chang⁹ &
…
Yanru Zhang^9,11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14592))

Included in the following conference series:

International Conference on Computational Visual Media

527 Accesses

Abstract

Motion capture (MoCap) suffers from inevitable noises. The raw markers can be mislabeled, occluded, or contain positional noise, which must be refined before being used for production. However, the clean-up of MoCap data is a costly and repetitive work requiring manual intervention of trained experts. To address this problem, this paper proposes a novel end-to-end Transformer-based framework called U-Solver for obtaining joint transformations directly from raw markers (called solving). Through the hierarchical framework composed of decoupled spatio-temporal (DeST) Transformer and the introduction of motion-aware network (MAN) in the temporal self-attention mechanism, U-Solver effectively learns the motion dynamics from both spatial and temporal dimensions. The raw markers can be automatically cleaned and solved through the U-Solver. The experimental results demonstrate that U-Solver outperforms previous state-of-the-art methods in terms of robustness, efficiency, and precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A fast non-convex optimization technique for human action recovery from misrepresented 3D motion capture data using trajectory movement and pair-wise hierarchical constraints

Article 14 August 2022

HuMoMM: A Multi-Modal Dataset and Benchmark for Human Motion Analysis

Efficient human motion recovery using bidirectional attention network

Article 23 October 2019

References

Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7144–7153 (2019)
Google Scholar
Aristidou, A., Cohen-Or, D., Hodgins, J.K., Shamir, A.: Self-similarity analysis for motion capture cleaning. Comput. Graph. Forum 37(2), 297–309 (2018)
Article Google Scholar
Aristidou, A., Lasenby, J.: Real-time marker prediction and COR estimation in optical motion capture. Visual Comput. 29(1), 7–26 (2013)
Article Google Scholar
Bao, L., Yang, Z., Wang, S., Bai, D., Lee, J.: Real image denoising based on multi-scale residual dense block and cascaded u-net with block-connection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1823–1831 (2020)
Google Scholar
Besl, P., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992)
Article Google Scholar
Burke, M., Lasenby, J.: Estimating missing marker positions using low dimensional Kalman smoothing. J. Biomech. 49(9), 1854–1858 (2016)
Article Google Scholar
Chai, J., Hodgins, J.K.: Performance animation from low-dimensional control signals. ACM Trans. Graph. 24(3), 686–696 (2005)
Article Google Scholar
Chai, J., Hodgins, J.K.: Constraint-based motion optimization using a statistical dynamic model. ACM Trans. Graph. 26(3), 8-es (2007)
Google Scholar
Chen, K., Wang, Y., Zhang, S., Xu, S., Zhang, W., Hu, S.: MoCap-solver: a neural solver for optical motion capture data. ACM Trans. Graph. 40(4), 1–11 (2021)
Google Scholar
CMU. CMU MoCap Dataset (2000)
Google Scholar
Cui, Q., Sun, H., Li, Y., Kong, Y.: A deep bi-directional attention network for human motion recovery. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 701–707. International Joint Conferences on Artificial Intelligence Organization (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dorfmüller-Ulhaas, K.: Robust optical user motion tracking using a Kalman filter (2007)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Feng, Y., et al.: Mining spatial-temporal patterns and structural sparsity for human motion data denoising. IEEE Trans. Cybern. 45(12), 2693–2706 (2015)
Article Google Scholar
Feng, Y., Xiao, J., Zhuang, Y., Yang, X., Zhang, J.J., Song, R.: Exploiting temporal stability and low-rank structure for motion capture data refinement. Inf. Sci. 277, 777–793 (2014)
Article Google Scholar
Ghorbani, N., Black, M.J.: SOMA: solving optical marker-based mocap automatically. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 11117–11126 (2021)
Google Scholar
Herda, L., Fua, P., Plänkers, R., Boulic, R., Thalmann, D.: Skeleton-based motion capture for robust reconstruction of human motion. In: CA 2000, USA, p. 77. IEEE Computer Society (2000)
Google Scholar
Holden, D.: Robust solving of optical motion capture data by denoising. ACM Trans. Graph. 37(4), 1–12 (2018)
Article Google Scholar
Holden, D., Saito, J., Komura, T., Joyce, T.: Learning motion manifolds with convolutional autoencoders. In: SA 2015, New York, NY, USA. Association for Computing Machinery (2015)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2014)
Google Scholar
Kirk, A., O’Brien, J., Forsyth, D.: Skeletal parameter estimation from optical motion capture data. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 782–788 (2005)
Google Scholar
Lai, R.Y.Q., Yuen, P.C., Lee, K.K.W.: Motion capture data completion and denoising by singular value thresholding. In: Avis, N., Lefebvre, S. (eds.) Eurographics 2011 - Short Papers. The Eurographics Association (2011)
Google Scholar
Li, L., McCann, J., Pollard, N., Faloutsos, C.: Bolero: a principled technique for including bone length constraints in motion capture occlusion filling. In: Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2010, pp. 179–188, Goslar, DEU (2010)
Google Scholar
Li, S., Zhou, Y., Zhu, H., Xie, W., Zhao, Y., Liu, X.: Bidirectional recurrent autoencoder for 3D skeleton motion data refinement. Comput. Graph. 81, 92–103 (2019)
Article Google Scholar
Liu, G., McMillan, L.: Estimation of missing markers in human motion capture. Vis. Comput. 22(9), 721–728 (2006)
Article Google Scholar
Liu, X., Cheung, Y.M., Peng, S.-J., Cui, Z., Zhong, B., Du, J.-X.: Automatic motion capture data denoising via filtered subspace clustering and low rank matrix approximation. Signal Process. 105, 350–362 (2014)
Article Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6) (2015)
Google Scholar
Luan, J., Jiang, H., Diao, J., Wang, Y., Xiao, J.: Memformer: transformer-based 3D human motion estimation from mocap markers. In: SIGGRAPH Asia 2022 Posters, pp. 1–2 (2022)
Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.: Amass: archive of motion capture as surface shapes. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5441–5450 (2019)
Google Scholar
Mao, W., Liu, M., Salzmann, M., Li, H.: Multi-level motion attention for human motion prediction. Int. J. Comput. Vision 129(9), 2513–2535 (2021)
Article Google Scholar
Mei, J., Chen, X., Wang, C., Yuille, A., Lan, X., Zeng, W.: Learning to refine 3D human pose sequences. In: 2019 International Conference on 3D Vision (3DV), pp. 358–366. IEEE (2019)
Google Scholar
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation mocap database HDM05. Technical report CG-2007-2, Universität Bonn (2007)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Savitzky, A., Golay, M.J.: Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36(8), 1627–1639 (1964)
Article Google Scholar
Tautges, J., et al.: Motion reconstruction using sparse accelerometer data. ACM Trans. Graph. 30(3), 1–12 (2011)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vicon. Vicon software (2023)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Google Scholar
Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: a general U-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022)
Google Scholar
Xiao, J., Feng, Y., Hu, W.: Predicting missing markers in human motion capture using L1-sparse representation. Comput. Animat. Virtual Worlds 22(2–3), 221–228 (2011)
Article Google Scholar
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5728–5739 (2022)
Google Scholar
Zeng, A., Yang, L., Ju, X., Li, J., Wang, J., Xu, Q.: Smoothnet: a plug-and-play network for refining human poses in videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13665, pp. 625–642. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20065-6_36
Chapter Google Scholar
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746 (2019)
Google Scholar

Download references

Acknowledgments

This work was supported in part by Guangdong High Level Innovation Research Institute (2019B090909005, 2019B090917008), GDST (2020B1212030003).

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Huabin Yang, Zhongjian Zhang, Yan Wang, Kangshuai Guo, Yu Chang & Yanru Zhang
team randomersharp, Chengdu, China
Deyu Guan
Shenzhen Institute for Advanced Study, UESTC, Shenzhen, China
Yanru Zhang
Guangdong-Macau Joint Laboratory for Advanced and Intelligent Computing, Hengqin, China
Yan Wang
Guangdong Qinzhi Technology Research Institute, Hengqin, China
Yan Wang

Authors

Huabin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongjian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Deyu Guan
View author publications
You can also search for this author in PubMed Google Scholar
Kangshuai Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yu Chang
View author publications
You can also search for this author in PubMed Google Scholar
Yanru Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanru Zhang .

Editor information

Editors and Affiliations

Victoria University of Wellington, Wellington, New Zealand
Fang-Lue Zhang
Ben-Gurion University, Be'er Sheva, Israel
Andrei Sharf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, H. et al. (2024). A U-Shaped Spatio-Temporal Transformer as Solver for Motion Capture. In: Zhang, FL., Sharf, A. (eds) Computational Visual Media. CVM 2024. Lecture Notes in Computer Science, vol 14592. Springer, Singapore. https://doi.org/10.1007/978-981-97-2095-8_15

Download citation

DOI: https://doi.org/10.1007/978-981-97-2095-8_15
Published: 30 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2094-1
Online ISBN: 978-981-97-2095-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A U-Shaped Spatio-Temporal Transformer as Solver for Motion Capture

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A fast non-convex optimization technique for human action recovery from misrepresented 3D motion capture data using trajectory movement and pair-wise hierarchical constraints

HuMoMM: A Multi-Modal Dataset and Benchmark for Human Motion Analysis

Efficient human motion recovery using bidirectional attention network

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A U-Shaped Spatio-Temporal Transformer as Solver for Motion Capture

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A fast non-convex optimization technique for human action recovery from misrepresented 3D motion capture data using trajectory movement and pair-wise hierarchical constraints

HuMoMM: A Multi-Modal Dataset and Benchmark for Human Motion Analysis

Efficient human motion recovery using bidirectional attention network

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation