Abstract
Motion capture (MoCap) suffers from inevitable noises. The raw markers can be mislabeled, occluded, or contain positional noise, which must be refined before being used for production. However, the clean-up of MoCap data is a costly and repetitive work requiring manual intervention of trained experts. To address this problem, this paper proposes a novel end-to-end Transformer-based framework called U-Solver for obtaining joint transformations directly from raw markers (called solving). Through the hierarchical framework composed of decoupled spatio-temporal (DeST) Transformer and the introduction of motion-aware network (MAN) in the temporal self-attention mechanism, U-Solver effectively learns the motion dynamics from both spatial and temporal dimensions. The raw markers can be automatically cleaned and solved through the U-Solver. The experimental results demonstrate that U-Solver outperforms previous state-of-the-art methods in terms of robustness, efficiency, and precision.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7144–7153 (2019)
Aristidou, A., Cohen-Or, D., Hodgins, J.K., Shamir, A.: Self-similarity analysis for motion capture cleaning. Comput. Graph. Forum 37(2), 297–309 (2018)
Aristidou, A., Lasenby, J.: Real-time marker prediction and COR estimation in optical motion capture. Visual Comput. 29(1), 7–26 (2013)
Bao, L., Yang, Z., Wang, S., Bai, D., Lee, J.: Real image denoising based on multi-scale residual dense block and cascaded u-net with block-connection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1823–1831 (2020)
Besl, P., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992)
Burke, M., Lasenby, J.: Estimating missing marker positions using low dimensional Kalman smoothing. J. Biomech. 49(9), 1854–1858 (2016)
Chai, J., Hodgins, J.K.: Performance animation from low-dimensional control signals. ACM Trans. Graph. 24(3), 686–696 (2005)
Chai, J., Hodgins, J.K.: Constraint-based motion optimization using a statistical dynamic model. ACM Trans. Graph. 26(3), 8-es (2007)
Chen, K., Wang, Y., Zhang, S., Xu, S., Zhang, W., Hu, S.: MoCap-solver: a neural solver for optical motion capture data. ACM Trans. Graph. 40(4), 1–11 (2021)
CMU. CMU MoCap Dataset (2000)
Cui, Q., Sun, H., Li, Y., Kong, Y.: A deep bi-directional attention network for human motion recovery. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 701–707. International Joint Conferences on Artificial Intelligence Organization (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dorfmüller-Ulhaas, K.: Robust optical user motion tracking using a Kalman filter (2007)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Feng, Y., et al.: Mining spatial-temporal patterns and structural sparsity for human motion data denoising. IEEE Trans. Cybern. 45(12), 2693–2706 (2015)
Feng, Y., Xiao, J., Zhuang, Y., Yang, X., Zhang, J.J., Song, R.: Exploiting temporal stability and low-rank structure for motion capture data refinement. Inf. Sci. 277, 777–793 (2014)
Ghorbani, N., Black, M.J.: SOMA: solving optical marker-based mocap automatically. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 11117–11126 (2021)
Herda, L., Fua, P., Plänkers, R., Boulic, R., Thalmann, D.: Skeleton-based motion capture for robust reconstruction of human motion. In: CA 2000, USA, p. 77. IEEE Computer Society (2000)
Holden, D.: Robust solving of optical motion capture data by denoising. ACM Trans. Graph. 37(4), 1–12 (2018)
Holden, D., Saito, J., Komura, T., Joyce, T.: Learning motion manifolds with convolutional autoencoders. In: SA 2015, New York, NY, USA. Association for Computing Machinery (2015)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2014)
Kirk, A., O’Brien, J., Forsyth, D.: Skeletal parameter estimation from optical motion capture data. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 782–788 (2005)
Lai, R.Y.Q., Yuen, P.C., Lee, K.K.W.: Motion capture data completion and denoising by singular value thresholding. In: Avis, N., Lefebvre, S. (eds.) Eurographics 2011 - Short Papers. The Eurographics Association (2011)
Li, L., McCann, J., Pollard, N., Faloutsos, C.: Bolero: a principled technique for including bone length constraints in motion capture occlusion filling. In: Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2010, pp. 179–188, Goslar, DEU (2010)
Li, S., Zhou, Y., Zhu, H., Xie, W., Zhao, Y., Liu, X.: Bidirectional recurrent autoencoder for 3D skeleton motion data refinement. Comput. Graph. 81, 92–103 (2019)
Liu, G., McMillan, L.: Estimation of missing markers in human motion capture. Vis. Comput. 22(9), 721–728 (2006)
Liu, X., Cheung, Y.M., Peng, S.-J., Cui, Z., Zhong, B., Du, J.-X.: Automatic motion capture data denoising via filtered subspace clustering and low rank matrix approximation. Signal Process. 105, 350–362 (2014)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6) (2015)
Luan, J., Jiang, H., Diao, J., Wang, Y., Xiao, J.: Memformer: transformer-based 3D human motion estimation from mocap markers. In: SIGGRAPH Asia 2022 Posters, pp. 1–2 (2022)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.: Amass: archive of motion capture as surface shapes. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5441–5450 (2019)
Mao, W., Liu, M., Salzmann, M., Li, H.: Multi-level motion attention for human motion prediction. Int. J. Comput. Vision 129(9), 2513–2535 (2021)
Mei, J., Chen, X., Wang, C., Yuille, A., Lan, X., Zeng, W.: Learning to refine 3D human pose sequences. In: 2019 International Conference on 3D Vision (3DV), pp. 358–366. IEEE (2019)
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation mocap database HDM05. Technical report CG-2007-2, Universität Bonn (2007)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Savitzky, A., Golay, M.J.: Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36(8), 1627–1639 (1964)
Tautges, J., et al.: Motion reconstruction using sparse accelerometer data. ACM Trans. Graph. 30(3), 1–12 (2011)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Vicon. Vicon software (2023)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: a general U-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022)
Xiao, J., Feng, Y., Hu, W.: Predicting missing markers in human motion capture using L1-sparse representation. Comput. Animat. Virtual Worlds 22(2–3), 221–228 (2011)
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5728–5739 (2022)
Zeng, A., Yang, L., Ju, X., Li, J., Wang, J., Xu, Q.: Smoothnet: a plug-and-play network for refining human poses in videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13665, pp. 625–642. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20065-6_36
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5738–5746 (2019)
Acknowledgments
This work was supported in part by Guangdong High Level Innovation Research Institute (2019B090909005, 2019B090917008), GDST (2020B1212030003).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yang, H. et al. (2024). A U-Shaped Spatio-Temporal Transformer as Solver for Motion Capture. In: Zhang, FL., Sharf, A. (eds) Computational Visual Media. CVM 2024. Lecture Notes in Computer Science, vol 14592. Springer, Singapore. https://doi.org/10.1007/978-981-97-2095-8_15
Download citation
DOI: https://doi.org/10.1007/978-981-97-2095-8_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2094-1
Online ISBN: 978-981-97-2095-8
eBook Packages: Computer ScienceComputer Science (R0)