Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

Cross Refinement Techniques for Markerless Human<?brk?> Motion Capture

Published: 04 March 2020 Publication History


This article presents a global 3D human pose estimation method for markerless motion capture. Given two calibrated images of a person, it first obtains the 2D joint locations in the images using a pre-trained 2D Pose CNN, then constructs the 3D pose based on stereo triangulation. To improve the accuracy and the stability of the system, we propose two efficient optimization techniques for the joints. The first one, called cross-view refinement, optimizes the joints based on epipolar geometry. The second one, called cross-joint refinement, optimizes the joints using bone-length constraints. Our method automatically detects and corrects the unreliable joint, and consequently is robust against heavy occlusion, symmetry ambiguity, motion blur, and highly distorted poses. We evaluate our method on a number of benchmark datasets covering indoors and outdoors, which showed that our method is better than or on par with the state-of-the-art methods. As an application, we create a 3D human pose dataset using the proposed motion capture system, which contains about 480K images of both indoor and outdoor scenes, and demonstrate the usefulness of the dataset for human pose estimation.


Ijaz Akhter and Michael J. Black. 2015. Pose-conditioned joint angle limits for 3D human pose reconstruction. In Proceedings of the IEEE CVPR.
Sikandar Amin, Mykhaylo Andriluka, Marcus Rohrbach, and Bernt Schiele. 2013. Multi-view pictorial structures for 3D human pose estimation. In Proceedings of the BMVC.
Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE CVPR. 1014--1021.
Michal Balazia and Petr Sojka. 2018. Gait recognition from motion capture data. ACM Trans. Multim. Comput. Commun. Appl. 14, 1s (2018), 22:1--22:18.
Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 2014. 3D pictorial structures for multiple human pose estimation. In Proceedings of the IEEE CVPR. 1669--1676.
Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 2016. 3D pictorial structures revisited: Multiple human pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 38, 10 (2016), 1929--1942.
Martin Bergtholdt, Jörg Kappes, Stefan Schmidt, and Christoph Schnörr. 2010. A study of parts-based object class detection using complete graphs. Int. J. Comput. Vis. 87, 1--2 (2010), 93.
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the ECCV. Springer, 561--578.
Magnus Burenius, Josephine Sullivan, and Stefan Carlsson. 2013. 3D pictorial structures for multiple view articulated pose estimation. In Proceedings of the IEEE CVPR. 3618--3625.
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE CVPR.
Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human pose estimation with iterative error feedback. In Proceedings of the IEEE CVPR. 4733--4742.
Ching-Hang Chen and Deva Ramanan. 2017. 3D human pose estimation= 2D pose estimation+ matching. In Proceedings of the IEEE CVPR, Vol. 2. 6.
Xipeng Chen, Kwan-Yee Lin, Wentao Liu, Chen Qian, and Liang Lin. 2019. Weakly supervised discovery of geometry-aware representation for 3D human pose estimation. In Proceedings of the IEEE CVPR.
Xianjie Chen and Alan L. Yuille. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proceedings of the NIPS. 1736--1744.
Yen-Lin Chen and Jinxiang Chai. 2009. 3D reconstruction of human motion and skeleton from uncalibrated monocular video. In Proceedings of the ACCV. Springer.
Ahmed Elhayek, Edilson de Aguiar, Arjun Jain, J. Thompson, Leonid Pishchulin, Mykhaylo Andriluka, Christoph Bregler, Bernt Schiele, and Christian Theobalt. 2017. MARCOnl-ConvNet-based MARker-less motion capture in outdoor and indoor scenes. IEEE Trans. Pattern Anal. Mach. Intell. 39, 3 (2017), 501--514.
Ahmed Elhayek, Edilson de Aguiar, Arjun Jain, Jonathan Tompson, Leonid Pishchulin, Mykhaylo Andriluka, Christoph Bregler, Bernt Schiele, and Christian Theobalt. 2015. Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras. In Proceedings of the IEEE CVPR. 3810--3818.
Haoshu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. 2017. Learning knowledge-guided pose grammar machine for 3D human pose estimation. arXiv preprint:1710.06513 (2017).
Pedro F. Felzenszwalb and Daniel P. Huttenlocher. 2005. Pictorial structures for object recognition. Int. J. Comput. Vis. 61, 1 (2005), 55--79.
Martin A. Fischler and Robert A. Elschlager. 1973. The representation and matching of pictorial structures. IEEE Trans. Comput. 100, 1 (1973), 67--92.
Richard Hartley and Andrew Zisserman. 2003. Multiple View Geometry in Computer Vision. Cambridge University Press.
Edmond S. L. Ho, Jacky C. P. Chan, Taku Komura, and Howard Leung. 2013. Interactive partner control in close interactions for real-time applications. ACM Trans. Multim. Comput. Commun. Applic. 9, 3 (2013), 21.
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 7 (2014), 1325--1339.
Vahid Kazemi, Magnus Burenius, Hossein Azizpour, and Josephine Sullivan. 2013. Multi-view body part recognition with random forests. In Proceedings of the BMVC.
Muhammed Kocabas, Salih Karagoz, and Emre Akbas. 2019. Self-supervised learning of 3D human pose using multi-view geometry. In Proceedings of the IEEE CVPR. 1077–1086.
Miaopeng Li, Zimeng Zhou, Jie Li, and Xinguo Liu. 2018. Bottom-up pose estimation of multiple person with bounding box constraint. In Proceedings of the IEEE ICPR.
Miaopeng Li, Zimeng Zhou, and Xinguo Liu. 2019. Multi-person pose estimation using bounding box constraint and LSTM. IEEE Trans. Multim. 21, 10 (2019), 2653–2663.
Sijin Li and Antoni B. Chan. 2014. 3D human pose estimation from monocular images with deep convolutional neural network. In Proceedings of the ACCV. Springer, 332--347.
Sijin Li, Weichen Zhang, and Antoni B. Chan. 2015. Maximum-margin structured learning with deep networks for 3D human pose estimation. In Proceedings of the ICCV. 2848--2856.
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A skinned multi-person linear model. ACM Trans. Graph. 34, 6 (2015), 248.
Alvaro Marcos-Ramiro, Daniel Pizarro, Marta Marron-Romera, and Daniel Gatica-Perez. 2015. Let your body speak: Communicative cue extraction on natural interaction using RGBD data. IEEE Trans. Multim. 17, 10 (2015), 1721--1732.
Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. 2017. A simple yet effective baseline for 3D human pose estimation. In Proceedings of the IEEE ICCV, Vol. 206. 3.
Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. 2017. Monocular 3D human pose estimation in the wild using improved CNN supervision. In Proceedings of the 3DV.
Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017. VNect: Real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. 36, 4 (2017), 44.
Thomas B. Moeslund, Adrian Hilton, and Volker Krüger. 2006. A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 104, 2–3 (2006), 90--127.
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the ECCV. Springer, 483--499.
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Daniilidis. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE CVPR. 1263--1272.
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Daniilidis. 2017. Harvesting multiple views for marker-less 3D human pose annotations. arXiv preprint:1704.04793 (2017).
Tomas Pfister, James Charles, and Andrew Zisserman. 2015. Flowing ConvNets for human pose estimation in videos. In Proceedings of the IEEE ICCV. 1913--1921.
Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2012. Reconstructing 3D human pose from 2D image landmarks. In Proceedings of the ECCV. Springer.
Marta Sanzari, Valsamis Ntouskos, and Fiora Pirri. 2016. Bayesian image based 3D pose estimation. In Proceedings of the ECCV. Springer, 566--582.
Yemin Shi, Yonghong Tian, Yaowei Wang, and Tiejun Huang. 2017. Sequential deep trajectory descriptor for action recognition with three-stream CNN. IEEE Trans. Multim. 19, 7 (2017), 1510--1520.
Leonid Sigal, Alexandru O. Balan, and Michael J. Black. 2010. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 87, 1--2 (2010), 4.
Leonid Sigal, Michael Isard, Horst Haussecker, and Michael J. Black. 2012. Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation. Int. J. Comput. Vis. 98, 1 (2012), 15--48.
Yong Su, Zhiyong Feng, Jianhai Zhang, Weilong Peng, and Meng Xing. 2018. Sequential articulated motion reconstruction from a monocular image sequence. ACM Trans. Multim. Comput. Commun. Applic. 14, 1s (2018), 23.
Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. 2017. Compositional human pose regression. In Proceedings of the IEEE ICCV.
Graham W. Taylor, Leonid Sigal, David J. Fleet, and Geoffrey E. Hinton. 2010. Dynamical binary latent variable models for 3D human pose tracking. In Proceedings of the IEEE CVPR. 631--638.
Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua. 2016. Structured prediction of 3D human pose with deep neural networks. In Proceedings of the BMVC.
Bugra Tekin, Pablo Marquez Neila, Mathieu Salzmann, and Pascal Fua. 2017. Learning to fuse 2D and 3D image cues for monocular body pose estimation. In Proceedings of the IEEE ICCV.
Bugra Tekin, Artem Rozantsev, Vincent Lepetit, and Pascal Fua. 2016. Direct prediction of 3D body poses from motion compensated sequences. In Proceedings of the IEEE CVPR. 991--1000.
Jonathan J. Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the NIPS. 1799--1807.
Alexander Toshev and Christian Szegedy. 2014. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE CVPR. 1653--1660.
Bastian Wandt, Hanno Ackermann, and Bodo Rosenhahn. 2016. 3D reconstruction of human motion from monocular image sequences. IEEE Trans. Pattern Anal. Mach. Intell. 38, 8 (2016), 1505–1516.
Bastian Wandt, Hanno Ackermann, and Bodo Rosenhahn. 2018. A kinematic chain space for monocular motion capture. In Proceedings of the ECCV.
Chunyu Wang, Yizhou Wang, Zhouchen Lin, Alan L. Yuille, and Wen Gao. 2014. Robust estimation of 3D human poses from a single image. In Proceedings of the IEEE CVPR.
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE CVPR.
Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu et al. 2017. AI challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint:1711.06475 (2017).
Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and Xiaogang Wang. 2018. 3D human pose estimation in the wild by adversarial learning. arXiv preprint:1803.09722 (2018).
Angela Yao, Juergen Gall, Luc V. Gool, and Raquel Urtasun. 2011. Learning probabilistic non-linear latent variable models for tracking complex activities. In Proceedings of the NIPS. 1359--1367.
Hashim Yasin, Umar Iqbal, Bjorn Kruger, Andreas Weber, and Juergen Gall. 2016. A dual-source approach for 3D pose estimation from a single image. In Proceedings of the IEEE CVPR. 4948--4956.
Petrissa Zell, Bastian Wandt, and Bodo Rosenhahn. 2017. Joint 3D human motion capture and physical analysis from monocular videos. In Proceedings of the IEEE CVPRW.
Feng Zhou and Fernando De la Torre. 2014. Spatio-temporal matching for human detection in video. In Proceedings of the ECCV. Springer, 62--77.
Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. Towards 3D human pose estimation in the wild: A weakly supervised approach. In Proceedings of the IEEE ICCV.
Xiaowei Zhou, Spyridon Leonardos, Xiaoyan Hu, and Kostas Daniilidis. 2015. 3D shape estimation from 2D landmarks: A convex relaxation approach. In Proceedings of the IEEE CVPR. 4447--4455.
Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, and Yichen Wei. 2016. Deep kinematic pose regression. In Proceedings of the ECCV. Springer, 186--201.
Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Konstantinos G. Derpanis, and Kostas Daniilidis. 2016. Sparseness meets deepness: 3D human pose estimation from monocular video. In Proceedings of the IEEE CVPR. 4966--4975.

Cited By

View all
  • (2023)Lightweight multi-person motion capture system in the wildSCIENTIA SINICA Informationis10.1360/SSI-2022-039753:11(2230)Online publication date: 31-Oct-2023
  • (2023)A Novel Model for Intelligent Pull-Ups Test Based on Key Point Estimation of Human Body and EquipmentMobile Information Systems10.1155/2023/36492172023Online publication date: 1-Jan-2023
  • (2022)Full-body Human Motion Reconstruction with Sparse Joint Tracking Using Flexible SensorsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3564700Online publication date: 29-Sep-2022
  • Show More Cited By

Index Terms

  1. Cross Refinement Techniques for Markerless Human Motion Capture



    Information & Contributors


    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 1
    February 2020
    363 pages
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]


    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 March 2020
    Accepted: 01 November 2019
    Revised: 01 November 2019
    Received: 01 May 2019
    Published in TOMM Volume 16, Issue 1


    Request permissions for this article.

    Check for updates

    Author Tags

    1. Human pose estimation
    2. camera calibration
    3. convolutional neural network
    4. epipolar geometry


    • Research-article
    • Research
    • Refereed

    Funding Sources

    • NSFC
    • FaceUnity Technology


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)30
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 25 Feb 2025

    Other Metrics


    Cited By

    View all
    • (2023)Lightweight multi-person motion capture system in the wildSCIENTIA SINICA Informationis10.1360/SSI-2022-039753:11(2230)Online publication date: 31-Oct-2023
    • (2023)A Novel Model for Intelligent Pull-Ups Test Based on Key Point Estimation of Human Body and EquipmentMobile Information Systems10.1155/2023/36492172023Online publication date: 1-Jan-2023
    • (2022)Full-body Human Motion Reconstruction with Sparse Joint Tracking Using Flexible SensorsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3564700Online publication date: 29-Sep-2022
    • (2021)A Systematic Review of the Application of Camera-Based Human Pose Estimation in the Field of Sport and Physical ExerciseSensors10.3390/s2118599621:18(5996)Online publication date: 7-Sep-2021

    View Options

    Login options

    Full Access

    View options


    View or Download as a PDF file.



    View online with eReader.


    HTML Format

    View this article in HTML Format.

    HTML Format






    Share this Publication link

    Share on social media