A Baseline for Cross-Database 3D Human Pose Estimation
Abstract
:1. Introduction
- We proposed a method for harmonizing the dataset-specific skeleton joint definitions (see Section 4.1). It facilitates cross-dataset experiments and training with multiple datasets while avoiding systematic errors. The source code is available at https://github.com/mihau2/Cross-Data-Pose-Estimation (accessed on 27 May 2021);
- We proposed a scale normalization method that significantly improves generalization across cameras, subjects, and databases by up to 50% (see Section 4.2). Although normalization is a well-known concept, it has not been consistently used in 3D human pose estimation, especially with the 3D skeletons;
- We conducted cross-dataset experiments using the method of Martinez et al. [26] (Section 5), showing the negative effect of dataset biases on generalization and the positive impact of the proposed scale normalization. Additional experiments investigated the effect of using more or less cameras (including virtual cameras), training with multiple datasets, applying a proposed anatomy-based pose validation step, and using OpenPose as the basis for the 3D pose estimation. Finally, we discussed our findings, the limitations of our work, and future directions (Section 6).
2. Related Work
2.1. 2D Human Pose Estimation
2.2. 3D Human Pose Estimation from 2D Images
2.3. 3D Human Pose Estimation from the 2D Pose
2.4. Cross-Dataset Generalization
2.5. Non-Vision-Based Approaches
3. Datasets
3.1. HumanEva-I (HE1)
3.2. Human3.6M (H36M)
3.3. Panoptic (Pan)
3.4. Comparison and Dataset Biases
- Lighting: The recordings are homogeneously lit, typically without any overexposed or strongly shadowed areas. Further, there is no variation in lighting color and color temperature. Real-world data are often more challenging, e.g., consider an outdoor scene with unilateral sunlight or a nightclub scene with colored and moving lighting;
- Background: The backgrounds are static and homogeneous. Real-world data often include cluttered and changing backgrounds, which may challenge the computer vision algorithms more;
- Occlusion: In real-world data, people are often partially occluded by their own body parts, other people, furniture, or other objects; or parts of the body are outside the image. Self-occlusion is covered in all three databases. Human3.6M is comprised of more self-occlusions than the other datasets (and also some occlusions by chairs), because it includes many occlusion-causing actions such as sitting, lying down, or bending down. Occlusions by other people are common in Panoptic’s multi-person sequences. Additionally, parts of the bodies are quite frequently outside of the cameras’ field of view in Panoptic;
- Subject appearance: Human3.6M and especially HumanEva-I suffer from a low number of subjects, which restricts variability in body shapes, clothing, hair, age, ethnicity, skin color, etc. Although Panoptic includes many more and quite diverse subjects, it may still not sufficiently cover the huge diversity of real-world human appearances;
- Cameras: In-the-wild data are recorded from different viewpoints with varying resolutions, noise, motion blur, fields of view, depths of field, white-balance, camera-to-subject distance, etc. Within the three databases, only the viewpoint is varied systematically, and the other factors are mostly constant. With more than 500 cameras, Panoptic is the most diverse regarding viewpoint (also using three types of cameras). In contrast to the others, it also includes high-angle and low-angle views (down- and up-looking cameras). If only a few cameras are used, as in Human3.6M and HumanEva-I, there may be a bias in the body poses, because people tend to turn towards one of the cameras (also see [86] on this issue);
- Actions and poses: HumanEva-I and Human3.6M are comprised of the acted behavior of several action categories, whereas the instructions in Human3.6M allowed quite free interpretation and performance. Further, the actions and poses in Human3.6M are much more diverse than in HumanEva-I, including many everyday activities and non-upright poses such as sitting, lying down, or bending down (compared to only upright poses in HumanEva-I). However, some of the acted behavior in Human3.6M used imaginary objects and interaction partners, which may cause subtle behavioral biases compared to natural interaction. Panoptic captured natural behavior in real social interactions of multiple people and interactions with real objects such as musical instruments. Thus, it should more closely resemble real-world behavior;
- Annotated skeleton joints: The labels of the datasets, the ground truth joints provided, differ among the datasets in their number and meaning. Most obviously, the head, neck, and hip joints were defined differently by the dataset creators. In Section 4.1, we discuss this issue in detail and propose a way to handle it.
4. Methods
4.1. Joint Harmonization
4.2. Scale Normalization
4.3. Baseline Model and Training
4.4. Anatomical Pose Validation
4.5. Use of Datasets
4.5.1. Dataset Split Details
4.5.2. Virtual Camera Augmentation
4.6. Implementation Details
5. Results
5.1. Joint Harmonization
5.2. Number of Cameras
5.3. Scale Normalization
5.4. Multi-Database Training
5.5. OpenPose Evaluation
5.6. Rotation Errors
5.7. Anatomical Pose Validation
6. Discussion
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lo Presti, L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognit. 2016, 53, 130–147. [Google Scholar] [CrossRef]
- Handrich, S.; Rashid, O.; Al-Hamadi, A. Non-intrusive Gesture Recognition in Real Companion Environments. In Companion Technology: A Paradigm Shift in Human-Technology Interaction; Biundo, S., Wendemuth, A., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 321–343. [Google Scholar] [CrossRef]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA USA, 16–20 June 2019. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; pp. 7444–7452. [Google Scholar]
- Zhang, X.; Xu, C.; Tao, D. Context Aware Graph Convolution for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition With Shift Graph Convolutional Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Li, C.; Zhang, X.; Liao, L.; Jin, L.; Yang, W. Skeleton-Based Gesture Recognition Using Several Fully Connected Layers with Path Signature Features and Temporal Transformer Module. Proc. AAAI Conf. Artif. Intell. 2019, 33, 8585–8593. [Google Scholar] [CrossRef]
- Joo, H.; Simon, T.; Cikara, M.; Sheikh, Y. Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in a Triadic Interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Joo, H.; Liu, H.; Tan, L.; Gui, L.; Nabbe, B.; Matthews, I.; Kanade, T.; Nobuhara, S.; Sheikh, Y. Panoptic Studio: A Massively Multiview System for Social Motion Capture. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Joo, H.; Simon, T.; Li, X.; Liu, H.; Tan, L.; Gui, L.; Banerjee, S.; Godisart, T.; Nabbe, B.; Matthews, I.; et al. Panoptic Studio: A Massively Multiview System for Social Interaction Capture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 190–204. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Iskakov, K.; Burkov, E.; Lempitsky, V.; Malkov, Y. Learnable Triangulation of Human Pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–3 November 2019. [Google Scholar]
- Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. Real-time human pose recognition in parts from single depth images. In Proceedings of the The 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), Colorado Springs, CO, USA, 20–25 June 2011; pp. 1297–1304. [Google Scholar]
- Handrich, S.; Al-Hamadi, A. Localizing body joints from single depth images using geodetic distances and random tree walk. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 146–150. [Google Scholar] [CrossRef]
- Handrich, S.; Waxweiler, P.; Werner, P.; Al-Hamadi, A. 3D Human Pose Estimation Using Stochastic Optimization in Real Time. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 555–559. [Google Scholar]
- Adib, F.; Kabelac, Z.; Katabi, D.; Miller, R.C. 3D Tracking via Body Radio Reflections. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, NSDI’14, Seattle, WA, USA, 2–4 April 2014; USENIX Association: Berkeley, CA, USA, 2014; pp. 317–329. [Google Scholar]
- Zhao, M.; Li, T.; Alsheikh, M.A.; Tian, Y.; Zhao, H.; Torralba, A.; Katabi, D. Through-Wall Human Pose Estimation Using Radio Signals. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7356–7365. [Google Scholar] [CrossRef]
- Wang, Z.; Liu, Y.; Liao, Q.; Ye, H.; Liu, M.; Wang, L. Characterization of a RS-LiDAR for 3D Perception. In Proceedings of the 2018 IEEE 8th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Tianjin, China, 18–23 July 2018; pp. 564–569. [Google Scholar] [CrossRef] [Green Version]
- Ionescu, C.; Li, F.; Sminchisescu, C. Latent Structured Models for Human Pose Estimation. In Proceedings of the International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011. [Google Scholar]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
- Sigal, L.; Black, M.J. HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion; Technical Report; Brown University: Providence, RI, USA, 2006. [Google Scholar]
- Sigal, L.; Balan, A.O.; Black, M.J. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 2010, 87, 4–27. [Google Scholar] [CrossRef]
- Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Verona, Italy, 10–12 October 2017. [Google Scholar] [CrossRef] [Green Version]
- Fabbri, M.; Lanzi, F.; Calderara, S.; Palazzi, A.; Vezzani, R.; Cucchiara, R. Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Torralba, A.; Efros, A.A. Unbiased look at dataset bias. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 1521–1528. [Google Scholar] [CrossRef] [Green Version]
- Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef] [Green Version]
- Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef] [Green Version]
- Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-Person Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
- Cao, Z.; Hidalgo Martinez, G.; Simon, T.; Wei, S.; Sheikh, Y.A. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [Green Version]
- Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. Numerical Coordinate Regression with Convolutional Neural Networks. arXiv 2019, arXiv:1801.07372. [Google Scholar]
- Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards Accurate Multi-person Pose Estimation in the Wild. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3711–3719. [Google Scholar]
- Pishchulin, L.; Insafutdinov, E.; Tang, S.; Andres, B.; Andriluka, M.; Gehler, P.; Schiele, B. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4929–4937. [Google Scholar]
- Nie, X.; Feng, J.; Xing, J.; Yan, S. Pose Partition Networks for Multi-Person Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 684–699. [Google Scholar]
- Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar] [CrossRef] [Green Version]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar]
- Habibie, I.; Xu, W.; Mehta, D.; Pons-Moll, G.; Theobalt, C. In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10897–10906. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal Depth Supervision for 3D Human Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7307–7316. [Google Scholar] [CrossRef] [Green Version]
- Zhou, X.; Huang, Q.; Sun, X.; Xue, X.; Wei, Y. Towards 3D Human Pose Estimation in the Wild: A Weakly-supervised Approach. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 398–407. [Google Scholar]
- Chen, C.H.; Ramanan, D. 3D human pose estimation = 2D pose estimation + matching. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5759–5767. [Google Scholar] [CrossRef] [Green Version]
- Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 529–545. [Google Scholar] [CrossRef] [Green Version]
- Zhou, X.; Zhu, M.; Leonardos, S.; Derpanis, K.; Daniilidis, K. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4966–4975. [Google Scholar]
- Luo, C.; Chu, X.; Yuille, A. OriNet: A Fully Convolutional Network for 3D Human Pose Estimation. In Proceedings of the British Machine Vision Conference BMVC, Newcastle, UK, 3–6 September 2018. [Google Scholar]
- Tome, D.; Russell, C.; Agapito, L. Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2500–2509. [Google Scholar]
- Rogez, G.; Weinzaepfel, P.; Schmid, C. LCR-Net: Localization-Classification-Regression for Human Pose. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. 3D Human Pose Estimation with 2D Marginal Heatmaps. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019. [Google Scholar]
- Pavlakos, G.; Zhu, L.; Zhou, X.; Daniilidis, K. Learning to Estimate 3D Human Pose and Shape from a Single Color Image. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 459–468. [Google Scholar] [CrossRef] [Green Version]
- Luvizon, D.C.; Picard, D.; Tabia, H. 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Wang, C.; Wang, Y.; Lin, Z.; Yuille, A.L.; Gao, W. Robust Estimation of 3D Human Poses from a Single Image. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 2369–2376. [Google Scholar] [CrossRef] [Green Version]
- Dabral, R.; Mundhada, A.; Kusupati, U.; Afaque, S.; Sharma, A.; Jain, A. Learning 3D Human Pose from Structure and Motion. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Tekin, B.; Márquez-Neila, P.; Salzmann, M.; Fua, P. Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Li, S.; Ke, L.; Pratama, K.; Tai, Y.W.; Tang, C.K.; Cheng, K.T. Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 6172–6182. [Google Scholar] [CrossRef]
- Chen, C.H.; Tyagi, A.; Agrawal, A.; Drover, D.; Rohith, M.V.; Stojanov, S.; Rehg, J.M. Unsupervised 3D Pose Estimation With Geometric Self-Supervision. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5707–5717. [Google Scholar] [CrossRef] [Green Version]
- Lin, J.; Lee, G.H. Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation. In Proceedings of the British Machine Vision Conference (BMVC), Cardiff, UK, 9–12 September 2019. [Google Scholar]
- Katircioglu, I.; Tekin, B.; Salzmann, M.; Lepetit, V.; Fua, P. Learning Latent Representations of 3D Human Pose with Deep Neural Networks. Int. J. Comput. Vis. 2018, 126, 1326–1341. [Google Scholar] [CrossRef] [Green Version]
- Chen, T.; Fang, C.; Shen, X.; Zhu, Y.; Chen, Z.; Luo, J. Anatomy-aware 3D Human Pose Estimation with Bone-based Pose Decomposition. IEEE Trans. Circuits Syst. Video Technol. 2021. [Google Scholar] [CrossRef]
- Benzine, A.; Luvison, B.; Pham, Q.C.; Achard, C. Single-shot 3D multi-person pose estimation in complex images. Pattern Recognit. 2021, 112, 107534. [Google Scholar] [CrossRef]
- Wu, H.; Xiao, B. 3D Human Pose Estimation via Explicit Compositional Depth Maps. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12378–12385. [Google Scholar] [CrossRef]
- Sárándi, I.; Linder, T.; Arras, K.O.; Leibe, B. Synthetic Occlusion Augmentation with Volumetric Heatmaps for the 2018 ECCV PoseTrack Challenge on 3D Human Pose Estimation. arXiv 2018, arXiv:1809.04987v3. [Google Scholar]
- Cheng, Y.; Yang, B.; Wang, B.; Wending, Y.; Tan, R. Occlusion-Aware Networks for 3D Human Pose Estimation in Video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–3 November 2019; pp. 723–732. [Google Scholar] [CrossRef]
- Popa, A.I.; Zanfir, M.; Sminchisescu, C. Deep Multitask Architecture for Integrated 2D and 3D Human Sensing. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 4714–4723. [Google Scholar] [CrossRef] [Green Version]
- Zanfir, A.; Marinoiu, E.; Sminchisescu, C. Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes—The Importance of Multiple Scene Constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Zanfir, A.; Marinoiu, E.; Zanfir, M.; Popa, A.I.; Sminchisescu, C. Deep Network for the Integrated 3D Sensing of Multiple People in Natural Images. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
- Radwan, I.; Dhall, A.; Goecke, R. Monocular Image 3D Human Pose Estimation under Self-Occlusion. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 8–12 April 2013; pp. 1888–1895. [Google Scholar] [CrossRef]
- Yasin, H.; Iqbal, U.; Kruger, B.; Weber, A.; Gall, J. A Dual-Source Approach for 3D Pose Estimation from a Single Image. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 27–30 June 2016; Volume 172, pp. 4948–4956. [Google Scholar] [CrossRef] [Green Version]
- Moreno-Noguer, F. 3D Human Pose Estimation from a Single Image via Distance Matrix Regression. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1561–1570. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Li, S.; Chan, A.B. 3D human pose estimation from monocular images with deep convolutional neural network. In Asian Conference on Computer Vision; Springer: Cham, Switzerland; Singapore, 2014; pp. 332–347. [Google Scholar] [CrossRef]
- Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-end Recovery of Human Shape and Pose. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
- Trumble, M.; Gilbert, A.; Hilton, A.; Collomosse, J. Deep autoencoder for combined human pose estimation and body model upscaling. In Proceedings of the European Conference on Computer Vision ECCV, Munich, Germany, 8–14 September 2018; pp. 784–800. [Google Scholar] [CrossRef] [Green Version]
- Güler, R.A.; Neverova, N.; Kokkinos, I. DensePose: Dense Human Pose Estimation In The Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Rhodin, H.; Salzmann, M.; Fua, P. Unsupervised geometry-aware representation for 3D human pose estimation. In Proceedings of the European Conference on Computer Vision ECCV, Munich, Germany, 8–14 September 2018; pp. 765–782. [Google Scholar] [CrossRef] [Green Version]
- Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7745–7754. [Google Scholar]
- Hossain, M.R.I.; Little, J.J. Exploiting temporal information for 3D human pose estimation. In Proceedings of the European Conference on Computer Vision ECCV, Munich, Germany, 8–14 September 2018; pp. 68–84. [Google Scholar] [CrossRef] [Green Version]
- Zhao, L.; Peng, X.; Tian, Y.; Kapadia, M.; Metaxas, D.N. Semantic Graph Convolutional Networks for 3D Human Pose Regression. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3420–3430. [Google Scholar] [CrossRef] [Green Version]
- Vicon. Available online: https://ien.vicon.eu (accessed on 27 May 2021).
- The Captury. Available online: https://captury.com (accessed on 27 May 2021).
- Wang, L.; Chen, Y.; Guo, Z.; Qian, K.; Lin, M.; Li, H.; Ren, J.S. Generalizing monocular 3D human pose estimation in-the-wild. In Proceedings of the 2019 International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27–28 October 2019; pp. 4024–4033. [Google Scholar] [CrossRef] [Green Version]
- Rogez, G.; Weinzaepfel, P.; Schmid, C. LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1146–1161. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chen, W.; Wang, H.; Li, Y.; Su, H.; Wang, Z.; Tu, C.; Lischinski, D.; Cohen-Or, D.; Chen, B. Synthesizing Training Images for Boosting Human 3D Pose Estimation. In Proceedings of the 2016 4th International Conference on 3D Vision 2016, Stanford, CA, USA, 25–28 October 2016; pp. 479–488. [Google Scholar]
- de Souza, C.R.; Gaidon, A.; Cabon, Y.; Peña, A.M.L. Procedural Generation of Videos to Train Deep Action Recognition Networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 2594–2604. [Google Scholar]
- Varol, G.; Romero, J.; Martin, X.; Mahmood, N.; Black, M.J.; Laptev, I.; Schmid, C. Learning from Synthetic Humans. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 4627–4635. [Google Scholar] [CrossRef] [Green Version]
- Peng, X.; Sun, B.; Ali, K.; Saenko, K. Learning Deep Object Detectors from 3D Models. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Las Condes, Chile, 7–13 December 2015. [Google Scholar]
- Rogez, G.; Schmid, C. Image-based Synthesis for Deep 3D Human Pose Estimation. Int. J. Comput. Vis. 2018, 126, 993–1008. [Google Scholar] [CrossRef] [Green Version]
- Wang, Z.; Shin, D.; Fowlkes, C.C. Predicting Camera Viewpoint Improves Cross-dataset Generalization for 3D Human Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Zhao, M.; Tian, Y.; Zhao, H.; Alsheikh, M.A.; Li, T.; Hristov, R.; Kabelac, Z.; Katabi, D.; Torralba, A. RF-based 3D skeletons. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, Budapest, Hungary, 20–25 August 2018; pp. 267–281. [Google Scholar] [CrossRef]
- Wang, F.; Zhou, S.; Panev, S.; Han, J.; Huang, D. Person-in-WiFi: Fine-Grained Person Perception Using WiFi. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–3 November 2019; pp. 5451–5460. [Google Scholar] [CrossRef] [Green Version]
- Jiang, W.; Xue, H.; Miao, C.; Wang, S.; Lin, S.; Tian, C.; Murali, S.; Hu, H.; Sun, Z.; Su, L. Towards 3D human pose construction using wifi. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, New York, NY, USA, 21–25 September 2020; pp. 1–14. [Google Scholar] [CrossRef]
- Hougne, P.; Imani, M.F.; Diebold, A.V.; Horstmeyer, R.; Smith, D.R. Learned Integrated Sensing Pipeline: Reconfigurable Metasurface Transceivers as Trainable Physical Layer in an Artificial Neural Network. Adv. Sci. 2020, 7, 1901913. [Google Scholar] [CrossRef] [Green Version]
- Li, L.; Shuang, Y.; Ma, Q.; Li, H.; Zhao, H.; Wei, M.; Liu, C.; Hao, C.; Qiu, C.W.; Cui, T.J. Intelligent metasurface imager and recognizer. Light. Sci. Appl. 2019, 8, 2047–7538. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, H.Y.; Zhao, H.T.; Wei, M.L.; Ruan, H.X.; Shuang, Y.; Cui, T.J.; del Hougne, P.; Li, L. Intelligent Electromagnetic Sensing with Learnable Data Acquisition and Processing. Patterns 2020, 1, 100006. [Google Scholar] [CrossRef] [PubMed]
- Kim, K.; Konda, P.C.; Cooke, C.L.; Appel, R.; Horstmeyer, R. Multi-element microscope optimization by a learned sensing network with composite physical layers. Opt. Lett. 2020, 45, 5684. [Google Scholar] [CrossRef] [PubMed]
- Li, T.; Liu, Q.; Zhou, X. Practical Human Sensing in the Light. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys’16, Singapore, 26–30 June 2016; pp. 71–84. [Google Scholar] [CrossRef] [Green Version]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D human pose estimation: New benchmark and state-of-the-art analysis. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV); Springer: Zurich, Switzerland, 2014; Volume 8693 LNCS, pp. 740–755. [Google Scholar] [CrossRef] [Green Version]
- Werner, P.; Saxen, F.; Al-Hamadi, A. Handling Data Imbalance in Automatic Facial Action Intensity Estimation. In Proceedings of the British Machine Vision Conference (BMVC), Swansea, UK, 7–10 September 2015; pp. 124.1–124.12. [Google Scholar] [CrossRef] [Green Version]
- Zhu, Y.; Long, Y.; Guan, Y.; Newsam, S.; Shao, L. Towards Universal Representation for Unseen Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Othman, E.; Werner, P.; Saxen, F.; Al-Hamadi, A.; Walter, S. Cross-database evaluation of pain recognition from facial video. In Proceedings of the International Symposium on Image and Signal Processing and Analysis (ISPA), Dubrovnik, Croatia, 23–25 September 2019; pp. 181–186. [Google Scholar] [CrossRef]
- Werner, P.; Lopez-Martinez, D.; Walter, S.; Al-Hamadi, A.; Gruss, S.; Picard, R. Automatic Recognition Methods Supporting Pain Assessment: A Survey. IEEE Trans. Affect. Comput. 2019. [Google Scholar] [CrossRef]
- Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2020, 3045, 1–20. [Google Scholar] [CrossRef] [Green Version]
- Wang, M.; Dong, W. Deep Face Recognition: A Survey. arXiv 2020, arXiv:1804.06655. [Google Scholar]
- Pietak, A.; Ma, S.; Beck, C.W.; Stringer, M.D. Fundamental ratios and logarithmic periodicity in human limb bones. J. Anat. 2013, 222, 526–537. [Google Scholar] [CrossRef] [PubMed]
Short Biography of Authors
Michal Rapczynski received his B.Sc. and M.Sc. degree at the Otto von Guericke University Magdeburg, Germany. Since 2013, he is a Researcher and Ph.D. candidate in the Neuro-Information Technology Group at Otto von Guericke University Magdeburg. His research focuses on computer vision, image processing, machine learning and biomedical signal processing. | |
Philipp Werner received his Masters degree (Dipl.-Ing.-Inf.) in computer science from the Otto-von-Guericke University Magdeburg, Germany, in 2011. Since then he has been working as a Research Assistant and Ph.D. candidate in the Neuro-Information Technology group of the Otto von Guericke University. His research focuses on pain recognition, facial expression recognition, human behavior recognition, computer vision, pattern recognition, and deep learning. Since 2018 he has been a research team leader at the Neuro-Information Technology Group of the Otto von Guericke University Magdeburg, Germany. He has authored and co-authored more than 40 articles, which have been cited more than 700 times. See http://philipp-werner.info for more details. | |
Sebastian Handrich received his B.S. and M.S. Degree in electrical engineering from the University of Magdeburg, Germany in 2008. After working as a research assistant at the University of Oldenburg in the field of biological psychology, he is currently working on his Ph.D. in electrical engineering and information technology at the University of Magdeburg. His research focuses on human pose estimation, facial expression analysis, affective computing and human machine interaction. | |
Ayoub Al-Hamadi received the Ph.D. degree in technical computer science, in 2001, and the Habilitation degree in artificial intelligence and the Venia Legendi degree in pattern recognition and image processing from Otto von Guericke University Magdeburg, Germany, in 2010. He is Professor and the Head of the Neuro-Information Technology Department (NIT), Otto-von-Guericke University Magdeburg. He is the author of more than 350 papers in peer-reviewed international journals, conferences, and books. His research interests include computer vision, pattern recognition, artificial intelligence, and human-roboter interaction. See http://www.iikt.ovgu.de/al_hamadi.html for more details. |
Method (Reference) | MPJPE (mm) | Method (Reference) | MPJPE (mm) |
---|---|---|---|
Ionescu et al. [20] | 162.1 | Habibie et al. [38] | 65.7 |
Pavlakos et al. [39] | 115.1 | Zhou et al. [40] | 64.9 |
Chen and Ramanan [41] | 114.2 | Sun et al. [42] | 64.1 |
Zhou et al. [43] | 113.0 | Luo et al. [44] | 61.3 |
Tome et al. [45] | 88.4 | Rogez et al. [46] | 61.2 |
Martinez et al. [26] | 87.3 | Nibali et al. [47] | 55.4 |
Pavlakos et al. [48] | 75.9 | Luvizon et al. [49] | 53.2 |
Wang et al. [50] | 71.9 | Dabral et al. [51] | 52.1 |
Tekin et al. [52] | 69.7 | Li et al. [53] | 50.9 |
Chen et al. [54] | 68.0 | Lin and Lee [55] | 46.6 |
Katircioglu et al. [56] | 67.3 | Chen et al. [57] | 44.1 |
Benzine et al. [58] | 66.4 | Wu and Xiao [59] | 43.2 |
Sárándi et al. [60] | 65.7 | Cheng et al. [61] | 42.9 |
Method (Reference) | MPJPE (mm) |
---|---|
Popa et al. [62] | 203.4 |
Zanfir et al. [63] | 153.4 |
Zanfir et al. [64] | 72.1 |
Benzine et al. [58] | 68.5 |
Method (Reference) | MPJPE (mm) |
---|---|
Radwan et al. [65] | 89.5 |
Wang et al. [50] | 71.3 |
Yasin et al. [66] | 38.9 |
Moreno-Noguer [67] | 26.9 |
Pavlakos et al. [68] | 25.5 |
Martinez et al. [26] | 24.6 |
Pavlakos et al. [39] | 18.3 |
HumanEva-I | Human3.6M | Panoptic | |
---|---|---|---|
Subjects | 4 | 11 | >100 |
Actions | 6 | 15 | many |
Multi-person | - | - | ✓ |
Recording duration | 10 min | 298 min | 689 min |
Cameras | 7 | 4 | >500 |
Total frames | 0.26 M | 3.6 M | >500 M |
Skeleton joints | 15 | 32 | 19 |
Joint | HumanEva-I | Human3.6M | Panoptic | OpenPose |
---|---|---|---|---|
R Hip | 1 * | 1 * | 12 | 9 |
R Knee | 2 | 2 | 13 | 10 |
R Ankle | 3 | 3 | 14 | 11 |
L Hip | 4 * | 6 * | 6 | 12 |
L Knee | 5 | 7 | 7 | 13 |
L Ankle | 6 | 8 | 8 | 14 |
Neck | 7 | 13 * | 0 | 1 |
Head | 8 * | 15 * | (17 + 18)/2 | (17 + 18)/2 |
L Shoulder | 9 | 17 | 3 | 5 |
L Elbow | 10 | 18 | 4 | 6 |
L Hand | 11 | 19 | 5 | 7 |
R Shoulder | 12 | 25 | 9 | 2 |
R Elbow | 13 | 26 | 10 | 3 |
R Hand | 14 | 27 | 11 | 4 |
Training Set | Testing Set | ||
---|---|---|---|
Reduced | Full | ||
HE1 | 113 | 225 | 17.8 |
H36M | 1169 | 2312 | 137.7 |
PAN | 4131 | 9809 | 292.0 |
HE1 (OP) | - | - | 1.7 |
H36M (OP) | - | - | 52.1 |
Training Data | Test Data | ||
---|---|---|---|
HE1 | H36M | PAN | |
original joints (mean 133.7) | |||
HE1 | 95.9 ± 2.9 | 299.7 ± 9.5 | 148.8 ± 4.9 |
H36M | 142.1 ± 3.9 | 67.6 ± 0.6 | 95.1 ± 3.2 |
PAN | 166.7 ± 2.4 | 143.6 ± 1.2 | 43.9 ± 0.3 |
harmonized joints (mean 120.0) | |||
HE1 | 91.7 ± 1.9 | 254.1 ± 5.8 | 125.4 ± 4.3 |
H36M | 141.7 ± 3.8 | 67.0 ± 0.6 | 98.3 ± 2.2 |
PAN | 117.8 ± 2.4 | 140.4 ± 1.3 | 43.7 ± 0.2 |
mean error change | |||
HE1 | −4.3% | −15.2% | −15.7% |
H36M | −0.2% | −0.9% | 3.4% |
PAN | −29.3% | −2.3% | −0.6% |
Training Data | Test Data | ||
---|---|---|---|
HE1 | H36M | PAN | |
reduced camera set (mean 132.6) | |||
HE1 | 96.6 ± 1.9 | 270.6 ± 11.5 | 176.2 ± 5.4 |
H36M | 166.4 ± 7.7 | 75.1 ± 0.4 | 105.2 ± 5.5 |
PAN | 129.5 ± 2.1 | 131.4 ± 0.7 | 42.2 ± 0.3 |
full camera set (mean 120.0) | |||
HE1 | 91.7 ± 1.9 | 254.1 ± 5.8 | 125.4 ± 4.3 |
H36M | 141.7 ± 3.8 | 67.0 ± 0.6 | 98.3 ± 2.2 |
PAN | 117.8 ± 2.4 | 140.4 ± 1.3 | 43.7 ± 0.2 |
mean error change | |||
HE1 | −5.1% | −6.1% | −28.8% |
H36M | −14.8% | −10.7% | −6.6% |
PAN | −9.0% | 6.8% | 3.4% |
Training Data | Test Data | ||
---|---|---|---|
HE1 | H36M | PAN | |
no scale normalization (mean 120.0) | |||
HE1 | 91.7 ± 1.9 | 254.1 ± 5.8 | 125.4 ± 4.3 |
H36M | 141.7 ± 3.8 | 67.0 ± 0.6 | 98.3 ± 2.2 |
PAN | 117.8 ± 2.4 | 140.4 ± 1.3 | 43.7 ± 0.2 |
with scale normalization (mean 90.1) | |||
HE1 | 69.2 ± 0.7 | 170.3 ± 4.0 | 152.7 ± 2.7 |
H36M | 86.0 ± 1.2 | 55.2 ± 0.5 | 89.2 ± 0.7 |
PAN | 67.3 ± 1.0 | 83.1 ± 0.6 | 37.9 ± 0.4 |
mean error change | |||
HE1 | −24.6% | −33.0% | 21.8% |
H36M | −39.3% | −17.7% | −9.3% |
PAN | −42.9% | −40.8% | −13.2% |
Training Data | Test Data | ||
---|---|---|---|
HE1 | H36M | PAN | |
scale error (full cam set) | |||
HE1 | 0.89 | 1.09 | 0.97 |
H36M | 0.90 | 0.99 | 1.01 |
PAN | 0.84 | 0.90 | 1.00 |
Training Data | Test Data | ||
---|---|---|---|
HE1 | H36M | PAN | |
no scale normalization (mean 103.0) | |||
H36M + PAN | 130.2 ± 2.9 | 103.8 ± 1.6 | 43.0 ± 0.3 |
HE1 + PAN | 115.0 ± 1.3 | 143.0 ± 2.2 | 45.5 ± 1.6 |
HE1 + H36M | 135.5 ± 1.2 | 75.1 ± 1.1 | 103.7 ± 4.3 |
with scale normalization (mean 69.0) | |||
H36M + PAN | 64.9 ± 0.5 | 63.0 ± 0.4 | 38.3 ± 0.3 |
HE1 + PAN | 67.2 ± 0.7 | 83.2 ± 0.8 | 38.3 ± 0.7 |
HE1 + H36M | 100.4 ± 1.2 | 62.6 ± 0.8 | 103.0 ± 2.1 |
HE1 | 69.2 ± 0.7 | 170.3 ± 4.0 | 152.7 ± 2.7 |
H36M | 86.0 ± 1.2 | 55.2 ± 0.5 | 89.2 ± 0.7 |
PAN | 67.3 ± 1.0 | 83.1 ± 0.6 | 37.9 ± 0.4 |
mean error change | |||
H36M + PAN | −50.1% | −39.3% | −10.8% |
HE1 + PAN | −41.5% | −41.8% | −15.9% |
HE1 + H36M | −25.8% | −16.6% | −0.6% |
Training Data | Evaluation Data | ||||
---|---|---|---|---|---|
HE1 (OP) | H36M (OP) | HE1 | H36M | PAN | |
no alignment | |||||
HE1 | 138.3 ± 1.3 | 184.4 ± 4.0 | 69.2 ± 0.7 | 170.3 ± 4.0 | 152.7 ± 2.7 |
H36M | 151.3 ± 1.7 | 108.6 ± 0.7 | 86.0 ± 1.2 | 55.2 ± 0.5 | 89.2 ± 0.7 |
PAN | 126.1 ± 1.2 | 130.8 ± 1.1 | 67.3 ± 1.0 | 83.1 ± 0.6 | 37.9 ± 0.4 |
Procrustes alignment | |||||
HE1 | 105.8 ± 0.6 | 109.5 ± 1.3 | 57.8 ± 0.7 | 105.0 ± 2.3 | 104.6 ± 1.9 |
H36M | 103.1 ± 0.8 | 65.6 ± 0.6 | 61.4 ± 0.6 | 41.5 ± 0.2 | 48.6 ± 0.9 |
PAN | 93.4 ± 0.7 | 71.9 ± 0.5 | 55.0 ± 0.8 | 55.4 ± 0.3 | 28.2 ± 0.3 |
mean error change | |||||
HE1 | −23.5% | −40.6% | −16.4% | −38.3% | −31.5% |
H36M | −31.8% | −39.6% | −28.6% | −24.8% | −45.5% |
PAN | −26.0% | −45.0% | −18.3% | −33.3% | −25.6% |
Training Data | Test Data | ||||
---|---|---|---|---|---|
HE1 (OP) | H36M (OP) | HE1 | H36M | PAN | |
rotation error (reduced cam set, no scale norm) | |||||
HE1 | 23.2° | 35.2° | 9.3° | 31.3° | 20.2° |
H36M | 28.5° | 11.2° | 22.7° | 8.2° | 11.0° |
PAN | 22.3° | 18.6° | 15.3° | 17.1° | 4.2° |
rotation error (full cam set, no scale norm) | |||||
HE1 | 21.3° | 35.1° | 10.1° | 30.6° | 13.6° |
H36M | 26.7° | 10.9° | 18.1° | 7.2° | 10.2° |
PAN | 19.8° | 18.6° | 12.1° | 18.7° | 4.1° |
rotation error (full cam set, using scale norm) | |||||
HE1 | 24.8° | 24.3° | 8.9° | 20.9° | 24.1° |
H36M | 18.1° | 12.5° | 8.8° | 6.5° | 9.3° |
PAN | 18.9° | 14.8° | 7.7° | 8.9° | 4.7° |
Training Data | Test Data | ||||
---|---|---|---|---|---|
HE1 (OP) | H36M (OP) | HE1 | H36M | PAN | |
no validation | |||||
HE1 | 138.3 ± 1.3 | 184.4 ± 4.0 | 69.2 ± 0.7 | 170.3 ± 4.0 | 152.7 ± 2.7 |
H36M | 151.3 ± 1.7 | 108.6 ± 0.7 | 86.0 ± 1.2 | 55.2 ± 0.5 | 89.2 ± 0.7 |
PAN | 126.1 ± 1.2 | 130.8 ± 1.1 | 67.3 ± 1.0 | 83.1 ± 0.6 | 37.9 ± 0.4 |
using validation | |||||
HE1 | 125.8 ± 2.4 | 166.1 ± 4.3 | 67.5 ± 0.8 | 155.4 ± 6.6 | 138.6 ± 2.3 |
H36M | 142.4 ± 1.9 | 108.3 ± 0.7 | 84.9 ± 1.3 | 54.9 ± 0.6 | 88.9 ± 0.7 |
PAN | 113.9 ± 1.1 | 130.4 ± 1.0 | 65.9 ± 1.0 | 81.8 ± 0.5 | 37.6 ± 0.4 |
mean error change | |||||
HE1 | −9.0% | −9.9% | −2.5% | −8.8% | −9.3% |
H36M | −5.8% | −0.3% | −1.3% | −0.5% | −0.3% |
PAN | −9.7% | −0.3% | −2.1% | −1.6% | −0.7% |
rate of rejected poses | |||||
HE1 | 15.8% | 15.7% | 1.8% | 13.8% | 20.7% |
H36M | 12.2% | 1.3% | 1.1% | 0.3% | 2.2% |
PAN | 15.6% | 2.9% | 1.3% | 3.3% | 0.6% |
Training Data → | HE1 | H36M | PAN | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Test Data → | HE1 | H36M | PAN | HE1 | H36M | PAN | HE1 | H36M | PAN | Mean |
Martinez et al. [26] | 95.9 | 299.7 | 148.8 | 146.0 | 78.7 | 107.8 | 166.7 | 143.6 | 43.9 | 136.8 |
Proposed | 69.2 | 170.3 | 152.7 | 86.0 | 55.2 | 89.2 | 67.3 | 83.1 | 37.9 | 90.1 |
Proposed + APV | 67.5 | 155.4 | 138.6 | 84.9 | 54.9 | 88.9 | 65.9 | 81.8 | 37.6 | 86.2 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rapczyński, M.; Werner, P.; Handrich, S.; Al-Hamadi, A. A Baseline for Cross-Database 3D Human Pose Estimation. Sensors 2021, 21, 3769. https://doi.org/10.3390/s21113769
Rapczyński M, Werner P, Handrich S, Al-Hamadi A. A Baseline for Cross-Database 3D Human Pose Estimation. Sensors. 2021; 21(11):3769. https://doi.org/10.3390/s21113769
Chicago/Turabian StyleRapczyński, Michał, Philipp Werner, Sebastian Handrich, and Ayoub Al-Hamadi. 2021. "A Baseline for Cross-Database 3D Human Pose Estimation" Sensors 21, no. 11: 3769. https://doi.org/10.3390/s21113769