Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

EMHIFormer: : An Enhanced Multi-Hypothesis Interaction Transformer for 3D human pose estimation in video

Published: 01 September 2023 Publication History

Abstract

Monocular 3D human pose estimation is a challenging task because of depth ambiguity and occlusion. Recent methods exploit spatio-temporal information and generate different hypotheses for simulating diverse solutions to alleviate these problems. However, these methods do not fully extract spatial and temporal information and the relationship of each hypothesis. To ease these limitations, we propose EMHIFormer (Enhanced Multi-Hypothesis Interaction Transformer) to model 3D human pose with better performance. In detail, we build connections between different Transformer layers so that our model is able to integrate spatio-temporal information from the previous layer and establish more comprehensive hypotheses. Furthermore, a cross-hypothesis model consisting of a parallel Transformer is proposed to strengthen the relationship between various hypotheses. We also design an enhanced regression head which adaptively adjusts the channel weights to export the final 3D human pose. Extensive experiments are conducted on two challenging datasets: Human3.6M and MPI-INF-3DHP to evaluate our EMHIFormer. The results show that EMHIFormer achieves competitive performance on Human3.6M and state-of-the-art performance on MPI-INF-3DHP. Compared with the closest counterpart, MHFormer, our model outperforms it by 0.6% P-MPJPE and 0.5% MPJPE on Human3.6M dataset and 46.0% MPJPE on MPI-INF-3DHP.

Highlights

We propose CMHG module to enhance the ability of hypotheses generation.
We design CHR module to strengthen the connection of various hypotheses.
We design ERH to adaptively adjust the channel weights for better 3D pose.
We propose a combined loss for fast convergence and stable descent.

References

[1]
Liu Mengyuan, Liu Hong, Chen Chen, Enhanced skeleton visualization for view invariant human action recognition, Pattern Recognit. 68 (2017) 346–362.
[2]
Mengyuan Liu, Junsong Yuan, Recognizing human actions as the evolution of pose estimation maps, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1159–1168.
[3]
Wang Pichao, Li Wanqing, Gao Zhimin, Tang Chang, Ogunbona Philip O., Depth pooling based large-scale 3-d action recognition with convolutional neural networks, IEEE Trans. Multimed. 20 (5) (2018) 1051–1061.
[4]
Mehta Dushyant, Sridhar Srinath, Sotnychenko Oleksandr, Rhodin Helge, Shafiei Mohammad, Seidel Hans-Peter, Xu Weipeng, Casas Dan, Theobalt Christian, Vnect: Real-time 3d human pose estimation with a single rgb camera, ACM Trans. Graph. 36 (4) (2017) 1–14.
[5]
Guo Chao X., Roumeliotis Stergios I., IMU-RGBD camera 3D pose estimation and extrinsic calibration: Observability analysis and consistency improvement, in: 2013 IEEE International Conference on Robotics and Automation, IEEE, 2013, pp. 2935–2942.
[6]
Opromolla Roberto, Fasano Giancarmine, Rufino Giancarlo, Grassi Michele, Uncooperative pose estimation with a LIDAR-based system, Acta Astronaut. 110 (2015) 287–297.
[7]
Kondori Farid Abedan, Yousefi Shahrouz, Li Haibo, Sonning Samuel, Sonning Sabina, 3D head pose estimation using the kinect, in: 2011 International Conference on Wireless Communications and Signal Processing, WCSP, IEEE, 2011, pp. 1–4.
[8]
Georgios Pavlakos, Xiaowei Zhou, Kostas Daniilidis, Ordinal depth supervision for 3d human pose estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7307–7316.
[9]
Moon Gyeongsik, Lee Kyoung Mu, I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image, in: European Conference on Computer Vision, Springer, 2020, pp. 752–768.
[10]
Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, Nadia Magnenat Thalmann, Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2272–2281.
[11]
Chen Tianlang, Fang Chen, Shen Xiaohui, Zhu Yiheng, Chen Zhili, Luo Jiebo, Anatomy-aware 3d human pose estimation with bone-based pose decomposition, IEEE Trans. Circuits Syst. Video Technol. 32 (1) (2021) 198–209.
[12]
Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Cheung, Vijayan Asari, Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5064–5073.
[13]
Julieta Martinez, Rayat Hossain, Javier Romero, James J Little, A simple yet effective baseline for 3d human pose estimation, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2640–2649.
[14]
Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli, 3d human pose estimation in video with temporal convolutions and semi-supervised training, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7753–7762.
[15]
Wang Jingbo, Yan Sijie, Xiong Yuanjun, Lin Dahua, Motion guided 3d pose estimation from videos, in: European Conference on Computer Vision, Springer, 2020, pp. 764–780.
[16]
Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, Zhengming Ding, 3d human pose estimation with spatial and temporal transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11656–11665.
[17]
Zeng Ailing, Sun Xiao, Huang Fuyang, Liu Minhao, Xu Qiang, Lin Stephen, Srnet: Improving generalization in 3d human pose estimation with a split-and-recombine approach, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, Springer, 2020, pp. 507–523.
[18]
Mir Rayat Imtiaz Hossain, James J. Little, Exploiting temporal information for 3d human pose estimation, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 68–84.
[19]
Li Wenhao, Liu Hong, Ding Runwei, Liu Mengyuan, Wang Pichao, Yang Wenming, Exploiting temporal contexts with strided transformer for 3d human pose estimation, IEEE Trans. Multimed. (2022).
[20]
Ehsan Jahangiri, Alan L. Yuille, Generating multiple diverse hypotheses for human 3d pose consistent with 2d joint detections, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 805–814.
[21]
Chen Li, Gim Hee Lee, Generating multiple hypotheses for 3d human pose estimation with mixture density network, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9887–9895.
[22]
Saurabh Sharma, Pavan Teja Varigonda, Prashast Bindal, Abhishek Sharma, Arjun Jain, Monocular 3d human pose estimation by generation and ordinal ranking, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2325–2334.
[23]
Tom Wehrbein, Marco Rudolph, Bodo Rosenhahn, Bastian Wandt, Probabilistic monocular 3d human pose estimation with normalizing flows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11199–11208.
[24]
Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, Luc Van Gool, Mhformer: Multi-hypothesis transformer for 3d human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13147–13156.
[25]
Li Sijin, Chan Antoni B., 3D human pose estimation from monocular images with deep convolutional neural network, in: Asian Conference on Computer Vision, Springer, 2014, pp. 332–347.
[26]
Xiaoxuan Ma, Jiajun Su, Chunyu Wang, Hai Ci, Yizhou Wang, Context modeling in 3d human pose estimation: A unified perspective, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6238–6247.
[27]
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, Kostas Daniilidis, Coarse-to-fine volumetric prediction for single-image 3D human pose, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7025–7034.
[28]
Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, Yichen Wei, Integral human pose regression, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 529–545.
[29]
Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, Jian Sun, Cascaded pyramid network for multi-person pose estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7103–7112.
[30]
Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick, Mask r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.
[31]
Newell Alejandro, Yang Kaiyu, Deng Jia, Stacked hourglass networks for human pose estimation, in: European Conference on Computer Vision, Springer, 2016, pp. 483–499.
[32]
Ke Sun, Bin Xiao, Dong Liu, Jingdong Wang, Deep high-resolution representation learning for human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703.
[33]
Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, Polosukhin Illia, Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017).
[34]
He Ruining, Ravula Anirudh, Kanagal Bhargav, Ainslie Joshua, Realformer: Transformer likes residual attention, 2020, arXiv preprint arXiv:2012.11747.
[35]
Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, et al., An image is worth 16 × 16 words: Transformers for image recognition at scale, 2020, arXiv preprint arXiv:2010.11929.
[36]
Carion Nicolas, Massa Francisco, Synnaeve Gabriel, Usunier Nicolas, Kirillov Alexander, Zagoruyko Sergey, End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer, 2020, pp. 213–229.
[37]
Kevin Lin, Lijuan Wang, Zicheng Liu, End-to-end human pose and mesh reconstruction with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1954–1963.
[38]
Li Wenhao, Liu Hong, Guo Tianyu, Tang Hao, Ding Runwei, GraphMLP: A graph MLP-like architecture for 3D human pose estimation, 2022, arXiv preprint arXiv:2206.06420.
[39]
Ionescu Catalin, Papava Dragos, Olaru Vlad, Sminchisescu Cristian, Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments, IEEE Trans. Pattern Anal. Mach. Intell. 36 (7) (2013) 1325–1339.
[40]
Mehta Dushyant, Rhodin Helge, Casas Dan, Fua Pascal, Sotnychenko Oleksandr, Xu Weipeng, Theobalt Christian, Monocular 3d human pose estimation in the wild using improved cnn supervision, in: 2017 International Conference on 3D Vision, 3DV, IEEE, 2017, pp. 506–516.
[41]
Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, Song-Chun Zhu, Learning pose grammar to encode human body configuration for 3d pose estimation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, No. 1, 2018.
[42]
Tianhan Xu, Wataru Takano, Graph stacked hourglass networks for 3d human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16105–16114.
[43]
Zhiming Zou, Wei Tang, Modulated graph convolutional network for 3d human pose estimation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11477–11487.
[44]
Ailing Zeng, Xiao Sun, Lei Yang, Nanxuan Zhao, Minhao Liu, Qiang Xu, Learning skeletal graph neural networks for hard 3d pose estimation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11436–11445.
[45]
Kehong Gong, Jianfeng Zhang, Jiashi Feng, Poseaug: A differentiable pose augmentation framework for 3d human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8575–8584.
[46]
Lin Jiahao, Lee Gim Hee, Trajectory space factorization for deep video-based 3d human pose estimation, 2019, arXiv preprint arXiv:1908.08289.
[47]
Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, Wenjun Zhang, Deep kinematics analysis for monocular 3d human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 899–908.
[48]
Kyoungoh Lee, Inwoong Lee, Sanghoon Lee, Propagating lstm: 3d pose estimation based on joint interdependency, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 119–135.
[49]
Shichao Li, Lei Ke, Kevin Pratama, Yu-Wing Tai, Chi-Keung Tang, Kwang-Ting Cheng, Cascaded deep monocular 3d human pose estimation with evolutionary training data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6173–6183.

Cited By

View all
  • (2024)DBGAN: Dual Branch Generative Adversarial Network for Multi-Modal MRI TranslationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365729820:8(1-22)Online publication date: 13-Jun-2024
  • (2024)Exploring multi-level transformers with feature frame padding network for 3D human pose estimationMultimedia Systems10.1007/s00530-024-01451-430:5Online publication date: 1-Oct-2024

Index Terms

  1. EMHIFormer: An Enhanced Multi-Hypothesis Interaction Transformer for 3D human pose estimation in video
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Journal of Visual Communication and Image Representation
        Journal of Visual Communication and Image Representation  Volume 95, Issue C
        Sep 2023
        481 pages

        Publisher

        Academic Press, Inc.

        United States

        Publication History

        Published: 01 September 2023

        Author Tags

        1. 3D human pose estimation
        2. Transformer
        3. Cross-hypothesis
        4. Enhanced regression head

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 23 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)DBGAN: Dual Branch Generative Adversarial Network for Multi-Modal MRI TranslationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365729820:8(1-22)Online publication date: 13-Jun-2024
        • (2024)Exploring multi-level transformers with feature frame padding network for 3D human pose estimationMultimedia Systems10.1007/s00530-024-01451-430:5Online publication date: 1-Oct-2024

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media