Abstract
Human pose estimation is a fundamental yet challenging task in computer vision. Recently, with the involvement of deep neural networks, human pose estimation has made great progresses. However, existing pose estimation networks still have some difficulties in detecting small-scale keypoints and distinguishing semantic confusion keypoints. In this paper, a novel convolutional neural network named multi-scale position enhancement network is proposed to address the above two problems. First, a multi-scale adaptive fusion unit is proposed to dynamically choose and fuse features on different scales, allowing small-scale keypoints to obtain more detailed information that is beneficial for detection. Second, we discover that although appearance-similar parts are difficult to distinguish in semantics, they differ significantly in spatial location. Therefore, a position enhancement module is designed to highlight features of real joint locations while learning more discriminative features to suppress features of similar joint regions. Finally, a global context block is applied to optimize the prediction results in order to further improve the network performance. Experiments on both single- and multi-person pose estimation benchmarks illustrate that our approach yields more accurate and reliable results.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agahian, S., Negin, F., Köse, C.: Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Vis. Comput. 35(4), 591–607 (2019)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: CVPR (2009)
Artacho, B., Savakis, A.: Unipose: unified human pose estimation in single images and videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7033–7042 (2020)
Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. In: 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), pp. 468–475 (2017)
Bin, Y., Chen, Z.M., Wei, X.S., Chen, X., Gao, C., Sang, N.: Structure-aware human pose estimation with graph convolutional networks. Pattern Recognit. 106, 107410 (2020)
Bulat, A., Kossaifi, J., Tzimiropoulos, G., Pantic, M.: Toward fast and accurate human pose estimation via soft-gated skip connections. In: 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 8–15 (2020)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: GCNet: non-local networks meet squeeze-excitation networks and beyond. In: IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1971–1980 (2019)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1302–1310 (2017)
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4733–4742 (2016)
Chen, Y., Shen, C., Wei, X.S., Liu, L., Yang, J.: Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 1221–1230 (2017)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Chu, X., Yang, W., Ouyang, W., Ma, C.X., Yuille, A., Wang, X.: Multi-context attention for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5669–5678 (2017)
Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 32(3), 289–306 (2016)
Fang, H., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 2353–2362 (2017)
Gao, G., Yang, J., Jing, X., Shen, F., Yang, W., Yue, D.: Learning robust and discriminative low-rank representations for face recognition with occlusion. Pattern Recognit. 66, 129–143 (2017)
Gao, G., Yu, Y., Yang, J., Qi, G., Yang, M.: Hierarchical deep cnn feature set-based representation learning for robust cross-resolution face recognition. CoRR abs/2103.13851 (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42, 386–397 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2011–2023 (2020)
Huang, J.J., Zhu, Z., Guo, F., Huang, G.: The devil is in the details: delving into unbiased data processing for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5699–5708 (2020)
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In: ECCV (2016)
Jiang, T., Zhang, Z., Yang, Y.: Modeling coverage with semantic embedding for image caption generation. Vis. Comput. 35(11), 1655–1665 (2019)
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)
Ke, L., Chang, M., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. CoRR abs/1803.09894 (2018)
Khan, M.A., Javed, K., Khan, S., Saba, T., Habib, U., Khan, J., Abbasi, A.A.: Human action recognition using fusion of multiview and deep features: an application to video surveillance. Multimed. Tools Appl. (2020). https://doi.org/10.1007/s11042-020-08806-9
Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite fields for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11969–11978 (2019)
Lifshitz, I., Fetaya, E., Ullman, S.: Human pose estimation using deep consensus voting. In: ECCV (2016)
Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: ECCV (2014)
Liu, Z., Duan, Q., Shi, S., Zhao, P.: Multi-level progressive parallel attention guided salient object detection for RGB-D images. Vis. Comput. 37(3), 529–540 (2021)
Moon, G., Chang, J.Y., Lee, K.M.: Multi-scale aggregation R-CNN for 2d multi-person pose estimation. CoRR abs/1905.03912 (2019)
Moon, G., Chang, J.Y., Lee, K.M.: Posefix: Model-agnostic general human pose refinement network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7765–7773 (2019)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV (2016)
Nie, X., Feng, J., Yan, S.: Mutual learning to adapt for joint human parsing and pose estimation. In: ECCV (2018)
Nie, X., Feng, J., Zuo, Y., Yan, S.: Human pose estimation with parsing induced learner. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2100–2108 (2018)
Ning, G., Zhang, Z., He, Z.: Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimed. 20, 1246–1259 (2018)
Papandreou, G., Zhu, T.L., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3711–3719 (2017)
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., Schiele, B.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4929–4937 (2016)
Ryou, S., Jeong, S.G., Perona, P.: Anchor loss: Modulating loss scale based on prediction difficulty. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5991–6000 (2019)
Sapp, B., Taskar, B.: Modec: multimodal decomposable models for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3681 (2013)
Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675 (2019)
Tang, W., Yu, P., Wu, Y.: Deeply learned compositional models for human pose estimation. In: ECCV (2018)
Tian, L., Liang, G., Wang, P., Shen, C.: An adversarial human pose estimation network injected with graph structure. Pattern Recognit. 115, 107863 (2021)
Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)
Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)
Vidanpathirana, M., Sudasingha, I., Vidanapathirana, J., Kanchana, P., Perera, I.: Tracking and frame-rate enhancement for real-time 2d human pose estimation. Vis. Comput. 36(7), 1501–1519 (2020)
Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29(10), 983–1009 (2013)
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6450–6458 (2017)
Wang, J., Long, X., Gao, Y., Ding, E., Wen, S.: Graph-pcnn: two stage human pose estimation with graph pose refinement. In: ECCV (2020)
Wang, K., Zhang, G., Yang, J., Bao, H.: Dynamic human body reconstruction and motion tracking with low-cost depth cameras. Vis. Comput. 37(3), 603–618 (2021)
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4732 (2016)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In: ECCV (2018)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: ECCV (2018)
Yang, Q., Shi, W., Chen, J., Tang, Y.H.: Localization of hard joints in human pose estimation based on residual down-sampling and attention mechanism. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02122-5
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 1290–1299 (2017)
Zhang, F., Chen, Y., Li, Z., Hong, Z., Liu, J., Ma, F., Han, J., Ding, E.: Acfnet: attentional class feature network for semantic segmentation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6797–6806 (2019)
Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., Jia, J.: Human pose estimation with spatial contextual information. CoRR abs/1901.01760 (2019)
Acknowledgements
This research is partially supported by the Beijing Natural Science Foundation (No. 4212025) and National Natural Science Foundation of China (No. 61876018, No. 61906014, No. 61976017).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Human and animal rights
This article does not contain any studies with human participants and/or animals performed by any of the authors.
Informed consent
There is no informed consent for this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xu, J., Liu, W., Xing, W. et al. MSPENet: multi-scale adaptive fusion and position enhancement network for human pose estimation. Vis Comput 39, 2005–2019 (2023). https://doi.org/10.1007/s00371-022-02460-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-022-02460-y