Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Vital information is only worth one thumbnail: : Towards efficient human pose estimation

Published: 04 March 2024 Publication History

Abstract

In pursuit of impressive performance, existing DCNN-based approaches of human pose estimation usually use massive networks and large-size images to train a deep model. When applying these deep based methods in real-time systems, current works try to compress the deep network by reducing the number of layers and channels, but such approaches are complex and poorly generalized since they require elaborate design of small-scale network structures. Based on the fact that large-size images contain redundant information, in this paper, we explore the influence of image-size on system complexity and propose a novel framework called ThumbPose to accelerate and compress deep models by inferring on thumbnail representations in the task of human pose estimation. In our framework, we first propose a style supervised online downscaler to reduce an input image into a thumbnail image. Furthermore, a training strategy of dual-branch auto-encoding is designed to obtain effective and accurate thumbnail representation in a knowledge distillation manner, which is further used to maintain the performance of thumbnail images as the original-size input images. For heat-map based human pose estimation, ThumbPose is an orthogonal and implementation-friendly method, that can not only compress and accelerate the inference network but also obtain an image downscaler in a supervised manner that can be used in other high-level tasks (e.g. detection, segmentation, etc. in practical applications). Extensive experiments on MS COCO dataset demonstrate the effectiveness of our proposed method, and ThumbPose achieves superior performance (+ 1.3% AP and + 0.7% AR) with negligible additional cost (<0.2 GFLOPs) compared to previous state-of-the-art methods when using small-size images as inputs. Moreover, experiments on MPII show that our model achieves higher accuracy (+ 0.2% [email protected]) with minimal computation (2.5 GFLOPs) compared to superior lightweight models obtained by the network compression methods.

Highlights

A novel framework is proposed for efficient human pose estimation.
A style supervised online downscaler is designed to compress input images to thumbnail ones.
A dual-branch auto-encoding strategy is designed for refining the thumbnail representation.
We obtain SOTA performance on COCO and MPII datasets with low computational costs.

References

[1]
Duan H., Zhao Y., Chen K., Shao D., Lin D., Dai B., Revisiting skeleton-based action recognition, 2021, arXiv preprint arXiv:2104.13586.
[2]
L. Shi, Y. Zhang, J. Cheng, H. Lu, AdaSGN: Adapting Joint Number and Model Size for Efficient Skeleton-Based Action Recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13413–13422.
[3]
Peng W., Hong X., Zhao G., Tripool: Graph triplet pooling for 3D skeleton-based action recognition, Pattern Recognit. 115 (2021).
[4]
Dong M., Xu C., Skeleton-based human motion prediction with privileged supervision, IEEE Trans. Neural Netw. Learn. Syst. (2022).
[5]
Zhong C., Hu L., Zhang Z., Ye Y., Xia S., Spatial-temporal gating-adjacency GCN for human motion prediction, 2022, arXiv preprint arXiv:2203.01474.
[6]
Yang J., Ma Y., Zuo X., Wang S., Gong M., Cheng L., 3D pose estimation and future motion prediction from 2D images, Pattern Recognit. 124 (2022).
[7]
Cotton R.J., McClerklin E., Cimorelli A., Patel A., Karakostas T., Transforming gait: Video-based spatiotemporal gait analysis, 2022, arXiv preprint arXiv:2203.09371.
[8]
F. Zhang, X. Zhu, M. Ye, Fast human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3517–3526.
[9]
Li Z., Ye J., Song M., Huang Y., Pan Z., Online knowledge distillation for efficient pose estimation, in: IEEE/CVF ICCV, 2021, pp. 11740–11750.
[10]
G. Hinton, O. Vinyals, J. Dean, et al. Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531, 2 (7).
[11]
C. Yu, B. Xiao, C. Gao, L. Yuan, L. Zhang, N. Sang, J. Wang, Lite-hrnet: A lightweight high-resolution network, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10440–10450.
[12]
Z. Zhang, Y. Jiang, J. Jiang, X. Wang, P. Luo, J. Gu, Star: A structure-aware lightweight transformer for real-time image enhancement, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4106–4115.
[13]
Newell A., Yang K., Deng J., Stacked hourglass networks for human pose estimation, in: European Conference on Computer Vision, Springer, 2016, pp. 483–499.
[14]
K. Sun, B. Xiao, D. Liu, J. Wang, Deep high-resolution representation learning for human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5693–5703.
[15]
F. Zhang, X. Zhu, H. Dai, M. Ye, C. Zhu, Distribution-aware coordinate representation for human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7093–7102.
[16]
Wang J., Long X., Gao Y., Ding E., Wen S., Graph-pcnn: Two stage human pose estimation with graph pose refinement, in: European Conference on Computer Vision, Springer, 2020, pp. 492–508.
[17]
C. Zhao, B. Ghanem, ThumbNet: one thumbnail image contains all you need for recognition, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1506–1514.
[18]
Zhang Y., Zhang Y., Tian R., Zhang Z., Bai Y., Zuo W., Ding M., ThumbDet: One thumbnail image is enough for object detection, Pattern Recognit. 138 (2023).
[19]
Zhang Y., Bai Y., Ding M., Li Y., Ghanem B., Weakly-supervised object detection via mining pseudo ground truth bounding-boxes, Pattern Recognit. 84 (2018) 68–81.
[20]
Zhang Y., Ding M., Bai Y., Ghanem B., Detecting small faces in the wild based on generative adversarial network and contextual information, Pattern Recognit. 94 (2019) 74–86.
[21]
Bragantini J., Falcão A.X., Najman L., Rethinking interactive image segmentation: feature space annotation, Pattern Recognit. (2022).
[22]
Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L., Microsoft coco: Common objects in context, in: European Conference on Computer Vision, Springer, 2014, pp. 740–755.
[23]
M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2d human pose estimation: New benchmark and state of the art analysis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
[24]
A. Toshev, C. Szegedy, Deeppose: Human pose estimation via deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1653–1660.
[25]
S.-E. Wei, V. Ramakrishna, T. Kanade, Y. Sheikh, Convolutional pose machines, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
[26]
Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, J. Sun, Cascaded pyramid network for multi-person pose estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7103–7112.
[27]
B. Xiao, H. Wu, Y. Wei, Simple baselines for human pose estimation and tracking, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 466–481.
[28]
Li W., Wang Z., Yin B., Peng Q., Du Y., Xiao T., Yu G., Lu H., Wei Y., Sun J., Rethinking on multi-stage networks for human pose estimation, 2019, arXiv preprint arXiv:1901.00148.
[29]
Y. Li, S. Zhang, Z. Wang, S. Yang, W. Yang, S.-T. Xia, E. Zhou, Tokenpose: Learning keypoint tokens for human pose estimation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11313–11322.
[30]
Li K., Wang Y., Zhang J., Gao P., Song G., Liu Y., Li H., Qiao Y., UniFormer: Unifying convolution and self-attention for visual recognition, 2022, arXiv preprint arXiv:2201.09450.
[31]
Li Y., Li K., Wang X., Da Xu R.Y., Exploring temporal consistency for human pose estimation in videos, Pattern Recognit. 103 (2020).
[32]
Tian L., Wang P., Liang G., Shen C., An adversarial human pose estimation network injected with graph structure, Pattern Recognit. 115 (2021).
[33]
Bin Y., Chen Z.-M., Wei X.-S., Chen X., Gao C., Sang N., Structure-aware human pose estimation with graph convolutional networks, Pattern Recognit. 106 (2020).
[34]
J. Li, S. Bian, A. Zeng, C. Wang, B. Pang, W. Liu, C. Lu, Human pose regression with residual log-likelihood estimation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11025–11034.
[35]
Zhou X., Wang D., Krähenbühl P., Objects as points, 2019, arXiv preprint arXiv:1904.07850.
[36]
Wei F., Sun X., Li H., Wang J., Lin S., Point-set anchors for object detection, instance segmentation and pose estimation, in: ECCV, Springer, 2020, pp. 527–544.
[37]
X. Nie, J. Feng, J. Zhang, S. Yan, Single-stage multi-person pose machines, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6951–6960.
[38]
J. Huang, Z. Zhu, F. Guo, G. Huang, The devil is in the details: Delving into unbiased data processing for human pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5700–5709.
[39]
Han S., Mao H., Dally W.J., Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2015, arXiv preprint arXiv:1510.00149.
[40]
LeCun Y., Denker J., Solla S., Optimal brain damage, Adv. Neural Inf. Process. Syst. 2 (1989).
[41]
Han S., Pool J., Tran J., Dally W., Learning both weights and connections for efficient neural network, Adv. Neural Inf. Process. Syst. 28 (2015).
[42]
Li H., Kadav A., Durdanovic I., Samet H., Graf H.P., Pruning filters for efficient convnets, 2016, arXiv preprint arXiv:1608.08710.
[43]
J.-H. Luo, J. Wu, W. Lin, Thinet: A filter level pruning method for deep neural network compression, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5058–5066.
[44]
J. Wu, C. Leng, Y. Wang, Q. Hu, J. Cheng, Quantized convolutional neural networks for mobile devices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4820–4828.
[45]
Courbariaux M., Hubara I., Soudry D., El-Yaniv R., Bengio Y., Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1, 2016, arXiv preprint arXiv:1602.02830.
[46]
Rastegari M., Ordonez V., Redmon J., Farhadi A., Xnor-net: Imagenet classification using binary convolutional neural networks, in: European Conference on Computer Vision, Springer, 2016, pp. 525–542.
[47]
Jaderberg M., Vedaldi A., Zisserman A., Speeding up convolutional neural networks with low rank expansions, 2014, arXiv preprint arXiv:1405.3866.
[48]
Denton E.L., Zaremba W., Bruna J., LeCun Y., Fergus R., Exploiting linear structure within convolutional networks for efficient evaluation, Adv. Neural Inf. Process. Syst. 27 (2014).
[49]
X. Zhang, J. Zou, X. Ming, K. He, J. Sun, Efficient and accurate approximations of nonlinear convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1984–1992.
[50]
Guo Q., Wu X.-J., Kittler J., Feng Z., Differentiable neural architecture learning for efficient neural networks, Pattern Recognit. 126 (2022).
[51]
Yang S., Yang W., Cui Z., Searching part-specific neural fabrics for human pose estimation, Pattern Recognit. 128 (2022).
[52]
Zhao B., Cui Q., Song R., Qiu Y., Liang J., Decoupled knowledge distillation, 2022, arXiv preprint arXiv:2203.08679.
[53]
Yang Z., Li Z., Jiang X., Gong Y., Yuan Z., Zhao D., Yuan C., Focal and global knowledge distillation for detectors, 2021, arXiv preprint arXiv:2111.11837.
[54]
T. Wang, L. Yuan, X. Zhang, J. Feng, Distilling object detectors with fine-grained feature imitation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4933–4942.
[55]
W. Park, D. Kim, Y. Lu, M. Cho, Relational knowledge distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3967–3976.
[56]
Wang Y., Sun F., Li D., Yao A., Resolution switchable networks for runtime efficient image recognition, in: European Conference on Computer Vision, Springer, 2020, pp. 533–549.
[57]
Yang T., Zhu S., Chen C., Yan S., Zhang M., Willis A., Mutualnet: Adaptive convnet via mutual learning from network width and resolution, in: European Conference on Computer Vision, Springer, 2020, pp. 299–315.
[58]
Li D., Yao A., Chen Q., Learning to learn parameterized classification networks for scalable input images, in: European Conference on Computer Vision, Springer, 2020, pp. 19–35.
[59]
L. Qi, J. Kuen, J. Gu, Z. Lin, Y. Wang, Y. Chen, Y. Li, J. Jia, Multi-scale aligned distillation for low-resolution detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14443–14453.
[60]
Wang C., Zhang F., Zhu X., Ge S.S., Low-resolution human pose estimation, Pattern Recognit. 126 (2022).
[61]
He K., Chen X., Xie S., Li Y., Dollár P., Girshick R., Masked autoencoders are scalable vision learners, 2021, arXiv preprint arXiv:2111.06377.
[62]
K. Purohit, M. Suin, A. Rajagopalan, V.N. Boddeti, Spatially-adaptive image restoration using distortion-guided networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2309–2319.
[63]
G. Bhat, M. Danelljan, F. Yu, L. Van Gool, R. Timofte, Deep reparametrization of multi-frame super-resolution and denoising, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2460–2470.
[64]
Hinton G.E., Salakhutdinov R.R., Reducing the dimensionality of data with neural networks, science 313 (5786) (2006) 504–507.
[65]
Masci J., Meier U., Cireşan D., Schmidhuber J., Stacked convolutional auto-encoders for hierarchical feature extraction, in: International Conference on Artificial Neural Networks, Springer, 2011, pp. 52–59.
[66]
Turchenko V., Chalmers E., Luczak A., A deep convolutional auto-encoder with pooling-unpooling layers in caffe, 2017, arXiv preprint arXiv:1701.04949.
[67]
Sonka M., Hlavac V., Boyle R., Image Processing, Analysis, and Machine Vision, Cengage Learning, 2014.
[68]
Lehmann T.M., Gonner C., Spitzer K., Survey: Interpolation methods in medical image processing, IEEE Trans. Med. Imaging 18 (11) (1999) 1049–1075.
[69]
Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L., Imagenet: A large-scale hierarchical image database, in: IEEE CVPR, Ieee, 2009, pp. 248–255.
[70]
X. Huang, S. Belongie, Arbitrary style transfer in real-time with adaptive instance normalization, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1501–1510.
[71]
Chen H., Zhao L., Zhang H., Wang Z., Zuo Z., Li A., Xing W., Lu D., Diverse image style transfer via invertible cross-space mapping, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE Computer Society, 2021, pp. 14860–14869.
[72]
H. Nam, H. Lee, J. Park, W. Yoon, D. Yoo, Reducing domain gap by reducing style bias, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8690–8699.
[73]
Z. Zheng, R. Ye, P. Wang, D. Ren, W. Zuo, Q. Hou, M.-M. Cheng, Localization Distillation for Dense Object Detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9407–9416.

Index Terms

  1. Vital information is only worth one thumbnail: Towards efficient human pose estimation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Pattern Recognition
          Pattern Recognition  Volume 147, Issue C
          Mar 2024
          771 pages

          Publisher

          Elsevier Science Inc.

          United States

          Publication History

          Published: 04 March 2024

          Author Tags

          1. Human pose estimation
          2. Small-size input
          3. Knowledge distillation
          4. Network compression and acceleration

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 05 Jan 2025

          Other Metrics

          Citations

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media