Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

KITPose: Keypoint-Interactive Transformer forĀ Animal Pose Estimation

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13534))

Included in the following conference series:

Abstract

Animal pose estimation has received increasing attention in recent years. The main challenge for this task is the diversity of animal species compared to their human counterpart. To address this issue, we design a keypoint-interactive Transformer model for high-resolution animal pose estimation, namely KITPose. Since a high-resolution network maintains local perception and the self-attention module in Transformer is an expert in connecting long-range dependencies, we equip the high-resolution network with a Transformer to enhance the model capacity, achieving keypoints interaction in the decision stage. Besides, to smoothly fit the pose estimation task, we simultaneously train the model parameters and joint weights, which can automatically adjust the loss weight for each specific keypoint. The experimental results obtained on the AP10K and ATRW datasets demonstrate the merits of KITPose, as well as its superior performance over the state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., Tao, D.: AP-10K: a benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617 (2021)

  2. Li, S., Li, J., Tang, H., Qian, R., Lin, W.: ATRW: a benchmark for amur tiger re-identification in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2590ā€“2598 (2020)

    Google ScholarĀ 

  3. Pereira, T.D., et al.: SLEAP: multi-animal pose tracking. bioRXiv (2020)

    Google ScholarĀ 

  4. Pereira, T.D., et al.: SLEAP: a deep learning system for multi-animal pose tracking. Nat. Methods 19, 486ā€“495 (2022). https://doi.org/10.1038/s41592-022-01426-1

  5. Lauer, J., et al.: Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat. Methods 19, 496ā€“504 (2022). https://doi.org/10.1038/s41592-022-01443-0

  6. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740ā€“755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    ChapterĀ  Google ScholarĀ 

  7. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3686ā€“3693 (2014)

    Google ScholarĀ 

  8. Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724ā€“4732 (2016)

    Google ScholarĀ 

  9. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291ā€“7299 (2017)

    Google ScholarĀ 

  10. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483ā€“499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29

    ChapterĀ  Google ScholarĀ 

  11. Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334ā€“2343 (2017)

    Google ScholarĀ 

  12. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472ā€“487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29

    ChapterĀ  Google ScholarĀ 

  13. Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349ā€“3364 (2020)

    Google ScholarĀ 

  14. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693ā€“5703 (2019)

    Google ScholarĀ 

  15. Cheng, B., et al.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5386ā€“5395 (2020)

    Google ScholarĀ 

  16. mmPose Contributor: OpenMMLab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose (2020)

  17. Li, K., et al.: Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1944ā€“1953 (2021)

    Google ScholarĀ 

  18. Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: keypoint localization via transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11802ā€“11812 (2021)

    Google ScholarĀ 

  19. Li, Y., et al.: TokenPose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313ā€“11322 (2021)

    Google ScholarĀ 

  20. Yuan, Y., et al.: HRFormer: high-Resolution Vision Transformer for Dense Predict. In: Advances in Neural Information Processing Systems 34 (2021)

    Google ScholarĀ 

  21. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213ā€“229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    ChapterĀ  Google ScholarĀ 

  22. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  23. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  24. Mathis, A., at al.: Pretraining boosts out-of-domain robustness for pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1859ā€“1868 (2021)

    Google ScholarĀ 

  25. Graving, J.M., Chae, D., et al.: DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 8, e47994 (2019). https://doi.org/10.7554/eLife.47994

  26. Cao, J., Tang, H., Fang, H.S., Shen, X., Lu, C., Tai, Y.W.: Cross-domain adaptation for animal pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9498ā€“9507 (2019)

    Google ScholarĀ 

  27. Li, C., Lee, G.H.: From synthetic to real: unsupervised domain adaptation for animal pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1482ā€“1491 (2021)

    Google ScholarĀ 

  28. Labuguen, R., et al.: MacaquePose: a novel ā€œin the wildā€ macaque monkey pose dataset for markerless motion capture. bioRxiv (2020)

    Google ScholarĀ 

  29. Pereira, T.D., et al.: Fast animal pose estimation using deep neural networks. Nat. Methods 16(1), 117ā€“125 (2019)

    ArticleĀ  Google ScholarĀ 

  30. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  31. Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing systems 30 (2017)

    Google ScholarĀ 

  32. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)

    Google ScholarĀ 

  33. Liu, Z., at al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012ā€“10022 (2021)

    Google ScholarĀ 

  34. Mu, J., Qiu, W., Hager, G.D., Yuille, A.L.: Learning from synthetic animals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12386ā€“12395 (2020)

    Google ScholarĀ 

  35. Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7093ā€“7102 (2020)

    Google ScholarĀ 

  36. Geng, Z., et al.: Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14676ā€“14686 (2021)

    Google ScholarĀ 

  37. Luo, Z., et al.: Rethinking the heatmap regression for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13264ā€“13273 (2021)

    Google ScholarĀ 

  38. Jin, L., et al.: Grouping by Center: predicting Centripetal Offsets for the bottom-up human pose estimation. IEEE Trans. Multimedia (2022)

    Google ScholarĀ 

  39. Harding, E.J., Paul, E.S., Mendl, M.: Cognitive bias and affective state. Nature 427(6972), 312 (2004)

    Google ScholarĀ 

  40. Touvron, H., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347ā€“10357 (2021)

    Google ScholarĀ 

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (U1836218, 61876072, 61902153, 62106089).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tianyang Xu or Xiaoning Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rao, J., Xu, T., Song, X., Feng, ZH., Wu, XJ. (2022). KITPose: Keypoint-Interactive Transformer forĀ Animal Pose Estimation. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13534. Springer, Cham. https://doi.org/10.1007/978-3-031-18907-4_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18907-4_51

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18906-7

  • Online ISBN: 978-3-031-18907-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics