KITPose: Keypoint-Interactive Transformer for Animal Pose Estimation

Rao, Jiyong; Xu, Tianyang; Song, Xiaoning; Feng, Zhen-Hua; Wu, Xiao-Jun

doi:10.1007/978-3-031-18907-4_51

Jiyong Rao¹⁵,
Tianyang Xu¹⁵,
Xiaoning Song¹⁵,
Zhen-Hua Feng¹⁶ &
…
Xiao-Jun Wu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13534))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

3244 Accesses
3 Citations

Abstract

Animal pose estimation has received increasing attention in recent years. The main challenge for this task is the diversity of animal species compared to their human counterpart. To address this issue, we design a keypoint-interactive Transformer model for high-resolution animal pose estimation, namely KITPose. Since a high-resolution network maintains local perception and the self-attention module in Transformer is an expert in connecting long-range dependencies, we equip the high-resolution network with a Transformer to enhance the model capacity, achieving keypoints interaction in the decision stage. Besides, to smoothly fit the pose estimation task, we simultaneously train the model parameters and joint weights, which can automatically adjust the loss weight for each specific keypoint. The experimental results obtained on the AP10K and ATRW datasets demonstrate the merits of KITPose, as well as its superior performance over the state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

Article 27 October 2023

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

Article 29 March 2022

IDPNet: a light-weight network and its variants for human pose estimation

Article 18 October 2023

References

Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., Tao, D.: AP-10K: a benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617 (2021)
Li, S., Li, J., Tang, H., Qian, R., Lin, W.: ATRW: a benchmark for amur tiger re-identification in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2590–2598 (2020)
Google Scholar
Pereira, T.D., et al.: SLEAP: multi-animal pose tracking. bioRXiv (2020)
Google Scholar
Pereira, T.D., et al.: SLEAP: a deep learning system for multi-animal pose tracking. Nat. Methods 19, 486–495 (2022). https://doi.org/10.1038/s41592-022-01426-1
Lauer, J., et al.: Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat. Methods 19, 496–504 (2022). https://doi.org/10.1038/s41592-022-01443-0
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Google Scholar
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343 (2017)
Google Scholar
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29
Chapter Google Scholar
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Google Scholar
Cheng, B., et al.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5386–5395 (2020)
Google Scholar
mmPose Contributor: OpenMMLab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose (2020)
Li, K., et al.: Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1944–1953 (2021)
Google Scholar
Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: keypoint localization via transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11802–11812 (2021)
Google Scholar
Li, Y., et al.: TokenPose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313–11322 (2021)
Google Scholar
Yuan, Y., et al.: HRFormer: high-Resolution Vision Transformer for Dense Predict. In: Advances in Neural Information Processing Systems 34 (2021)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Mathis, A., at al.: Pretraining boosts out-of-domain robustness for pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1859–1868 (2021)
Google Scholar
Graving, J.M., Chae, D., et al.: DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 8, e47994 (2019). https://doi.org/10.7554/eLife.47994
Cao, J., Tang, H., Fang, H.S., Shen, X., Lu, C., Tai, Y.W.: Cross-domain adaptation for animal pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9498–9507 (2019)
Google Scholar
Li, C., Lee, G.H.: From synthetic to real: unsupervised domain adaptation for animal pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1482–1491 (2021)
Google Scholar
Labuguen, R., et al.: MacaquePose: a novel “in the wild” macaque monkey pose dataset for markerless motion capture. bioRxiv (2020)
Google Scholar
Pereira, T.D., et al.: Fast animal pose estimation using deep neural networks. Nat. Methods 16(1), 117–125 (2019)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing systems 30 (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Liu, Z., at al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Mu, J., Qiu, W., Hager, G.D., Yuille, A.L.: Learning from synthetic animals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12386–12395 (2020)
Google Scholar
Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7093–7102 (2020)
Google Scholar
Geng, Z., et al.: Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14676–14686 (2021)
Google Scholar
Luo, Z., et al.: Rethinking the heatmap regression for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13264–13273 (2021)
Google Scholar
Jin, L., et al.: Grouping by Center: predicting Centripetal Offsets for the bottom-up human pose estimation. IEEE Trans. Multimedia (2022)
Google Scholar
Harding, E.J., Paul, E.S., Mendl, M.: Cognitive bias and affective state. Nature 427(6972), 312 (2004)
Google Scholar
Touvron, H., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China (U1836218, 61876072, 61902153, 62106089).

Author information

Authors and Affiliations

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
Jiyong Rao, Tianyang Xu, Xiaoning Song & Xiao-Jun Wu
Department of Computer Science, University of Surrey, Guildford, GU2 7XH, UK
Zhen-Hua Feng

Authors

Jiyong Rao
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoning Song
View author publications
You can also search for this author in PubMed Google Scholar
Zhen-Hua Feng
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Jun Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tianyang Xu or Xiaoning Song .

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang
Hong Kong Baptist University, Hong Kong, China
Pong C. Yuen
Northwestern Polytechnical University, Xi’an, China
Junwei Han
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hong Kong Baptist University, Hong Kong, China
Yike Guo
Sun Yat-sen University, Guangzhou, China
Jianhuang Lai
Southern University of Science and Technology, Shenzhen, China
Jianguo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rao, J., Xu, T., Song, X., Feng, ZH., Wu, XJ. (2022). KITPose: Keypoint-Interactive Transformer for Animal Pose Estimation. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13534. Springer, Cham. https://doi.org/10.1007/978-3-031-18907-4_51

Download citation

DOI: https://doi.org/10.1007/978-3-031-18907-4_51
Published: 27 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18906-7
Online ISBN: 978-3-031-18907-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

KITPose: Keypoint-Interactive Transformer for Animal Pose Estimation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

IDPNet: a light-weight network and its variants for human pose estimation

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

KITPose: Keypoint-Interactive Transformer for Animal Pose Estimation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

IDPNet: a light-weight network and its variants for human pose estimation

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation